Likelihood of Purchase - Proper Method

Hi Everyone,

I'm trying to determine the odds of a product succeeding in a given area.
After doing some research it seems that similar problems have been
solved using Logistic Regression, but I'm still not a hundred percent clear on how it would work.
Here's an example of how I was thinking of setting up the problem:

Say a product has been released in one area (Area 1) and I'm trying to determine
if it will succeed in another area (Area 2).

I was first thinking of creating a profile of the people that have bought the product in Area 1:

Note: everything in the bought column would be 1 because I don't have data on who didn't buy the product.

The next step I was thinking of doing was to create a profile of the Area 2:

  • 45% Male
  • 55% Female
  • Average Income Male: 40000
  • Average Income Female: 41000
  • Average Age Male: 40
  • Average Age Female: 42

Now, based on this type of data I'd like to determine the likelihood that the product will be purchased in Area 2.

I would appreciate any advice/tips on:

  • What methods I could use to solve the problem
  • Any references I could use to learn said methods
  • How I can set the problem up differently to make it easier to analyze
Hi, welcome

This is an interesting problem. Without zeros in "bought" there's no logistic regression. Is the idea you'll append zeros to the data based on your knowledge of the overall population?

Also, in general, predictions made on one group using a model built from data on another group carries the assumption that the two groups are in the same population, so I guess Area 1 and Area 2 are considered similar enough (just making sure)
Hi ted00,

Sorry for the late reply. To answer your last question, yes these two groups would be a part of a larger population, the only difference being that the product was sold to people in the first area but not the second.
I'm not sure I'll be able to get information on people who didn't buy the product in the first area. But for arguments sake, say I was able to get a few zeros in the first table, would Logistic Regression be the best choice to solve the problem?
With no zeros theres no logistic regression.

This reminds me of one of those no-denominator problems; sometimes solved using "disproportionality" methods. I.e. find which strata are found in the data at a proportion higher than that expected if the strata variables are truly independent. Other than something like this, or possibly there exists some method that's really more data mining/machine learning, I don't know of a way.