Hi everyone,

I could do with some guidance on a problem I have, and my stats is very elementary.

Lets say I'm a company that sells musical instruments and I have data on the proportion of musicians per ZIP code, the number of live performances per year per ZIP, and avg. household income per ZIP. For each data set I order the ZIP codes, and group the ZIP codes into 10 groups which each consist of a 10th of the population.

I then compare my own internal customer data with these grouping, and calculate a return on investment (ROI) for each group (e.g. if I have 100 customers from ZIPs in the 1st group I calculate an ROI for this group). I see that the cost of reaching the customers for a first interaction (e.g. website visit), the chance they buy something, and their basket value is correlated with what group of ZIP codes they fall into. For example, if they are from the 10% group with the highest proportion of musicians, my cost of acquiring them as a lead is lower, their chance of converting is higher and their basket value is higher compared to mean. I could then, for example, say with confidence that my ROI is 25% better than mean for a customer from the top grouping (e.g. highest proportion of musicians), when looking at proportion of musicians by ZIP.

I want to know how I can combine all three variables - proportion of musicians, income and number of performances - so that my estimate of ROI includes information from all of them. It is important that like when I just look at one variable, I can estimate how much better/worse ROI is based on ZIP code.

To be clear, the data I have is the following:

External Data:

So far I have considered just getting a % difference from mean for each one, adding together and dividing by three, however this is obviously very simplistic and for a start I can see it doesn't take into account if one variable more important than other.

Would really appreciate some guidance on best method to use and where to read up on it!

p.s. if there is a simpler/better method that would work just using two variables let me know!

I could do with some guidance on a problem I have, and my stats is very elementary.

Lets say I'm a company that sells musical instruments and I have data on the proportion of musicians per ZIP code, the number of live performances per year per ZIP, and avg. household income per ZIP. For each data set I order the ZIP codes, and group the ZIP codes into 10 groups which each consist of a 10th of the population.

I then compare my own internal customer data with these grouping, and calculate a return on investment (ROI) for each group (e.g. if I have 100 customers from ZIPs in the 1st group I calculate an ROI for this group). I see that the cost of reaching the customers for a first interaction (e.g. website visit), the chance they buy something, and their basket value is correlated with what group of ZIP codes they fall into. For example, if they are from the 10% group with the highest proportion of musicians, my cost of acquiring them as a lead is lower, their chance of converting is higher and their basket value is higher compared to mean. I could then, for example, say with confidence that my ROI is 25% better than mean for a customer from the top grouping (e.g. highest proportion of musicians), when looking at proportion of musicians by ZIP.

I want to know how I can combine all three variables - proportion of musicians, income and number of performances - so that my estimate of ROI includes information from all of them. It is important that like when I just look at one variable, I can estimate how much better/worse ROI is based on ZIP code.

To be clear, the data I have is the following:

External Data:

- Number of musicians per ZIP code + total ZIP population (so can get a % of Musicians)
- Avg household income per ZIP code
- Number of live performances per ZIP code (could also make into performances per 1000 people or something)

- All customers and potential customers, and their ZIP codes
- Means I can work out % of potential customers that convert
- The cost of getting a potential customer
- The value of an actual customer
- Essentially I can get an ROI of customers for each 10th of ZIP codes.

So far I have considered just getting a % difference from mean for each one, adding together and dividing by three, however this is obviously very simplistic and for a start I can see it doesn't take into account if one variable more important than other.

Would really appreciate some guidance on best method to use and where to read up on it!

p.s. if there is a simpler/better method that would work just using two variables let me know!

Last edited: