Multiple Regression to receive cost for a single item of an order

Hi everybody,

I appreciate any kind of help and tips:

- I have a dataset with 1600 orders
- For each of these orders I know the total costs & products included (e.g. 7975€ & A, C, D, H)
- I don't know the individual prices for the range of products (in total 80), which would, for instance, be helpful to calcuate new offers .
--> please see the attached pic

Potential Solultion: Multiple regression
- no of observations = no of orders
- independent variables = all products (= ~80)
- dependent variable = total cost (per order)

- Is a multiple regression a suitable statistical tool?
- If not: Is there an alternative?
- If yes: Are there any specialities I have to consider (e.g. multicollinearity --> how can I reduce it, etc.)?

Many thanks for your help! :)


Less is more. Stay pure. Stay poor.
Yeah, I could see multiple linear regression working or getting you in the ballpark, then you confirm results based on a few future orders. Are you going to treat products as binary variables (1/0)? That is what I would do. I don't think MC should impact it.

This also seems solvable in matrix algebra. Though I do not know MA that well. Which regression uses MA.
Can you de-duplicate the rows and find the inverse matrix of the Product Matrix?

Linear algebra would be as follows:
1) AB = C where A = Product Matrix, B = Price Vector, and C = Order Price Vector
2) Inv(A)AB = Inv(A)C where Inv(A)A = Identity Matrix
3) IB = Inv(A)C
4) B = Inv(A)C

If you cannot invert matrix A then the regression solution might be good enough. But if it works, then B should yield a vector of the prices for each Product.
Thanks a lot for your instant answer, hlsmith & Oberon!
1) Linear Algebra:
- Based on my current state of knowledge, I think I cannot build a quadratic (= invertible) product matrix, which is necessary to use the approach of Oberon, right?

2) Multiple regression
- Sounds good that a multiple linear regression is a valid approach
- @hlsmith: yes, I want to treat products as binary variables
- When I conduct a multiple regression with Stata, I receive improper values, e.g.
- negative coefficients (= cost per product)
- low R^2 (0.72)
- high standard error (e.g. 638 for coefficient 1108)
- 30% of coefficients not statistically significant
- omitted variables because of multicollinearity​
- a potential reasons might be: costs for individual product calculated within an order vary strongly from order to order
- what are possible countermeasures: clean up data? My first approaches would be:
- combine correlated variables to factor?
- de-duplicate rows (orders)?
- exclude those orders that contain products with strongly varying cost per product over different orders?​
- Do you have more ideas?

Thanks a lot!


Less is more. Stay pure. Stay poor.
Can you just solve the easier products first? Find some one who only bought a couple of product and different combinations of those? Or if a person who just bought a single product - kind of like a word decipher!


Less is more. Stay pure. Stay poor.
Yes, that would be another approach. If there are 80 possible products, for observations, what is the fewest products selected.
Hello again,
I have worked on my data record and regression for the last few days, but unfortunately I have not received the results I expected. Please check out the excel file attached:

  • Some orders seem to be calculated accurately (e.g. order 156 - 172); Those orders contain mostly the same products)
  • For other orders the deviation (regression – actual, column BB) is partly pretty high; those orders contain often a variety of different products (e.g. order 103 -155).
  • Some products have negative costs (e.g. ca, cc,..)

-->Considering the results and the structure of the data record, is a multiple regression the right tool?

-->Will the results be more accurate if I drop more datasets (with outliers) or focus on those with more reasonable results first?
Many thanks!!