# Multiple Regression to receive cost for a single item of an order

#### christoph_b

##### New Member
Hi everybody,

I appreciate any kind of help and tips:

Problem:
- I have a dataset with 1600 orders
- For each of these orders I know the total costs & products included (e.g. 7975€ & A, C, D, H)
- I don't know the individual prices for the range of products (in total 80), which would, for instance, be helpful to calcuate new offers .
--> please see the attached pic

Potential Solultion: Multiple regression
- no of observations = no of orders
- dependent variable = total cost (per order)

Questions:
- Is a multiple regression a suitable statistical tool?
- If not: Is there an alternative?
- If yes: Are there any specialities I have to consider (e.g. multicollinearity --> how can I reduce it, etc.)?

Chris

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Yeah, I could see multiple linear regression working or getting you in the ballpark, then you confirm results based on a few future orders. Are you going to treat products as binary variables (1/0)? That is what I would do. I don't think MC should impact it.

This also seems solvable in matrix algebra. Though I do not know MA that well. Which regression uses MA.

#### Oberon

##### New Member
Can you de-duplicate the rows and find the inverse matrix of the Product Matrix?

Linear algebra would be as follows:
1) AB = C where A = Product Matrix, B = Price Vector, and C = Order Price Vector
2) Inv(A)AB = Inv(A)C where Inv(A)A = Identity Matrix
3) IB = Inv(A)C
4) B = Inv(A)C

If you cannot invert matrix A then the regression solution might be good enough. But if it works, then B should yield a vector of the prices for each Product.

#### christoph_b

##### New Member
1) Linear Algebra:
- Based on my current state of knowledge, I think I cannot build a quadratic (= invertible) product matrix, which is necessary to use the approach of Oberon, right?

2) Multiple regression
- Sounds good that a multiple linear regression is a valid approach
- @hlsmith: yes, I want to treat products as binary variables
- When I conduct a multiple regression with Stata, I receive improper values, e.g.
- negative coefficients (= cost per product)
- low R^2 (0.72)
- high standard error (e.g. 638 for coefficient 1108)
- 30% of coefficients not statistically significant
- omitted variables because of multicollinearity​
- a potential reasons might be: costs for individual product calculated within an order vary strongly from order to order
- what are possible countermeasures: clean up data? My first approaches would be:
- combine correlated variables to factor?
- de-duplicate rows (orders)?
- exclude those orders that contain products with strongly varying cost per product over different orders?​
- Do you have more ideas?

Thanks a lot!
Cheers
Chris​

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Can you just solve the easier products first? Find some one who only bought a couple of product and different combinations of those? Or if a person who just bought a single product - kind of like a word decipher!

#### christoph_b

##### New Member
Ok. So you mean start with some chosen orders and products and then enlarge it step by step?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Yes, that would be another approach. If there are 80 possible products, for observations, what is the fewest products selected.

#### christoph_b

##### New Member
Hello again,
I have worked on my data record and regression for the last few days, but unfortunately I have not received the results I expected. Please check out the excel file attached:

• Some orders seem to be calculated accurately (e.g. order 156 - 172); Those orders contain mostly the same products)
• For other orders the deviation (regression – actual, column BB) is partly pretty high; those orders contain often a variety of different products (e.g. order 103 -155).
• Some products have negative costs (e.g. ca, cc,..)

-->Considering the results and the structure of the data record, is a multiple regression the right tool?

-->Will the results be more accurate if I drop more datasets (with outliers) or focus on those with more reasonable results first?
Many thanks!!