Regression Analysis

#2
Naza, there are a few assumptions that are made during a regression analysis

1) The data generally fits a linear model, Y = B0 + B1*X1i + B2*X2i + .... + e
2) The error terms (e) are independent
3) The error terms have a mean of zero
4) The error terms have a constant variance
5) The error terms are normally distributed

Looking through the PDF quickly they have satisfied all of the assumptions.

Now looking at the model and its diagnostics, I am not totally convinced because they left out the possibility of co linearity of the coefficients. Usually when there are multiple predictors, people include the Variance Inflation Factor (VIF), which gives you an idea how how correlated they are. Co linearity is not good as it gives you over-confidence in the model. Ask to see the VIFs of the coefficients. The general rule of thumb is that if they are all under 10, then you are okay.
 
#5
Yeah, I'd be okay adopting this model. The R-squared is around 50% so understand that this has some limitations on its accuracy of predicting future values. So long as people's lives are not at stake then I'd be okay with this.
 

ondansetron

TS Contributor
#7
VIF <10 should not be the only criterion, nor is 10 necessarily “the” value to start worrying about multicollinearity. You should check for estimate instability and compare the sign and magnitude of estimates with what’s expected by theory and see how these fluctuate or change when some variables are omitted or when you refit the model on a random subset of the data set. This is really a full picture assessment for multicollinearity rather than just using the VIF.
 

noetsi

Fortran must die
#8
You should start with simply seeing if MC is an issue. It is if the slope coefficients are all not significant when the model is. If this does not occur I am not sure I would even pay attention to VIF.
 
#9
You should start with simply seeing if MC is an issue. It is if the slope coefficients are all not significant when the model is. If this does not occur I am not sure I would even pay attention to VIF.
His attached report does have all 3 factors with significance, all p-values under .02. So at this point i guess MC should be investigated next.

VIF <10 should not be the only criterion, nor is 10 necessarily “the” value to start worrying about multicollinearity. You should check for estimate instability and compare the sign and magnitude of estimates with what’s expected by theory and see how these fluctuate or change when some variables are omitted or when you refit the model on a random subset of the data set. This is really a full picture assessment for multicollinearity rather than just using the VIF.
I knew VIF<10 was like a rule of thumb short-cut, but never knew how to truly test for MC. Are you saying run reduced models and compare the coefficients? Example run all combinations of factors, and see if for the same estimate the model coefficients have overlapping confidence intervals? Just curious as to how to do this check for MC.
 

ondansetron

TS Contributor
#10
I knew VIF<10 was like a rule of thumb short-cut, but never knew how to truly test for MC. Are you saying run reduced models and compare the coefficients? Example run all combinations of factors, and see if for the same estimate the model coefficients have overlapping confidence intervals? Just curious as to how to do this check for MC.
In a way, this would be a way to check the so-called instability of the coefficients.

Imagine I fit a model of

y-hat = b0 + b1X1 + b2X2

where bi is the estimated beta coefficient for the i-th independent variable.

Suppose X1 and X2 are problematically collinear.

The estimate of b1 with both X1 and X2 in the model may be -5, for example, when theory tells us the coefficient should be positive. If I rerun the model without X2 and see that the coefficient on X1 is now +8, this might be evidence that the collinearity is making the estimate of b1 unstable and that collinearity may be at a problematic level if we want to make inferences on the true value and direction of a beta parameter from this model (this similarly applies to b2 and dropping X1 to see what happens to X2's coefficient, and this can extend to larger models but need not inolve all estimates in the model, only those of the collinear group/groups). Another way too see some of the impact of multicollinearity would be a a resample with replacement of size equal to the original sample size (or a random subset of the original data set) to then fit the model on to see how dramatic the change is in the coefficients for the suspected collinear variables. I wouldn't necessarily say run all combinations, though, and I wouldn't use confidence intervals in that sense.

Evaluating the severity of multicollinearity is a more involved process usually than looking just at the VIF because we need to look for something to suggest a problematic symptom (i.e. beta estimates with wrong signs when we want to make an inference on the parameter).
 

ondansetron

TS Contributor
#11
You should start with simply seeing if MC is an issue. It is if the slope coefficients are all not significant when the model is. If this does not occur I am not sure I would even pay attention to VIF.
You can have severe MC in the presence of individually significant coefficients, but overall, the VIF or tolerance should only be one piece in the puzzle and you need to decide if prediction or inference is more your goal. If prediction, I think the bigger concern is that the pattern of MC holds in the new data sets (I don't have personal experience with this part, so I'm not sure how much of a real problem it would be since the predictions remain unbiased vis a vis the unbiased coefficient estimates assuming MC is the only issue) or that you don't extrapolate with the model (which is good advice regardless of degree of MC), but prediction generally means MC is far less of a concern as far as I know.
 

naza

New Member
#12
Hi, for this model , i run for Best Subsets model first before chose the best model (using Minitab software). Means already evaluate all combinations of predictors to determine which combination that makes the best model. At first, I have 5 predictors , but the best model only have 3 predictors.