# Comparing correlated data

#### bc1212

##### New Member
Hello,

I am a masters student at university however I do not have any statistical qualifications beyond A-level (many years ago!), so please excuse my lack of knowledge.

My research is comparing two pre-treatment measures (A and B) to treatment success, and investigating if A or B is a better predictor of treatment success.

However, A and B are correlated (Pearson 0.649 (sig 0.000)). I understand that linear regression can not be used with correlated data because this indicates confounding. However, theoretically one pre treatment measure could still be a better predicator of treatment success, despite the correlation.

An example of this in a different context is:
BMI and waist size both affect chance of getting diabetes, but it is known that waist size is a better predictor. However, BMI and waist size still correlate with each other.

If linear regression cannot be used, then is there a statistical test that can be used to show this?

Many thanks

#### staassis

##### Member
Almost everything in life is correlated. Allowing several correlated predictors into a linear model is totally fine, as long as there is no multicollinearity. To see which predictor has the highest predictive power, you can look at a number of metrics. One useful metric is the standardized regression coefficient (often called "beta").

#### ondansetron

##### TS Contributor
Almost everything in life is correlated. Allowing several correlated predictors into a linear model is totally fine, as long as there is no multicollinearity. To see which predictor has the highest predictive power, you can look at a number of metrics. One useful metric is the standardized regression coefficient (often called "beta").
Absence of multicollinearity is not an assumption for something like ordinary least squares linear regression. The assumption is absence of perfect collinearity between independent variables, as this brings up issues regarding unique parameter estimates (and general fitting of the model). I also would note that, at least what I am familiar with, estimated "beta" is a raw coefficient as opposed to the "standardized beta estimates" which are standardized as the name implies.

Hello,

I am a masters student at university however I do not have any statistical qualifications beyond A-level (many years ago!), so please excuse my lack of knowledge.

My research is comparing two pre-treatment measures (A and B) to treatment success, and investigating if A or B is a better predictor of treatment success.

However, A and B are correlated (Pearson 0.649 (sig 0.000)). I understand that linear regression can not be used with correlated data because this indicates confounding. However, theoretically one pre treatment measure could still be a better predicator of treatment success, despite the correlation.

An example of this in a different context is:
BMI and waist size both affect chance of getting diabetes, but it is known that waist size is a better predictor. However, BMI and waist size still correlate with each other.

If linear regression cannot be used, then is there a statistical test that can be used to show this?

Many thanks
Linear regression can certainly be used with correlated x-variables as long as there is not perfect correlation among some group of them. In the case of correlated x-variables, one needs to be more careful about the inferences made regarding the individual variable's relationship with the DV (but in cases of prediction, it's less of a concern). What are your independent variables and how are they measured? There may be a few options for your to reduce the collinearity. Also, have you done an assessment of the degree of multicollinearity to see how much of a problem this actually poses in your case? It may be trivial.

#### bc1212

##### New Member
My independent variables are both percentages that relate to surfaces in the mouth covered by plaque. One IV looks at a patient's mouth as a whole, and the other IV looks at a specific area of the patient's mouth.

I've just run the multicollinearity of the two IVs, which has come back as 1.000.

#### staassis

##### Member
Absence of multicollinearity is not an assumption for something like ordinary least squares linear regression. The assumption is absence of perfect collinearity between independent variables, as this brings up issues regarding unique parameter estimates.
I cannot agree with this. Even imperfect multicollinearity may lead to high variance in the coefficient estimates and, thus, high standard errors. Think: why would anybody invent ridge regression or other forms of regularization? I can refer you to a number of books on this, for example, chapters 3 and 7 of Hastie, Tibshirani, Friedman.

I also would note that, at least what I am familiar with, estimated "beta" is a raw coefficient as opposed to the "standardized beta estimates" which are standardized as the name implies.
As I mentioned, "beta" is another name for "standardized regression coefficient". For example, this is the terminology that SPSS is using.

#### Dason

I cannot agree with this. Even imperfect multicollinearity may lead to high variance in the coefficient estimates and, thus, high standard errors. Think: why would anybody invent ridge regression or other forms of regularization? I can refer you to a number of books on this, for example, chapters 3 and 7 of Hastie, Tibshirani, Friedman.
It isn't an assumption though. And although it impacts tests for individual coefficients it doesn't change predictions. They aren't saying that there aren't reasons one may care about it but it being an actual assumption in the model is incorrect.

As I mentioned, "beta" is another name for "standardized regression coefficient". For example, this is the terminology that SPSS is using.
I think you missed the point? 'beta' a lot of times referred to an unstandardized coefficient. The terminology isn't consistent across uses though so it's just good to clarify. If somebody asked me to provide the betas from a model I fit I would assume they want the raw unstandardized coefficients.

#### ondansetron

##### TS Contributor
I cannot agree with this. Even imperfect multicollinearity may lead to high variance in the coefficient estimates and, thus, high standard errors.
I appreciate your disagreement, but it simply isn't an assumption that's required or even made for analysis but relaxed in practice. The assumption is regarding perfect collinearity. You're correct that imperfect MC might lead to inflated standard errors, but that's not always a concern. In cases of building models for prediction purposes, you're much less likely to have any concern for the collinearity since prediction is the goal rather than inferences on beta parameters.

Think: why would anybody invent ridge regression or other forms of regularization? I can refer you to a number of books on this, for example, chapters 3 and 7 of Hastie, Tibshirani, Friedman.
You needn't refer me to any sources to believe you. But the logic is rather poor: someone came up with some popular methods to handle this, therefore, it must violate an assumption of the method. These methods of regularization were developed for many reasons including more trust worthy interpretations of the coefficients and the good old bias-vs.-efficiency trade-off that people often weigh. However, one of the reasons is not that imperfect multicollinearity violates an OLS assumption.

As I mentioned, "beta" is another name for "standardized regression coefficient". For example, this is the terminology that SPSS is using.
In the regression context, beta has traditionally been the Greek letter use to denote the unknown popluation regression coefficients (or the probability of a Type II error, beyond and including regression) which are estimated by the corresponding beta-hat symbols/terms. I can refer you to several books where beta refers to the raw coefficient and standardized beta is referred to as the standardized beta coefficient, but that doesn't seem too important. I'm just pointing out what I had seen as a standard terminology to help clarify for the OP.

#### Dason

Technically the model doesn't even have an assumption of no prefect collinearity. It's perfectly possible to fit models like that. The estimates won't be unique but the model itself doesn't make an assumption about that.

#### ondansetron

##### TS Contributor
Technically the model doesn't even have an assumption of no prefect collinearity. It's perfectly possible to fit models like that. The estimates won't be unique but the model itself doesn't make an assumption about that.
I have at least one book that used no perfect collinearity as an assumption, granted it's an econometrics book by Wooldridge... I wonder where he got that as an assumption-- my guess would be the linear algebra degenerate matrix part. I always learned that there are essentially 4 assumptions in practice, though (Errors ~ i.i.d. N(0, [sigma^2]=constant). Now that I think about the derivations we did, the "collinearity assumption" they made never came into play, but it made sense as to why it plays into unique estimates.

Last edited:

#### staassis

##### Member
I appreciate your disagreement, but it simply isn't an assumption that's required or even made for analysis but relaxed in practice. The assumption is regarding perfect collinearity. You're correct that imperfect MC might lead to inflated standard errors, but that's not always a concern. In cases of building models for prediction purposes, you're much less likely to have any concern for the collinearity since prediction is the goal rather than inferences on beta parameters.
I never said that no multicollinearity is a formal algebraic assumption of Ordinary Least Squares. Obviously, solution (X'X)^{-1} * X'Y exists whenever matrix (X'X) is invertable. This is trivial. We are not passing a midterm in Statistics 102 here. Our task is somewhat bigger. bc1212 is asking for an accurate framework for comparing relative predictive power of predictors A and B. Read his/her post. He/she is not building a black-box data mining model, where out-of-sample prediction is the only concern. He/she wants to use the model for inference. Therefore, "inflated standard errors" are a huge concern, contrary to your statement above. Removing multicollinearity is likely to improve the accuracy of inference by an order of magnitude.

You needn't refer me to any sources to believe you. But the logic is rather poor: someone came up with some popular methods to handle this, therefore, it must violate an assumption of the method.
Again, nobody is talking about "assumptions" which are formally violated. I am just drawing attention to bad statistical practices.

In the regression context, beta has traditionally been the Greek letter use to denote the unknown popluation regression coefficients
I would not say "traditionally". Some resources use β as notation for unstandardized regression coefficients. But many others use B, A, φ or something else. Separately, many resources use word "beta" to refer to standardized regression coefficients (see an SPSS example below). That is why I said: "One useful metric is the standardized regression coefficient (often called "beta")." Did I say "always"?.... But I agree: this is a minor point. Your tolerance to multicollinearity is the truly dangerous element here.

#### staassis

##### Member
Technically the model doesn't even have an assumption of no prefect collinearity. It's perfectly possible to fit models like that. The estimates won't be unique but the model itself doesn't make an assumption about that.
Good point, Dason. Thank you for bringing it into the discussion.

#### ondansetron

##### TS Contributor
I never said that no multicollinearity is a formal algebraic assumption of Ordinary Least Squares. Obviously, solution (X'X)^{-1} * X'Y exists whenever matrix (X'X) is invertable. This is trivial. We are not passing a midterm in Statistics 102 here. Our task is somewhat bigger. bc1212 is asking for an accurate framework for comparing relative predictive power of predictors A and B. Read his/her post. He/she is not building a black-box data mining model, where out-of-sample prediction is the only concern. He/she wants to use the model for inference. Therefore, "inflated standard errors" are a huge concern, contrary to your statement above. Removing multicollinearity is likely to improve the accuracy of inference by an order of magnitude.
You quoted my post where I said absence of MC is not an assumption, then you went on to say you can't agree-- seemed much like you were saying you disagree and that absence of MC is an assumption. Maybe I misunderstood that part. If you reread post #3, you'll see that I'm clarifying for OP that your post isn't a generally true statement and that the OP should investigate the degree of MC in his or her model. It may be a huge problem or it may be negligible, even for inferential purposes. You'll also notice that I introduced the OP to both sides of the potential issue.

Without having seen any output, we can take a guess based on the Pearson correlation or make a VIF, but without seeing the actual output and looking at some other information, it's hard to judge how problematic MC is for this case. I'm open to the possibility that OP has different independent variables beyond A and B, which would be useful to know. Again, no one is disagreeing with you on variance inflation, possibly inappropriate inferences, and potential issues with parameter estimate instability.

Again, nobody is talking about "assumptions" which are formally violated. I am just drawing attention to bad statistical practices.
It's a lot more clear in this post what you are doing than the several prior where it seems you were doing the former.

I would not say "traditionally". Some resources use β as notation for unstandardized regression coefficients. But many others use B, A, φ or something else. Separately, many resources use word "beta" to refer to standardized regression coefficients (see an SPSS example below). That is why I said: "One useful metric is the standardized regression coefficient (often called "beta")." Did I say "always"?.... But I agree: this is a minor point. Your tolerance to multicollinearity is the truly dangerous element here.
For the "often" and "always" debate, sure you said "often", but again, it was possibly unclear for the OP who admitted to be less familiar with statistics. SPSS isn't also what I would call a resource for learning statistics (nor would I call any statistical software). Also, the "B" in SPSS is the capital Greek letter beta-- just another point that illustrates a benefit of clarification.

I think you're grabbing at straws in a defensive effort and jumping to the conclusion that I have an unreasonable "tolerance to multicollinearity." Before jumping to further incorrect positions, reread again what I wrote in post #3. I encouraged the OP to look into the degree of MC because it may or may not actually be problematic for inferences. I'd be curious to see how that indicates a "truly dangerous" "tolerance" regarding multicollinearity.

It seems, overall, that a few of these points of discussion could be avoided with increased clarification in the responses:

Almost everything in life is correlated. Allowing several correlated predictors into a linear model is totally fine in cases of prediction, but for inferences on the betas as you want, you'll want to minimize the issues of multicollinearity.
To see which predictor has the highest predictive power, you can look at a number of metrics. One useful metric is the standardized regression coefficient (called "beta" in SPSS, not to be confused with the unstandardized coefficient which may also be called beta).
My intention wasn't any sort of argument, but rather, clarification for OP. I think this is apparent in my initial few replies.