# Regression Diagnostics

#### noetsi

##### Fortran must die
Its been years since I ran linear regression. I wanted to ask about these diagnostics. I have about 5400 cases.

My qq plot. This does not look normal to be, but I am not sure it is enough to matter with this many cases.

The residuals. I don't think heteroscedastcity or non-linearity is indicated but with this many points it never looks like the pictures in my textbooks

What tests am I missing (the tolerance test does not indicate Multicolinearity although I am not concerned about that anyway).

#### noetsi

##### Fortran must die
This one baffles me. Both the dependent and independent variable are interval (spending/income). This looks like they were categorical.

#### Dason

##### Ambassador to the humans
Looks like you might want to investigate those outliers

#### noetsi

##### Fortran must die
I will dason, can outliers change a continuous distribution this way. I have never seen results like this with interval data.

#### hlsmith

##### Not a robit
Agreed, in your way above plot the mass looks like they are around maybe 2K or less, but you have a few outliers way the heck out there.

#### noetsi

##### Fortran must die
Ignoring for a moment outliers, hlsmith would you agree that there is no indication of non-linearity or heteroskedacity in the residuals at the top?

I will try to find out what outliers exist. I don't know if proc req has a way to determine what they are.

#### noetsi

##### Fortran must die
I removed 14 extreme data points and reran the regression. To me the residuals showed no obvious problem, but the data is not remotely normal. Given that we spend far more on some customers than others I am not sure I should expect normality.

One problem is that a very high proportion of our customers get no spending so say 90 percent of certain predictors have zero for that (an interval variable, but most are at one level). I don't know how this effects slopes or if it violates the regression assumptions - I have not seen that point addressed for interval data.

Last edited:

#### noetsi

##### Fortran must die
One thing that I realized as I worked back through the data is that a high proportion of our customers get no spending on the key predictor variables which are spending. I am honesty not sure how that influences the regression results. I have only seen this addressed with categorical data.

#### Buckeye

##### Member
what if you log transform the dependent variable? That might help the nonconstant variance.

Last edited:

#### obh

##### Member
The qq-plot looks very nice (not normal). Doesn't look like random outliers...

#### GretaGarbo

##### Human
what if you log transform the dependent variable? That might help the nonconstant variance.
Yes, what happens then? Or even the log of logs. And it might get rid of the oulier problem.

Isn't it natural that a logged dependent variable would work better, as it often does for economic variables?

If a large proportion of customer does not get any money at all, maybe a zero inflated model should be considered.

#### ondansetron

##### TS Contributor
Yes, what happens then? Or even the log of logs. And it might get rid of the oulier problem.

Isn't it natural that a logged dependent variable would work better, as it often does for economic variables?

If a large proportion of customer does not get any money at all, maybe a zero inflated model should be considered.
Yeah, I've heard thinking a priori with salary or money you can just Ln(Y) since it's bounded by zero below and may have large outliers and this will generally help stabilize the variance.

I've had a project or two where the nonconstant variance was explained when plotting by a grouping variable, although it didn't correct the problem since it was already in the model and I tried both natural log and square root on the DV but nothing alleviated the huge disparity between the error variances between groups. At least plotting by the grouping variable helped explain the finding in my paper (GPA error variance was far smaller in an honors program than the general admission program).

#### noetsi

##### Fortran must die
I can log the DV. I am sort of flying blind her as, unlike economics, there is no theory to build on in what I do (I looked again last week).

What is the impact on the slope if most, say 90 percent, of an interval predictor are at one level (zero). And is there any way to address this problem. For categorical variables having the level like this attenuates the slope. But I have never seen this addressed with an interval predictor.

#### noetsi

##### Fortran must die
I ran it against the log of my DV and the residuals look much better (although I am not sure how I interpret the logs now, do I transform them back to do the interpretation)?

I still don't understand what is occurring here. This is interval data not categorical data.....but I found out something baffling that may have to do with the queries. There are commonly 10 or less distinct levels for these variables although in theory you could spend from 100 to a million dollars. I have to find out why.

Last edited:

#### Buckeye

##### Member
For a 1 unit increase in the predictor variable, the dependent variable changes by a factor of 10^(beta coefficient).

#### ondansetron

##### TS Contributor
For a 1 unit increase in the predictor variable, the dependent variable changes by a factor of 10^(beta coefficient).
Would be helpful to point to the "semi-elasticity", too, for noesti's econ familiarity, no? A one unit increase in x-variable is, on average, associated with a % change in Y, all else constant? Or did I goof something?
Also, @noetsi, how were the data collected? Were respondents allowed to enter their own values for that independent variable, or where they forced to "check a box" for that variable maybe with an option for "other" and most people just picked the closest?

#### Dason

##### Ambassador to the humans
Also make sure you know what base your log function is using. It will most likely either be base 'e' or base 10 but look at the documentation to check.

#### noetsi

##### Fortran must die
I will look at it.

I have what I think is a final question. Some of my spending predictors have 21 or fewer distinct levels (because we pay benchmarks so everyone gets pretty much the same or in a narrow range). 8 distinct levels are the least. I don't know when you get to that few levels (there are 5000 plus cases so its only levels that we are talking about) you have to make the variable categorical or can still treat it as interval.

#### Buckeye

##### Member
Would be helpful to point to the "semi-elasticity", too, for noesti's econ familiarity, no? A one unit increase in x-variable is, on average, associated with a % change in Y, all else constant? Or did I goof something?
I suppose you can interpret it in different ways. Just requires a few more steps of math.

#### noetsi

##### Fortran must die
I spent some time reading it. It matter somewhat if its an interval or a dummy predictor, and which base you use.

For the natural log of the dv and untransformed IV....

Here the slope of the IV is .46 and you have a dummy variable
So its (exp(.46)-1) times 100 which in this case is about 58 percent.

and the interpretation...
That is, married taxpayers, on average, make charitable contributions 58% percent higher than unmarried studies holding income and price constant.

where married couples are coded 1 or the slope would be negative .46.

For interval predictors with a logged DV not logged IV with the slope .00666305 then (exp(slope)-1) X 100 is the percent change in Y for a one unit change in X holding other variables constant.