Normality

#1
I did a test of normality in STATA on my dependent variable. The rest was significant thus violations of normality. After taking the log, there is still violation of normality. What is the next step to take?
 

hlsmith

Not a robit
#2
Why are you trying to normalize the variable?


Can you post a histogram and qq-plot of your data?


Large dataset will typically fail normality test when there is very small departures. What is your sample size. Also, feel free to upload the normality test results.
 

ondansetron

TS Contributor
#3
Why are you trying to normalize the variable?


Can you post a histogram and qq-plot of your data?


Large dataset will typically fail normality test when there is very small departures. What is your sample size. Also, feel free to upload the normality test results.
Very good questions to raise. Also, what is the dependent variable including the units of measure? A log transformation might not be the most appropriate transformation.
 
#4
I did a test of normality in STATA on my dependent variable. The rest was significant thus violations of normality. After taking the log, there is still violation of normality. What is the next step to take?
The next step is to check normality of the residuals, (i.e. the dependent variable given the explanatory variables). The distribution of the dependent variable is irrelevant.
 

ondansetron

TS Contributor
#5
The next step is to check normality of the residuals, (i.e. the dependent variable given the explanatory variables). The distribution of the dependent variable is irrelevant.
Good catch, I didn't think to make sure the OP was doing that since I kind of assumed that's what had been done :O
 
#6
The more I post in this forum the more I learn (THANKS) and the more I realize I don't know anything :eek: (CRIES)

I'm on my way out, but I will definitely go over everything
 
#7
My dependent variable is on a scale from 0 to 100 (percentage), but I took the log (seemed logical at the time)

My main drivers are also on a scale from 0 to 100, but I had to transform them so I only (keep) have values of fifty and higher

Then I have some economic control variables (GDP) and nominal dummy control variable 1/0

I also checked for outliers. Ive attached the kdensity(histogram?)/qqplot and the other tests I used for normality
 
#8
That seems to be OK. And with the large sample size the parameter estimates will be approx normal by the central limit anyway, so it is OK to go ahead with the inference based on normal theory. (In my view - but we can all be wrong.)
 

ondansetron

TS Contributor
#9
That seems to be OK. And with the large sample size the parameter estimates will be approx normal by the central limit anyway, so it is OK to go ahead with the inference based on normal theory. (In my view - but we can all be wrong.)
I agree with you that the departure from normality doesn't appear to be large enough to cause an appreciable issue, especially at that sample size.

Would you be able to show us a histogram and normal probability plot of your untransformed residuals?
 

ondansetron

TS Contributor
#12
Doing the test for the untransformed and transformed variable both were significant
Formal tests for normality aren't really a great idea for several reasons. In particular, they're often very sensitive to slight departures from normality, with this problem magnifying as you increase the sample size. I would personally not even run a formal test of normality and just rely on the histogram/stem-leaf/normal probability plot approach. If you check your standardized residuals and investigate the suspect outliers (absolute standardized value between 2 and 3) and the outliers (absolute standardized value great than 3) as well as other regression diagnostics (influential observations) I think you'll be better off than using anything than the formal normality tests. Especially since regression methods are able to perform pretty well in the presence of outliers and moderate non-normality.

The difference between your transformed and untransformed plots doesn't look to be too big. I would see how much the conclusions vary between the two models. Did you happen to look into the constant variance assumption using a plot of residuals vs predicted y values in the transformed and untransformed models? I think the homoscedasticity assumption is more important than the normality issue.
 
#13
Formal tests for normality aren't really a great idea for several reasons. In particular, they're often very sensitive to slight departures from normality, with this problem magnifying as you increase the sample size. I would personally not even run a formal test of normality and just rely on the histogram/stem-leaf/normal probability plot approach. If you check your standardized residuals and investigate the suspect outliers (absolute standardized value between 2 and 3) and the outliers (absolute standardized value great than 3) as well as other regression diagnostics (influential observations) I think you'll be better off than using anything than the formal normality tests. Especially since regression methods are able to perform pretty well in the presence of outliers and moderate non-normality.

The difference between your transformed and untransformed plots doesn't look to be too big. I would see how much the conclusions vary between the two models. Did you happen to look into the constant variance assumption using a plot of residuals vs predicted y values in the transformed and untransformed models? I think the homoscedasticity assumption is more important than the normality issue.
Ohh I see. Good to know!

I didnt check for homoscedasticity using plots, I used the BrueschPagan test in both cases there was heteroscedasticity, so I run the regression on robust error terms

But I've attached the plots
 

ondansetron

TS Contributor
#14
Ohh I see. Good to know!

I didnt check for homoscedasticity using plots, I used the BrueschPagan test in both cases there was heteroscedasticity, so I run the regression on robust error terms

But I've attached the plots
Understood that you used BP and then robust SEs. As you can see with the plot, the vertical spread of the residuals is not approximately constant which indicates the errors might not have constant variance (as you already saw using the BP). Are you using and categorical variables? If so, it would probably be helpful for you to plot the residuals using the categorical variable as a grouping variable (should give you the same plots, but it will use a different symbol or color on the plot to represent each group). This way you can get an idea of why the heteroscedasticity is occurring (i.e. is it that each group has the same pattern or that the groups just have different variances from one another).
 
#17
I would search the documentation on your software to see how to apply a plotting or grouping variable to a scatter plot. It really depends on the program you're using.
Not sure if I did it correctly.... I am using STATA at the moment.
I managed to change the color of only one of my categorical variable, cant seem to select more (if that makes sense?), but then I tried with another categorical variable and the colored dots were exactly the same
 
#18
I don't think your output is right. What ondansetron was looking for was just a scatterplot like you were doing before, but to have signs for each of your categorical variables. So if you had sex in the model, all of the females would be one color or shape and the males would be a different color or shape. It seems like you got close with your code, though it appears you have two values for each observation in the figure.