Normality: DV itself or residuals?

#1
I always thought that we should be testing the assumption of normality by looking at the distribution of the DV in our design (by each level of the IV). But I keep seeing people talking about the residuals of the DV. For example, in ANOVA (and I suppose by extension the t-test and regression), should we check normality of our DV or should we calculate the errors of the DV for each level of the IV? Would be great if someone can point me to a good resource that covers this.
 

hlsmith

Not a robit
#2
It is the residuals. Though its nice to know the distribution of variables you are working with.

@Dason - can you send a link to you all's paper!
 
Last edited:
#4
Thanks, I'll give it a read. Just skimming over the part on normality though, it seems the assumption of normality depends on the difference between the observed data and the regression model's predictions i.e. the residuals. But what I don't understand is that in order to find out what the residuals are, you must run a regression in the first place, but that before you run the regression you should test the assumption of normality... So I know that I'm misunderstanding something here because this is circular logic.

I'm just thinking, if I want to decide whether to analyse my data with ANOVA or some non-parametric alternative, then I see whether I'm violating the assumptions. If the residuals are calculated after running the test itself, then I have no idea of knowing whether the test is appropriate before running it in the first place. I was taught to check if my DV itself was normally distributed, and if so then parametric. Otherwise, non-parametric.
 

Dason

Ambassador to the humans
#5
It might sound counter intuitive but we typically don't check the normality, equal variances, independence, ... assumptions until after we fit the model. The assumptions are basically all on the error term which we use the residuals as a stand-in replacement for when checking the assumptions. There are a few things you can do before fitting the model but honestly it doesn't really matter much - you'll need to fit the model anyways to get the other diagnostics.
 
#6
It might sound counter intuitive but we typically don't check the normality, equal variances, independence, ... assumptions until after we fit the model.
Not so much counter intuitive, just completely the opposite of what I thought I was being taught! I study psychology, and all resources explicitly talk about checking assumptions prior to applying tests. It makes me wonder whether people teaching/writing the textbooks don't fully understand it themselves, and/or there is a large degree of subjectivity.
 

Karabiner

TS Contributor
#7
Not so much counter intuitive, just completely the opposite of what I thought I was being taught! I study psychology, and all resources explicitly talk about checking assumptions prior to applying tests.
Not at all opposite. Dawson wrote "we typically don't check ... assumptions until after we fit the model". He did not say: "we typically don't check assumptions until after we performed statistical tests of significance".

It makes me wonder whether people teaching/writing the textbooks don't fully understand it themselves, and/or there is a large degree of subjectivity.
Or maybe students sometimes do not read carefully enough.

With kind regards

Karabiner
 
Last edited:

spunky

Doesn't actually exist
#8
Not so much counter intuitive, just completely the opposite of what I thought I was being taught! I study psychology, and all resources explicitly talk about checking assumptions prior to applying tests. It makes me wonder whether people teaching/writing the textbooks don't fully understand it themselves, and/or there is a large degree of subjectivity.
This is quite true indeed. I also come from a closely-related field (Education) and the quality of statistical/methodological training is abysmal and mostly at fault of people who do not quite understand these methods being tasked to teach them to the new generations.

Every time I have student tell me "but my textbook says...." or "well, I was taught that..." my immediately reply is something like "Sure. But then again you're coming from the field that started the Replication Crisis so... ¯\_(ツ)_/¯"

BTW, shameless self-promotion that is actually tied on to what you're asking: https://psychometroscar.com/2018/07/11/normality-residuals-or-dependent-variable/
 
#9
Not at all opposite. Dawson wrote "we typically don't check ... assumptions until after we fit the model". He did not say: "we typically don't check assumptions until after we performed statistical tests of significance".
Good point.

Or maybe students sometimes do not read carefully enough.

With kind regards

Karabiner
I guess I'm one of those students then!

Every time I have student tell me "but my textbook says...." or "well, I was taught that..." my immediately reply is something like "Sure. But then again you're coming from the field that started the Replication Crisis so... ¯\_(ツ)_/¯"
So could you both please recommend a textbook/resource that covers these topics correctly? All I ever hear about is Andy Field's book, but I can't see him covering this in a definitive way. On one page we should check assumptions, and our DV should be normal, and on the next page it doesn't matter because the sampling distribution is normal, and we shouldn't bother checking if our DV is normal.
 

Karabiner

TS Contributor
#10
I guess I'm one of those students then!
Don't know. After reading what Spunky wrote, I'd guess that teachers
and teaching material are the main sources of error.
So could you both please recommend a textbook/resource that covers these topics correctly?
Good question. Unfortunately, I am not really up with that.

With kind regards

Karabiner
 

noetsi

Fortran must die
#11
a lot of text stress looking at the univariate distribution. But the actual requirements of regression do not. Normality is not I think even one of the gauss markov assumptions. Once you get to a few hundred cases many argue it does not matter because the CLT.
 

spunky

Doesn't actually exist
#12
So could you both please recommend a textbook/resource that covers these topics correctly? All I ever hear about is Andy Field's book, but I can't see him covering this in a definitive way. On one page we should check assumptions, and our DV should be normal, and on the next page it doesn't matter because the sampling distribution is normal, and we shouldn't bother checking if our DV is normal.
OMG did the Andy Field book say that? And here was me hoping at least *he* would get things right. But unfortunately by the time this content moves from statistics to social sciences it has been digested and re-interpreted so many times that what you end up with is some bastardized version of what was initially meant.

Like sometimes I feel I wanna scream every time I hear the whole "just look for n>30 and you should be OK". OMG SO.MUCH.WRONG.WITH.THAT.

Honestly, I'm still searching for a good book in the social sciences that accurately reflects the stuff we should need. For better or worse, the most "approachable" book I've ever found is this one:

https://www.amazon.ca/Mathematical-Statistics-Data-Analysis-Sets/dp/0534399428

And even I wouldn't dare use that for a graduate methods course in the social sciences. I usually just recommend it to people I'm helping supervise if I see they've got good math chops. And then we work on it together.
 

spunky

Doesn't actually exist
#14
That in social science land it gets sold like the be-all-end-all of what a 'large' sample size should be. So I review article after article, paper after paper saying some variant of "all assumptions were met because n>30 and therefore the Central Limit Theorem applies (CITATION FORM A METHODS TEXTBOOK).

I think I even have 1 or 2 simulation examples in R that I now copy-paste on my reviews to show why n>30 doesn't automatically mean you can get away with whatever your want. Because the papers that use it dont have Ns even in the 100s. It's more like "We got 35 people. But by the powers of the CLT, everything that we did is now justified"
 
#15
OMG did the Andy Field book say that? And here was me hoping at least *he* would get things right.
Well I do have to say that the Field book is the best book I've used, maybe it's just me not getting my head around this in general. He does state that normality refers to the residuals of the model, "or the sampling distribution". However, we don't have access to the sampling distribution so we use our data as representative of the sampling distribution. (I don't know how accurate/valid that is). Then shortly later he talks about the CLT and how it means we can assume normality regardless of the shape of our sample data, and that with larger samples (I guess > 30?) we don't really need to worry about normality.

So should we even care about normality when analysing our data or not? And why is there so much emphasis on normal distributions and statistics that assume normality?

Thanks for the discussion!
 

spunky

Doesn't actually exist
#16
He does state that normality refers to the residuals of the model
Yes, distributional assumptions are on the residuals of our models.


"or the sampling distribution". However, we don't have access to the sampling distribution so we use our data as representative of the sampling distribution. (I don't know how accurate/valid that is).
I'm.... not sure what he means here. I mean, the sampling distribution of the standard deviation (coming from a normal parent population) is not normal. It is chi-squared. Maybe he means the sampling distribution of the mean?


Then shortly later he talks about the CLT and how it means we can assume normality regardless of the shape of our sample data, and that with larger samples (I guess > 30?) we don't really need to worry about normality.
So... a lot to unpack here. First and foremost, the statement as stated here is simply wrong. The sampling distribution of the sample mean taken from a Cauchy distribution is NOT normal. It's Cauchy-distributed. The sampling distribution of the maximum value of a sample coming from a normal parent population is, in itself, NOT normal. It follows a Gumbel distribution.

Now what Andy is then forgetting (and I'm gonna take your word for it because I don't have the book here) is that the Central Limit Theorem is asymptotic. Sure, at infinity a lot of distributions do converge to the normal one. But I don't think you or me or most people have access to infinite sample sizes, right? So what becomes REALLY crucial here is the rate of convergence. Or, in other words, "how large does LARGE SAMPLE mean?"

Example here. Consider a Poisson random variable (a type of discrete distribution) with parameter \(\lambda=.01\). Folk wisdom tells us we only need to get to n>30 so we can assume normality of the sampling distribution of the mean, right? Well, let's be REALLY generous with ourselves and start with n=100. Small simulation in R:






Yeah... this is not looking very normal to me. And notice how we're already... what? More than 3 times over the recommended n>30? Let's try with a sample size of 1000:









Better... but still a lot of gaps in between. Ok, what about a sample size of 10,000:









Yeah, that looks more promising. So...yeah. You will, as \(n \to \infty\) get the normal distribution from the Central Limit Theorem, assuming a few of other things hold true first, like the Lyapunov Condition. So this is not a free lunch that you can go around assuming all wily-nilly. Yes, it is a powerful result, yes it holds in a lot of cases but there are also other things that need to be true for it to work out. The question is, how large of a sample or how many trials or how much data do you have to collect before you have access to it? Because in most of social sciency stuff, getting samples on the 1000s is simply not feasible for the most part.

So should we even care about normality when analysing our data or not?
Over time you will learn that the correct answer to most of these questions is the very underwhelming and much hair-pulling "it depends". Does it matter much if you're calculating a simple means difference t-test? Probably not. Does it matter if you're doing some complicated Structural Equation Model? Oh, quite a bit! To the point that people have made whole careers just out of corrections to non-normality.

Very few things of what we do are routine. Aside from mind-numbingly boring and trivial cases, very few types of analyses can be done in a "rote" fashion. Sometimes you need to transform the data. Sometimes you need to change a different, non-standard model. Sometimes you need to come up with your *own* version of a regression-type model. Nothing that has any real scientific interest can be default-coded in SPSS. Yes, you see a lot of people doing it and getting away with it because, honest to god, there aren't enough people in our fields that are properly trained enough to catch the mistakes that other people do.


And why is there so much emphasis on normal distributions and statistics that assume normality?
Honestly, I think it's mostly a combination of convenience and lack of formal training in more advanced methods. Keep in mind that most of the statistical methods that you know of and will be using were developed... what? Maybe around 100-150 years ago? People did not have access to the computing power that we have today so they needed to make simplifying assumptions to make these methods usable to a certain degree. And the normal distribution is a convenient assumption because it shows up a lot in nature. Then psychology, sociology and all the other -ologies from social-science-land showed up trying to get legitimacy as a scientific endavours circa the 19th century; so they decided that whatever physics was doing (what was considered the "standard" of science in pre-WWI Europe) they needed to do as well. That's the convenience part.

Now the overall lack of formal training. There simply is no way around this so I'm just gonna say it. Anything that doesn't assume a normal distribution is hard. Like, REAL hard. I've been doing research on the sampling distribution of the simple, bivariate correlation under a very specific (and restrictive) type of non-normality. If you assume normality, the sampling distribution of the correlation is a simple, friendly t-distribution like you'd find in linear regression. You know, a simple regression with only 1 predictor so that the slope is the correlation coefficient if the variables are standardized. So far so good. This is how it looks like under a very restrictive and relatively "simple" type of non-normality:




Could you imagine if we taught something like *that* to intro psych students? ;)
 
Last edited:
#17
Maybe he means the sampling distribution of the mean?
Yeah this is what he means.

Interesting post. I would like to learn as much as I can about all of this now, just I don't have a maths background so this is quite heavy. And that's the thing, if it were taught in psych classes then it would be quite pointless because most wouldn't have a clue. But this is clearly part of the problem - people need better training, and those training them also need better training I suppose. I remember I had an ex-physicist teaching my stats, but whenever things got complicated the attitude was "I'm not going into the details of why, just trust me .........". I was grateful at the time but now I'm questioning everything :D

What do you think this means for other non-psychology applications of parametric statistics? For example I'm aware many applications of "machine learning" are simply based on multiple linear regression. There is never any talk of specific parametric assumptions before applying such methods in any resource I've come across (other than that the outcome data should be continuous). I can't see why these would be less of a consideration just because they aren't necessarily interested in p < .05... they are still ultimately trying to fit a model that might not be appropriate.
 

Karabiner

TS Contributor
#18
So which rules of thump could apply? Something like "If the distribution looks Cauchy or Poisson, then you
cannot achive normality of the sampling distribution, unless you have very large n's (in which case standard
errors become so small that one wouldn't cre anyway). But if the distribution looks roughly normal (or if Poisson,
Cauchy, Gamma distribution etc. can be ruled out, based on substantial considerations), then 30 (or 100?)
should be sufficient"?

Ragards

Karabniner
 

spunky

Doesn't actually exist
#19
Yeah this is what he means.

Interesting post. I would like to learn as much as I can about all of this now, just I don't have a maths background so this is quite heavy. And that's the thing, if it were taught in psych classes then it would be quite pointless because most wouldn't have a clue. But this is clearly part of the problem - people need better training, and those training them also need better training I suppose. I remember I had an ex-physicist teaching my stats, but whenever things got complicated the attitude was "I'm not going into the details of why, just trust me .........". I was grateful at the time but now I'm questioning everything :D

What do you think this means for other non-psychology applications of parametric statistics? For example I'm aware many applications of "machine learning" are simply based on multiple linear regression. There is never any talk of specific parametric assumptions before applying such methods in any resource I've come across (other than that the outcome data should be continuous). I can't see why these would be less of a consideration just because they aren't necessarily interested in p < .05... they are still ultimately trying to fit a model that might not be appropriate.
Social-science-land (including Psych, obviously) is in a very weird position these days. On the one hand, they all recognize (or pretend to recognize) the importance of statistical and methodological expertise. But, on the other hand, they prefer not to hire people with this kind of expertise into their programs and opt for the two-for-one deal (the research term I’ve heard is ‘toofer’) of someone who has an applied, substantive area of expertise (usually cognitive or social/personality) and happens to have some methods chops. Although there are some people who can do both well, I’ve always argued that this is unsustainable in the long run. I’m sure I know more statistics/methodology than an applied person who has interest in it for no other reason that I only need to focus on *my* area whereas the other person needs to focus on two and, when everything is said and done, there are only 24h in a day. But somehow we’ve manage to establish, as a field, that this is OK.

Dr. Leona Aiken from Arizona State University has done a lot of interesting research on this in psychology, showing some pretty convincing (and damning) evidence that, at least in the United States, people with PhDs in Quantitative or Mathematical Psychology are undervalued, lose on academic positions for which they are perfectly qualified for if the other candidates have a substantive area of expertise and, those who get hired, are expected to perform FAR more service for the department than others. So kudos to you for questioning everything! Borrowing something from Twitter, I honestly don’t think I trust any psych/social science paper from before 2011 when the Replication Crisis started making its rumbles.

Well… a lot of Machine Learning stuff is more interested in prediction rather than inference. And parametric assumptions are most important for matters of inference. HOWEVER (and this is my real pet-peeve) is that it doesn’t matter which method from the Machine Learning/Data Mining/Data Science/choose-a-fancy-name toolbox you want to use, there are ALWAYS assumptions on everything we do. And these methods are pernicious in the sense that they are very easy to implement but very hard to understand form a theoretical/mathematical perspective. And sometimes the method is developed before the theory gets solid around it, which means we don’t always understand why things work the way they work and, more importantly, in which cases they *shouldn’t* work. However you have a constantly increasing group of people jumping into this (because, if anything, it promises a quick, good-paying job) without the necessary theoretical skills to be critical of them. And that’s gonna end up spelling disaster at some point. Dr. Cathy O’Neil (author of “Weapons of Math Destruction”) goes into this in a quite detailed fashion concludes her book by saying she is expecting a “2008 Financial Crisis” type meltdown (she was, after all, one of the “quants” of Wall-Street who contributed to the Crisis) in the future for applying these types of algorithms without any regard to their limitations or critical appreciation of what they can and cannot do. Of much more epic proportions, of course, because this won’t be limited to the financial markets given that most of our everyday lives are influenced, in one way or another, by the decisions made from these algorithms.
 

spunky

Doesn't actually exist
#20
So which rules of thump could apply? Something like "If the distribution looks Cauchy or Poisson, then you
cannot achive normality of the sampling distribution, unless you have very large n's (in which case standard
errors become so small that one wouldn't cre anyway). But if the distribution looks roughly normal (or if Poisson,
Cauchy, Gamma distribution etc. can be ruled out, based on substantial considerations), then 30 (or 100?)
should be sufficient"?

Ragards

Karabniner
Uhm… I’m not sure if I’d be the right person to ask this because I’m very suspicious of rules of thumb. If you look into the theory where rules of thumb come from, you can almost always come up with enough counter-examples that either (a) make the rule of thumb invalid or (b) make it valid in such a restricted set of cases as to render it irrelevant.

Like… let’s unpack a few things. When you say “you cannot achieve normality of the sampling distribution” the immediate question is the sampling distribution of what? Are we exclusively talking about the sampling distribution of the mean? Because, for example, the sampling distribution of the standard deviation is chi-square. And if you rule out enough distributions (because a similar result to what I showed with the Poisson can be showed with the Binomial, for instance. Or the Hypergeometric) then you are probably going to end up with where the original rule of thumb comes from: The t-distribution. If my memory serves me right, the whole n>30 thing comes from the old times, back in the day when we needed tables of numbers to obtain p-values. I think the story goes that by the time you get n>30, the area under the curve of the t-distribution is below some sort of minimal error from the normal, so you can use the z-tables as opposed to the t-table. The irony being that the “rule of thumb” was really more a matter of convenience, where textbooks wanted to save on paper by re-using the z-tables for the t-test. But of course, this took a mind of its own and now it’s prescribed almost as a theorem.

Now, when you say “standard errors become so small that one wouldn't care anyway”, let’s place ourselves on the opposite side of things. If you look at Yuan, Bentler & Zhang (2005) Eqn (6) you can see that the kurtosis of the distribution biases the MLE standard error of the variance (and the covariance/correlation). If the kurtosis is positive, the standard error is biased downwards and, if it is negative, it is biased upwards. And this is an asymptotic result letting \(n \to \infty\) so its influence goes away. Now you can find yourself in the awkward situation where minute effect sizes are statistically significant not because the effect is there, but because of the kurtosis of the distribution. Or, on the other hand, you have to deal with the curious interplay between a negative kurtosis and the sample size. So… I don’t know about the “one wouldn’t care anyway” part. *UNLESS* we are really *only* talking about the sample mean and nothing beyond that.

So, to be honest, what I do these days is to advocate for training in simulation methods OR pairing up with people who can do simulations for you. Especially when it comes to power analysis and stuff like that. That would be my rule of thumb: check by simulation first.
 
Last edited: