Question about the normality assumption

#1
I see everywhere reference to the normality assumption. e.g. for a t-test, you read "data must be approximately normally distributed". When reading more, you see for repeated measure t-test we want the difference to approximately be normal, and for independent we want values in each group to be approximately normal. In the regression scenario (I mention because I think we can run a t-test as a regression?), it must be the residuals that are normal, but I also see mention that normality is the "least important assumption", and even for t-test people seem to shrug it off after a certain sample size.

My issue is that normality is a hypothetical because the normal distribution only exists at infinity - my data can only ever be approximately normal. I am confused because I think we pick a test not based on the data we have obtained, but on what characteristics we think that data has in theory (i.e. in population, which we do not have access to). For example I should not run a t-test if I think the population generating data is chi2 distributed - no amount of non-significant normality test can justify the t-test in this case. Similarly, if I assume that my data come from a population with a normal distribution based on theory, then it doesn't matter if my data don't look normal - t-test is the appropriate model.

So my questions: the assumption clearly says our data must "follow a normal distribution", but does this talk about (1) what we actually see in our data, or (2) what we are hypothesising about our data? Because these seem different. If the former, then feels we can justify running t-test without caring about the hypothesised population distribution. If the later, then it seems like testing our data for normality is a total waste of time and we should instead just THINK if it is normal.

Another question: why care about normality when we know it to be implausible? e.g. maybe I am measuring heights of women in a doctors clinic. I can sample and sample until infinity but I will never get a height of -150. But I can write a little program to do the sampling from a normal distribution with mean 150 and SD 20 (or whatever it might be for height), and, I think in principle, I will one day see -150, -1500, and so on under the normal distribution. I understand it is about t-distribution approximating normal at infinity, but why even bother talking about normality if it is implausible for most data collected?

Thank you.
 

katxt

Well-Known Member
#2
You're right, of course. Nothing real is truly normal. But many things are near enough. There are more important things to worry about.
As far as the tests go, all the things you mention are "residuals". In actual fact, it is the sampling distribution of the thing you are looking at that is supposed to be normal. This is guaranteed if the residuals are normal no matter the sample size, but it is near enough to normal if the residuals are roughly normal, or you have enough of them.
 
#3
You're right, of course. Nothing real is truly normal. But many things are near enough. There are more important things to worry about.
As far as the tests go, all the things you mention are "residuals". In actual fact, it is the sampling distribution of the thing you are looking at that is supposed to be normal. This is guaranteed if the residuals are normal no matter the sample size, but it is near enough to normal if the residuals are roughly normal, or you have enough of them.
You said it is the sampling distribution of the thing we are looking at. But we do not have access to it as we don't run infinite experiments so we estimate it from our sample. So my confusion is why we should care about normality at all, because (1) if we can make a good argument that our "thing" is theoretically normal, then that is that and we go with a test that assumes it, even if our data/residuals are not normal... but (2) if we cannot make such argument then we use a test that is good for the distribution we expect (or make no assumption), and we stop caring about it again. Especially in learning about statistical testing, I find this emphasis on normality assumption quite weird if it truly is unimportant for our observed data...
 

hlsmith

Less is more. Stay pure. Stay poor.
#5
You said it is the sampling distribution of the thing we are looking at. But we do not have access to it as we don't run infinite experiments so we estimate it from our sample. So my confusion is why we should care about normality at all, because (1) if we can make a good argument that our "thing" is theoretically normal, then that is that and we go with a test that assumes it, even if our data/residuals are not normal... but (2) if we cannot make such argument then we use a test that is good for the distribution we expect (or make no assumption), and we stop caring about it again. Especially in learning about statistical testing, I find this emphasis on normality assumption quite weird if it truly is unimportant for our observed data...
Yes, you could just use regression for a ttest here and examine residuals. Have you run many regressions? Because if you have, you will note that the residuals can help discern a non-linear relationship, wrong defined data generating function, or heterogeneity resulting in a funnel shape in residuals. I imagine if the residuals are not normal because of the prior causes or a poor random sample, then the standard errors may be off.
 
#8
It seems like there is general agreement about normality not being an important assumption then. But I am still unclear what to take away from this!

For example if I run a t-test (or a whatever test), I need to make sure I meet its assumptions because when I send it off to Nature I will be asked what I did to ensure my t-test was appropriate. When they ask: were your data normally distributed? should I say "normality is bs, don't keep asking me about normality" :p? Or I should say that it is a bad question because "the unobservable sampling distribution (which nobody have access to) for my outcome variable is ASSUMED to be normal, which means I tick the box and move on, don't keep asking me about normality"?

I feel there needs clarity on this if people demand assumption checks, but I am going around in circles - care? not care? Maybe it is just me being slow.
 

katxt

Well-Known Member
#9
Nothing is perfect.
Do your analysis. Do a normal probability plot of the residuals. If it is obviously not straight then try a transformation and see what happens, or use a non parametric test.
If all is OK, then plot the residuals against the predicted values. If there is no obvious pattern just say that the assumptions have been met.
(Mind you, I have never submitted anything to Nature.)
 
#10
I find the topic very frustrating. Screening for normality of data - and if it's not there, using non-parametric tests or transforming variables - was one of the first things we were taught in quantitative methods, and it was emphasised that it was very important.

But then in much of academia/research, outside of teaching, no-one seems to actually care much about it. I don't know about Nature, but a lecturer told me the other day that in practice no-one at most journals they deal with ever asks about it.

Although this is probably largely down to me being dumb, but I have also yet to come across a resource that clearly explains for the non-technically-minded exactly when and how it matters, or at least what the debate is (as there seems to be some disagreement), and how best to address it. For our introductory QM course, as I say it was emphasised as very important, but our textbook at the time seemed to suggest otherwise (given the CLT), though imo rather unclearly.

It seems like there is general agreement about normality not being an important assumption then.
I'm not sure about this. I've definitely seen a fair few people (including pure statisticians) suggest that, and I think there maybe something of a consensus that some other issues such as outliers are a bigger concern. On the other hand, I'm reading Rand Wilcox's Basic Statistical Methods book at the moment and regarding t-tests, he says that non-normality doesn't matter much in symmetric, light-tailed distributions, but that in asymmetric light-tailed distributions and symmetric heavy-tailed distributions power can be an issue, and in asymmetric heavy-tailed distributions, "serious concerns about Type I errors and power arise, and a sample size of 300 or more might be needed to correct any practical problems", which doesn't sound so good. (That's as far as I've got so far, but I imagine he would suggest using a robust t-test, eg based on trimmed means or winsorised data, as a sensitivity check or to replace Student's T-test altogether, as that seems to be his thing).

Wilcox, Rand R.. Understanding and Applying Basic Statistical Methods Using R, John Wiley & Sons, Incorporated, 2016.
 
#11
I'm not sure about this.
This possibly was my bad wording -- i mean not important to check for when we run our test. In terms of what the underlying theory is about these statistical analysis, then I understand that it forms part of the proof and the entire thing rests on the assumption and that is fine. But my problem is about observed data and testing THAT for normality. Why at all? As I understand it, normality is defined only at infinity and is therefore purely a theoretical, and no amount of observed data can be considered 100% definitely normal.

But we can say that we draw some data from a normal distribution. IF we write a program that draws from the TRULY mathematically normal distribution we know 100% our data came out of it because we made it that way! But we do not have this in real life, all we have is we could assume our generator process (which we might assume or reason to be normal) made some data. This seems to have to be something to defend through reason, not something to demonstrate by looking at some histogram or doing some Shapiro test. It is unlikely, but if I simulate a draw from normal distribution I might obtain very skewed results. I don't care though, because I make the claim it came from a normal one so I continue with my beloved t-test and publish in nature.

This is causing me a lot of frustration.
 

katxt

Well-Known Member
#12
One of the reasons for checking the residuals is that deviations from normality may suggest a more appropriate model.