I see everywhere reference to the normality assumption. e.g. for a t-test, you read "data must be approximately normally distributed". When reading more, you see for repeated measure t-test we want the difference to approximately be normal, and for independent we want values in each group to be approximately normal. In the regression scenario (I mention because I think we can run a t-test as a regression?), it must be the residuals that are normal, but I also see mention that normality is the "least important assumption", and even for t-test people seem to shrug it off after a certain sample size.
My issue is that normality is a hypothetical because the normal distribution only exists at infinity - my data can only ever be approximately normal. I am confused because I think we pick a test not based on the data we have obtained, but on what characteristics we think that data has in theory (i.e. in population, which we do not have access to). For example I should not run a t-test if I think the population generating data is chi2 distributed - no amount of non-significant normality test can justify the t-test in this case. Similarly, if I assume that my data come from a population with a normal distribution based on theory, then it doesn't matter if my data don't look normal - t-test is the appropriate model.
So my questions: the assumption clearly says our data must "follow a normal distribution", but does this talk about (1) what we actually see in our data, or (2) what we are hypothesising about our data? Because these seem different. If the former, then feels we can justify running t-test without caring about the hypothesised population distribution. If the later, then it seems like testing our data for normality is a total waste of time and we should instead just THINK if it is normal.
Another question: why care about normality when we know it to be implausible? e.g. maybe I am measuring heights of women in a doctors clinic. I can sample and sample until infinity but I will never get a height of -150. But I can write a little program to do the sampling from a normal distribution with mean 150 and SD 20 (or whatever it might be for height), and, I think in principle, I will one day see -150, -1500, and so on under the normal distribution. I understand it is about t-distribution approximating normal at infinity, but why even bother talking about normality if it is implausible for most data collected?
Thank you.
My issue is that normality is a hypothetical because the normal distribution only exists at infinity - my data can only ever be approximately normal. I am confused because I think we pick a test not based on the data we have obtained, but on what characteristics we think that data has in theory (i.e. in population, which we do not have access to). For example I should not run a t-test if I think the population generating data is chi2 distributed - no amount of non-significant normality test can justify the t-test in this case. Similarly, if I assume that my data come from a population with a normal distribution based on theory, then it doesn't matter if my data don't look normal - t-test is the appropriate model.
So my questions: the assumption clearly says our data must "follow a normal distribution", but does this talk about (1) what we actually see in our data, or (2) what we are hypothesising about our data? Because these seem different. If the former, then feels we can justify running t-test without caring about the hypothesised population distribution. If the later, then it seems like testing our data for normality is a total waste of time and we should instead just THINK if it is normal.
Another question: why care about normality when we know it to be implausible? e.g. maybe I am measuring heights of women in a doctors clinic. I can sample and sample until infinity but I will never get a height of -150. But I can write a little program to do the sampling from a normal distribution with mean 150 and SD 20 (or whatever it might be for height), and, I think in principle, I will one day see -150, -1500, and so on under the normal distribution. I understand it is about t-distribution approximating normal at infinity, but why even bother talking about normality if it is implausible for most data collected?
Thank you.