# z test and t test

#### awkhan

##### New Member
I understand that we go for t test in regression , after the data has been normally disributed using z distribution.

Need to know why cant we use t - test in all the phases , i mean why is it required that t- test requires that the data is normally distributed.....why t-test can't handle this because t-test deals with samples as well right.

I know that they are other non-parametreic tests which dont expect the data to be normally distributed but my question is specific to t test and z-test.

#### Miner

##### TS Contributor
The t-test only requires normality of the data at smaller sample sizes. It is robust against departures from normality at large sample sizes. The question then becomes how large a sample size? There is a rule of thumb that a sample size of >= 30 is sufficient. This will cover most situations where the departure from normality is moderate. More severe departures from normality require larger sample sizes.

#### awkhan

##### New Member
so are we saying here that t-test dont need the data to be normalised if the sample size is >= 30....if that is the case how will it handle accuracy of the prediction....because here we are dealing with samples and we are making inferences about something which we dont know

#### Miner

##### TS Contributor
First, you aren't making a prediction. You are using a sample to make an inference about a population.

At small sample sizes a departure from normality will bias the p-value, which could potentially cause you to make a bad decision on rejecting or failing to reject the null hypothesis. As the sample size increases the bias caused by the departure from normality gets smaller and smaller until it and the true p-value converge.

#### awkhan

##### New Member
I am bit confused..can you help answer the below queries.
So z-test or t-test is carried out only to make inferences about the population?
t-test is preferred over z-test when the sample size is >=30?
For prediction we dont require any of these tests. if yes, is there any pre-requisite that we need to carry out before prediction is done so that we can minimize the error and improve the accuracy of the model?
Earlier thought that the data has to be normally distributed, before prediction is carried out (where we probably go for z test), then is it that t-test is used only for hypothesis?

#### Miner

##### TS Contributor
The z-test is usually only used in textbook examples. In actual practice a t-test is used. The heavier tail areas better reflect the real world. In a hypothesis test such as a t-test you use a sample to make an inference about the parent population. For example, in a 2-sample t-test, you use the results to infer whether the two samples came from the same population or from different populations.

Part of the problem is that you are mixing hypothesis tests such as the z-test and t-test in with regression. I'm not sure why you are doing this because they are used for very different purposes. Also, regression does not require normality of the data. I know some people and textbooks teach that, but it is false. Regression requires normality of the residuals, not of the data

#### awkhan

##### New Member
Thanks for all response. can you help answer the below queries as well
1)Do we need to carry out hypothesis tests before carrying out prediction like regression - to ensure that samples taken are coming from the same population. If not, when these tests are carried out.
2)Regression requires normality of the residuals. If the residuals are not normally distributed , do we take one more sample until residuals attain normality? Are they any other pre-requisites for regression.
3)For z-test, we plot Histogram to check if the plot is normally distributed or not.
Similarly do we have any plot that we can visualize for t-test. If yes, what do we need to look into the plot to make inferences
4) will t-test works better than z-test when we take sample size >=30 ,if not what should be ideal sample size to be taken.
5) Do we go with Box and whisker plot only to identify outliers.If they are outliers are we going with median and if they are no outliers we go with mean? If not , do we have any other measure

#### Miner

##### TS Contributor
1) The only hypothesis tests that are normally performed in association with regression is a test for normality of the residuals. I would recommend the Anderson-Darling test, but there are others as well.
2) Non-normality of residuals are indicative of a problem with your model itself. This could be due to fitting a linear model to a nonlinear data set, missing variables, missing interactions, etc.
3) If you mean the residuals, see answer 1) above.
4) Why are you wanting to use either one?
5) Box and whisker charts are one way of many possibilities to detect outliers. Again, what are you trying to accomplish here?

#### awkhan

##### New Member
4) My understanding on t - test is when we have one regressor (i.e. one independent variable) , we carry out t- test to check if the variable is significant or not (i.e X (Indep. vatiable) has a direct relationship with Y variable or not) by carrying out the hypotehsis test. As we are working on the sample , need to know what would be the ideal sample size to be considered to make inferences about the actual population parameter.
5) As x(independent variable) should be normally distributed to make ineferences about the population , my intention was to use box and whisker plot to remove outliers
6) When we have multiple regressors understand that we need to use F- statitics to check if the var's are significant or not, but in this case how should the samples be selected
- Shall we take samples of x1, x3, x3,....variables in one go
- shall we take samples of x1 first and then check if it is significant or not
Not sure what should be the approach
7) Also see when plotting the residual points, the residual points are very closer to x or y axis and we see that the model is best fit, but i dont see any normality in the residual plot
8) What would be the ideal sample size to be choosen when we have single regression variable ?
What would be the ideal sample size to be choosen when we have multiple regression variable ?
9) Also read somewhere that Y (dependent variable) will follow normality after carrying out predictions on regressors on N samples considered
i mean ,
y1=c+b1x1+b2x2+E (1st sample)
y2=c+b1x1+b2x2+E (2nd sample)
...
Yn=c+b1x1+b2x2+E (nth sample)