RESOLVED but still open for discussion: When to transform and when to use non-parametric tests?

#1
Hi Amazing Forum of wonderfully statistically minded people.

I recently did some statistical analyses on cattle trials, and was considerably helped by this forum. I used non parametric testing for non normal data. However, my supervisor says (as a rule) it's better to transform data and use 'more common' parametric tests (such as glm) than use non parametric tests?

I've attached what I sent my supervisor, and I hoped some of you people could take a side. Have I made a mistake? And if I choose to transform for parametric testing, why (so I can explain)?

Best wishes
 

Miner

TS Contributor
#2
Here are my two cents worth (note: my background is industrial statistics):
  • You lose information when you transform data
  • Non-normal data sets are often the result of mixtures or from an underlying process that changes over time. Trying to transform these is a mistake.
The usual argument for transforming data is that the parametric test has more power than the equivalent non-parametric test. However, that increased power is often small and with reasonable sample sizes is often irrelevant. Another argument from the slide rule era, was transforming data would allow you to use linear regression models instead of nonlinear models. Since we no longer use slide rules, that is a rather outdated argument.
 
#3
@lucyd123 - there is no attachment.

I may differ a little from @Miner this time. I would normally go for transformations first. Transformations can result in more interpretable results for people, given you interpret them correctly. Skewed data are common and not necessarily the results of multiple process or at least well defined processes (e.g., costs and say economic growth). And parametric procedures can be robust to minor normality deviations. You can control for covariates in GLMs and there is more flexibility if you slide over to generalized estimating equations (GEEs).
 
#4
I recently did some statistical analyses on cattle trials, and was considerably helped by this forum. I used non parametric testing for non normal data
How data is distributed is most often irrelevant for testing. Instead, the distribution of the
residuals from your statistical model can be of concern.

Transformations should not be done just to achieve a distribution (of the residuals) which is suitable
for a certain statistical test, IMHO. They should make some real sense (distributions of some
entities such as income/wealth or reaction time are often better described on a
logarithmic or exponential scale).

With kind regards

Karabiner
 

noetsi

Fortran must die
#5
I don't know which is more technically correct, but in economics its well accepted to use say logs with skewed data. I assume if its that common in a highly advanced field methodologically there must be a good reason to do so...

Its not only violations of assumptions that are involved. Sometimes transforming data , makes interpretations easier.
 
#8
The right answer is that you need not be concerned about normality for a t- test,
if sample size is large enough. Seemingly, your total sample size is > 200, which
certainly would be sufficient.

With kind regards

Karabiner
 
#9
But you can not replace a missing value with zero. (If I refuse to tell you my length it does not mean that it is zero.)

Code:
#change    NA’s to     0's   
   my_data[is.na(my_data)]    <-‐ 0
It seems like most of your data are positive values. To impose zero values will increase the skewness (and it will also be a sort of fabrication. Sorry).
 

noetsi

Fortran must die
#10
But you can not replace a missing value with zero. (If I refuse to tell you my length it does not mean that it is zero.)

Code:
#change    NA’s to     0's  
   my_data[is.na(my_data)]    <-‐ 0
It seems like most of your data are positive values. To impose zero values will increase the skewness (and it will also be a sort of fabrication. Sorry).
Yes. Multiple imputations is the best way to address such.
 
#14
Transforming is worth it, if reaching a special condition (for example, normality) helps increasing the quality of your analysis (reducing error type I or II) or checking specific hypotheses (for example you do linear regression even with totally non normal data, but then you loose the possibility to test several hypotheses). Otherwise non parametric is fine. Or Monte Carlo or bootstrap approaches are also very good. Re transformations, I like to use the Johnson transformation (you can find it in Minitab, XLSTAT, R).