This is a good idea, to keep in mind that the distribution and standard error issues might have different

impact, depending on e.g. which parameters are concerned.

This is very good advice. I remembered our (online) conversation yesterday when I was sitting on a 4th year undergrad course in econometrics. They were talking about instrumental variable (IV) estimators and deriving one of the 'simplest' forms of the standard error for this type of regression coefficients. As it turns out (and this blew my mind a little bit) the standard error for IV estimators has a sample size term 'n' both on the numerator AND the denominator. So, as sample size goes to infinity you can have something that is asymptotically normal **BUT** it is no consistent. The standard error never goes to 0 just by letting the sample size increase arbitrarily. The only way it goes to 0 is if you have both a large sample size AND what is called a 'strong' or 'valid' instrument. And those are hard to come by because they cannot be selected exclusively on the basis of statistical considerations alone.

So yeah... once you step outside the very basics, there is a lot of weirdo stuff that goes on.

Genereally, this is a very good point. Starting from simulated populations and then taking a large

number of samples, simulation gives me an idea what can go wrong under defined circumstances,

or what presents itself as robust. But I am not sure how to apply it in order to check from which

distribution a given sample was drawn.

With kind regards

Karabiner

Well, I mean... it is not perfect but I would say a good first step would always be to fit a distribution to the data. For example, let's play God for a moment and pretend that our dataset somehow came from a gamma distribution with shape and rate parameters 1.So \(X \sim G(1,1)\) and the sample size is \(n=1000\)

After some exploration and plotting, one could do something like this:

Code:

```
library(fitdistrplus)
g <- rgamma(n=1000, shape=1, rate=1) ###hypothetical dataset we collected
summary(fitdist(g, "norm"))
Fitting of the distribution ' norm ' by maximum likelihood
Parameters :
estimate Std. Error
mean 1.012525 0.03181838
sd 1.006186 0.02249889
Loglikelihood: -1425.105 AIC: 2854.21 BIC: 2864.026
Correlation matrix:
mean sd
mean 1 0
sd 0 1
summary(fitdist(g, "gamma"))
Fitting of the distribution ' gamma ' by maximum likelihood
Parameters :
estimate Std. Error
shape 1.016245 0.04007357
rate 1.003637 0.05057144
Loglikelihood: -1012.363 AIC: 2028.726 BIC: 2038.542
Correlation matrix:
shape rate
shape 1.0000000 0.7825828
rate 0.7825828 1.0000000
```

The first part tries to fit a normal distribution and estimates (via MLE) what the most likely parameters for this dataset would be. The second part fits a gamma distribution and also estimates the parameters. You can see by looking at the information criteria that the gamma distribution provides a MUCH better fit than the normal. And it estimates the most likely parameters for both distributions (and does a pretty good job at it because my sample size is large).

I'd imagine doing something like that at the first stages of analysis. And the

fitdistrplus package provides a wide array of plotting techniques and methods to try and 'guesstimate' what the most likely distribution generated your data.

I dunno why we don't teach stuff like this in my turf in social-science-land. I think a lot of better data practice could come from trying to let the data speak for itself as opposed to assuming the normal distribution everywhere and then add patches and corrections to our analyses so it fits the normal model.