# Software to Determine Data Distribution

#### bscully

##### New Member
Does anyone know of any statistical program that takes a set of empirical data and determines what probability distribution it most closely matches?

This would be ideal if a package existed in Python or R.

Thanks for the help!

-Ben

#### lex logan

##### New Member
Well, of course, you can generate scatter plots and see what looks right. One problem with what you're asking for is that if you use the actual data you are interested in, your fit will be better than it should be, by definition. The best procedure in model-building is to play around with a different data set, such as a previous year or different region or if necessary small subset of the full data. Once you've decided on what model to use, then apply it to the full data set. The basic idea is that you only get one shot at using data to get p-values, coefficients, etc. Anytime you use data, change your model, and re-use the same data, your results are statistically invalid. So if there exists software to do what you ask, you must still avoid feeding it your actual data set. (Unless, of course, your goal is to get published, not discover truth.)

#### bscully

##### New Member
I would like to get away from viewing scatter plots and manually assessing distributions. Is there any software or programming package that performs this task?

#### Dason

##### Ambassador to the humans
The problem is that there are an infinite number of probability distributions. The distribution that most closely matches your data is the one that puts a mass of 1/frequency(x) at each value of x in the empirical data set. That's a valid probability distribution and the observed data set would be very likely if that was the true distribution. Usually you need to specify a class of distributions you're interested and then try to find the best fit within that class. So more information would be necessary.

#### bscully

##### New Member
Damon, thanks for the information and what you say makes sense. I'm analyzing investment return data and want to see if the distribution is normal, fat tailed, or some others (tbd).

I was hoping there were universal attributes that could define any and every distribution. All i can think of is mean, std dev, skew, and kurtosis but that doesnt explain everything.

Also came across an interesting package in Matlab that outputs a best guess.
http://stats.stackexchange.com/ques...ine-probability-distribution-given-a-data-set