How do I fit a distribution to these data?!?!

#1
I am trying to fit a distribution to some data that consist of measurements of dollar amounts. The range is basically 0 to 300,000 (this range encompasses more than 99% of all measurements), although there are measurements that exceed this. The summary stats for the data look like this:

Summary Stats:
Length: 32015
Missing Count: 0
Mean: 18002.581787
Minimum: 0.000000
1st Quartile: 137.880000
Median: 3146.500000
3rd Quartile: 14274.605000
Maximum: 6331830.630000
Type: Float64

The 99th percentile is $206,143 and a histogram of the data looks like this:

1567174814776.png

As you can see, the data are largely bunched up in the $0 - $10,000 range. I tried to fit a truncated normal distribution to the data, which looks like this:

1567173480225.png

But when I do a quantile-quantile plot to check how well the data fit this distribution, it looks like this:

1567174444236.png

I'm trying to figure out what kind of distribution to use to represent these data and could use some feedback! I was reading about Gamma and Pareto distributions but it seems that those won't work because the mode of my data is 0.....any ideas?
 

hlsmith

Less is more. Stay pure. Stay poor.
#3
What do you plan to do with these data once you have assumed a distribution? So why do you need to label data to a distribution?
 
#4
What do you plan to do with these data once you have assumed a distribution? So why do you need to label data to a distribution?
I actually have two different datasets - the second one is similar to this one, just with lower mean, median, etc. I want to be able to make some probability statements about the two processes that generated these data. For example, under process A, the probability of a measurement being > $10,000 is 0.1, while under process B, the probability of a measurement being > $10,000 is 0.2.
 

katxt

Active Member
#6
I presume that Length: 32015 is your sample size. If so, this is large enough to answer your questions without a distribution - just use the actual proportion of the sample which is above 10000, or whatever. This will probably give better answers than forcing the data into a distribution, and you can get a confidence interval for the proportion. kat