How do I fit a distribution to these data?!?!

#1
I am trying to fit a distribution to some data that consist of measurements of dollar amounts. The range is basically 0 to 300,000 (this range encompasses more than 99% of all measurements), although there are measurements that exceed this. The summary stats for the data look like this:

Summary Stats:
Length: 32015
Missing Count: 0
Mean: 18002.581787
Minimum: 0.000000
1st Quartile: 137.880000
Median: 3146.500000
3rd Quartile: 14274.605000
Maximum: 6331830.630000
Type: Float64

The 99th percentile is $206,143 and a histogram of the data looks like this:

1567174814776.png

As you can see, the data are largely bunched up in the $0 - $10,000 range. I tried to fit a truncated normal distribution to the data, which looks like this:

1567173480225.png

But when I do a quantile-quantile plot to check how well the data fit this distribution, it looks like this:

1567174444236.png

I'm trying to figure out what kind of distribution to use to represent these data and could use some feedback! I was reading about Gamma and Pareto distributions but it seems that those won't work because the mode of my data is 0.....any ideas?
 

hlsmith

Not a robit
#3
What do you plan to do with these data once you have assumed a distribution? So why do you need to label data to a distribution?
 
#4
What do you plan to do with these data once you have assumed a distribution? So why do you need to label data to a distribution?
I actually have two different datasets - the second one is similar to this one, just with lower mean, median, etc. I want to be able to make some probability statements about the two processes that generated these data. For example, under process A, the probability of a measurement being > $10,000 is 0.1, while under process B, the probability of a measurement being > $10,000 is 0.2.
 
#6
I presume that Length: 32015 is your sample size. If so, this is large enough to answer your questions without a distribution - just use the actual proportion of the sample which is above 10000, or whatever. This will probably give better answers than forcing the data into a distribution, and you can get a confidence interval for the proportion. kat