Comparing Distributions of Datasets

Furn

New Member
#1
I've been asked to find a method to best compare the distribution of a number datasets that have small sample sizes. Bonus points for a solution/result that is in a scale of 0-1, i.e. a distribution approaching 1 is bordering on perfectly unequal and a distribution approaching 0 is bordering on perfectly equal.

Some examples within this dataset include:
  • Sample A: [10,1]
  • Sample B: [10,1,1]
  • Sample C: [4,4,3,2,2]
In other words, the method used should show A to have a distribution close close to 1 (almost perfectly unevenly distributed), B to be close to 1 but further away from 1 than A's distribution, and C to be closer to 0 (quite an equal distribution).

I first thought of the Gini coefficient, which is precisely about distribution and gives values between 0-1. However it seems the Gini has a 'small-sample bias' that limits its use here, where each of the datapoints have between 1 and c.10 values.

I then considered the coefficient of variance, however given results can go higher than 1 this also isn't well suited to this problem.

Any pointers would be greatly appreciated!
 

Furn

New Member
#3
Thanks @GretaGarbo. Let me be a bit clearer about the examples.

Dataset A contains the values of 10 and 1
Dataset B contains the values of 10, 1, and 1
Dataset C contains the values of 4, 4, 3, 2, and 2

What I'm looking for is some way to depict the distribution of each dataset in a number that is between 0 and 1. So dataset A would get a result that is close to 1, meaning it is quite unevenly distributed. Dataset C on the other hand would get a result that is close to 0, as it is pretty evenly distributed. Dataset B on the other hand is somewhere in between, although closer to 1 than 0.

Is that any clearer?
 
#4
Is that any clearer?
Yes.

So it is not about statistical distribution functions, like the normal distribution or the binomial distribution. It has more to do with something like the income distribution.

Then, a possibility is different measures of "spread", like the variance, standar deviation, quartile distance, the range, 1/(standard deviation) etc. But you have already mentioned that.

So why not use the Gini coefficient (that you mentioned). Please be realistic and face that often (but not always) you can not infer very much from two to four observations.

Maybe you could do some simulations (random number generations from a known distribution) to make it clear for you how much a Gini coefficient could vary, and its bias.
 

Dason

Ambassador to the humans
#7
And that is what the OP want us to explain. Or at least give a suggestion.
I disagree. They're asking for methods to describe that. But they should at least be able to tell us what it means - because it's not clear to me. If they're talking about distributions that are uniform then why do they consider the case with only 2 values to be uneven?

It's not clear to me what concept they're actually trying to convey so I'm asking them to provide more details instead of restating the same exact thing they've said twice because it didn't clear it up for me.
 
#8
It's not clear to me what concept they're actually trying to convey so I'm asking them to provide more details instead of restating the same exact thing they've said twice because it didn't clear it up for me.
Well, yes. :) But let's see if she/ he comes back and explains.
 

Furn

New Member
#9
Hi @GretaGarbo and @Dason. Apologies for the confusion - I think my terminology and use of words have been quite poor! I am indeed referring to spread rather than distribution (in statistical terms).

Let me give you an example that I hope makes it clearer.

Two children, Alice and Betty, are placed in a room with a sweet jar. The jar contains many sweets, each of which can be one of three different colours: red, blue and yellow. Each day the children are allowed to take one sweet from the jar, and we observe them over a 20 day period. Upon the end of this observation period, we find that Alice chose 7 red sweets, 8 blue sweets and 5 yellow sweet, whilst Betty chose 17 red sweets and 3 blue sweets.

What I'm trying to come up with is a mathematical method that describes our observation/confidence that Betty prefers a specific colour, whereas Alice seems somewhat colour-agnostic. The colour itself is irrelevant to me: the point is that colour appears to make a difference to one child whilst it does not to the other child.

I'd like this metric to take a value between 0 and 1.

I hope this is clearer - let me know if you would like further clarification.

Thanks in advance!
 

Dason

Ambassador to the humans
#10
Your original question didn't appear to have a fixed number of choices. What does your data look like exactly
 

Furn

New Member
#11
Your original question didn't appear to have a fixed number of choices. What does your data look like exactly
There is no fixed number of choices. Sometimes there will be 3 colours available, sometimes more or less. The example I just gave was for colour, but the same question applies for a variety of different observations - none of which have a consistent number of possible choices.

The data does look somewhat similar to what I first posted, e.g.:

Dataset A contains the values of 10 and 1 (10 observations for one colour, 1 observations for another colour)
Dataset B contains the values of 10, 1, and 1 (10 observations for one colour, 1 observations for another colour and 1 observation for a third colour)
Dataset C contains the values of 4, 4, 3, 2, and 2

If it is not clear from the above, it is also possible for the number of observations to be different per child.
 

Dason

Ambassador to the humans
#12
Gotcha. You never made it clear that those were counts for a category. The way it was presented it seemed more like actual raw observations so like you had a height of 10 and a height of 1 in Dataset A.

Is there a limit to the possible number of categories?
 

Furn

New Member
#13
There is no 'hard-coded' absolute limit, although if I look through the data in practice the maximum number of categories appears to be 12.
 
#14
But is it known in advance how many categories there are? For examaple there might be a black category så that: Alice chose 7 red sweets, 8 blue sweets and 5 yellow sweet and 0 black sweet. And is it known how many sweets a child can take, like here n=20?

And are the choices statistically independent? Alice might get bored with red so that the probability och chosing black increases?

It seems to be about estimating parameters (the proportion) in a multinomial distribution and to give a confidence interval for that. Normaly that is done with a Wald interval (the usual confidence interval for proportions), but that can have values outside of the 0 to 1 range (when the sample size is small and the proportions are 0 or close to 0). I would do a likelihood interval, byt maybe that is tricky.
 

Dason

Ambassador to the humans
#15
Knowing in advanced how many categories there are in advanced seems to be a fairly important part of the problem at least if I understand what you want to measure correctly. For instance if there are 5 categories and somebody's counts for those categories are: 10, 10, 10, 10, 11 and somebody else did: 15, 15, 15, 0, 0. If we don't know there are 5 categories then we might say that the second person is more evenly distributed because for their data all we would see is 15, 15, 15 right? Which would be 'perfectly evenly distributed' even though it's missing two categories entirely - and without additional information we might say this person is more 'evenly distributed' than the other person who almost had a perfectly even distribution across all five categories.
 

Furn

New Member
#16
For instance if there are 5 categories and somebody's counts for those categories are: 10, 10, 10, 10, 11 and somebody else did: 15, 15, 15, 0, 0. If we don't know there are 5 categories then we might say that the second person is more evenly distributed because for their data all we would see is 15, 15, 15 right? Which would be 'perfectly evenly distributed' even though it's missing two categories entirely - and without additional information we might say this person is more 'evenly distributed' than the other person who almost had a perfectly even distribution across all five categories.
I would indeed like that to be the case, ie we ignore any 0 values and only look at the observed categories. So even if, say, there were 10 possible sweet colours and a child took different amounts of sweets that were of 3 different colours, then I would ignore the fact that there are 7 categories with zero counts. I’d like to measure the spread of values only within the categories observed. Which means that in the example you gave, the individual with 15,15,15,0,0 would indeed be more ‘evenly spread’ than the one with 10,10,10,10,11.
 
#17
. I’d like to measure the spread of values only within the categories observed. Which means that in the example you gave, the individual with 15,15,15,0,0 would indeed be more ‘evenly spread’ than the one with 10,10,10,10,11.
That is a peculiar view. Suppose that, by randomness, one of the 0 is changed to 1 then that category would be included.

So 15,15,15,0,0 would be more ‘evenly spread’ than 10,10,10,10,11.
But 15,15,15,1,0 would be less ‘evenly spread’ than 10,10,10,10,11.

- - -

What are you really interested in? Dason made the example of comparing across relative fequencies ( 15,15,15,0,0). I thought that you were interested in the uncertainty in the proportion of 1/3 (i.e. 15 of 45).
 

Furn

New Member
#18
That is a peculiar view. Suppose that, by randomness, one of the 0 is changed to 1 then that category would be included. So 15,15,15,0,0 would be more ‘evenly spread’ than 10,10,10,10,11. But 15,15,15,1,0 would be less ‘evenly spread’ than 10,10,10,10,11.
That is though what I'd like to reflect. A new category suddenly becoming relevant, due to a child choosing a new colour sweet, would then change the picture of how evenly/unevenly spread their preferences are.

What are you really interested in? Dason made the example of comparing across relative fequencies ( 15,15,15,0,0). I thought that you were interested in the uncertainty in the proportion of 1/3 (i.e. 15 of 45).
I think I would like to compare across relative frequencies, and use that to understand to what extent, in the example with the sweets, colour is important as a factor in the child making that choice. If a child chose 17 red sweets and 1 blue sweet (and ignores any other colour sweets in the jar), then we can infer with some confidence that the child chose those sweets because of the colour and not some other factor. On the other hand a child that chose 5 blue sweets, 5 red sweets, and 5 yellow sweets, we could be reasonably confident that colour is unimportant in their choice factors. So child A has 17/18 of their sweets that are of one colour and 1/18 sweets in the second colour, and hence we are reasonably confident. Child B however has 5/15 of their sweets in each of the colours, and hence we cannot be sure that the sweet colour is important to them.