Probability Distribution of a Subset

ElaB

New Member
#1
I was wondering, if you have a set with data that follows a certain probability distribution, will the subset have that same distribution?

Lately it has come up for me in several questions, one being: Say that you have a pool of applicants for a job and that pool has a certain diversity. If you take a subset of that pool (for example, a subset of internal job applicants vs. all possible job applicants), will you have that same diversity?

I've looked through my statistics textbooks, some references, and this site, but can't find a defnititve proof.
 

fed1

TS Contributor
#2
The answer is no, every subset is not distributed the same.

If you do not believe me, do the following experiment.

a) draw a histogram of your outcome, using the entire data set.
b) split the data set in two groups, draw a histogram for each.
c) if the histrograms are not identical in b), you have shown that subsets are not the same.
d) if histograms in c) are identical, drop a single observation and go to a.
 

ElaB

New Member
#3
Thank you for your response. I believe it, and intuitively that seems like the right answer. The example I had seems to indicate that the subset is not the same distribution, but I wondered if that was just an artifact of thinking about a small set of data.

Now it gets me thinking, is there anything meaningful that can be said about the subset. Is the subset always going to be 'less random" that the set? That is, if the first set was normally distributed, would the subset have a distribution that was more centrally located around the mean (less standard deviation or some other more tightly groupd distribution), etc. It seems like the answer to that is that it isn't necessarily the case.

If anyone does have any thoughts on this or a theoretical reference for this, it would be most appreciated.
 

BGM

TS Contributor
#4
It depends on how did you choose your subset.

Suppose you are given a i.i.d. random sample

\( X_1, X_2, \ldots, X_n \)

Say if you "randomly/uniformly" pick \( k \) of them out, the individual distribution will not change; and these \( k \) random variables are still i.i.d.

However let say you select the smallest \( k \) of them, then the distribution of individual is no longer the same.
 
#5
With regard to the latest answer I also have a question. Mind me if I should start a new thread with this question.

So, let us assume that the distribution of the complete sample (Sall) of a dependent variable is non-normal and that a subset (Ssub) of random values from the complete sample has a normal distribution. If I use the Sall, I should use non-parametric tests to estimate differences between means. My question is, should I use parametric or non-parametric tests with the Ssub??? Should I always check for normality on random subsets of the complete sample, or assume that the distribution of the complete sample should be valid in all random subsets???

Thank you very much in advance.
 

BGM

TS Contributor
#6
Usually it will be better to start a new thread (unless your question is really a follow-up)

Regarding your question, I think you can think of a simple scenario:

Consider you have two independent normal sample. Suppose these two normal distributions are not identical. Then a mixture of these two distributions will no longer be normal. That means if you merge/combine the two sample, then the resulting one will no longer be normal. Does it answer your question?
 
#7
Thank you very much BGM for your reply. I will start a new thread, because I have not described my question very well. Your answer, although it does not answer my original question, it has helped me to think of a solution around the problem! It was all the time right in front of my eyes, but somehow your answer helped me make the connection :)

Still, just out of curiosity and to expand my knowledge, I will start a new thread to see what other people think.

Thank you very much