Probability of overlap in independent samples

UsrX

New Member
#1
Hi all,

I got a question concerning calculating probability in random samples.

Let's assume I have two sets of the size n(1) and n(2). Both sets are made up of the same items but can have different sizes and thus partially overlap, i.e.:

n(1) =7, set1 = [1,2,3,4,5,6,7]
n(2) = 9, set2 = [1,2,3,4,5,6,7,8,9].

From each set, I draw a random sample without replacement of the sizes k(1) and k(2).
How can I calculate the probability to obtain the intersect of the size X in those 2 random samples. The 2 samples are indepent of each other P i. e:

k(1) = 4, randomSample1 = [1,2,7,6]
k(2) = 5, randomSample2 = [3,2,9,6,4]
intersect = [2,6], intersect size X = 2,

Following, my intuition I would calculate the probability for an overlap in the random samples as the conditional probability of the overlap in the random samples given the probability for the overlap in the two base sets. But obviously, my intuition fails me...

If you could point me in the right direction how to calculate this accurately, I would be very grateful, for I am nothing but a sinner in the hands of the angry god of probability. If you show me the light so that I can start to atone for my ignorance by studying probability.

Cheers

UsrX
 

Dason

Ambassador to the humans
#2
With different sets and different sizes I don't think there is any elegant solution to this. But for small sizes it's not too terrible to either fully enumerate all possibilities to get the full sampling distribution. For your particular example assuming uniform random samples there are just 4410 possible outcomes. Enumerating them and then calculating the overlap wouldn't be too difficult to do in most programming languages.


Alternatively if that's too taxing it wouldn't be too hard to just do some simulation to estimate the probabilities to within an acceptable tolerance.
 

Dason

Ambassador to the humans
#3
For example if you are willing to use R the following code will calculate the probability distribution of the size of the overlap:

Code:
s1 <- 1:7
s2 <- 1:9
n1 <- 4
n2 <- 5

c1 <- combn(s1, n1)
c2 <- combn(s2, n2)

g <- expand.grid(1:ncol(c1), 1:ncol(c2))

f <- function(i, j){
  a <- c1[,i]
  b <- c2[,j]
  length(intersect(a,b))
}

out <- mapply(f, g[,1], g[,2])

p <- prop.table(table(out))
which gives

Code:
> p
out
          0           1           2           3           4 
0.007936508 0.158730159 0.476190476 0.317460317 0.039682540
 

UsrX

New Member
#4
Hi Dason,

Thanks for the advice. This approach is indeed feasible. I did enumerate all the possible outcomes and I can plot the probability (density). I was just puzzled that there is no simple solution...
Thank you very much for your detailed answer including code. I greatly appreciate your effort.

Thanks mate!
 

Dason

Ambassador to the humans
#5
It's possible there is a better solution but nothing that comes to mind for me. The version you've presented is pretty general so if there are some simplifying assumptions it might be possible to get a nicer solution.