Fun (im)probability question!

#1
Hi folks - I need some help with a tricky probability. Here's the situation:

Let's say there are 4M internet users in Age Group A. (The total set)
Of those 4M, there are 1,000 users who play a specific sport.
Those 1,000 are spread evenly over 125 teams, so 8 players each.

1. What's the probability of selecting 3 random users (from the 4M) who all play that sport?
2. What's the probability of all 3 random users being members of the same team?

This isn't a school problem - I'm a 47 year old writer, and I need to figure out this absurdly improbable number. (The improbability is the point.) Any advice / guidance / direction here would be appreciated! (I have a moderate familiarity with combinations & permutations, and their notations, so do feel free to use whatever formulae / notations are needed.)

Any questions, please feel free to let me know. Thanks! ~ MW.
 
#2
I am NOT a statistician, but have some limited understanding. Given no one has yet answered, I'll put in my 2 cents and let the experts correct me.

The initial assumptions:
  • 4 million people is a pretty narrow age group of internet users. This quora question guestimated 6.85% of the week is spent on average on the internet. Since that question was answered, the number of active internet uses has grown (+ probably time online given current social distancing!) but the number of users is growing daily. The estimate at the time of writing is about4.6 billion taking 6.85% of this = 315.1 million. 4 million would be <1.3% of internet users online at a given time (prevelance), so if it is an age range it is a pretty narrow one (i.e. unlikely to span an entire age year unless in low prevalence users - e.g. elderly).
  • However the players of a bespoke sport does smell of bias - is it popular in only a few countries etc. The nature by which they found each other is important - obviously a forum which might attract users with the same or even related interests invalidates assumptions. similarly if just typing in random IP addresses changes things as in the internet of things there are a lot more than 315million devices online at any given time, and if typing random IPs the number of permutations of an adress is more important than the number of occupied addresses.
bearing this in mind, most people I know would approximate with replacement because it is slightly quicker and the result will effectively be the same:
=(1K/4M)^3= 156 billion to 1 (quoting as odds as they are effectively the same at these tiny probabilities).
the 'correct way' however is to note that each time a user is selected there is 1 less available to be picked (called without replacement). You could do the same with the denominator, and you can represent the equation with factorials (!) but it is all pedantry
=1K⋅(1K-1)⋅(1K-2) / 4M^3

the second question has the same initial chance i.e. 1K/4M of selecting a player of that game (IF that was the a-priori intention). then the player must be in the same teams as the initial pick, so 7 then 6 players remaining
= 1K⋅7⋅6 / 4M^3 = 6.56 x 10^-16 or 65 quadrillion to 1
 
Last edited:
#3
Much appreciated! I understand the questions about set size, sampling bias, etc. - those are things I'm ignoring, given the actual circumstances behind the question. (It's not sports, but a niche subset that - for sensitivity reasons - I shant mention.) Again, this is just an exercise in absurdity - I intentionally wanted to find how improbable such situations are (choosing 3 profiles on the same 'team' out of all available) to make the case that it can't be down to coincidence, much like DNA accuracy. Thanks again for your help here! It's important work, and it'll do good. Cheers, friend.
 

fed2

Active Member
#5
never mind, i was going to say q2 was an overcount, but actually looks correct. I usually count as 125*8*7*6 = pick team * pick permutation of players, but I guess that works out the same as above.
 
Last edited: