# How to determine if there is a needle in a haystack

#### Tugboat

##### New Member
I have a workplace with a known hazard.
There are 2500 workers (let’s use balls to represent them) in the workplace everyday.
Workers fit into one of two categories - (1) those that are exposed to the hazard (red balls), and (2) those that are not exposed (black balls).
The workplace claims that no one is being exposed to the hazard - I.e., they assert that all 2500 balls are black, without looking to know if this is true.

It is not possible to monitor all 2500 workers daily to verify whether or not they are exposed to the hazard (I.e., we cannot look at all 2500 balls).

Because we are dealing with people, who may or may not follow rules the same way from day-to-day, the number of exposures to the hazard (I.e., production of red balls) could change from day-to-day (it is a random number). And, because of various engineered and administrative barriers, the number of red balls (out of the total 2500) that would occur is likely very small (expecting single digit, or very very low double digit)

If there is a sample of red balls amongst the 2500, what function would describe it?
Again, we don’t know if any of the 2500 balls are red; and even if there are red balls, the number of red balls on any given day is random (and we don’t know their number). If instead of a daily sample, we sampled after a period of time (e.g., 30 days), it’s possible the number of red balls would increase, but no guarantee and then likely in a random fashion. As well, on some days, due to type of work, more people are required to work near the hazard. This could run for several weeks, maybe even 4 months, and then the work changes and the number of workers in proximity to the hazard drops. But at all times the hazard is always present. These campaigns could increase the potential for worker exposure, but again, workers following rules might not be exposed. So no guarantee.

My question is the following: “is there a statistical model that can be used to represent this situation?”
Is there a probability function that describes it?
A colleague told me this can be represented by a hypergeometric probability density function. They also said I should group the workers and look at those in close proximity as a different population than the remainder.

Ultimately, I want to make a statement (within some confidence level, 95% seems typical) as to whether or not there are any red balls among the 2500 balls.

I have the ability to sample the balls on a monthly basis.
But, I don’t know how many balls to sample. And I don’t know how long to sample for.... would i stop sampling after I have received a defined number of samples - e.g., 300? Or after a defined period - e.g., 15 samples/month for 20 months? I have ability to sample monthly, but is monthly appropriate?

Can this even be modelled using statistics?

#### Tugboat

##### New Member
For those reading my post, I would greatly appreciate if at least ONE person replied. Thanks.

#### katxt

##### Member
OK. I'll put in a thought, Tugboat. For a start, we will make some simplifying assumptions which you may be able to modify later. For example, we assume each ball sampled has the same probability p of being red, and that the probability of a ball being red is independent of the colour of any other ball.

Ultimately, I want to make a statement (within some confidence level, 95% seems typical) as to whether or not there are any red balls among the 2500 balls.
First, on any one particular day you have a bag with 2500 balls in it, most of which we know to be black. You take a sample of n (say 50) and they are all black. What can we say about the proportion of red balls, say p? The answer will look something like "I'm 95% sure that less than 0.2% or 5 balls are red." Then you choose a sample size n which will give you an answer management can live with.
Or, second, on any one particular day you have a bag with 2500 balls in it, most of which we know to be black. How many balls do we need to sample to be able to say "we are 95% sure that there are 0 red balls in this bag."
Which do you like? kat

#### Tugboat

##### New Member
Ah-ha!! Finally, someone willing to take a chance.
Now, let’s parlez.

Regarding your simplifying assumptions- those are spot on.

I think you have described my situation quite accurately- the conundrum is which one to pick?

Personally, I believe your ‘option 2’ is what management “wants“ to say - because historically they keep saying no one has been exposed (I.e., there are no red balls), with that claim coming without ever having conducted any sampling - just their ‘gut’ feel.
However, now they have had someone (don’t know who - I’ll follow up on that later, as apparently someone wants my job) draft up a memo that looks a lot like your ‘option 1’.
This may be a result of the fact that the minute they drew the first 20 balls, 3 came up red (management was in denial - how could it be - they started shouting ‘false positives’ - to no avail, because they most definitely were red balls).
So, I think management is forced to go with ‘option1’.

Of the next 20 drawn, there have been zero red balls (management is ecstatic).
The next 20 balls have yet to be drawn (everyone is holding their breach).

Now, whomever wrote the memo has stated that the situation fits a hypergeometric model - is this a true statement?

They also claim that for a large population hypergeometric model can be reduced to a much easier to use binomial model. Then they go on to simplify the model even further.

Ultimately, using their simplified model, ’they’ (still don’t know who that is) have decided management needs to sample 90 balls out of the 2500, to be able to say with 99% confidence that there is less than 5% probability there are any red balls in the lot.
(Of course, I don’t know how ‘they‘ account for the fact that 3 of the first 20 balls (or 3 out of 40 to date) are already red. What does that even say about their prediction?)

I am very interested to know if you think their model choices are appropriate.
Also, whether you think the fact that there are 3 red balls already has any bearing on their stated hypothesis.

#### Tugboat

##### New Member
Oh... and I should mention... i understand the plan is ultimately to draw 270 balls over a 3 year period, and then based on the results say that they have 99% confidence (or perhaps another confidence level, 95%?) that there is less than 5% chance (or some other %) of there being a red ball among the 2500, and then going forward stop sampling and live with that probability of occurrence.

I’m wondering if that is a fair means for them to arrive at a conclusion, and if that conclusion would even be defensible.

Thoughts?

#### katxt

##### Member
The hypergeometric model does apply when you take a sample from a finite population. It can be approximated with the binomial, and further approximated by the Poisson, (and commonly is). However, it looks to me as if we are more interested in the process as if we are drawing balls as a continuous process and it just happens that we are getting about 2500 per day. What we hope to be able to say is something like "we are 95% sure that the red rate is less than 1 in 2500 long term" rather than "we are 95% sure that there were 0 reds in today's 2500."
If this is the case, then you can use the "rule of three" which says to be 95% sure that the failure rate is less than 1/k, then you must have no failures in a sample of 3k. So, if your bosses want to be able to say that the failure rate is less than 1 in 2500, then you would have to have a sample of 3x2500 = 7500 without a failure! https://en.wikipedia.org/wiki/Rule_of_three_(statistics)
I'll think some more about the "management needs to sample 90 balls out of the 2500, to be able to say with 99% confidence that there is less than 5% probability there are any red balls in the lot" claim.

#### katxt

##### Member
I'm happy to look at the calculations for the "management needs to sample 90 balls out of the 2500, to be able to say with 99% confidence that there is less than 5% probability there are any red balls in the lot" claim, but I'm fairly confident that that scheme won't work. As I read it, the proposal suggests that you take a sample of 90 and check them all. If there is at least one red you say "there is at least one red in the 2500." If there are no reds in the sample you say "there is less than a 5% chance that there are no reds in the 2500." But is that chance actually less than 5%?
Say that there is one red in the 2500. Then what is the probability of getting 0 reds out of 90 in our sample if in fact we know that there is 1 red out of 2500 in the population. This is hyper-geometric. There are various online calculators but you can do it in Excel =HYPGEOMDIST(0,90,1,2500) = 96.4%. In other words it is very likely indeed that you will get no reds out of 90 when in fact they are present in the population and consequently it is very likely that you will make a false claim that there are no reds in the 2500. Even if there are 10 reds in the 2500 then there is about a 70% chance that you will make a false claim of cleanliness. =HYPGEOMDIST(0,90,10,2500) = 69.2%
Here is my suggestion. Accept the fact that there are red balls occasionally and calculate a 95% confidence interval for the long term rate.
Look at the data you quoted 2 out of 20, 0 out of 20, and (let's say to be encouraging) 0 out of 20. That's a total of 2 out of 60.
Find an online calculator for a binomial confidence interval https://www.danielsoper.com/statcalc/calculator.aspx?id=85 for example.
Enter 2 out of 60. The 95% confidence interval is 0.00406 ≤ p ≤ 0.11528. Or scaling up to 2500, between 10 and 288 per day.
Discouraging, isn't it. kat

#### Miner

##### TS Contributor
As well, on some days, due to type of work, more people are required to work near the hazard. This could run for several weeks, maybe even 4 months, and then the work changes and the number of workers in proximity to the hazard drops. But at all times the hazard is always present. These campaigns could increase the potential for worker exposure, but again, workers following rules might not be exposed.
It sounds like the hazard is location specific, so could you not eliminate that portion of the population that is not working near the hazard from your calculations and sample from those that do work near the hazard?

#### Tugboat

##### New Member
Miner, thanks for the question, but no, the hazard is everywhere and it is possible for all 2500 to come in contact - it’s just what it is. Although, some are closer to it than others and so their potential to be affected is higher. Consequently all 2500 are involved.

Arguably, it was much easier to ignore the red balls and claim that they are not there because they’ve never seen them.
However, to me that argument died the minute the first red ball appeared.

Basically, management doesn’t want to sample all the balls, nor do they want to randomly sample them on an ongoing basis, because those options come at a cost (time, money, whatever). As well, they firmly believe that the engineered and administrative barriers will restrict the number of red balls to perhaps single digits, so sampling is not an advantageous expenditure of resources (time, money, whatever). This exercise is an effort to prove that.

(Note that as of today there are only 3 red balls out of 55 picks - everything seems to be coming up black except for the original 3 reds, but then no one has really been working near the hazard, so not really all that surprising - again, due to engineered and administrative barriers everyone expects the number of red balls to be small.)

So, now I need to explain to management that even if they draw 270 samples ( there chosen number), of which 267 come up black, there exist 3 red balls in that lot. And those red balls, to me, represent the fact that in the 2500 total balls there may be others. Not to forget that those 2500 balls go home today, and come back tomorrow and start all over again. So, what do those 3 red balls drawn on day 1 tell us about the 2500 balls on day 2, or on day 3, or ... particularly if we stop drawing balls once we hit 270?

Kat, you appear to have well understood my scenario, and I tend to agree with your statement: “... it is very likely indeed that you will get no reds out of 90 when in fact they are present in the population and consequently it is very likely that you will make a false claim that there are no reds in the 2500.”

As I mentioned, there is a desire to make a statement on probability of occurrence to justify not looking for them ( I.e. management might say that there may be some red balls, but the probability of their number exceeding a certain percentage is acceptably low, and they will accept the risk of failing to identify them.). It seems such a statement could be made, but the uncertainty in the claim could be significant.

I’m realizing I need to learn a lot more about statistics...

#### katxt

##### Member
Management really needs to say something like "we consider the situation acceptable if we can be 95% sure that the population has less than 1 red in 1000". Then you can design a sampling regime that will test for this. In the meantime, the union might have its own views as to what is acceptable.
The main lesson for management is that measuring and estimating low rates takes much more effort than you might expect. kat