Statistical justification - number of retests to overturn a failure

#1
In the Pharma industry, labs have to deal with out-of-specification (OOS) events. Typical examples:

Assay for a drug substance must be 98.0-102.0%. Result = 97.6%
Water content must be <= 0.20%. Result = 0.23%.

As part of these investigations, there is a retest of the sample, and sometimes the retest shows the sample to actually meet the test requirements. This leads to what is termed an 'inconclusive OOS', in which you have a failing result followed by passing result(s) but no root cause to explain the original failure (ex. talked to the lab tech and found out there was a sample prep error that would have caused the failure).

The question I have is to provide a statistical justification to determine how many passing retests are needed to be able to overturn the original failing result. We currently apply an 8/9, meaning that after the first 97.6% assay result we would need to collect eight consecutive in-spec results in order to overturn or exclude the original failing result. This approach was basically drawn from the notion that 8 results appear to be plenty but we have not derived a statistical argument why the number ought to 8 instead of 5 or 7 or 10.

I have a paper by Hofer that speaks to the concept,

http://files.pharmtech.com/alfresco...60fc-4fe4-9deb-d7b6a8737140/article-75176.pdf

but there is an aspect that escapes me. He defines the notion of a proportion of suspect results, where p is a value between 0 and 1. We don't make widgets nor do we make aspirin 200 times a year, so ascribing a p value would be pretty much guesswork, meaning I cannot properly apply his methodology.

So, in absence of this, going back to the original example. If we had an assay of 97.6% and then started running retests, how many is enough to claim that the 97.6% is erroneous and the set of values all in spec represent the truth? Is there a suitable statistical test that will allow us to arrive at a number?

And this may not be one size fits all. There are two fairly big variables at work, as far as I can see:

1. Test precision. It's not the same as citing p but with most tests we have a precision requirement to consider the test valid. For typical assays, we permit a relative standard deviation of 2% on replicate measurements of a single solution. For a limit test like water, where we are measuring a much smaller quantity of analyte, we might allow 10%. This speaks to general method variability.
2. Difference between the first result and the retests. I expect the problem is somewhat different depending on how far off the retest results are, ex.
water initial result = 0.23%, retest results = 0.18, 0.19, 0.19, 0.18... vs. 0.23% followed by 0.13, 0.10, 0.14, 0.13...

Apologize for the long post. Thanks in advance for any insight.