Alternative to chi-squared test needed

#1
Hi all.

I need a goodness of fit test for a situation in which a chi-squared test isn't appropriate - where the sum of the expected values does not necessarily equal the sum of the observed values.

For example, I have 5 cells each with the same expected value of 10. The observations for the five cells might then be: 8, 10, 9, 12 and 11. I haven't calculated the chi-squared value for this particular example but the observations seem reasonable and I assume a standard test would not reject the null hypothesis. However, in this situation I might also have observations of 12, 10, 11, 12 and 11. These observations each have the same absolute deviation from the expected value of 10 so would produce the same chi-squared test result. However, they are clearly skewed and an appropriate test should lead to a rejection of the null hypothesis.

I know that the chi-squared test is not appropriate in this scenario. I'm also fairly sure that the appropriate methodology was covered at some point in my academic past but it's just not coming to mind.

The scenarios arise in the context of bernoulli trials. The example above could be restated: I'm carrying out 5 sets of 50 independent bernoulli trials. If the null hypothesis is that the probability of success in each set of trials is 0.2, what is the goodness of fit of the model if the outcomes of the trials are 12, 10, 11, 12 and 11 successes?

Many thanks in advance for any help.

James.
 
#2
No responses so far, but I’ve been reflecting on my own question further and thought it would be worth sharing an update. I’m afraid I don’t have the time to present a rigorous treatment but hopefully the line of thinking is sound. However, I’m rather conscious that this appears to be poking at an apparent ‘hole’ in conventional statistics so I’m quite prepared to hear back that this is handled by an existing distribution that I’m not familiar with, or that there’s simply a flaw in my argument.

It still seems correct to me that when considering goodness of fit for N independent standard normal random variables the sum of the squares cannot be a sufficient statistic for hypothesis testing. Each observation will be positive or negative, and the total number of positive and negative observations is also relevant. Specifically, if we let ‘n’ be the greater of (a) the total number of positive observations and (b) the total number of negative observations, then the value of n should also be considered.

For example, if k (degrees of freedom) = 5 and we have a sample where x (the sum of squares) = 7, the chi-squared distribution tells us that a sum of squares as extreme as this will be seen about 22% of the time. In a traditional chi-squared test we extend the inference to be: “a sample as extreme as this will be seen about 22% of the time”, and in this example we would rule that there is insufficient evidence to reject the null hypothesis.

In practice this extension is usually reasonable because we often derive our N independent standard normal random variables from a set of observed values and a set of expected values, where the expected values are determined by allocating the total number of observations across k cells of expected values according to some hypothesised distribution. In most cases this reduces the likelihood of there being an extreme imbalance between the number of positive and negative values of ‘observed minus expected’.

However, in an application where the expected values are fixed independently of the observations – such as the scenario in my earlier question – or more generally where we are truly considering independent standard normal random variables, extending our inference from ‘the sum of squares is not extreme enough to reject the null hypothesis’ to ‘the sample is not extreme enough to reject the null hypothesis’ is inappropriate. Building on the above example, if we take all possible samples of 5 observations where x (the sum of squares) is approximately 7, the binomial distribution tells us that 62.5% of these samples will have n=3 (either 3 positive and 2 negative values, or 3 negative and 2 positive values), 31.25% of the samples will have n=4, and 6.25% of the samples will have n=5. So we will see samples with n=5 much less frequently than samples with n=3. In other words, for a given sum of squares, when n=5 the observations do not fit the hypothesised distribution as well as when n=3. My conclusion is that a test of goodness of fit of should incorporate both x and n, and when n is larger the null hypothesis will be rejected for smaller values of x.

If this is correct, it suggests an alternative distribution for testing goodness of fit for a sample of independent standard normal random variables. Like the standard chi-squared distribution this would have k degrees of freedom, but would be a function of x and n as defined above.

I’m making no attempt to rigorously derive a pdf and cdf – not least because I’m sure this distribution must exist somewhere – but I believe (hope!) that the following procedure is correct and can be applied more generally. For the sake of maintaining reasonably plain language I’m continuing to rely heavily on the concept of samples being ‘extreme’, by which I mean samples being unlikely to occur under the null hypothesis when compared to samples that are less ‘extreme’. Please consider this concept equivalent to the p-value as it is conventionally defined; a sample with a test statistic that has a smaller p-value is considered to be more extreme than one with a test statistic that has a larger p-value.

The procedure:

Continuing to building on the above example, where k (degrees of freedom) =5 and x (sum of squares)=7, suppose that n=5 (all of the observations in the sample are either positive or negative).
We see x >= 7 approximately 22.06% of the time (by the chi-squared distribution), and 6.25% of these samples have n=5 (by the binomial distribution), so under the null hypothesis we see samples as extreme as this with n=5 only 1.38% of the time. Additional samples that are at least as extreme with n=4 also exist (albeit with higher values of x); given that the definition of ‘extreme’ is based on frequency of occurrence, there must also be 1.38% such samples (a statement which relies only on there being at least 1.38% of samples with n=4). Similarly, there are samples that are at least as extreme with n=3, and there are also 1.38% of these. Therefore, when k = 5 degrees of freedom, the p-value for x=7 and n=5 is approximately 0.0138 + 0.0138 + 0.0138 = 0.0414.

For n=4 (either 4 positive and 1 negative values, or 1 positive and 4 negative values).

We see x >= 7 approximately 22.06% of the time, and 31.25% of these samples have n=4, so under the null hypothesis we see samples as extreme as this with n=4 only 6.90% of the time. As before, additional samples that are at least as extreme with n=3 also exist, and there are 6.90% such samples (relying only on at least 6.90% of all samples having n=3). However, given that only 6.25% of all samples have n=5, by definition all samples with n=5 are more extreme. Therefore , when k = 5 degrees of freedom, the p-value for x=7 and n=4 is approximately 0.0690 + 0.0690 + 0.0625 = 0.2005.

For n=3 (either 3 positive and 2 negative values, or 2 positive and 3 negative values).

We see x >= 7 approximately 22.06% of the time, and 62.5% of these samples have n=3, so under the null hypothesis we see samples as extreme as this with n=3 only 13.79% of the time. Additional samples that are at least as extreme with n=4 exist, and there are 13.79% such samples (because 31.25% of samples have n=4, and it should now be evident that we are taking the smaller of 13.79% and 31.25%). Continuing in the same vein, only 6.25% of all samples have n=5, so all of these samples are more extreme. Therefore , when k = 5 degrees of freedom, the p-value for x=7 and n=3 is approximately 0.1379 + 0.1379 + 0.0625 = 0.3383.

If this procedure is correct then samples with k=5 and x=7 (or greater), which the chi-squared distribution tells us occur 22.06% of the time, may or may not be as common under the null hypothesis as a standard chi-squared test tells us. The alternative test statistic (based on x and n) has a p-value of 0.3383 when n=3, and a p-value of 0.2005 when n=4. Such samples would not lead to rejection of the null hypothesis. However, when n=5 we have a p-value of 0.0414 and we would reject the null hypothesis.

That was a longer posting than I thought it would be so thank you to anybody who bothered to read this far. So, do you agree that the traditional chi-squared test fails to make use of valuable information as described above? Have I unwittingly described an existing distribution? Is my procedure for calculating p-values correct or flawed? Is such a procedure the best way to calculate p-values or can the pdf and cdf be rigorously defined?

JB.
 

BGM

TS Contributor
#3
If you assume that the trials in each sets are mutually independent,
then the total \( 5 \times 50 = 250 \) trials are independent,
and you just need to have a simple test for the binomial proportion \( p = 0.2 \)
with observed number of success \( = 12 + 10 + 11 + 12 + 11 = 56 \)
 
#4
Thanks BGM.

You're quite right, but I wasn't clear in my problem statement that while the 5 samples are mutually independent they are not necessarily identically distributed.

I should have stated that the 5 samples are from 5 distributions with probabilities p1 to p5 (which may or may not be identical). The null hypothesis is that p1=p2=p3=p4=p5=0.2.

Where this is the case it's not sufficient to pool all of the samples into a single binomial test.

JB.
 

BGM

TS Contributor
#5
You may express your null hypothesis as the intersection of 5 simpler hypotheses:

\( H_0: p_1 = p_2 = p_3 = p_4 = p_5 = 0.2 \) is equivalent to \( \bigcap_{i=1}^5 H_{0i} \triangleq \bigcap_{i=1}^5 \{p_i = 0.2\} \)

and you reject \( H_0 \) if you reject any one of the \( H_{0i} \),
and each test is just a simple binomial proportion test.
 
#6
Thanks - that makes sense. I suppose it's worth being clear that the critical values used to test the simplified hypotheses would have to be adjusted to ensure that the overall test has the desired significance.

If a standard critical value based on 5% significance was used for each of the 5 tests this approach would reject a true overall null hypothesis 22.6% of the time (calculated 1 - 0.95^5). I would have to use a critical value based on 1.02% significance for each of the 5 tests to acheive an overall test significance of 5%.

Although this does address the problem as I described it, I can now better articulate that what I'm looking for is a normality test that will determine the likelihood that a sample follows a normal distribution when the mean and variance have been specified (rather than when the mean and variance are based on the observed data). The best resource I've found is the Wikipedia page for 'Normality test', which provides a good overview and has links to a number of tests including D'Agostino's K-squared test, the Jarque–Bera test, the Anderson–Darling test, the Cramér–von-Mises criterion, the Lilliefors test for normality and the Shapiro–Wilk test. (Not all of these are appropriate when the mean and variance are specified but I've found them all worthwhile reading.)

It appears that there isn't an existing test based on the sum of squares and the number of positive/negatives - the idea I explored in my second post above. I've since confirmed that the procedure I described to calculate p-values is incorrect, but I have determined the correct calculations and will post them in case anybody is interested. In comparison with the alternative methodologies I think these test statistics are very easy to calculate and there is quite strong appeal due to the use of simple, well-understood distributions (chi-squared and binomial).

JB.