No responses so far, but I’ve been reflecting on my own question further and thought it would be worth sharing an update. I’m afraid I don’t have the time to present a rigorous treatment but hopefully the line of thinking is sound. However, I’m rather conscious that this appears to be poking at an apparent ‘hole’ in conventional statistics so I’m quite prepared to hear back that this is handled by an existing distribution that I’m not familiar with, or that there’s simply a flaw in my argument.

It still seems correct to me that when considering goodness of fit for N independent standard normal random variables the sum of the squares cannot be a sufficient statistic for hypothesis testing. Each observation will be positive or negative, and the total number of positive and negative observations is also relevant. Specifically, if we let ‘n’ be the greater of (a) the total number of positive observations and (b) the total number of negative observations, then the value of n should also be considered.

For example, if k (degrees of freedom) = 5 and we have a sample where x (the sum of squares) = 7, the chi-squared distribution tells us that a sum of squares as extreme as this will be seen about 22% of the time. In a traditional chi-squared test we extend the inference to be: “a sample as extreme as this will be seen about 22% of the time”, and in this example we would rule that there is insufficient evidence to reject the null hypothesis.

In practice this extension is usually reasonable because we often derive our N independent standard normal random variables from a set of observed values and a set of expected values, where the expected values are determined by allocating the total number of observations across k cells of expected values according to some hypothesised distribution. In most cases this reduces the likelihood of there being an extreme imbalance between the number of positive and negative values of ‘observed minus expected’.

However, in an application where the expected values are fixed independently of the observations – such as the scenario in my earlier question – or more generally where we are truly considering independent standard normal random variables, extending our inference from ‘the sum of squares is not extreme enough to reject the null hypothesis’ to ‘the sample is not extreme enough to reject the null hypothesis’ is inappropriate. Building on the above example, if we take all possible samples of 5 observations where x (the sum of squares) is approximately 7, the binomial distribution tells us that 62.5% of these samples will have n=3 (either 3 positive and 2 negative values, or 3 negative and 2 positive values), 31.25% of the samples will have n=4, and 6.25% of the samples will have n=5. So we will see samples with n=5 much less frequently than samples with n=3. In other words, for a given sum of squares, when n=5 the observations do not fit the hypothesised distribution as well as when n=3. My conclusion is that a test of goodness of fit of should incorporate both x and n, and when n is larger the null hypothesis will be rejected for smaller values of x.

If this is correct, it suggests an alternative distribution for testing goodness of fit for a sample of independent standard normal random variables. Like the standard chi-squared distribution this would have k degrees of freedom, but would be a function of x and n as defined above.

I’m making no attempt to rigorously derive a pdf and cdf – not least because I’m sure this distribution must exist somewhere – but I believe (hope!) that the following procedure is correct and can be applied more generally. For the sake of maintaining reasonably plain language I’m continuing to rely heavily on the concept of samples being ‘extreme’, by which I mean samples being unlikely to occur under the null hypothesis when compared to samples that are less ‘extreme’. Please consider this concept equivalent to the p-value as it is conventionally defined; a sample with a test statistic that has a smaller p-value is considered to be more extreme than one with a test statistic that has a larger p-value.

The procedure:

Continuing to building on the above example, where k (degrees of freedom) =5 and x (sum of squares)=7, suppose that n=5 (all of the observations in the sample are either positive or negative).

We see x >= 7 approximately 22.06% of the time (by the chi-squared distribution), and 6.25% of these samples have n=5 (by the binomial distribution), so under the null hypothesis we see samples as extreme as this with n=5 only 1.38% of the time. Additional samples that are at least as extreme with n=4 also exist (albeit with higher values of x); given that the definition of ‘extreme’ is based on frequency of occurrence, there must also be 1.38% such samples (a statement which relies only on there being at least 1.38% of samples with n=4). Similarly, there are samples that are at least as extreme with n=3, and there are also 1.38% of these. Therefore, when k = 5 degrees of freedom, the p-value for x=7 and n=5 is approximately 0.0138 + 0.0138 + 0.0138 = 0.0414.

For n=4 (either 4 positive and 1 negative values, or 1 positive and 4 negative values).

We see x >= 7 approximately 22.06% of the time, and 31.25% of these samples have n=4, so under the null hypothesis we see samples as extreme as this with n=4 only 6.90% of the time. As before, additional samples that are at least as extreme with n=3 also exist, and there are 6.90% such samples (relying only on at least 6.90% of all samples having n=3). However, given that only 6.25% of all samples have n=5, by definition all samples with n=5 are more extreme. Therefore , when k = 5 degrees of freedom, the p-value for x=7 and n=4 is approximately 0.0690 + 0.0690 + 0.0625 = 0.2005.

For n=3 (either 3 positive and 2 negative values, or 2 positive and 3 negative values).

We see x >= 7 approximately 22.06% of the time, and 62.5% of these samples have n=3, so under the null hypothesis we see samples as extreme as this with n=3 only 13.79% of the time. Additional samples that are at least as extreme with n=4 exist, and there are 13.79% such samples (because 31.25% of samples have n=4, and it should now be evident that we are taking the smaller of 13.79% and 31.25%). Continuing in the same vein, only 6.25% of all samples have n=5, so all of these samples are more extreme. Therefore , when k = 5 degrees of freedom, the p-value for x=7 and n=3 is approximately 0.1379 + 0.1379 + 0.0625 = 0.3383.

If this procedure is correct then samples with k=5 and x=7 (or greater), which the chi-squared distribution tells us occur 22.06% of the time, may or may not be as common under the null hypothesis as a standard chi-squared test tells us. The alternative test statistic (based on x and n) has a p-value of 0.3383 when n=3, and a p-value of 0.2005 when n=4. Such samples would not lead to rejection of the null hypothesis. However, when n=5 we have a p-value of 0.0414 and we would reject the null hypothesis.

That was a longer posting than I thought it would be so thank you to anybody who bothered to read this far. So, do you agree that the traditional chi-squared test fails to make use of valuable information as described above? Have I unwittingly described an existing distribution? Is my procedure for calculating p-values correct or flawed? Is such a procedure the best way to calculate p-values or can the pdf and cdf be rigorously defined?

JB.