Fisher's exact test as a goodness of fit test

Gaba

New Member
#1
Hello everybody,

I'm at a beginner level regarding statistics knowledge and I would very much appreciate any help with this problem.

I have a 10000x2 table (see below, first column is actually the index) where the first column holds binned data representing observations, dataset A, and the second one holds binned data in the same range but for a model: B (you can check here to see where this question is coming from)

Code:
        A   B
1       1   0
2       0   0
3       1   0
4       0   1
5       1   0
6       2   0
7       1   0
8       3   0
9       0   0
10      0   0
11      5   2
12      1   1
13      0   1
(...)
I need to check the goodness of fit of model B with respect to observations A (ie: I need an estimator of how good is model B representing data in A) and I've been advised that Fisher's exact test is what I should apply.

If I use R's implementation of Fisher's test (fisher.test) using the Monte Carlo simulation option (simulate.p.value=TRUE), I get this:

Code:
> test_grid <- read.table("test_grid")
> test_grid
        A   B
1       1   0
2       0   0
3       1   0
4       0   1
5       1   0
6       2   0
7       1   0
8       3   0
9       0   0
10      0   0
11      5   2
12      1   1
13      0   1
(...)
> fisher.test (test_grid,simulate.p.value=TRUE,B=10000)

    Fisher's Exact Test for Count Data with simulated p-value (based on
    10000 replicates)

data:  test_grid 
p-value = 9.999e-05
alternative hypothesis: two.sided
As I uderstand it, what this function does is:

  1. Calculate the p-value for that particular table
  2. Calculate the p-values of all possible combinations of that table that comply with Fisher's fixed-margin condition
  3. Add up only those values less or equal than the p-value obtained for the original table arrangement to obtain the final p_value displayed

Now my questions are:

  1. What is the meaning of the final p-value in this context? How would I describe it?
  2. What would the null hypothesis look like for this problem?

I've been told that in this context, a higher p-value is what I want because the null hypothesis of independence would mean that whether the data come from A or B, you observe a similar distribution across your bins and that the higher the p-value the better the fit. I think I understand this, but I still have problems phrasing both points above.

Any help would be very much appreciated.

Cheers!