Need to aggregate t-test stats

#1
Here's the setup: My main goal is to "bless" a new database (db) such
that "blessing" means that the new database isn't "statistically
different" from the old db. One way I thought of doing this is to
compare the string length (i.e. number of letters) for a given field
with respect (w/r) to each db to see if they were the same. However,
I know the database was created by combining data from multiple
different sources and each record in the db has the source-origin
info. So, at least to me, it makes sense to have a strong test that
is based off of testing the consistency of the two dbs string lengths
for each different source. That is to say, I am partitioning the two
samples into multiple sub-samples where each sample has all the
records from a different source.

Rather than continue describe this is painful abstract terms, I will
make this concrete:
Imagine that both data bases have a "name" field. The names are
collected from different countries (i.e. "source"). It may be that
one country messed up their data entry and so I want to test the
"name" lengths for each country. So, for instance, both databases
could have some names from England, China, and France, etc (hundreds
of countries) and let's say that just the French source screwed up
their data entry.
I can do a two-sample t-test to compare the length of English names
in db1 against db2, then another t-test for the Chinese names, and
likewise for the French, etc. Now I have hundreds of t-test p values
but I need (like?) to combine the results to give a final
answer. The "French t-test" has a low p-value but hey, there are
hundreds of tests so that could be a (statistically insignificant) coincidence.
Note, the numbers of records from a given country and from each db
can be different (and so each t-test has a different t-distribution
given that different sample sizes ==> different d.f.)

(Given the multiple T-test stuff, perhaps ANOVA would be right but it doesn't seem to quite gel...see end).

If these tests were all homogenous (measuring the same quantity) I
might use Bonferonni or the like but that's not the case.
Furthermore, I simply don't like the idea of lowering the alpha's on
each t-test when, if the db's were the same (i.e. my Ho null overall
hypothesis), I would expect a distribution of values where must of
them would have high (safely Ho) p-values (seems like a great way to
get false negatives). In fact, following that logic, my intuition
suggests that the "right" thing to do is to use the set of
t-statistics obtained from all the different (source) tests and
compare that to a t-distribution; if it fits then even if some
p-values violate a typical per-t-test confidence limit I wouldn't
care (e.g. if you run a million tests, a single p-value of 0.01
doesn't say anything by itself but if most of them were 0.01...).
Anyway, the above is a non-starter because, as mentioned above, t
distributions are a function sample size/degrees of freedom and those
are different for each test (country). So, I can't from a single
sampled t-distribution.

Another possibility that intuitively feels promising is to somehow
work with the final p-values. After all, regardless of the
underlying statistic (T vs. <put your stat here>), given a certain
number of tests there should only be so many low
p-values. Unfortunately, I don't yet see how to put that into
practice (assuming it's even sensible!).

====== ANOVA? ===========
1) anova wants individual measurements vs. the summary statistics that I can use for t-tests.
2) I don't think model-1 anova fits the bill anyway because, I think, my conditions would be the two databases but the rows would be the measurements w/o respect to source and so what's left is a simple single t-test that may not find the discrepancy in a couple of sources (e.g. say two small sources/countries messed up their data entry on the 2nd db then their contributions to increased variance could be noise in the context of all the other sources's contributions).
3) Model 2 Anova - pretty much the same as the above.
4) 2-way anova - The problem I have here is that this seems to suggest the rows of the table are the sources (e.g. countries) and the 2 columns are the two DBs, and each source has its own possible bias (to be factored out) to the measurements. But I don't think that corresponds to the situation because a particular source could be fine for one db and bad for the other because someone from the source happened to mess up data entry for only one of the dbs. My understanding of anova is that an issue with the source would be expected to bias both db measurements in the same way (it's the source that's the factor, not someone who messed it up on one occasion).

Anyway, I've totally skirted the issue of the fact that I would need one row per source which means I would need to form an aggregate stat (eg. Mean) for each cell (i.e. each source/db element in the table). I'm not sure that's OK but I think it is. After all, that would obliterate the small weighting (i.e. influence on the stats) of small sources (countries) but that's something I really want to know (and is consistent with the multiple t-test concept above).
 

Xenu

New Member
#2
I am not sure I understand the problem, but I will try to give an answer anyways.

If these tests were all homogenous (measuring the same quantity) I
might use Bonferonni or the like but that's not the case.
I don't see why you couldn't use something like bonferroni. It doesn't 'care' about the quantity measured, as it only affect your p-values.

As the tests for each country probably is independent you can simply use the fact that:

FWER = 1-(1-Alpha)^(1/n)

Where FWER is the desired familywise error rate, n is the number of tests and Alpha is the individual error rate.

Of course, correcting for multiple comparison will increase the risk for false negatives in the individual cases, but sadly I don't see a way to get around that problem without increasing the risk for a false positive.