that "blessing" means that the new database isn't "statistically

different" from the old db. One way I thought of doing this is to

compare the string length (i.e. number of letters) for a given field

with respect (w/r) to each db to see if they were the same. However,

I know the database was created by combining data from multiple

different sources and each record in the db has the source-origin

info. So, at least to me, it makes sense to have a strong test that

is based off of testing the consistency of the two dbs string lengths

for each different source. That is to say, I am partitioning the two

samples into multiple sub-samples where each sample has all the

records from a different source.

Rather than continue describe this is painful abstract terms, I will

make this concrete:

Imagine that both data bases have a "name" field. The names are

collected from different countries (i.e. "source"). It may be that

one country messed up their data entry and so I want to test the

"name" lengths for each country. So, for instance, both databases

could have some names from England, China, and France, etc (hundreds

of countries) and let's say that just the French source screwed up

their data entry.

I can do a two-sample t-test to compare the length of English names

in db1 against db2, then another t-test for the Chinese names, and

likewise for the French, etc. Now I have hundreds of t-test p values

but I need (like?) to combine the results to give a final

answer. The "French t-test" has a low p-value but hey, there are

hundreds of tests so that could be a (statistically insignificant) coincidence.

Note, the numbers of records from a given country and from each db

can be different (and so each t-test has a different t-distribution

given that different sample sizes ==> different d.f.)

(Given the multiple T-test stuff, perhaps ANOVA would be right but it doesn't seem to quite gel...see end).

If these tests were all homogenous (measuring the same quantity) I

might use Bonferonni or the like but that's not the case.

Furthermore, I simply don't like the idea of lowering the alpha's on

each t-test when, if the db's were the same (i.e. my Ho null overall

hypothesis), I would expect a distribution of values where must of

them would have high (safely Ho) p-values (seems like a great way to

get false negatives). In fact, following that logic, my intuition

suggests that the "right" thing to do is to use the set of

t-statistics obtained from all the different (source) tests and

compare that to a t-distribution; if it fits then even if some

p-values violate a typical per-t-test confidence limit I wouldn't

care (e.g. if you run a million tests, a single p-value of 0.01

doesn't say anything by itself but if most of them were 0.01...).

Anyway, the above is a non-starter because, as mentioned above, t

distributions are a function sample size/degrees of freedom and those

are different for each test (country). So, I can't from a single

sampled t-distribution.

Another possibility that intuitively feels promising is to somehow

work with the final p-values. After all, regardless of the

underlying statistic (T vs. <put your stat here>), given a certain

number of tests there should only be so many low

p-values. Unfortunately, I don't yet see how to put that into

practice (assuming it's even sensible!).

====== ANOVA? ===========

1) anova wants individual measurements vs. the summary statistics that I can use for t-tests.

2) I don't think model-1 anova fits the bill anyway because, I think, my conditions would be the two databases but the rows would be the measurements w/o respect to source and so what's left is a simple single t-test that may not find the discrepancy in a couple of sources (e.g. say two small sources/countries messed up their data entry on the 2nd db then their contributions to increased variance could be noise in the context of all the other sources's contributions).

3) Model 2 Anova - pretty much the same as the above.

4) 2-way anova - The problem I have here is that this seems to suggest the rows of the table are the sources (e.g. countries) and the 2 columns are the two DBs, and each source has its own possible bias (to be factored out) to the measurements. But I don't think that corresponds to the situation because a particular source could be fine for one db and bad for the other because someone from the source happened to mess up data entry for only one of the dbs. My understanding of anova is that an issue with the source would be expected to bias both db measurements in the same way (it's the source that's the factor, not someone who messed it up on one occasion).

Anyway, I've totally skirted the issue of the fact that I would need one row per source which means I would need to form an aggregate stat (eg. Mean) for each cell (i.e. each source/db element in the table). I'm not sure that's OK but I think it is. After all, that would obliterate the small weighting (i.e. influence on the stats) of small sources (countries) but that's something I really want to know (and is consistent with the multiple t-test concept above).