probability of catching duplicates when consolidating data bases

#1
Hi everyone

My statistics skills have gotten a bit rusty, so please forgive me if I do not formulate everything properly.

I am currently facing the following problem: I am working on a central data base for a number of 10 shops. Each shop has 200 clients records.

These 2000 (200 records * 10 shops) client records do belong to 1000 unique clients, implying an average number of records per client of 2. I do know that every shop only has unique client records, so there is no duplication within the 200 records that each shop has.

We will now go and onboard one shop after another to the central data base and I am trying to show how the probability of "drawing" a duplicate changes over time.

My first attempt looks as follows, but I'm not sure whether the math on it is correct:

1st shop onboarding:


Probability of the records being new to data base: 100%
Probability of the records being a duplicate: 0%

--> unique records on average: 200
--> duplicate records on average: 0

2nd shop onboarding:

Probability of the records being new to data base: (1000-200)/1000=80%
Probability of the records being a duplicate: 1-80%=20%

--> unique records on average: 200+160 = 360
--> duplicate records on average: 0 + 40 = 40

Variance = (20% * 80% )/1000 = .00016

3nd shop onboarding:

Probability of the records being new to data base: (1000-360)/1000=64%
Probability of the records being a duplicate: 1-64%=36%

--> unique records on average: 200+160+128= 488
--> duplicate records on average: 0 + 40 + 72= 112

Variance = (64% * 36% )/1000 = .00023

and so on...

I am sure this isn't exactly a tricky problem. However, I would highly appreciate if someone could have a quick look and let me know whether I'm on the right track with this or if there is any major mistakes in my calculations.

Many thanks in advance.

Best

Handrix
 

obh

Active Member
#2
Hi Handrix,

What is a duplicate, the same customer in more than one shop? or duplicate database rows?

If you want to check every shop separately, why do you add data from 2 shops?

Do you want the variance of the "probability of duplicate"? pq?
or the variance of the "number of duplicates" npq?
Anyway I assume it is easier to use the fraction like 0.64, not the percentage like 64%
 
#3
Hi OBH

Many thanks for your questions to which I reply as follows:

What is a duplicate, the same customer in more than one shop? or duplicate database rows?
A duplicate is whenever a customer exists in more than one shop. Today each shop has an individual data base. We want to consolidate those data bases and get rid of duplication. In order to do so, we will integrate one shop after another and I am trying to get an understanding of how the number of duplicates will develop over time.

If you want to check every shop separately, why do you add data from 2 shops?
I am adding data from 2 shops as we will add the individual data base of each shop to our central data base one after another. So, the more shops we add to our central data base, the higher the probability that the data we add from the next shop is already existing within the central data base.

Do you want the variance of the "probability of duplicate"? pq?
or the variance of the "number of duplicates" npq?
I want the variance of the number of duplicates.

Anyway I assume it is easier to use the fraction like 0.64, not the percentage like 64%
Thanks, will do so.
 

obh

Active Member
#4
Hi Hendrix,

I would just calculate p = total duplicates / total customers.

What else would you want to achieve?

PS of course if you start from store1, there are no duplicates then going to store2 there are duplicate.
But if you will do the opposite, there won't be duplicates when starting with store2, then you will find duplicates when moving to store1
So the order is matters.

Do you want to check the proportion of duplicates per number of stores?