# Determining which data set is most like another

#### EMStelley

##### New Member
I am new to this forum - so let me apologize in advance if this is the wrong location. Given a specific multivariate data set, and numerous other data sets, I want to find the one most "like" the original. So, for example (this is dumbed down for the sake of space), which set (B or C) is most like Set A? I am having trouble finding any information on how to do this, but also feel like it should be straightforward. I have done a chi squared test on (A,B) and (A,C) and have the p-values from both, but I'm unclear on how to compare the two, or if you even can compare them. Any help would be greatly appreciated!

Set A
50 hot dogs
30 hamburgers
20 veggies

Set B
55 hot dogs
25 hamburgers
20 veggies

Set C
110 hot dogs
50 hamburgers
40 veggies

#### Karabiner

##### TS Contributor
What about calculating squared Euclidian distances?

Kind regards

K.

#### EMStelley

##### New Member
Of course I wasn't thinking simply enough! Thank you! But squared Euclidean distances don't take into account the different sizes of the data sets.

#### EMStelley

##### New Member
I'm still struggling with this problem based on total size of data sets. Is there no way to say "A is an 80% match, B is a 77% match, etc.". I really feel like I need to include a statistical comparison.

#### Dason

To deal with the sample size issue you could just normalize all of your data so you're looking at the proportion of the total instead of the raw counts.

So instead of having A = {50, 30, 20} convert it to {50/100, 30/100, 20/100} = {.5, .3, .2} so that all the elements sum to 1.

#### EMStelley

##### New Member
I started out with using percentages but then you lose the fact that one data set is much larger than the others, which in this case, counts.