# Analyze the similarity among groups

#### callisto

##### New Member
Recently, I have to analyze the similarity among groups. This is my problem.

1. There are 20 independent groups. (A1, A2, ..., A20).
2. Each group has 20000 samples. (s1, s2, ...., s20000).
3. Each sample has 7 features. (x1, x2, ..., x7).

If two samples from different groups have similar features, i.e. |x1-x1'|<0.01 ... |x7-x7'|<0.01, we consider that these two groups have similar behavior at that point.

After study some basic statistic text books, I know that if each sample has only one feature, then I can use hypothesis test to compare two groups. But I don't know how to do the test on the sample that has 7 features. I can only come up the idea that compare each pairs. For example, if A1 has 4000 samples are similar with the samples in A2, then I will say the similarity between A1 and A2 is 4000/20000= 0.2.

I know some people in my field use the mean values of the 20000 samples, so each group has only 7 mean values. Then they apply PCA and K-NN clustering algorithm. If two groups are in the same cluster, they will think these two groups are similar. But it is a very rough result. I think similarity can be defined more precisely in this problem. I need some suggestions about how to define and compute the similarity.