# Testing for similarity vs dissimilarity

#### Qroid_mtl

##### New Member
Hi all,

I'm a grad student who uses stats a lot but who has very little formal training. Not sure if this is in the right forum, so feel free to move it if that sort of thing is done here.

Here's a question that comes up a lot and I've never had answered very well. Basically, I often have two distributions and want to assess whether they're similar. Say they're normally distributed and a suitable test for dissimilarity would be a t-test.

Now, is it acceptable to simply perform a t-test, and then use a p-value>.05 as a criteria for similarity? This is usually my first reaction. But in practice this seems way too liberal, since p>.05 corresponds to 95% of the distribution - whereas p<.05 corresponds to 5%. Thus testing for p>.05 tends to admit way too much 'noise' and my numbers are usually hopelessly skewed upwards.

If what I want is to select the distribution pairs (normally distributed) that are similar, what's an appropriate test?

#### Qroid_mtl

##### New Member
I sometimes calculate the area under the ROC curve (AUC) as a similarity measure. But I find it difficult to go from an AUC value to the statement "these two distributions are similar", since the threshold value seems to be arbitrary/depend on the data.

#### hlsmith

##### Not a robit
Are you familiar with the Kolmogorov Smirnov Goodness of Fit Test (see below):

http://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test

AUC provides model discrimination, so I am not sure what you are getting at with your second post. Trying to compare categorical (binary) data? If so, you may want to look at chi-square goodness of fit.

#### Qroid_mtl

##### New Member
I wouldn't say I was familiar with it. Thanks very much.

Just so I'm sure I understand, performing a Kolmogorov-Smirnov test will give me a pvalue. If p>.05, the two distributions are similar. This leaves me with my initial problem, although maybe that wasn't a problem at all and I was mistaken. So, is it kosher to test for similarity using the criterion p>.05 (compared to dissimilarity with p<.05)? Also, if the data is normal, why not just use a t-test and compare that pvalue?

#### Dason

1) A K-S test can only tell you if you have evidence that the samples don't come from the same distribution. If you fail to reject (p > .05) that doesn't mean that the distributions the samples came from are the same - just that you don't have enough evidence to say that they are different.

2) If the data is normal then a t-test will tell you if there is a difference in the means of the distributions. The standard deviations of the two distributions could be wildly different and as long as the means are the same a t-test doesn't care.

#### Qroid_mtl

##### New Member
Hey guys - first of all, thank you so much. This is very helpful. It's really difficult to get this information other places, so don't mind me if I ask a few more questions.

As opposed to testing for a difference, what's the correct way to test for similarity between two normal distributions? I'm a little confused. Is KS p>.05 a good test in this situation? If not, what is a good way to test this conclusion statistically? I've been told that qualitatively it's not correct to argue from p>.05.

#### Dason

Well you would need to define what you mean by similar. But in general testing "similarity" is a lot harder than testing if things are dissimilar.

#### Qroid_mtl

##### New Member
Okay, briefly, I have a random variable that's the number of action potentials a neuron fires in one of two conditions. These conditions are each repeated 30-40 times to give distributions y1 and y2. It's meaningful if y1 and y2 are "the same", and for this I think it's sufficient to say their means and variances are similar. y1 and y2 are usually normal.

I'm not sure how I'd do it in this case, but can tests for similarity be reformulated as tests for dissimilarity? Is that one way around it?

Thanks so much.

#### hlsmith

##### Not a robit
All observations and y1 an y2 all come from the same exact individual?

#### Qroid_mtl

##### New Member
Yes, the same individual, and all were recorded within the same experiment.

#### hlsmith

##### Not a robit
I think in your case you should overlay them on top of each other on a graph, run the Kolmogorov, and if normally distributed perform a t-test. This should suffice. You can also wait and see if anyone else proposes any other options.