# Beginner question: similairity between samples

#### Zmicier

##### New Member
hello forum,
at first i want to appologize for may be asking some simple straighforward question, because i don't have much background in statistics.
let's say i have several hundreds records(islands), that are measured by certain variables (compactness of shape, elongation, fractal dimension of coastline, roundness etc. etc.).
then i made PCA analysis in R. and now i have just three variables (because 3 first PCA explained over 90% of whole variance).
now i want to measure how similair my islands are between each other.
how to do this? the most obvious solution that came into my mind is to measure Euclidean distance \|(PCA1i-PCA1j)^2 + (PCA2i-PCA2j)^2 + (PCA3i-PCA3j)^2
between pairs of islands.
may be there are some better ways to compare this similairity? will be gratefull for your advices.
best regards,
zmicier

#### Zmicier

##### New Member
may be also i should add which criteria i expect from it. so what i want is to get similarity between 0 and 1 if the records are very similair between each other....and similarity > 1 if the records are much different

#### Miner

##### TS Contributor
Cluster analysis will provide a number of different measures of similarity, including Euclidian distance.

#### Zmicier

##### New Member
hello, thanks a lot for the reply.

i am afraid that in my case i need to think about some other method for this similairity measure.

may be i need to describe more in details what i am doing. so i am writing a semestral work at uni. the topic is identification of archipelagos (groups of islands). my purpose is while grouping islands not just take into account spatial proximity between islands but also some characteristics of their shape (area, elongation, fractal dimension of coastline, compactness, concavity, ratio between small and big axeses of ellipse hull etc.).
so in first step, i measured all these parameters.
in second, calculated PCA based on these parmeteres.
in third, calculated similairity (euclidian distance between islands on PCA plot) between islands.
after in fourth step, (because what i want is that more similair islands (islands with smaller "PCA distance") need to become closer, and the ones with more distinct PCA scores to become further apart) i multiply the real distance between islands by that similairity coefficient ("PCA distance"), and based on this distance I make distance matrix.

in fifth step i do hierarchial agglomerative clustering with average-link distance method. and in this way i get archipelagos (groups of islands that are not that far from each other and similair in shape).

the results are not bad. similair to that, what i expected.

just i have two problems. first I am not sure that there are not better method to calculate similarities between islands than to calculate distance between their PCA scores.
and second i don't really know where to cut hierarchical tree automatically. of course, manually i can take a look at dendrogram and by intuition to guess at which height it would make sense to cut it.

#### gianmarco

##### TS Contributor
just i have two problems. first I am not sure that there are not better method to calculate similarities between islands than to calculate distance between their PCA scores.
and second i don't really know where to cut hierarchical tree automatically. of course, manually i can take a look at dendrogram and by intuition to guess at which height it would make sense to cut it.
Have you tried to use the package 'FactoMineR' to perform PCA and Hierarchical Clustering in R?

If you haven't, I suggest to use it for a number of reasons, among which:
-there is a good documentation in terms of both Journal articles and books on its use (link, link, link)
-there is a number of YouTube video tutorials made by Prof. Husson on the use of the package; the videos are available at his Youtube channel
-there is an active User Group
-the package can automatically cut the tree, suggesting an optimal partition

Hope this helps
Cheers
Gm

#### Zmicier

##### New Member
thank you for the suggestion Gianmarco. i definetely will take a look at the documentation. i've already used this package to perform PCA ( i needed to resize variances of some variables, and this package had this functionality).
so i will take a look at its cutting tree ability

#### cocoonnelly

##### New Member
I am writing a semestral work at uni.

Last edited by a moderator: