Determination of optimal number of clusters (stpping rule) with similarity measure (Jaccard coefficient)

#1
Hello,

unfortunately I have not been able to find an answer to my question and have not found a solution here in the forum. My problem:

I am performing a cluster analysis of binary scaled data (yes/no). For this I use the Jaccard coefficient, which measures the similarity between objects. I also use the average linkage algorithm.

How can I determine the optimal number of clusters? Unfortunately, I have not found a suitable stopping rule yet. Calinski/Harabasz for example is only suitable for metric data. I was also recommended the criterion according to Mojena, but due to the use of a similarity measure, the fusion level decreases continuously, which is why the fusion value to be determined according to Mojena also decreases continuously.

I am grateful for any hint how I can determine computationally an optimal cluster number!

Many greetings from Aachen (Germany)!
 

gianmarco

TS Contributor
#2
Hello,
It's a while that I do not deal with cluster analysis, so please take y words with a grain of salt.

As far as I recall, there is a large literature on the way in which one should decide where to "cut the tree" and, consequently, to decide how many clusters can be read off the dendrogram.

What I have found interesting was the use of the silhouette plot, which (given a number of partitions) measures how "nicely" the observations fall in each group. Obviously, this entails that one decides beforehand how many groups (s)he wants. BUT, what I have done in the past (and I implemented that in some R functions out of one of my packages) was to calculate the average silhouette value (a sort of metric that tell you how good the observations fall in the groups) for partitions starting from 2 to the maximum number of clusters supported by the data. And the "optimal" partition would be the one that produces the largest average silhouette value.

In the past, I benefited from reading the following:
Rousseeuw P J. 1987. "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis", Journal of Computational and Applied Mathematics 20, 53-65 (http://www.sciencedirect.com/science/article/pii/0377042787901257)

There are some YT videos about silhouette plot:
THIS for example (in the context of a free stats software)

Hope this helps
Gm
 

Miner

TS Contributor
#3
I use the method described in this post. You always need to make sure that your number of cluster makes practical sense. You may find value in clustering at more than one level. For example, if you were to perform a cluster analysis using data on vehicles, you might cluster at a high level and define cars, SUVs and trucks as your cluster names. Cluster at a deeper level and cars might breakout into clusters of hybrid, compact and midsize.