K-means clustering

#4
Instead of an algorithm, I prefer this:

Try to name the clusters for each reasonable level of k. If your colleagues say "Yeah! That's right!" then you have a good clustering. If they scratch their heads and say "well... I dunno...." then you still have work to do.
 

TheEcologist

Global Moderator
#5
Instead of an algorithm, I prefer this:

Try to name the clusters for each reasonable level of k. If your colleagues say "Yeah! That's right!" then you have a good clustering. If they scratch their heads and say "well... I dunno...." then you still have work to do.
... the level of k was chosen as to maximize satisfaction among my colleagues ...
 

Dason

Ambassador to the humans
#6
"After considerable consideration we all agreed that k can be whatever I choose as long as I buy the next round"
 

bryangoodrich

Probably A Mammal
#9
Choosing the number of clusters in an unsupervised learning (clustering) model has no general approach. The reason may be that your choice of k depends on data beyond the features (variables, columns) used in your input data to that model. For instance, if you want to do well at capturing some structure of feature Y not used in the model, then you might cluster your features X into k clusters. Upon completion, you then review how Y is broken down by the k clusters associated with X. Using something like cross-validation (as discussed in the link Lazar mentioned), you can find the optimal k to relate the clustering of X in terms of Y. This, however, is not unlike TE's choice of "the level of k ... as to maximize satisfaction among my colleagues." Since the method is unsupervised, it will only do what it was designed to do: find linear separations in the feature space defined by X such that the within-cluster deviations from center is minimized. That objective may not meet the objective of the application. Thus, the choice of k or any learning model should be determined by the business objective of using the model in the first place.

There are many other methods with different parameter options to consider, also. Some of those methods do have ways of picking out the number of clusters, but they also aim at different objectives to k-means. I'd recommend checking out Coursera for their machine learning or clustering courses. At the very least, they are informative.

For doing clustering, I've found the flexclust package to be particularly well designed and very flexible. Something more advanced might be kernel methods, for which kernlab has a function for kernel k-means.
 

Miner

TS Contributor
#10
My understanding of cluster analysis is that if the clusters are unknown, use hierarchical cluster analysis. k-means cluster analysis is to assign items to pre-known or theorized clusters.
 

bryangoodrich

Probably A Mammal
#11
My understanding of cluster analysis is that if the clusters are unknown, use hierarchical cluster analysis. k-means cluster analysis is to assign items to pre-known or theorized clusters.
That's a bit of a stretch. They each are doing two completely different things. Hierarchical clustering creates a tree structure out of the distances, building it up from individual data points, single-linking them into groups based on nearest neighbor. Cluster assignment at that point is selecting how many of those linked groups you want to have down the tree. Thus, based on your distance measure, you're guaranteed to have things grouped as nearest to each other in this nested fashion up to the number of clusters you want.

By contrast, k-means is detecting representatives (the means or medians, say) for the select number of groups you want, and assigning data points to this representative based on their nearness to that center. The distance measure is implicit in the feature space. Notice that in this case, the clustering is based on the similarity of the data points to the center whereas hierarchical clustering is creating clusters based on the nested linkage nearness among the data points. Very different results in what that cluster means.

There's also other issues, such as k-means is seeking to linearly separate the feature space into partitions based on those centers (implicitly). There is no such division in hierarchical clustering. This also demonstrates a great weakness in k-means: what if the data points in this feature space are not linearly separable? This is where other methods or data transformations become relevant. For instance, kernel k-means uses a kernel transformation of your data so you're essentially using k-means on the similarity among the features that the kernel computes to better get at underlying clusters in that data. In either case, you're supposing there actually are clusters, however. The fact is, you can take random data and impose clusters using either method.