- Thread starter Chirantha89
- Start date
- Tags clustering r programming

Have you look at the function kmean (package stats) in R?

Instead of an algorithm, I prefer this:

Try to name the clusters for each reasonable level of k. If your colleagues say "Yeah! That's right!" then you have a good clustering. If they scratch their heads and say "well... I dunno...." then you still have work to do.

Try to name the clusters for each reasonable level of k. If your colleagues say "Yeah! That's right!" then you have a good clustering. If they scratch their heads and say "well... I dunno...." then you still have work to do.

There are many other methods with different parameter options to consider, also. Some of those methods do have ways of picking out the number of clusters, but they also aim at different objectives to k-means. I'd recommend checking out Coursera for their machine learning or clustering courses. At the very least, they are informative.

For doing clustering, I've found the flexclust package to be particularly well designed and very flexible. Something more advanced might be kernel methods, for which kernlab has a function for kernel k-means.

My understanding of cluster analysis is that if the clusters are unknown, use hierarchical cluster analysis. k-means cluster analysis is to assign items to pre-known or theorized clusters.

By contrast, k-means is detecting representatives (the means or medians, say) for the select number of groups you want, and assigning data points to this representative based on their nearness to that center. The distance measure is implicit in the feature space. Notice that in this case, the clustering is based on the similarity of the data points to the center whereas hierarchical clustering is creating clusters based on the nested linkage nearness among the data points. Very different results in what that cluster means.

There's also other issues, such as k-means is seeking to linearly separate the feature space into partitions based on those centers (implicitly). There is no such division in hierarchical clustering. This also demonstrates a great weakness in k-means: what if the data points in this feature space