hclust function-cluster analysis-text/document-function creation

#1
Hi guys

Im working on a text mining/clustering project and am trying to create a table which contains number of clusters as rows and 6 columns representing the following 6 metrics: max.diameter, min.separation, average.within,average.between,avg.silwidth,dunn.

I need to create the tables for 3 methods - kmeans, pam and hclust.

I was able to create something for kmeans

Code:
dtm0.90Dist = dist(dtm0.90)

foreachcluster = function(k) { 
  kmeans.result = kmeans(dtm0.90, k);
  kmeans.stats = cluster.stats(dtm0.90Dist,kmeans.result$cluster); 
                              c(kmeans.stats$min.separation, kmeans.stats$max.diameter,
                               kmeans.stats$average.within, kmeans.stats$avearge.between,
                               kmeans.stats$avg.silwidth, kmeans.stats$dunn) 
}
rbind(foreachcluster(2), foreachcluster(3), foreachcluster(4), foreachcluster(5),
      foreachcluster(6), foreachcluster(7),foreachcluster(8))
OUTPUT
Code:
    [,1]     [,2]     [,3]      [,4]       [,5]
[1,] 3.162278 30.19934 5.831550 0.5403872 0.10471348
[2,] 2.236068 28.37252 5.006058 0.3923446 0.07881104
[3,] 1.000000 28.37252 4.995478 0.2496066 0.03524537
[4,] 1.000000 26.40076 4.387212 0.2633338 0.03787770
[5,] 1.000000 26.40076 4.353248 0.2681947 0.03787770
[6,] 1.000000 26.40076 4.163757 0.1633954 0.03787770
[7,] 1.000000 26.40076 4.128927 0.2676423 0.03787770
OUTPUT END

I need similar output for hclust and pam methods but for the life of me can't get the same function to work for either of the two methods

OK, so I was able to make the function for HCLUST

Code:
forhclust=function(k){dfDist = dist(dtm0.90);
                      hclust.result = hclust(dfDist);
                      hclust.cluster = (cutree(hclust.result, k));
                      cluster.stats(dfDist,hclust.cluster);c(cluster.stats$min.separation)}
But I get an error when i run this

Error in cluster.stats$min.separation :
object of type 'closure' is not subsettable


What I need is for it to print "min.separation" output and other 5 measures like in the kmeans code.

I would really appreciate all the help and perhaps some guidance in understanding why my approach is failing in hclust.

Also, is there a good source that can explain the functioning and application of these methods, step by step, in detail?

Thank You
 
Last edited:

trinker

ggplot2orBust
#2
#3
Thank you Trinker - I'll take care of the codes and formatting in my posts in the future.

I ran the code in R and it works -i'm not sure what you mean by reproducible. Basically, there is an XML file that is converted to a corpus then converted to an R readable dataframe and then cleaned for sparse words etc and then a document term matrix is created and then the Kmeans, Hclust and PAM methods are applied. That is what I have followed so far to get that output.

Is it possible to create a function that picks up certain values from the list of values that one receives when running "cluster.stats" command under hclust option?
for example, when i used kmeans i was able to specify
Code:
c(kmeans.stats$min.separation, kmeans.stats$max.diameter,
kmeans.stats$average.within, kmeans.stats$avearge.between,
kmeans.stats$avg.silwidth, kmeans.stats$dunn)
to pick the 6 options i needed from cluster.stats option


EDIT: i have attached the XML file and a text file containing the code I have so far. Perhaps that is what you were referring to when you said 'reproducible'.
 
Last edited: