How to find Optimal Clusters in Hierarchical Clustering (SPSS)

#1
Hi
I am a linguistics researcher and trying to use cluster analysis in SPSS. After reading some tutorials I have found that determining number of clusters using hierarchical method is best before going to K-means method, for example. Now I am trying to find out cut-off point in output table of SPSS. You can see the agglomeration schedule below produced by SPSS. I copied it to XL and added another columns (the last column) which shows differences between co-efficients. I am confused where to cut off, the stage where the difference in co-efficients is getting a "big" leap. And lastly if I determine that stage say Stage 199 is showing a large difference, how can I determine number of clusters at this stage (199)?
I will highly appreciate any help as I am completely foreign to statistics.
Regards
Code:
Agglomeration Schedule							
Stage	Cluster Combined		Coefficients	Stage Cluster First Appears		Next Stage	
	Cluster 1	Cluster 2		Cluster 1	Cluster 2		
1	132	204	.000	0	0	196	
2	39	80	.081	0	0	154	0
3	102	123	1.422	0	0	53	1
4	226	234	3.212	0	0	8	2
5	172	225	7.081	0	0	85	4
6	7	9	10.950	0	0	13	4
7	87	182	14.907	0	0	96	4
8	211	226	19.072	0	4	19	4
9	125	174	23.782	0	0	57	5
10	34	246	29.269	0	0	49	5
11	98	99	36.593	0	0	170	7
12	35	37	44.355	0	0	67	8
13	7	11	52.432	6	0	148	8
14	93	94	60.562	0	0	48	8
15	117	203	69.211	0	0	81	9
16	86	218	78.062	0	0	109	9
17	74	181	87.186	0	0	136	9
18	78	200	96.355	0	0	148	9
19	65	211	105.807	0	8	108	9
20	194	221	115.461	0	0	168	10
21	175	243	125.486	0	0	76	10
22	105	144	135.528	0	0	85	10
23	72	170	145.572	0	0	31	10
24	17	201	155.863	0	0	90	10
25	135	244	166.156	0	0	151	10
26	84	240	176.495	0	0	64	10
27	13	18	187.153	0	0	61	11
28	119	166	197.890	0	0	72	11
29	141	162	208.676	0	0	121	11
30	16	222	219.712	0	0	143	11
31	25	72	230.862	0	23	164	11
32	3	54	242.084	0	0	115	11
33	20	43	253.384	0	0	82	11
34	116	241	264.686	0	0	162	11
35	97	247	276.097	0	0	97	11
36	138	193	287.632	0	0	89	12
37	59	206	299.337	0	0	133	12
38	45	150	311.061	0	0	134	12
39	26	90	323.003	0	0	89	12
40	67	115	335.093	0	0	159	12
41	2	83	347.312	0	0	56	12
42	63	233	359.569	0	0	133	12
43	29	41	372.394	0	0	111	13
44	44	53	385.712	0	0	115	13
45	49	245	399.084	0	0	165	13
46	156	238	412.818	0	0	110	14
47	15	191	426.790	0	0	104	14
48	93	228	441.028	14	0	122	14
49	34	88	455.567	10	0	158	15
50	19	160	470.517	0	0	100	15
51	76	210	486.300	0	0	112	16
52	30	198	502.136	0	0	79	16
53	102	232	518.108	3	0	105	16
54	23	242	534.088	0	0	182	16
55	209	236	550.277	0	0	122	16
56	1	2	566.700	0	41	110	16
57	125	229	583.221	9	0	179	17
58	6	199	600.280	0	0	130	17
59	202	219	617.589	0	0	145	17
60	89	146	635.199	0	0	170	18
61	13	136	653.043	27	0	71	18
62	61	235	671.175	0	0	113	18
63	104	122	689.443	0	0	163	18
64	81	84	708.054	0	26	109	19
65	69	205	726.702	0	0	156	19
66	71	216	745.416	0	0	124	19
67	35	251	764.384	12	0	116	19
68	24	215	783.576	0	0	183	19
69	70	145	802.819	0	0	84	19
70	62	137	822.121	0	0	149	19
71	13	75	841.859	61	0	175	20
72	119	227	861.899	28	0	141	20
73	28	231	881.987	0	0	108	20
74	95	250	902.260	0	0	137	20
75	31	66	922.551	0	0	190	20
76	175	195	942.946	21	0	159	20
77	173	177	964.615	0	0	180	22
78	68	230	986.762	0	0	166	22
79	30	64	1009.487	52	0	131	23
80	189	217	1032.650	0	0	93	23
81	12	117	1055.970	0	15	125	23
82	20	50	1079.351	33	0	130	23
83	55	214	1102.824	0	0	160	23
84	70	113	1127.488	69	0	128	25
85	105	172	1152.242	22	5	138	25
86	36	143	1177.334	0	0	154	25
87	46	164	1202.760	0	0	201	25
88	73	223	1228.376	0	0	150	26
89	26	138	1254.488	39	36	126	26
90	14	17	1280.741	0	24	158	26
91	139	159	1307.073	0	0	155	26
92	106	239	1333.607	0	0	129	27
93	151	189	1360.257	0	80	165	27
94	77	176	1387.192	0	0	176	27
95	126	130	1414.358	0	0	161	27
96	87	185	1442.350	7	0	169	28
97	97	120	1470.625	35	0	119	28
98	165	169	1499.225	0	0	153	29
99	109	142	1528.449	0	0	120	29
100	19	40	1558.509	50	0	163	30
101	22	58	1588.830	0	0	135	30
102	52	56	1619.163	0	0	195	30
103	107	111	1649.604	0	0	203	30
104	15	212	1680.150	47	0	174	31
105	102	167	1710.700	53	0	175	31
106	118	249	1741.840	0	0	149	31
107	60	158	1774.096	0	0	167	32
108	28	65	1806.720	73	19	151	33
109	81	86	1839.773	64	16	153	33
110	1	156	1874.120	56	46	147	34
111	29	186	1908.891	43	0	174	35
112	38	76	1944.417	0	51	181	36
113	61	108	1980.960	62	0	136	37
114	8	33	2017.576	0	0	178	37
115	3	44	2054.430	32	44	142	37
116	35	91	2091.439	67	0	142	37
117	4	124	2128.598	0	0	173	37
118	154	178	2165.976	0	0	164	37
119	97	171	2203.967	97	0	131	38
120	109	127	2242.212	99	0	172	38
121	141	163	2281.072	29	0	140	39
122	93	209	2320.377	48	55	189	39
123	168	190	2360.785	0	0	155	40
124	71	196	2402.654	66	0	146	42
125	12	51	2444.797	81	0	183	42
126	26	133	2487.275	89	0	198	42
127	197	208	2531.873	0	0	150	45
128	70	152	2576.701	84	0	160	45
129	100	106	2622.001	0	92	206	45
130	6	20	2667.346	58	82	215	45
131	30	97	2712.885	79	119	143	46
132	103	192	2759.645	0	0	187	47
133	59	63	2807.498	37	42	188	48
134	45	128	2855.791	38	0	181	48
135	21	22	2906.152	0	101	162	50
136	61	74	2957.467	113	17	206	51
137	48	95	3013.472	0	74	171	56
138	105	248	3070.732	85	0	176	57
139	79	96	3128.181	0	0	197	57
140	141	157	3185.914	121	0	214	58
141	57	119	3244.371	0	72	156	58
142	3	35	3302.958	115	116	195	59
143	16	30	3361.574	30	131	211	59
144	149	207	3420.342	0	0	208	59
145	179	202	3479.416	0	59	167	59
146	71	92	3540.555	124	0	185	61
147	1	47	3601.847	110	0	168	61
148	7	78	3663.581	13	18	207	62
149	62	118	3725.668	70	106	197	62
150	73	197	3787.755	88	127	223	62
151	28	135	3850.916	108	25	202	63
152	147	188	3914.467	0	0	193	64
153	81	165	3978.443	109	98	216	64
154	36	39	4042.711	86	2	192	64
155	139	168	4107.901	91	123	184	65
156	57	69	4173.533	141	65	221	66
157	10	161	4239.819	0	0	213	66
158	14	34	4308.913	90	49	202	69
159	67	175	4378.138	40	76	226	69
160	55	70	4447.954	83	128	205	70
161	126	129	4517.832	95	0	192	70
162	21	116	4589.479	135	34	190	72
163	19	104	4662.135	100	63	200	73
164	25	154	4736.806	31	118	185	75
165	49	151	4812.742	45	93	188	76
166	68	110	4888.898	78	0	210	76
167	60	179	4965.092	107	145	204	76
168	1	194	5041.373	147	20	222	76
169	87	183	5122.155	96	0	189	81
170	89	98	5203.044	60	11	205	81
171	48	224	5285.457	137	0	209	82
172	27	109	5369.126	0	120	182	84
173	4	134	5455.352	117	0	177	86
174	15	29	5547.681	104	111	199	92
175	13	102	5642.581	71	105	218	95
176	77	105	5737.799	94	138	198	95
177	4	112	5833.600	173	0	201	96
178	8	32	5930.962	114	0	213	97
179	125	184	6030.260	57	0	212	99
180	101	173	6132.209	0	77	216	102
181	38	45	6238.498	112	134	209	106
182	23	27	6348.739	54	172	228	110
183	12	24	6459.193	125	68	200	110
184	82	139	6571.531	0	155	217	112
185	25	71	6685.123	164	146	212	114
186	153	155	6798.949	0	0	204	114
187	103	187	6917.682	132	0	211	119
188	49	59	7036.590	165	133	207	119
189	87	93	7157.793	169	122	231	121
190	21	31	7282.194	162	75	229	124
191	180	220	7409.656	0	0	210	127
192	36	126	7538.215	154	161	225	129
193	121	147	7669.288	0	152	217	131
194	85	237	7804.019	0	0	208	135
195	3	52	7950.627	142	102	227	147
196	42	132	8098.301	0	1	236	148
197	62	79	8252.568	149	139	199	154
198	26	77	8410.775	126	176	222	158
199	15	62	8586.288	174	197	232	176
200	12	19	8765.548	183	163	223	179
201	4	46	8949.229	177	87	234	184
202	14	28	9140.998	158	151	227	192
203	107	114	9335.794	103	0	225	195
204	60	153	9531.717	167	186	233	196
205	55	89	9730.627	160	170	221	199
206	61	100	9932.205	136	129	220	202
207	7	49	10137.439	148	188	215	205
208	85	149	10344.672	194	144	219	207
209	38	48	10558.159	181	171	226	213
210	68	180	10783.738	166	191	220	226
211	16	103	11014.013	143	187	218	230
212	25	125	11270.451	185	179	240	256
213	8	10	11533.443	178	157	228	263
214	141	148	11809.224	140	0	224	276
215	6	7	12098.615	130	207	231	289
216	81	101	12403.101	153	180	230	304
217	82	121	12711.649	184	193	238	309
218	13	16	13037.099	175	211	239	325
219	85	131	13375.251	208	0	237	338
220	61	68	13713.968	206	210	230	339
221	55	57	14074.292	205	156	242	360
222	1	26	14435.258	168	198	233	361
223	12	73	14821.239	200	150	229	386
224	141	213	15229.240	214	0	235	408
225	36	107	15648.504	192	203	239	419
226	38	67	16091.583	209	159	241	443
227	3	14	16561.556	195	202	242	470
228	8	23	17075.567	213	182	244	514
229	12	21	17592.328	223	190	236	517
230	61	81	18128.157	220	216	237	536
231	6	87	18716.822	215	189	232	589
232	6	15	19340.140	231	199	245	623
233	1	60	19984.877	222	204	240	645
234	4	5	20741.203	201	0	243	756
235	140	141	21531.661	0	224	238	790
236	12	42	22442.789	229	196	241	911
237	61	85	23434.693	230	219	249	992
238	82	140	24458.315	217	235	244	1024
239	13	36	25504.442	218	225	246	1046
240	1	25	26565.764	233	212	243	1061
241	12	38	27754.707	236	226	247	1189
242	3	55	29191.887	227	221	245	1437
243	1	4	30978.196	240	234	246	1786
244	8	82	33014.873	228	238	248	2037
245	3	6	35359.216	242	232	247	2344
246	1	13	38305.928	243	239	248	2947
247	3	12	42204.358	245	241	250	3898
248	1	8	46950.857	246	244	249	4746
249	1	61	52598.569	248	237	250	5648
250	1	3	64439.238	249	247	0	11841
 
#2
Hi True_Friend,

Have you tried plotting the differences at each stage in a scatter plot and looking for the "elbow"? It can be easier to identify big jumps visually, but I must admit that there isn't always a clear answer on where to cut. Since you have copied it into Excel and computed the differences, you could also look for e.g. the five biggest jumps and then assess if the number of clusters at either of those stages seems sensible for your analysis. Regarding the determination of how many clusters are at each stage, you can work back from the last stage. The last stage is one cluster (all cases in one cluster), then the second-last is two clusters etc.

Hope it helps,
Londoner
 
#3
Hi
I have tried to use Dendogram to determine number of cluster. This ppt helped me.
http://www.slideshare.net/jewelmrefran/cluster-analysis-15529464
The dendogram show three main clusters (with sub-clusters within). Looks like I am wrong in interpreting dendogram to find out optimal number of clusters.
Can you help me understand how table 23.2 on slide number 60 is formed in above ppt?
PS: I used a line graph of co-efficient values, there does appears a steep upward line at the end. I have made a line graphs with stages and co-efficient values. Can you help me put a cut on this? Additionally after deciding the cut how to calculate clusters? I am confused about the procedure. I do not understand the internals of cluster analysis so this is difficult for me.

Regards
 
#4
I have found out that agglomeration table should be interpreted from bottom to top. Here is the last 13 stages pasted. I have changed last stage to 1 and so on.
Stage Coefficients
Can you help me decide which one is optimal? 2 cluster or 3 cluster? The biggest jump is at 2-to-1 (64439.238) while less but still bigger than previous jump is from 3-to-2. Previous differences are somewhat similar (around 400). So it seems I need to have 3 clusters?
13 24458.315
12 25504.442
11 26565.764
10 27754.707
9 29191.887
8 30978.196
7 33014.873
6 35359.216
5 38305.928
4 42204.358
3 46950.857
2 52598.569
1 64439.238
 
#5
Again, it is not always so easy to pick the "perfect" point. I had a look at your last few stages (between 1 cluster and 20 clusters), and it looks like each time the coefficients are increasing smoothly (i.e. at each stage the jump in coefficient is slightly higher than at the prior stage). I agree that the jump from 2 clusters to 1 cluster (stage 249 to 250) is by far the highest (in fact double the prior jump), so 2 clusters could make sense in my view. It really depends on your analysis as well though. What are you actually trying to cluster?
 

Miner

TS Contributor
#6
As stated, it is not always easy to determine the "best" number of clusters from the numbers. There is no specific threshold (e.g., p-value </= 0.05), so you must combine the guidance typically given in texts with some conceptual theory then make an educated guess.

I ran across a dataset on motorcycles that illustrated this perfectly. There were 2 - 3 logical places where you could cut the dendrogram. If you cut it at 3 clusters, you got obvious clusters for the big bikes (e.g., Honda Goldwing), street bikes and crotch rockets. If you cut it at 5 clusters, there were logical subdivision of two of these groups. Both options made perfect sense, so which do you choose? It depends on your needs/goals. If you wanted to identify market segments for a targeted ad campaign, 3 clusters is probably sufficient.

So, back to your problem. What are your needs, and how many clusters does it take to meet those requirements? Since you are in linguistics, a possible scenario that you might encounter is one cluster option is for major language groups, while additional clusters identify subtypes, or even regional dialects.
 
#7
I am trying to classify text types (e.g. business letters can be of various topics, lengths and linguistic features). After making 3 clusters, I apply one way ANOVA to see which attribute is most significantly decisive in making clusters. There are three (at 0.0000 p value) out of six. When I decide to go for 4 clusters, all the attributes show significance 0.0000 after application of one way ANOVA.
Should I take into account this phenomenon as well?