Help with clustering based on shopping category transaction freq

hey there.

Just wondering if anyone could lend their expertise on cluster analysis.

Basically I am trying to see if I can find 5-10 customer segments based on shopping habits.

I have a table (single customer view) of 10,000 randomly selected customers along with a column for each category containing a value between 1 and 4, denoting they have transacted in that category 1, 2, 3 or 4+ times. The cut off of point at 4 is employed in order to eliminate outliers and I have normalised the values by converting them into z-scores.

I'm looking to use a two-step approach to effectively cluster this large data set. First I am partitioning using k-means into "sub-clusters" and then using agglomerative hierarchical clustering to cluster the clusters.

Problem I have is that every time I cluster using k-means I get a massively different result, especially when I randomly sort the data set before each run, and so by the time the sub-clusters are being inputted into the AHC the damage is already done so-to-speak.

Half of the issue is potentially that I don't know how many sub-clusters to create. Is there any best practice for number of sub-clusters e.g. if you have X observations then created X/10 sub-clusters?

Also does anyone have any general advice on the approach? I'm quite new to modelling and I feel like I've read every post on every forum, lots of academic papers and web resources but am still no better off.

If my data set is fraught with things that will cause issues then how do I know? Can certain data sets not be clustered? Are there any other alternative techniques I could use?

I've attached the first 1k rows (of 10k total) from my original dataset (normalised "z-score" variables are at the right) in tab delimited format

Any advice would be really appreciated!

Last edited:
Why are you using two step clustering? Do you have both continuous and categorical variables? ( I am asking because I am not able to open the dataset)
In that case as an alternative you can use the Gower distance for the hierarchical clustering.
Also, for the problem of selecting the optimal number of clusters, you can use the scree plot and identify the knee point.
I can explain these things in detail if you want but would be better if you search for these things (+ I am feeling lazy).
One method I have used in deciding the number of clusters is to randomly split the dataset in half and run the cluster analysis for a given number of clusters (e.g. 2) on one half and assign the other half to the nearest neighbor, repeat this several times (e.g. 10) for given range of cluster numbers (e.g. 2 through 8). Then you can calculate some agreement statistics (e.g. hubert and arabie's adjusted rand index, kappa, etc) for each cluster number to see if there is a certain number of clusters that might be optimal. A paper using this technique can be found here:
Shahnawaz, 2 step was recommended to me as a good way to get a more consistent result but there's nothing to suggest this was good advice! :)

I have continuous and categorical, where the latter are represented as numbers (e.g. male 1, female 2). Actually though I have tried removing the categorical variables to see if that helps (data set I uploaded here is post-removing them) but it hasn't helped, in part due to the likely significance of those variables.

Thanks a lot for the tip. I will Google the things you mention and revert if I need any other pointers.

ps. I have re-uploaded the dataset as a tab delimited txt file

Last edited:


TS Contributor
May be you could look into Correspondence Analysis (CA), which is an exploratory technique frequently used in marketing research and market segmentation (LINK).

It looks to me as a natural choice for you since you have a number of units (customers, in your case) whose purchases are tabulated (with counts) against a number of categories.

Now, CA would allow you to explore the patterns of association between customers and purchased item classes. Moreover, it would also allow to cluster both customers and purchased item classes.

I cannot go into details here. A good intro to CA has been written by Bendixen (LINK). Another useful source could be my Website on CA, where you can find info about the clustering procedure and rationale.

The only problem I envision with your data is that: 10,000 rows are quite a lot and it would be perhaps better to draw a more manageable sample.
Secondly, your data looks rather sparse. It would be better, in order to perform CA, to meaningfully collapse some column categories.

GM this is really helpful, thank you. Funnily enough I had just begun to explore other options on Friday and CA was one of the things I short-listed for research.

You are right that there is a lot of zero values and as you say it would be best to get rid of certain categories altogether. Perhaps I'll also reduce the sample size to 1k.

Your help is much appreciated!