Question About K-Means

Kam36

New Member
#1
I'm currently working on a project that looks at clustering retail stores on the bases of their Sales performance by item class.
Items that are sold in these retail stores are classified across 18 groupings.
In short, I'm looking at clustering stores that share similar sales patterns across these classes.
I have been looking into using a K-means Clustering Algorithm, but I'm not sure if I should use Principal Component Analysis (PCA) to reduce my variables into 2-5, as opposed to 18.
The research I have done has been giving me mixed results, and I have not been able to find anything specific. After clustering stores with PCA as well as without, my results vary greatly. Also, I am not sure what the best way is of evaluating my results...How can I tell which process is giving me the best result?
 
#2
Hehe I ran into this on a midterm take home project once.

There is a really good chance that the reason your kmean results vary wildly on the PC and natural variable values is because the PC are standardized and you didn't standardize the natural observations.

I had a situation where I ran kmeans on the full PC projections and on the natural observations and I was getting wildly different results. I thought: why should this be if PC are just rotations in space? When I thought to standardize the natural variables I had my answer.
 

Kam36

New Member
#3
I had no idea!!! Your point was VERY helpful!!

I've been using % of total Sales instead of the actual sales, since I don't want Larger volume stores to be clustered together...So for example, Store X sells 5 products of class 1, 2 products of class 2 and 3 products of class 3....In my dataset, i've turned it into "50% of product 1, 20% of product 2 and 30% of product 3".
If I standardize the actual dollar amounts, will it be essentially doing the same thing? Or will it standardize it by Product?


Hehe I ran into this on a midterm take home project once.

There is a really good chance that the reason your kmean results vary wildly on the PC and natural variable values is because the PC are standardized and you didn't standardize the natural observations.

I had a situation where I ran kmeans on the full PC projections and on the natural observations and I was getting wildly different results. I thought: why should this be if PC are just rotations in space? When I thought to standardize the natural variables I had my answer.
 
#4
Well point one is you don't ~have~ to standardized to use PCA. It is just that by default it is occasionally done. You can do it without the standardization and repeat to see if it was the issue.

But whether you should standardize is still a question to consider.

So all your variables are total sales in Product A through Z or are there other types of measurements in there too? Interesting. I dont have an answer.
 
Last edited:

Kam36

New Member
#5
So running PCA Doesn't always standardize the dataset?
The only type of measurement i'm using is Sales by Category.
As to standardizing the dataset, do you suggest doing it? As mentioned before in my previous post, i'm worried that large volumes will be clustered together....I'm more interested in clustering stores that have similar sales patterns, as opposed to clustering stores with large sales volumes together.
Do you have any ideas or suggestions as to how I should continue?
Thanks!!

Well point one is you don't ~have~ to standardized to use PCA. It is just that by default it is occasionally done.

But whether you should standardize is still a question to consider.

So all your variables are total sales in Product A through Z or are there other types of measurements in there too? Interesting. I dont have an answer.