K-means cluster for skewed dataset

sesh

New Member
#1
Hi,
I've 7 columns(variables) and their percentiles are shown as below for 14K rows. I've tried to create k-means clusters for 14k observations of 7 variables.

A-G are products
the numerics are turnover for A-G

If you look at the table, massive dataset is having no turnover for all products.

Does it make sense to create any clusters for the below percentile dataset since there is huge skew and mean and standard deviation is totally skewed.

I've tried to create clusters and their segments are not that useful.

Two question:

Does it make sense to build the clusters for below dataset ? If so, please explain the approach ?

I'm planning to exclude outliers from all the columns and will run the k-means.

Does it make sense ? I'd look at the max value for all columns and roughly remove the customers whose values are extremely high.

The way I approached is as follows.

Raw data for all the columns are taken. Their z-score is then created Ran 5-8 clusters and the iterations are 30.

First table has got their percentiles and second table has their mean/sd ,min and max ranges.

Percentiles A B C D E F G
5 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10 0.00 0.00 0.00 0.00 0.00 0.00 5.00
15 2.00 0.00 0.00 0.00 0.00 0.00 10.00
20 5.00 0.00 0.00 0.00 0.00 0.00 19.00
25 8.50 0.00 0.00 0.00 0.00 0.00 20.00
30 10.00 0.00 0.00 0.00 0.00 0.00 25.00
35 15.00 0.00 0.00 0.00 0.00 0.00 40.00
40 20.00 0.00 0.00 0.00 0.00 0.00 59.14
45 27.00 0.00 0.00 0.00 0.00 0.00 91.76
50 36.00 0.00 0.00 0.00 0.00 0.00 134.41
55 50.00 0.00 0.00 0.00 0.00 0.00 216.01
60 67.00 0.00 0.00 0.00 0.00 0.00 342.23
65 95.20 0.00 0.00 0.00 0.69 0.00 541.96
70 137.15 0.00 2.25 0.00 4.00 0.00 877.71
75 204.00 35.48 13.00 0.00 14.40 0.00 1474.91
80 321.50 101.84 56.24 0.00 50.52 0.00 2492.92
85 554.49 284.33 218.54 0.00 187.17 0.00 4355.42
90 1070.82 836.64 946.31 2.00 728.33 0.00 8392.30
95 2536.78 2887.65 5237.67 175.63 3571.01 12.99 20901.80



A B C D E F G
Record count 14168 14168 14168 14168 14168 14168 14168
Mean 908.33 1131.4 3447.63 714.71 1541.42 465.1 5979.68
Std. Deviation 24338.19 11710.0 81949.65 9065.3 15599.0 11625.03 63834.93
Skewness 107.95 36.78 87.73 25.11 43.03 53.96 49.20
Range 2798373.20 788183.75 8606816.05 411374.22 1223201.92 816560.80 4732292.61
Minimum 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Maximum 2798373.20 788183.75 8606816.05 411374.22 1223201.92 816560.80 4732292.61

I'm new to stats and datamining :)


Thanks alot
Seshi