Clustering of categorial & quantitative data

mahoo

New Member
#1
Hello,

Struggling a bit here, hope I chose the right forum place.

I am trying to cluster cells (1x1km) over a specific area. Each cell is composed of various habitats defined by a code. (Each habitat consists of 3 parameters, so a habitat code looks like e.g. 1-3-15. There are around 100 different habitats, i.e. max number of combinations of these 3 parameters).

I now try to define clusters according to:

- Number of habitats per cell
- 1st largest habitat covering the cell
- 2nd largest habitat covering the cell and
- Rarest habitat in the cell (I developed a global rarity index for each habitat, and one habitat having the lowest index is chosen per cell).

So the last 3 in the list refer to a specific habitat per cell and the first is more a quantitative measure of how many are in each cell.

Based on this they'd like to perform a cluster analysis, but I am wondering whether there is a mix up btw categotrial and quantitative data.

Could anyone please give some advise on what to method to use, e.g. in R?

Thanks,
Ralph
 

bryangoodrich

Probably A Mammal
#2
Typically you'd look at k-nearest neighbors (KNN), k-means clustering, or hierarchical clustering. If you're looking to group things that are similar, k-means is the way to go, because it partitions your data into groups. If instead you are interested in how things are similar (individually), hierarchical clustering is the way to go, because it builds a dendrogram of how similar things are. From that you can form groups or investigate the dendrogram. All of these take quantitative variables. There's no way around that. What you have to do is convert your qualitative variables to a meaningful quantitative scale. There are many ways to do this. For instance, if you have a binary variable, you might encode it 0 and 1 or -1 and 1. If you have 3 factors to that variable maybe -1, 0, 1 or 0, 0.5, 1. You can also create contrasts so a variable is 1 if the first factor occurs, 0 otherwise. Thus, you'd have 2 such binary variables for a 3 factor variable. All of the quantitative variables should be normalized so the scale of one variable doesn't influence the results.

To use either of these methods is pretty straight-forward. See kmeans and hclust, respectively. In the latter case, you'd have to use dist on your matrix of variables to create the distance matrix the cluster algorithm uses. The problem is, at least for R, this is very costly as it is easy to get a large distance matrix, so be aware that size of the data matters here.

Code:
#assuming x is your data frame of values
xd <- dist(scale(x)) # I'm scaling my data!
kc <- kmeans(scale(x), k=5, nstarts=50)
hc <- hclust(xd)
table(kc$cluster, cutree(hc, 5))  # Compare cluster results by method
 

mahoo

New Member
#3
Great, thanks for the reply. That gives me already a way forward.

Regarding transforming the categorial data

For my cell dataset, for example if the habitat variable consists of 4 geology types, 5 landuse types and 2 soil classes (e.g. 40 possible combinations or habitat codes), I'd compose a presence/absence matrix for the first criteria "largest habitat in the cell" like this?


geol_1 geol_2 geol_3 geol_4 lu_1 lu_2 lu_3 lu_4 lu_5 soil_1 soil_2
cell1 1 0 0 0 0 1 0 0 0 1 0
cell2 0 0 1 0 1 0 0 0 0 0 1
...


This would then address one criteria, others would be similar, I presume, but how to combine it all (with all criteria) in a dataframe?

Thanks.