0

I am using a k-modes model (mymodel) which is created by a data frame mydf1. I am looking to assign the nearest cluster of mymodel for each row of a new data frame mydf2. Similar to this question - just with k-modes instead of k-means. The predict function of the flexclust package only works with numeric data, not categorial.

A short example:

require(klaR)
set.seed(100)
mydf1 <- data.frame(var1 = as.character(sample(1:20, 50, replace = T)),
                    var2 = as.character(sample(1:20, 50, replace = T)),
                    var3 = as.character(sample(1:20, 50, replace = T)))
mydf2 <- data.frame(var1 = as.character(sample(1:20, 50, replace = T)),
                    var2 = as.character(sample(1:20, 50, replace = T)),
                    var3 = as.character(sample(1:20, 50, replace = T)))
mymodel <- klaR::kmodes(mydf1, modes = 5)
# Get mode centers
mycenters <- mymodel$modes
# Now I would want to predict which of the 5 clusters each row 
# of mydf2 would be closest to, e.g.:
# cluster2 <- predict(mycenters, mydf2)

Is there already a function which can predict with a k-modes model or what would be the simplest way to do that? Thanks!

sh_student
  • 369
  • 2
  • 14

1 Answers1

1

We can use the distance measure that is used in the kmodes algorithm to assign each new row to its nearest cluster.

## From klaR::kmodes

distance <- function(mode, obj, weights) {
  if (is.null(weights)) 
    return(sum(mode != obj))
  obj <- as.character(obj)
  mode <- as.character(mode)
  different <- which(mode != obj)
  n_mode <- n_obj <- numeric(length(different))
  for (i in seq(along = different)) {
    weight <- weights[[different[i]]]
    names <- names(weight)
    n_mode[i] <- weight[which(names == mode[different[i]])]
    n_obj[i] <- weight[which(names == obj[different[i]])]
  }
  dist <- sum((n_mode + n_obj)/(n_mode * n_obj))
  return(dist)
}

AssignCluster <- function(df,kmeansObj)
{
  apply(
    apply(df,1,function(obj)
  {
    apply(kmeansObj$modes,1,distance,obj,NULL)
  }),
  2, which.min)
}

AssignCluster(mydf2,mymodel)

[1] 4 3 4 1 1 1 2 2 1 1 5 1 1 3 2 2 1 3 3 1 1 1 1 1 3 1 1 1 3 1 1 1 1 2 1 5 1 3 5 1 1 4 1 1 2 1 1 1 1 1

Please note that this will likely produce a lot of entries that are equally far away from multiple clusters and which.min will then choose the cluster with the lowest number.

Julian_Hn
  • 2,086
  • 1
  • 8
  • 18
  • Thanks! When I am using the `AssignCluster` function onto my actual data fame (around 6000 rows), it returns cluster `1` for each row. That means that cluster `1` is always minimizing the distance and maybe some others cluster as well, but as `1` is the first cluster, it always returns `1`? I am a bit surprised by that, as the clusters are quiet different, so I'm wondering how cluster `1` can always minimize the distance. – sh_student Sep 29 '20 at 09:25
  • 1
    I can't really see for your data, but for the synthetic data in your example all distances were very close for all clusters. So it might actually be the case, that they are all the same. I unfortunately am no expert in k-modes clustering. Another approach would actually be to train a classifier on the clustered data and use that to assign new data to the respective clusters. – Julian_Hn Sep 29 '20 at 09:27
  • Ahh ok thanks for the info. Am I understanding correctly, that another approach would be to train e.g. a random forest model with `mydf1` and the clusters for `mydf1` and then use `mydf2` and the random forest model to predict clusters for `mydf2`? – sh_student Sep 29 '20 at 09:30
  • 1
    Yes. That would be the idea. – Julian_Hn Sep 29 '20 at 09:31