Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the optimal number of clusters from the clusGap function as an output?

I have a data frame with 2 variables and I want to use the clusGap function to find the number of clusters that would be optimal to use. This code has a similar result:

library(cluster)    
x <- as.vector(runif(100, 0, 1))
y <- as.vector(runif(100, 0, 1))
df <- data.frame(x, y)
gap_stat <- clusGap(df, FUN = kmeans, nstart = n,
                    K.max = 10, B = 50)
gap_stat

Result:

Clustering Gap statistic ["clusGap"] from call:
clusGap(x = df, FUNcluster = kmeans, K.max = 10, B = 50, nstart = n)
B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
 --> Number of clusters (method 'firstSEmax', SE.factor=1): 1
          logW   E.logW           gap     SE.sim
 [1,] 2.569315 2.584217  0.0149021144 0.03210076
 [2,] 2.285049 2.284537 -0.0005116382 0.03231529
 [3,] 2.053193 2.033653 -0.0195399122 0.03282376
 [4,] 1.839085 1.835590 -0.0034952935 0.03443303
 [5,] 1.691219 1.708479  0.0172603348 0.03419994
 [6,] 1.585084 1.597277  0.0121935992 0.03440672
 [7,] 1.504763 1.496853 -0.0079104306 0.03422321
 [8,] 1.416176 1.405903 -0.0102731340 0.03371149
 [9,] 1.333721 1.323658 -0.0100626869 0.03245958
[10,] 1.253199 1.250366 -0.0028330498 0.03034140

As you can see in line 4, the optimal number of clusters is 1. I would like the function to have 1 as an output. I need the optimal number of outputs to be an object in the environment, such as n is 1.

like image 934
Marco Pastor Mayo Avatar asked Oct 25 '25 02:10

Marco Pastor Mayo


2 Answers

Typically such information is somewhere directly inside the object, like gap_stat$nc. To look for it str(gap_stat) would typically suffice.

In this case, however, the above strategy isn't enough. But the fact that you can see your number of interest in the output, means that print.clusGap (because the class of gap_stat is clusGap) will show how to obtain this number. So, inspecting cluster:::print.clusGap leads to

maxSE(f = gap_stat$Tab[, "gap"], SE.f = gap_stat$Tab[, "SE.sim"])
# [1] 1
like image 61
Julius Vainora Avatar answered Oct 26 '25 18:10

Julius Vainora


This may have been less transparent in the past, but you can actually specify the method directly:

nc <- maxSE(f         = gap_stat$Tab[,"gap"],
            SE.f      = gap_stat$Tab[,"SE.sim"],
            method    = "firstSEmax",
            SE.factor = 1)