Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R scale function with character variable

I'm relatively new to R - I'm having challenges to figure out how to scale a dataset that contains a character variable.

However I when I try to use the scale function to create a dataframe, I'm getting an error:

 df<-scale(USArrests)
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

Is there a way to create a dataframe with a character variable to later use it in a cluster analysis?

km.res<-kmeans(df,4,nstart=10)
like image 643
MJ_vdH Avatar asked Jun 13 '26 18:06

MJ_vdH


1 Answers

?scale() says scale is desgined to center columns of numeric matrices, see the help entry for further details. However, df <- USArrests is sufficient to store the required in-built dataset as object df (see environment), if you have to name it df. Compare the following:

df <- USArrests
# compare
head(df, n=5)
# to 
df1 <- scale(df)
head(df1, n=5)

As you can see, all numeric columns are now scaled while the row ids, Alabama, ..., Wyoming, of course, do not change. Btw, to check the class of all variables you can use lapply(df, class).

I think you shouldn't have problems to then call km.res <- kmeans(df1,4,nstart=10). To inspect the object type km.res.

To be honest, I think previous to running kmeans() you should again have a look on the help page (e.g. help(kmeans)) to get in touch with the arguments clusters, iter, ... Further, I think it would be a good idea to investigate why or why not to center the data in previous step. In any case, it is possible to run kmeans() with centered (df1) and uncentered (df) data. Why one of those alternatives is more appropriate is of major importance.

EDIT: It is recommended to set a seed (e.g. set.seed(09102021)) before running the algorithm. By doing so you ensure the reproducibility of results.


Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!