Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R function categorize by column?

Tags:

function

r

I would like to write a function that takes a data frame, counts occurrences across multiple columns, and then assigns the row with a "Category" based on column name occurrence.

Taking this df as an example:

df <- data.frame(k1 = c(0,0,3,4,5,1), 
                 k2 = c(1,0,0,4,5,0), 
                 k3 = c(0,0,0,8,0,0), 
                 k4 = c(2,5,0,3,4,5))

I'd like the output to look like this:

df.final<-data.frame(k1 = c(0,0,3,4,5,1), 
                     k2 = c(1,0,0,4,5,0), 
                     k3 = c(0,0,0,8,0,0), 
                     k4 = c(2,5,0,3,4,5), 
                     Category = c("k2_k4","k4","k1","k1_k2_k3_k4","k1_k2_k4","k1_k4"))

Of course, my actual data is many, many more lines and I'm hoping this function can be used to evaluate data frames with any number of columns. I'm just not sure how to write the function. I'm a function writing newbie!

like image 962
shu251 Avatar asked Feb 04 '26 00:02

shu251


2 Answers

You can use data.table::transpose() function to make each row a vector, then use sapply to loop through the list and paste corresponding column names where the values are not zero:

df$category = sapply(data.table::transpose(df), 
                     function(r) paste0(names(df)[r != 0], collapse = "_"))

df
#  k1 k2 k3 k4    category
#1  0  1  0  2       k2_k4
#2  0  0  0  5          k4
#3  3  0  0  0          k1
#4  4  4  8  3 k1_k2_k3_k4
#5  5  5  0  4    k1_k2_k4
#6  1  0  0  5       k1_k4
like image 105
Psidom Avatar answered Feb 05 '26 14:02

Psidom


In base R, there are a lot of options. One:

df$Category <- apply(df > 0, 1, function(x){toString(names(df)[x])})

df
##   k1 k2 k3 k4       Category
## 1  0  1  0  2         k2, k4
## 2  0  0  0  5             k4
## 3  3  0  0  0             k1
## 4  4  4  8  3 k1, k2, k3, k4
## 5  5  5  0  4     k1, k2, k4
## 6  1  0  0  5         k1, k4

or to use underscores,

df$Category <- apply(df > 0, 1, function(x){paste(names(df)[x], collapse = '_')})

df
##   k1 k2 k3 k4    Category
## 1  0  1  0  2       k2_k4
## 2  0  0  0  5          k4
## 3  3  0  0  0          k1
## 4  4  4  8  3 k1_k2_k3_k4
## 5  5  5  0  4    k1_k2_k4
## 6  1  0  0  5       k1_k4

A sort of interesting alternative is purrr::by_row:

library(purrr)

df %>% by_row(~toString(names(.)[.x > 0]), .collate = 'cols', .to = 'Category')

## # A tibble: 6 × 5
##      k1    k2    k3    k4       Category
##   <dbl> <dbl> <dbl> <dbl>          <chr>
## 1     0     1     0     2         k2, k4
## 2     0     0     0     5             k4
## 3     3     0     0     0             k1
## 4     4     4     8     3 k1, k2, k3, k4
## 5     5     5     0     4     k1, k2, k4
## 6     1     0     0     5         k1, k4
like image 36
alistaire Avatar answered Feb 05 '26 14:02

alistaire