Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

undo (flatten) tabulation

Tags:

r

I have a large data.frame that actually is a table containing each factor combination with count per row. Here is a playground example:

> z <- data.frame(a=factor(c("x","x","x","y","y","y")),
                  b=factor(c("a","b","c","a","b","c")),
                  count=c(2,5,1,4,5,1))
> z
  a b count
1 x a     2
2 x b     5
3 x c     1
4 y a     4
5 y b     5
6 y c     1

In order to use the function DescTools::Lambda(), I must undo the tabulation and repeat each combination by the number of count. The rep function, however, produces an error:

> rep(z[,1:2], z$count)
Error in rep(z[, 1:2], z$count) : invalid 'times' argument

Can someone please suggest a correct way to achieve this?

like image 453
cdalitz Avatar asked Oct 29 '25 05:10

cdalitz


2 Answers

You dont need to expand your table in the way you described as it may be computationally intensive for large data and DescTools::Lambda() accepts tables. An easier way would be to use xtabs to create a table and feed it into DescTools::Lambda()

Your example data were great but returned a 0 value for lambda, so here I am using the example data provided in ?DescTools::Lambda() to demonstrate it works:

# data copied from ?DescTools::Lambda()
m <- as.table(cbind(c(1768,946,115), c(807,1387,438), c(189,746,288), c(47,53,16)))

# your data structure
z <- setNames(as.data.frame(m), c("a", "b", "count"))


DescTools::Lambda(xtabs(count ~ a + b, data = z))
#[1] 0.2076188

If you did want to expand, the trick is to repeat the row numbers instead of the data and then use these for indexing the data.frame. You could do that by:

z[rep(seq_len(nrow(z)), z$count), c("a","b")]
like image 187
jpsmith Avatar answered Oct 31 '25 18:10

jpsmith


two solutions which are more explicit about the columns involved:

library(dplyr)
z |> 
  reframe(across(c(a, b), ~ rep(.x, count)))

library(data.table)
z |> 
  as.data.table() |> 
  _[,
    lapply(.SD, \(xs) rep(xs, count)),
    .SDcols = c("a", "b")
  ]

note the speed differences, though:

Unit: microseconds
      expr      min        lq       mean    median        uq      max neval
      base   32.676   43.2020   62.08967   67.7315   74.9065  132.756   500
 datatable  382.196  438.1415  490.71282  483.7960  522.4790 1278.699   500
      tidy 1380.911 1455.9130 1617.94013 1508.0150 1586.0380 8495.531   500

(version "base" being @Rolands solution)

like image 35
I_O Avatar answered Oct 31 '25 18:10

I_O



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!