Counting combinations without destroying type

Question

I wonder whether someone has an idea for how to count combinations like the following in a better way than I've thought of.

> library(lubridate)
> df <- data.frame(x=sample(now()+hours(1:3), 100, T), y=sample(1:4, 100, T))
> with(df, as.data.frame(table(x, y)))
                     x y Freq
1  2012-06-15 00:10:18 1    5
2  2012-06-15 01:10:18 1    9
3  2012-06-15 02:10:18 1    8
4  2012-06-15 00:10:18 2    9
5  2012-06-15 01:10:18 2   10
6  2012-06-15 02:10:18 2   12
7  2012-06-15 00:10:18 3    7
8  2012-06-15 01:10:18 3    9
9  2012-06-15 02:10:18 3    6
10 2012-06-15 00:10:18 4    5
11 2012-06-15 01:10:18 4   14
12 2012-06-15 02:10:18 4    6

I like that format, but unfortunately when we ran x and y through table(), they got converted to factors. In the final output they can exist quite nicely as their original type, but getting there seems problematic. Currently I just manually fix all the types afterward, which is really messy because I have to re-set the timezone, and look up the percent-codes for the default date format, etc. etc.

It seems like an efficient solution would involve hashing the objects, or otherwise mapping integers to the unique values of x and y so we can use tabulate(), then mapping back.

Ideas?

Josh O'Brien · Accepted Answer

Here's data.table version that preserves the column classes:

library(data.table)

dt <- data.table(df, key=c("x", "y"))
dt[, .N, by=key(dt)]
#                       x y  N
#  1: 2012-06-14 18:10:22 1  8
#  2: 2012-06-14 18:10:22 2 10
#  3: 2012-06-14 18:10:22 3  8
#  4: 2012-06-14 18:10:22 4  8
#  5: 2012-06-14 19:10:22 1  6
#  6: 2012-06-14 19:10:22 2  8
#  7: 2012-06-14 19:10:22 3  6
#  8: 2012-06-14 19:10:22 4  6
#  9: 2012-06-14 20:10:22 1 15
# 10: 2012-06-14 20:10:22 2  5
# 11: 2012-06-14 20:10:22 3 12
# 12: 2012-06-14 20:10:22 4  8

str(dt[, .N, by=key(dt)])
# Classes ‘data.table’ and 'data.frame':  12 obs. of  3 variables:
#  $ x: POSIXct, format: "2012-06-14 18:10:22" "2012-06-14 18:10:22" ...
#  $ y: int  1 2 3 4 1 2 3 4 1 2 ...
#  $ N: int  8 10 8 8 6 8 6 6 15 5 ...

Edit in response to follow-up question

To count the number of appearances of all possible combinations of the observed factor levels (including those which don't appear in the data), you can do something like the following:

dt<-dt[1:30,]  # Make subset of dt in which some factor combinations don't appear

ii <- do.call("CJ", lapply(dt, unique))  # CJ() is similar to expand.grid()
dt[ii, .N]
#                      x y N
# 1: 2012-06-14 22:53:05 1 8
# 2: 2012-06-14 22:53:05 2 7
# 3: 2012-06-14 22:53:05 3 9
# 4: 2012-06-14 22:53:05 4 5
# 5: 2012-06-14 23:53:05 1 1
# 6: 2012-06-14 23:53:05 2 0
# 7: 2012-06-14 23:53:05 3 0
# 8: 2012-06-14 23:53:05 4 0

Counting combinations without destroying type

Tags:

r

data.table

combinations

factors

Ken Williams

1 Answers

Josh O'Brien

Recent Activity

Donate For Us

Counting combinations without destroying type

Tags:

r

data.table

combinations

factors

Ken Williams

1 Answers

Josh O'Brien

Related questions

Recent Activity

Donate For Us