Let's say I want to find, for each purchase category, the top 3 most frequently occurring zip codes. In this example, the categories are home, townhouse, and condo. I have transactional data like:
set.seed(1234)
d <- data.table(purch_id = 1:3e6,
purch_cat = sample(x = c('home','townhouse','condo'),
size = 3e6, replace=TRUE),
purch_zip = formatC( sample(x = 1e4:9e4, size = 3e6, replace=TRUE),
width = 5, format = "d", flag = "0") )
I know that I can do this:
# there has to be a better way...
d[,list(purch_count = length(purch_id)),
by=list(purch_cat, purch_zip)][, purch_rank := rank(-purch_count, ties.method='min'),
by=purch_cat][purch_rank<=3,][order(purch_cat, purch_rank)]
purch_cat purch_zip purch_count purch_rank
1: condo 39169 32 1
2: condo 15725 31 2
3: condo 75768 30 3
4: condo 72023 30 3
5: home 71294 30 1
6: home 56053 30 1
7: home 57971 29 3
8: home 77521 29 3
9: home 70124 29 3
10: home 25302 29 3
11: home 65292 29 3
12: home 39488 29 3
13: townhouse 39587 33 1
14: townhouse 80365 30 2
15: townhouse 37360 30 2
But this can't be the most elegant data.table approach, and it seems kind of slow.
Any suggestions to reduce the number of passes? Maybe something using table()? TYVM!
One approach would be
d[ , .N, by=.(purch_cat, purch_zip)][
order(-N),
.SD[ N >= unique(N)[3] ]
,by=purch_cat]
which gives
purch_cat purch_zip N
1: townhouse 39587 33
2: townhouse 80365 30
3: townhouse 37360 30
4: townhouse 83099 28
5: townhouse 33518 28
6: townhouse 59347 28
7: townhouse 22402 28
8: condo 39169 32
9: condo 15725 31
10: condo 75768 30
11: condo 72023 30
12: home 71294 30
13: home 56053 30
14: home 57971 29
15: home 77521 29
16: home 70124 29
17: home 25302 29
18: home 65292 29
19: home 39488 29
20: home 81754 28
21: home 43426 28
22: home 16943 28
23: home 88978 28
24: home 43003 28
25: home 76501 28
purch_cat purch_zip N
To achieve the OP's tie-breaking rule, one could do
d[ , .N, by=.(purch_cat,purch_zip)][
order(-N),
.SD[ N >= unique(N)[3] ][
.N - frank(N, ties.method='max') < 3 ]
, by=purch_cat]
which gives
purch_cat purch_zip N
1: townhouse 39587 33
2: townhouse 80365 30
3: townhouse 37360 30
4: condo 39169 32
5: condo 15725 31
6: condo 75768 30
7: condo 72023 30
8: home 71294 30
9: home 56053 30
10: home 57971 29
11: home 77521 29
12: home 70124 29
13: home 25302 29
14: home 65292 29
15: home 39488 29
Following @MichaelChirico's answer, this approach adds an frank step.
I think you were exactly on the right track. One key thing you're missing, however, is the function frank, which has been optimized and should speed up your code considerably (runs almost instantaneously on your 3m row sample data):
d[ , .(purch_count = .N),
by = .(purch_cat, purch_zip)
][, purch_rank := frank(-purch_count, ties.method = 'min'),
keyby = purch_cat
][purch_rank <= 3,
][order(purch_cat, purch_rank)]
purch_cat purch_zip purch_count purch_rank
1: condo 39169 32 1
2: condo 15725 31 2
3: condo 75768 30 3
4: condo 72023 30 3
5: home 71294 30 1
6: home 56053 30 1
7: home 57971 29 3
8: home 77521 29 3
9: home 70124 29 3
10: home 25302 29 3
11: home 65292 29 3
12: home 39488 29 3
13: townhouse 39587 33 1
14: townhouse 80365 30 2
15: townhouse 37360 30 2
table (slow):Yes, one way involves using table.
d[ , {x <- table(purch_zip)
x <- x[order(-x)]
names(x[x %in% unique(x)[1:3]])
}, keyby = purch_cat]
purch_cat V1
1: condo 39169
2: condo 15725
3: condo 72023
4: condo 75768
5: home 56053
6: home 71294
7: home 25302
8: home 39488
9: home 57971
10: home 65292
11: home 70124
12: home 77521
13: home 16943
14: home 43003
15: home 43426
16: home 76501
17: home 81754
18: home 88978
19: townhouse 39587
20: townhouse 37360
21: townhouse 80365
22: townhouse 22402
23: townhouse 33518
24: townhouse 59347
25: townhouse 83099
purch_cat V1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With