I have a dataframe with 120000 records and 19 variables, 2 of which are state and MonthlyIncome.
I have to create a new dataset with top 10(in terms of monthly income) customers from each state.
I tried a lot of options including data.table and dplyr and base but there is always something missing.
data.table :
x <- customer_any_360[,order(-dense_rank(MonthlyIncome))[1:10], by = state]
--- example I tried
Could someone please help, am still new to R and really struggling with this problem. Thanks in advance!!
If you want to use rank functions, one option is frank from data.table where you can specify the option in ties.method.
library(data.table)#v1.9.5+
setDT(customer_any_360)[, .SD[frank(-MonthlyIncome,
ties.method='dense') %in% 1:10], by = state]
Or even just order is sufficient
setDT(customer_any_360)[order(-MonthlyIncome), .SD[1:10], by = state]
Using dplyr, there are mutliple options, dense_rank, min_rank, top_n depending on what you wanted. Also, for filtering, slice or filter can be used.
library(dplyr)
customer_any_360 %>%
group_by(state) %>%
slice(dense_rank(-MonthlyIncome)[1:10])
Or using sqldf
library(sqldf)
sqldf('select * from customer_any_360 i
where rowid in
(select rowid from customer_any_360
where state = i.state
order by MonthlyIncome desc
limit 10)
order by i.state, i.MonthlyIncome desc')
Or using ave from base R
indx <- with(customer_any_360, ave(-MonthlyIncome, state,
FUN=function(x) rank(x, ties.method='first')) %in% 1:10)
customer_any_360[indx,]
EDIT: The frank option was edited with suggestions from @Arun
set.seed(24)
customer_any_360 <- data.frame(cust=1:120000, state= sample(state.abb,
120000, replace=TRUE), MonthlyIncome= sample(1000:6500, 120000,
replace=TRUE), stringsAsFactors=FALSE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With