How can I improve the performance of my data cleaning code that currently uses ddply by using data.table?

Question

I am trying to clean data using ddply but it is running very slowly on 1.3M rows.

Sample code:

#Create Sample Data Frame
num_rows <- 10000
df <- data.frame(id=sample(1:20, num_rows, replace=T), 
                Consumption=sample(-20:20, num_rows, replace=T), 
                StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
#df <- df[order(df$id, df$StartDate, df$Consumption),]
#Are values negative? 
# Needed for subsetting in ddply rows with same positive and negative values
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)

I have written a function to remove rows where there is a consumption value in one row that is identical but negative to a consumption value in another row (for the same id).

#Remove rows from a data frame where there is an equal but opposite consumption value
#Should ensure only one negative value is removed for each positive one. 
clean_negatives <- function(x3){
  copies <- abs(sum(x3$Neg))
  sgn <- ifelse(sum(x3$Neg) <0, -1, 1) 
  x3 <- x3[0:copies,]
  x3$Consumption <- sgn*x3$Consumption
  x3$Neg <- NULL
  x3}

I then use ddply to apply that function to remove these erroneous rows in the data

ptm <- proc.time()
df_cleaned <- ddply(df, .(id,StartDate, EndDate, Consumption),
                    function(x){clean_negatives(x)})
proc.time() - ptm

I was hoping I could use data.table to make this go faster but I couldn't work out how to employ data.table to help.

With 1.3M rows, so far it is taking my desktop all day to compute and still hasn't finished.

Arun · Accepted Answer

Your question asks about data.table implementation. So, I've shown it here. Your function could be drastically simplified as well. You can first get the sign by summing up Neg and then filter the table and then multiply Consumption by sign (as shown below).

require(data.table)
# get the data.table in dt
dt <- data.table(df, key = c("id", "StartDate", "EndDate", "Consumption"))
# first obtain the sign directly
dt <- dt[, sign := sign(sum(Neg)), by = c("id", "StartDate", "EndDate", "Consumption")]
# then filter by abs(sum(Neg))
dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], by = c("id", "StartDate", "EndDate", "Consumption")]
# modifying for final output (line commented after Statquant's comment
# dt.fil$Consumption <- dt.fil$Consumption * dt.fil$sign
dt.fil[, Consumption := (Consumption*sign)]
dt.fil <- subset(dt.fil, select=-c(Neg, sign))

Benchmarking

The data with million rows:

#Create Sample Data Frame
num_rows <- 1e6
df <- data.frame(id=sample(1:20, num_rows, replace=T), 
                Consumption=sample(-20:20, num_rows, replace=T), 
                StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)

The data.table function:

FUN.DT <- function() {
    require(data.table)
    dt <- data.table(df, key=c("id", "StartDate", "EndDate", "Consumption"))
    dt <- dt[, sign := sign(sum(Neg)), 
               by = c("id", "StartDate", "EndDate", "Consumption")]
    dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], 
               by=c("id", "StartDate", "EndDate", "Consumption")]
    dt.fil[, Consumption := (Consumption*sign)]
    dt.fil <- subset(dt.fil, select=-c(Neg, sign))
}

Your function with ddply

FUN.PLYR <- function() {
    require(plyr)
    clean_negatives <- function(x3) {
        copies <- abs(sum(x3$Neg))
        sgn <- ifelse(sum(x3$Neg) <0, -1, 1) 
        x3 <- x3[0:copies,]
        x3$Consumption <- sgn*x3$Consumption
        x3$Neg <- NULL
        x3
    }
    df_cleaned <- ddply(df, .(id, StartDate, EndDate, Consumption), 
                           function(x) clean_negatives(x))
}

Benchmarking with rbenchmark (with 1 run only)

require(rbenchmark)
benchmark(FUN.DT(), FUN.PLYR(), replications = 1, order = "elapsed")

        test replications elapsed relative user.self sys.self user.child sys.child
1   FUN.DT()            1   6.137    1.000     5.926    0.211          0         0
2 FUN.PLYR()            1 242.268   39.477   152.855    82.881         0         0

My data.table implementation is about 39 times faster than your current plyr implementation (I compare mine to your implementation because the functions are different).

Note: I loaded the packages within the function in order to obtain the complete time to obtain the result. Also, for the same reason I converted the data.frame to data.table with keys inside the benchmarking function. This is therefore the minimum speed-up.

How can I improve the performance of my data cleaning code that currently uses ddply by using data.table?

Tags:

performance

r

data.table

plyr

Oscar Smith

1 Answers

Arun

Recent Activity

Donate For Us

How can I improve the performance of my data cleaning code that currently uses ddply by using data.table?

Tags:

performance

r

data.table

plyr

Oscar Smith

1 Answers

Arun

Related questions

Recent Activity

Donate For Us