Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I improve the performance of my data cleaning code that currently uses ddply by using data.table?

I am trying to clean data using ddply but it is running very slowly on 1.3M rows.

Sample code:

#Create Sample Data Frame
num_rows <- 10000
df <- data.frame(id=sample(1:20, num_rows, replace=T), 
                Consumption=sample(-20:20, num_rows, replace=T), 
                StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
#df <- df[order(df$id, df$StartDate, df$Consumption),]
#Are values negative? 
# Needed for subsetting in ddply rows with same positive and negative values
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)

I have written a function to remove rows where there is a consumption value in one row that is identical but negative to a consumption value in another row (for the same id).

#Remove rows from a data frame where there is an equal but opposite consumption value
#Should ensure only one negative value is removed for each positive one. 
clean_negatives <- function(x3){
  copies <- abs(sum(x3$Neg))
  sgn <- ifelse(sum(x3$Neg) <0, -1, 1) 
  x3 <- x3[0:copies,]
  x3$Consumption <- sgn*x3$Consumption
  x3$Neg <- NULL
  x3}

I then use ddply to apply that function to remove these erroneous rows in the data

ptm <- proc.time()
df_cleaned <- ddply(df, .(id,StartDate, EndDate, Consumption),
                    function(x){clean_negatives(x)})
proc.time() - ptm

I was hoping I could use data.table to make this go faster but I couldn't work out how to employ data.table to help.

With 1.3M rows, so far it is taking my desktop all day to compute and still hasn't finished.

like image 226
Oscar Smith Avatar asked Dec 12 '25 14:12

Oscar Smith


1 Answers

Your question asks about data.table implementation. So, I've shown it here. Your function could be drastically simplified as well. You can first get the sign by summing up Neg and then filter the table and then multiply Consumption by sign (as shown below).

require(data.table)
# get the data.table in dt
dt <- data.table(df, key = c("id", "StartDate", "EndDate", "Consumption"))
# first obtain the sign directly
dt <- dt[, sign := sign(sum(Neg)), by = c("id", "StartDate", "EndDate", "Consumption")]
# then filter by abs(sum(Neg))
dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], by = c("id", "StartDate", "EndDate", "Consumption")]
# modifying for final output (line commented after Statquant's comment
# dt.fil$Consumption <- dt.fil$Consumption * dt.fil$sign
dt.fil[, Consumption := (Consumption*sign)]
dt.fil <- subset(dt.fil, select=-c(Neg, sign))

Benchmarking

  • The data with million rows:

    #Create Sample Data Frame
    num_rows <- 1e6
    df <- data.frame(id=sample(1:20, num_rows, replace=T), 
                    Consumption=sample(-20:20, num_rows, replace=T), 
                    StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
    df$EndDate <- df$StartDate + 90
    df$Neg <- ifelse(df$Consumption < 0, -1, 1)
    df$Consumption <- abs(df$Consumption)
    
  • The data.table function:

    FUN.DT <- function() {
        require(data.table)
        dt <- data.table(df, key=c("id", "StartDate", "EndDate", "Consumption"))
        dt <- dt[, sign := sign(sum(Neg)), 
                   by = c("id", "StartDate", "EndDate", "Consumption")]
        dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], 
                   by=c("id", "StartDate", "EndDate", "Consumption")]
        dt.fil[, Consumption := (Consumption*sign)]
        dt.fil <- subset(dt.fil, select=-c(Neg, sign))
    }
    
  • Your function with ddply

    FUN.PLYR <- function() {
        require(plyr)
        clean_negatives <- function(x3) {
            copies <- abs(sum(x3$Neg))
            sgn <- ifelse(sum(x3$Neg) <0, -1, 1) 
            x3 <- x3[0:copies,]
            x3$Consumption <- sgn*x3$Consumption
            x3$Neg <- NULL
            x3
        }
        df_cleaned <- ddply(df, .(id, StartDate, EndDate, Consumption), 
                               function(x) clean_negatives(x))
    }
    
  • Benchmarking with rbenchmark (with 1 run only)

    require(rbenchmark)
    benchmark(FUN.DT(), FUN.PLYR(), replications = 1, order = "elapsed")
    
            test replications elapsed relative user.self sys.self user.child sys.child
    1   FUN.DT()            1   6.137    1.000     5.926    0.211          0         0
    2 FUN.PLYR()            1 242.268   39.477   152.855    82.881         0         0
    

My data.table implementation is about 39 times faster than your current plyr implementation (I compare mine to your implementation because the functions are different).

Note: I loaded the packages within the function in order to obtain the complete time to obtain the result. Also, for the same reason I converted the data.frame to data.table with keys inside the benchmarking function. This is therefore the minimum speed-up.

like image 67
Arun Avatar answered Dec 14 '25 03:12

Arun



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!