Speed up fill function R

Question

I have a dataframe with missing values that I've written a function to fill using R 3.3.2

pkgs <- c("dplyr", "ggplot2", "tidyr", 'data.table', 'lazyeval')
lapply(pkgs, require, character.only = TRUE)

UID <- c('A', 'A', 'A', 'B', 'B', 'B', 'C', 'C')
Col1 <- c(1, 0, 0, 0, 1, 0, 0, 0)
df <- data.frame(UID, Col1)

Function to fill in Col1:

AggregatedColumns <- function(DF, columnToUse, NewCol1) {
  # Setting up column names to use
  columnToUse <- deparse(substitute(columnToUse))
  NewCol1 <- deparse(substitute(NewCol1))

  # Creating new columns 
  DF[[NewCol1]] <- ifelse(DF[[columnToUse]] == 1, 1, NA)
  DF <- DF %>% group_by_("UID") %>% sort(DF[[columnToUse]], decreasing = TRUE) %>% fill_(NewCol1)
  DF <- DF %>% group_by_("UID") %>% sort(DF$columnToUse, decreasing = TRUE) %>% fill_(NewCol1, .direction = 'up')
  DF[[NewCol1]] <- ifelse(is.na(DF[[NewCol1]]), 0, DF[[NewCol1]])

  DF
}

I've pulled out this part of the function since this is the piece that is slowing down the function. I'm very new to writing functions and any advice on how/if this can be sped up would be appreciated. I've isolated the speed issue down to the fill_ part of the function.

What I am trying to do is pass a dummy variable from Col1 to New_Column and then forward fills to other same ID's. For example:

UID             Col1
John Smith        1
John Smith        0

Should become

UID             Col1  New_Column
John Smith        1      1
John Smith        0      1

EDITED FUNCTION I edited the function to fit with @HubertL suggestion. The function is still fairly slow, but hopefully with these edits the example is reproducible.

AggregatedColumns <- function(DF, columnToUse, NewCol1) {
  # Setting up column names to use
  columnToUse <- deparse(substitute(columnToUse))
  NewCol1 <- deparse(substitute(NewCol1))

  # Creating new columns 
  DF[[NewCol1]] <- ifelse(DF[[columnToUse]] == 1, 1, NA)
    DF <- DF %>% group_by_("UID") %>% fill_(NewCol1) %>% fill_(NewCol1, .direction = 'up')
  DF[[NewCol1]] <- ifelse(is.na(DF[[NewCol1]]), 0, DF[[NewCol1]])

  DF
}

Desired output:

UID Col1 New
A    1    1 
A    0    1
A    0    1 
B    0    1
B    1    1
B    0    1
C    0    0
C    0    0

David Arenburg · Accepted Answer

First of all, here are few points:

You are needlessly calling ifelse (twice) while this function is very inefficient
You needlessly using inefficient function (by group) from external package (also twice) when you could simply vectorize the process with just base R.

Here's a simple one-liner without using any external packages that enhances performance by a factor of x72 (and probably much more for bigger data sets) on a 5e7 data set

AggregatedColumns2 <- function(DF, columnToUse, NewCol1) {
    # Setting up column names to use
    columnToUse <- deparse(substitute(columnToUse))
    NewCol1 <- deparse(substitute(NewCol1))

    # Creating the new column (one simple line)
    DF[[NewCol1]] <- as.integer(DF$UID %in% DF$UID[DF[[columnToUse]] == 1])

    # returning new data set back
    DF
}

Benchmark

set.seed(123)
library(stringi)
N <- 5e7
UID <- stri_rand_strings(N, 2)
Col1 <- sample(0:1, N, replace = TRUE)
df <- data.frame(UID, Col1)


system.time(res <- AggregatedColumns(df, Col1, NewCol1))
#   user  system elapsed 
# 198.67    3.94  203.07 

system.time(res2 <- AggregatedColumns2(df, Col1, NewCol1))
# user  system elapsed 
# 2.82    0.00    2.82

Now in order to compare those I will reorder both and convert to a matrix, because Hadleyverses packages add tons of unnecessary attributes (compare the mess created in str(res) vs the simple structure in str(res2))

identical(arrange(res, UID) %>% as.matrix, arrange(res2, UID) %>% as.matrix)
## [1] TRUE

Uwe · Answer

If speed is a concern, you may try this with data.table and na.locf() from the zoo package. LOCF means last observation carried forward.

library(data.table)
setDT(df)[Col1 != 0, New := Col1 ][, New := zoo::na.locf(New), UID][is.na(New), New := 0][]
#   UID Col1 New
#1:   A    1   1
#2:   A    0   1
#3:   A    0   1
#4:   B    0   1
#5:   B    1   1
#6:   B    0   1
#7:   C    0   0
#8:   C    0   0

This is just to give an idea. It still needs to be wrapped in a function call.

It assumes that value 0 in Col1 is considered as missing.

Speed up fill function R

Tags:

performance

function

r

vectorization

vino88

2 Answers

David Arenburg

Uwe

Recent Activity

Donate For Us

Speed up fill function R

Tags:

performance

function

r

vectorization

vino88

2 Answers

David Arenburg

Uwe

Related questions

Recent Activity

Donate For Us