I built a script that works great with small data sets (<1 M rows) and performs very poorly with large datasets. I've heard of data table as being more performant than tibbles. I'm interested to know about other speed optimizations in addition to learn about data tables.
I'll share a couple of commands in the script for examples. In each of the examples, the datasets are 10 to 15 million rows and 10 to 15 columns.
      dataframe %>% 
      group_by(key_a, key_b, key_c,
               key_d, key_e, key_f,
               key_g, key_h, key_i) %>%
      summarize(min_date = min(date)) %>% 
      ungroup()
      merge(dataframe, 
          dataframe_two, 
          by = c("key_a", "key_b", "key_c",
               "key_d", "key_e", "key_f",
               "key_g", "key_h", "key_i"),
          all.x = T) %>% 
      as_tibble()
      dataframe %>%
      left_join(dataframe_two, 
                  by = "key_a") %>%
      group_by(key_a, date.x) %>%
      summarise(key_z = key_z[which.min(abs(date.x - date.y))]) %>%
      arrange(date.x) %>%
      rename(day = date.x)
What best practices can I apply and, in particular, what can I do to make these types of functions optimized for large datasets?
--
This is an example dataset
set.seed(1010)
library("conflicted")
conflict_prefer("days", "lubridate")
bigint <- rep(
  sample(1238794320934:19082323109, 1*10^7)
)
key_a <-
  rep(c("green", "blue", "orange"), 1*10^7/2)
key_b <-
  rep(c("yellow", "purple", "red"), 1*10^7/2)
key_c <-
  rep(c("hazel", "pink", "lilac"), 1*10^7/2)
key_d <-
  rep(c("A", "B", "C"), 1*10^7/2)
key_e <-
  rep(c("D", "E", "F", "G", "H", "I"), 1*10^7/5)
key_f <-
  rep(c("Z", "M", "Q", "T", "X", "B"), 1*10^7/5)
key_g <-
  rep(c("Z", "M", "Q", "T", "X", "B"), 1*10^7/5)
key_h <-
  rep(c("tree", "plant", "animal", "forest"), 1*10^7/3)
key_i <-
  rep(c("up", "up", "left", "left", "right", "right"), 1*10^7/5)
sequence <- 
  seq(ymd("2010-01-01"), ymd("2020-01-01"), by = "1 day")
date_sequence <-
  rep(sequence, 1*10^7/(length(sequence) - 1))
dataframe <-
  data.frame(
    bigint,
    date = date_sequence[1:(1*10^7)],
    key_a = key_a[1:(1*10^7)],
    key_b = key_b[1:(1*10^7)],
    key_c = key_c[1:(1*10^7)],
    key_d = key_d[1:(1*10^7)],
    key_e = key_e[1:(1*10^7)],
    key_f = key_f[1:(1*10^7)],
    key_g = key_g[1:(1*10^7)],
    key_h = key_h[1:(1*10^7)],
    key_i = key_i[1:(1*10^7)]
  )
dataframe_two <-
  dataframe %>%
      mutate(date_sequence = ymd(date_sequence) + days(1))
sequence_sixdays <-
  seq(ymd("2010-01-01"), ymd("2020-01-01"), by = "6 days")
date_sequence <-
  rep(sequence_sixdays, 3*10^6/(length(sequence_sixdays) - 1))
key_z <-
  sample(1:10000000, 3*10^6)
dataframe_three <-
  data.frame(
    key_a = sample(key_a, 3*10^6),
    date = date_sequence[1:(3*10^6)],
    key_z = key_z[1:(3*10^6)]
  )
There are two options to process very large data sets ( > 10GB) in R. Use integrated environment packages like Rhipe to leverage Hadoop MapReduce framework. Use RHadoop directly on hadoop distributed system.
table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas .
Money-costing solution: One possible solution is to buy a new computer with a more robust CPU and larger RAM that is capable of handling the entire dataset. Or, rent a cloud or a virtual memory and then create some clustering arrangement to handle the workload.
What best practices can I apply and, in particular, what can I do to make these types of functions optimized for large datasets?
use data.table package
library(data.table)
d1 = as.data.table(dataframe)
d2 = as.data.table(dataframe_two)
grouping by many columns is something that data.table is excellent at
see barchart at the very bottom of the second plot for comparison against dplyr spark and others for exactly this kind of grouping
https://h2oai.github.io/db-benchmark
by_cols = paste("key", c("a","b","c","d","e","f","g","h","i"), sep="_")
a1 = d1[, .(min_date = min(date_sequence)), by=by_cols]
note I changed date to date_sequence, I think you meant that as a column name
it is unclear on what fields you want to merge tables, dataframe_two does not have specified fields so the query is invalid
please clarify
data.table has very useful type of join called rolling join, which does exactly what you need
a3 = d2[d1, on=c("key_a","date_sequence"), roll="nearest"]
# Error in vecseq(f__, len__, if (allow.cartesian || notjoin || #!anyDuplicated(f__,  : 
#  Join results in more than 2^31 rows (internal vecseq reached #physical limit). Very likely misspecified join. Check for #duplicate key values in i each of which join to the same group in #x over and over again. If that's ok, try by=.EACHI to run j for #each group to avoid the large allocation. Otherwise, please search #for this error message in the FAQ, Wiki, Stack Overflow and #data.table issue tracker for advice.
It results an error. Error is in fact very useful. On your real data it may work perfectly fine, as the reason behind the error (cardinality of matching rows) may be related to process of generating sample data. It is very tricky to have good dummy data for joining.
If you are getting the same error on your real data you may want to review design of that query as it attempts to make row explosion by doing many-to-many join. Even after already considering only single date_sequence identity (taking roll into account). I don't see this kind of question to be valid for that data (cadrinalities of join fields strictly speaking). You may want to introduce data quality checks layer in your workflow to ensure there are no duplicates on key_a and date_sequence combined.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With