Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining 2 parquets that are too large for memory together

Say I have two datasets stored as parquets that I want to combine. I can read them in, rbind them, then spit them back out into a parquet, like so:

# Load library
library(arrow)

# Create dummy datasets
write_dataset(mtcars, path = "~/foo", format = "parquet")
write_dataset(mtcars, path = "~/bar", format = "parquet")

# Read, combine, and write datasets
open_dataset("~/foo") |> collect() -> foo
open_dataset("~/bar") |> collect() -> bar
rbind(foo, bar) |> write_dataset(path = "~/foobar", format = "parquet")

That's great! Now, imagine that these datasets are so large that I don't have enough memory to hold both datasets in my R session. How would I go about combining these datasets into one?

like image 302
Lyngbakr Avatar asked Sep 06 '25 03:09

Lyngbakr


1 Answers

You might be able to use the internal or granular functions in the arrow package to take apart and iterate over the data, but (1) I'm not certain it's possible, and (2) I think it is not necessary.

You can combine them virtually by calling open_dataset on both files:

arrow::write_dataset(data.frame(A=1:2), "A.pq")
arrow::write_dataset(data.frame(A=3:4), "B.pq")

list.files(c("A.pq", "B.pq"), full.names = TRUE)
# [1] "A.pq/part-0.parquet" "B.pq/part-0.parquet"

ds <- arrow::open_dataset(list.files(c("A.pq", "B.pq"), full.names = TRUE))
collect(ds)
#   A
# 1 1
# 2 2
# 3 3
# 4 4

That is, open_dataset can take as its first argument (sources=) one of (from ?arrow::open_dataset):

  • a string path or URI to a directory containing data files

  • a FileSystem that references a directory containing data files (such as what is returned by 's3_bucket()')

  • a string path or URI to a single file

  • a character vector of paths or URIs to individual data files

  • a list of 'Dataset' objects as created by this function

  • a list of 'DatasetFactory' objects as created by 'dataset_factory()'.

and my method takes advantage of the fourth bullet. (It doesn't take a vector of multiple directories, which is why we need to intervene with list.files.)

Edit: the lazy operation of arrow (and dplyr's help) allows us to "load" all of the frames into an apparent object without loading the data into memory. For instance, if we wanted just the even rows, using dplyr we would do

library(dplyr)
filter(ds, A %% 2 == 0)
# FileSystemDataset (query)
# A: int32
# * Filter: (subtract_checked(A, multiply_checked(2, floor(divide(cast(A, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), cast(2, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}))))) == 0)
# See $.data for the source Arrow object

Which still has not loaded the data into memory. For that we use collect(), which is the first time that the raw data in the parquet files is pulled into R.

filter(ds, A %% 2 == 0) %>%
  collect()
#   A
# 1 2
# 2 4
like image 83
r2evans Avatar answered Sep 07 '25 22:09

r2evans