Say I have two datasets stored as parquets that I want to combine. I can read them in, rbind
them, then spit them back out into a parquet, like so:
# Load library
library(arrow)
# Create dummy datasets
write_dataset(mtcars, path = "~/foo", format = "parquet")
write_dataset(mtcars, path = "~/bar", format = "parquet")
# Read, combine, and write datasets
open_dataset("~/foo") |> collect() -> foo
open_dataset("~/bar") |> collect() -> bar
rbind(foo, bar) |> write_dataset(path = "~/foobar", format = "parquet")
That's great! Now, imagine that these datasets are so large that I don't have enough memory to hold both datasets in my R session. How would I go about combining these datasets into one?
You might be able to use the internal or granular functions in the arrow
package to take apart and iterate over the data, but (1) I'm not certain it's possible, and (2) I think it is not necessary.
You can combine them virtually by calling open_dataset
on both files:
arrow::write_dataset(data.frame(A=1:2), "A.pq")
arrow::write_dataset(data.frame(A=3:4), "B.pq")
list.files(c("A.pq", "B.pq"), full.names = TRUE)
# [1] "A.pq/part-0.parquet" "B.pq/part-0.parquet"
ds <- arrow::open_dataset(list.files(c("A.pq", "B.pq"), full.names = TRUE))
collect(ds)
# A
# 1 1
# 2 2
# 3 3
# 4 4
That is, open_dataset
can take as its first argument (sources=
) one of (from ?arrow::open_dataset
):
a string path or URI to a directory containing data files
a FileSystem that references a directory containing data files (such as what is returned by 's3_bucket()')
a string path or URI to a single file
a character vector of paths or URIs to individual data files
a list of 'Dataset' objects as created by this function
a list of 'DatasetFactory' objects as created by 'dataset_factory()'.
and my method takes advantage of the fourth bullet. (It doesn't take a vector of multiple directories, which is why we need to intervene with list.files
.)
Edit: the lazy operation of arrow
(and dplyr
's help) allows us to "load" all of the frames into an apparent object without loading the data into memory. For instance, if we wanted just the even rows, using dplyr we would do
library(dplyr)
filter(ds, A %% 2 == 0)
# FileSystemDataset (query)
# A: int32
# * Filter: (subtract_checked(A, multiply_checked(2, floor(divide(cast(A, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), cast(2, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}))))) == 0)
# See $.data for the source Arrow object
Which still has not loaded the data into memory. For that we use collect()
, which is the first time that the raw data in the parquet files is pulled into R.
filter(ds, A %% 2 == 0) %>%
collect()
# A
# 1 2
# 2 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With