Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Alternatives for distinct(.keep_all = TRUE) in arrow?

I have a larger than memory arrow dataset created by open_dataset() from partitioned parquet files that I need to use distinct(.keep_all = TRUE) on. I need to keep the computation on disk, therefore I'm using arrow to speed things up and not crash my R session.

I want to keep rows that have distinct values in columns a and b.

Example

df <- tibble(a = c(1,1,2,2),
             b = c(1,1,2,1),
             c = c("x", "y", "z", "a")) %>%
  arrow_table()


df %>% 
  distinct(a, b, .keep_all = TRUE)

This results in: Error: distinct() with .keep_all = TRUE not supported in Arrow

Desired Output

An arrow dataset with the following values.

  a     b c    
  <dbl> <dbl> <chr>
1     1     1 x    
2     2     2 z    
3     2     1 a  

I see others have had similar questions but it doesn't seem like arrow plans to incorporate use of .keep_all (see closed issue).

base::duplicated() would also work but it's not supported by arrow either. Any thoughts on how to work around this without using collect() (which crashes my R session)? TIA!

like image 809
afro_omics Avatar asked Dec 04 '25 21:12

afro_omics


1 Answers

Back with an answer after many trials and tribulations. You can use to_duckdb() to use some dplyr verbs not supported by arrow.

df <- tibble(a = c(1,1,2,2),
             b = c(1,1,2,1),
             c = c("x", "y", "z", "a")) %>%
 arrow_table() %>%
 to_duckdb() %>% # swap to duckdb to use distinct
 distinct(a, b, .keep_all = TRUE) %>%
to_arrow() # back to arrow

# Check that this returns an arrow object
class(df)

[1] "RecordBatchReader" "ArrowObject"      
[3] "R6"  

# Print table to make sure that distinct(.keep_all = TRUE) worked
df %>% collect()

# A tibble: 3 × 3
      a     b c    
  <dbl> <dbl> <chr>
1     1     1 x    
2     2     1 a    
3     2     2 z  
like image 134
afro_omics Avatar answered Dec 07 '25 15:12

afro_omics



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!