Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging on columns with dask

I have a simple script currently written with pandas that I want to convert to dask dataframes.
In this script, I am executing a merge on two dataframes on user-specified columns and I am trying to convert it into dask.

def merge_dfs(df1, df2, columns):
    merged = pd.merge(df1, df2, on=columns, how='inner')
...

How can I change this line to match to dask dataframes?

like image 459
Eliran Turgeman Avatar asked Oct 17 '25 19:10

Eliran Turgeman


1 Answers

The dask merge follows pandas syntax, so it's just substituting call to pandas with a call to dask.dataframe:

import dask.dataframe as dd

def merge_dfs(df1, df2, columns):
    merged = dd.merge(df1, df2, on=columns, how='inner')
# ...

The resulting dataframe, merged, will be a dask.dataframe and hence may need computation downstream. This will be done automatically if you are persisting the data to a file, e.g. with .to_csv or with .to_parquet.

If you will need the dataframe for some computation and if the data fits into memory, then calling .compute will create a pandas dataframe:

pandas_df = merged.compute()
like image 158
SultanOrazbayev Avatar answered Oct 20 '25 16:10

SultanOrazbayev