Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to set up (calculate) divisions in dask dataframe?

Tags:

python

dask

When loading data from parquet or csv files, having the NONE divisions. DASK docs have no information about how to set and calculate this....

How to set up and calculate right the divisions of DASK dataframe?

like image 739
VadimCh Avatar asked Sep 05 '25 02:09

VadimCh


2 Answers

If you read from parquet you can use infer_divisions=True as in this example

import dask.dataframe as dd
df = dd.read_parquet("file.parq", infer_divisions=True)

In case you need you can directly set an index while reading

df = dd.read_parquet("file.parq", index="my_col",
                     infer_divisions=True)
like image 113
rpanai Avatar answered Sep 07 '25 21:09

rpanai


OK, i do:

divisions =[part_n for part_n in range(f.npartitions)]
f = f.set_index(f.index, divisions=divisions).persist()

Then i do:

f.groupby('userId').first().compute()

But last operation is dramatically slow!

like image 20
VadimCh Avatar answered Sep 07 '25 19:09

VadimCh