How to set up (calculate) divisions in dask dataframe?

Question

When loading data from parquet or csv files, having the NONE divisions. DASK docs have no information about how to set and calculate this....

How to set up and calculate right the divisions of DASK dataframe?

rpanai · Accepted Answer

If you read from parquet you can use infer_divisions=True as in this example

import dask.dataframe as dd
df = dd.read_parquet("file.parq", infer_divisions=True)

In case you need you can directly set an index while reading

df = dd.read_parquet("file.parq", index="my_col",
                     infer_divisions=True)

VadimCh · Answer

OK, i do:

divisions =[part_n for part_n in range(f.npartitions)]
f = f.set_index(f.index, divisions=divisions).persist()

Then i do:

f.groupby('userId').first().compute()

But last operation is dramatically slow!

Donate For Us