When loading data from parquet or csv files, having the NONE divisions. DASK docs have no information about how to set and calculate this....
How to set up and calculate right the divisions of DASK dataframe?
If you read from parquet you can use infer_divisions=True
as in this example
import dask.dataframe as dd
df = dd.read_parquet("file.parq", infer_divisions=True)
In case you need you can directly set an index while reading
df = dd.read_parquet("file.parq", index="my_col",
infer_divisions=True)
OK, i do:
divisions =[part_n for part_n in range(f.npartitions)]
f = f.set_index(f.index, divisions=divisions).persist()
Then i do:
f.groupby('userId').first().compute()
But last operation is dramatically slow!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With