I need to find duplicates in a column in a dask DataFrame.
For pandas there is duplicated() method for this. Though in dask it is not supported.
Q: What is the best way of getting all duplicated values in dask?
My Idea:
Make a column I'm checking as index, then drop_duplicates and then join.
Is there any better solution?
For example:
df = pandas.DataFrame(
[
['a'],
['b'],
['c'],
['a']
],
columns=['col']
)
df_test = dask.dataframe.from_pandas(df, npartitions=2)
# Expected to get dataframe with value 'a', as it appears twice
I've came up with following solution:
import dask.dataframe as dd
import pandas
if __name__ == '__main__':
df = pandas.DataFrame(
[
['a'],
['b'],
['c'],
['a']
],
columns=["col-a"]
)
ddf = dd.from_pandas(df, npartitions=2)
# Apparently the code below will fail if the dask DataFrame is empty
if ddf.index.size.compute() != 0:
# With indexing data will be repartitioned - and all duplicated can be found within one partition
indexed_df = ddf.set_index('col-a', drop=False)
# Mark duplicate values within partitions. dask DataFrame does not support duplicates().
dups = indexed_df.map_partitions(lambda d: d.duplicated())
# Get duplicated by indexes calculated in previous step.
duplicates = indexed_df[dups].compute().index.tolist()
print(duplicates) # Prints: ['a']
Can this be further improved?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With