Dask

Question

I need to find duplicates in a column in a dask DataFrame.

For pandas there is duplicated() method for this. Though in dask it is not supported.

Q: What is the best way of getting all duplicated values in dask?

My Idea: Make a column I'm checking as index, then drop_duplicates and then join.

Is there any better solution?

For example:

df = pandas.DataFrame(
    [
        ['a'],
        ['b'],
        ['c'],
        ['a']
    ],
    columns=['col']
)
df_test = dask.dataframe.from_pandas(df, npartitions=2)
# Expected to get dataframe with value 'a', as it appears twice

Vladislav Varslavans · Accepted Answer

I've came up with following solution:

import dask.dataframe as dd
import pandas

if __name__ == '__main__':
    df = pandas.DataFrame(
        [
            ['a'],
            ['b'],
            ['c'],
            ['a']
        ],
        columns=["col-a"]
    )
    ddf = dd.from_pandas(df, npartitions=2)

    # Apparently the code below will fail if the dask DataFrame is empty
    if ddf.index.size.compute() != 0:
        # With indexing data will be repartitioned - and all duplicated can be found within one partition
        indexed_df = ddf.set_index('col-a', drop=False)
        # Mark duplicate values within partitions. dask DataFrame does not support duplicates().
        dups = indexed_df.map_partitions(lambda d: d.duplicated())
        # Get duplicated by indexes calculated in previous step.
        duplicates = indexed_df[dups].compute().index.tolist()
        print(duplicates) # Prints: ['a']

Can this be further improved?

Dask - Find duplicate values

Tags:

python

pandas

Vladislav Varslavans

1 Answers

Vladislav Varslavans

Recent Activity

Donate For Us

Dask - Find duplicate values

Tags:

python

pandas

dask

Vladislav Varslavans

1 Answers

Vladislav Varslavans

Related questions

Recent Activity

Donate For Us