Efficient pairwise comparison of rows in pandas DataFrame

Question

I am currently working with a smallish dataset (about 9 million rows). Unfortunately, most of the entries are strings, and even with coercion to categories, the frame sits at a few GB in memory.

What I would like to do is compare each row with other rows and do a straight comparison of contents. For example, given

   A   B     C      D
0 cat blue  old Saturday
1 dog red   old Saturday

I would like to compute

      d_A   d_B   d_C   d_D
0, 0  True  True  True  True
0, 1  False False True  True
1, 0  False False True  True
1, 1  True  True  True  True

Obviously, combinatorial explosion will preclude a comparison of every record with every other record. So we can instead use blocking, by applying groupby, say on column A.

My question is, is there a a way to do this in either pandas or dask, that is faster than the following sequence:

Group by index
Outer join each group to itself to produce pairs
dataframe.apply comparison function on each row of pairs

For reference, assume I have access to a good number of cores (hundreds), and about 200G of memory.

Fred Byrd · Accepted Answer

The solution turned out to be using numpy in place of step 3). While we cannot create an outer join of every row, we can group by values in column A and create smaller groups to outer join.

The trick is then to use numpy.equal.outer(df1, df2).ravel() When dataframes are passed as inputs to a numpy function in this way, the result is a much faster (at least 30x) vectorized result. For example:

>>> df = pd.DataFrame
   A   B     C      D
0 cat blue  old Saturday
1 dog red   old Saturday

>>> result = pd.DataFrame(columns=["A", "B", "C", "D"], 
                            index=pd.MultiIndex.from_product([df.index, df.index]))
>>> result["A"] = np.equal.outer(df["A"], df["A"]).ravel()
>>> result
        A     B     C     D
0, 0  True   NaN   NaN   NaN  
0, 1  False  NaN   NaN   NaN  
1, 0  False  NaN   NaN   NaN  
1, 1  True   NaN   NaN   NaN

You can repeat for each column, or just automate the process with columnwise apply on result.

MRocklin · Answer

You might consider phrasing your problem as a join operation
You might consider using categoricals to reduce memory use

Efficient pairwise comparison of rows in pandas DataFrame

Tags:

python

pandas

pandas-groupby

dask

Fred Byrd

2 Answers

Fred Byrd

MRocklin

Recent Activity

Donate For Us

Efficient pairwise comparison of rows in pandas DataFrame

Tags:

python

pandas

pandas-groupby

dask

Fred Byrd

2 Answers

Fred Byrd

MRocklin

Related questions

Recent Activity

Donate For Us