I am currently working with a smallish dataset (about 9 million rows). Unfortunately, most of the entries are strings, and even with coercion to categories, the frame sits at a few GB in memory.
What I would like to do is compare each row with other rows and do a straight comparison of contents. For example, given
A B C D
0 cat blue old Saturday
1 dog red old Saturday
I would like to compute
d_A d_B d_C d_D
0, 0 True True True True
0, 1 False False True True
1, 0 False False True True
1, 1 True True True True
Obviously, combinatorial explosion will preclude a comparison of every record with every other record. So we can instead use blocking, by applying groupby, say on column A.
My question is, is there a a way to do this in either pandas or dask, that is faster than the following sequence:
For reference, assume I have access to a good number of cores (hundreds), and about 200G of memory.
The solution turned out to be using numpy in place of step 3). While we cannot create an outer join of every row, we can group by values in column A and create smaller groups to outer join.
The trick is then to use numpy.equal.outer(df1, df2).ravel() When dataframes are passed as inputs to a numpy function in this way, the result is a much faster (at least 30x) vectorized result. For example:
>>> df = pd.DataFrame
A B C D
0 cat blue old Saturday
1 dog red old Saturday
>>> result = pd.DataFrame(columns=["A", "B", "C", "D"],
index=pd.MultiIndex.from_product([df.index, df.index]))
>>> result["A"] = np.equal.outer(df["A"], df["A"]).ravel()
>>> result
A B C D
0, 0 True NaN NaN NaN
0, 1 False NaN NaN NaN
1, 0 False NaN NaN NaN
1, 1 True NaN NaN NaN
You can repeat for each column, or just automate the process with columnwise apply on result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With