Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas extractall merge

Not sure if I should fix my regex pattern, or process more with pandas.

Here's a mock setup:

import re
import pandas as pd

regex = r"(?P<adv>This)|(?P<noun>test)"
texts = ["This is a test", "Random stuff with no match"]
series = pd.Series(texts)

I want to find all matches for groups (<adv>, <noun> -- there are typically more than two). These groups are designed to be exclusive hence I would want to have only one row result with the captured string / NaN.

Current output: multi-index rows, only for texts that have a match

>>> print(series.str.extractall(regex))
          adv  noun
  match            
0 0      This   NaN
  1       NaN  test

Expected output: one row per input text, and aggregated matchs per group

          adv  noun
0        This  test
1         NaN   NaN

Any chance for a hand on this? Either fix the regex, or post-process with pandas. Thanks!

like image 422
arnaud Avatar asked Jan 23 '26 21:01

arnaud


1 Answers

You can try;

series.str.extractall(regex).groupby(level=0).first()

    adv  noun
0  This  test
like image 115
anky Avatar answered Jan 25 '26 10:01

anky