Not sure if I should fix my regex pattern, or process more with pandas.
Here's a mock setup:
import re
import pandas as pd
regex = r"(?P<adv>This)|(?P<noun>test)"
texts = ["This is a test", "Random stuff with no match"]
series = pd.Series(texts)
I want to find all matches for groups (<adv>, <noun> -- there are typically more than two). These groups are designed to be exclusive hence I would want to have only one row result with the captured string / NaN.
Current output: multi-index rows, only for texts that have a match
>>> print(series.str.extractall(regex))
adv noun
match
0 0 This NaN
1 NaN test
Expected output: one row per input text, and aggregated matchs per group
adv noun
0 This test
1 NaN NaN
Any chance for a hand on this? Either fix the regex, or post-process with pandas. Thanks!
You can try;
series.str.extractall(regex).groupby(level=0).first()
adv noun
0 This test
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With