Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Extract multiple values from string in pandas df

I've searched for an answer for the following question but haven't found the answer yet. I have a large dataset like this small example:

df =

A  B
1  I bought 3 apples in 2013
3  I went to the store in 2020 and got milk
1  In 2015 and 2019 I went on holiday to Spain
2  When I was 17, in 2014 I got a new car
3  I got my present in 2018 and it broke down in 2019

What I would like is to extract all the values of > 1950 and have this as an end result:

A  B                                                    C
1  I bought 3 apples in 2013                            2013
3  I went to the store in 2020 and got milk             2020
1  In 2015 and 2019 I went on holiday to Spain          2015_2019
2  When I was 17, in 2014 I got a new car               2014
3  I got my present in 2018 and it broke down in 2019   2018_2019

I tried to extract values first, but didn't get further than:

df["C"] = df["B"].str.extract('(\d+)').astype(int)
df["C"] = df["B"].apply(lambda x: re.search(r'\d+', x).group())

But all I get are error messages (I've only started python and working with texts a few weeks ago..). Could someone help me?

like image 615
Lotw Avatar asked Oct 20 '25 12:10

Lotw


2 Answers

Here's one way using str.findall and joining those items from the resulting lists that are greater than 1950::

s = df["B"].str.findall('\d+')
df['C'] = s.apply(lambda x: '_'.join(i for i in x if int(i)> 1950))

   A                                                  B          C
0  1                          I bought 3 apples in 2013       2013
1  3           I went to the store in 2020 and got milk       2020
2  1        In 2015 and 2019 I went on holiday to Spain  2015_2019
3  2             When I was 17, in 2014 I got a new car       2014
4  3  I got my present in 2018 and it broke down in ...  2018_2019
like image 167
yatu Avatar answered Oct 23 '25 03:10

yatu


With single regex pattern (considering your comment "need the year it took place"):

In [268]: pat = re.compile(r'\b(19(?:[6-9]\d|5[1-9])|[2-9]\d{3})')

In [269]: df['C'] = df['B'].apply(lambda x: '_'.join(pat.findall(x)))

In [270]: df
Out[270]: 
   A                                                  B          C
0  1                          I bought 3 apples in 2013       2013
1  3           I went to the store in 2020 and got milk       2020
2  1        In 2015 and 2019 I went on holiday to Spain  2015_2019
3  2             When I was 17, in 2014 I got a new car       2014
4  3  I got my present in 2018 and it broke down in ...  2018_2019
like image 28
RomanPerekhrest Avatar answered Oct 23 '25 03:10

RomanPerekhrest



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!