Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split multi-word strings into individual words for Pandas series containing list of strings

I have a Pandas Dataframe that has the column values as list of strings. Each list may have one or more than one string. For strings that have more than one word, I'd like to split them into individual words, so that each list contains only individual words. In the following Dataframe, only the sent_tags column has lists which contain strings of variable length.

DataFrame:

import pandas as pd    
pd.set_option('display.max_colwidth', -1)
df = pd.DataFrame({"fruit_tags": [["'apples'", "'oranges'", "'pears'"], ["'melons'", "'peaches'", "'kiwis'"]], "sent_tags":[["'apples'", "'sweeter than oranges'", "'pears sweeter than apples'"], ["'melons'", "'sweeter than peaches'", "'kiwis sweeter than melons'"]]})
print(df)  

    fruit_tags                        sent_tags
0   ['apples', 'oranges', 'pears']  ['apples', 'sweeter than oranges', 'pears sweeter than apples']
1   ['melons', 'peaches', 'kiwis']  ['melons', 'sweeter than peaches', 'kiwis sweeter than melons']

My attempt:

I decided to use word_tokenize from the NLTK library to break such strings into individual words. I do get the tokenized words for a particular selection inside the list but cannot club them together into each list for each row:

from nltk.tokenize import word_tokenize
df['sent_tags'].str[1].str.strip("'").apply(lambda x:word_tokenize(x.lower()))
#Output
0    [sweeter, than, oranges]
1    [sweeter, than, peaches]
Name: sent_tags, dtype: object

Desired result:

    fruit_tags                        sent_tags
0   ['apples', 'oranges', 'pears']  ['apples', 'sweeter', 'than', 'oranges', 'pears', 'sweeter', 'than', 'apples']
1   ['melons', 'peaches', 'kiwis']  ['melons', 'sweeter', 'than', 'peaches', 'kiwis', 'sweeter', 'than', 'melons']
like image 551
amanb Avatar asked Oct 28 '25 07:10

amanb


1 Answers

Use list comprehension with flatenning with all text functions - strip, lower and split:

s = df['sent_tags'].apply(lambda x: [z for y in x for z in y.strip("'").lower().split()])

Or:

s = [[z for y in x for z in y.strip("'").lower().split()] for x in df['sent_tags']]

df['sent_tags'] = s

print(df) 
                       fruit_tags  \
0  ['apples', 'oranges', 'pears']   
1  ['melons', 'peaches', 'kiwis']   

                                                        sent_tags  
0  [apples, sweeter, than, oranges, pears, sweeter, than, apples]  
1  [melons, sweeter, than, peaches, kiwis, sweeter, than, melons]  
like image 95
jezrael Avatar answered Oct 30 '25 23:10

jezrael



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!