I have a Pandas Dataframe that has the column values as list of strings. Each list may have one or more than one string. For strings that have more than one word, I'd like to split them into individual words, so that each list contains only individual words. In the following Dataframe, only the sent_tags column has lists which contain strings of variable length.
DataFrame:
import pandas as pd    
pd.set_option('display.max_colwidth', -1)
df = pd.DataFrame({"fruit_tags": [["'apples'", "'oranges'", "'pears'"], ["'melons'", "'peaches'", "'kiwis'"]], "sent_tags":[["'apples'", "'sweeter than oranges'", "'pears sweeter than apples'"], ["'melons'", "'sweeter than peaches'", "'kiwis sweeter than melons'"]]})
print(df)  
    fruit_tags                        sent_tags
0   ['apples', 'oranges', 'pears']  ['apples', 'sweeter than oranges', 'pears sweeter than apples']
1   ['melons', 'peaches', 'kiwis']  ['melons', 'sweeter than peaches', 'kiwis sweeter than melons']
My attempt:
I decided to use word_tokenize from the NLTK library to break such strings into individual words. I do get the tokenized words for a particular selection inside the list but cannot club them together into each list for each row:
from nltk.tokenize import word_tokenize
df['sent_tags'].str[1].str.strip("'").apply(lambda x:word_tokenize(x.lower()))
#Output
0    [sweeter, than, oranges]
1    [sweeter, than, peaches]
Name: sent_tags, dtype: object
Desired result:
    fruit_tags                        sent_tags
0   ['apples', 'oranges', 'pears']  ['apples', 'sweeter', 'than', 'oranges', 'pears', 'sweeter', 'than', 'apples']
1   ['melons', 'peaches', 'kiwis']  ['melons', 'sweeter', 'than', 'peaches', 'kiwis', 'sweeter', 'than', 'melons']
Use list comprehension with flatenning with all text functions - strip, lower and split:
s = df['sent_tags'].apply(lambda x: [z for y in x for z in y.strip("'").lower().split()])
Or:
s = [[z for y in x for z in y.strip("'").lower().split()] for x in df['sent_tags']]
df['sent_tags'] = s
print(df) 
                       fruit_tags  \
0  ['apples', 'oranges', 'pears']   
1  ['melons', 'peaches', 'kiwis']   
                                                        sent_tags  
0  [apples, sweeter, than, oranges, pears, sweeter, than, apples]  
1  [melons, sweeter, than, peaches, kiwis, sweeter, than, melons]  
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With