If I have this df dataframe with 41,000 rows contains thousands of words for example like this df:
column1 column2
better spotted better rights rights rights fresh fresh rights rights 2015
better rights reserved 2016
better 2015
better horse 2014
and I got the frequency of every word from df to df2 with this code:
df2 = df['column1'].str.split(expand=True).stack().value_counts()
and my df2 looks like this:
keywords counts
spotted 1
better 5
fresh 2
rights 6
horse 1
reserved 1
Then how to remove all the keywords in df which has counts below 5 times based on df2 so then df would look like this:
column1 column2
better better rights rights rights rights rights 2015
better rights 2016
better 2015
better 2014
my initial trial is to make list of keywords from df2 like this:
ListKeywords = ['spotted', 'fresh', 'horse', 'reserved']
and then simply remove all that word within ListKeywords from df with this code:
df['column1'] = df['column1'].apply(lambda x: ' '.join([word for word in x.split() if word not in (ListKeywords)]))
then I got frustrated because I have 15,000 keywords which has words counts below 5 times in all rows. Meaning that I have to put that 15,000 keywords into the ListKeywords, which is insane.
Anyone can help me out of this frustation? thank you
Though you create df2 it's easier to work with the stacked Series prior to the value_counts. This allows you to filter it then str.join back only the words you want to keep.
s = df['column1'].str.split(expand=True).stack()
# Keep only words with frequency above specified threshold
cutoff = 5
s = s[s.groupby(s).transform('size') >= cutoff]
# Alignment based on original Index
df['column1'] = s.groupby(level=0).agg(' '.join)
column1 column2
0 better better rights rights rights rights rights 2015
1 better rights 2016
2 better 2015
3 better 2014
For your own knowledge, if working with the value_counts DataFrame, you can subset it and generate ListKeywords by only specifying the cutoff. However, we've already split the 'column1' Series to get the counts, so re-doing this calculation here is fairly inefficient.
df2 = df['column1'].str.split(expand=True).stack().value_counts()
cutoff = 5
ListKeywords = df2[df2 >= cutoff].index
#Index(['rights', 'better'], dtype='object')
df['column1'].apply(lambda x: ' '.join([i for i in x.split(' ') if i in ListKeywords]))
Starting Data
df = pd.DataFrame({'column1': ['better spotted better rights rights rights fresh fresh rights rights',
'better rights reserved', 'better', 'better horse'],
'column2': [2015, 2016, 2015, 2014]})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With