remove keywords based on words counts fom massive python dataframe

Question

If I have this df dataframe with 41,000 rows contains thousands of words for example like this df:

column1                                                                   column2
better spotted better rights rights rights fresh fresh rights rights      2015
better rights reserved                                                    2016
better                                                                    2015
better horse                                                              2014

and I got the frequency of every word from df to df2 with this code:

df2 = df['column1'].str.split(expand=True).stack().value_counts()

and my df2 looks like this:

keywords counts spotted 1 better 5 fresh 2 rights 6 horse 1 reserved 1

Then how to remove all the keywords in df which has counts below 5 times based on df2 so then df would look like this:

column1                                           column2
better better rights rights rights rights rights  2015
better rights                                     2016
better                                            2015
better                                            2014

my initial trial is to make list of keywords from df2 like this:

ListKeywords = ['spotted', 'fresh', 'horse', 'reserved']

and then simply remove all that word within ListKeywords from df with this code:

df['column1'] = df['column1'].apply(lambda x: ' '.join([word for word in x.split() if word not in (ListKeywords)]))

then I got frustrated because I have 15,000 keywords which has words counts below 5 times in all rows. Meaning that I have to put that 15,000 keywords into the ListKeywords, which is insane.

Anyone can help me out of this frustation? thank you

ALollz · Accepted Answer

Though you create df2 it's easier to work with the stacked Series prior to the value_counts. This allows you to filter it then str.join back only the words you want to keep.

s = df['column1'].str.split(expand=True).stack()

# Keep only words with frequency above specified threshold
cutoff = 5
s = s[s.groupby(s).transform('size') >= cutoff]

# Alignment based on original Index
df['column1'] = s.groupby(level=0).agg(' '.join)

                                            column1  column2
0  better better rights rights rights rights rights     2015
1                                     better rights     2016
2                                            better     2015
3                                            better     2014

For your own knowledge, if working with the value_counts DataFrame, you can subset it and generate ListKeywords by only specifying the cutoff. However, we've already split the 'column1' Series to get the counts, so re-doing this calculation here is fairly inefficient.

df2 = df['column1'].str.split(expand=True).stack().value_counts()

cutoff = 5
ListKeywords = df2[df2 >= cutoff].index
#Index(['rights', 'better'], dtype='object')

df['column1'].apply(lambda x: ' '.join([i for i in x.split(' ') if i in ListKeywords]))

Starting Data

df = pd.DataFrame({'column1': ['better spotted better rights rights rights fresh fresh rights rights',
                               'better rights reserved', 'better', 'better horse'],
                   'column2': [2015, 2016, 2015, 2014]})

remove keywords based on words counts fom massive python dataframe

Tags:

python

pandas

dataframe

Jack Zaki Zakiul Fahmi Jailani

1 Answers

ALollz

Recent Activity

Donate For Us

remove keywords based on words counts fom massive python dataframe

Tags:

python

pandas

dataframe

Jack Zaki Zakiul Fahmi Jailani

1 Answers

ALollz

Related questions

Recent Activity

Donate For Us