Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove keywords based on words counts fom massive python dataframe

If I have this df dataframe with 41,000 rows contains thousands of words for example like this df:

column1                                                                   column2
better spotted better rights rights rights fresh fresh rights rights      2015
better rights reserved                                                    2016
better                                                                    2015
better horse                                                              2014

and I got the frequency of every word from df to df2 with this code:

df2 = df['column1'].str.split(expand=True).stack().value_counts()

and my df2 looks like this:

keywords counts spotted 1 better 5 fresh 2 rights 6 horse 1 reserved 1

Then how to remove all the keywords in df which has counts below 5 times based on df2 so then df would look like this:

column1                                           column2
better better rights rights rights rights rights  2015
better rights                                     2016
better                                            2015
better                                            2014

my initial trial is to make list of keywords from df2 like this:

ListKeywords = ['spotted', 'fresh', 'horse', 'reserved']

and then simply remove all that word within ListKeywords from df with this code:

df['column1'] = df['column1'].apply(lambda x: ' '.join([word for word in x.split() if word not in (ListKeywords)]))

then I got frustrated because I have 15,000 keywords which has words counts below 5 times in all rows. Meaning that I have to put that 15,000 keywords into the ListKeywords, which is insane.

Anyone can help me out of this frustation? thank you

like image 983
Jack Zaki Zakiul Fahmi Jailani Avatar asked Feb 20 '26 07:02

Jack Zaki Zakiul Fahmi Jailani


1 Answers

Though you create df2 it's easier to work with the stacked Series prior to the value_counts. This allows you to filter it then str.join back only the words you want to keep.

s = df['column1'].str.split(expand=True).stack()

# Keep only words with frequency above specified threshold
cutoff = 5
s = s[s.groupby(s).transform('size') >= cutoff]

# Alignment based on original Index
df['column1'] = s.groupby(level=0).agg(' '.join)

                                            column1  column2
0  better better rights rights rights rights rights     2015
1                                     better rights     2016
2                                            better     2015
3                                            better     2014

For your own knowledge, if working with the value_counts DataFrame, you can subset it and generate ListKeywords by only specifying the cutoff. However, we've already split the 'column1' Series to get the counts, so re-doing this calculation here is fairly inefficient.

df2 = df['column1'].str.split(expand=True).stack().value_counts()

cutoff = 5
ListKeywords = df2[df2 >= cutoff].index
#Index(['rights', 'better'], dtype='object')

df['column1'].apply(lambda x: ' '.join([i for i in x.split(' ') if i in ListKeywords]))

Starting Data

df = pd.DataFrame({'column1': ['better spotted better rights rights rights fresh fresh rights rights',
                               'better rights reserved', 'better', 'better horse'],
                   'column2': [2015, 2016, 2015, 2014]})
like image 158
ALollz Avatar answered Feb 22 '26 21:02

ALollz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!