I have the following sample dataframe:
No category problem_definition
175 2521 ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438 ['galley', 'work', 'table', 'stuck']
912 2698 ['cloth', 'stuck']
572 2521 ['stuck', 'coffee']
The problem_definition field has already been tokenized with stop gap words removed.
I want to create a frequency distribution that outputs another Pandas dataframe:
1) with the frequency occurrence of each word in problem_definition 2) with the frequency occurrence of each word in problem_definition by category field
Sample desired output below for case 1):
text count
coffee 2
maker 1
brewing 1
properly 1
2 1
420 3
stuck 3
galley 1
work 1
table 1
cloth 1
Sample desired output below for case 2):
category text count
2521 coffee 2
2521 maker 1
2521 brewing 1
2521 properly 1
2521 2 1
2521 420 3
2521 stuck 1
1438 galley 1
1438 work 1
1438 table 1
1438 stuck 1
2698 cloth 1
2698 stuck 1
I tried the following code to accomplish 1):
from nltk.probability import FreqDist
import pandas as pd
fdist = FreqDist(df['problem_definition_stopwords'])
TypeError: unhashable type: 'list'
I have no idea how to do 2)
Using unnesting , I step by step introduced couple of methods for achieve this type of problems , for fun I just link the question here
unnesting(df,['problem_definition'])
Out[288]:
problem_definition No category
0 coffee 175 2521
0 maker 175 2521
0 brewing 175 2521
0 properly 175 2521
0 2 175 2521
0 420 175 2521
0 420 175 2521
0 420 175 2521
1 galley 211 1438
1 work 211 1438
1 table 211 1438
1 stuck 211 1438
2 cloth 912 2698
2 stuck 912 2698
3 stuck 572 2521
3 coffee 572 2521
Then just do regular groupby + size for case 2
unnesting(df,['problem_definition']).groupby(['category','problem_definition']).size()
Out[290]:
category problem_definition
1438 galley 1
stuck 1
table 1
work 1
2521 2 1
420 3
brewing 1
coffee 2
maker 1
properly 1
stuck 1
2698 cloth 1
stuck 1
dtype: int64
About the case 1 value_counts
unnesting(df,['problem_definition'])['problem_definition'].value_counts()
Out[291]:
stuck 3
420 3
coffee 2
table 1
maker 1
2 1
brewing 1
galley 1
work 1
cloth 1
properly 1
Name: problem_definition, dtype: int64
Myself define function
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With