Python Pandas NLTK Frequency Distribution for Tokenized Words in Dataframe Column with a Groupby

Question

I have the following sample dataframe:

No  category    problem_definition
175 2521       ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438       ['galley', 'work', 'table', 'stuck']
912 2698       ['cloth', 'stuck']
572 2521       ['stuck', 'coffee']

The problem_definition field has already been tokenized with stop gap words removed.

I want to create a frequency distribution that outputs another Pandas dataframe:

1) with the frequency occurrence of each word in problem_definition 2) with the frequency occurrence of each word in problem_definition by category field

Sample desired output below for case 1):

text       count
coffee     2
maker      1
brewing    1
properly   1
2          1
420        3
stuck      3
galley     1
work       1
table      1
cloth      1

Sample desired output below for case 2):

category    text       count
2521        coffee     2
2521        maker      1
2521        brewing    1
2521        properly   1
2521        2          1
2521        420        3
2521        stuck      1
1438        galley     1
1438        work       1
1438        table      1
1438        stuck      1
2698        cloth      1
2698        stuck      1

I tried the following code to accomplish 1):

from nltk.probability import FreqDist
import pandas as pd

fdist = FreqDist(df['problem_definition_stopwords'])

TypeError: unhashable type: 'list'

I have no idea how to do 2)

BENY · Accepted Answer

Using unnesting , I step by step introduced couple of methods for achieve this type of problems , for fun I just link the question here

unnesting(df,['problem_definition'])
Out[288]: 
  problem_definition   No  category
0             coffee  175      2521
0              maker  175      2521
0            brewing  175      2521
0           properly  175      2521
0                  2  175      2521
0                420  175      2521
0                420  175      2521
0                420  175      2521
1             galley  211      1438
1               work  211      1438
1              table  211      1438
1              stuck  211      1438
2              cloth  912      2698
2              stuck  912      2698
3              stuck  572      2521
3             coffee  572      2521

Then just do regular groupby + size for case 2

unnesting(df,['problem_definition']).groupby(['category','problem_definition']).size()
Out[290]: 
category  problem_definition
1438      galley                1
          stuck                 1
          table                 1
          work                  1
2521      2                     1
          420                   3
          brewing               1
          coffee                2
          maker                 1
          properly              1
          stuck                 1
2698      cloth                 1
          stuck                 1
dtype: int64

About the case 1 value_counts

unnesting(df,['problem_definition'])['problem_definition'].value_counts()
Out[291]: 
stuck       3
420         3
coffee      2
table       1
maker       1
2           1
brewing     1
galley      1
work        1
cloth       1
properly    1
Name: problem_definition, dtype: int64

Myself define function

def unnesting(df, explode):
    idx=df.index.repeat(df[explode[0]].str.len())
    df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
    df1.index=idx
    return df1.join(df.drop(explode,1),how='left')

Python Pandas NLTK Frequency Distribution for Tokenized Words in Dataframe Column with a Groupby

Tags:

python

pandas

counter

nltk

cpu-word

PineNuts0

1 Answers

BENY

Recent Activity

Donate For Us

Python Pandas NLTK Frequency Distribution for Tokenized Words in Dataframe Column with a Groupby

Tags:

python

pandas

counter

nltk

cpu-word

PineNuts0

1 Answers

BENY

Related questions

Recent Activity

Donate For Us