I have a pandas dataframe that contains queries and counts for a given time period and I'm hoping to turn this dataframe into a count of unique words. For example, if the dataframe contained the below:
query          count
foo bar        10
super          8 
foo            4
super foo bar  2
I'm looking to receive the below dataframe. e.g. the word 'foo' appears exactly 16 times within the table.
word    count
foo     16
bar     12
super   10
I'm working with the below function, but it hardly seems like the optimal way to do this and it ignores the total count for each row.
def _words(df):
  return Counter(re.findall(r'\w+', ' '.join(df['query'])))
Any help will be greatly appreciated.
Thanks in advance!
Option 1
df['query'].str.get_dummies(sep=' ').T.dot(df['count'])
bar      12
foo      16
super    10
dtype: int64
Option 2
df['query'].str.get_dummies(sep=' ').mul(df['count'], axis=0).sum()
bar      12
foo      16
super    10
dtype: int64
Option 3
numpy.bincount + pd.factorize
also highlighting the use of cytoolz.mapcat.  It returns an iterator where it maps a function and concatenates the results. That's cool!  
import pandas as pd, numpy as np, cytoolz
q = df['query'].values
c = df['count'].values
f, u = pd.factorize(list(cytoolz.mapcat(str.split, q.tolist())))
l = np.core.defchararray.count(q.astype(str), ' ') + 1
pd.Series(np.bincount(f, c.repeat(l)).astype(int), u)
foo      16
bar      12
super    10
dtype: int64
Option 4
Absurd use of stuff... just use option 1.  
pd.DataFrame(dict(
    query=' '.join(df['query']).split(),
    count=df['count'].repeat(df['query'].str.count(' ') + 1)
)).groupby('query')['count'].sum()
query
bar      12
foo      16
super    10
Name: count, dtype: int64
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With