Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using dictionaries to count word frequency in python dataframe

I have a dataframe composed of text job descriptions, and 3 empty columns

   index   job_description                 level_1      level_2        level_3
    0      this job requires masters in..    0             0              0
    1      bachelor degree needed for..      0             0              0
    2      ms is preferred or phd..          0             0              0

I'm trying to go through each job description string and count the frequency of each degree level that was mentioned in the job description. A sample output should look like this.

   index   job_description                 level_1      level_2        level_3
    0      this job requires masters in..    0             1              0
    1      bachelor degree needed for..      1             0              0
    2      ms is preferred or phd..          0             1              1

I created the dictionaries to do the comparison as seen below, but I'm somewhat clueless on how I can look for those words in the strings of the dataframe "job description" column and populate the dataframe columns depending on whether the words exist or not.

my_dict_1 = dict.fromkeys(['bachelors', 'bachelor', 'ba','science
                           degree','bs','engineering degree'], 1)
my_dict_2 = dict.fromkeys(['masters', 'ms', 'master'], 1)
my_dict_3 = dict.fromkeys(['phd','p.h.d'], 1)

I really appreciate the support on this..

like image 770
megazord Avatar asked Dec 29 '25 19:12

megazord


1 Answers

How about something like this?

Since each of your three dictionaries correspond to different columns you want to create, we can create another dictionary mapping with the soon-to-be-column names as keys, and the strings to search for at each particular level as values (really, you don't even need a dictionary for storing the my_dict_<x> items - you could use a set instead - but it's not a huge deal):

>>> lookup = {'level_1': my_dict_1, 'level_2': my_dict_2, 'level_3': my_dict_3}
>>> lookup
{'level_1': {'bachelors': 1, 'bachelor': 1, 'ba': 1, 'science degree': 1, 'bs': 1, 'engineering degree': 1}, 'level_2': {'masters': 1, 'ms': 1, 'master': 1}, 'level_3': {'phd': 1, 'p.h.d': 1}}

Then, go through each proposed column in the dictionary you just created and assign a new column which creates the output you want, checking for each level specified in each my_dict_<x> object whether at least one belongs in the job description in each row...

>>> for level, values in lookup.items():
...     df[level] = df['job_description'].apply(lambda x: 1 if any(v in x for v in values) else 0)
... 
>>> df
              job_description  level_1  level_2  level_3
0     masters degree required        0        1        0
1  bachelor's degree required        1        0        0
2    bachelor degree required        1        0        0
3                phd required        0        0        1

Another solution, using scikit-learn's CountVectorizer class, which counts the frequencies of tokens (words, basically) occurring in strings:

>>> from sklearn.feature_extraction.text import CountVectorizer

Specify a particular vocabulary - forget about all other words that aren't "academic credential" keywords:

>>> vec = CountVectorizer(vocabulary={value for level, values in lookup.items() for value in values})
>>> vec.vocabulary
{'master', 'p.h.d', 'ba', 'ms', 'engineering degree', 'masters', 'phd', 'bachelor', 'bachelors', 'bs', 'science degree'}

Fit that transformer to the text iterable, df['job_description']:

>>> result = vec.fit_transform(df['job_description'])

Taking a deeper look at the results:

>>> pd.DataFrame(result.toarray(), columns=vec.get_feature_names())
   ba  bachelor  bachelors  bs  engineering degree  master  masters  ms  p.h.d  phd  science degree
0   0         0          0   0                   0       0        1   0      0    0               0
1   0         1          0   0                   0       0        0   0      0    0               0
2   0         1          0   0                   0       0        0   0      0    0               0
3   0         0          0   0                   0       0        0   0      0    1               0

This last approach might require a bit more work if you want to get back to your level_<x> column structure, but I thought I'd just show it as a different way of thinking about encoding those datapoints.

like image 97
blacksite Avatar answered Dec 31 '25 07:12

blacksite



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!