tokenize string based on self-defined dictionary

Question

I would like to tokenize a list of strings according to my self-defined dictionary.

The list of string looks like this:

lst = ['vitamin c juice', 'organic supplement']

The self-defined dictionary:

dct = {0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'}

My expected result:

vitamin c juice --> [(3,1), (1,1)] organic supplement --> [(0,1), (2,1)]

My current code:

import gensim
import gensim.corpora as corpora
from gensim.utils import tokenize
dct = corpora.Dictionary([list(x) for x in tup_list]) 
corpus = [dct.doc2bow(text) for text in [s for s in lst]]

The error message I got is TypeError: doc2bow expects an array of unicode tokens on input, not a single string However, I do not want to simply tokenize "vitamin c" as vitamin and c. Instead, I want to tokenize based on my existing dct words. That is to say, it should be vitamin c.

Alain T. · Accepted Answer

You will first have to reverse your dictionary so that the keywords become the key. Then you can use regular expressions to break down the list's entries into keywords. Then use the keywords against the reversed dictionary to find the corresponding tokens

for example:

lst = ['vitamin c juice', 'organic supplement'] 
dct = {0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'}

import re
from collections import Counter
keywords      = { keyword:token for token,keyword in dct.items() }  # inverted dictionary
sortedKw      = sorted(keywords,key=lambda x:-len(x))               # keywords in reverse order of size
pattern       = re.compile( "|".join(sortedKw) )                    # regular expression
lstKeywords   = [ pattern.findall(item) for item in lst ]           # list items --> keywords
tokenGroups   = [ [keywords[word] for word in words] for words in lstKeywords ]  # keyword lists to lists of indexes
result        = [ list(Counter(token).items()) for token in tokenGroups ] # lists of token indexes to (token,count) 
print(result) # [[(3, 1), (1, 1)], [(0, 1), (2, 1)]]

The regular expression takes the form: keyword1|keyword2|keyword3

Because the "|" operator in regular expressions is never greedy, longer keywords must appear first in the list. This is the reason for sorting them before building the expression.

After that it is merely a matter of converting list items to keyword lists (re.findall() does that) and then using the inverted dictionary to turn each keyword into a token index.

[UPDATE] In order to count the number of token occurrences, the list of keywords, converted to a list of token indexes, is loaded into a Counter object (from the collection modules) that performs the counting operation in a specialized dictionary.

tokenize string based on self-defined dictionary

Tags:

python

tokenize

nlp

nltk

gensim

Anita

1 Answers

Alain T.

Recent Activity

Donate For Us

tokenize string based on self-defined dictionary

Tags:

python

tokenize

nlp

nltk

gensim

Anita

1 Answers

Alain T.

Related questions

Recent Activity

Donate For Us