>>> t = Tokenizer(num_words=3) >>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"] >>> t.fit_on_texts(l) >>> t.word_index {'fantastic': 6, 'like': 10, 'no': 8, 'this': 2, 'is': 3, 'there': 7, 'one': 11, 'other': 9, 'so': 5, 'world': 1, 'hello': 4} I'd have expected t.word_index to have just the top 3 words. What am I doing wrong?
num_words is nothing but your vocabulary size. We need to be very cautious while selecting this parameter because this will results in the performace of the model.By default the value of num_words is none. The best value is to use for the num_words is “ len(tokenizer. word_index) + 1".
By default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the ' character).
Keras Tokenizer Class The Tokenizer class of Keras is used for vectorizing a text corpus. For this either, each text input is converted into integer sequence or a vector that has a coefficient for each token in the form of binary values.
There is nothing wrong in what you are doing. word_index is computed the same way no matter how many most frequent words you will use later (as you may see here). So when you will call any transformative method - Tokenizer will use only three most common words and at the same time, it will keep the counter of all words - even when it's obvious that it will not use it later.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With