Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between Tokenizer and TextVectorization layer in tensorflow

New to TensorFlow

I saw couple of small NLP projects where people use the 'tf.keras.preprocessing.Tokenizer' to pre-process their text (link: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer )

In some cases, they directly add 'tf.keras.layers.TextVectorization' layer while making the model (link : https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization)

May I know what's the difference between the two in terms of usage and when to choose which option?

like image 738
Jay ra1 Avatar asked Oct 19 '25 05:10

Jay ra1


1 Answers

The Tokenizer, as the name suggests, tokenizes the text. Tokenization is the process of splitting text to individual elements (character, word, sentence, etc).

tf.keras.preprocessing.text.Tokenizer(
    num_words=None,
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
    lower=True, split=' ', char_level=False, oov_token=None,
    document_count=0, **kwargs
)

The result of tf.keras.preprocessing.text.Tokenizer is then used to convert to integer sequences using texts_to_sequences.

On the other hand tf.keras.layers.TextVectorization converts the text to integer sequences.

tf.keras.layers.TextVectorization(
    max_tokens=None, standardize='lower_and_strip_punctuation',
    split='whitespace', ngrams=None, output_mode='int',
    output_sequence_length=None, pad_to_max_tokens=False, vocabulary=None,
    idf_weights=None, sparse=False, ragged=False, **kwargs
)
like image 167
Asdoost Avatar answered Oct 21 '25 22:10

Asdoost