Difference between Tokenizer and TextVectorization layer in tensorflow

Question

New to TensorFlow

I saw couple of small NLP projects where people use the 'tf.keras.preprocessing.Tokenizer' to pre-process their text (link: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer )

In some cases, they directly add 'tf.keras.layers.TextVectorization' layer while making the model (link : https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization)

May I know what's the difference between the two in terms of usage and when to choose which option?

Asdoost · Accepted Answer

The Tokenizer, as the name suggests, tokenizes the text. Tokenization is the process of splitting text to individual elements (character, word, sentence, etc).

tf.keras.preprocessing.text.Tokenizer(
    num_words=None,
    filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~	
',
    lower=True, split=' ', char_level=False, oov_token=None,
    document_count=0, **kwargs
)

The result of tf.keras.preprocessing.text.Tokenizer is then used to convert to integer sequences using texts_to_sequences.

On the other hand tf.keras.layers.TextVectorization converts the text to integer sequences.

tf.keras.layers.TextVectorization(
    max_tokens=None, standardize='lower_and_strip_punctuation',
    split='whitespace', ngrams=None, output_mode='int',
    output_sequence_length=None, pad_to_max_tokens=False, vocabulary=None,
    idf_weights=None, sparse=False, ragged=False, **kwargs
)

Difference between Tokenizer and TextVectorization layer in tensorflow

Tags:

tensorflow

tokenize

nlp

tf.keras

Jay ra1

1 Answers

Asdoost

Recent Activity

Donate For Us

Difference between Tokenizer and TextVectorization layer in tensorflow

Tags:

tensorflow

tokenize

nlp

tf.keras

Jay ra1

1 Answers

Asdoost

Related questions

Recent Activity

Donate For Us