New to TensorFlow
I saw couple of small NLP projects where people use the 'tf.keras.preprocessing.Tokenizer' to pre-process their text (link: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer )
In some cases, they directly add 'tf.keras.layers.TextVectorization' layer while making the model (link : https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization)
May I know what's the difference between the two in terms of usage and when to choose which option?
The Tokenizer, as the name suggests, tokenizes the text. Tokenization is the process of splitting text to individual elements (character, word, sentence, etc).
tf.keras.preprocessing.text.Tokenizer(
num_words=None,
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True, split=' ', char_level=False, oov_token=None,
document_count=0, **kwargs
)
The result of tf.keras.preprocessing.text.Tokenizer is then used to convert to integer sequences using texts_to_sequences.
On the other hand tf.keras.layers.TextVectorization converts the text to integer sequences.
tf.keras.layers.TextVectorization(
max_tokens=None, standardize='lower_and_strip_punctuation',
split='whitespace', ngrams=None, output_mode='int',
output_sequence_length=None, pad_to_max_tokens=False, vocabulary=None,
idf_weights=None, sparse=False, ragged=False, **kwargs
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With