New to TensorFlow
I saw couple of small NLP projects where people use the 'tf.keras.preprocessing.Tokenizer' to pre-process their text (link: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer )
In some cases, they directly add 'tf.keras.layers.TextVectorization' layer while making the model (link : https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization)
May I know what's the difference between the two in terms of usage and when to choose which option?
The Tokenizer
, as the name suggests, tokenizes the text. Tokenization is the process of splitting text to individual elements (character, word, sentence, etc).
tf.keras.preprocessing.text.Tokenizer(
num_words=None,
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True, split=' ', char_level=False, oov_token=None,
document_count=0, **kwargs
)
The result of tf.keras.preprocessing.text.Tokenizer
is then used to convert to integer sequences using texts_to_sequences
.
On the other hand tf.keras.layers.TextVectorization
converts the text to integer sequences.
tf.keras.layers.TextVectorization(
max_tokens=None, standardize='lower_and_strip_punctuation',
split='whitespace', ngrams=None, output_mode='int',
output_sequence_length=None, pad_to_max_tokens=False, vocabulary=None,
idf_weights=None, sparse=False, ragged=False, **kwargs
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With