Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save a tokenizer after training it?

I have just followed this tutorial on how to train my own tokenizer.

Now, from training my tokenizer, I have wrapped it inside a Transformers object, so that I can use it with the transformers library:

from transformers import BertTokenizerFast

new_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)

Then, I try to save my tokenizer using this code:

tokenizer.save_pretrained('/content/drive/MyDrive/Tokenzier')

But I get this error:

AttributeError: 'tokenizers.Tokenizer' object has no attribute 'save_pretrained'

Am I saving the tokenizer wrongly?

If so, what is the correct approach to save it to my local files, so that I can use it later?


1 Answers

If you are building a custom tokenizer, you can save & load it like this:

from tokenizers import Tokenizer

# Save
tokenizer.save('saved_tokenizer.json')

# Load
tokenizer = Tokenizer.from_file('saved_tokenizer.json')

save_pretrained() only works if you train from a pre-trained tokenizer like this:

from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("the_pretrained_model_in_hf")
tokenizer = old_tokenizer.train_new_from_iterator(get_training_corpus(), 52000)
tokenizer.save_pretrained("your-tokenizer")
like image 185
Raptor Avatar answered Dec 02 '25 00:12

Raptor