As we can see in the below code snippet, specifying max_length and truncation for a tokenizer cuts excess tokens from the left:
tokenizer("hello, my name", truncation=True, max_length=6).input_ids
> [0, 42891, 6, 127, 766, 2]
tokenizer("hello, my name", truncation=True, max_length=4).input_ids
> [0, 42891, 6, 2]
The different tokenization strategies like only_second, only_first, longest_first, all seem to cut from the right. Is there a way to cut from the left? So that the tokens would be [0, 127, 766, 2] in the second example?
As pointed out by andrea in the comments, you can use truncation_side='left' when initialising the tokenizer. You can also set this attribute after tokenizer creation:
tokenizer.truncation_side='left'. # Default is 'right'
The tokenizer internally takes care of the rest and truncates based on the max_len argument. Alternatively; if you need to use a transformers version which does not have this feature, you can tokenize without truncation and implement the following custom logic as a postprocessing step:
tokens = tokens[-max_len:]
attn_mask = attn_mask[-max_len:]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With