Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyTorch tokenizers: how to truncate tokens from left?

As we can see in the below code snippet, specifying max_length and truncation for a tokenizer cuts excess tokens from the left:

tokenizer("hello, my name", truncation=True, max_length=6).input_ids

> [0, 42891, 6, 127, 766, 2]

tokenizer("hello, my name", truncation=True, max_length=4).input_ids

> [0, 42891, 6, 2]

The different tokenization strategies like only_second, only_first, longest_first, all seem to cut from the right. Is there a way to cut from the left? So that the tokens would be [0, 127, 766, 2] in the second example?

like image 896
aayc Avatar asked Oct 25 '25 15:10

aayc


1 Answers

As pointed out by andrea in the comments, you can use truncation_side='left' when initialising the tokenizer. You can also set this attribute after tokenizer creation:

tokenizer.truncation_side='left'. # Default is 'right'

The tokenizer internally takes care of the rest and truncates based on the max_len argument. Alternatively; if you need to use a transformers version which does not have this feature, you can tokenize without truncation and implement the following custom logic as a postprocessing step:

tokens = tokens[-max_len:]
attn_mask = attn_mask[-max_len:]
like image 58
Bas Krahmer Avatar answered Oct 28 '25 05:10

Bas Krahmer