PyTorch tokenizers: how to truncate tokens from left?

Question

As we can see in the below code snippet, specifying max_length and truncation for a tokenizer cuts excess tokens from the left:

tokenizer("hello, my name", truncation=True, max_length=6).input_ids

> [0, 42891, 6, 127, 766, 2]

tokenizer("hello, my name", truncation=True, max_length=4).input_ids

> [0, 42891, 6, 2]

The different tokenization strategies like only_second, only_first, longest_first, all seem to cut from the right. Is there a way to cut from the left? So that the tokens would be [0, 127, 766, 2] in the second example?

Bas Krahmer · Accepted Answer

As pointed out by andrea in the comments, you can use truncation_side='left' when initialising the tokenizer. You can also set this attribute after tokenizer creation:

tokenizer.truncation_side='left'. # Default is 'right'

The tokenizer internally takes care of the rest and truncates based on the max_len argument. Alternatively; if you need to use a transformers version which does not have this feature, you can tokenize without truncation and implement the following custom logic as a postprocessing step:

tokens = tokens[-max_len:]
attn_mask = attn_mask[-max_len:]

PyTorch tokenizers: how to truncate tokens from left?

Tags:

truncate

tokenize

pytorch

bert-language-model

aayc

1 Answers

Bas Krahmer

Recent Activity

Donate For Us

PyTorch tokenizers: how to truncate tokens from left?

Tags:

truncate

tokenize

pytorch

bert-language-model

aayc

1 Answers

Bas Krahmer

Related questions

Recent Activity

Donate For Us