How does SpaCy keeps track of character and token offset during tokenization?

Question

How does SpaCy keeps track of character and token offset during tokenization?

In SpaCy, there's a Span object that keeps the start and end offset of the token/span https://spacy.io/api/span#init

There's a _recalculate_indices method seems to be retrieving the token_by_start and token_by_end but that looks like all the recalcuation is doing.

When looking at extraneous spaces, it's doing some smart alignment of the spans.

Does it recalculate after every regex execution, does it keep track of the character's movement? Does it do a post regexes execution span search?

MyNameIsCaleb · Accepted Answer

Summary:
During tokenization, this is the part that keeps track of offset and character.

Simple answer: It goes character by character in the string.

TL;DR is at the bottom.

Explained chunk by chunk:

It takes in the string to be tokenized and starts iterating through it one letter/space at a time.

It is a simple for loop on the string where uc is the current character in the string.

for uc in string:

It first checks to see if the current character is a space and compares that to see if the last in_ws setting is opposite of whether it is a space or not. If they are the same, it will jump down and increase i += 1.

in_ws is being used to know if it should process or not. They want to do things on spaces as well as on characters, so they can't just track isspace() and operate only on False. Instead, when it first starts, in_ws is set to the result of string[0].isspace() and then compared against itself. If string[0] is a space, it will evaluate the same and therefor skip down and increase i (discussed later) and go to the next uc until it reaches a uc that is not the same as the first one. In practice this allows it to sequence through multiple spaces after having treated the first space, or multiple characters until it reaches the next space boundary.

    if uc.isspace() != in_ws:

It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i.

It tracks two index values: start and i. start is the start of the potential token that it is on, and i is the ending character it is looking at. When the script starts, start will be 0. After a cycle of this, start will be the index of the last space plus 1 which would make it the first letter of the current word.

It checks first if start is less than i which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.

        if start < i:

span is the word that is currently being looked at for tokenization. It is the string sliced by the start index value through the i index value.

            span = string[start:i]

It is then taking the hash of the word (start through i) and checking the cache dictionary to see if that word has been processed already. If it has not it will call the _tokenize method on that portion of the string.

            key = hash_string(span)
            cache_hit = self._try_cache(key, doc)
            if not cache_hit:
                self._tokenize(doc, span, key)

Next it checks to see if the current character uc is an exact space. If it is, it resets start to be i + 1 where i is the index of the current character.

        if uc == ' ':
            doc.c[doc.length - 1].spacy = True
            start = i + 1

If the character is not a space, it sets start to be the current character's index. It then reverses in_ws, indicating it is a character.

        else:
            start = i
        in_ws = not in_ws

And then it increases i += 1 and loops to the next character.

    i += 1

TL;DR
So all of that said, it keeps track of the character in the string that it is on using i and it keeps the start of the word using start. start is reset to the current character at the end of processing for a word, and then after the spaces it is set to the last space plus one (the start of the next word).

How does SpaCy keeps track of character and token offset during tokenization?

Tags:

python

algorithm

nlp

cython

spacy

alvas

1 Answers

MyNameIsCaleb

Recent Activity

Donate For Us

How does SpaCy keeps track of character and token offset during tokenization?

Tags:

python

algorithm

nlp

cython

spacy

alvas

1 Answers

MyNameIsCaleb

Related questions

Recent Activity

Donate For Us