I am using spacy 2.0 and using a quoted string as input.
Example string
"The quoted text 'AA XX' should be tokenized"
and expecting to extract
[The, quoted, text, 'AA XX', should, be, tokenized]
I however get some strange results while experimenting. Noun chunks and ents looses one of the quote.
import spacy
nlp = spacy.load('en')
s = "The quoted text 'AA XX' should be tokenized"
doc = nlp(s)
print([t for t in doc])
print([t for t in doc.noun_chunks])
print([t for t in doc.ents])
Result
[The, quoted, text, ', AA, XX, ', should, be, tokenized]
[The quoted text 'AA XX]
[AA XX']
What is the best way to address what I need
In Spacy, the process of tokenizing a text into segments of words and punctuation is done in various steps. It processes the text from left to right. First, the tokenizer split the text on whitespace similar to the split() function. Then the tokenizer checks whether the substring matches the tokenizer exception rules.
SpaCy automatically breaks your document into tokens when a document is created using the model.
dep_ property of each child token describes its relationship with its parent; for instance a dep_ of 'nsubj' means that a token is the nominal subject of its parent.
orth is simply an integer that indicates the index of the occurrence of the word that is kept in the spacy.
While you could modify the tokenizer and add your own custom prefix, suffix and infix rules that exclude quotes, I'm not sure this is the best solution here.
For your use case, it might make more sense to add a component to your pipeline that merges (certain) quoted strings into one token before the tagger, parser and entity recognizer are called. To accomplish this, you can use the rule-based Matcher and find combinations of tokens surrounded by '. The following pattern looks for one or more alphanumeric characters:
pattern = [{'ORTH': "'"}, {'IS_ALPHA': True, 'OP': '+'}, {'ORTH': "'"}]
Here's a visual example of the pattern in the interactive matcher demo. To do the merging, you can then set up the Matcher, add the pattern and write a function that takes a Doc object, extracts the matched spans and merges them into one token by calling their .merge method.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
matcher.add('QUOTED', None, [{'ORTH': "'"}, {'IS_ALPHA': True, 'OP': '+'}, {'ORTH': "'"}])
def quote_merger(doc):
    # this will be called on the Doc object in the pipeline
    matched_spans = []
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        matched_spans.append(span)
    for span in matched_spans:  # merge into one token after collecting all matches
        span.merge()
    return doc
nlp.add_pipe(quote_merger, first=True)  # add it right after the tokenizer
doc = nlp("The quoted text 'AA XX' should be tokenized")
print([token.text for token in doc])
# ['The', 'quoted', 'text', "'AA XX'", 'should', 'be', 'tokenized']
For a more elegant solution, you can also refactor the component as a reusable class that sets up the matcher in its __init__ method (see the docs for examples).
If you add the component first in the pipeline, all other components like the tagger, parser and entity recognizer will only get to see  the retokenized Doc. That's also why you might want to write more specific patterns that only merge certain quoted strings you care about. In your example, the new token boundaries improve the predictions – but I can also think of many other cases where they don't, especially if the quoted string is longer and contains a significant part of the sentence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With