Loading spacy models slows down running my unit tests. Is there a way to mock spacy models or Doc objects to speed up unit tests?
Example of a current slow tests
import spacy
nlp = spacy.load("en_core_web_sm")
def test_entities():
    text = u"Google is a company."
    doc = nlp(text)
    assert doc.ents[0].text == u"Google"
Based on the docs my approach is
Constructing the Vocab and Doc manually and setting the entities as tuples.
from spacy.vocab import Vocab
from spacy.tokens import Doc
def test()
    alphanum_words = u"Google Facebook are companies".split(" ")
    labels = [u"ORG"]
    words = alphanum_words + [u"."]
    spaces = len(words) * [True]
    spaces[-1] = False
    spaces[-2] = False
    vocab = Vocab(strings=(alphanum_words + labels))
    doc = Doc(vocab, words=words, spaces=spaces)
    def get_hash(text):
        return vocab.strings[text]
    entity_tuples = tuple([(get_hash(labels[0]), 0, 1)])
    doc.ents = entity_tuples
    assert doc.ents[0].text == u"Google"
Is there a cleaner more Pythonic solution for mocking spacy objects for unit tests for entities?
This is a great question actually! I'd say your instinct is definitely right: If all you need is a Doc object in a given state and with given annotations, always create it manually wherever possible. And unless you're explicitly testing a statistical model, avoid loading it in your unit tests. It makes the tests slow, and it introduces too much unnecessary variance. This is also very much in line with the philosophy of unit testing: you want to be writing independent tests for one thing at a time (not one thing plus a bunch of third-party library code plus a statistical model).
Some general tips and ideas:
Doc manually. Avoid loading models or Language subclasses.doc.text, you do not have to set the spaces. In fact, I leave this out in about 80% of the tests I write, because it really only becomes relevant when you're putting the tokens back together.Doc objects in your test suite, you could consider using a utility function, similar to the get_doc helper we use in the spaCy test suite. (That function also shows you how the individual annotations are set manually, in case you need it.)Vocab. Depending on what you're testing, you might want to explicitly use the English vocab. In the spaCy test suite, we do this by setting up an en_vocab fixture in the conftest.py.doc.ents to a list of tuples, you can also make it a list of Span objects. This looks a bit more straightforward, is easier to read, and in spaCy v2.1+, you can also pass a string as a label:def test_entities(en_vocab):
    doc = Doc(en_vocab, words=["Hello", "world"])
    doc.ents = [Span(doc, 0, 1, label="ORG")]
    assert doc.ents[0].text == "Hello"
English, put them in a session-scoped fixture. This means that they'll only be loaded once per session instead of once per test. Language classes are lazy-loaded and may also take some time to load, depending on the data they contain. So you only want to do this once.# Note: You probably don't have to do any of this, unless you're testing your
# own custom models or language classes.
@pytest.fixture(scope="session")
def en_core_web_sm():
    return spacy.load("en_core_web_sm")
@pytest.fixture(scope="session")
def en_lang_class():
    lang_cls = spacy.util.get_lang_class("en")
    return lang_cls()
def test(en_lang_class):
    doc = en_lang_class("Hello world")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With