I am trying to prepare a custom ner model in spacy v3. V3 has changed significantly as compared to v2 from training perspective.
I am Using the default config with en_web_lg. I have prepared the training data (training.spacy) using convert command. However, the training command needs a dev.spacy file.
Not sure what data is expected there in dev.spacy. Is this asking a plain text corpus for the training.spacy file? But then is there a way to convert the plain text file in spacy format..
Command from spacy site- python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
Can someone pls help explain on how to prep the dev.spacy.
The train.spacy is a placeholder for collection of 'training' files - a directory of files usually using the Spacy convert utility. The dev.spacy is a placeholder for collection of 'validation' files - same format as training files, but used as a validation sample during training (for NER used to compute the prediction, recall and f-score after each training iteration). Commonly suggested 'size' of validation sample is between 10 to 20% of training sample. I tend to use 20% because my data has a large variation - but larger validation sample adds training overhead.
The dev.spacy
file should look exactly the same as the train.spacy
file, but should contain new examples that the training process hasn't seen before to get a realistic evaluation of the performance of your model.
To create this dev set, you can first split your original data into train/dev parts, and then run convert
separately on each of them, calling the larger one train.spacy
and the smaller one dev.spacy
. As @mbrunecky suggests, an 80-20 split is usually good, but it depends on the dataset.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With