I have some training data for a new set of NER labels that are not currently covered in SpaCy's default NER model. I have prepared a training_data.spacy
file - which exclusively contain annotated examples with new labels. I am able to train a blank model from scratch following the instructions listed here - basically using the GUI tool to create a basic_config.cfg
and then filling it up to create a config.cfg
.
However, I am not sure how to fine-tune NER component an existing model - while keeping all the components intact. Basically, I would like to freeze all the other components during training. I tried to do something like the following:
import spacy
spacy.require_gpu()
nlp = spacy.load('en_core_web_sm')
frozen_components = [name for name in nlp.component_names if name not in ['ner']]
max_steps = 20000
eval_frequency = 200
patience = 1600
config = nlp.config
config['training']['max_steps'] = max_steps
config['training']['patience'] = patience
config['training']['eval_frequency'] = eval_frequency
config['training']['frozen_components'] = frozen_components
config['training']['annotating_components'] = nlp.component_names
with open('./ner_config.cfg', 'w') as f:
f.write(config.to_str())
After this, I run
python -m spacy train ner_config.cfg --output ./output/$(date +%s) --paths.train ./training_data.spacy --paths.dev ./training_data.spacy --gpu-id 0
I get the following error:
✔ Created output directory: output/1647965025
ℹ Saving to output directory: output/1647965025
ℹ Using GPU: 0
=========================== Initializing pipeline ===========================
[2022-03-22 21:33:47,498] [INFO] Set up nlp object from config
[2022-03-22 21:33:47,511] [INFO] Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[2022-03-22 21:33:47,571] [INFO] Added vocab lookups: lexeme_norm
[2022-03-22 21:33:47,571] [INFO] Created vocabulary
[2022-03-22 21:33:47,572] [INFO] Finished initializing nlp object
[2022-03-22 21:34:04,376] [INFO] Initialized pipeline components: ['ner']
✔ Initialized pipeline
============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler',
'lemmatizer', 'ner']
ℹ Frozen components: ['tok2vec', 'tagger', 'parser', 'senter',
'attribute_ruler', 'lemmatizer']
ℹ Set annotations on update for: ['tok2vec', 'tagger', 'parser',
'senter', 'attribute_ruler', 'lemmatizer', 'ner']
ℹ Initial learn rate: 0.001
E # LOSS NER TAG_ACC DEP_UAS DEP_LAS SENTS_F LEMMA_ACC ENTS_F ENTS_P ENTS_R SPEED SCORE
--- ------ -------- ------- ------- ------- ------- --------- ------ ------ ------ ------ ------
⚠ Aborting and saving the final best model. Encountered exception:
KeyError("Parameter 'E' for model 'hashembed' has not been allocated yet.")
...
vectors = cast(Floats2d, model.get_param("E"))
File "/home/abhinav/miniconda3/envs/spacy/lib/python3.8/site-packages/thinc/model.py", line 216, in get_param
raise KeyError(
KeyError: "Parameter 'E' for model 'hashembed' has not been allocated yet."
What am I missing?
Thanks!
There is a demo project that shows how to do this:
https://github.com/explosion/projects/tree/v3/pipelines/ner_demo_update
The key point is that you need to source
components from en_core_web_sm
instead in your config. You also don't need any components as annotating components in this scenario.
The generic version looks like this (copied from a script in the project above):
def create_config(model_name: str, component_to_update: str, output_path: Path):
nlp = spacy.load(model_name)
# create a new config as a copy of the loaded pipeline's config
config = nlp.config.copy()
# revert most training settings to the current defaults
default_config = spacy.blank(nlp.lang).config
config["corpora"] = default_config["corpora"]
config["training"]["logger"] = default_config["training"]["logger"]
# copy tokenizer and vocab settings from the base model, which includes
# lookups (lexeme_norm) and vectors, so they don't need to be copied or
# initialized separately
config["initialize"]["before_init"] = {
"@callbacks": "spacy.copy_from_base_model.v1",
"tokenizer": model_name,
"vocab": model_name,
}
config["initialize"]["lookups"] = None
config["initialize"]["vectors"] = None
# source all components from the loaded pipeline and freeze all except the
# component to update; replace the listener for the component that is
# being updated so that it can be updated independently
config["training"]["frozen_components"] = []
for pipe_name in nlp.component_names:
if pipe_name != component_to_update:
config["components"][pipe_name] = {"source": model_name}
config["training"]["frozen_components"].append(pipe_name)
else:
config["components"][pipe_name] = {
"source": model_name,
"replace_listeners": ["model.tok2vec"],
}
# save the config
config.to_disk(output_path)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With