Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cross Validation with Spacy for Named Entity Recognition

I am trying to train a custom NER Model on 50,000 million samples. I am using minibatch with 20 iterations for modeling. I want to understand if I should be using use Cross-Validation for more accurate out of sample accuracy. If yes then where should the cross-validation step take place? If no, then how do I split/distribute my training and testing data, since I am using annotations and 6 custom entities and it is hard to keep track of the percentages of annotated labels in each of training and test data since and evenly distribute it.

Here is the code I am using for training -

def train_spacy(data, iterations):
    TRAIN_DATA = data

    # create blank Language class
    nlp = spacy.blank('en')  

    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)

    # Add LABELS
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
            ner.add_label(ent[2])

    # Get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

    # only train NER
    with nlp.disable_pipes(*other_pipes):  
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print("Starting iteration " + str(itn))

            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, 
                           drop=0.20,losses=losses)
            print('Losses', losses)

    return nlp


if __name__ == "__main__":

    # Train formatted data
    model = train_spacy(data, 10)

I think cross-validation step should take place somewhere inside the for loop for iterations but I am not sure. Can someone throw some light on how to use cross-validation with Spacy NER or whether is it not needed at all?

like image 918
iCHAIT Avatar asked Oct 24 '25 19:10

iCHAIT


1 Answers

Ideally, you would split off a proportion of your training dataset as "development set" and use all entities in that set to tune your hyperparameters.

If you select the proportion at random (make sure not to bias towards date or name), you would expect the distributions of entities to be roughly the same, too. It's always best not over-engineer this split but take a true random sample.

like image 184
Sofie VL Avatar answered Oct 26 '25 10:10

Sofie VL