How to properly extract entities like facilities and establishment from text using NLP and Entity recognition?

Question

I need to identify all the establishments and facilities from a given text using natural language processing and NER.

Example text:

The government panned to build new parks, swimming pool and commercial complex for out town and improve existing housing complex, schools and townhouse.

Expected entities to be identified:

parks, swimming pool, commercial complex, housing complex, school and townhouse

I did explore some python libraries like Spacy and NLTK but results were not great only 2 entities were identified. I reckon the data needs to be pre-processed properly.

What should I do to improve the results ? Is there any other libraries/framework that is better for this use case ? Is there any way to train our model using the existing db ?

Anant Kumar · Accepted Answer

As @Sergey mentioned, you'd need a custom NER Model. And Spacy really comes handy for custom NER, given you have the training data. Here's a straightforward way to do it and considering your example -

import spacy
from tqdm import tqdm
import random
train_data = [
    ('Government built new parks', {
        'entities': [(0, 10, 'ORG'),(21, 26, 'FAC')]
    }),
]

Create a Blank Model & Add 'NER' pipe

nlp=spacy.blank('en')
ner=nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)

Training Step

n_iter=100 
for _, annotations in train_data:
    for ent in annotations.get('entities'):
        ner.add_label(ent[2])

    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(train_data)
            losses = {}
            for text, annotations in tqdm(train_data):
                nlp.update(
                    [text],  
                    [annotations],  
                    drop=0.25,  
                    sgd=optimizer,
                    losses=losses)
            print(losses)
 
#Test           
for text, _ in train_data:
    doc = nlp(text)
    print('Entities', [(ent.text, ent.label_) for ent in doc.ents])

Tune the Hyper-Parameters and Check which works best for you.

Other Ways to explore -

Train a seq2seq model for Custom NER. (huggingface transformers library might come in handy)
Use Unsupervised NER with BERT or other transformer models.
Recently, Language Models have provided 'state-of-the-art' results for NER.

Cheers !

Julien Salinas · Answer

In 2022 you don't necessarily need to train a new model for that.

Instead you can use a large language model like GPT-3, GPT-J, or GPT-NeoX, and perform entity extraction on any sort of complex entity, without even training a new model for it!

See how to install GPT-J and use it in Python here: https://github.com/kingoflolz/mesh-transformer-jax . If this model is too big for your machine, you can also use a smaller one, like this small version of OPT (by Facebook): https://huggingface.co/facebook/opt-125m

In order to understand how to use these models for NER, see this article about few-shot learning: https://nlpcloud.com/effectively-using-gpt-j-gpt-neo-gpt-3-alternatives-few-shot-learning.html

And also see this TDS article about few-shot learning and NER: https://towardsdatascience.com/advanced-ner-with-gpt-3-and-gpt-j-ce43dc6cdb9c

Last of all, you might be interested in this this video about NER with GPT-NeoX vs spaCy: https://www.youtube.com/watch?v=E-qZDwXpeY0

How to properly extract entities like facilities and establishment from text using NLP and Entity recognition?

Tags:

python

nlp

stanford-nlp

named-entity-recognition

spacy

PradhanKamal

2 Answers

Anant Kumar

Julien Salinas

Recent Activity

Donate For Us

How to properly extract entities like facilities and establishment from text using NLP and Entity recognition?

Tags:

python

nlp

stanford-nlp

named-entity-recognition

spacy

PradhanKamal

2 Answers

Anant Kumar

Julien Salinas

Related questions

Recent Activity

Donate For Us