I need to identify all the establishments and facilities from a given text using natural language processing and NER.
Example text:
The government panned to build new parks, swimming pool and commercial complex for out town and improve existing housing complex, schools and townhouse.
Expected entities to be identified:
parks, swimming pool, commercial complex, housing complex, school and townhouse
I did explore some python libraries like Spacy and NLTK but results were not great only 2 entities were identified. I reckon the data needs to be pre-processed properly.
What should I do to improve the results ? Is there any other libraries/framework that is better for this use case ? Is there any way to train our model using the existing db ?
As @Sergey mentioned, you'd need a custom NER Model. And Spacy really comes handy for custom NER, given you have the training data. Here's a straightforward way to do it and considering your example -
import spacy
from tqdm import tqdm
import random
train_data = [
    ('Government built new parks', {
        'entities': [(0, 10, 'ORG'),(21, 26, 'FAC')]
    }),
]
Create a Blank Model & Add 'NER' pipe
nlp=spacy.blank('en')
ner=nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
Training Step
n_iter=100 
for _, annotations in train_data:
    for ent in annotations.get('entities'):
        ner.add_label(ent[2])
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(train_data)
            losses = {}
            for text, annotations in tqdm(train_data):
                nlp.update(
                    [text],  
                    [annotations],  
                    drop=0.25,  
                    sgd=optimizer,
                    losses=losses)
            print(losses)
 
#Test           
for text, _ in train_data:
    doc = nlp(text)
    print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
Tune the Hyper-Parameters and Check which works best for you.
Other Ways to explore -
Cheers !
In 2022 you don't necessarily need to train a new model for that.
Instead you can use a large language model like GPT-3, GPT-J, or GPT-NeoX, and perform entity extraction on any sort of complex entity, without even training a new model for it!
See how to install GPT-J and use it in Python here: https://github.com/kingoflolz/mesh-transformer-jax . If this model is too big for your machine, you can also use a smaller one, like this small version of OPT (by Facebook): https://huggingface.co/facebook/opt-125m
In order to understand how to use these models for NER, see this article about few-shot learning: https://nlpcloud.com/effectively-using-gpt-j-gpt-neo-gpt-3-alternatives-few-shot-learning.html
And also see this TDS article about few-shot learning and NER: https://towardsdatascience.com/advanced-ner-with-gpt-3-and-gpt-j-ce43dc6cdb9c
Last of all, you might be interested in this this video about NER with GPT-NeoX vs spaCy: https://www.youtube.com/watch?v=E-qZDwXpeY0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With