I want to extract all country and nationality mentions from text using nltk, I used POS tagging to extract all GPE labeled tokens but the results were not satisfying.
 abstract="Thyroid-associated orbitopathy (TO) is an autoimmune-mediated orbital inflammation that can lead to disfigurement and blindness. Multiple genetic loci have been associated with Graves' disease, but the genetic basis for TO is largely unknown. This study aimed to identify loci associated with TO in individuals with Graves' disease, using a genome-wide association scan (GWAS) for the first time to our knowledge in TO.Genome-wide association scan was performed on pooled DNA from an Australian Caucasian discovery cohort of 265 participants with Graves' disease and TO (cases) and 147 patients with Graves' disease without TO (controls). "
  sent = nltk.tokenize.wordpunct_tokenize(abstract)
  pos_tag = nltk.pos_tag(sent)
  nes = nltk.ne_chunk(pos_tag)
  places = []
  for ne in nes:
      if type(ne) is nltk.tree.Tree:
         if (ne.label() == 'GPE'):
            places.append(u' '.join([i[0] for i in ne.leaves()]))
      if len(places) == 0:
          places.append("N/A")
The results obtained are :
['Thyroid', 'Australian', 'Caucasian', 'Graves']
Some are nationalities but others are just nouns.
So what am I doing wrong or is there another way to extract such info?
NLTK has already a pre-trained named entity chunker which can be used using ne_chunk() method in the nltk.chunk module. This method chunks a single sentence into a Tree. Code #1 : Using ne-chunk() on tagged sentence of the treebank_chunk corpus.
The GPE is a Tree object's label from the pre-trained ne_chunk model.
If you want the country names to be extracted, what you need is NER tagger, not POS tagger.
Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
Check out Stanford NER tagger!
from nltk.tag.stanford import NERTagger
import os
st = NERTagger('../ner-model.ser.gz','../stanford-ner.jar')
tagging = st.tag(text.split()) 
So after the fruitful comments, I digged deeper into different NER tools to find the best in recognizing nationalities and country mentions and found that SPACY has a NORP entity that extracts nationalities efficiently. https://spacy.io/docs/usage/entity-recognition
Here's geograpy that uses NLTK to perform entity extraction. It stores all places and locations as a gazetteer. It then performs a lookup on the gazetteer to fetch relevant places and locations. Look up the docs for more usage details -
from geograpy import extraction
e = extraction.Extractor(text="Thyroid-associated orbitopathy (TO) is an autoimmune-
mediated orbital inflammation that can lead to disfigurement and blindness. 
Multiple genetic loci have been associated with Graves' disease, but the genetic 
basis for TO is largely unknown. This study aimed to identify loci associated with 
TO in individuals with Graves' disease, using a genome-wide association scan 
(GWAS) for the first time to our knowledge in TO.Genome-wide association scan was 
performed on pooled DNA from an Australian Caucasian discovery cohort of 265 
participants with Graves' disease and TO (cases) and 147 patients with Graves' 
disease without TO (controls).")
e.find_entities()
print e.places()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With