What nltk corpus should use to identify pos tag for technlology related text

Question

Hi below is my code to remove stopwords and get the named entities for text which contains technology related terms like java, lan, port, socket etc

import nltk
from nltk.corpus import stopwords
import codecs
import os
import base64

def stop_final():
    result=[]
    text="some technology related text"
    text = nltk.word_tokenize(text)
    for word in text:
        if word not in stopwords.words('english'):
            result.append(word)

     print nltk.ne_chunk(nltk.pos_tag(result))

stop_final()

From the above code i am getting Person entities for lan, socket etc, so i am not getting accurate result, so please suggest me how can i get correct named entities for my text

Thanks

rand_acs · Accepted Answer

Late, but here goes. Also, this is not a solution, more an explanation of the problem and trying to point the reader in the right direction. Hope this helps someone.

NLTK uses a dictionary of stopwords in that module, so that will not filter everything you are looking for. You'll have to look at assigning POS tags to your words and filtering irrelevant types to your problem.

However, you are looking for terms that can be both nouns and proper nouns. Therefore, both Java and Ian would get through. The problem is that POS tags does not contain the extra information that you are looking for, i.e., that the words should be technology related.

This is an extremely difficult problem to solve with a high accuracy, since you'll need to infer context out of your text. This is a current research problem in the fields of Natural Language Processing (NLP) and Machine Learning.

Possible solutions may contain some of the following techniques.

You can start building your own stopwords list, by adding words to the list as you spot them (after POS tags filtering). This is tedious and error prone, but simpler than the alternatives.
There are methods in NLP called Name-entity resolution that you can look at.
Also, checkout Goolge's Ngram corpus viewer. They did some interesting things with that.

What nltk corpus should use to identify pos tag for technlology related text

Tags:

python

nltk

named-entity-recognition

corpus

user2609542

1 Answers

rand_acs

Recent Activity

Donate For Us

What nltk corpus should use to identify pos tag for technlology related text

Tags:

python

nltk

named-entity-recognition

corpus

user2609542

1 Answers

rand_acs

Related questions

Recent Activity

Donate For Us