Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What nltk corpus should use to identify pos tag for technlology related text

Hi below is my code to remove stopwords and get the named entities for text which contains technology related terms like java, lan, port, socket etc

import nltk
from nltk.corpus import stopwords
import codecs
import os
import base64

def stop_final():
    result=[]
    text="some technology related text"
    text = nltk.word_tokenize(text)
    for word in text:
        if word not in stopwords.words('english'):
            result.append(word)

     print nltk.ne_chunk(nltk.pos_tag(result))

stop_final()

From the above code i am getting Person entities for lan, socket etc, so i am not getting accurate result, so please suggest me how can i get correct named entities for my text

Thanks

like image 807
user2609542 Avatar asked Nov 30 '25 21:11

user2609542


1 Answers

Late, but here goes. Also, this is not a solution, more an explanation of the problem and trying to point the reader in the right direction. Hope this helps someone.

NLTK uses a dictionary of stopwords in that module, so that will not filter everything you are looking for. You'll have to look at assigning POS tags to your words and filtering irrelevant types to your problem.

However, you are looking for terms that can be both nouns and proper nouns. Therefore, both Java and Ian would get through. The problem is that POS tags does not contain the extra information that you are looking for, i.e., that the words should be technology related.

This is an extremely difficult problem to solve with a high accuracy, since you'll need to infer context out of your text. This is a current research problem in the fields of Natural Language Processing (NLP) and Machine Learning.

Possible solutions may contain some of the following techniques.

  • You can start building your own stopwords list, by adding words to the list as you spot them (after POS tags filtering). This is tedious and error prone, but simpler than the alternatives.

  • There are methods in NLP called Name-entity resolution that you can look at.

  • Also, checkout Goolge's Ngram corpus viewer. They did some interesting things with that.

like image 140
rand_acs Avatar answered Dec 02 '25 10:12

rand_acs



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!