Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determining what a word "is" - categorizing a token

I'm writing a bridge between the user and a search engine, not a search engine. Part of my value added will be inferring the intent of a query. The intent of a tracking number, stock symbol, or address is fairly obvious. If I can categorise a query, then I can decide if the user even needs to see search results. Of course, if I cannot, then they will see search results. I am currently designing this inference engine.

I'm writing a parser; it should take any given token and assign it a category. Here are some theoretical English examples:

  • "denver" is a USCITY and a PLACENAME
  • "aapl" is a NASDAQSYMBOL and a STOCKTICKERSYMBOL
  • "555 555 5555" is a USPHONENUMBER

I know that each of these cases will most likely require specific handling, however I'm not sure where to start.

Ideally I'd end up with something simple like:

queryCategory = magicCategoryFinder( query )

    >print queryCategory
    >"SOMECATEGORY or a list"
like image 407
Art Avatar asked Jan 19 '26 06:01

Art


2 Answers

Natural language parsing is a complicated topic. One of the problems here is that determining what a word is depends on context and implied knowledge. Also, you're not so much interested in words as you are in groups of words. Consider, "New York City" is a place but its three words, two of which (new and city) have other meanings.

also you have to consider ambiguity, which is once again where context and implied knowledge comes in. For example, JAVA is (or was) a stock symbol for Sun Microsystems. It's also a programming language, a place and has meaning associated with coffee. How do you classify it? You'd need to know the context in which it was used.

And if you can solve that problem reliably you can make yourself very wealthy.

What's all this in aid of anyway?

like image 131
cletus Avatar answered Jan 20 '26 23:01

cletus


To learn about "tagging" (the term of art for what you're trying to do), I suggest playing around with NLTK's tag module. More generally, NLTK, the Natural Language ToolKit, is an excellent toolkit (based on the Python programming language) for experimentation and learning in the field of Natural Language Processing (whether it's suitable for a given production application may be a different issue, esp. if said application requires very high speed processing on large volumes of data -- but, you have to walk before you can run!-).

like image 42
Alex Martelli Avatar answered Jan 20 '26 22:01

Alex Martelli



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!