Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use stanford word tokenizer in NLTK?

I am searching way to use stanford word tokenizer in nltk, I want to use because when I compare results of stanford and nltk word tokenizer, they both are different. I know there might be way to use stanford tokenizer, like we can stanford POS Tagger and NER in NLTK.

Is it possible to do use stanford tokenizer without running server?

Thanks

like image 415
Lucky Avatar asked Jan 24 '26 07:01

Lucky


1 Answers

Note: This solution would only work for:

  • NLTK v3.2.5 (v3.2.6 would have an even simpler interface)

  • Stanford CoreNLP (version >= 2016-10-31)

First you have to get Java 8 properly installed first and if Stanford CoreNLP works on command line, the Stanford CoreNLP API in NLTK v3.2.5 is as follows.

Note: You have to start the CoreNLP server in terminal BEFORE using the new CoreNLP API in NLTK.

On the terminal:

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-preload tokenize,ssplit,pos,lemma,parse,depparse \
-status_port 9000 -port 9000 -timeout 15000

In Python:

>>> from nltk.parse.corenlp import CoreNLPParser
>>> st = CoreNLPParser()
>>> tokenized_sent = list(st.tokenize('What is the airspeed of an unladen swallow ?'))
>>> tokenized_sent
['What', 'is', 'the', 'airspeed', 'of', 'an', 'unladen', 'swallow', '?']
like image 113
alvas Avatar answered Jan 25 '26 19:01

alvas



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!