I am searching way to use stanford word tokenizer in nltk, I want to use because when I compare results of stanford and nltk word tokenizer, they both are different. I know there might be way to use stanford tokenizer, like we can stanford POS Tagger and NER in NLTK.
Is it possible to do use stanford tokenizer without running server?
Thanks
Note: This solution would only work for:
NLTK v3.2.5 (v3.2.6 would have an even simpler interface)
Stanford CoreNLP (version >= 2016-10-31)
First you have to get Java 8 properly installed first and if Stanford CoreNLP works on command line, the Stanford CoreNLP API in NLTK v3.2.5 is as follows.
Note: You have to start the CoreNLP server in terminal BEFORE using the new CoreNLP API in NLTK.
On the terminal:
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-preload tokenize,ssplit,pos,lemma,parse,depparse \
-status_port 9000 -port 9000 -timeout 15000
In Python:
>>> from nltk.parse.corenlp import CoreNLPParser
>>> st = CoreNLPParser()
>>> tokenized_sent = list(st.tokenize('What is the airspeed of an unladen swallow ?'))
>>> tokenized_sent
['What', 'is', 'the', 'airspeed', 'of', 'an', 'unladen', 'swallow', '?']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With