I'm building an application using python which involves getting news articles from RSS feeds. As part of my project, I have decided to use boilerpipe in order to extract just the article content from the html page on which the article appears.
Although boilerpipe was originally written for java, it has been ported to python too. You can see its page on github here: https://github.com/misja/python-boilerpipe
The problem is that I get an exception when trying to import it using:
from boilerpipe.extract import Extractor
The error I get is:
Traceback (most recent call last):
File "", line 1, in
File "build\bdist.win32\egg\boilerpipe\extract__init__.py", line 12, in
File "C:\Python26\lib\site-packages\jpype_jclass.py", line 54, in JClass
raise _RUNTIMEEXCEPTION.PYEXC("Class %s not found" % name)
jpype._jexception.ExceptionPyRaisable: java.lang.Exception: Class
de.l3s.boilerpipe.sax.HTMLHighlighter not found
What might be causing this problem and how can I fix it?
This worked for me on Mac OS X 10.8.5 with Python 2.7.9.:
pip install JPype1 # to install https://pypi.python.org/pypi/JPype1
pip install charade
git clone https://github.com/misja/python-boilerpipe.git
cd python-boilerpipe
sudo python setup.py install
Then you should be able to do in the python console
>>> from boilerpipe.extract import Extractor
>>> extractor = Extractor(extractor='ArticleExtractor', url="http://en.wikipedia.org/wiki/Main_Page")
>>> print extractor.getText()
You are missing boiler pipe java packages install, you can find it here - http://code.google.com/p/boilerpipe/downloads/list
you have only install python boilerpipe wrapper.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With