I am interested to do web crawling. I was looking at <code>solr</code>. Does <code>solr</code> do web crawling, or what are the steps to do web crawling?

Solr 5+ DOES in fact now do web crawling! http://lucene.apache.org/solr/ Older Solr versions do not do web crawling alone, as historically it's a search server that provides full text search capabilities. It builds on top of Lucene. If you need to crawl web pages using another Solr project then you have a number of options including: <ul> <li>Nutch - http://lucene.apache.org/nutch/ </li> <li>Websphinx - http://www.cs.cmu.edu/~rcm/websphinx/ </li> <li>JSpider - http://j-spider.sourceforge.net/ </li> <li>Heritrix - http://crawler.archive.org/ </li> </ul> If you want to make use of the search facilities provided by Lucene or SOLR you'll need to build indexes from the web crawl results. See this also: Lucene crawler (it needs to build lucene index)

Solr does not in of itself have a web crawling feature. Nutch is the "de-facto" crawler (and then some) for Solr.

Does solr do web crawling?

2 Answers

Solr 5+ DOES in fact now do web crawling! http://lucene.apache.org/solr/

Older Solr versions do not do web crawling alone, as historically it's a search server that provides full text search capabilities. It builds on top of Lucene.

If you need to crawl web pages using another Solr project then you have a number of options including:

Nutch - http://lucene.apache.org/nutch/
Websphinx - http://www.cs.cmu.edu/~rcm/websphinx/
JSpider - http://j-spider.sourceforge.net/
Heritrix - http://crawler.archive.org/

If you want to make use of the search facilities provided by Lucene or SOLR you'll need to build indexes from the web crawl results.

See this also:

Lucene crawler (it needs to build lucene index)

175

answered Sep 17 '22 12:09

Jon

Solr does not in of itself have a web crawling feature.

Nutch is the "de-facto" crawler (and then some) for Solr.

answered Sep 17 '22 12:09

mjv

Related questions
                            
                                how do I normalise a solr/lucene score?
                            
                                What is the difference between managed-schema and schema.xml
                            
                                SOLR Permissions / Filtering Results depending on Access Rights
                            
                                How to fix Solr Exception: Could not find necessary SLF4j logging jars?
                            
                                "Nothing to start" when trying to start Apache Solr
                            
                                sunspot_rails not re-indexing model after save
                            
                                Difference between StandardTokenizerFactory and KeywordTokenizerFactory in Solr?
                            
                                Solr 4.2 - what is _version_field?
                            
                                How to recover from Solr deleted index files?
                            
                                Installing Apache Solr 4.1 on windows 7 [closed]
                            
                                Apache Solr: Slave replicates 10+ times every time it polls (excessive commits?)
                            
                                How to crawl a website that has SAML authentication using ManifoldCF or nutch?
                            
                                When I do query from solr, it occurred a common exception telling me that undefined field userId
                            
                                Situations to prefer Apache Lucene over Solr?
                            
                                EdgeNGram: Error instantiating class: 'org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory'
                            
                                SOLR and Natural Language Parsing - Can I use it?
                            
                                Solr localhost:8983 Not Found
                            
                                Is there a size or term limit for a Solr query string when using HTTP POST?
                            
                                How does Solr's schema-less feature work? How to revert it to classic schema?
                            
                                Reloading SolrCloud configuration (stored on Zookeeper) - schema.xml

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does solr do web crawling?

Tags:

solr

web-crawler

Murali Krishna Pinjala

People also ask

2 Answers

Jon

mjv

Recent Activity

Donate For Us