Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does solr do web crawling?

I am interested to do web crawling. I was looking at solr.

Does solr do web crawling, or what are the steps to do web crawling?

like image 831
Murali Krishna Pinjala Avatar asked Nov 23 '09 05:11

Murali Krishna Pinjala


People also ask

Which software is used for crawling the website?

UiPath is a robotic process automation software for free web scraping. It automates web and desktop data crawling for most third-party apps. You can install the robotic process automation software if you run it on Windows. UiPath is able to extract tabular and pattern-based data across multiple web pages.

What can Solr do?

Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.

What type of database is Solr?

Solr is a search engine at heart, but it is much more than that. It is a NoSQL database with transactional support. It is a document database that offers SQL support and executes it in a distributed manner.


2 Answers

Solr 5+ DOES in fact now do web crawling! http://lucene.apache.org/solr/

Older Solr versions do not do web crawling alone, as historically it's a search server that provides full text search capabilities. It builds on top of Lucene.

If you need to crawl web pages using another Solr project then you have a number of options including:

  • Nutch - http://lucene.apache.org/nutch/
  • Websphinx - http://www.cs.cmu.edu/~rcm/websphinx/
  • JSpider - http://j-spider.sourceforge.net/
  • Heritrix - http://crawler.archive.org/

If you want to make use of the search facilities provided by Lucene or SOLR you'll need to build indexes from the web crawl results.

See this also:

Lucene crawler (it needs to build lucene index)

like image 175
Jon Avatar answered Sep 17 '22 12:09

Jon


Solr does not in of itself have a web crawling feature.

Nutch is the "de-facto" crawler (and then some) for Solr.

like image 45
mjv Avatar answered Sep 17 '22 12:09

mjv