Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

org.apache.nutch.crawl.Crawl missing in nutch 1.9 on hadoop 1.2.1

Tags:

hadoop

nutch

I have installed fully distributed Hadoop 1.2.1. I was trying to integrated nutch with steps below:

  1. Download apache-nutch-1.9-src.zip
  2. Add value http.agent.name into nutch-site.xml
  3. Copy hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml, masters, slaves into $NUTCH_HOME/conf
  4. compile using ant runtime
  5. create urls/seed.txt and put on hadoop dfs
  6. edit $NUTCH_HOME/conf/regex-urlfilter.txt

Test crawl using command:

bin/hadoop -jar nutch-1.9.job org.apache.nutch.crawl.Crawl urls -dir urls -depth 1 -topN 5

and get this error:

Exception in thread "main" java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hadoop.util.RunJar.main(RunJar.java:153)

I tried extract nutch-1.9.job and I didn't find out class Crawl in org/apache/nutch/crawl.

Do I need to config something?

like image 827
kha Avatar asked Dec 04 '25 01:12

kha


1 Answers

Crawl.java removed at 1.8 version. You can use crawl shell script for all crawling.

Deprecated class o.a.n.crawl.Crawler is still in code base https://issues.apache.org/jira/browse/NUTCH-1621

like image 148
Talat Avatar answered Dec 07 '25 04:12

Talat



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!