user.nutch.apache.org
(
List home) (
Recent threads) (
3 other Apache Nutch lists)
Subscription Options
- RSS or Atom: Read-only subscription using a browser or aggregator. This is the recommended way if you don't need to send messages to the list. You can learn more about feed syndication and clients here.
- Conventional: All messages are delivered to your mail address, and you can reply. To subscribe, send an email to the list's subscribe address with "subscribe" in the subject line.
- Moderate traffic list: up to 30 messages per day
- This list contains about 9,526 messages, beginning May 2010
- 8 messages added yesterday
user.nutch.apache.org
Recent threads
Shah, Nishant — 136944688725 May 2013
Hi everyone, I followed an online tutorial which helped me setup Nutch 2.1 with MySql. I used blobs as storage mechanism for headers, metadata, parseP...
Lewis John Mcgibbney — 136942047024 May 2013*
Hi Chris,Well yes you can set it in the log4j.properties file, however if you are working with anything older than 2.x HEAD then by default the loggin...
Bai Shen — 136941495124 May 2013*
I'm running Nutch 2.1 using HBase. When I run readdb -stats I show that there are 15k unfetched urls. However, when I run generate -topN 1000 I ge...
Martin Aesch — 136938866424 May 2013*
Dear nutchers, I extended the ParseFilter extension point public Parse filter(String url, WebPage page, Parse parse, HTMLMetaTags metaTags, DocumentFr...
Lewis John Mcgibbney — 136935927524 May 2013
Hi Kirby,I agree with this. It is not my goal to attack this head on but (I think) it is useful for us to know more about the different components of ...
Lewis John Mcgibbney — 136935842524 May 2013*
Hi All, A really nice aspect of the regex (urlfilter-automaton and urfilter-regex) plugin implementation's in Nutch is that there is a small but v...
Adriana Farina — 136934011123 May 2013*
Hi, I'm using Nutch 2.1 in distributed mode on top of Hadoop 1.0.4, with HBase 0.90.4 as database. I wrote a Java class from which I run the crawl...
Daniel Hüsch — 136930201523 May 2013*
Hi, we used Nutch 1.x with an authentication for tomcat. We used this: <role rolename="solr_admin"/> <user username="USER...
Nicholas W — 136929790623 May 2013
Dear List, I have been following the instructions at http://wiki.apache.org/nutch/Nutch2Tutorial to see if I can get a nutch installation running with...
Yves S. Garret — 136918476422 May 2013*
Hello, I'm curious. If I wanted to store the URLs for Nutch (version 2.1) in Hive (version 0.9.0) and then store the output from Nutch in Hive, ho...
Christopher Gross — 136906958220 May 2013*
I'm having trouble getting my nutch working. I had it on another server and it was working fine. I migrated it to a new server, and I've been ...
Christopher Gross — 136906687820 May 2013*
I'm attempting to get a crawl working using scripts, but I've been getting a "Skipping <url>; different batch id (null)" error...
Lewis John Mcgibbney — 136899861419 May 2013
Hi All, I submitted a patch to upgrade the Nutch 2.x Branch codebase to the newly released Gora 0.3. The patch can be found here [0]. It would be exce...
Chris Hairfield — 136892846619 May 2013*
Hello everyone, I've been eagerly awaiting some of the functionality slated for 2.x, especially around your work integrating with Elasticsearch. I...
harsh yadav — 136890633418 May 2013
Hello, I am running nutch 1.6 with hadoop 0.20.2 but not able to crawl in eclips every time getting error:- 2013-05-17 00:33:09,376 WARN crawl.Crawl (...
harsh yadav — 136890391018 May 2013*
Hello, I am running nutch 1.6 with hadoop 0.20.2 but not able to crawl in eclips every time getting error:-...
Shah, Nishant — 136883649718 May 2013
Hi everyone, This is my first post so apologies if this is not the correct question to ask. I have followed the wiki tutorial and I am getting the bel...
James Ford — 136882234017 May 2013*
Hello! I am wondering if there is some example crawl script for Nutch 2.1? This includes the Inject/Generate/Fetch/Parse/Update/Index phases. Thanks...
Sourajit Basak — 136882133117 May 2013*
In order to control the number of part files generated, we made a minor change to handle 'numFetchers' argument in the one-step crawl command....
Renato Marroquín Mogrovejo — 136873866116 May 2013*
Hi all, I have been trying to fetch a query similar to: http://www.xyz.com/?page=1 But where the number can vary from 1 to 100. Inside the first page ...
Older threads