user.nutch.apache.org
(
List home) (
Recent threads) (
3 other Apache Nutch lists)
Subscription Options
- RSS or Atom: Read-only subscription using a browser or aggregator. This is the recommended way if you don't need to send messages to the list. You can learn more about feed syndication and clients here.
- Conventional: All messages are delivered to your mail address, and you can reply. To subscribe, send an email to the list's subscribe address with "subscribe" in the subject line.
- Moderate traffic list: up to 30 messages per day
- This list contains about 9,500 messages, beginning May 2010
- 3 messages added yesterday
user.nutch.apache.org
May 2012 - page 1
Alex McLintock — 133586682001 May 2012*
Hi Folks, This is not 100% a Nutch question... and I hate it when other people say "I know my question is off topic....." so why I am doing ...
ML mail — 133603696303 May 2012*
Hi, I would like to index the typical description and keywords HTML meta tags using my stable installation of Nutch 1.4. For that, I have followed the...
Jim Chandler — 133604996503 May 2012*
Greetings, Nutch, Solr, Lucene and everything else is very new to me. I am in the process of trying to change a plugin from an IndexingFilter to a Par...
Xiao Li — 133621660605 May 2012*
Hi Nutch people, I am using Nutch to index a website. I notice that Nutch has crawled some junk webpages, such as http://**************/category/event...
Ali Safdar Kureishy — 133630411006 May 2012*
Hi, I have attached a *Sequence* file with the following format: <url:Text> <data:CrawlDatum> (CrawlDatum is a custom Java type, that cont...
Roberto Gardenier — 133637641707 May 2012*
Hello, Im currently trying to crawl a site which uses hashtags in the urls. I dont seem to get any results and Im hoping im just overlooking something...
Mohammad wrk — 133639969507 May 2012*
Hi, Can Nutch be configured to consider the url in the following html snippet as a link? <tr onclick="clickOnLink("http://www.example.com...
Benjamin Heilbrunn — 133640032507 May 2012*
Hi, I'm trying to crawl an intranet platform, which uses https and client certificates for user authentication. Is it possible to cofigure nutch t...
slavo — 133640032607 May 2012*
Hello, I would like to use nutch to crawl and index some sites in local network, but the server require client certificate. How can I configure nutch ...
nutch.buddy — 133645756608 May 2012*
In a previous discussion about handling of failures in nutch, it was mentioned that a broken segment cannot be fixed and it's urls should be re-cr...
nutch.buddy — 133647403308 May 2012*
Hi, When i merge indexes using nutch's IndexMerger, I give as input some folders that were created by the indexer and get as output the merged ind...
Ali Safdar Kureishy — 133655056309 May 2012
Hi, I've included both the Nutch and Hadoop mailing lists, since I don't know which one of the two is the root cause for this issue, and it mi...
Tolga — 133656756809 May 2012
Dear Lewis, Thanks a lot for your help. Now my crawler is indexing to Solr properly, as requested. What I did was forget about the other tutorial, and...
Tolga — 133657138409 May 2012*
Hi, This is my very first post to the list. In fact, I heard of nutch only yesterday. Anyway, I'm trying to figure out what path to export CLASSPA...
Michael Erickson — 133658974509 May 2012*
Hello all, I'd like to try to do a focused crawl [1][2] using Nutch. I have a classifier trained on a large corpus of hand-curated data. My goal i...
Stephan Kristyn — 133659196109 May 2012*
Hi, after installing Nutch and Solr I get a HTTP ERROR 400 Problem accessing /solr/select/. Reason: undefined field text...
Markus Jelsma — 133664197610 May 2012
Hi, This is not quite similar but there's a new parameter for the generator in Nutch 1.5 where you can restrict selection by status. Cheers Origin...
James Ford — 133664405510 May 2012*
Hello, I am wondering how to only crawl the domains of a injected seed without adding external URLs to the database? Lets say I have 5k urls in my see...
Vikas Hazrati — 133664464210 May 2012*
Hi, A few days back there was a discussion on the way to extract data from raw html content ( http://lucene.472066.n3.nabble.com/Getting-th...) and ho...
Vijith — 133682368712 May 2012*
Hi, How can I create a separate project specific log in addition to the existing log. I am running nutch in eploy mode. Also I want some urls filtered...
nutch.buddy — 133692484213 May 2012*
Hi, I'm running nutch 1.4 on a 5 node cluster. I try to crawl big xlsx files (~60mb). Every time I run nutch, I get an "Error: Java heap spac...
kh3rad — 133701178514 May 2012*
Hi, I want to crawl a website which denies access to all crawlers. this site is one of the top site in alexa rank and it is news site. these are my lo...
forwardswing — 133705760315 May 2012*
When i use nutch1.4,it always occur this error: 2012-05-14 09:32:23,472 WARN mapred.LocalJobRunner - job_local_0011 java.lang.NullPointerException at ...
ramires — 133706808615 May 2012
hi I tried to index a huge url set with nutch-1.4 with hadoop-0.20. In reduce part it throws an error like this. I think some char break xml. Any idea...
LEVILLAIN Olivier — 133709326815 May 2012*
Hi, Each time I try to include a word file in my fetch/parse list, I always get the following error: 2012-05-15 15:02:40,319 ERROR tika.TikaParser - E...
ML mail — 133710529915 May 2012*
Hello, I am using Nutch 1.4 with Solr 3.6.0 and would like to get the HTML keywords and description metatags indexed into Solr. On the Nutch side I ha...
Tolga — 133716399316 May 2012*
I'm going nuts. I issued the command bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5, went on to http://localhost:8983/sol...
Ramsel Ruiz — 133720060116 May 2012*
I just checked out nutchgora on Wednesday, and I'm getting exceptions trying to run an initial crawl. I found some threads regarding this issue bu...
Florian Hartl — 133720084316 May 2012*
Hi there, whenever I try to crawl something, I get the error message below. More info: The command "bin/nutch" triggers the desired return, ...
Tolga — 133722041917 May 2012*
Hi, I have been trying for a week. I really want to get a start, so what should I use? curl or nutch? I want to be able to index pdf, xml etc. and sea...
forwardswing — 133733640018 May 2012*
when I use Nutch1.2,it alwayls occurs the following error: dtree.js: failed(2,0): Can't retrieve Tika parser for mime-type text/javascript main.js...
Matthias Paul — 133734721218 May 2012*
How can I exlude certain mime-types from crawling, for example Word-documents? If I have parse-tika in plugin.includes it will parse them. Do I have t...
Mattmann, Chris A (388J) — 133745242019 May 2012*
Hi Folks, A candidate for the Nutch 1.5 release is available at: http://people.apache.org/~mattmann/apache-nut... The release candidate is a zip and t...
haochen — 133758296021 May 2012
Hi all, Currently, there is no API for google group. And now I want to get all the post information in one google group that I am in. So I tried to us...
Tolga — 133760134321 May 2012
Okay I'm coming to the end of my questions. Do I need to read http://wiki.apache.org/nutch/FAQ#How_do_I_ind... to index files as well on a web sit...
Lewis John Mcgibbney — 133762877421 May 2012*
Hi, When working on some patches for both trunk and Nutchgora branch I ended up doing some code analysis of the generator mappers [0] & [1] respec...
Tolga — 133768667522 May 2012*
Hi, I am crawling my website with this command: bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr http://localhost:8983/solr/ -depth 20 -topN...
blunderboy — 133768776222 May 2012*
As I run Apache Nutch 1.4 crawler, I want to store some additional information. I want to store the parent of every URL. For example, I want to crawl ...
Caklovic, Nenad — 133776140923 May 2012*
Hi. I am trying to import some Common Crawl dataset files into Nutch. Those files are in Arc file format. I tried using ArcSegmentCreator tool, but th...
Tolga — 133776268423 May 2012*
Thank you all, especially Lewis, Markus, and whomever I might have forgotten! It is working; I can crawl, index and search. One last question though. ...
Tolga — 133777597423 May 2012*
Hi, I put the lines <mimeType name="application/x-excel"> <plugin id="parse-tika" /> <plugin id="feed" /...
Dustine Rene Bernasor — 133785608824 May 2012*
I have a 3-slaves hadoop cluster and I am performing a crawl on a single website. However, only 1 slave is performing fetching (though the other slave...
xiaodong.han — 133785732624 May 2012
从我的诺基亚手机发送 -原始邮件- 自:remi tassing 发送时间: 2012/04/24 06:57:04 主题: Re: Good workflow for a regular re-indexing j...
Tolga — 133786146224 May 2012*
Hi, I am crawling a large website, which is our university's. From the logs and some grep'ing, I see that some pdf files were not crawled. Why...
Tolga — 133786654224 May 2012*
Hi, This will sound like a duplicate, but actually it differs from the other one. Please bear with me. Following http://wiki.apache.org/nutch/NutchTut...
Tolga — 133788770024 May 2012*
Hi, Isn't tika responsible for XML parsing? Because I got this: parse.ParserFactory - ParserFactory: Plugin: org.apache.nutch.parse.feed.FeedParse...
vlad.paunescu — 133794882725 May 2012*
Hello, I am currently trying to use Nutch as a web site mirroring tool. To be more explicit, I only need to download the pages, not to index them (I d...
vlad.paunescu — 133795045025 May 2012*
Hello, I am currently trying to use Nutch as a web site mirroring tool. To be more explicit, I only need to download the pages, not to index them (I d...
Dustine Rene Bernasor — 133800838126 May 2012*
Hello I was wondering, would it be possible to run multiple nutch jobs on a single Hadoop cluster at the same time? Like I would perform two crawls at...
Lewis John Mcgibbney — 133820083028 May 2012*
Good Evening Everyone, A candidate for the Apache Nutch 1.5 release is available at: http://people.apache.org/~lewismc/apache-gora... The release cand...
Next page