user.nutch.apache.org
(
List home) (
Recent threads) (
3 other Apache Nutch lists)
Subscription Options
- RSS or Atom: Read-only subscription using a browser or aggregator. This is the recommended way if you don't need to send messages to the list. You can learn more about feed syndication and clients here.
- Conventional: All messages are delivered to your mail address, and you can reply. To subscribe, send an email to the list's subscribe address with "subscribe" in the subject line.
- Moderate traffic list: up to 30 messages per day
- This list contains about 9,503 messages, beginning May 2010
- 3 messages added yesterday
user.nutch.apache.org
September 2010 - page 1
jitendra rajput — 128336088101 Sep 2010*
Hi, I have gone through the tutorial about writing plugin in Nutch source code itself. But I want to write a nutch plugin in my own package with Nutch...
Nayanish Hinge — 128341236402 Sep 2010*
Hi, I was wondering, why nutch has an option of parsing 1. right within the fetcher and 2. also as a separate map-reduce job In Crawl.java, There is a...
Nayanish Hinge — 128341984602 Sep 2010*
Hi, I have a doubt, not sure if anybody has already thought about it What if nutch crawler fails during its crawling cycles, could we restart the craw...
onlinespending — 128344714202 Sep 2010*
Hi, I'd like to use Nutch to crawl a very limited set of pages. But as it's crawling I'd like for it to only fetch particular pages and fi...
Nemani, Raj — 128345504002 Sep 2010
All, I am trying to compile Gora to compile latest lNutch turnk. I am doing the following steps to compile Gora as mentioned by Chris Mattmann. 1. git...
Gingras Jean-François — 128348466903 Sep 2010*
Hi, You may want to look for the db.max.outlinks.per.page property in your nutch-[default|site].xml configuration file. The default is 100 outlin...
Mike Pountney — 128351689503 Sep 2010*
I'd like to refetch pages that I know change frequently more often. Does anyone know of a way to set a lower retry interval on a set of pages matc...
Nemani, Raj — 128351880203 Sep 2010*
All, I am crawling a site that is heavy in rtf, txt and pdf documents in addition to pages that embed a lot of images. I am using Nutch 1.1 and runnin...
AJ Chen — 128353891003 Sep 2010*
I'm setting up a small cluster for crawling 3000 domains: 1 master, 3 slaves. Using the default configs, each step (generate, fetch, updatedb) run...
jeff — 128363426804 Sep 2010*
Hi, I have asked the same questions once and have yet received any response. So I decide to give it another try. Does anyone know how nutch decides to...
Nayanish Hinge — 128368501905 Sep 2010
Hi, I wanted to understand and use the blockAddr functionality of nutch lib-http (HTTPBase.java) But recently found that the whole code is removed. Se...
Nayanish Hinge — 128370576805 Sep 2010
Hi, I am referring to Nutch 1.1 From the code it looks like this does not retry immediately though the comment says so. Any idea? Fetcher.java: --- ca...
André Ricardo — 128380050506 Sep 2010
Hello, I'm using Nutch 1.1 and I've successfully started indexing a field called "artist" from mp3s that contains the mp3 ID3 artist...
brad — 128383096407 Sep 2010
When trying to create a index: bin/nutch index crawl_test/indexes crawl_test/crawldb crawl_test/linkdb crawl_test/segments/* I'm getting the error...
Markus Jelsma — 128385671707 Sep 2010*
Hi, It seems the NUTCH-716 [1] patch does not really produce a multi valued field. Instead, it concatenates multiple subcollection definitions into a ...
Markus Jelsma — 128385841607 Sep 2010*
Hi all, I've got the nutch-2010-07-07_04-49-04 nightly build in which the parser fails but keeps the proces running for ever! I've tried with ...
Mark Stephenson — 128391504208 Sep 2010*
Hi, I am new to Nutch and I'm trying to understand how it handles redirects. Let's say I want to fetch the following article from the New York...
brad — 128394051708 Sep 2010*
Chris, Thank you for the help. Based on what you said, I decided to install Tika 0.7 and try to parse the file using tika app to see what happen. java...
Thumuluri, Sai — 128395403808 Sep 2010*
Hi - I am trying to crawl using Nutch and index content using Solr. I have some custom metadata in my html source files that I need to extract from Nu...
yi zhu — 128395423708 Sep 2010*
I've run a 2-datanode-cluster to do crawling job, now I need to add one new node to the cluster without stop the cluster I add a new line in conf/...
Yavuz Selim YILMAZ — 128396616908 Sep 2010*
Hi all, When I try to crawl with the cygwin bin/nutch urls -dir crawl -depth -2 I got such errors; Exception in thread "main" java.lang.NoCl...
Nemani, Raj — 128398243408 Sep 2010*
Hi all, I am having a small issue with subcollection plugin. I am using 1.2 branch but I believe I have seen this in 1.1 also. I have the following XM...
Mike Baranczak — 128400346409 Sep 2010
The impression that I got from reading the mailing lists is that the developers are slowly moving to deprecate all the parser plugins in favor of Tika...
Nayanish Hinge — 128403434309 Sep 2010*
Hi, I have a specific use case where I need to know at which level (depth) I fetched the current url. Currently the depth could be figured out from th...
André Ricardo — 128403565209 Sep 2010
Is it possible to search in Nutch by page url, like all the urls that end in .html? How can I search in Nutch by looking only in one field like for in...
Markus Jelsma — 128405994909 Sep 2010*
Hi, I've got something weird again (using Nutch 1.2), a document fetched and parsed by Nutch doesn't comply with my Solr schema, it attempts t...
lonely Feb — 128417869711 Sep 2010*
Hello~ I just start to deploy Nutch on my distributed machines, and an existing Hadoop system has already deplyed on these machines, I wonder how to s...
Nayanish Hinge — 128430499412 Sep 2010*
Hi, Some website return HTTP 503 when they throttle hits. I see that I need to re-implement the HttpBase.java to handle this as a special case and put...
AJ Chen — 128432208812 Sep 2010*
After crawldb grows big, the percentage of invalid urls in the generated segments become very high. Fetching invlid urls is wasteful - reducing throug...
Markus Jelsma — 128448899314 Sep 2010*
Hi, Well, today it happened again. I had quite a large fetch list and finally it all failed. I added a hadoop.tmp.dir setting to my nutch-site.xml fil...
— 128449452614 Sep 2010*
hi... I have configured the |solrindex-mapping.xml| (nutch) and configured my solr |schema.xml| and |solrconfig.xml| too...
ramires — 128454867515 Sep 2010
hı I use nutch 1.2 rev 997274. My paltform debian lenny 64-bit. ı use sun-java6-bin 6.21-1. When ı try to fetch 18659 url first cycle is good but s...
eric park — 128461673216 Sep 2010*
Hello, guys I installed nutch-1.0 on a linux machine and everything worked fine. Recently, our system administrator reset linux user account and the f...
Thumuluri, Sai — 128465382216 Sep 2010*
Hi, We are using Nutch to crawl URL entry points and index content using Solr. I have an entry point like http://urlentrypoint.com/searchLinks.aspx wh...
jitendra rajput — 128465385716 Sep 2010*
Hi, Hadoop log file is not getting generated when I run nutch job jar on hadoop EC2 cluster. I am starting job with following command. hadoop jar -DHa...
Nemani, Raj — 128465733516 Sep 2010*
All, Did anybody encounter the following error with parsing PDF files using Tika parser? Online search seems to indicate PDFBox should support this en...
<Arkadi.Kosmynin — 128472110217 Sep 2010
Hello, I am announcing release of Arch 1.2 based on Nutch 1.2. Arch is an extension of Nutch. It is designed for indexing and search of intranets. Man...
ramires — 128473165317 Sep 2010
when i try to fetch ~10k url second cycle always get these kind of messages. # # A fatal error has been detected by the Java Runtime Environment: # # ...
jitendra rajput — 128475755217 Sep 2010*
Hi, I see this exception often in logs when I run job on hadoop EC2. Could anyone please tell me, what does it mean? Exception in thread "Timer t...
Mattmann, Chris A (388J) — 128487213419 Sep 2010*
Hi Folks, I have posted a 2nd release candidate for the Apache Nutch 1.2 release. The source code is at: http://people.apache.org/~mattmann/apache-nut...
Mattmann, Chris A (388J) — 128491230719 Sep 2010
Hi Folks, I have posted a 3rd release candidate for the Apache Nutch 1.2 release. The source code is at: http://people.apache.org/~mattmann/apache-nut...
Markus Jelsma — 128499813120 Sep 2010*
Hi, I'm testing the index-more plug-in but, to my surprise, it is defined as a multi valued field in the shipped Solr schema configuration! Since ...
Campbell, John — 128510054821 Sep 2010
We are running nutch 1.1 and are attempting to crawl pages that are behind Siteminder (NTLM). However, we're getting an error that we can't se...
Rida Benjelloun — 128518662722 Sep 2010
The Constellio team is proud to announce the release of the first Open Source version of Constellio Enterprise Search. It is available for download at...
reinhard schwab — 128519720022 Sep 2010*
im pretty surprised that solr wiki has set nofollow for robots. <meta name="robots" content="index,nofollow"> im curious abo...
Yavuz Selim YILMAZ — 128525235923 Sep 2010*
Hi all, I make a crawl which has such links; .../+location.hostname+ .../</div> .../application/rss+xml .../</text/css .../application/text/c...
webdev1977 — 128526391523 Sep 2010
I would appreciate any help anyone could lend. A very deep crawl of a file system using release canidate 1.2 #4 produces an OutOfMemory error after ab...
Mattmann, Chris A (388J) — 128529595024 Sep 2010*
Hi Folks, I have posted a 4th release candidate for the Apache Nutch 1.2 release. The source code is at: http://people.apache.org/~mattmann/apache-nut...
Yavuz Selim YILMAZ — 128533531824 Sep 2010*
During my crawl, after 2-3 depth, my CPU runs %100 Try to understand but can't find any solution. Any idea?...
AJ Chen — 128535544124 Sep 2010
when fetching with a cluster, the data node fails to write temp file. any idea how to fix this? thanks, -aj 2010-09-24 15:02:59,788 ERROR datanode.Dat...
Next page