ArchiveOrangemail archive

user.nutch.apache.org


(List home) (Recent threads) (3 other Apache Nutch lists)

Subscription Options

  • RSS or Atom: Read-only subscription using a browser or aggregator. This is the recommended way if you don't need to send messages to the list. You can learn more about feed syndication and clients here.
  • Conventional: All messages are delivered to your mail address, and you can reply. To subscribe, send an email to the list's subscribe address with "subscribe" in the subject line.
  • Moderate traffic list: up to 30 messages per day
  • This list contains about 9,503 messages, beginning May 2010
  • 3 messages added yesterday

user.nutch.apache.org

September 2010 - page 1
jitendra rajput 128336088101 Sep 2010* Hi, I have gone through the tutorial about writing plugin in Nutch source code itself. But I want to write a nutch plugin in my own package with Nutch...
Nayanish Hinge 128341236402 Sep 2010* Hi, I was wondering, why nutch has an option of parsing 1. right within the fetcher and 2. also as a separate map-reduce job In Crawl.java, There is a...
Nayanish Hinge 128341984602 Sep 2010* Hi, I have a doubt, not sure if anybody has already thought about it What if nutch crawler fails during its crawling cycles, could we restart the craw...
onlinespending128344714202 Sep 2010* Hi, I'd like to use Nutch to crawl a very limited set of pages. But as it's crawling I'd like for it to only fetch particular pages and fi...
Nemani, Raj 128345504002 Sep 2010 All, I am trying to compile Gora to compile latest lNutch turnk. I am doing the following steps to compile Gora as mentioned by Chris Mattmann. 1. git...
Gingras Jean-François 128348466903 Sep 2010* Hi, You may want to look for the db.max.outlinks.per.page property in your nutch-[default|site].xml configuration file. The default is 100 outlin...
Mike Pountney 128351689503 Sep 2010* I'd like to refetch pages that I know change frequently more often. Does anyone know of a way to set a lower retry interval on a set of pages matc...
Nemani, Raj 128351880203 Sep 2010* All, I am crawling a site that is heavy in rtf, txt and pdf documents in addition to pages that embed a lot of images. I am using Nutch 1.1 and runnin...
AJ Chen 128353891003 Sep 2010* I'm setting up a small cluster for crawling 3000 domains: 1 master, 3 slaves. Using the default configs, each step (generate, fetch, updatedb) run...
jeff 128363426804 Sep 2010* Hi, I have asked the same questions once and have yet received any response. So I decide to give it another try. Does anyone know how nutch decides to...
Nayanish Hinge 128368501905 Sep 2010 Hi, I wanted to understand and use the blockAddr functionality of nutch lib-http (HTTPBase.java) But recently found that the whole code is removed. Se...
Nayanish Hinge 128370576805 Sep 2010 Hi, I am referring to Nutch 1.1 From the code it looks like this does not retry immediately though the comment says so. Any idea? Fetcher.java: --- ca...
André Ricardo 128380050506 Sep 2010 Hello, I'm using Nutch 1.1 and I've successfully started indexing a field called "artist" from mp3s that contains the mp3 ID3 artist...
brad 128383096407 Sep 2010 When trying to create a index: bin/nutch index crawl_test/indexes crawl_test/crawldb crawl_test/linkdb crawl_test/segments/* I'm getting the error...
Markus Jelsma 128385671707 Sep 2010* Hi, It seems the NUTCH-716 [1] patch does not really produce a multi valued field. Instead, it concatenates multiple subcollection definitions into a ...
Markus Jelsma 128385841607 Sep 2010* Hi all, I've got the nutch-2010-07-07_04-49-04 nightly build in which the parser fails but keeps the proces running for ever! I've tried with ...
Nutch redirects. (6 Replies)
Mark Stephenson 128391504208 Sep 2010* Hi, I am new to Nutch and I'm trying to understand how it handles redirects. Let's say I want to fetch the following article from the New York...
brad 128394051708 Sep 2010* Chris, Thank you for the help. Based on what you said, I decided to install Tika 0.7 and try to parse the file using tika app to see what happen. java...
Solr and Nutch (6 Replies)
Thumuluri, Sai 128395403808 Sep 2010* Hi - I am trying to crawl using Nutch and index content using Solr. I have some custom metadata in my html source files that I need to extract from Nu...
yi zhu 128395423708 Sep 2010* I've run a 2-datanode-cluster to do crawling job, now I need to add one new node to the cluster without stop the cluster I add a new line in conf/...
Cygwin (5 Replies)
Yavuz Selim YILMAZ 128396616908 Sep 2010* Hi all, When I try to crawl with the cygwin bin/nutch urls -dir crawl -depth -2 I got such errors; Exception in thread "main" java.lang.NoCl...
Nemani, Raj 128398243408 Sep 2010* Hi all, I am having a small issue with subcollection plugin. I am using 1.2 branch but I believe I have seen this in 1.1 also. I have the following XM...
Mike Baranczak 128400346409 Sep 2010 The impression that I got from reading the mailing lists is that the developers are slowly moving to deprecate all the parser plugins in favor of Tika...
Nayanish Hinge 128403434309 Sep 2010* Hi, I have a specific use case where I need to know at which level (depth) I fetched the current url. Currently the depth could be figured out from th...
André Ricardo 128403565209 Sep 2010 Is it possible to search in Nutch by page url, like all the urls that end in .html? How can I search in Nutch by looking only in one field like for in...
Markus Jelsma 128405994909 Sep 2010* Hi, I've got something weird again (using Nutch 1.2), a document fetched and parsed by Nutch doesn't comply with my Solr schema, it attempts t...
lonely Feb 128417869711 Sep 2010* Hello~ I just start to deploy Nutch on my distributed machines, and an existing Hadoop system has already deplyed on these machines, I wonder how to s...
Nayanish Hinge 128430499412 Sep 2010* Hi, Some website return HTTP 503 when they throttle hits. I see that I need to re-implement the HttpBase.java to handle this as a special case and put...
AJ Chen 128432208812 Sep 2010* After crawldb grows big, the percentage of invalid urls in the generated segments become very high. Fetching invlid urls is wasteful - reducing throug...
Markus Jelsma 128448899314 Sep 2010* Hi, Well, today it happened again. I had quite a large fetch list and finally it all failed. I added a hadoop.tmp.dir setting to my nutch-site.xml fil...
128449452614 Sep 2010* hi... I have configured the |solrindex-mapping.xml| (nutch) and configured my solr |schema.xml| and |solrconfig.xml| too...
ramires 128454867515 Sep 2010 hı I use nutch 1.2 rev 997274. My paltform debian lenny 64-bit. ı use sun-java6-bin 6.21-1. When ı try to fetch 18659 url first cycle is good but s...
eric park 128461673216 Sep 2010* Hello, guys I installed nutch-1.0 on a linux machine and everything worked fine. Recently, our system administrator reset linux user account and the f...
Crawl depth (1 Reply)
Thumuluri, Sai 128465382216 Sep 2010* Hi, We are using Nutch to crawl URL entry points and index content using Solr. I have an entry point like http://urlentrypoint.com/searchLinks.aspx wh...
jitendra rajput 128465385716 Sep 2010* Hi, Hadoop log file is not getting generated when I run nutch job jar on hadoop EC2 cluster. I am starting job with following command. hadoop jar -DHa...
Nemani, Raj 128465733516 Sep 2010* All, Did anybody encounter the following error with parsing PDF files using Tika parser? Online search seems to indicate PDFBox should support this en...
<Arkadi.Kosmynin128472110217 Sep 2010 Hello, I am announcing release of Arch 1.2 based on Nutch 1.2. Arch is an extension of Nutch. It is designed for indexing and search of intranets. Man...
ramires 128473165317 Sep 2010 when i try to fetch ~10k url second cycle always get these kind of messages. # # A fatal error has been detected by the Java Runtime Environment: # # ...
jitendra rajput 128475755217 Sep 2010* Hi, I see this exception often in logs when I run job on hadoop EC2. Could anyone please tell me, what does it mean? Exception in thread "Timer t...
Mattmann, Chris A (388J) 128487213419 Sep 2010* Hi Folks, I have posted a 2nd release candidate for the Apache Nutch 1.2 release. The source code is at: http://people.apache.org/~mattmann/apache-nut...
Mattmann, Chris A (388J) 128491230719 Sep 2010 Hi Folks, I have posted a 3rd release candidate for the Apache Nutch 1.2 release. The source code is at: http://people.apache.org/~mattmann/apache-nut...
Markus Jelsma 128499813120 Sep 2010* Hi, I'm testing the index-more plug-in but, to my surprise, it is defined as a multi valued field in the shipped Solr schema configuration! Since ...
Campbell, John 128510054821 Sep 2010 We are running nutch 1.1 and are attempting to crawl pages that are behind Siteminder (NTLM). However, we're getting an error that we can't se...
Rida Benjelloun 128518662722 Sep 2010 The Constellio team is proud to announce the release of the first Open Source version of Constellio Enterprise Search. It is available for download at...
solr wiki (1 Reply)
reinhard schwab 128519720022 Sep 2010* im pretty surprised that solr wiki has set nofollow for robots. <meta name="robots" content="index,nofollow"> im curious abo...
Junk Links (4 Replies)
Yavuz Selim YILMAZ 128525235923 Sep 2010* Hi all, I make a crawl which has such links; .../+location.hostname+ .../</div> .../application/rss+xml .../</text/css .../application/text/c...
webdev1977 128526391523 Sep 2010 I would appreciate any help anyone could lend. A very deep crawl of a file system using release canidate 1.2 #4 produces an OutOfMemory error after ab...
Mattmann, Chris A (388J) 128529595024 Sep 2010* Hi Folks, I have posted a 4th release candidate for the Apache Nutch 1.2 release. The source code is at: http://people.apache.org/~mattmann/apache-nut...
CPU %100 (3 Replies)
Yavuz Selim YILMAZ 128533531824 Sep 2010* During my crawl, after 2-3 depth, my CPU runs %100 Try to understand but can't find any solution. Any idea?...
AJ Chen 128535544124 Sep 2010 when fetching with a cluster, the data node fails to write temp file. any idea how to fix this? thanks, -aj 2010-09-24 15:02:59,788 ERROR datanode.Dat...

Next page

Home | About | Privacy