user.nutch.apache.org
(
List home) (
Recent threads) (
3 other Apache Nutch lists)
Subscription Options
- RSS or Atom: Read-only subscription using a browser or aggregator. This is the recommended way if you don't need to send messages to the list. You can learn more about feed syndication and clients here.
- Conventional: All messages are delivered to your mail address, and you can reply. To subscribe, send an email to the list's subscribe address with "subscribe" in the subject line.
- Moderate traffic list: up to 30 messages per day
- This list contains about 9,526 messages, beginning May 2010
- 8 messages added yesterday
user.nutch.apache.org
September 2011 - page 1
matty2012 — 131489595301 Sep 2011*
I am an newbie to Nutch and Hadoop. I am trying to follow the tutorial here at http://wiki.apache.org/nutch/NutchHadoopTutor.... I got Nutch 1.3 relea...
John R. Brinkema — 131489672901 Sep 2011*
Hi all, I am trying use URLmeta to inject meta data into documents that I crawl and I am having some problems. First the context: Nutch 1.3 with Solr ...
alex — 131489840401 Sep 2011*
hi all, I get multiple lines in log: 2011-09-01 12:47:52,351 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter ...
Marek Bachmann — 131490327801 Sep 2011*
Hello, I ran in a bad situation. After crawling and parsing about 130k pages in multiple generate/fetch/parse/update cycles today the parser crashed w...
alex — 131490332701 Sep 2011*
hi all, I need to delete some common content from pages , like menus etc. Is there anything in 1.3 ready to use? what's the best way to handle thi...
alex — 131490341901 Sep 2011*
hi all, what do I do if I need to reparse? is it ok just to delete parse_ directories or there is a better way? thanks....
webdev1977 — 131490489001 Sep 2011*
Do I NEED SSHD for Nutch 1.3 in Pseudo Distributed mode? I am running on a windows server using cygwin (obviously :-) I can not get haddop/nutch to ru...
Marek Bachmann — 131495576802 Sep 2011*
Hello again, As I ran in trouble with parsing again and again because there are so many strange file types around our university network, I am looking...
alex — 131497393302 Sep 2011*
hi all, All pages which I need to crawl have same html title tag , like "our products" and actual title which I want to use is in <link r...
Kaiwii Ho — 131508931303 Sep 2011*
For some questions,I'd like to contact directly to the Source-code‘s author. Right now,I got his name in the author.But what's the next step...
Gabriele Kahlout — 131521974705 Sep 2011*
Hi, I've just noticed that two search results of indexed data have the same url: http://www.atory.com/dupe_checker_pro/ http://www.atory.com/dupe_...
Elisabeth Adler — 131523351005 Sep 2011*
Hi, I was wondering if anybody can help me on how to configure per-field boosting for documents on Nutch 1.3 and Solr 3.3.0. I'm not sure if this ...
Harris Rappaport — 131525870405 Sep 2011*
Hi, On your the wiki (here: https://wiki.apache.org/nutch/Features ) it says that special characters and punctuation are treated as spaces, but it doe...
Kaiwii Ho — 131527603006 Sep 2011
Everytime I come to the following code of the ScoringFilters: if (orderedFilters == null) { objectCache.setObject(ScoringFilter.class.getName(), filte...
Danicela nutch — 131538230907 Sep 2011*
Hi, I'm trying to get search suggestions like Google 'Did you mean ?' with indexed data with Solr from Nutch. I added this to my schema.xm...
Danicela nutch — 131538260407 Sep 2011
I tried http://localhost:8983/solr/select/?q =*:* and it returns <result name="response" numFound="7376" start="0"...
Dinçer Kavraal — 131539551807 Sep 2011*
Hi, Is it possible to reject a page to be indexed in parse operation? I even don't want it to be indexed as a no-content page without any text inf...
Ferdy Galema — 131541379907 Sep 2011*
What is the current status of Nutch 2.0? How different is it from the current 1.x branch in terms of production stableness? We would very much like to...
Peter Harrington — 131541722407 Sep 2011*
I run Nutch1.3 crawl with topN = 5000, and depth=20. For the first two crawl cycles the Generator and CrawlDb Update phases take ~1hour. Around the 3r...
aceyin — 131546651008 Sep 2011*
Hi : I met some strange problem when i try to use Nutch-1.3 . i list what I did bellow , hope there is someone can help me : 1. Operations A.I tried t...
Markus Jelsma — 131547271608 Sep 2011*
Hi, Any idea why the reducer of the parse job is as slow as a snail taking a detour? There is no processing in reducer; all it does it copy the keys a...
Joshua J Pavel — 131551553708 Sep 2011*
I ask this time to time, but I was wondering if anybody would have any insight on how I might be able to get the -stats information (nutch-1.2/bin/nut...
Joshua J Pavel — 131561248709 Sep 2011*
Due to a unique configuration requirement, we move our crawl directories off of the node that generates them to the nodes that serve them. What is the...
Danicela nutch — 131582142112 Sep 2011*
Hi, I'm making a plugin implementing ScoringFilter. I want to modify the fetch order of pages according to their URL. For that, I have to modify t...
Elisabeth Adler — 131582855912 Sep 2011*
Hi, Since I'm relatively new to Nutch/Solr, I was wondering if the following would make sense: Headings in web pages (h1, h2, h3) should be more i...
Danicela nutch — 131590483613 Sep 2011*
I want to prioritize URLs containing "compar" for exemple to fetch them first. Maybe I didn't understand how this is intended to work, b...
dpt9876 — 131591212713 Sep 2011*
Hi, the friendly guys at the Solr user group pointed me here. I am wondering if Nutch/Solr will do the following for a project I am working on. I want...
Ferdy Galema — 131593802313 Sep 2011*
Please see following exception. It looks like it is caused by the _SUCCESS file created by Hadoop when trying to open map files in a permission checke...
Yousef Ourabi — 131596963114 Sep 2011*
Hello: I keep on running into the following exception on both Nutch 1.1 and the nightly build. I seem to get this after 3 or 4 iterations of the fetch...
Danicela nutch — 131599486814 Sep 2011*
Hi, I want to set properties in nutch-site.xml that I can use in a plugin after. For exemple, I would want to have var = 10 in this file, and then ret...
Markus Jelsma — 131600783614 Sep 2011*
Hi, Would it not be a good idea to patch DomContentUtils with an option not to consider relative outlinks without a base url? This example [1] will cu...
Thomas B — 131608631415 Sep 2011
I've run into a small issue with my deployment of Nutch. Some of the sites I crawl use characters such as æøå in their URLs, and these never se...
Luis Cappa Banda — 131609887015 Sep 2011
Hello. I've downloaded Nutch-1.3 version via Subversion and modified some classes a little. My intention is to integrate with Maven the new artifa...
Anshuman Mor — 131610302615 Sep 2011*
Hi All, I am using nutch 1.3/1.4 and trying to index one lithium based forum site, but I am getting http 302 (Moved temporarily). I have changed the v...
Germán Biozzoli — 131610774815 Sep 2011*
Hi everybody I've a small batch nutch 1.2 based process that is crawling a site and after that insert data into a Solr instance. After updating to...
Luis Cappa Banda — 131611123915 Sep 2011*
Hello. I've downloaded Nutch-1.3 version via Subversion and modified some classes a little. My intention is to integrate with Maven the new artifa...
Arcadius Ahouansou — 131612898315 Sep 2011*
Hello. I am new to Nutch. I need to use Nutch to index data into Solr. Lets say I need to crawl some newspaper search pages and index any article rega...
Michael.Sulistijo — 131614492316 Sep 2011
Hi I'm fairly new to use Nutch, I have segments of 2,000 URLs, and I'm trying to index them to solr, but throws out error message like this: a...
Danicela nutch — 131616312516 Sep 2011
Hi, I need to get the links of a page in a ParseFilter plugin. The parameters types are Content, ParseResult, HTMLMetaTags, DocumentFragment. Are the ...
Danicela nutch — 131616458616 Sep 2011
I found it : parse.get(content.getUrl()).getData().getOutlinks() (parse is of type ParseResult) - Message d'origine - De : Danicela nutch Envoyés...
Michael.Sulistijo — 131618520416 Sep 2011*
Hi I'm fairly new to use Nutch, I have segments of 2,000 URLs, and I'm trying to index them to solr, but throws out error message like this: a...
Mohammad Anbari — 131618753516 Sep 2011*
I have some urls that contain many pdf links and i want to index them but when i start crawling with nutch 1.3 no pdf link fetch,is there any config i...
Nutch User - 1 — 131634541718 Sep 2011
Hi. I created a simple graph and crawled it with Nutch. The graph consists of three HTML files: A.html B/index.html C.html The link structure is as th...
webdev1977 — 131642879019 Sep 2011*
I am having a hard time getting nutch 1.3 to run in a pseudo distributed mode on Windows Server 2008 sp2. I spent a week messing with hadoop version 0...
Jann Forrer — 131644411619 Sep 2011*
Hi I tried to run nutch-1.3 together with solr 3.x according to http://wiki.apache.org/nutch/NutchTutorial. That worked as described but if I try to s...
Markus Jelsma — 131645126719 Sep 2011*
Hi, Another complaint on Nutch' handling of outlinks. Since NUTCH-436 there is better support for embedded segment parameters. This exotic feature...
Markus Jelsma — 131646698219 Sep 2011*
Hi, I sometimes come across relative outlinks in the source that are intended as absolute but where the webmaster or CMS omits the protocol scheme. Th...
alxsss — 131658740421 Sep 2011*
Hello, I wondered if it is possible to restart a failed job in nutch-1.3 version. I have this error org.apache.hadoop.util.DiskChecker$DiskErrorExcept...
Marek Bachmann — 131661994721 Sep 2011*
Hello list, I was wondering why I get this stats from readdb ./nutch readdb /nutch/global-crawl/crawldb/ -stats CrawlDb statistics start: /nutch/globa...
Oleg Mürk — 131662243021 Sep 2011*
Hello, When I fetch the following links with nutch 1.3: http://blog.mises.org/archives/010450.asp http://feedproxy.google.com/~r/readwriteweb/~... and...
Next page