ArchiveOrangemail archive

user.nutch.apache.org


(List home) (Recent threads) (3 other Apache Nutch lists)

Subscription Options

  • RSS or Atom: Read-only subscription using a browser or aggregator. This is the recommended way if you don't need to send messages to the list. You can learn more about feed syndication and clients here.
  • Conventional: All messages are delivered to your mail address, and you can reply. To subscribe, send an email to the list's subscribe address with "subscribe" in the subject line.
  • Moderate traffic list: up to 30 messages per day
  • This list contains about 9,526 messages, beginning May 2010
  • 8 messages added yesterday

user.nutch.apache.org

September 2011 - page 1
matty2012 131489595301 Sep 2011* I am an newbie to Nutch and Hadoop. I am trying to follow the tutorial here at http://wiki.apache.org/nutch/NutchHadoopTutor.... I got Nutch 1.3 relea...
John R. Brinkema 131489672901 Sep 2011* Hi all, I am trying use URLmeta to inject meta data into documents that I crawl and I am having some problems. First the context: Nutch 1.3 with Solr ...
alex 131489840401 Sep 2011* hi all, I get multiple lines in log: 2011-09-01 12:47:52,351 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter ...
Marek Bachmann 131490327801 Sep 2011* Hello, I ran in a bad situation. After crawling and parsing about 130k pages in multiple generate/fetch/parse/update cycles today the parser crashed w...
common content... (1 Reply)
alex 131490332701 Sep 2011* hi all, I need to delete some common content from pages , like menus etc. Is there anything in 1.3 ready to use? what's the best way to handle thi...
how to reparse? (2 Replies)
alex 131490341901 Sep 2011* hi all, what do I do if I need to reparse? is it ok just to delete parse_ directories or there is a better way? thanks....
webdev1977 131490489001 Sep 2011* Do I NEED SSHD for Nutch 1.3 in Pseudo Distributed mode? I am running on a windows server using cygwin (obviously :-) I can not get haddop/nutch to ru...
Marek Bachmann 131495576802 Sep 2011* Hello again, As I ran in trouble with parsing again and again because there are so many strange file types around our university network, I am looking...
alex 131497393302 Sep 2011* hi all, All pages which I need to crawl have same html title tag , like "our products" and actual title which I want to use is in <link r...
Kaiwii Ho 131508931303 Sep 2011* For some questions,I'd like to contact directly to the Source-code‘s author. Right now,I got his name in the author.But what's the next step...
Gabriele Kahlout 131521974705 Sep 2011* Hi, I've just noticed that two search results of indexed data have the same url: http://www.atory.com/dupe_checker_pro/ http://www.atory.com/dupe_...
Elisabeth Adler 131523351005 Sep 2011* Hi, I was wondering if anybody can help me on how to configure per-field boosting for documents on Nutch 1.3 and Solr 3.3.0. I'm not sure if this ...
Harris Rappaport 131525870405 Sep 2011* Hi, On your the wiki (here: https://wiki.apache.org/nutch/Features ) it says that special characters and punctuation are treated as spaces, but it doe...
Kaiwii Ho 131527603006 Sep 2011 Everytime I come to the following code of the ScoringFilters: if (orderedFilters == null) { objectCache.setObject(ScoringFilter.class.getName(), filte...
Spellcheck with Solr (3 Replies)
Danicela nutch 131538230907 Sep 2011* Hi, I'm trying to get search suggestions like Google 'Did you mean ?' with indexed data with Solr from Nutch. I added this to my schema.xm...
Danicela nutch 131538260407 Sep 2011 I tried http://localhost:8983/solr/select/?q =*:* and it returns <result name="response" numFound="7376" start="0"...
Dinçer Kavraal 131539551807 Sep 2011* Hi, Is it possible to reject a page to be indexed in parse operation? I even don't want it to be indexed as a no-content page without any text inf...
Ferdy Galema 131541379907 Sep 2011* What is the current status of Nutch 2.0? How different is it from the current 1.x branch in terms of production stableness? We would very much like to...
Peter Harrington 131541722407 Sep 2011* I run Nutch1.3 crawl with topN = 5000, and depth=20. For the first two crawl cycles the Generator and CrawlDb Update phases take ~1hour. Around the 3r...
aceyin 131546651008 Sep 2011* Hi : I met some strange problem when i try to use Nutch-1.3 . i list what I did bellow , hope there is someone can help me : 1. Operations A.I tried t...
Markus Jelsma 131547271608 Sep 2011* Hi, Any idea why the reducer of the parse job is as slow as a snail taking a detour? There is no processing in reducer; all it does it copy the keys a...
Joshua J Pavel 131551553708 Sep 2011* I ask this time to time, but I was wondering if anybody would have any insight on how I might be able to get the -stats information (nutch-1.2/bin/nut...
Crawl Directories (1 Reply)
Joshua J Pavel 131561248709 Sep 2011* Due to a unique configuration requirement, we move our crawl directories off of the node that generates them to the nodes that serve them. What is the...
Danicela nutch 131582142112 Sep 2011* Hi, I'm making a plugin implementing ScoringFilter. I want to modify the fetch order of pages according to their URL. For that, I have to modify t...
Elisabeth Adler 131582855912 Sep 2011* Hi, Since I'm relatively new to Nutch/Solr, I was wondering if the following would make sense: Headings in web pages (h1, h2, h3) should be more i...
Danicela nutch 131590483613 Sep 2011* I want to prioritize URLs containing "compar" for exemple to fetch them first. Maybe I didn't understand how this is intended to work, b...
dpt9876 131591212713 Sep 2011* Hi, the friendly guys at the Solr user group pointed me here. I am wondering if Nutch/Solr will do the following for a project I am working on. I want...
Ferdy Galema 131593802313 Sep 2011* Please see following exception. It looks like it is caused by the _SUCCESS file created by Hadoop when trying to open map files in a permission checke...
Yousef Ourabi 131596963114 Sep 2011* Hello: I keep on running into the following exception on both Nutch 1.1 and the nightly build. I seem to get this after 3 or 4 iterations of the fetch...
Danicela nutch 131599486814 Sep 2011* Hi, I want to set properties in nutch-site.xml that I can use in a plugin after. For exemple, I would want to have var = 10 in this file, and then ret...
Markus Jelsma 131600783614 Sep 2011* Hi, Would it not be a good idea to patch DomContentUtils with an option not to consider relative outlinks without a base url? This example [1] will cu...
Thomas B 131608631415 Sep 2011 I've run into a small issue with my deployment of Nutch. Some of the sites I crawl use characters such as æøå in their URLs, and these never se...
Luis Cappa Banda 131609887015 Sep 2011 Hello. I've downloaded Nutch-1.3 version via Subversion and modified some classes a little. My intention is to integrate with Maven the new artifa...
Anshuman Mor 131610302615 Sep 2011* Hi All, I am using nutch 1.3/1.4 and trying to index one lithium based forum site, but I am getting http 302 (Moved temporarily). I have changed the v...
Germán Biozzoli 131610774815 Sep 2011* Hi everybody I've a small batch nutch 1.2 based process that is crawling a site and after that insert data into a Solr instance. After updating to...
Luis Cappa Banda 131611123915 Sep 2011* Hello. I've downloaded Nutch-1.3 version via Subversion and modified some classes a little. My intention is to integrate with Maven the new artifa...
Arcadius Ahouansou 131612898315 Sep 2011* Hello. I am new to Nutch. I need to use Nutch to index data into Solr. Lets say I need to crawl some newspaper search pages and index any article rega...
Michael.Sulistijo 131614492316 Sep 2011 Hi I'm fairly new to use Nutch, I have segments of 2,000 URLs, and I'm trying to index them to solr, but throws out error message like this: a...
Danicela nutch 131616312516 Sep 2011 Hi, I need to get the links of a page in a ParseFilter plugin. The parameters types are Content, ParseResult, HTMLMetaTags, DocumentFragment. Are the ...
Danicela nutch 131616458616 Sep 2011 I found it : parse.get(content.getUrl()).getData().getOutlinks() (parse is of type ParseResult) - Message d'origine - De : Danicela nutch Envoyés...
Michael.Sulistijo 131618520416 Sep 2011* Hi I'm fairly new to use Nutch, I have segments of 2,000 URLs, and I'm trying to index them to solr, but throws out error message like this: a...
Mohammad Anbari 131618753516 Sep 2011* I have some urls that contain many pdf links and i want to index them but when i start crawling with nutch 1.3 no pdf link fetch,is there any config i...
Nutch User - 1 131634541718 Sep 2011 Hi. I created a simple graph and crawled it with Nutch. The graph consists of three HTML files: A.html B/index.html C.html The link structure is as th...
webdev1977 131642879019 Sep 2011* I am having a hard time getting nutch 1.3 to run in a pseudo distributed mode on Windows Server 2008 sp2. I spent a week messing with hadoop version 0...
Jann Forrer 131644411619 Sep 2011* Hi I tried to run nutch-1.3 together with solr 3.x according to http://wiki.apache.org/nutch/NutchTutorial. That worked as described but if I try to s...
Markus Jelsma 131645126719 Sep 2011* Hi, Another complaint on Nutch' handling of outlinks. Since NUTCH-436 there is better support for embedded segment parameters. This exotic feature...
Markus Jelsma 131646698219 Sep 2011* Hi, I sometimes come across relative outlinks in the source that are intended as absolute but where the webmaster or CMS omits the protocol scheme. Th...
restart a failed job (2 Replies)
alxsss131658740421 Sep 2011* Hello, I wondered if it is possible to restart a failed job in nutch-1.3 version. I have this error org.apache.hadoop.util.DiskChecker$DiskErrorExcept...
retry count (1 Reply)
Marek Bachmann 131661994721 Sep 2011* Hello list, I was wondering why I get this stats from readdb ./nutch readdb /nutch/global-crawl/crawldb/ -stats CrawlDb statistics start: /nutch/globa...
Oleg Mürk 131662243021 Sep 2011* Hello, When I fetch the following links with nutch 1.3: http://blog.mises.org/archives/010450.asp http://feedproxy.google.com/~r/readwriteweb/~... and...

Next page

Home | About | Privacy