ArchiveOrangemail archive

user.nutch.apache.org


(List home) (Recent threads) (3 other Apache Nutch lists)

Subscription Options

  • RSS or Atom: Read-only subscription using a browser or aggregator. This is the recommended way if you don't need to send messages to the list. You can learn more about feed syndication and clients here.
  • Conventional: All messages are delivered to your mail address, and you can reply. To subscribe, send an email to the list's subscribe address with "subscribe" in the subject line.
  • Moderate traffic list: up to 30 messages per day
  • This list contains about 9,498 messages, beginning May 2010
  • 3 messages added yesterday

user.nutch.apache.org

March 2012
nutch crawling (1 Reply)
sanjay87 133059555701 Mar 2012* Hi Techies, I am having some queries related to Nutch- the web crawler. I am actually done with Crawling the website and indexing the same in SOLR, bu...
Frank Scholten 133059652801 Mar 2012 Hi all, I am looking into reusing some existing code for distributed indexing to test a Mahout tool I am working on https://issues.apache.org/jira/bro...
Stany Fargose 133067397302 Mar 2012* Hi All, I am working on replacing our current site search with Nutch-Solr. I am very new to this technologies but I like what it's offering. I got...
http.redirect.max (18 Replies)
Rafael Pappert 133068734402 Mar 2012* Hello List, is it possible to follow http 301 redirects immediately? I tried to set http.redirect.max to 3 but the page is still not indexed. readdb i...
Rafael Pappert 133068790802 Mar 2012 Hello List, how do I get the inlinks/outlinks/nodes from hdfs into a plain textfile? I created the webgraph with this command: nutch webgraph -segment...
James Ford 133069464502 Mar 2012* Hello, I am having a problem getting nutch to crawl and fetch the initial seedlist only. It seems like nutch tend to skip some urls? Or it does not pa...
Jason Trost 133071162602 Mar 2012* For anyone interested... Accumulo and Pig play together now: http://www.covert.io/post/18605091231/accumul... and https://github.com/jt6211/accumulo-p...
alxsss133077691803 Mar 2012* Hello, I need to have different fetch intervals for initial seed urls and urls extracted from them at depth 1. How this can be achieved. I tried -addd...
Markus Jelsma 133088310104 Mar 2012 If they cannot resolve must either check if they really exist, use some proxy or dns that can resolve or add them to a hosts file. Or don't try to...
dafna 133089583904 Mar 2012* Hi, I have an index that was indexed with hadoop & nutch version 1.2. I saw that nutch 1.2 using lucene version 3.0.3. I'm trying to read from...
hadi 133095831205 Mar 2012* I have one link with many external link inside it,when the fetching process start many external link failed with: java.net.UnknownHostException, i use...
Dayal 133111032007 Mar 2012* Can nutch be used to crawl pages with javascripts in them.Will the nutch crawl efficiently executes the js script and fetch the html file after java s...
Daniel Rosher 133121866908 Mar 2012* Hi, We want to have continuous indexing with NutchGora and are wondering what implementation others might already use? Our current thinking is along t...
Multiple parsers (5 Replies)
nutch.buddy133123860308 Mar 2012* Hi I've looked at nutch's code in ParseUtil and it seems that it was designed so only one parses is eventually activated on a single url. What...
webdev1977 133155235112 Mar 2012* Is there a guide to optmizing nutch/hadoop for crawling intranet sites? Most of what I need to crawl are large stores of data (databases exposed throu...
webdev1977 133156052912 Mar 2012* How would one go about changing the hostnames that a large number of urls point to in both the crawldb as well as the solr index? I tried running the ...
Daniel Bourrion 133163108313 Mar 2012* Hi. I'm a french librarian (that explains the bad english coming now... :) ) Newbie on Nutch, that looks exactly what i'm searching for (an op...
HaYa aziz 133164782013 Mar 2012* Dear all ,, we know that DOMContentUtils.java (in parse-html plug-in) extract the text from node and save it in sb (StringBuffer), then in will be sav...
kingping 133167816813 Mar 2012* All, I have been working with Nutch 1.1 for quite some time now and everthing is working fine, until I came across a site that I am having a ton of tr...
Sudip Datta 133194020616 Mar 2012* Hi, I am using Nutch 1.4 and trying to run CrawlDbReader (with -url argument) to find the status of specific urls. Run in isolation the code crashes w...
Mohammad Tambe 133214574819 Mar 2012* Hi, I need to execute a custom indexing plugin after the index-more plugin has been executed . So that I can use the Last-Modified field set by index-...
Magnús Skúlason 133215173819 Mar 2012* Hi, I have setup nutch to run on a Pseudo-Distributed hadoop cluster (i.e. all running on one machine). Everything works fine except that I only get t...
Rafael Pappert 133218139719 Mar 2012* Hello, I'm running nutch 1.4 on an 3 Node Hadoop Cluster and from time to time i got an "alert" that 1 TaskTracker have been blacklisted...
blunderboy 133224127920 Mar 2012* Hi all, After crawling the site, I want to create a solrIndex but I am getting the following error: *$ bin/nutch solrindex http://localhost:8983/solr/...
Nutch with Letor (12 Replies)
varunpandeyengg 133224141220 Mar 2012* Hey Guys, I am new to Nutch. I am part of a IR research team & need to create a setup where in I need to crawl Microsoft's LETOR Dataset ( htt...
urls won't get crawled (12 Replies)
jepse 133224697220 Mar 2012* Hi there, i'm despret with two urls: http://www.lequipe.fr/Football/ http://www.ostsee-zeitung.de/nachrichten Everything seems ok: robots.txt and ...
Meta Tags (11 Replies)
Marek Bachmann 133233733621 Mar 2012* Hello again, I want to extract specific meta tag from HTML pages, like: <meta name="uniks-fb" value="fb16" /> But it seems t...
Milica Bogicevic 133235978921 Mar 2012 Hi, Is it possible to run Nutch crawler using Elastic Map Reduce? I'm running Nutch crawler 1.4 from Java by calling Crawler.java main method with...
webdev1977 133242786322 Mar 2012* I have created an application that can detect when files are created/modified/deleted in one of our Windows Share drives and I would like to know if i...
Milica Bogicevic 133252173923 Mar 2012 Hi, I'm trying to save crawled data ona S3. I am using Nutch 1.4 and Hadoop 0.20.2 and everything works fine on my local machine. When I try to do...
Generator taking time (4 Replies)
James Ford 133252183623 Mar 2012* Hello, I am having problems with the Generator step of my crawls. It takes a lot of time compared to indexing and fetching? Right now the generator st...
Jan Riewe 133277812826 Mar 2012 Hey there, currently i try to debug the dedup results from nutch. There is a page with is exactly the same (compared the HTML with a diff tool) as on ...
Vicente Canhoto 133277849526 Mar 2012* Hi there, I'm trying to utilize in Nutch 1.4 a plugin that was made (not by me) for an older version, possibily 1.2 or 1.3. When i tried to build ...
Elisabeth Adler 133283718927 Mar 2012* Hi, I am using Nutch 1.3 in conjunction with Solr 3.3.0 to add search capabilities to an Intranet. The bit that's indexed is fine, though most of ...
pepe3059 133284198727 Mar 2012* Hello, i have some questions, sorry if i'm so noob Is there a way to divide "fetch process" between two or more computers using distinct...
webdev1977 133284221827 Mar 2012* I was under the impression that setting topN for crawl cycles would limit the number of items each iteration of the crawl would fetch/parse. However, ...
Elisabeth Adler 133285160527 Mar 2012* Hi, I'm using Nutch 1.3 to crawl dynamic pages (JSPs) and indexing them into Solr. With the same settings, I sometimes get more documents indexed,...
George 133285666527 Mar 2012* Hello I.m using nutch 9.0 default installation single machine: 2x2.5 quad core 16 GB ram 6 x 1TB sata raid 1 Network 1 gbps. Not using any distributed...
James Ford 133288377227 Mar 2012* Hello, I am trying to optimize my crawls as much as possible. The current bottleneck is the step after adding segments to the linkdb, where Nutch is t...
webdev1977 133292740028 Mar 2012* I am seeing an issue with crawling html pages that have relative urls embedded in them. I know there is an ongoing issue related to relative urls that...
JohnRodey 133298176929 Mar 2012* I am just doing a simple project for my Information Retrieval class. I am currently using nutch to get a bunch of pages and it is indexing and storing...
dspathis 133299746529 Mar 2012* Hi, I'm having trouble with the following use case: I use Nutch to crawl a web site and index the pages. When a page is temporarily unavailable (4...
ashish vyas 133310152030 Mar 2012 Hi, I have setup hadoop cluster(2 nodes) and trying to run nutch crawl on it. Currently in our application we are running Nutch crawl without hadoop a...

Home | About | Privacy