user.nutch.apache.org
(
List home) (
Recent threads) (
3 other Apache Nutch lists)
Subscription Options
- RSS or Atom: Read-only subscription using a browser or aggregator. This is the recommended way if you don't need to send messages to the list. You can learn more about feed syndication and clients here.
- Conventional: All messages are delivered to your mail address, and you can reply. To subscribe, send an email to the list's subscribe address with "subscribe" in the subject line.
- Moderate traffic list: up to 30 messages per day
- This list contains about 9,498 messages, beginning May 2010
- 3 messages added yesterday
user.nutch.apache.org
March 2012
sanjay87 — 133059555701 Mar 2012*
Hi Techies, I am having some queries related to Nutch- the web crawler. I am actually done with Crawling the website and indexing the same in SOLR, bu...
Frank Scholten — 133059652801 Mar 2012
Hi all, I am looking into reusing some existing code for distributed indexing to test a Mahout tool I am working on https://issues.apache.org/jira/bro...
Stany Fargose — 133067397302 Mar 2012*
Hi All, I am working on replacing our current site search with Nutch-Solr. I am very new to this technologies but I like what it's offering. I got...
Rafael Pappert — 133068734402 Mar 2012*
Hello List, is it possible to follow http 301 redirects immediately? I tried to set http.redirect.max to 3 but the page is still not indexed. readdb i...
Rafael Pappert — 133068790802 Mar 2012
Hello List, how do I get the inlinks/outlinks/nodes from hdfs into a plain textfile? I created the webgraph with this command: nutch webgraph -segment...
James Ford — 133069464502 Mar 2012*
Hello, I am having a problem getting nutch to crawl and fetch the initial seedlist only. It seems like nutch tend to skip some urls? Or it does not pa...
Jason Trost — 133071162602 Mar 2012*
For anyone interested... Accumulo and Pig play together now: http://www.covert.io/post/18605091231/accumul... and https://github.com/jt6211/accumulo-p...
alxsss — 133077691803 Mar 2012*
Hello, I need to have different fetch intervals for initial seed urls and urls extracted from them at depth 1. How this can be achieved. I tried -addd...
Markus Jelsma — 133088310104 Mar 2012
If they cannot resolve must either check if they really exist, use some proxy or dns that can resolve or add them to a hosts file. Or don't try to...
dafna — 133089583904 Mar 2012*
Hi, I have an index that was indexed with hadoop & nutch version 1.2. I saw that nutch 1.2 using lucene version 3.0.3. I'm trying to read from...
hadi — 133095831205 Mar 2012*
I have one link with many external link inside it,when the fetching process start many external link failed with: java.net.UnknownHostException, i use...
Dayal — 133111032007 Mar 2012*
Can nutch be used to crawl pages with javascripts in them.Will the nutch crawl efficiently executes the js script and fetch the html file after java s...
Daniel Rosher — 133121866908 Mar 2012*
Hi, We want to have continuous indexing with NutchGora and are wondering what implementation others might already use? Our current thinking is along t...
nutch.buddy — 133123860308 Mar 2012*
Hi I've looked at nutch's code in ParseUtil and it seems that it was designed so only one parses is eventually activated on a single url. What...
webdev1977 — 133155235112 Mar 2012*
Is there a guide to optmizing nutch/hadoop for crawling intranet sites? Most of what I need to crawl are large stores of data (databases exposed throu...
webdev1977 — 133156052912 Mar 2012*
How would one go about changing the hostnames that a large number of urls point to in both the crawldb as well as the solr index? I tried running the ...
Daniel Bourrion — 133163108313 Mar 2012*
Hi. I'm a french librarian (that explains the bad english coming now... :) ) Newbie on Nutch, that looks exactly what i'm searching for (an op...
HaYa aziz — 133164782013 Mar 2012*
Dear all ,, we know that DOMContentUtils.java (in parse-html plug-in) extract the text from node and save it in sb (StringBuffer), then in will be sav...
kingping — 133167816813 Mar 2012*
All, I have been working with Nutch 1.1 for quite some time now and everthing is working fine, until I came across a site that I am having a ton of tr...
Sudip Datta — 133194020616 Mar 2012*
Hi, I am using Nutch 1.4 and trying to run CrawlDbReader (with -url argument) to find the status of specific urls. Run in isolation the code crashes w...
Mohammad Tambe — 133214574819 Mar 2012*
Hi, I need to execute a custom indexing plugin after the index-more plugin has been executed . So that I can use the Last-Modified field set by index-...
Magnús Skúlason — 133215173819 Mar 2012*
Hi, I have setup nutch to run on a Pseudo-Distributed hadoop cluster (i.e. all running on one machine). Everything works fine except that I only get t...
Rafael Pappert — 133218139719 Mar 2012*
Hello, I'm running nutch 1.4 on an 3 Node Hadoop Cluster and from time to time i got an "alert" that 1 TaskTracker have been blacklisted...
blunderboy — 133224127920 Mar 2012*
Hi all, After crawling the site, I want to create a solrIndex but I am getting the following error: *$ bin/nutch solrindex http://localhost:8983/solr/...
varunpandeyengg — 133224141220 Mar 2012*
Hey Guys, I am new to Nutch. I am part of a IR research team & need to create a setup where in I need to crawl Microsoft's LETOR Dataset ( htt...
jepse — 133224697220 Mar 2012*
Hi there, i'm despret with two urls: http://www.lequipe.fr/Football/ http://www.ostsee-zeitung.de/nachrichten Everything seems ok: robots.txt and ...
Marek Bachmann — 133233733621 Mar 2012*
Hello again, I want to extract specific meta tag from HTML pages, like: <meta name="uniks-fb" value="fb16" /> But it seems t...
Milica Bogicevic — 133235978921 Mar 2012
Hi, Is it possible to run Nutch crawler using Elastic Map Reduce? I'm running Nutch crawler 1.4 from Java by calling Crawler.java main method with...
webdev1977 — 133242786322 Mar 2012*
I have created an application that can detect when files are created/modified/deleted in one of our Windows Share drives and I would like to know if i...
Milica Bogicevic — 133252173923 Mar 2012
Hi, I'm trying to save crawled data ona S3. I am using Nutch 1.4 and Hadoop 0.20.2 and everything works fine on my local machine. When I try to do...
James Ford — 133252183623 Mar 2012*
Hello, I am having problems with the Generator step of my crawls. It takes a lot of time compared to indexing and fetching? Right now the generator st...
Jan Riewe — 133277812826 Mar 2012
Hey there, currently i try to debug the dedup results from nutch. There is a page with is exactly the same (compared the HTML with a diff tool) as on ...
Vicente Canhoto — 133277849526 Mar 2012*
Hi there, I'm trying to utilize in Nutch 1.4 a plugin that was made (not by me) for an older version, possibily 1.2 or 1.3. When i tried to build ...
Elisabeth Adler — 133283718927 Mar 2012*
Hi, I am using Nutch 1.3 in conjunction with Solr 3.3.0 to add search capabilities to an Intranet. The bit that's indexed is fine, though most of ...
pepe3059 — 133284198727 Mar 2012*
Hello, i have some questions, sorry if i'm so noob Is there a way to divide "fetch process" between two or more computers using distinct...
webdev1977 — 133284221827 Mar 2012*
I was under the impression that setting topN for crawl cycles would limit the number of items each iteration of the crawl would fetch/parse. However, ...
Elisabeth Adler — 133285160527 Mar 2012*
Hi, I'm using Nutch 1.3 to crawl dynamic pages (JSPs) and indexing them into Solr. With the same settings, I sometimes get more documents indexed,...
George — 133285666527 Mar 2012*
Hello I.m using nutch 9.0 default installation single machine: 2x2.5 quad core 16 GB ram 6 x 1TB sata raid 1 Network 1 gbps. Not using any distributed...
James Ford — 133288377227 Mar 2012*
Hello, I am trying to optimize my crawls as much as possible. The current bottleneck is the step after adding segments to the linkdb, where Nutch is t...
webdev1977 — 133292740028 Mar 2012*
I am seeing an issue with crawling html pages that have relative urls embedded in them. I know there is an ongoing issue related to relative urls that...
JohnRodey — 133298176929 Mar 2012*
I am just doing a simple project for my Information Retrieval class. I am currently using nutch to get a bunch of pages and it is indexing and storing...
dspathis — 133299746529 Mar 2012*
Hi, I'm having trouble with the following use case: I use Nutch to crawl a web site and index the pages. When a page is temporarily unavailable (4...
ashish vyas — 133310152030 Mar 2012
Hi, I have setup hadoop cluster(2 nodes) and trying to run nutch crawl on it. Currently in our application we are running Nutch crawl without hadoop a...