ArchiveOrangemail archive

user.nutch.apache.org


(List home) (Recent threads) (3 other Apache Nutch lists)

Subscription Options

  • RSS or Atom: Read-only subscription using a browser or aggregator. This is the recommended way if you don't need to send messages to the list. You can learn more about feed syndication and clients here.
  • Conventional: All messages are delivered to your mail address, and you can reply. To subscribe, send an email to the list's subscribe address with "subscribe" in the subject line.
  • Moderate traffic list: up to 30 messages per day
  • This list contains about 9,497 messages, beginning May 2010
  • 8 messages added yesterday

user.nutch.apache.org

January 2012 - page 1
<contacts132550713802 Jan 2012* Hello everyone, I've just finished testing my plug-in 'language-id-filter' that is used to filter the indexing of documents by language id...
mina 132552647202 Jan 2012* hi, i setup nutch 1.3 without hadoop, when i crawl 4 sites with depth 10 and topN 10000 my /tmp is filled up to 100% and my crawling is failed, how ca...
<contacts132562480003 Jan 2012* Hello, I spoke too fast. The filter works when I use our beta API for language detection (used for fast testing), but I want to use default Nutch infr...
Continuous Crawling (3 Replies)
Bai Shen 132569238704 Jan 2012* Currently, I'm using a shell script to run my nutch crawl. It seems to work okay, but it only generates one segment at a time. Does anybody have a...
Xiao Li 132569250404 Jan 2012* Hi I am debuging Nutch in Eclipse on Ubuntu platform. I can run the crawler program smoothly. However, when it tries to parse a PDF file, I just get t...
Tim Pease 132571271104 Jan 2012* I've noticed that the mirrors only contain downloadable assets for Nutch 1.4. Is there a location where older versions of Nutch can be downloaded?...
niviksha 132576272505 Jan 2012* Hi all, this is my first post. I've used lucene extensively in the past, but am just getting my feet wet with Nutch. The problem I have is to use ...
Eddie Drapkin 132584535206 Jan 2012* Is there any way to disable the URL filter plugins in the parse step? I want to only filter at fetch generate time, but it seems that the parse proces...
Crawl only *.*.us (3 Replies)
Waleed 132604352508 Jan 2012* Hello everyone I am trying to crawl only .us for example I want All domains that in all com.us and net.us etc ... of course I have it all in my seed l...
mina 132604577508 Jan 2012* hi markus. i have a problem in nutch. i want use stopwords in nutch, when i crawl sites and use solr to index them, any word in stopwords.txt can is s...
crawl-javascript (2 Replies)
tahere ganjiyar 132605182908 Jan 2012* i use nutch 1.4, i want crawl sites with their javascript files, how should i config nutch?...
mina 132605328708 Jan 2012 i want crawl .js files beacuse in .js files i add some links to a sites. how i can config nutch to ceawl .js files? i use nutch 1.4...
mina 132605553408 Jan 2012 i use nutch 1.4 and i want to pares .js files beacuse some links add in sites with .js files. help me. how i can config nutch?...
<contacts132610930409 Jan 2012* Hello, If a want to crawl a set A of pages, and a set B of pages, but using a config(A) for A, and a config(B) for B, which is the suggested 'best...
Dean Pullen 132628265611 Jan 2012* Hi all, I'm upgrading from nutch 1 to 1.4 and am having problems running invertlinks. Error: LinkDb: org.apache.hadoop.mapred.InvalidInputExceptio...
Elisabeth Adler 132629121511 Jan 2012* Hi, I was wondering what is the best approach to process anchor elements that come with a custom keyword element like this one: <a href="mydoc...
shlomi java 132629413611 Jan 2012* hi Hadoops & Nutchs, I'm trying to run Nutch 1.4 *locally*, on Windows 7, using Hadoop 0.20.203.0. I run with: fs.default.name = D:\fs hadoop....
Isabel Drost 132645748813 Jan 2012* Call for Submission Berlin Buzzwords 2012 - Search, Store, Scale -- June 4 / 5. 2012 The event will comprise presentations on scalable data processing...
Bowen Masco 132645861213 Jan 2012* So I have recently did some work adding elasticsearch indexing to nutch and creating workflows to run nutch tasks with oozie. Our fork is on github: h...
Fetching large files (2 Replies)
Bai Shen 132646179513 Jan 2012* I'm using nutch in distributed mode. I'm crawling large files(a bunch of videos), and when the fetcher map job goes to merge the spill files i...
Matthew Slade 132646983713 Jan 2012* Hi All I've managed to setup a hadoop cluster (0.19.0) using the guide provided at http://wiki.apache.org/hadoop/AmazonEC2 I then attempted to run...
Max Stricker 132669614616 Jan 2012* Hi Mailinglist, I currently need to start the nutch crawl process from Java, as it should be accessible through a WebApp. I fugured out that calling C...
Wilson, Matt 132670298816 Jan 2012* I am attempting to crawl a corporate intranet site and allow it to be searched in solr. As part of the requirements I have to be able to index certain...
remi tassing 132672242416 Jan 2012* Hello all and compliments of the new season! Apparently there is a bug with relative URLs not just with Nutch but also Sun's class URL: https://is...
remi tassing 132672456616 Jan 2012* Hello all, one of the sites I'm crawling doesn't have the robots.txt file, so I decide to modify RobotRulesParser.java so to give it default r...
<Arkadi.Kosmynin132677050617 Jan 2012* Hi, I started having this problem recently. For some reason, I did not have it before, when working with Nutch 1.4 pre-release code. The stack trace w...
remi tassing 132689623918 Jan 2012* Hello guys, After crawling with Nutch I tried pushing the index to Solr but it doesn't work. I'm using Nutch-1.2. Solr-3.4 & 3.5 don...
Cube Agen 132689655318 Jan 2012* Nutch has an argument of urls to init the fetching sites. If I want to get the urls from a database, how should I do ?...
Dennis Spathis 132689707818 Jan 2012* Hi, The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime...
Dan Cox 132692004318 Jan 2012* Nutch Users, I'm attempting to run a crawl using nutch 1.4 deployed to hadoop 1.0 (single server for now). I'm able to crawl using the local r...
Dean Del Ponte 132692279218 Jan 2012* I'm new to Nutch. Currently using version 1.4. I have some URLs which I would like excluded from the crawl. What's the best way to do this? Th...
Embedded Nutch API (3 Replies)
<contacts132697323119 Jan 2012* Hello, We've finished plug-in development (filter by lang-id, it will be proposed on JIRA shortly), now we want to embed the Nutch control in a Ja...
Dean Del Ponte 132699661519 Jan 2012* I have a website and the home page's URL is: http://www.homepage.com I would like to crawl all pages, EXCEPT the home page. What regex expression ...
Waleed 132700390719 Jan 2012* Hello everyone I am using nutch with hadoop for a while With no problems. but after changing Internal & external links in nutch-default.xml , when...
remi tassing 132708017520 Jan 2012* Hi, Let's say my filters in regex-urlfilter.txt weren't well written and I crawled outside my wanted boundaries. Now I noticed it and want to ...
Marek Bachmann 132708023420 Jan 2012* Hi all together, a short question: Stands the field "Fetch time" in the crawldb for the time when the url WAS fetched or when it has to be f...
Marek Bachmann 132708840520 Jan 2012* Hello again, I was inspecting the generator because it doesn't deliver all urls for the fetcht list from the crawldb even if I set the addDays atr...
abhayd 132713505221 Jan 2012* hi I am crawling some websites which have some meta tags in head. I would like to extract these from html and send it to solr in certain fields. I see...
remi tassing 132725087822 Jan 2012* Hi, Is it safe to run concurrent instances of Nutch in different machines and just merge the segments later on? I believe Hadoop is recommended for th...
José Ignacio Ortiz de Galisteo 132731419423 Jan 2012* Hi. I'm trying to make a unit test of a custom parse plugin. When I load a fixture I have to create a new Content object in order to mock the beha...
Adriana Farina 132731871423 Jan 2012* Hello, I have a problem I'm not able to solve though I've googled around. I have crawled a set of web pages containing documents of different ...
Sameendra Samarawickrama 132734115523 Jan 2012* Hi, I am using Nutch to generate a small dataset of web; dataset on which I am planning of running a focused crawler later. I did a test crawl of and ...
Denis Sinner 132741637224 Jan 2012 Hello, i have a setup Nutch crawler and try to index into a Solr Core where information is written by other applications aswell. The data gets indexed...
Following .axd urls (6 Replies)
Ian Piper 132741638124 Jan 2012* Hi all, I'd appreciate some guidance... can't seem to find much useful stuff on the web on this. I have set up a Nutch and Solr service that i...
Danicela nutch 132742494124 Jan 2012* Hi, I want to ban some pages from my Solr Index. But this shouldn't be a simple page delete from the index. I want to prevent this page URL from b...
Dan Volfman 132742576824 Jan 2012 I want to crawl a full website in order to index specific file types. Assume I have a web site with many pages, some of them contain links to pdf file...
Markus Jelsma 132744909824 Jan 2012* Hi, We read that "its benefit to cost ratio is very low" [1]. In our experience there is very little cost, so would the benefit be even lowe...
Michael Lissner 132747851025 Jan 2012* Hi, I'm doing some research on what technologies various crawlers support for crawl exclusion. Without installing and figuring out Nutch, I can...
Sudip Datta 132748648825 Jan 2012* Hi, I am using Nutch 1.4 and storing the index in Solr. For scoring, I use the OPIC Scoring filter. My queries are complex weighted queries involving ...
Denis Sinner 132762859627 Jan 2012* Hello, i have a setup Nutch crawler and try to index into a Solr Core where information is written by other applications aswell. The data gets indexed...

Next page

Home | About | Privacy