ArchiveOrangemail archive

user.nutch.apache.org


(List home) (Recent threads) (3 other Apache Nutch lists)

Subscription Options

  • RSS or Atom: Read-only subscription using a browser or aggregator. This is the recommended way if you don't need to send messages to the list. You can learn more about feed syndication and clients here.
  • Conventional: All messages are delivered to your mail address, and you can reply. To subscribe, send an email to the list's subscribe address with "subscribe" in the subject line.
  • Moderate traffic list: up to 30 messages per day
  • This list contains about 9,500 messages, beginning May 2010
  • 3 messages added yesterday

user.nutch.apache.org

May 2012 - page 1
Alex McLintock 133586682001 May 2012* Hi Folks, This is not 100% a Nutch question... and I hate it when other people say "I know my question is off topic....." so why I am doing ...
ML mail 133603696303 May 2012* Hi, I would like to index the typical description and keywords HTML meta tags using my stable installation of Nutch 1.4. For that, I have followed the...
Jim Chandler 133604996503 May 2012* Greetings, Nutch, Solr, Lucene and everything else is very new to me. I am in the process of trying to change a plugin from an IndexingFilter to a Par...
Xiao Li 133621660605 May 2012* Hi Nutch people, I am using Nutch to index a website. I notice that Nutch has crawled some junk webpages, such as http://**************/category/event...
Ali Safdar Kureishy 133630411006 May 2012* Hi, I have attached a *Sequence* file with the following format: <url:Text> <data:CrawlDatum> (CrawlDatum is a custom Java type, that cont...
Roberto Gardenier 133637641707 May 2012* Hello, Im currently trying to crawl a site which uses hashtags in the urls. I dont seem to get any results and Im hoping im just overlooking something...
link without href (1 Reply)
Mohammad wrk 133639969507 May 2012* Hi, Can Nutch be configured to consider the url in the following html snippet as a link? <tr onclick="clickOnLink("http://www.example.com...
Benjamin Heilbrunn 133640032507 May 2012* Hi, I'm trying to crawl an intranet platform, which uses https and client certificates for user authentication. Is it possible to cofigure nutch t...
https authentication (3 Replies)
slavo 133640032607 May 2012* Hello, I would like to use nutch to crawl and index some sites in local network, but the server require client certificate. How can I configure nutch ...
nutch.buddy133645756608 May 2012* In a previous discussion about handling of failures in nutch, it was mentioned that a broken segment cannot be fixed and it's urls should be re-cr...
nutch.buddy133647403308 May 2012* Hi, When i merge indexes using nutch's IndexMerger, I give as input some folders that were created by the indexer and get as output the merged ind...
Ali Safdar Kureishy 133655056309 May 2012 Hi, I've included both the Nutch and Hadoop mailing lists, since I don't know which one of the two is the root cause for this issue, and it mi...
Tolga 133656756809 May 2012 Dear Lewis, Thanks a lot for your help. Now my crawler is indexing to Solr properly, as requested. What I did was forget about the other tutorial, and...
CLASSPATH (8 Replies)
Tolga 133657138409 May 2012* Hi, This is my very first post to the list. In fact, I heard of nutch only yesterday. Anyway, I'm trying to figure out what path to export CLASSPA...
Michael Erickson 133658974509 May 2012* Hello all, I'd like to try to do a focused crawl [1][2] using Nutch. I have a classifier trained on a large corpus of hand-curated data. My goal i...
HTTP ERROR 400 (14 Replies)
Stephan Kristyn 133659196109 May 2012* Hi, after installing Nutch and Solr I get a HTTP ERROR 400 Problem accessing /solr/select/. Reason: undefined field text...
Markus Jelsma 133664197610 May 2012 Hi, This is not quite similar but there's a new parameter for the generator in Nutch 1.5 where you can restrict selection by status. Cheers Origin...
James Ford 133664405510 May 2012* Hello, I am wondering how to only crawl the domains of a injected seed without adding external URLs to the database? Lets say I have 5k urls in my see...
Vikas Hazrati 133664464210 May 2012* Hi, A few days back there was a discussion on the way to extract data from raw html content ( http://lucene.472066.n3.nabble.com/Getting-th...) and ho...
Vijith 133682368712 May 2012* Hi, How can I create a separate project specific log in addition to the existing log. I am running nutch in eploy mode. Also I want some urls filtered...
nutch.buddy133692484213 May 2012* Hi, I'm running nutch 1.4 on a 5 node cluster. I try to crawl big xlsx files (~60mb). Every time I run nutch, I get an "Error: Java heap spac...
kh3rad 133701178514 May 2012* Hi, I want to crawl a website which denies access to all crawlers. this site is one of the top site in alexa rank and it is news site. these are my lo...
forwardswing 133705760315 May 2012* When i use nutch1.4,it always occur this error: 2012-05-14 09:32:23,472 WARN mapred.LocalJobRunner - job_local_0011 java.lang.NullPointerException at ...
ramires 133706808615 May 2012 hi I tried to index a huge url set with nutch-1.4 with hadoop-0.20. In reduce part it throws an error like this. I think some char break xml. Any idea...
LEVILLAIN Olivier 133709326815 May 2012* Hi, Each time I try to include a word file in my fetch/parse list, I always get the following error: 2012-05-15 15:02:40,319 ERROR tika.TikaParser - E...
ML mail 133710529915 May 2012* Hello, I am using Nutch 1.4 with Solr 3.6.0 and would like to get the HTML keywords and description metatags indexed into Solr. On the Nutch side I ha...
solrindex (1 Reply)
Tolga 133716399316 May 2012* I'm going nuts. I issued the command bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5, went on to http://localhost:8983/sol...
Ramsel Ruiz 133720060116 May 2012* I just checked out nutchgora on Wednesday, and I'm getting exceptions trying to run an initial crawl. I found some threads regarding this issue bu...
Florian Hartl 133720084316 May 2012* Hi there, whenever I try to crawl something, I get the error message below. More info: The command "bin/nutch" triggers the desired return, ...
curl or nutch (1 Reply)
Tolga 133722041917 May 2012* Hi, I have been trying for a week. I really want to get a start, so what should I use? curl or nutch? I want to be able to index pdf, xml etc. and sea...
forwardswing 133733640018 May 2012* when I use Nutch1.2,it alwayls occurs the following error: dtree.js: failed(2,0): Can't retrieve Tika parser for mime-type text/javascript main.js...
Matthias Paul 133734721218 May 2012* How can I exlude certain mime-types from crawling, for example Word-documents? If I have parse-tika in plugin.includes it will parse them. Do I have t...
Mattmann, Chris A (388J) 133745242019 May 2012* Hi Folks, A candidate for the Nutch 1.5 release is available at: http://people.apache.org/~mattmann/apache-nut... The release candidate is a zip and t...
haochen 133758296021 May 2012 Hi all, Currently, there is no API for google group. And now I want to get all the post information in one google group that I am in. So I tried to us...
Tolga 133760134321 May 2012 Okay I'm coming to the end of my questions. Do I need to read http://wiki.apache.org/nutch/FAQ#How_do_I_ind... to index files as well on a web sit...
Lewis John Mcgibbney 133762877421 May 2012* Hi, When working on some patches for both trunk and Nutchgora branch I ended up doing some code analysis of the generator mappers [0] & [1] respec...
Tolga 133768667522 May 2012* Hi, I am crawling my website with this command: bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr http://localhost:8983/solr/ -depth 20 -topN...
blunderboy 133768776222 May 2012* As I run Apache Nutch 1.4 crawler, I want to store some additional information. I want to store the parent of every URL. For example, I want to crawl ...
Common Crawl dataset (2 Replies)
Caklovic, Nenad 133776140923 May 2012* Hi. I am trying to import some Common Crawl dataset files into Nutch. Those files are in Arc file format. I tried using ArcSegmentCreator tool, but th...
One last question (3 Replies)
Tolga 133776268423 May 2012* Thank you all, especially Lewis, Markus, and whomever I might have forgotten! It is working; I can crawl, index and search. One last question though. ...
Tolga 133777597423 May 2012* Hi, I put the lines <mimeType name="application/x-excel"> <plugin id="parse-tika" /> <plugin id="feed" /...
Dustine Rene Bernasor 133785608824 May 2012* I have a 3-slaves hadoop cluster and I am performing a crawl on a single website. However, only 1 slave is performing fetching (though the other slave...
xiaodong.han133785732624 May 2012 从我的诺基亚手机发送 -原始邮件- 自:remi tassing 发送时间: 2012/04/24 06:57:04 主题: Re: Good workflow for a regular re-indexing j...
Tolga 133786146224 May 2012* Hi, I am crawling a large website, which is our university's. From the logs and some grep'ing, I see that some pdf files were not crawled. Why...
HTTP error 400 (19 Replies)
Tolga 133786654224 May 2012* Hi, This will sound like a duplicate, but actually it differs from the other one. Please bear with me. Following http://wiki.apache.org/nutch/NutchTut...
XML parsing (1 Reply)
Tolga 133788770024 May 2012* Hi, Isn't tika responsible for XML parsing? Because I got this: parse.ParserFactory - ParserFactory: Plugin: org.apache.nutch.parse.feed.FeedParse...
vlad.paunescu 133794882725 May 2012* Hello, I am currently trying to use Nutch as a web site mirroring tool. To be more explicit, I only need to download the pages, not to index them (I d...
vlad.paunescu 133795045025 May 2012* Hello, I am currently trying to use Nutch as a web site mirroring tool. To be more explicit, I only need to download the pages, not to index them (I d...
Dustine Rene Bernasor 133800838126 May 2012* Hello I was wondering, would it be possible to run multiple nutch jobs on a single Hadoop cluster at the same time? Like I would perform two crawls at...
Lewis John Mcgibbney 133820083028 May 2012* Good Evening Everyone, A candidate for the Apache Nutch 1.5 release is available at: http://people.apache.org/~lewismc/apache-gora... The release cand...

Next page

Home | About | Privacy