ArchiveOrangemail archive

user.nutch.apache.org


(List home) (Recent threads) (3 other Apache Nutch lists)

Subscription Options

  • RSS or Atom: Read-only subscription using a browser or aggregator. This is the recommended way if you don't need to send messages to the list. You can learn more about feed syndication and clients here.
  • Conventional: All messages are delivered to your mail address, and you can reply. To subscribe, send an email to the list's subscribe address with "subscribe" in the subject line.
  • Moderate traffic list: up to 30 messages per day
  • This list contains about 9,497 messages, beginning May 2010
  • 8 messages added yesterday

user.nutch.apache.org

May 2010
Renaming segments? (3 Replies)
Joshua J Pavel 127360380511 May 2010* Hi everyone! I crawl often, and move my crawl to a different server to serve out the results, replacing the previous crawl's filesystem. This can ...
toocrazymail127367346512 May 2010 hi :) i am trying to using nutch without bin/nutch from my own (java) mojarra 2.0.2 webapp... i am searching at google for examples, but there are no ...
toocrazymail127367449112 May 2010 hi :) i am trying to using nutch without bin/nutch from my own (java) mojarra 2.0.2 webapp... i am searching at google for examples, but there are no ...
Andrzej Bialecki 127368125112 May 2010 Hi all, The Nutch site is being moved to a new address: http://nutch.apache.org Currently it's down, but should be up within the next 2 hours - so...
Ilya Kasnacheev 127378057013 May 2010 Addresses still point to the old address Which is unfortunate 'cause I'd want to unsubscribe. Never bothered to read the list anyway. How?...
Bradford Stephens 127379445613 May 2010 We've heard your feedback from the last meetup: we're having less speakers and more discussion. Yay! http://www.meetup.com/Seattle-Hadoop-HBas...
Hemanth Yamijala 127382645214 May 2010* Hi, I have a situation where we have data indexed from two different sources into different indexes. The nature of data indexed is roughly the same. F...
Julien Nioche 127408439417 May 2010 A message from the ApacheCon organizers, sorry for cross-posting....
Generating Segments (2 Replies)
Tom Landvoigt 127410637517 May 2010* Hi, I generated segments with -topN 1000 but why the fetcher fetches more than 1000 urls? Any ideas? nutch@blub:/nutch/search> ./bin/nutch readseg ...
Michela Becchi 127412811117 May 2010 Hello, I am performing a local file system crawling. My problem is the following: all files that contain some hexadecimal characters in the name do no...
Hokanson,Eric 127412918417 May 2010* Hello, I currently have a working Nutch 1.0 installation that crawls our website and then dumps the data into a Solr instance that we have. I decided ...
Stefano Cherchi 127417940518 May 2010* I'm running Nutch 1.0 on a 4-nodes Hadoop cluster, so all nutch data must reside on the hadoop distributed filesystem rather than on the local fs....
Stefano Cherchi 127419280018 May 2010 Thank you Andrzej,I suppose it does actually... I configured the filesystem-related parameters into conf/hadoop-site.xml. Most of the nutch subprocess...
Joshua J Pavel 127426334319 May 2010* I would like to recrawl a certain site I admin at a much smaller interval - say, every hour. I've specified my db.fetch.interval.default to be 120...
Regex urlfilter (5 Replies)
Tom Landvoigt 127430017919 May 2010* Hi, I have a little problem. In my crawldb are urls like http://blog2.de/fotos/tags/080807/photo/11501... but I don't want to crawl them. So I put...
Mayank Shrivastava 127433248420 May 2010* I have created an Index of some webpages using nutch. Now, I need to search this index using Lucene API. Even though I am able to do the search on the...
Stjepan Marjanović 127433346520 May 2010 ...
Faruk Berksöz 127437714620 May 2010 Hi Everyone, Is there any tool with that we can look at the inside of the crawldb,Segments or any other hadoop files without using readdb,readsegments...
Claus Daldorph Nielsen 127437755820 May 2010 Hi, I am new to Nutch and trying to get Nutch to index meta tags from html pages and store them for searching in Solr. The tags are on this form: ...
Michael 127443324421 May 2010 Hello, Please, can anyone post the workaround on debian lenny for the permission problem with tomcat. I got the following error from tomcat: Dec 21, 2...
Michael R. 127456085222 May 2010* Hello, Please, can anyone post the workaround on debian lenny for the permission problem with tomcat. I got the following error from tomcat: Dec 21, 2...
Artyom Shvedchikov 127469096124 May 2010* Hi Nutch community. We are trying to solve such task with the help of nutch: User give to us path on site and number of pages to grab. For example htt...
hareesh 127476574425 May 2010 Hi, I was running crawl from the last 5 days. Till yesterday there were no problem during the process. The fetching and updating was happening fine. w...
Hannes Carl Meyer 127477498525 May 2010* Hi, is it possible to run nutch in a single virtual machine for intranet crawling? Even inside a Java Application Server? Normally I'm using custo...
PhamDuyHung 127479073325 May 2010 Hi All, How can i monitor and run daemon crawling (crawl and recrawl) in all slaves host from master host ? Ex: I have 1 master (master01) and 30 slav...
radiohead118 127479075125 May 2010 I'm new to both Lucene and Nutch. I've been stuck on this problem for almost 2 weeks. Please help me. I have a my own search engine using Luce...
John Sherwood 127479076725 May 2010 Hi people, I've been banging my head against this problem for two days now. Simply, I want to add a field with the value of a given meta tag. I...
John Sherwood 127479208325 May 2010* Hi people, Basically, I want to parse the html docs I'm searching for a meta tag named foo and get its content. What's the best approach for t...
Michael R. 127481627825 May 2010* I double checked permissions and also there is no file logging.properties in this directory. I get another error when I add "permission java.secu...
Politeness policies (3 Replies)
Hemanth Yamijala 127484369826 May 2010* Hi, This question is about politeness policies. If I have understood correctly, Nutch adheres to politeness policies by ensuring a few things in its c...
Markus Jelsma 127486682726 May 2010* Hi, I've got a copy of the nutch-2010-05-11_04-34-41 nightly build because i need Tika to parse JPEG images and that would be in 1.1 as i read som...
Hemanth Yamijala 127486714926 May 2010* Hi, We are trying to crawl a site that requires cookies to be set. When this is tried from the browser, the original URL is redirected to a page that ...
Markus Jelsma 127487347726 May 2010* Hi, I've got a copy of the nutch-2010-05-11_04-34-41 nightly build because i need Tika to parse JPEG images and that would be in 1.1 as i read som...
Markus Jelsma 127487435726 May 2010 Hi, All works very nice but i'd wonder if there is a method to have information from the LinkDB indexed into Solr? It would be splendid if i could...
eric park 127492024127 May 2010* Hello guys, I'm trying to get rid of the url injection text file and read the starting urls from oracle database. It seems that nutch is integrate...
radiohead118 127494329027 May 2010 I want crawlers to index Thai language using nutch-1.0. (Thai has no space between words!) I looked at plugins/lib-lucene-analyzers. It contains ThaiA...
vinay vaish 127494873427 May 2010* Hi, I have been trying to download Nutch for the past one week but all mirror show an empty directory for Nutch. Is this a temporary problem or am i l...
Maxime CHEVALIER 127505387228 May 2010 Hi All, I'm trying to limit the bin/nutch generate for 1 Million of unfetched urls. It'is possible to do this with nutch? What's the best ...
Grant Ingersoll 127507498928 May 2010* If you are planning on submitting for ApacheCon, you have until Friday to do so See the CFP at http://blogs.apache.org/conferences/date/2010......
Claus Daldorph Nielsen 127507714528 May 2010* Hi, I am new to Nutch and trying to get Nutch to index meta tags from html pages and store them for searching in Solr. The tags are on this form: ...

Home | About | Privacy