ArchiveOrangemail archive

nutch-user.lucene.apache.org


(List home) (Recent threads) (34 other Apache Lucene lists)

Subscription Options

  • RSS or Atom: Read-only subscription using a browser or aggregator. This is the recommended way if you don't need to send messages to the list. You can learn more about feed syndication and clients here.
  • Conventional: All messages are delivered to your mail address, and you can reply. To subscribe, send an email to the list's subscribe address with "subscribe" in the subject line, or visit the list's homepage here.
  • This list contains about 16,502 messages, beginning Jun 2005
  • This list doesn't seem to be active
Report the Spam
This button sends a spam report to the moderator. Please use it sparingly. For other removal requests, read this.
Are you sure? yes no

nutch crawl issue

Ad
matthew a. grisius 1272429642Wed, 28 Apr 2010 04:40:42 +0000 (UTC)
using Nutch nightly build nutch-2010-04-27_04-00-28:

I am trying to bin/nutch crawl a single html file generated by javadoc
and no links are followed. I verified this with bin/nutch readdb and
bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
seed doc specified is processed.

I searched and reviewed the nutch-user archive and tried several
different settings but none of the settings appear to have any effect.

I then downloaded maven-2.2.1 so that I could mvn install tika and
produce tika-app-0.7.jar to command line extract information about the
html javadoc file. I am not familiar w/ tika but the command line
version doesn't return any metadata, e.g. no 'src=' links from the html
'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
nutch uses tika and maybe it's not related . . .

Has anyone crawled javadoc files or have any suggestions? Thanks.

-m.
matthew a. grisius 1272463411Wed, 28 Apr 2010 14:03:31 +0000 (UTC)
My subject should've been clearer, e.g. it should've read Nutch 1.1
nightly build crawl issue.

Also, I did verify that Nutch 1.0 successfully completes crawling the
javadoc html file and can be verified with luke-1.0.1 and searched using
command line bin/nutch org.apache.nutch.searcher.NutchBean java
On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote: > using Nutch nightly build nutch-2010-04-27_04-00-28: > > I am trying to bin/nutch crawl a single html file generated by javadoc > and no links are followed. I verified this with bin/nutch readdb and > bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base > seed doc specified is processed. > > I searched and reviewed the nutch-user archive and tried several > different settings but none of the settings appear to have any effect. > > I then downloaded maven-2.2.1 so that I could mvn install tika and > produce tika-app-0.7.jar to command line extract information about the > html javadoc file. I am not familiar w/ tika but the command line > version doesn't return any metadata, e.g. no 'src=' links from the html > 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how > nutch uses tika and maybe it's not related . . . > > Has anyone crawled javadoc files or have any suggestions? Thanks. > > -m. >
matthew a. grisius 1272557052Thu, 29 Apr 2010 16:04:12 +0000 (UTC)
in nutch-site.xml I modified plugin.includes

parse-(html) works
parse-(tika) does not

I need to also parse pdfs so I need both features, I tried parse-(html|
tika) to see if html would be selected before tika and that did not
work.
On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote: > using Nutch nightly build nutch-2010-04-27_04-00-28: > > I am trying to bin/nutch crawl a single html file generated by javadoc > and no links are followed. I verified this with bin/nutch readdb and > bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base > seed doc specified is processed. > > I searched and reviewed the nutch-user archive and tried several > different settings but none of the settings appear to have any effect. > > I then downloaded maven-2.2.1 so that I could mvn install tika and > produce tika-app-0.7.jar to command line extract information about the > html javadoc file. I am not familiar w/ tika but the command line > version doesn't return any metadata, e.g. no 'src=' links from the html > 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how > nutch uses tika and maybe it's not related . . . > > Has anyone crawled javadoc files or have any suggestions? Thanks. > > -m. >
arpit khurdiya 1272558490Thu, 29 Apr 2010 16:28:10 +0000 (UTC)
if u r using  nigthly build, Did u changed d same in parse-plugin.xml??
uncomment this:
 <mimeType name="text/html">
 	<plugin id="parse-html" />
 	</mimeType>

hopefully this helps u
On Thu, Apr 29, 2010 at 9:32 PM, matthew a. grisius wrote: > in nutch-site.xml I modified plugin.includes > > parse-(html) works > parse-(tika) does not > > I need to also parse pdfs so I need both features, I tried parse-(html| > tika) to see if html would be selected before tika and that did not > work. > > On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote: >> using Nutch nightly build nutch-2010-04-27_04-00-28: >> >> I am trying to bin/nutch crawl a single html file generated by javadoc >> and no links are followed. I verified this with bin/nutch readdb and >> bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base >> seed doc specified is processed. >> >> I searched and reviewed the nutch-user archive and tried several >> different settings but none of the settings appear to have any effect. >> >> I then downloaded maven-2.2.1 so that I could mvn install tika and >> produce tika-app-0.7.jar to command line extract information about the >> html javadoc file. I am not familiar w/ tika but the command line >> version doesn't return any metadata, e.g. no 'src=' links from the html >> 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how >> nutch uses tika and maybe it's not related . . . >> >> Has anyone crawled javadoc files or have any suggestions? Thanks. >> >> -m. >> > >
Julien Nioche 1272562618Thu, 29 Apr 2010 17:36:58 +0000 (UTC)
Hi Matthew,

There is an open issue with Tika (e.g.
https://issues.apache.org/jira/browse/TIKA-37...) that could explain the
differences betwen parse-html and parse-tika. Note that you can specify :
*parse-(html|pdf) *in order to get both HTML and PDF files.

Could you please open an issue in JIRA
https://issues.apache.org/jira/browse/NUTCH) and attach the file you are
trying to process? I'll have a look and see if it is related to TIKA-379.

Thanks

Julien
matthew a. grisius 1272770064Sun, 02 May 2010 03:14:24 +0000 (UTC)
Hi Julien,
On Thu, 2010-04-29 at 18:36 +0100, Julien Nioche wrote: > Hi Matthew, > > There is an open issue with Tika (e.g. > https://issues.apache.org/jira/browse/TIKA-37...) that could explain the > differences betwen parse-html and parse-tika. Note that you can specify : > *parse-(html|pdf) *in order to get both HTML and PDF files.
The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my PDFs, but has problems with some html. Nutch 1.1 includes more current PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.
> > Could you please open an issue in JIRA > https://issues.apache.org/jira/browse/NUTCH) and attach the file you are > trying to process? I'll have a look and see if it is related to TIKA-379.
I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-8... with the attached file Thanks. -m.
> > Thanks > > Julien
Mattmann, Chris A (388J) 1272773217Sun, 02 May 2010 04:06:57 +0000 (UTC)
Hi Matthew,
>> Hi Matthew, >> >> There is an open issue with Tika (e.g. >> https://issues.apache.org/jira/browse/TIKA-37...) that could explain the >> differences betwen parse-html and parse-tika. Note that you can specify : >> *parse-(html|pdf) *in order to get both HTML and PDF files. > > The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 > rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my > PDFs, but has problems with some html. Nutch 1.1 includes more current > PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.
Interesting: well one solution comes to mind. Can you test this out? * uncomment the lines: <mimeType name="text/html"> <plugin id="parse-html" /> </mimeType> In conf/parse-plugins.xml. * try your crawl again.
> > I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-8... > with the attached file
Thanks! Let me know what happens after you uncomment the line above. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
matthew a. grisius 1272902761Mon, 03 May 2010 16:06:01 +0000 (UTC)
Hi Chris,

Yes, that worked. I caught up on email and noticed that Arpit also
mentioned the same thing. Sorry I missed it, thanks to both of you!

-m.
On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: > Hi Matthew, > > >> Hi Matthew, > >> > >> There is an open issue with Tika (e.g. > >> https://issues.apache.org/jira/browse/TIKA-37...) that could explain the > >> differences betwen parse-html and parse-tika. Note that you can specify : > >> *parse-(html|pdf) *in order to get both HTML and PDF files. > > > > The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 > > rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my > > PDFs, but has problems with some html. Nutch 1.1 includes more current > > PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4. > > Interesting: well one solution comes to mind. Can you test this out? > > * uncomment the lines: > > <mimeType name="text/html"> > <plugin id="parse-html" /> > </mimeType> > > In conf/parse-plugins.xml. > > * try your crawl again. > > > > > I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-8... > > with the attached file > > Thanks! Let me know what happens after you uncomment the line above. > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >
Mattmann, Chris A (388J) 1272903886Mon, 03 May 2010 16:24:46 +0000 (UTC)
Hi Matthew,

Awesome! Glad it worked. Now my next question < how often are you seeing
that parse-tika doesn¹t work on HTML files? Is it all HTML that you are
trying to process? Or just some of them? Or particular ones (categories of
them). The reason I ask is that I¹m trying to determine whether I should
commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a
systematic thing versus an exception.

Let me know and thanks!

Cheers,
Chris
On 5/3/10 9:04 AM, "matthew a. grisius" wrote: > Hi Chris, > > Yes, that worked. I caught up on email and noticed that Arpit also > mentioned the same thing. Sorry I missed it, thanks to both of you! > > -m. > > On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: >> Hi Matthew, >> >>>> Hi Matthew, >>>> >>>> There is an open issue with Tika (e.g. >>>> https://issues.apache.org/jira/browse/TIKA-37...) that could explain the >>>> differences betwen parse-html and parse-tika. Note that you can specify : >>>> *parse-(html|pdf) *in order to get both HTML and PDF files. >>> >>> The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 >>> rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my >>> PDFs, but has problems with some html. Nutch 1.1 includes more current >>> PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4. >> >> Interesting: well one solution comes to mind. Can you test this out? >> >> * uncomment the lines: >> >> <mimeType name="text/html"> >> <plugin id="parse-html" /> >> </mimeType> >> >> In conf/parse-plugins.xml. >> >> * try your crawl again. >> >>> >>> I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-8... >>> with the attached file >> >> Thanks! Let me know what happens after you uncomment the line above. >> >> Cheers, >> Chris >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> > >
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
matthew a. grisius 1273032518Wed, 05 May 2010 04:08:38 +0000 (UTC)
Hi Chris,

It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES
and/or javascript. Using the parse-html suggested work around I am able
to process my simple test cases such as javadoc which does include
simple embedded javascript (of course I can't verify that it is actually
parsing it though). I expanded my testing to include two more complex
examples that heavily use HTML FRAMESET/FRAME and more complex
javascript:

134 mb, 11,269 files
1.9 gb, 133,978 files

They both fail at the top level with the similar errors such as:

fetching
http://192.168.1.101:8080/technical/general/C...
fetching
http://192.168.1.101:8080/technical/general/C...
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
Error parsing:
http://192.168.1.101:8080/technical/general/C... UNKNOWN!(-56,0): Can't retrieve Tika parser for mime-type text/javascript
Attempting to finish item from unknown queue:
org.apache.nutch.fetcher.Fetcher$FetchItem@1532fc
fetch of
http://192.168.1.101:8080/technical/general/C... failed with: java.lang.ArrayIndexOutOfBoundsException: -56
-finishing thread FetcherThread, activeThreads=2

I tried several property settings to mimic the previous work around and
could not solve it. Any suggestions?

So, I'm not sure how to categorize the issues more accurately. I have
many javadoc sets and lots of simple HTML that will now parse, but I
have other examples such as the two mentioned above that won't parse and
therefore can't be crawled. It seems to me to be systematic rather than
exceptional. I cannot believe that I'm the only one who will experience
these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks
for asking.

-m.
On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote: > Hi Matthew, > > Awesome! Glad it worked. Now my next question < how often are you seeing > that parse-tika doesn¹t work on HTML files? Is it all HTML that you are > trying to process? Or just some of them? Or particular ones (categories of > them). The reason I ask is that I¹m trying to determine whether I should > commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a > systematic thing versus an exception. > > Let me know and thanks! > > Cheers, > Chris > > > On 5/3/10 9:04 AM, "matthew a. grisius" wrote: > > > Hi Chris, > > > > Yes, that worked. I caught up on email and noticed that Arpit also > > mentioned the same thing. Sorry I missed it, thanks to both of you! > > > > -m. > > > > On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: > >> Hi Matthew, > >> > >>>> Hi Matthew, > >>>> > >>>> There is an open issue with Tika (e.g. > >>>> https://issues.apache.org/jira/browse/TIKA-37...) that could explain the > >>>> differences betwen parse-html and parse-tika. Note that you can specify : > >>>> *parse-(html|pdf) *in order to get both HTML and PDF files. > >>> > >>> The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 > >>> rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my > >>> PDFs, but has problems with some html. Nutch 1.1 includes more current > >>> PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4. > >> > >> Interesting: well one solution comes to mind. Can you test this out? > >> > >> * uncomment the lines: > >> > >> <mimeType name="text/html"> > >> <plugin id="parse-html" /> > >> </mimeType> > >> > >> In conf/parse-plugins.xml. > >> > >> * try your crawl again. > >> > >>> > >>> I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-8... > >>> with the attached file > >> > >> Thanks! Let me know what happens after you uncomment the line above. > >> > >> Cheers, > >> Chris > >> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Chris Mattmann, Ph.D. > >> Senior Computer Scientist > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >> Office: 171-266B, Mailstop: 171-246 > >> Email: > >> WWW: http://sunset.usc.edu/~mattmann/ > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Adjunct Assistant Professor, Computer Science Department > >> University of Southern California, Los Angeles, CA 90089 USA > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> > >> > > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >
Mattmann, Chris A (388J) 1273035062Wed, 05 May 2010 04:51:02 +0000 (UTC)
Hi Matthew,

I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at
Julien’s patch and see if there is a way to get it committed sooner rather
than later.

One way to help me do that ― since you already have an environment and set
of use cases where this is reproduceable can you apply TIKA-379 to a local
checkout of tika trunk (I’ll show you how) and then let me know if that
fixes parse-tika for you?

Here are the steps:

svn co http://svn.apache.org/repos/asf/lucene/tika/t... ./tika
cd tika
wget "http://bit.ly/bXeLkf" (if you don't have SSL support, then manually
download the linked file)
patch -p0 < TIKA-379-3.patch
mvn install package

Then grab tika-parsers and tika-core out of the respective tika-core/target
and tika-parsers/target directories and drop those jars in your
parse-tika/lib folder, replacing their originals. Then, try your nutch crawl
again.

See if that works. In the meanwhile, I'll inspect Julien's patch.

Thanks!

Cheers,
Chris
On 5/4/10 9:02 PM, "matthew a. grisius" wrote: > Hi Chris, > > It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES > and/or javascript. Using the parse-html suggested work around I am able > to process my simple test cases such as javadoc which does include > simple embedded javascript (of course I can't verify that it is actually > parsing it though). I expanded my testing to include two more complex > examples that heavily use HTML FRAMESET/FRAME and more complex > javascript: > > 134 mb, 11,269 files > 1.9 gb, 133,978 files > > They both fail at the top level with the similar errors such as: > > fetching > http://192.168.1.101:8080/technical/general/C... > ocCommon.js > fetching > http://192.168.1.101:8080/technical/general/C... > cBanner.htm > -finishing thread FetcherThread, activeThreads=8 > -finishing thread FetcherThread, activeThreads=7 > -finishing thread FetcherThread, activeThreads=9 > -finishing thread FetcherThread, activeThreads=6 > -finishing thread FetcherThread, activeThreads=5 > -finishing thread FetcherThread, activeThreads=4 > -finishing thread FetcherThread, activeThreads=3 > Error parsing: > http://192.168.1.101:8080/technical/general/C... > ocCommon.js: UNKNOWN!(-56,0): Can't retrieve Tika parser for mime-type > text/javascript > Attempting to finish item from unknown queue: > org.apache.nutch.fetcher.Fetcher$FetchItem@1532fc > fetch of > http://192.168.1.101:8080/technical/general/C... > ocCommon.js failed with: java.lang.ArrayIndexOutOfBoundsException: -56 > -finishing thread FetcherThread, activeThreads=2 > > I tried several property settings to mimic the previous work around and > could not solve it. Any suggestions? > > So, I'm not sure how to categorize the issues more accurately. I have > many javadoc sets and lots of simple HTML that will now parse, but I > have other examples such as the two mentioned above that won't parse and > therefore can't be crawled. It seems to me to be systematic rather than > exceptional. I cannot believe that I'm the only one who will experience > these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks > for asking. > > -m. > > > > On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote: >> Hi Matthew, >> >> Awesome! Glad it worked. Now my next question < how often are you seeing >> that parse-tika doesn¹t work on HTML files? Is it all HTML that you are >> trying to process? Or just some of them? Or particular ones (categories of >> them). The reason I ask is that I¹m trying to determine whether I should >> commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a >> systematic thing versus an exception. >> >> Let me know and thanks! >> >> Cheers, >> Chris >> >> >> On 5/3/10 9:04 AM, "matthew a. grisius" wrote: >> >>> Hi Chris, >>> >>> Yes, that worked. I caught up on email and noticed that Arpit also >>> mentioned the same thing. Sorry I missed it, thanks to both of you! >>> >>> -m. >>> >>> On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: >>>> Hi Matthew, >>>> >>>>>> Hi Matthew, >>>>>> >>>>>> There is an open issue with Tika (e.g. >>>>>> https://issues.apache.org/jira/browse/TIKA-37...) that could explain the >>>>>> differences betwen parse-html and parse-tika. Note that you can specify : >>>>>> *parse-(html|pdf) *in order to get both HTML and PDF files. >>>>> >>>>> The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 >>>>> rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my >>>>> PDFs, but has problems with some html. Nutch 1.1 includes more current >>>>> PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4. >>>> >>>> Interesting: well one solution comes to mind. Can you test this out? >>>> >>>> * uncomment the lines: >>>> >>>> <mimeType name="text/html"> >>>> <plugin id="parse-html" /> >>>> </mimeType> >>>> >>>> In conf/parse-plugins.xml. >>>> >>>> * try your crawl again. >>>> >>>>> >>>>> I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-8... >>>>> with the attached file >>>> >>>> Thanks! Let me know what happens after you uncomment the line above. >>>> >>>> Cheers, >>>> Chris >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Senior Computer Scientist >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 171-266B, Mailstop: 171-246 >>>> Email: >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Assistant Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>> >>> >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> > >
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
matthew a. grisius 1273122125Thu, 06 May 2010 05:02:05 +0000 (UTC)
Hi Chris,

The 'maven install package' produced this for each
target/maven-shared-archive-resources/... file.

...
[INFO] [bundle:bundle {execution: default-bundle}]
[ERROR] Error building bundle
org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not
exist: target/maven-shared-archive-resources/META-INF/NOTICE~
[ERROR] Error building bundle
org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not
exist: target/maven-shared-archive-resources/META-INF/DEPENDENCIES~
[ERROR] Error building bundle
org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not
exist: target/maven-shared-archive-resources/META-INF/LICENSE~
[ERROR] Error(s) found in bundle configuration
[INFO]
------------------------------------------------------------------------
[ERROR] BUILD ERROR
[INFO]
------------------------------------------------------------------------
[INFO] Error(s) found in bundle configuration

[INFO]
------------------------------------------------------------------------
[INFO] For more information, run Maven with the -e switch
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 1 minute 24 seconds
[INFO] Finished at: Wed May 05 23:38:56 EDT 2010
[INFO] Final Memory: 40M/271M
[INFO]
------------------------------------------------------------------------

Assuming this was the right thing to do, I renamed each file to match
the missing filename, e.g. rename "DEPENDENCIES" to "DEPENDENCIES~" (and
NOTICE, LICENSE) in each 'target' and re-ran to generate the new jars.
and produce this:

[INFO]
------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
------------------------------------------------------------------------
[INFO] Apache Tika parent .................................... SUCCESS
[2.261s]
[INFO] Apache Tika core ...................................... SUCCESS
[14.429s]
[INFO] Apache Tika parsers ................................... SUCCESS
[32.370s]
[INFO] Apache Tika application ............................... SUCCESS
[34.179s]
[INFO] Apache Tika OSGi bundle ............................... SUCCESS
[16.081s]
[INFO] Apache Tika ........................................... SUCCESS
[0.237s]
[INFO]
------------------------------------------------------------------------
[INFO]
------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO]
Julien Nioche 1273048612Wed, 05 May 2010 08:36:52 +0000 (UTC)
Hi Matthew,

As you can see from the error messages Tika does not know how to parse
javascript. There is a legacy javascript parser in Nutch which you can
activate in the usual way i.e. specify parse-js in plugin.includes. It
generates a lot of spurious URLs but you should give it a try and see if it
gives you the outlinks you expect. I think there have been quite a few
discussions about javascript processing in the nutch archives.

BTW a good practice is to separate the fetching from the parsing step, so
that if the parsing fails you won't need to refetch the URLs. That can be
done of you call the fetch and parse commands (and not the all-in-one crawl
command) and specify -noparse while fetching.

HTH

Julien



It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES
> and/or javascript. Using the parse-html suggested work around I am able > to process my simple test cases such as javadoc which does include > simple embedded javascript (of course I can't verify that it is actually > parsing it though). I expanded my testing to include two more complex > examples that heavily use HTML FRAMESET/FRAME and more complex > javascript: > > 134 mb, 11,269 files > 1.9 gb, 133,978 files > > They both fail at the top level with the similar errors such as: > > fetching > > http://192.168.1.101:8080/technical/general/C... > fetching > > http://192.168.1.101:8080/technical/general/C... > -finishing thread FetcherThread, activeThreads=8 > -finishing thread FetcherThread, activeThreads=7 > -finishing thread FetcherThread, activeThreads=9 > -finishing thread FetcherThread, activeThreads=6 > -finishing thread FetcherThread, activeThreads=5 > -finishing thread FetcherThread, activeThreads=4 > -finishing thread FetcherThread, activeThreads=3 > Error parsing: > > http://192.168.1.101:8080/technical/general/C... > UNKNOWN!(-56,0): Can't retrieve Tika parser for mime-type text/javascript > Attempting to finish item from unknown queue: > org.apache.nutch.fetcher.Fetcher$FetchItem@1532fc > fetch of > > http://192.168.1.101:8080/technical/general/C... with: java.lang.ArrayIndexOutOfBoundsException: -56 > -finishing thread FetcherThread, activeThreads=2 > > I tried several property settings to mimic the previous work around and > could not solve it. Any suggestions? > > So, I'm not sure how to categorize the issues more accurately. I have > many javadoc sets and lots of simple HTML that will now parse, but I > have other examples such as the two mentioned above that won't parse and > therefore can't be crawled. It seems to me to be systematic rather than > exceptional. I cannot believe that I'm the only one who will experience > these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks > for asking. > > -m. > > > > On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote: > > Hi Matthew, > > > > Awesome! Glad it worked. Now my next question < how often are you seeing > > that parse-tika doesn¹t work on HTML files? Is it all HTML that you are > > trying to process? Or just some of them? Or particular ones (categories > of > > them). The reason I ask is that I¹m trying to determine whether I should > > commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s > a > > systematic thing versus an exception. > > > > Let me know and thanks! > > > > Cheers, > > Chris > > > > > > On 5/3/10 9:04 AM, "matthew a. grisius" wrote: > > > > > Hi Chris, > > > > > > Yes, that worked. I caught up on email and noticed that Arpit also > > > mentioned the same thing. Sorry I missed it, thanks to both of you! > > > > > > -m. > > > > > > On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: > > >> Hi Matthew, > > >> > > >>>> Hi Matthew, > > >>>> > > >>>> There is an open issue with Tika (e.g. > > >>>> https://issues.apache.org/jira/browse/TIKA-37...) that could explain > the > > >>>> differences betwen parse-html and parse-tika. Note that you can > specify : > > >>>> *parse-(html|pdf) *in order to get both HTML and PDF files. > > >>> > > >>> The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 > > >>> rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my > > >>> PDFs, but has problems with some html. Nutch 1.1 includes more > current > > >>> PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4. > > >> > > >> Interesting: well one solution comes to mind. Can you test this out? > > >> > > >> * uncomment the lines: > > >> > > >> <mimeType name="text/html"> > > >> <plugin id="parse-html" /> > > >> </mimeType> > > >> > > >> In conf/parse-plugins.xml. > > >> > > >> * try your crawl again. > > >> > > >>> > > >>> I submitted NUTCH-817 > https://issues.apache.org/jira/browse/NUTCH-8... > > >>> with the attached file > > >> > > >> Thanks! Let me know what happens after you uncomment the line above. > > >> > > >> Cheers, > > >> Chris > > >> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >> Chris Mattmann, Ph.D. > > >> Senior Computer Scientist > > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > >> Office: 171-266B, Mailstop: 171-246 > > >> Email: > > >> WWW: http://sunset.usc.edu/~mattmann/<http://sunset.usc.edu/%7Emattmann/> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >> Adjunct Assistant Professor, Computer Science Department > > >> University of Southern California, Los Angeles, CA 90089 USA > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >> > > >> > > > > > > > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Senior Computer Scientist > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 171-266B, Mailstop: 171-246 > > Email: > > WWW: http://sunset.usc.edu/~mattmann/<http://sunset.usc.edu/%7Emattmann/> > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Adjunct Assistant Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > >
Phil Barnett 1272695463Sat, 01 May 2010 06:31:03 +0000 (UTC)
This sounds exactly like what I have been experiencing.
On Wed, Apr 28, 2010 at 12:39 AM, matthew a. grisius wrote: > using Nutch nightly build nutch-2010-04-27_04-00-28: > > I am trying to bin/nutch crawl a single html file generated by javadoc > and no links are followed. I verified this with bin/nutch readdb and > bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base > seed doc specified is processed. > > I searched and reviewed the nutch-user archive and tried several > different settings but none of the settings appear to have any effect. > > I then downloaded maven-2.2.1 so that I could mvn install tika and > produce tika-app-0.7.jar to command line extract information about the > html javadoc file. I am not familiar w/ tika but the command line > version doesn't return any metadata, e.g. no 'src=' links from the html > 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how > nutch uses tika and maybe it's not related . . . > > Has anyone crawled javadoc files or have any suggestions? Thanks. > > -m. > >
Ad
Home | About | Privacy