ArchiveOrangemail archive

solr-user.lucene.apache.org


(List home) (Recent threads) (34 other Apache Lucene lists)

Subscription Options

  • RSS or Atom: Read-only subscription using a browser or aggregator. This is the recommended way if you don't need to send messages to the list. You can learn more about feed syndication and clients here.
  • Conventional: All messages are delivered to your mail address, and you can reply. To subscribe, send an email to the list's subscribe address with "subscribe" in the subject line, or visit the list's homepage here.
  • Moderate traffic list: up to 30 messages per day
  • This list contains about 100,971 messages, beginning Jan 2006
  • 9 messages added yesterday
Report the Spam
This button sends a spam report to the moderator. Please use it sparingly. For other removal requests, read this.
Are you sure? yes no

EdgeNGramTokenFilter, term position?

Ad
Ryan McKinley 1189926545Sun, 16 Sep 2007 07:09:05 +0000 (UTC)
Should the EdgeNGramFilter use the same term position for the ngrams 
within a single token?

As is, the EdgeNGramTokenFilter increments the term position for each 
character.  In analysis.jsp, with the input "hello", I get:

term position 	1	2	3	4	5
term text 	h	he	hel	hell	hello
term type 	word	word	word	word	word
start,end 	0,1	0,2	0,3	0,4	0,5


I would expect something more like what is generated from SOLR-357:

term position 	1
term text 	hello
		hell
		hel
		he
		h
term type 	word
		prefix
		prefix
		prefix
		prefix
start,end 	0,5
		0,4
		0,3
		0,2
		0,1

This seems like it would affect slop queries, but I don't really 
understand them yet.

thanks
ryan
Chris Hostetter 1190061382Mon, 17 Sep 2007 20:36:22 +0000 (UTC)
: Should the EdgeNGramFilter use the same term position for the ngrams within a
: single token?

i can see the argument going both ways ... imagine a hypothetical 
CharSplitterTokenFilter that takes replaces each token in the stream with 
one token per character in the orriginal token (ie: "hello" becomes 
h,e,l,l,o) ... should those tokens all have the same position?  the have a 
logical ordered flow to them, so in theory they are sequential ... but 
they did occupy the same "space" in the orriginal token stream.

when in doubt: make it an option



-Hoss
Yonik Seeley 1190061608Mon, 17 Sep 2007 20:40:08 +0000 (UTC)
On 9/16/07, Ryan McKinley wrote: > Should the EdgeNGramFilter use the same term position for the ngrams > within a single token?
It feels like that is the right approach. I don't see value in having them sequential, and I can think of uses for having them overlap. -Yonik
Home | About | Privacy