ArchiveOrangemail archive

solr-user.lucene.apache.org


(List home) (Recent threads) (34 other Apache Lucene lists)

Subscription Options

  • RSS or Atom: Read-only subscription using a browser or aggregator. This is the recommended way if you don't need to send messages to the list. You can learn more about feed syndication and clients here.
  • Conventional: All messages are delivered to your mail address, and you can reply. To subscribe, send an email to the list's subscribe address with "subscribe" in the subject line, or visit the list's homepage here.
  • Moderate traffic list: up to 30 messages per day
  • This list contains about 98,287 messages, beginning Jan 2006
  • 32 messages added yesterday
Report the Spam
This button sends a spam report to the moderator. Please use it sparingly. For other removal requests, read this.
Are you sure? yes no

Need tokenization that finds part of stringvalue

Ad
PeterKerk 1330463139Tue, 28 Feb 2012 21:05:39 +0000 (UTC)
I have the following in my schema.xml

<field name="title" type="text_ws" indexed="true" stored="true"/>
<field name="title_search" type="text" indexed="true" stored="true"/>


<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
	<tokenizer class="solr.WhitespaceTokenizerFactory"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_dutch.txt"/>
	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
	<filter class="solr.LowerCaseFilterFactory"/>
	<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="query">
	<tokenizer class="solr.WhitespaceTokenizerFactory"/>
	<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_dutch.txt"/>
	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>
	<filter class="solr.LowerCaseFilterFactory"/>
	
	<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>


I want to search on field "title".
Now my field title holds the value "great smartphone".
If I search on "smartphone" the item is found. But I want the item also to
be found on "great" or "phone" it doesnt work.
I have been playing around with the tokenizer test function, but have failed
to find the definition for the "text" fieldtype I need.
Help? :)
Erick Erickson 1330610037Thu, 01 Mar 2012 13:53:57 +0000 (UTC)
Right, there's nothing in Solr that I know of that'll help here. How would
a tokenizer understand that "smartphone" should be "smart" "phone"?
There's no general solution for this issue.

You can do domain-specific solutions with synonyms for instance, or
some other word list that contains terms you're interested in, entries
like smartphone => smart phone
but that has the obvious drawback of requiring that you know all the
terms that might be smashed together.

You *might* be able to do something with shingles, but I'm a little unclear
on how.

Best
ErickOn Tue, Feb 28, 2012 at 4:05 PM, PeterKerk  wrote:
> I have the following in my schema.xml
>
> <field name="title" type="text_ws" indexed="true" stored="true"/>
> <field name="title_search" type="text" indexed="true" stored="true"/>
>
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>  <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_dutch.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>  </analyzer>
>  <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_dutch.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>  </analyzer>
> </fieldType>
>
>
> I want to search on field "title".
> Now my field title holds the value "great smartphone".
> If I search on "smartphone" the item is found. But I want the item also to
> be found on "great" or "phone" it doesnt work.
> I have been playing around with the tokenizer test function, but have failed
> to find the definition for the "text" fieldtype I need.
> Help? :)
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Need-token...
> Sent from the Solr - User mailing list archive at Nabble.com.
PeterKerk 1330610644Thu, 01 Mar 2012 14:04:04 +0000 (UTC)
I think I didnt explain myself clearly: I need to be able to find substrings.
So, its not that I'd expect Solr to find synonyms, but rather if a piece of
text contains the searched text, for example:

if title holds "smartphone" I want it to be found when someone types
"martph" or "smar" or "smart".

I think that is different from what you initially understood from my
explanation or....?
Ahmet Arslan 1330618438Thu, 01 Mar 2012 16:13:58 +0000 (UTC)
> if title holds "smartphone" I want it to be found when
> someone types
> "martph" or "smar" or "smart".Peter, so you want to beginsWith startsWith type of search? You can use use wildcard search (with start operator) for this. e.g. &q=smar* 

Alternatively, if your index size is not huge, you can use EdgeNGramFilterFactory at index time along with normal queries. e.g. &q=smar

http://lucene.apache.org/solr/api/org/apache/...
Message missing here (more information)
Ahmet Arslan 1330636806Thu, 01 Mar 2012 21:20:06 +0000 (UTC)
--- On Thu, 3/1/12, PeterKerk  wrote:

> From: PeterKerk 
> Subject: Re: Need tokenization that finds part of stringvalue
> To: 
> Date: Thursday, March 1, 2012, 6:59 PM
> @iorixxx: yes, that is what I need.
> But also when its IN the text, not
> necessarily at the beginning.
> 
> So using the * character like: 
> q=smart* 
> the product is found, but when I do this: 
> q=*mart* 
> it isnt...why is that?In example schema.xml there is a field type named text_rev that makes use of ReversedWildcardFilterFactory. It is designed to enable leading star operator. e.g. q=*mart

Didn't used by myself but may be you can use both leading and trailing wildcard (at the same time) with this type.
q=*mart*&df=title_search&defType=lucene
PeterKerk 1330639047Thu, 01 Mar 2012 21:57:27 +0000 (UTC)
@iorixxx: Where can I find that example schema.xml?

I downloaded the latest version here:
ftp://apache.mirror.easycolocate.nl//lucene/solr/3.5.0
And checked \example\example-DIH\solr\db\conf\schema.xml
But no text_rev type is defined in there.

And when I find it, can I just make the title field which currently is of
"text" type then of "text_rev" type?

Thanks!
Erick Erickson 1330640444Thu, 01 Mar 2012 22:20:44 +0000 (UTC)
On frequent method of doing leading and trailing wildcards
is to use ngrams (as distinct from edgengrams). That in
combination with phrase queries might work well in this case.

You also might be surprised at how little space bigrams take,
give it a test and see <G>..

Best
ErickOn Thu, Mar 1, 2012 at 4:57 PM, PeterKerk  wrote:
> @iorixxx: Where can I find that example schema.xml?
>
> I downloaded the latest version here:
> ftp://apache.mirror.easycolocate.nl//lucene/solr/3.5.0
> And checked \example\example-DIH\solr\db\conf\schema.xml
> But no text_rev type is defined in there.
>
> And when I find it, can I just make the title field which currently is of
> "text" type then of "text_rev" type?
>
> Thanks!
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Need-token...
> Sent from the Solr - User mailing list archive at Nabble.com.
Ahmet Arslan 1330641770Thu, 01 Mar 2012 22:42:50 +0000 (UTC)
> @iorixxx: Where can I find that
> example schema.xml?Please find text_general_rev at 
http://svn.apache.org/repos/asf/lucene/dev/tr...> And when I find it, can I just make the title field which
> currently is of
> "text" type then of "text_rev" type?Yes, also you can just add solr.ReversedWildcardFilterFactory into your index analyzer too.
PeterKerk 1330864310Sun, 04 Mar 2012 12:31:50 +0000 (UTC)
@iorixxx
I tried making my title_search of type text_rev and tried adding the
ReversedWildcardFilterFactory to my existing "text" type, but in both cases
no luck.

@Erick Erickson
"On frequent method of doing leading and trailing wildcards is to use ngrams
(as distinct from edgengrams). That in combination with phrase queries might
work well in this case. "

Do you perhaps have an example of that?
Ahmet Arslan 1330866564Sun, 04 Mar 2012 13:09:24 +0000 (UTC)
> @iorixxx
> I tried making my title_search of type text_rev and tried
> adding the
> ReversedWildcardFilterFactory to my existing "text" type,
> but in both cases
> no luck.I was able to perform *query* types of searches with solr 3.5 distro.
Here is what I did:

Download apache-solr-3.5.0
Edit schema.xml
make text_rev as stored="true" 
add <copyField source="features" dest="text_rev"/>
java -jar start.jar
java -jar post.jar 

http://localhost:8983/solr/select/?q=text_rev...

returns 7 docs, with the following snippets:

<em>SmartMedia</em>, <em>megapixel</em>, <em>document</em>, <em>time</em> etc.

Keep in mind that changes of this kind in schema.xml requires re-indexing and restart solr server.

Also you need to be aware of http://wiki.apache.org/solr/MultitermQueryAna...> @Erick Erickson
> "On frequent method of doing leading and trailing wildcards
> is to use ngrams
> (as distinct from edgengrams). That in combination with
> phrase queries might
> work well in this case. "Erick's suggestion will work faster in terms of QTime (response time)
To get the idea, try "text_ngrm" field type in analysis.jsp and it will display generated tokens.

http://lucene.apache.org/solr/api/org/apache/...
PeterKerk 1331067386Tue, 06 Mar 2012 20:56:26 +0000 (UTC)
@iorixxx: Sorry it took so long, had some difficulties upgrading to 3.5.0

It still doesnt work. Here's what I have now:

I copied text_general_rev from
http://svn.apache.org/repos/asf/lucene/dev/tr...
to my schema.xml:
    <fieldType name="text_general_rev" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ReversedWildcardFilterFactory"
withOriginal="true"
           maxPosAsterisk="3" maxPosQuestion="2"
maxFractionAsterisk="0.33"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

To be complete: this the definition of the title fieldtype:	
    <fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>
	
	

<field name="title" type="text_ws" indexed="true" stored="true"/>	
<field name="title_search" type="text_general_rev" indexed="true"
stored="true"/>
<copyField source="title" dest="title_search"/>	


title field value="Smartphone"

With this searchquery I dont get any results:
http://localhost:8983/solr/zz/select/?indent=...

What more can I do?
Thanks!
Ahmet Arslan 1331068404Tue, 06 Mar 2012 21:13:24 +0000 (UTC)
> @iorixxx: Sorry it took so long, had
> some difficulties upgrading to 3.5.0
> 
> It still doesnt work. Here's what I have now:
> 
> I copied text_general_rev from
> http://svn.apache.org/repos/asf/lucene/dev/tr...
> to my schema.xml:
>     <fieldType name="text_general_rev"
> class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer
> class="solr.StandardTokenizerFactory"/>
>         <filter
> class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>         <filter
> class="solr.LowerCaseFilterFactory"/>
>         <filter
> class="solr.ReversedWildcardFilterFactory"
> withOriginal="true"
>        
>    maxPosAsterisk="3" maxPosQuestion="2"
> maxFractionAsterisk="0.33"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer
> class="solr.StandardTokenizerFactory"/>
>         <filter
> class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>         <filter
> class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>         <filter
> class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
> 
> To be complete: this the definition of the title
> fieldtype:    
>     <fieldType name="text_ws"
> class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer>
>         <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>     </fieldType>
>     
>     
> 
> <field name="title" type="text_ws" indexed="true"
> stored="true"/>    
> <field name="title_search" type="text_general_rev"
> indexed="true"
> stored="true"/>
> <copyField source="title"
> dest="title_search"/>    
> 
> 
> title field value="Smartphone"
> 
> With this searchquery I dont get any results:
> http://localhost:8983/solr/zz/select/?indent=...
> 
> What more can I do?
> Thanks!Dismax query parser does not support wildcard queries. defType=edismax would work. Also defType=lucene&df=title_search&q=*smart* should work too.
PeterKerk 1331068834Tue, 06 Mar 2012 21:20:34 +0000 (UTC)
edismax did the trick! Thanks!
Walter Underwood 1330617583Thu, 01 Mar 2012 15:59:43 +0000 (UTC)
I once used a spell checker to break up compound words. It was slow, but worked pretty well.

wunderOn Mar 1, 2012, at 5:53 AM, Erick Erickson wrote:

> Right, there's nothing in Solr that I know of that'll help here. How would
> a tokenizer understand that "smartphone" should be "smart" "phone"?
> There's no general solution for this issue.
> 
> You can do domain-specific solutions with synonyms for instance, or
> some other word list that contains terms you're interested in, entries
> like smartphone => smart phone
> but that has the obvious drawback of requiring that you know all the
> terms that might be smashed together.
> 
> You *might* be able to do something with shingles, but I'm a little unclear
> on how.
> 
> Best
> Erick
> 
> On Tue, Feb 28, 2012 at 4:05 PM, PeterKerk  wrote:
>> I have the following in my schema.xml
>> 
>> <field name="title" type="text_ws" indexed="true" stored="true"/>
>> <field name="title_search" type="text" indexed="true" stored="true"/>
>> 
>> 
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>>  <analyzer type="index">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_dutch.txt"/>
>>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> catenateAll="0" splitOnCaseChange="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>  </analyzer>
>>  <analyzer type="query">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_dutch.txt"/>
>>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>> 
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>  </analyzer>
>> </fieldType>
>> 
>> 
>> I want to search on field "title".
>> Now my field title holds the value "great smartphone".
>> If I search on "smartphone" the item is found. But I want the item also to
>> be found on "great" or "phone" it doesnt work.
>> I have been playing around with the tokenizer test function, but have failed
>> to find the definition for the "text" fieldtype I need.
>> Help? :)
>> 
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Need-token...
>> Sent from the Solr - User mailing list archive at Nabble.com.
Dyer, James 1330618062Thu, 01 Mar 2012 16:07:42 +0000 (UTC)
Speaking of which, there is a spellchecker in jira that will detect word-break errors like this.  See "WordBreakSpellChecker" at https://issues.apache.org/jira/browse/LUCENE-... .

To use it with Solr, you'd also need to apply SOLR-2993 (https://issues.apache.org/jira/browse/SOLR-29...).  This Solr piece will take the results of your "normal" spellchecker and integrate them with the results from the WordBreakSpellChecker.  

These patches are for Trunk/4.x, and you'd have to apply them as described here:  http://wiki.apache.org/solr/HowToContribute#R...

I would appreiate it if you tried these out to provide feedback on the JIRA issues as to how it works for you and also how it can be improved.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Walter Underwood  
Sent: Thursday, March 01, 2012 9:59 AM
To: 
Subject: Re: Need tokenization that finds part of stringvalue

I once used a spell checker to break up compound words. It was slow, but worked pretty well.

wunderOn Mar 1, 2012, at 5:53 AM, Erick Erickson wrote:

> Right, there's nothing in Solr that I know of that'll help here. How would
> a tokenizer understand that "smartphone" should be "smart" "phone"?
> There's no general solution for this issue.
> 
> You can do domain-specific solutions with synonyms for instance, or
> some other word list that contains terms you're interested in, entries
> like smartphone => smart phone
> but that has the obvious drawback of requiring that you know all the
> terms that might be smashed together.
> 
> You *might* be able to do something with shingles, but I'm a little unclear
> on how.
> 
> Best
> Erick
> 
> On Tue, Feb 28, 2012 at 4:05 PM, PeterKerk  wrote:
>> I have the following in my schema.xml
>> 
>> <field name="title" type="text_ws" indexed="true" stored="true"/>
>> <field name="title_search" type="text" indexed="true" stored="true"/>
>> 
>> 
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>>  <analyzer type="index">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_dutch.txt"/>
>>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> catenateAll="0" splitOnCaseChange="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>  </analyzer>
>>  <analyzer type="query">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_dutch.txt"/>
>>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>> 
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>  </analyzer>
>> </fieldType>
>> 
>> 
>> I want to search on field "title".
>> Now my field title holds the value "great smartphone".
>> If I search on "smartphone" the item is found. But I want the item also to
>> be found on "great" or "phone" it doesnt work.
>> I have been playing around with the tokenizer test function, but have failed
>> to find the definition for the "text" fieldtype I need.
>> Help? :)
>> 
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Need-token...
>> Sent from the Solr - User mailing list archive at Nabble.com.
Ad
Home | About | Privacy