xml - Solr filters are them good? -
do think filters french search?
<fieldtype name="text" class="solr.textfield" positionincrementgap="100"> <analyzer type="index"> <tokenizer class="solr.whitespacetokenizerfactory"/> <!-- in example, use synonyms @ query time <filter class="solr.synonymfilterfactory" synonyms="index_synonyms.txt" ignorecase="true" expand="false"/> --> <!-- case insensitive stop word removal. add enablepositionincrements=true in both index , query analyzers leave 'gap' more accurate phrase queries. --> <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.txt" enablepositionincrements="true"/> <filter class="solr.worddelimiterfilterfactory" generatewordparts="1" generatenumberparts="1" catenatewords="1" catenatenumbers="1" catenateall="0" splitoncasechange="1"/> <filter class="solr.lowercasefilterfactory"/> <filter class="solr.asciifoldingfilterfactory"/> <filter class="solr.snowballporterfilterfactory" language="french" protected="protwords.txt"/> <filter class="solr.removeduplicatestokenfilterfactory"/> <filter class="solr.elisionfilterfactory" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.whitespacetokenizerfactory"/> <filter class="solr.lowercasefilterfactory"/> <filter class="solr.asciifoldingfilterfactory"/> <filter class="solr.synonymfilterfactory" synonyms="synonyms.txt" ignorecase="true" expand="true"/> <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.txt" enablepositionincrements="true"/> <filter class="solr.worddelimiterfilterfactory" generatewordparts="1" generatenumberparts="1" catenatewords="0" catenatenumbers="0" catenateall="0" splitoncasechange="1"/> <filter class="solr.snowballporterfilterfactory" language="french" protected="protwords.txt"/> <filter class="solr.removeduplicatestokenfilterfactory"/> <filter class="solr.elisionfilterfactory" /> </analyzer> </fieldtype>
i have problems "electricitré" returns 6 occurences, when "electricite" returns 9 occurences.
- you can use solr admin page understand why
electricitré
,electricite
doesn't give same results:
here suppose due typo: electricitré
instead of electricité
without r?
keep in mind while synonymfilter happily work synonyms containing multiple words (ie: "sea biscuit, sea biscit, seabiscuit") recommended approach dealing synonyms this, expand synonym when indexing. because there 2 potential issues can arrise @ query time:
- the lucene queryparser tokenizes on white space before giving text analyzer, if person searches words sea biscit analyzer given words "sea" , "biscit" seperately, , not know match synonym.
- phrase searching (ie: "sea biscit") cause queryparser pass entire string analyzer, if synonymfilter configured expand synonyms, when queryparser gets resulting list of tokens analyzer, construct multiphrasequery not have desired effect. because of limited mechanism available analyzer indicate 2 terms occupy same position: there no way indicate "phrase" occupies same position term. our example resulting multiphrasequery "(sea | sea | seabiscuit) (biscuit | biscit)" not match simple case of "seabiscuit" occuring in document
even when aren't worried multi-word synonyms, idf differences still make index time synonyms idea. consider following scenario:
- an index "text" field, @ query time uses synonymfilter synonym tv, televesion , expand="true"
- many thousands of documents containing term "text:tv"
- a few hundred documents containing term "text:television"
a query text:tv expand (text:tv text:television) , lower docfreq text:television give documents match "television" higher score docs match "tv" comparably -- may counter intuitive client. index time expansion (or reduction) result in same idf documents regardless of term original text contained.
note: best use elisionfilter before worddelimiterfilter. prevent slow phrase queries.
Comments
Post a Comment