lucene - Stopwords not getting removed

lucene - Stopwords not getting removed - solr -

- May 15, 2010

i new using solr , have defined following schema:

<schema name="example" version="1.5"> <fields>     <field name="nodeid" type="string" indexed="true" stored="true" />     <field name="_root_" type="string" indexed="true" stored="false" />     <field name="datetime" type="string" indexed="true" stored="true"         multivalued="true" />     <field name="epochsecs" type="string" indexed="true" stored="true"                     multivalued="true" />     <field name="subject" type="text_general" indexed="true"         stored="true" />     <field name="body" type="text_general" indexed="true"         stored="true" />     <field name="emailid" type="string" indexed="true"         stored="true" />     <field name="compliantflag" type="boolean" indexed="true"                     stored="true" />     <field name="_version_" type="long" indexed="true" stored="true" />     <field name="text" type="text_general" indexed="true" stored="false"         multivalued="true" />     <field name="ngrams" type="myngram" indexed="true" stored="false" required="false" />   </fields> <uniquekey>nodeid</uniquekey> <copyfield source="datetime" dest="text" /> <copyfield source="epochsecs" dest="text" /> <copyfield source="subject" dest="text" /> <copyfield source="body" dest="text" /> <copyfield source="emailid" dest="text" /> <copyfield source="compliantflag" dest="text" /> <copyfield source="text" dest="ngrams"/>  <types>     <fieldtype name="string" class="solr.strfield"         sortmissinglast="true" omitnorms="true"/>     <fieldtype name="long" class="solr.trielongfield"                     precisionstep="0" positionincrementgap="0" />     <fieldtype name="boolean" class="solr.boolfield" sortmissinglast="true"/>     <fieldtype name="text_general" class="solr.textfield"         positionincrementgap="100">         <analyzer type="index">             <tokenizer class="solr.standardtokenizerfactory" />             <filter class="solr.stopfilterfactory" ignorecase="true" words="lang/stopwords_en.txt" />             <filter class="solr.porterstemfilterfactory"/>         </analyzer>         <analyzer type="query">             <tokenizer class="solr.standardtokenizerfactory" />             <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.txt" />             <filter class="solr.synonymfilterfactory" synonyms="synonyms.txt" ignorecase="true" expand="true"/>             <filter class="solr.porterstemfilterfactory"/>         </analyzer>     </fieldtype>     <fieldtype name="myngram" stored="false" class="solr.textfield">          <analyzer type="index">              <tokenizer class="solr.standardtokenizerfactory"/>             <filter class="solr.lowercasefilterfactory"/>              <filter class="solr.ngramfilterfactory" mingramsize="2" maxgramsize="5"/>          </analyzer>      </fieldtype> </types>

the stopwords not getting removed "body" field when indexed.

also, how remove special characters \n below field using solr's analysers:

\n \n\n\nthese numbers smurfit has.  \n\np

any appreciated. thanks.

standardtokenizer should create tokens around newlines, spaces, etc., , stopword filter looks, @ glance, should working correctly. should include lowercasefilter above stopwordfilter, prevent matches being case sensitive, though.

i wonder if pertinent question might be: mean "removed"? analysis affects indexed representation of field. not affect stored version retrieve index in way. meant facilitate searching, not transform stored version of text. if remove word "the" through filter, should no longer hits on word "the" while searching, still see in when retrieve document index.

Search This Blog

Sp

lucene - Stopwords not getting removed - solr -

Comments

Post a Comment

Popular posts from this blog

Android Java.Lang.RuntimeException : Unable to start activity Component Info -

php - htaccess subdomain and directory redirect -

c# - Sort XmlNodeList with a specific Node value -