lucene - Stopwords not getting removed - solr -


i new using solr , have defined following schema:

<schema name="example" version="1.5"> <fields>     <field name="nodeid" type="string" indexed="true" stored="true" />     <field name="_root_" type="string" indexed="true" stored="false" />     <field name="datetime" type="string" indexed="true" stored="true"         multivalued="true" />     <field name="epochsecs" type="string" indexed="true" stored="true"                     multivalued="true" />     <field name="subject" type="text_general" indexed="true"         stored="true" />     <field name="body" type="text_general" indexed="true"         stored="true" />     <field name="emailid" type="string" indexed="true"         stored="true" />     <field name="compliantflag" type="boolean" indexed="true"                     stored="true" />     <field name="_version_" type="long" indexed="true" stored="true" />     <field name="text" type="text_general" indexed="true" stored="false"         multivalued="true" />     <field name="ngrams" type="myngram" indexed="true" stored="false" required="false" />   </fields> <uniquekey>nodeid</uniquekey> <copyfield source="datetime" dest="text" /> <copyfield source="epochsecs" dest="text" /> <copyfield source="subject" dest="text" /> <copyfield source="body" dest="text" /> <copyfield source="emailid" dest="text" /> <copyfield source="compliantflag" dest="text" /> <copyfield source="text" dest="ngrams"/>  <types>     <fieldtype name="string" class="solr.strfield"         sortmissinglast="true" omitnorms="true"/>     <fieldtype name="long" class="solr.trielongfield"                     precisionstep="0" positionincrementgap="0" />     <fieldtype name="boolean" class="solr.boolfield" sortmissinglast="true"/>     <fieldtype name="text_general" class="solr.textfield"         positionincrementgap="100">         <analyzer type="index">             <tokenizer class="solr.standardtokenizerfactory" />             <filter class="solr.stopfilterfactory" ignorecase="true" words="lang/stopwords_en.txt" />             <filter class="solr.porterstemfilterfactory"/>         </analyzer>         <analyzer type="query">             <tokenizer class="solr.standardtokenizerfactory" />             <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.txt" />             <filter class="solr.synonymfilterfactory" synonyms="synonyms.txt" ignorecase="true" expand="true"/>             <filter class="solr.porterstemfilterfactory"/>         </analyzer>     </fieldtype>     <fieldtype name="myngram" stored="false" class="solr.textfield">          <analyzer type="index">              <tokenizer class="solr.standardtokenizerfactory"/>             <filter class="solr.lowercasefilterfactory"/>              <filter class="solr.ngramfilterfactory" mingramsize="2" maxgramsize="5"/>          </analyzer>      </fieldtype> </types> 

the stopwords not getting removed "body" field when indexed.

also, how remove special characters \n below field using solr's analysers:

\n \n\n\nthese numbers smurfit has.  \n\np 

any appreciated. thanks.

standardtokenizer should create tokens around newlines, spaces, etc., , stopword filter looks, @ glance, should working correctly. should include lowercasefilter above stopwordfilter, prevent matches being case sensitive, though.

i wonder if pertinent question might be: mean "removed"? analysis affects indexed representation of field. not affect stored version retrieve index in way. meant facilitate searching, not transform stored version of text. if remove word "the" through filter, should no longer hits on word "the" while searching, still see in when retrieve document index.


Comments

Popular posts from this blog

java - WrongTypeOfReturnValue exception thrown when unit testing using mockito -

php - Magento - Deleted Base url key -

android - How to disable Button if EditText is empty ? -