machine learning - Stemming in Text Classification - Degrades Accuracy? -


i implementing text classification system using mahout. have read stop-words removal , stemming helps improve accuracy of text classification. in case removing stop-words giving better accuracy, stemming not helping much. found 3-5% decrease in accuracy after applying stemmer. tried porter stemmer , k-stem got same result in both cases.

i using naive bayes algorithm classification.

any appreciated in advance.

first of all, need understand why stemming improve accuracy. imagine following sentence in training set:

he played below-average football in 2013, viewed ascending player before , can play guard or center.

and following in test set:

we’re looking @ number of players, including mark

first sentence contains number of words referring sports, including word "player". second sentence test set mentions player, but, oh, it's in plural - "players", not "player" - classifier distinct, unrelated variable.

stemming tries cut off details exact form of word , produce word bases features classification. in example above, stemming shorten both words "player" (or "play") , use them same feature, having more chances classify second sentence belonging "sports" class.

sometimes, however, these details play important role themselves. example, phrase "runs today" may refer runner, while "long running" may phone battery lifetime. in case stemming makes classification worse, not better.

what can here use additional features can distinguish between different meanings of same words/stems. 2 popular approaches n-grams (e.g. bigrams, features made of word pairs instead of individual words) , part-of-speech (pos) tags. can try combination of them, e.g. stems + bigrams of stems, or words + bigrams of words, or stems + pos tags, or stems, bigrams , pos tags, etc.

also, try out other algorithms. e.g. svm uses different approach naive bayes, can catch things in data nb ignores.


Comments

Popular posts from this blog

java - WrongTypeOfReturnValue exception thrown when unit testing using mockito -

php - Magento - Deleted Base url key -

android - How to disable Button if EditText is empty ? -