A Fast and Simple Stemming Algorithm

Jörg Caumanns
Free University of Berlin, CeDiS
caumanns@wiwiss.fu-berlin.de

 

Abstract

 

 

In this paper I present a stemming algorithm for morphological complex languages like German or Dutch. The main idea is not to use stems as common forms in order to make the algorithm simple and fast. The algorithm consists of two steps: First certain characters and/or character sequences are substituted. This step takes linguistic rules and statistical heuristics into account. In a second step a very simple, context free suffix-stripping algorithm is applied. Three variations of the algorithm are described in this paper. The simplest one can easily be implemented with 50 lines of C++ code while the most complex one requires about 100 lines of code and a small wordlist. The algorithm is scalable in a way that linguistic rules and statistical heuristics can be added.

 

1 Introduction

Even in our times of Multi- and Hypermedia written text still is the preferred encoding of information. More than 80% of all documents found in the World Wide Web are textual [Bray96]. The largest search engine on the Web - Altavista - claims to have indexed over 100 million documents. These two numbers imply that anyone with access to the internet as well has access to about 80 million freely available textual documents.

As no one is able to keep track of this mass of information there is a strong demand for automatically indexing, searching, extracting, integrating, and restructuring information.

1.1 Stemming

One of the problems for any kind of automated text processing is the detection of different morphological variations of the same word. Especially for information retrieval, data mining, and automated text comparison these variations must be detected and mapped to a common form. Most algorithms for this conflation of terms use word stems as common forms. This is why conflation is usually just called stemming.

Stemming can be either weak or strong:

  1. In the easiest case only different declensions of a word are detected and conflated. E.g. "house" and "houses" are stemmed into "house" and "mouse" and "mice" into "mouse". This is called weak stemming.
  2. With strong stemming all suffixes (and sometimes even all prefixes) are removed. E.g. even a noun like "illness" is conflated to is adjective stem "ill".

For most applications weak stemming is both sufficient and desired, as in many cases prefixes and suffixes carry lexical and semantic meaning.

The main problems to be solved with weak stemming are:

For languages with just few morphological variants (e.g. English) the most common stemming method is suffix stripping. The idea of suffix stripping is to cut of all suffixes from a word. Suffix stripping usually is based on a dictionary of suffixes, a longest match algorithm, and some simple morphological rules. An examples for this strategy is [Porter80], which is used by most web-based search engines.

1.2 Stemming German Texts

Stemming for morphological more complex languages like Dutch and German is not that easy. The reasons why the combination of longest suffix matches and simple linguistics doesn't work properly are:

As far as I know there are currently no fast and powerful algorithms available for stemming German texts. Linguistic approaches have difficulties with compound words which must be decomposed before stemming and dictionary based approaches are slow because of their large word lists (about 50,000 entries seem to bee needed [Sheridan96, Piotrowski98]). This is the reason why German search engines on the web - e.g. fireball - do no stemming at all.

1.3 Overview

In this paper I will introduce a stemming algorithm based on context free suffix stripping. The core algorithm is very fast and very simple (less than 100 lines of code). It can handle an infinite number of words and gets even stronger the more compounding occurs. The only requirement for the algorithm is that all stop words - determiners and other function words - have been removed from the text to be stemmed.

The version of the algorithm given in this paper is designed for stemming German words but it can more or less easily be adapted for other languages. Currently the algorithm is used for keyword search and text comparison as part of an application in the area of dynamic documents [Caumanns98].

The next chapter provides a description of the algorithm and the ideas it is based on. The algorithm has been tested with a wordlist containing more than 75,000 terms. Some of the results of these tests are given in chapter 3.

It should be noticed that the algorithm is not better than existing approaches but it is much smaller and faster. Further on it can easily be extended by either linguistic rules, statistical heuristics, or wordlists to improve the quality for the price of speed.

2 The Substitute-and-Strip Algorithm

As mentioned above the purpose of stemming is to reduce different morphological forms of a word to a common form. Within this paper I will call this common form a word's discriminator. Discriminators should be unique in a way that an one-to-one mapping of words to discriminators is desired. Word stems are very good discriminators as they are usually unique (if homonyms are ignored) but they are not the only ones.

In practice the requirement of uniqueness can never be met. Morphological homonyms like "nuts" (Plural of "nut" and synonym for "crazy") cannot be detected without performing a semantic analysis of the text to be stemmed. This adds some noise to any stemming algorithm as there will always be a certain number of words not mapped onto an unique discriminator.

Another problem for any stemming algorithm are irregular declensions. Verb forms like "go - went - gone" or nouns like "vertex - vertices" are very hard to detect as common. Especially for languages like Dutch or German that make heavy use of compounding these irregular form might get buried deep inside a word. Stemming these words correctly not only requires a wordlist of all irregular forms but even very sophisticated decomposing and recomposing algorithms.

To sum up these considerations it can be said that any stemming algorithm that is not based on semantic text analysis produces some kind of errors. Even the best algorithms will never be able to stem all words correctly.

The idea behind this work is very simple: If one knows that stemming is never perfect, why should one try? For most applications it doesn't matter if the error rate caused by wrong stemming is 2.1% or 3.1%. In either case the application has to deal with some amount of uncertainty.

The algorithm introduced in this paper is based on a two-step strategy. In a first step certain characters and/or groups of characters are replaced within each word to be stemmed. In a second step suffixes are cut off in a context free manner. Both steps cause stemming errors in a sense that different words are mapped onto identical discriminators. The question of interest is if this additional error rate is low enough to be acceptable.

In the next section I will first describe the second part of the algorithm as it provides the motivation for substitutions.

2.1 Recursive Context Free Suffix-Stripping

One of the most common stemming methods is suffix stripping. With suffix stripping each word to be stemmed is matched against a list of suffixes. The longest suffix matching the word's end is cut off (E.g. "going" is stemmed to "go" by cutting of the inflectional suffix "ing"). If suffix stripping is just based on simple character matching it is called context free.

The idea of context-free suffix stripping is not new, but many people claim it to be a bad thought:

 

"Unfortunately, context free removal leads to a significant error rate. For example, we may well want UAL removed from FACTUAL but not from EQUAL." [Rijsbergen79]

 

The problem addressed within this quote is easy to understand and easy to accept, but it is irrelevant. Why should "ual" not be removed from "equal"? As long as there is no other word mapped onto "eq" it is no problem to stem "equal" to "eq". E.g. my dictionary lists nine words starting with "eq" that are not derived form "equal". None of these nine words would be stemmed to "eq" by a context-free suffix stripping algorithm. The wrong though behind this quote is that "eq" surely is not the stem of "equal" but nevertheless it is an unique discriminator for "equal".

In German there are seven declensional suffixes for nouns: -s, -es, -e, -en, -n, -er, and -ern, 16 for adjectives: -e, -er, -en, -em, -ere, -erer, -eren, -erem, -ste, -ster, -sten, and -stem, and 48 for verbs: -e, -est, -st, -et, -t, -en, -ete, -te, etest, -test, eten, -ten, -etet, tet, -end-, and -nd- (-end- and -nd- turn verbs into adverbs and can be followed by any of the adjective suffixes). Additionally -ge is either used as a prefix or an infix to denote participles.

These suffixes partly include each other which leads to the observation that any declensional suffix can be build up from a combination of "e", "s", "n", "t", "em", "er" and "nd". By giving up some amount of correctness for speed the stemming algorithm described in this paper recursively cuts of any occurrence of "e", "s", "n", "t", "em", "er" or "nd" from the end of the term to be stemmed. The additional error rate caused by this simplification is minimal but the gain in speed is significant as now only 8 different suffixes must be checked instead of 71.

After all suffixes have been removed any occurrence of "-ge-" either as prefix or suffix is cut off. By post processing terms in this way participles are reduced to their respective stem. This replacement is risky at it is the source for a number of 'different word - same discriminator' errors.

I previously wrote that the algorithm is context free. This is not completely true as there are four restrictions which take the length and the case of a term into account:

One of the effects of this restrictions is that 2-letter words are ignored by the algorithm. As there are just a handful of such words in German the effect of ignoring them tends to be zero.

The following table lists some terms and their respective discriminators as calculated by suffix stripping:

term

discriminator

 

term

discriminator

 

singt

sing

 

singen

sing

 

beliebt

belieb

 

beliebtester

belieb

 

stören

stö

 

stöhnen

stöh

 

Kuß

Kuß

 

Küsse

Küss

error!

Verlierer

Verli

 

Verlies

Verli

error!

Maus

Mau

 

Mauer

Mau

error!

stören

stö

 

Störsender

stö

error!

The top three rows show examples of correct stemming the last four rows examples of incorrect stemming. Stemming errors as given in row four and five are prevented by the first part of the stemming algorithm - substitution. Errors as in line six and seven are the price for the algorithm's speed and simplicity.

2.2 Substitution

Just doing suffix stripping without taking any contextual and linguistic information into account leads to two main stemming errors:

To prevent these stemming errors some substitutions are done before suffix stripping:

  1. Each Umlaut is replaced by its corresponding vowel and "ß" is substituted by "ss".
  2. The second character of each double character is replaced by "*"
  3. "sch", "ch", "ei", and "ie" are replaced by special characters ("$", "", "%", "&").

Purpose of the first substitution is to conflate all plural forms correctly. Drawback of this substitution is that different words like "Stück" and "Stuck" or "Eisbär" and "Eisbar" are mapped onto the same discriminator. As the number of these errors is very low (only 12 of these pairs were found in a list of more than 50,000 nouns), there is no effort done to handle it. The second and third replacements are mainly done to save some fixed character sequences from being split. Further on by replacing "sch" and "ch" by single characters some more characters are saved from splitting at words starting with "sch" or "ch".

The algorithm can be improved further by substituting other characters, too, e.g. "x" for "z" to handle irregular plural forms like "Matrix" and "Matrizen" (matrix and matrices). 

The only requirement for any substitution is that it should either result in another morphological form of the same word or produce a non-word. If this requirement is not met a stemming error will occur. E.g. substituting "ß" for "ss" in "Buße" leads to "Busse" which is the plural form of "Bus". For this reason "Bus" and "Buße" are both stemmed to the same discriminator "Bus".

Substitutions are a very powerful instrument as they can capture linguistic rules and statistical heuristics. If for example "ge" within the character sequence "ige" can never be an infix denoting a participle, "ig" should be substituted in order to save it from being split. Most of these substitutions are rather short and can be implemented efficiently using transition tables or finite state machines.

3 Evaluation

The algorithm described in this paper has been tested with a list of about 75,000 terms taken from the German version of the Linux spell checker. Additional tests have been performed with scientific texts and short stories

The two possible kinds of stemming errors are:

From the sum of all errors of either kind within a given text or wordlist the overall error rate of the stemming algorithm can be calculated.

3.1 Multiple Discriminators for the same Word

Even though the stemming algorithm tends to strip off more letters than necessary it may happen that not the whole declinational suffix of a term is removed. This kind of error occurs with three groups of words:

The second kind of error can easily be prevented by substituting each occurrence of "erinn" with "erin". With the Linux word list it would reduce the amount of stemming errors by 21 without causing any additional errors. In Franz Kafka's novell "Der Prozeß" only one female form of a profession is mentioned (only in singular), and within all scientific texts about computer science and linguistics that have been checked no word of this kind occurred. For this reason I decided not to do the "erinn" - "erin" replacement as the additional computational time is not justified by that minimum effect.

Errors of the first and third kind (irregular verbs and foreign words) can only be solved by using a dictionary.

In general it could be said that cases where two different morphological forms of the same word are stemmed to different discriminators are very rare for nouns and adjectives. For further calculation I will assume that about 0.1% of all nouns fall into this category. Giving an estimate for verbs is a bit difficult. The Duden just says that the vast majority of all verbs are regular without giving concrete numbers. But as irregular verbs are very common in spoken and written German even concrete numbers won't help. In a wordlist just 5% of all verbs may be irregular but in a single text their number may rise to 50% and more. Depending on the purpose the substitute-and-stem algorithm is used for irregular verbs may cause an error rate somewhere between 1% and 10% (about 20% of all German words are verbs). As this error rate may not be acceptable I'll give a solution for the 'irregular verb' problem at the end of this section.

3.2 Stemming Nouns

The crucial aspect for the quality of this simple stemming algorithm is how many different words are conflated onto the same discriminator. Taking the quote from van Rijsbergen one could think it must be quite a lot. .

The first test I did was just on nouns with only the suffixes "e", "n", "s", and "er" being cut off. The result was extremely positive: The Linux spell checker's wordlist contains 53300 nouns, some of them in both singular and plural forms. These terms were conflated into 50058 different discriminators by the stemming algorithm. Of these 50058 discriminators only 160 (0.32%) were not unique. Of the 53300 nouns only 325 (0.61%) were not stemmed to a unique discriminator. In one case 4 different words were stemmed to the same discriminator ("Buch", "Büchse", "Buchse", and "Büchner"), in 13 cases 3 different words were reduced to the same discriminator. All these numbers have been retrieved manually which means that they include part of the noise produced by morphological homonyms.

The 160 non-unique discriminators can be grouped as follows:

This statistic is interesting as it helps comparing suffix stripping with linguistic and dictionary based stemming algorithms. Errors of the first two kinds (Non-German words and names) could occur as well with these more sophisticated approaches. Overlapping stems can only be handled if the gender of a word is known. Detecting gender is very difficult for German nouns, so many of the 58 errors produced by suffix stripping would as well cause problems with any other algorithm. Only errors of the last three kinds - which are about 50% of all non-unique discriminators - are typical for suffix stripping.

None of the words stemmed to a non-unique discriminator was longer than twelve characters. More than 80% were even shorter than or equal to 6 characters. This proofs the assumption that the longer the words the fewer the errors caused by suffix stripping.

With most single documents stemmed with the algorithm the rate of non-unique discriminators for nouns was at about 0.5%. More than half of the non-unique stems were caused either by names or by derived forms of the same word. Once again the great majority of error-causing terms were shorter than seven characters.

3.3 Stemming Nouns, Verbs, and Adjectives

The second test was performed on the same set of documents. In this case all suffixes listed in the second chapter were removed. As weak stemming was desired identical discriminators were assumed to be different if their first letter was not of the same case (e.g. "kranker" is distinguishable from "Kranker"). As German sentences very seldom start with verbs or adjectives the amount of 'same word - different discriminator' errors only increased slightly. For the most representative test with the Linux wordlist the following results were obtained:

The size of the complete wordlist is about 75,000 terms of which about 70% are nouns. The stemming algorithm calculated 66959 different discriminators (50058 of them for nouns). The number of 'different word - same discriminator' errors raised from 160 non-unique discriminators for nouns to about 510 non-unique discriminators (0.76%) for the whole wordlist. About 1600 words were detected that were not conflated to an unique discriminator (2.8% of all words). For verbs and adjectives 1.6% of all discriminators were not unique and 6.7% of all verb and adjective forms were stemmed to non-unique discriminators.

In a third tests all words were converted into lowercase after stemming. This scenario - I'll call it medium stemming - is close to strong stemming as many derived forms are reduced to the same discriminator as their stem. Nevertheless it is assumed that still weak stemming is desired but that there is no problem with conflating closely related terms to the same discriminator. For most retrieval and mining applications this should be the case as many of these 'errors' could only have been detected by very time-consuming linguistic and sometimes even semantic analysis.

With medium stemming about 620 non-unique discriminators (0.93%) were detected for the whole wordlist. About 2100 words (2.1%) were not conflated to an unique discriminator. As with the previous test most of the errors were caused by verbs and adjectives. 

3.4 Improvements

With none of the tests the rate of words stemmed to a non-unique discriminator was higher than 3%. For most single texts it was at about 1%. With most tests the number of 'different words - single discriminator' errors was lower than the number of 'same word - different discriminator' errors. The reason for this unexpected behavior is the high frequency of irregular verbs in written texts.

Irregular verbs van very easily be handled by extending the substitution part of the algorithm. The idea is to reduce all morphological forms of an irregular verb to one of these forms by using a small word list.

The common form should be the most significant character sequence of all forms. E.g. the common form for "kommen - kam" is "komm" while the common form for "laufen - lief" is "lief". Substitution is done on a substring base in order to handle compound words, too. By doing this simple substitution the verb "ankam" would be changed into "ankomm" and "zugelaufen" would be changed into "zugeliefen". As replacements of irregular verb forms in most cases lead to non-words the number of additional 'different word - same discriminator' errors is very small.

By making use of a small wordlist the number of single words stemmed to multiple discriminators can be significantly decreased. The only tradeoff is a loss in time as about 200 additional substring comparisons have to be done for each term.

If hard stemming is required more suffixes can be cut off. Especially for the most common derivation suffixes ("-ung", "-heit", "-keit", etc.) the number of additional 'different words - single discriminator' errors is close to zero.

4 Conclusion

In this paper I described a stemming algorithm based on substring substitution and suffix stripping. Three variants of the algorithm have been tested:

  1. When only nouns were stemmed very few errors occurred ( < 1%). It can be assumed that this error rate is not worse than the error rates produced by more sophisticated algorithms based on linguistics or word lists. The advantage of substitute-and-strip algorithm (beside its speed) is that it gets the better the longer the terms to be stemmed are.
  2. When all kinds of words were stemmed the rate of terms mapped onto non-unique discriminators raised up to nearly 3% depending on the stemming semantics. There was a significant number of words that are stemmed onto more than one discriminator (mainly irregular verbs). This leads to an overall error rate between 4% and 15% depending on the terms to be stemmed. As irregular verbs are frequently used but small in number the error rate will be the lower the more terms are stemmed.
  3. By matching each term against a list of irregular verbs most errors originating in multiple discriminators for a single word can be prevented. As the number of 'multiple words - single stem' errors only barely increased the overall error rate was lower than 5%.

For most applications in the area of information retrieval a stemming error rate of about 5% is assumed to be acceptable [vanRijsbergen79]. This handicap can easily be met by the substitute-and-strip algorithm.

The low error rate of the algorithm is very astonishing as it is very small and fast. From practical experiences and the tests presented in this paper it looks as if this 100 lines-of-code algorithm is only hardly worse than other approaches which are either based on sophisticated linguistic methods or on wordlists with 50,000 and more entries.

Further on the algorithm is an ideal extension to any wordlist based approaches. Theses systems tend to have error rates close to zero for short terms while the substitute-and-stem algorithm in nearly error-free for long terms. By combining both approaches a wordlist containing less than 10,000 entries should be enough to get error rates of less than 1% (plus noise).

The only drawback of the algorithm is the relatively high error rate for verbs and adjectives. Nearly 7% of all terms within these groups were not stemmed to unique discriminators. For this reason I would not use it as a base for semantic analysis or other areas were verbs are extremely important. But for any other kind of applications (e.g. information retrieval or text comparisons) it is a good, small, and fast alternative to existing approaches. Further on simplified versions of the algorithm (e.g. without a wordlist) can easily be used for any kind of interactive systems where stemming never was considered because of the complexity of either algorithms or wordlists. But with this simple and fast algorithm there is no reason why users should not be allowed to type natural language sentences into entry fields and other dialog controls.

Literature

[Bray96]

Bray, T. Measuring the Web. Proc. Fifth International World Wide Web Conference WWW5. Paris. 1996.

[Caumanns98]

Caumanns, J. A Bottom-Up Approach to Multimedia Teachware. In: Goettl, B.P., Halff, H.M., Redfield, C.L., & Shute, V.J. (Eds.); Proceedings of the 4th International Conference on Intelligent Tutoring Systems, ITS-98. Berlin: Springer, 1998.

[Dijkstra93]

Dijkstra, T. & Kempen, G. Einführung in die Psycholinguistik. Bern, Göttingen: Verlag Hans Huber. 1993.

[Duden84]

Duden Grammatik der deutschen Gegenwartssprache. Mannheim: Bibliographisches Institut. 1984.

[Piotrowski98]

Piotrowski, M. NLP-Supported Full-Text Retrieval. Master's Thesis, Universität Erlangen-Nürnberg. 1998.

[Porter80]

Porter, M.F. An Algorithm for Suffix Stripping. Program 14(3), pp 130-137. 1980.

[vanRijsbergen79]

van Rijsbergen, C.J. Information Retrieval. London: Butterworth. 1979.

[Sheridan96]

Sheridan P. & Ballerini, J.P. Experiments in Multilingual Information Retrieval using the SPIDER System. Proc. 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR 96, pp 58-65. 1996.