Report B 99-16
In this report a stemming algorithm for morphological complex languages like German or Dutch is presented. The main idea is not to use stems as common forms in order to make the algorithm simple and fast. The algorithm consists of two steps: First certain characters and/or character sequences are substituted. This step takes linguistic rules and statistical heuristics into account. In a second step a very simple, context free suffix-stripping algorithm is applied. Three variations of the algorithm are described: The simplest one can easily be implemented with 50 lines of C++ code while the most complex one requires about 100 lines of code and a small wordlist. Speed and quality of the algorithm can be scaled by applying further linguistic rules and statistical heuristics.
Get the report here or by anonymous ftp: Server: fubinf.inf.fu-berlin.de File: pub/reports/tr-b-99-16.ps.gz