Preprocessing How-to

To preprocess files for information extraction, a patched version of the TreeTagger is required; preprocessing will only work on Unix-like systems (not on Windows). Note that the license terms of the TreeTagger are different from those of TIE; it may be used for evaluation and research, but not commercially.

After downloading and installing the TreeTagger, you should add the bin and cmd subdirectories to your PATH. You then need to download the patch file and to apply it within the cmd subdirectory of your TreeTagger installation by calling

    patch -p0 < treetagger-cmd.patch

within that directory. The patch has been created and tested on Linux; it might also work for the Solaris and Mac versions, but most likely not for the Windows version.

After that (assuming that TreeTagger is in the PATH and has been successfully patched) you should be ready to preprocess any files by invoking the "preprocess" goal. By default, texts are assumed to be in English; if they are in German, you have to specify the -lang=de parameter (other languages are currently not supported).

For text (*.txt) files, you will also need to download and install txt2html which is invoked to convert the input to (X)HTML. HTML and XML files can be preprocessed directly without requiring other tools.

The resulting output files have the ".aug" extension, since they are "augmented" with the results of the linguistic analysis (shallow parsing and POS tagging) -- see schema/augment for an exact description.