To preprocess files for information extraction, a patched version of the TreeTagger is required; preprocessing will only work on Unix-like systems (not on Windows). Note that the license terms of the TreeTagger are different from those of TIE; it may be used for evaluation and research, but not commercially.
After downloading and installing the TreeTagger, you should add the
cmd subdirectories to your PATH. You then need to download the patch
file and to apply it within the
of your TreeTagger installation by calling
patch -p0 < treetagger-cmd.patch
within that directory. The patch has been created and tested on Linux; it might also work for the Solaris and Mac versions, but most likely not for the Windows version.
After that (assuming that TreeTagger is in the PATH and has been
successfully patched) you should be ready to preprocess any files by
invoking the "preprocess" goal. By default, texts are assumed to be in
English; if they are in German, you have to specify the
parameter (other languages are currently not supported).
For text (*.txt) files, you will also need to download and install txt2html which is invoked to convert the input to (X)HTML. HTML and XML files can be preprocessed directly without requiring other tools.
The resulting output files have the ".aug" extension, since they are "augmented" with the results of the linguistic analysis (shallow parsing and POS tagging) -- see schema/augment for an exact description.