Trainable Information Extractor

The Trainable Information Extractor (TIE) software is an incrementally trainable system for information extraction, text classification and generally language engineering. It employs classification models for working with texts. Other modules allow to augment text with linguistic annotations (by delegating to external tools) and to resolve nesting errors and other kinds of well-formedness violations in XML-like input.

Currently available goals include:

class-train:
Classifies a list of files, training a text classifier on each error.
extract:
Extracts relevant information (as specified by a given target schema) from texts.
preprocess:
Preprocesses documents for information extraction, by converting them to a suitable XML format and adding lingustic information.
adjust:
Tries to fix corrupt XML documents, especially documents containing nesting errors.
strip:
Strips all markup from an XML document and stores the resulting plain text.

Navigate menus (left) to learn more about how to download, install, and use the software. Visit http://www.siefkes.net/ie/ for details on the theoretical background and the used algorithms.