TIE Configuration Parameters

NameTypeOptional?List?ValueDescription
adjust.delete.control-charsBooleanfalsefalsefalseWhether to delete control characters (which are not allowed in XML 1.0 and discouraged in XML 1.1)
adjust.delete.pseudo-tagsBooleanfalsefalsefalseWhether to delete "pseudo-tags"
adjust.delete.trailing-garbageBooleanfalsefalsefalseWhether to delete trailing garbage (illegal content that occurs after the root tag has been closed)
adjust.emptiable-tagsStringtruetrueSet of names of tags that can be converted empty tags when required
adjust.escape.pseudo-entitiesBooleanfalsefalsefalseWhether to escape "&" starting a possible nonstandard entity reference ("&" at the start of one of the 5 predefined entity references or a character reference is never escaped, all other "&" are always escaped)
adjust.missing-rootStringtruefalsedocumentThe name to use for the root element if missing
analyzer.aug-dirStringfalsefalse..Directory containing the augmented (preprocessed) input texts, relative to the current working directory
answers.attribStringtruefalseName of attribute to read element types from if the elemant name shouldn't be used
answers.elementStringtruetrueIf not empty, answer types are read from the "answers.attrib" attribute of the elements specified in this list instead of using element names as types
charsetStringtruefalseCharacter set to use when reading and writing local files. If omitted (default), the default charset of the current platform is used.
classesStringfalsetrueNames of the classes to recognize (temporarily)
classifierStringfalsetrue[Multi, Winnow]The trainable classifier to use, either "Ext" for an external classifier; "Winnow" for the Winnow algorithm; "ucWinnow" for Ultraconservative Winnow; the qualified name of a TrainableClassifier subclass; or "Multi" resp. "OAR" followed by the name of
classifier.ext.classifyStringfalsetrue[crm, -{ isolate (:stats:); classify <osb> (:*:_arg2:) (:stats:) /[[:graph:]]+/; output /:*:stats:/ }]Command name + arguments to call for classification (list of possible target classes will be last arg, feature vector stdin)
classifier.ext.directoryStringtruefalseThe directory to run the classifier in (defaults to current working directory)
classifier.ext.initStringtruetrue[cssutil, -b, -r, -s, 94321]Command name + arguments to call for initialization (class to initialize will be last arg)
classifier.ext.regexStringfalsefalse\((.*?)\).*?prob:\s+(.*?)[,\s]\s*pR:\s+(\S+)Regular expression to extract for all, or at least the best, classes (group 1) the probability (group 2) and optionally the pR = log(P / (1-P)) from the classifier's stdout
classifier.ext.resetStringtruetrue[rm, -f]Command name + arguments to call for resetting the classifier by deleting the prediction model (class to reset will be last arg)
classifier.ext.suffixStringtruefalse.cssthe suffix to append to classes for classifier
classifier.ext.threshold.pRDoubletruefalse20If specified the classifier is trained if the pR is below this value as well as on errors ("thick threshold" heuristic)
classifier.ext.threshold.probDoubletruefalseIf specified the classifier is trained if the probability is below this value (must be < 1.0) as well as on errors ("thick threshold" heuristic)
classifier.ext.trainStringfalsetrue[crm, -{ learn <osb microgroom> (:*:_arg2:) /[[:graph:]]+/ }]Command name + arguments to call for training (expected target class will be last arg, feature vector stdin)
classifier.fileStringfalsefalseclassifier.xsjName of the file used for storing the classifier
classifier.meta.judgeStringfalsetrueWinnowThe specification of the judge classifiers used in the MetaClassifier (same syntax as "classifier" parameter)
classifier.meta.layersIntegerfalsefalse2The number of layers to use in the MetaClassifier (at least 1, typically 2 or more)
classifier.re-useBooleanfalsefalsetrueWhether to re-use classifiers between several runs (incl. classifiers stored in the classifier.file, if exists)
classifier.storeBooleanfalsefalsefalseWhether to store the final classifier in the file specified by the classifier.file parameter
classifier.test-onlyBooleanfalsefalsefalseIf set to true, the classifier will be used only for prediction -- no training will take place
classifier.textStringfalsetrue[OAR, Winnow]The trainable classifier to use, either "Ext" for an external classifier; "Winnow" for the Winnow algorithm; "ucWinnow" for Ultraconservative Winnow; the qualified name of a TrainableClassifier subclass; or "Multi" resp. "OAR" followed by the name of (for text classification)
classifier.tie.layersIntegerfalsefalse2The number of layers to use in the TieClassifier (at least 1, typically 2 or more)
classifier.tie.thresholdDoublefalsefalse0.99TieClassifier invokes the next layer if the relative probability of the second best prediction as above or equal to this threshold (must be in the 0 to 1 range)
classifier.train.allBooleanfalsefalsefalseIf true the classifier considers all classes for error-driven training, not only the candidate classes
classifier.winnow.balancedBooleanfalsefalsefalseWhether to use the Balanced Winnow or the standard Winnow algorithm; Balanced Winnow keeps two weights per feature and class, a positive and a negative one
classifier.winnow.demotionFloatfalsefalse0.83The demotion factor used by the Winnow classifier
classifier.winnow.featuresIntegerfalsefalse2000000The number of features to store
classifier.winnow.ignore.exponentIntegerfalsefalse1If the ignore parameter is true, all features in the range from demotion factor^exponent to
classifier.winnow.ignore.irrelevantBooleanfalsefalsefalseWhether to ignore features within a certain range around the default weight for classification
classifier.winnow.promotionFloatfalsefalse1.23The promotion factor used by the Winnow classifier
classifier.winnow.shared-storeBooleanfalsefalsetrueWhether a shared feature store is used for all Winnow instances
classifier.winnow.threshold.thicknessFloatfalsefalse0.05The thickness of the threshold if the "thick threshold" heuristic is used (must be < 1.0), 0.0 otherwise
classifier.winnow.threshold.thickness.ucFloatfalsefalse0.1The thickness of the threshold if the "thick threshold" heuristic is used (must be < 1.0), 0.0 otherwise (for the "ultraconservative" variant of the Winnow classifier)
combination.adapterStringfalsefalseMapping from a regular expression to a replacement text used to translate between an external labeling convention and the internal convention used by a combination strategy
combination.begin-end.level2BooleanfalsefalsefalseWhether the begin/end combination strategy should use a second level similar to the ELIE/2 system by Finn and Kushmerick
combination.strategyStringfalsefalseIOB2Strategy to combine extractions, either "BE" for the begin/end strategy which uses two classifiers, "BIA" for the begin/after (a.k.a. BIA) strategy, "BIE1" or "BIE2" for variations of the open/close (a.k.a. BIE) strategy, "IOB1" or "IOB2" for variations of the inside/outside strategy, "Triv" for the trivial strategy, or the qualified name of a CombinationStrategy subclass
compress.gzipBooleanfalsefalsefalseWhether to compress your data in gzip format
compress.gzip.xsjBooleanfalsefalsetrueWhether to compress your data in gzip format (for XML-serialized Java)
dsv.entry.separator.wsBooleanfalsefalsefalseif set to true, any sequence of whitespace is assumed to separate DSV entries; otherwise only newlines are accepted as separators
dsv.field.separatorStringtruefalse|Separator between fields in a DSV entry; a single space is used if the value is empty or missing
dsv.keysStringtruetrueThe list of keys (header names) for DSV files; if none, the keys are read from the first line of a file
dsv2xml.attribs.omitBooleanfalsefalsefalseWhether to omit all attributes when converting DSV to XML
dsv2xml.level2StringtruefalseOptional name of second level elements to insert at the start of a document and whenever an empty line is encountered -- if given, all field maps will be appended as child elements (third level) to the preceding second level element
dsv2xml.rootStringfalsefalsedocumentThe name used for the root element by the DSV-to-XML converter
eval.feedbackBooleanfalsefalsefalseIf true, a fully incremental setup is used where the trainer is trained on each document after the extractor processed it
eval.match.allBooleanfalsefalsetrueWhether to use "match-all" or "match-best" (more probably) as match mode
eval.match.posBooleanfalsefalsetrueIf true, the positions of extraction and answer keys must match; otherwise only their contents must match (string compare)
eval.split.separatorStringtruefalseIf given, the specified string is used to separate the training from the testing section of the corpus (e.g. "---")
eval.test-splitFloatfalsefalse-1The percentage of a corpus to use for testing (evaluation); if -1, all remaining documents (1 - eval.train-split) are used
eval.train-splitFloatfalsefalse0.5The percentage of a corpus to use for training
eval.train-split.textFloatfalsefalse1The percentage of a corpus to use for training (for text classification)
eval.tune.eachBooleanfalsefalsefalseIf true, evaluation results are measured after training iteration (starting from eval.tune.since); otherwise after the last
eval.tune.listIntegertruetrue1A list of iterations after which to evaluate TUNE training in addition to the last one; ignored if eval.tune.each is true
eval.tune.sinceIntegerfalsefalse1The training iteration after which to evaluate results for the first time if eval.tune.each is enabled
ext.docStringfalsefalseapplication/mswordMaps a file extension of a MIME type contained in matching files (for DOC files)
ext.dotStringfalsefalseapplication/mswordMaps a file extension of a MIME type contained in matching files (for DOT (Document Template) files)
ext.htmStringfalsefalsetext/htmlMaps a file extension of a MIME type contained in matching files (for HTM files)
ext.htmlStringfalsefalsetext/htmlMaps a file extension of a MIME type contained in matching files (for HTML files)
ext.pdfStringfalsefalseapplication/pdfMaps a file extension of a MIME type contained in matching files (for PDF files)
ext.rtfStringfalsefalsetext/rtfMaps a file extension of a MIME type contained in matching files (for RTF (Rich Text Format) files)
ext.txtStringfalsefalsetext/plainMaps a file extension of a MIME type contained in matching files (for TXT (plain text) files)
ext.uriStringfalsefalsetext/uri-listMaps a file extension of a MIME type contained in matching files (for URI files)
ext.urisStringfalsefalsetext/uri-listMaps a file extension of a MIME type contained in matching files (for URIS files)
ext.xhtmlStringfalsefalsetext/htmlMaps a file extension of a MIME type contained in matching files (for XHTML files)
ext.xmlStringfalsefalsetext/xmlMaps a file extension of a MIME type contained in matching files (for XML files)
externalize.keyStringfalsefalseFileThe name of the field to externalize
extract.biasDoubletruefalseBias that reduces or increases the score calculated for a class
extract.evaluateBooleanfalsefalsetrueWhether to evaluate predictions by comparing them to answer keys, otherwise predictions are stored without evaluating them
extract.pred.extStringfalsefalsepredThe extension used to stored predictions (if the extract.evaluate option is set to false)
extract.pred.use-outdirBooleanfalsefalsetrueWhether to write prediction files to the configured output directory or the the directory containing the input file
extract.punctuation.relevantStringtruetrue[., )]A list of punctuation and symbol tokens that are considered as relevant from the very start (other such tokens are added on demand)
feature.extractorStringfalsefalsede.fu_berlin.ties.text.TokenizingExtractorQualified name of a de.fu_berlin.ties.classify.feature.FeatureExtractor instance to be used for converting text sequences into feature vectors
file.extStringtruefalseThe extension to append to file names (if any),currently used by the class-train goal
filter.avoidStringtruetrue[pos, const, entity]List of elements that should be avoided (use parent element instead) when filtering as first step of a double classification approach
filter.elementsStringtruetrueList of elements to filter as the first step of a double classification approach ("sentence filtering"); if none, no sentence filtering is used
goal.adjustClass definitionfalsefalse[de.fu_berlin.ties.xml.XMLAdjuster, xml]Tries to fix corrupt XML documents, especially documents containing nesting errors
goal.analyzeClass definitionfalsefalse[de.fu_berlin.ties.eval.MistakeAnalyzer, mistakes]Analyses the types of prediction errors that occurred during a test run
goal.answersClass definitionfalsefalse[de.fu_berlin.ties.extract.AnswerBuilder, ans]Builds answer keys from from an annotated text (in XML format)
goal.avg-lengthClass definitionfalsefalse[de.fu_berlin.ties.eval.AverageLength, avl]Calculates the average length for extractions of different types and evaluation statuses
goal.class-trainClass definitionfalsefalse[de.fu_berlin.ties.classify.ClassTrain, cls]Classifies a list of files, training the text classifier on each error
goal.dsv2xmlClass definitionfalsefalse[de.fu_berlin.ties.xml.convert.DSVtoXMLConverter, xml]Converts data in DSV format into XML
goal.eval-predsClass definitionfalsefalsede.fu_berlin.ties.eval.PredictionEvaluatorReads a set of files that must contain predictions and evaluates them against the corresponding answer keys (*.ans files)
goal.externalizeClass definitionfalsefalse[de.fu_berlin.ties.io.Externalize, dsv]Externalizes the contents of a file in DSV format. For each entry, the contents of one specified field (read from the "externalize.key" configuration parameter) are stored in an external file whose name is stored in the output DSV file instead of its content.
goal.extractClass definitionfalsefalse[de.fu_berlin.ties.extract.Extractor, pred]Extracts relevant information from texts
goal.filterClass definitionfalsefalsede.fu_berlin.ties.classify.TextFilterA simple filter for classifying and/or training text files
goal.preprocessClass definitionfalsefalse[de.fu_berlin.ties.preprocess.PreProcessor, aug]Preprocesses documents by converting them to a suitable XML format and adding lingustic information
goal.re-evalClass definitionfalsefalse[de.fu_berlin.ties.eval.ReEvaluator, ext]Re-evaluates evaluated extractions (useful for switching the match mode -- eval.match.all)
goal.shuffleClass definitionfalsefalsede.fu_berlin.ties.eval.ShuffleGeneratorCreates random "shuffles" of input arguments (e.g. files or URLs)
goal.shuffle-linesClass definitionfalsefalse[de.fu_berlin.ties.eval.LineShuffleGenerator, rand]Randomly reshuffles the lines in a file
goal.simple-quotesClass definitionfalsefalse[de.fu_berlin.ties.text.SimplifyQuotes, txt]Simplifies different kinds of quotes that can occur in text files
goal.splitClass definitionfalsefalsede.fu_berlin.ties.io.SplitSplits an input file into a series of output files
goal.stripClass definitionfalsefalse[de.fu_berlin.ties.xml.dom.XMLStripper, txt]Strips all markup from an XML document and stores the resulting plain text
goal.trainClass definitionfalsefalsede.fu_berlin.ties.extract.TrainerTrains the classifier used to extract information
goal.train-evalClass definitionfalsefalse[de.fu_berlin.ties.extract.TrainEval, metrics]Trains an extractor and evaluates extraction quality
goal.unflattenClass definitionfalsefalse[de.fu_berlin.ties.xml.convert.AttributeUnflatten, xml]Unflattens an XML document, reading labels for a combination strategy from an XML attribute ("class" by default)
html-converter.command.text/plainStringfalsetrue[txt2html, --prebegin, 1, --preend, 1, --tables, --nounhyphenation, --xhtml, --doctype, , -8]External converter from specified MIME type to XHTML (first element: command name, further elements: command-line arguments) (for plain text)
langStringfalsefalseenLanguage of documents (ISO 639 language code)
lengthfilter.toleranceDoublefalsefalse1.05Length filter discards extractions longer than the longest trained extraction multiplied with this tolerance factor
log.logStringfalsefalseDEBUGOnly messages with this priority or higher are logged: DEBUG, INFO, WARN, ERROR, FATAL_ERROR or NONE (ignoring case)
log.showStringfalsefalseINFOOnly messages with this priority or higher are written to standard output (but only if covered by log.log)
mime.application/rtfStringfalsefalsetext/rtfMaps an alternative MIME type to a main MIME type (for Rich Text Format)
mime.application/xhtml+xmlStringfalsefalsetext/htmlMaps an alternative MIME type to a main MIME type (for XHTML content)
mime.application/xmlStringfalsefalsetext/xmlMaps an alternative MIME type to a main MIME type (for XML content)
outdirStringtruefalseOutput files are written if the directory, if specified;otherwise they are written in the same directory as the corresponding input file resp. in the working directory if there is no local input file; not relevant for log files which are always written in the working directory
postClass definitionfalsefalsePostprocessor for files with the given extension (must extend TextProcessor)
preprocess.taggerStringtruetruede.fu_berlin.ties.preprocess.TreeTaggerA tagger (or a list of taggers) used to annotate a text e.g. with linguistic information; each tagger must implement the TextProcessor interface and accept a string (the output extension) as single constructor argument
preprocess.textBooleanfalsefalsetrueWhether plain text is preprocessed to recognize and reformat definition lists
prune.candidatesIntegerfalsefalse1The number of candidates considered for each pruning operation; if 1, the feature store behaves like a standard LRU cache
prune.numIntegerfalsefalse1The number of candidates to prune by each pruning operation; must not be larger than prune.candidates
reestimator.chainStringtruetruede.fu_berlin.ties.extract.reestimate.LengthFilterList of re-estimators (fully specified names of de.fu_berlin.ties.extract.reestimate.Reestimator subclasses) used in a chain
representation.ancestor.numIntegerfalsefalse4Maximum number of ancestors to represent
representation.default.attribsStringfalsetrue[type, class]Local names of default attributes
representation.head.attribStringfalsefalsenormalLocal name of the attribute to use for calculating head values
representation.head.elementStringfalsefalseconstLocal name of the element to use for calculating head values
representation.prefix.maxlengthIntegerfalsefalse4The maximum length of prefixed and suffixes
representation.recogn.detailedIntegerfalsefalse2The number of preceding recognitions to represent in detail
representation.recogn.numIntegerfalsefalse4The number of preceding recognitions to represent
representation.sensorsStringtruetruede.fu_berlin.ties.context.sensor.ListSensorList of classes implementing the Sensor interface that are used to add e.g. semantic information to tokens (each class must provide a constructor that accepts a TiesConfiguration as single argument)
representation.sibling.numIntegerfalsefalse4Basic number of preceding and following siblings to represent
representation.split.maximumIntegerfalsefalse4Maximum number of subsequences to keep when a feature value must be split (at whitespace)
representation.store.nthIntegerfalsefalse0If > 0, every n-th (n=given value) context is stored for debugging and inspection purposes
rewriter.pred.classesStringtruetrueNames of the prediction classes to use -- all are used if this parameter if missing or empty
rewriter.pred.extStringfalsefalsepredExtension of the files containing predictions
rewriter.pred.noneStringtruefalse"None" marker to use for tokens that do not belong to any prediction -- if empty or missing, these tokens are not tagged
rewritersStringtruetrueList of classes extending DocumentRewriter that are used to add e.g. semantic information to documents (each class must provide a constructor that accepts a TiesConfiguration as single argument)
sensor.list.basepathStringtruefalse${user.home}/lib/semanticGazetteer files are resolved relative to this path, if given
sensor.list.caseBooleanfalsefalsefalseGazetteer entries are looked up case-sensitive if set to true
sensor.list.map.dictStringfalsefalseamerican-english.gzMaps a identifier to a gazetteer file (American English dictionary)
sensor.list.map.lastnameStringfalsefalsecensus.last.frequent.gzMaps a identifier to a gazetteer file (list of last names from US census: http://www.census.gov/genealogy/names/)
sensor.list.map.name-femaleStringfalsefalsecensus.female.first.gzMaps a identifier to a gazetteer file (list of female names from US census)
sensor.list.map.name-maleStringfalsefalsecensus.male.first.gzMaps a identifier to a gazetteer file (list of female names from US census)
sensor.list.map.suffixStringfalsefalseusps.suffix.gzMaps a identifier to a gazetteer file (address suffixes from US Postal Service)
sensor.list.map.titleStringfalsefalsewikipedia.titles.gzMaps a identifier to a gazetteer file (titles collected from Wikipedia: http://en.wikipedia.org/wiki/Title)
sensor.list.negativeBooleanfalsefalsetrueWhether to add a negative marker if there is no positive information for a token
sensor.list.negative.valueStringtruefalsenoneThe key to use as negative marker if configured; if not set or empty, "false" is added for each gazetteer type if sensor.list.negative is true
sent.tuneIntegerfalsefalse0For how many iterations to TUNE the sentence classifier; if 0 or negative, it is TUNE trained for all iterations
shuffle.extStringfalsefalselistSuffix used for the generated random shuffle files
shuffle.firstIntegerfalsefalse1The number used in the name of the first generated shuffle file
shuffle.lines.ignore-firstIntegerfalsefalse1The number of lines at the start of a file that are ignored by the line shuffle generator
shuffle.numIntegerfalsefalse5The number of random shuffles to generate
shuffle.prefixStringfalsefalserunPrefix used for the generated random shuffle files
split.patternStringtruefalseThe default pattern used by the split goal to split input
strip.normalizeBooleanfalsefalsefalseWhether or not to normalize whitespace
strip.to-xmlBooleanfalsefalsefalseIf this is set to true, the strip target generates a XML file instead of a plain text file, by preserving the root element (while still discarding all other elements + attributes)
textfilter.classesStringfalsetrue[spam, ham]Names of the classes used to filter text. The probability of the very first class will be returned as "score".
tokenizer.patternStringfalsetrue[(?:\p{N}[.,:]\p{N}|[\p{L}\p{M}\p{N}])+, \p{Sc}+(?:\p{N}[.,:]\p{N}|\p{N})*, [\p{Sm}\p{Sk}\p{So}]+, (\p{P})\1*]List of regular expressions defining the token types accepted by the tokenizer
tokenizer.pattern.classifierStringfalsetrue[^\p{Z}\p{C}][/!?#]?[-\p{L}\p{M}\p{N}]*(?:["'=;]|/?>|:/*)?List of regular expressions defining the token types accepted by the tokenizer (for classification of whole texts)
tokenizer.whitespaceStringfalsefalse[\p{Z}\p{C}]*Regular expression giving the whitespace accepted by the tokenizer
train.only-errorsBooleanfalsefalsetrueWhether to train only errors (TOE mode, recommmended) or all instances (brute-force mode)
train.test-onlyBooleanfalsefalsefalseIf enabled the trainer only checks whether all answer keys exist and can be located in the document (logging errors if they don't), it doesn't do any training
train.tuneIntegerfalsefalse15The maximum number of iterations used for training the extractor; if 1, training is incremental
train.tune.stopIntegerfalsefalse1TUNE training is stopped if the training accuracy didn't improve for the specified number of iterations
train.tune.textIntegerfalsefalse1The maximum number of iterations used for training the extractor; if 1, training is incremental (for text classification)
transformer.chainStringtruetruede.fu_berlin.ties.classify.feature.OSBTransformerList of feature transformers (fully specified names of de.fu_berlin.ties.classify.FeatureTransformer subclasses) used in a chain
transformer.osb.lengthIntegerfalsefalse5The window length used by the OSB (orthogonal sparse bigrams) transformer to generate joint features; minimum value is 2
transformer.osb.preserveBooleanfalsefalsefalseIf true the OSB transformer preserves the original features (unigrams); otherwise it discards them (using only the generated joint features)
transformer.osb.separatorStringtruefalseUsed by the OSB transformer to separate joint features; a space character is used if not specified
transformer.sbph.lengthIntegerfalsefalse5The window length used by the SBPH (sparse binary polynomial hashing) transformer to generate joint features
transformer.sbph.separatorStringtruefalseUsed by the SBPH transformer to separate joint features; a space character is used if not specified
treetagger.after-eos.enStringtruetrue''List of POS tags still to include in a previous (for English)
treetagger.command.deStringfalsefalsetagger-chunker-germanName of the TreeTagger command (for German)
treetagger.command.enStringfalsefalsetagger-chunker-englishName of the TreeTagger command (for English)
treetagger.eos.deStringfalsefalse\$\.Regular expression fragment: POS tag marking the end of a sentence (for German)
treetagger.eos.enStringfalsefalseSENTRegular expression fragment: POS tag marking the end of a sentence (for English)
treetagger.tag-sentencesBooleanfalsefalsetrueWhether to add XML tags around each sentence
unflatten.attributeStringfalsefalseclassName of the attribute used for unflattening