TIE Configuration Parameters

NameTypeOptional?List?ValueDescription
adjust.delete.control-charsBooleanfalsefalsefalseWhether to delete control characters (which are not allowed in XML 1.0 and discouraged in XML 1.1)
adjust.delete.pseudo-tagsBooleanfalsefalsefalseWhether to delete "pseudo-tags"
adjust.emptiable-tagsStringtruetrueSet of names of tags that can be converted empty tags when required
adjust.escape.pseudo-entitiesBooleanfalsefalsefalseWhether to escape "&" starting a possible nonstandard entity reference ("&" at the start of one of the 5 predefined entity references or a character reference is never escaped, all other "&" are always escaped)
adjust.missing-rootStringtruefalsedocumentThe name to use for the root element if missing
charsetStringtruefalseCharacter set to use when reading and writing local files. If omitted (default), the default charset of the current platform is used.
classifierStringfalsetrue[Multi, Winnow]The trainable classifier to use, either "Ext" for an external classifier; "Winnow" for the Winnow algorithm;
classifier.ext.classifyStringfalsetrue[crm, -{ isolate (:stats:); classify <osb> (:*:_arg2:) (:stats:) /[[:graph:]]+/; output /:*:stats:/ }]Command name + arguments to call for classification (list of possible target classes will be last arg, feature vector stdin)
classifier.ext.directoryStringtruefalseThe directory to run the classifier in (defaults to current working directory)
classifier.ext.initStringtruetrue[cssutil, -b, -r, -s, 94321]Command name + arguments to call for initialization (class to initialize will be last arg)
classifier.ext.regexStringfalsefalse\((.*?)\).*?prob:\s+(.*?)[,\s]\s*pR:\s+(\S+)Regular expression to extract for all, or at least the best, classes (group 1) the probability (group 2) and optionally the pR = log(P / (1-P)) from the classifier's stdout
classifier.ext.resetStringtruetrue[rm, -f]Command name + arguments to call for resetting the classifier by deleting the prediction model (class to reset will be last arg)
classifier.ext.suffixStringtruefalse.cssthe suffix to append to classes for classifier
classifier.ext.threshold.pRDoubletruefalse20If specified the classifier is trained if the pR is below this value as well as on errors ("thick threshold" heuristic)
classifier.ext.threshold.probDoubletruefalseIf specified the classifier is trained if the probability is below this value (must be < 1.0) as well as on errors ("thick threshold" heuristic)
classifier.ext.trainStringfalsetrue[crm, -{ learn <osb microgroom> (:*:_arg2:) /[[:graph:]]+/ }]Command name + arguments to call for training (expected target class will be last arg, feature vector stdin)
classifier.meta.judgeStringfalsetrueWinnowThe specification of the judge classifiers used in the MetaClassifier (same syntax as "classifier" parameter)
classifier.meta.layersIntegerfalsefalse2The number of layers to use in the MetaClassifier (at least 1, typically 2 or more)
classifier.train.allBooleanfalsefalsefalseIf true the classifier considers all classes for error-driven training, not only the candidate classes
classifier.winnow.balancedBooleanfalsefalsefalseWhether to use the Balanced Winnow or the standard Winnow algorithm; Balanced Winnow keeps two weights per feature and class, a positive and a negative one
classifier.winnow.demotionFloatfalsefalse0.83The demotion factor used by the Winnow classifier
classifier.winnow.featuresIntegerfalsefalse600000The number of features to store
classifier.winnow.promotionFloatfalsefalse1.23The promotion factor used by the Winnow classifier
classifier.winnow.strength.frequencyStringfalsefalseconstantHow feature frequencies are considered when calculating strength values: one of "constant" (not at all), "log" (logarithmic), "sqrt" (square root), or "linear"
classifier.winnow.threshold.thicknessFloatfalsefalse0.05The thickness of the threshold if the "thick threshold" heuristic is used (must be < 1.0), 0.0 otherwise
classifier.winnow.threshold.thickness.ucFloatfalsefalse0.1The thickness of the threshold if the "thick threshold" heuristic is used (must be < 1.0), 0.0 otherwise (for the "ultraconservative" variant of the Winnow classifier)
combination.strategyStringfalsefalseIOB2Strategy to combine extractions, either "BE" for the begin/end strategy which uses two classifiers, "BIA" for the begin/after (a.k.a. BIA) strategy, "BIE" for open/close (a.k.a. BIE), "IOB1" or "IOB2" for variations of the inside/outside strategy, "Triv" for the trivial strategy, or the qualified name of a CombinationStrategy subclass
compress.gzipBooleanfalsefalsetrueWhether to compress your data in gzip format
eval.feedbackBooleanfalsefalsefalseIf true, a fully incremental setup is used where the trainer is trained on each document after the extractor processed it
eval.match.allBooleanfalsefalsetrueWhether to use "match-all" or "match-best" (more probably) as match mode
eval.match.posBooleanfalsefalsetrueIf true, the positions of extraction and answer keys must match; otherwise only their contents must match (string compare)
eval.test-splitFloatfalsefalse-1The percentage of a corpus to use for testing (evaluation); if -1, all remaining documents (1 - eval.train-split) are used
eval.train-splitFloatfalsefalse0.5The percentage of a corpus to use for training
eval.tune.eachBooleanfalsefalsefalseIf true, evaluation results are measured after training iteration (starting from eval.tune.since); otherwise after the last
eval.tune.listIntegertruetrue1A list of iterations after which to evaluate TUNE training in addition to the last one; ignored if eval.tune.each is true
eval.tune.sinceIntegerfalsefalse1The training iteration after which to evaluate results for the first time if eval.tune.each is enabled
ext.docStringfalsefalseapplication/mswordMaps a file extension of a MIME type contained in matching files (for DOC files)
ext.dotStringfalsefalseapplication/mswordMaps a file extension of a MIME type contained in matching files (for DOT (Document Template) files)
ext.htmStringfalsefalsetext/htmlMaps a file extension of a MIME type contained in matching files (for HTM files)
ext.htmlStringfalsefalsetext/htmlMaps a file extension of a MIME type contained in matching files (for HTML files)
ext.pdfStringfalsefalseapplication/pdfMaps a file extension of a MIME type contained in matching files (for PDF files)
ext.rtfStringfalsefalsetext/rtfMaps a file extension of a MIME type contained in matching files (for RTF (Rich Text Format) files)
ext.txtStringfalsefalsetext/plainMaps a file extension of a MIME type contained in matching files (for TXT (plain text) files)
ext.uriStringfalsefalsetext/uri-listMaps a file extension of a MIME type contained in matching files (for URI files)
ext.urisStringfalsefalsetext/uri-listMaps a file extension of a MIME type contained in matching files (for URIS files)
ext.xhtmlStringfalsefalsetext/htmlMaps a file extension of a MIME type contained in matching files (for XHTML files)
ext.xmlStringfalsefalsetext/xmlMaps a file extension of a MIME type contained in matching files (for XML files)
extract.biasDoubletruefalseBias that reduces or increases the score calculated for a class
extract.punctuation.relevantStringtruetrue[., )]A list of punctuation and symbol tokens that are considered as relevant from the very start (other such tokens are added on demand)
file.extStringtruefalseThe extension to append to file names (if any),currently used by the class-train goal
filter.avoidStringtruetrue[pos, const, entity]List of elements that should be avoided (use parent element instead) when filtering as first step of a double classification approach
filter.elementsStringtruetrueList of elements to filter as the first step of a double classification approach ("sentence filtering"); if none, no sentence filtering is used
goal.adjustClass definitionfalsefalse[de.fu_berlin.ties.xml.XMLAdjuster, xml]Tries to fix corrupt XML documents, especially documents containing nesting errors
goal.answersClass definitionfalsefalse[de.fu_berlin.ties.extract.AnswerBuilder, ans]Builds answer keys from from an annotated text (in XML format)
goal.class-trainClass definitionfalsefalse[de.fu_berlin.ties.classify.ClassTrain, cls]Classifies a list of files, training the text classifier on each error
goal.extractClass definitionfalsefalse[de.fu_berlin.ties.extract.Extractor, ext]Extracts relevant information from texts
goal.preprocessClass definitionfalsefalse[de.fu_berlin.ties.preprocess.PreProcessor, aug]Preprocesses documents by converting them to a suitable XML format and adding lingustic information
goal.re-evalClass definitionfalsefalse[de.fu_berlin.ties.eval.ReEvaluator, ext]Re-evaluates evaluated extractions (useful for switching the match mode -- eval.match.all)
goal.shuffleClass definitionfalsefalsede.fu_berlin.ties.eval.ShuffleGeneratorCreates random "shuffles" of input arguments (e.g. files or URLs)
goal.stripClass definitionfalsefalse[de.fu_berlin.ties.xml.dom.XMLStripper, txt]Strips all markup from an XML document and stores the resulting plain text
goal.trainClass definitionfalsefalsede.fu_berlin.ties.extract.TrainerTrains the classifier used to extract information
goal.train-evalClass definitionfalsefalse[de.fu_berlin.ties.extract.TrainEval, metrics]Trains an extractor and evaluates extraction quality
html-converter.command.text/plainStringfalsetrue[txt2html, --prebegin, 1, --preend, 1, --tables, --xhtml, --doctype, , -8]External converter from specified MIME type to XHTML (first element: command name, further elements: command-line arguments) (for plain text)
langStringfalsefalseenLanguage of documents (ISO 639 language code)
log.logStringfalsefalseDEBUGOnly messages with this priority or higher are logged: DEBUG, INFO, WARN, ERROR, FATAL_ERROR or NONE (ignoring case)
log.showStringfalsefalseINFOOnly messages with this priority or higher are written to standard output (but only if covered by log.log)
mime.application/rtfStringfalsefalsetext/rtfMaps an alternative MIME type to a main MIME type (for Rich Text Format)
mime.application/xhtml+xmlStringfalsefalsetext/htmlMaps an alternative MIME type to a main MIME type (for XHTML content)
mime.application/xmlStringfalsefalsetext/xmlMaps an alternative MIME type to a main MIME type (for XML content)
outdirStringtruefalseOutput files are written if the directory, if specified;otherwise they are written in the same directory as the corresponding input file resp. in the working directory if there is no local input file; not relevant for log files which are always written in the working directory
postClass definitionfalsefalsePostprocessor for files with the given extension (must extend TextProcessor)
preprocess.taggerStringtruetruede.fu_berlin.ties.preprocess.TreeTaggerA tagger (or a list of taggers) used to annotate a text e.g. with linguistic information; each tagger must implement the TextProcessor interface and accept a string (the output extension) as single constructor argument
preprocess.textBooleanfalsefalsetrueWhether plain text is preprocessed to recognize and reformat definition lists
prune.candidatesIntegerfalsefalse1The number of candidates considered for each pruning operation; if 1, the feature store behaves like a standard LRU cache
prune.numIntegerfalsefalse1The number of candidates to prune by each pruning operation; must not be larger than prune.candidates
representation.ancestor.numIntegerfalsefalse4Maximum number of ancestors to represent
representation.default.attribsStringfalsetrue[type, class]Local names of default attributes
representation.head.attribStringfalsefalsenormalLocal name of the attribute to use for calculating head values
representation.head.elementStringfalsefalseconstLocal name of the element to use for calculating head values
representation.prefix.maxlengthIntegerfalsefalse4The maximum length of prefixed and suffixes
representation.recogn.detailedIntegerfalsefalse2The number of preceding recognitions to represent in detail
representation.recogn.numIntegerfalsefalse4The number of preceding recognitions to represent
representation.sensorsListtruefalsede.fu_berlin.ties.context.sensor.ListSensorList of classes implementing the Sensor interface that are used to add e.g. semantic information to tokens (each class must provide a constructor that accepts a TiesConfiguration as single argument)
representation.sibling.numIntegerfalsefalse4Basic number of preceding and following siblings to represent
representation.split.maximumIntegerfalsefalse4Maximum number of subsequences to keep when a feature value must be split (at whitespace)
representation.store.nthIntegerfalsefalse0If > 0, every n-th (n=given value) context is stored for debugging and inspection purposes
sensor.list.basepathStringtruefalse${user.home}/lib/semanticGazetteer files are resolved relative to this path, if given
sensor.list.caseBooleanfalsefalsefalseGazetteer entries are looked up case-sensitive if set to true
sensor.list.map.dictStringfalsefalseamerican-english.gzMaps a identifier to a gazetteer file (American English dictionary)
sensor.list.map.lastnameStringfalsefalsecensus.last.frequent.gzMaps a identifier to a gazetteer file (list of last names from US census: http://www.census.gov/genealogy/names/)
sensor.list.map.name-femaleStringfalsefalsecensus.female.first.gzMaps a identifier to a gazetteer file (list of female names from US census)
sensor.list.map.name-maleStringfalsefalsecensus.male.first.gzMaps a identifier to a gazetteer file (list of female names from US census)
sensor.list.map.suffixStringfalsefalseusps.suffix.gzMaps a identifier to a gazetteer file (address suffixes from US Postal Service)
sensor.list.map.titleStringfalsefalsewikipedia.titles.gzMaps a identifier to a gazetteer file (titles collected from Wikipedia: http://en.wikipedia.org/wiki/Title)
sensor.list.negativeBooleanfalsefalsetrueWhether to add a negative marker if there is no positive information for a token
sensor.list.negative.valueStringtruefalsenoneThe key to use as negative marker if configured; if not set or empty, "false" is added for each gazetteer type if sensor.list.negative is true
sent.tuneIntegerfalsefalse0For how many iterations to TUNE the sentence classifier; if 0 or negative, it is TUNE trained for all iterations
shuffle.extStringfalsefalselistSuffix used for the generated random shuffle files
shuffle.firstIntegerfalsefalse1The number used in the name of the first generated shuffle file
shuffle.numIntegerfalsefalse5The number of random shuffles to generate
shuffle.prefixStringfalsefalserunPrefix used for the generated random shuffle files
target.classesStringfalsetrueNames of the classes to recognize (temporarily)
tokenizer.patternStringfalsetrue[(?:\p{N}[.,:]\p{N}|[\p{L}\p{M}\p{N}])+, \p{Sc}+(?:\p{N}[.,:]\p{N}|\p{N})*, [\p{Sm}\p{Sk}\p{So}]+, (\p{P})\1*]List of regular expressions defining the token types accepted by the tokenizer
tokenizer.pattern.classifierStringfalsetrue[^\p{Z}\p{C}][/!?#]?[-\p{L}\p{M}\p{N}]*(?:["'=;]|/?>|:/*)?List of regular expressions defining the token types accepted by the tokenizer (for classification of whole texts)
tokenizer.whitespaceStringfalsefalse[\p{Z}\p{C}]*Regular expression giving the whitespace accepted by the tokenizer
train.only-errorsBooleanfalsefalsetrueWhether to train only errors (TOE mode, recommmended) or all instances (brute-force mode)
train.test-onlyBooleanfalsefalsefalseIf enabled the trainer only checks whether all answer keys exist and can be located in the document (logging errors if they don't), it doesn't do any training
train.tuneIntegerfalsefalse15The maximum number of iterations used for training the extractor; if 1, training is incremental
train.tune.stopIntegerfalsefalse2TUNE training is stopped if the training accuracy didn't improve for the specified number of iterations
transformer.chainStringtruetruede.fu_berlin.ties.classify.feature.OSBTransformerList of feature transformers (fully specified names of de.fu_berlin.ties.classify.FeatureTransformer subclasses) used in a chain
transformer.osb.lengthIntegerfalsefalse5The window length used by the OSB (orthogonal sparse bigrams) transformer to generate joint features; minimum value is 2
transformer.osb.preserveBooleanfalsefalsefalseIf true the OSB transformer preserves the original features (unigrams); otherwise it discards them (using only the generated joint features)
transformer.osb.separatorStringtruefalseUsed by the OSB transformer to separate joint features; a space character is used if not specified
transformer.osb.strength.unigramFloatfalsefalse1Strength value used for unigrams, if the transformer is configured to preserve them
transformer.osb.strengthsFloatfalsetrue1List of strength values used for the different kinds of bigrams
transformer.sbph.lengthIntegerfalsefalse5The window length used by the SBPH (sparse binary polynomial hashing) transformer to generate joint features
transformer.sbph.separatorStringtruefalseUsed by the SBPH transformer to separate joint features; a space character is used if not specified
treetagger.after-eos.enStringtruetrue''List of POS tags still to include in a previous (for English)
treetagger.command.deStringfalsefalsetagger-chunker-germanName of the TreeTagger command (for German)
treetagger.command.enStringfalsefalsetagger-chunker-englishName of the TreeTagger command (for English)
treetagger.eos.deStringfalsefalse\$\.Regular expression fragment: POS tag marking the end of a sentence (for German)
treetagger.eos.enStringfalsefalseSENTRegular expression fragment: POS tag marking the end of a sentence (for English)