Name | Type | Optional? | List? | Value | Description |
---|
adjust.delete.control-chars | Boolean | false | false | false | Whether to delete control characters (which are not allowed in XML 1.0 and discouraged in XML 1.1) |
adjust.delete.pseudo-tags | Boolean | false | false | false | Whether to delete "pseudo-tags" |
adjust.emptiable-tags | String | true | true | | Set of names of tags that can be converted empty tags when required |
adjust.escape.pseudo-entities | Boolean | false | false | false | Whether to escape "&" starting a possible nonstandard entity reference ("&" at the start of one of the 5 predefined entity references or a character reference is never escaped, all other "&" are always escaped) |
adjust.missing-root | String | true | false | document | The name to use for the root element if missing |
charset | String | true | false | | Character set to use when reading and writing local files. If omitted (default), the default charset of the current platform is used. |
classifier | String | false | true | [Multi, Winnow] | The trainable classifier to use, either "Ext" for an external classifier; "Winnow" for the Winnow algorithm; |
classifier.ext.classify | String | false | true | [crm, -{ isolate (:stats:); classify <osb> (:*:_arg2:) (:stats:) /[[:graph:]]+/; output /:*:stats:/ }] | Command name + arguments to call for classification (list of possible target classes will be last arg, feature vector stdin) |
classifier.ext.directory | String | true | false | | The directory to run the classifier in (defaults to current working directory) |
classifier.ext.init | String | true | true | [cssutil, -b, -r, -s, 94321] | Command name + arguments to call for initialization (class to initialize will be last arg) |
classifier.ext.regex | String | false | false | \((.*?)\).*?prob:\s+(.*?)[,\s]\s*pR:\s+(\S+) | Regular expression to extract for all, or at least the best, classes (group 1) the probability (group 2) and optionally the pR = log(P / (1-P)) from the classifier's stdout |
classifier.ext.reset | String | true | true | [rm, -f] | Command name + arguments to call for resetting the classifier by deleting the prediction model (class to reset will be last arg) |
classifier.ext.suffix | String | true | false | .css | the suffix to append to classes for classifier |
classifier.ext.threshold.pR | Double | true | false | 20 | If specified the classifier is trained if the pR is below this value as well as on errors ("thick threshold" heuristic) |
classifier.ext.threshold.prob | Double | true | false | | If specified the classifier is trained if the probability is below this value (must be < 1.0) as well as on errors ("thick threshold" heuristic) |
classifier.ext.train | String | false | true | [crm, -{ learn <osb microgroom> (:*:_arg2:) /[[:graph:]]+/ }] | Command name + arguments to call for training (expected target class will be last arg, feature vector stdin) |
classifier.meta.judge | String | false | true | Winnow | The specification of the judge classifiers used in the MetaClassifier (same syntax as "classifier" parameter) |
classifier.meta.layers | Integer | false | false | 2 | The number of layers to use in the MetaClassifier (at least 1, typically 2 or more) |
classifier.train.all | Boolean | false | false | false | If true the classifier considers all classes for error-driven training, not only the candidate classes |
classifier.winnow.balanced | Boolean | false | false | false | Whether to use the Balanced Winnow or the standard Winnow algorithm; Balanced Winnow keeps two weights per feature and class, a positive and a negative one |
classifier.winnow.demotion | Float | false | false | 0.83 | The demotion factor used by the Winnow classifier |
classifier.winnow.features | Integer | false | false | 600000 | The number of features to store |
classifier.winnow.promotion | Float | false | false | 1.23 | The promotion factor used by the Winnow classifier |
classifier.winnow.strength.frequency | String | false | false | constant | How feature frequencies are considered when calculating strength values: one of "constant" (not at all), "log" (logarithmic), "sqrt" (square root), or "linear" |
classifier.winnow.threshold.thickness | Float | false | false | 0.05 | The thickness of the threshold if the "thick threshold" heuristic is used (must be < 1.0), 0.0 otherwise |
classifier.winnow.threshold.thickness.uc | Float | false | false | 0.1 | The thickness of the threshold if the "thick threshold" heuristic is used (must be < 1.0), 0.0 otherwise (for the "ultraconservative" variant of the Winnow classifier) |
combination.strategy | String | false | false | IOB2 | Strategy to combine extractions, either "BE" for the begin/end strategy which uses two classifiers, "BIA" for the begin/after (a.k.a. BIA) strategy, "BIE" for open/close (a.k.a. BIE), "IOB1" or "IOB2" for variations of the inside/outside strategy, "Triv" for the trivial strategy, or the qualified name of a CombinationStrategy subclass |
compress.gzip | Boolean | false | false | true | Whether to compress your data in gzip format |
eval.feedback | Boolean | false | false | false | If true, a fully incremental setup is used where the trainer is trained on each document after the extractor processed it |
eval.match.all | Boolean | false | false | true | Whether to use "match-all" or "match-best" (more probably) as match mode |
eval.match.pos | Boolean | false | false | true | If true, the positions of extraction and answer keys must match; otherwise only their contents must match (string compare) |
eval.test-split | Float | false | false | -1 | The percentage of a corpus to use for testing (evaluation); if -1, all remaining documents (1 - eval.train-split) are used |
eval.train-split | Float | false | false | 0.5 | The percentage of a corpus to use for training |
eval.tune.each | Boolean | false | false | false | If true, evaluation results are measured after training iteration (starting from eval.tune.since); otherwise after the last |
eval.tune.list | Integer | true | true | 1 | A list of iterations after which to evaluate TUNE training in addition to the last one; ignored if eval.tune.each is true |
eval.tune.since | Integer | false | false | 1 | The training iteration after which to evaluate results for the first time if eval.tune.each is enabled |
ext.doc | String | false | false | application/msword | Maps a file extension of a MIME type contained in matching files (for DOC files) |
ext.dot | String | false | false | application/msword | Maps a file extension of a MIME type contained in matching files (for DOT (Document Template) files) |
ext.htm | String | false | false | text/html | Maps a file extension of a MIME type contained in matching files (for HTM files) |
ext.html | String | false | false | text/html | Maps a file extension of a MIME type contained in matching files (for HTML files) |
ext.pdf | String | false | false | application/pdf | Maps a file extension of a MIME type contained in matching files (for PDF files) |
ext.rtf | String | false | false | text/rtf | Maps a file extension of a MIME type contained in matching files (for RTF (Rich Text Format) files) |
ext.txt | String | false | false | text/plain | Maps a file extension of a MIME type contained in matching files (for TXT (plain text) files) |
ext.uri | String | false | false | text/uri-list | Maps a file extension of a MIME type contained in matching files (for URI files) |
ext.uris | String | false | false | text/uri-list | Maps a file extension of a MIME type contained in matching files (for URIS files) |
ext.xhtml | String | false | false | text/html | Maps a file extension of a MIME type contained in matching files (for XHTML files) |
ext.xml | String | false | false | text/xml | Maps a file extension of a MIME type contained in matching files (for XML files) |
extract.bias | Double | true | false | | Bias that reduces or increases the score calculated for a class |
extract.punctuation.relevant | String | true | true | [., )] | A list of punctuation and symbol tokens that are considered as relevant from the very start (other such tokens are added on demand) |
file.ext | String | true | false | | The extension to append to file names (if any),currently used by the class-train goal |
filter.avoid | String | true | true | [pos, const, entity] | List of elements that should be avoided (use parent element instead) when filtering as first step of a double classification approach |
filter.elements | String | true | true | | List of elements to filter as the first step of a double classification approach ("sentence filtering"); if none, no sentence filtering is used |
goal.adjust | Class definition | false | false | [de.fu_berlin.ties.xml.XMLAdjuster, xml] | Tries to fix corrupt XML documents, especially documents containing nesting errors |
goal.answers | Class definition | false | false | [de.fu_berlin.ties.extract.AnswerBuilder, ans] | Builds answer keys from from an annotated text (in XML format) |
goal.class-train | Class definition | false | false | [de.fu_berlin.ties.classify.ClassTrain, cls] | Classifies a list of files, training the text classifier on each error |
goal.extract | Class definition | false | false | [de.fu_berlin.ties.extract.Extractor, ext] | Extracts relevant information from texts |
goal.preprocess | Class definition | false | false | [de.fu_berlin.ties.preprocess.PreProcessor, aug] | Preprocesses documents by converting them to a suitable XML format and adding lingustic information |
goal.re-eval | Class definition | false | false | [de.fu_berlin.ties.eval.ReEvaluator, ext] | Re-evaluates evaluated extractions (useful for switching the match mode -- eval.match.all) |
goal.shuffle | Class definition | false | false | de.fu_berlin.ties.eval.ShuffleGenerator | Creates random "shuffles" of input arguments (e.g. files or URLs) |
goal.strip | Class definition | false | false | [de.fu_berlin.ties.xml.dom.XMLStripper, txt] | Strips all markup from an XML document and stores the resulting plain text |
goal.train | Class definition | false | false | de.fu_berlin.ties.extract.Trainer | Trains the classifier used to extract information |
goal.train-eval | Class definition | false | false | [de.fu_berlin.ties.extract.TrainEval, metrics] | Trains an extractor and evaluates extraction quality |
html-converter.command.text/plain | String | false | true | [txt2html, --prebegin, 1, --preend, 1, --tables, --xhtml, --doctype, , -8] | External converter from specified MIME type to XHTML (first element: command name, further elements: command-line arguments) (for plain text) |
lang | String | false | false | en | Language of documents (ISO 639 language code) |
log.log | String | false | false | DEBUG | Only messages with this priority or higher are logged: DEBUG, INFO, WARN, ERROR, FATAL_ERROR or NONE (ignoring case) |
log.show | String | false | false | INFO | Only messages with this priority or higher are written to standard output (but only if covered by log.log) |
mime.application/rtf | String | false | false | text/rtf | Maps an alternative MIME type to a main MIME type (for Rich Text Format) |
mime.application/xhtml+xml | String | false | false | text/html | Maps an alternative MIME type to a main MIME type (for XHTML content) |
mime.application/xml | String | false | false | text/xml | Maps an alternative MIME type to a main MIME type (for XML content) |
outdir | String | true | false | | Output files are written if the directory, if specified;otherwise they are written in the same directory as the corresponding input file resp. in the working directory if there is no local input file; not relevant for log files which are always written in the working directory |
post | Class definition | false | false | | Postprocessor for files with the given extension (must extend TextProcessor) |
preprocess.tagger | String | true | true | de.fu_berlin.ties.preprocess.TreeTagger | A tagger (or a list of taggers) used to annotate a text e.g. with linguistic information; each tagger must implement the TextProcessor interface and accept a string (the output extension) as single constructor argument |
preprocess.text | Boolean | false | false | true | Whether plain text is preprocessed to recognize and reformat definition lists |
prune.candidates | Integer | false | false | 1 | The number of candidates considered for each pruning operation; if 1, the feature store behaves like a standard LRU cache |
prune.num | Integer | false | false | 1 | The number of candidates to prune by each pruning operation; must not be larger than prune.candidates |
representation.ancestor.num | Integer | false | false | 4 | Maximum number of ancestors to represent |
representation.default.attribs | String | false | true | [type, class] | Local names of default attributes |
representation.head.attrib | String | false | false | normal | Local name of the attribute to use for calculating head values |
representation.head.element | String | false | false | const | Local name of the element to use for calculating head values |
representation.prefix.maxlength | Integer | false | false | 4 | The maximum length of prefixed and suffixes |
representation.recogn.detailed | Integer | false | false | 2 | The number of preceding recognitions to represent in detail |
representation.recogn.num | Integer | false | false | 4 | The number of preceding recognitions to represent |
representation.sensors | List | true | false | de.fu_berlin.ties.context.sensor.ListSensor | List of classes implementing the Sensor interface that are used to add e.g. semantic information to tokens (each class must provide a constructor that accepts a TiesConfiguration as single argument) |
representation.sibling.num | Integer | false | false | 4 | Basic number of preceding and following siblings to represent |
representation.split.maximum | Integer | false | false | 4 | Maximum number of subsequences to keep when a feature value must be split (at whitespace) |
representation.store.nth | Integer | false | false | 0 | If > 0, every n-th (n=given value) context is stored for debugging and inspection purposes |
sensor.list.basepath | String | true | false | ${user.home}/lib/semantic | Gazetteer files are resolved relative to this path, if given |
sensor.list.case | Boolean | false | false | false | Gazetteer entries are looked up case-sensitive if set to true |
sensor.list.map.dict | String | false | false | american-english.gz | Maps a identifier to a gazetteer file (American English dictionary) |
sensor.list.map.lastname | String | false | false | census.last.frequent.gz | Maps a identifier to a gazetteer file (list of last names from US census: http://www.census.gov/genealogy/names/) |
sensor.list.map.name-female | String | false | false | census.female.first.gz | Maps a identifier to a gazetteer file (list of female names from US census) |
sensor.list.map.name-male | String | false | false | census.male.first.gz | Maps a identifier to a gazetteer file (list of female names from US census) |
sensor.list.map.suffix | String | false | false | usps.suffix.gz | Maps a identifier to a gazetteer file (address suffixes from US Postal Service) |
sensor.list.map.title | String | false | false | wikipedia.titles.gz | Maps a identifier to a gazetteer file (titles collected from Wikipedia: http://en.wikipedia.org/wiki/Title) |
sensor.list.negative | Boolean | false | false | true | Whether to add a negative marker if there is no positive information for a token |
sensor.list.negative.value | String | true | false | none | The key to use as negative marker if configured; if not set or empty, "false" is added for each gazetteer type if sensor.list.negative is true |
sent.tune | Integer | false | false | 0 | For how many iterations to TUNE the sentence classifier; if 0 or negative, it is TUNE trained for all iterations |
shuffle.ext | String | false | false | list | Suffix used for the generated random shuffle files |
shuffle.first | Integer | false | false | 1 | The number used in the name of the first generated shuffle file |
shuffle.num | Integer | false | false | 5 | The number of random shuffles to generate |
shuffle.prefix | String | false | false | run | Prefix used for the generated random shuffle files |
target.classes | String | false | true | | Names of the classes to recognize (temporarily) |
tokenizer.pattern | String | false | true | [(?:\p{N}[.,:]\p{N}|[\p{L}\p{M}\p{N}])+, \p{Sc}+(?:\p{N}[.,:]\p{N}|\p{N})*, [\p{Sm}\p{Sk}\p{So}]+, (\p{P})\1*] | List of regular expressions defining the token types accepted by the tokenizer |
tokenizer.pattern.classifier | String | false | true | [^\p{Z}\p{C}][/!?#]?[-\p{L}\p{M}\p{N}]*(?:["'=;]|/?>|:/*)? | List of regular expressions defining the token types accepted by the tokenizer (for classification of whole texts) |
tokenizer.whitespace | String | false | false | [\p{Z}\p{C}]* | Regular expression giving the whitespace accepted by the tokenizer |
train.only-errors | Boolean | false | false | true | Whether to train only errors (TOE mode, recommmended) or all instances (brute-force mode) |
train.test-only | Boolean | false | false | false | If enabled the trainer only checks whether all answer keys exist and can be located in the document (logging errors if they don't), it doesn't do any training |
train.tune | Integer | false | false | 15 | The maximum number of iterations used for training the extractor; if 1, training is incremental |
train.tune.stop | Integer | false | false | 2 | TUNE training is stopped if the training accuracy didn't improve for the specified number of iterations |
transformer.chain | String | true | true | de.fu_berlin.ties.classify.feature.OSBTransformer | List of feature transformers (fully specified names of de.fu_berlin.ties.classify.FeatureTransformer subclasses) used in a chain |
transformer.osb.length | Integer | false | false | 5 | The window length used by the OSB (orthogonal sparse bigrams) transformer to generate joint features; minimum value is 2 |
transformer.osb.preserve | Boolean | false | false | false | If true the OSB transformer preserves the original features (unigrams); otherwise it discards them (using only the generated joint features) |
transformer.osb.separator | String | true | false | | Used by the OSB transformer to separate joint features; a space character is used if not specified |
transformer.osb.strength.unigram | Float | false | false | 1 | Strength value used for unigrams, if the transformer is configured to preserve them |
transformer.osb.strengths | Float | false | true | 1 | List of strength values used for the different kinds of bigrams |
transformer.sbph.length | Integer | false | false | 5 | The window length used by the SBPH (sparse binary polynomial hashing) transformer to generate joint features |
transformer.sbph.separator | String | true | false | | Used by the SBPH transformer to separate joint features; a space character is used if not specified |
treetagger.after-eos.en | String | true | true | '' | List of POS tags still to include in a previous (for English) |
treetagger.command.de | String | false | false | tagger-chunker-german | Name of the TreeTagger command (for German) |
treetagger.command.en | String | false | false | tagger-chunker-english | Name of the TreeTagger command (for English) |
treetagger.eos.de | String | false | false | \$\. | Regular expression fragment: POS tag marking the end of a sentence (for German) |
treetagger.eos.en | String | false | false | SENT | Regular expression fragment: POS tag marking the end of a sentence (for English) |