Name | Type | Optional? | List? | Value | Description |
---|
adjust.delete.control-chars | Boolean | false | false | false | Whether to delete control characters (which are not allowed in XML 1.0 and discouraged in XML 1.1) |
adjust.delete.pseudo-tags | Boolean | false | false | false | Whether to delete "pseudo-tags" |
adjust.delete.trailing-garbage | Boolean | false | false | false | Whether to delete trailing garbage (illegal content that occurs after the root tag has been closed) |
adjust.emptiable-tags | String | true | true | | Set of names of tags that can be converted empty tags when required |
adjust.escape.pseudo-entities | Boolean | false | false | false | Whether to escape "&" starting a possible nonstandard entity reference ("&" at the start of one of the 5 predefined entity references or a character reference is never escaped, all other "&" are always escaped) |
adjust.missing-root | String | true | false | document | The name to use for the root element if missing |
analyzer.aug-dir | String | false | false | .. | Directory containing the augmented (preprocessed) input texts, relative to the current working directory |
answers.attrib | String | true | false | | Name of attribute to read element types from if the elemant name shouldn't be used |
answers.element | String | true | true | | If not empty, answer types are read from the "answers.attrib" attribute of the elements specified in this list instead of using element names as types |
charset | String | true | false | | Character set to use when reading and writing local files. If omitted (default), the default charset of the current platform is used. |
classes | String | false | true | | Names of the classes to recognize (temporarily) |
classifier | String | false | true | [Multi, Winnow] | The trainable classifier to use, either "Ext" for an external classifier; "Winnow" for the Winnow algorithm; "ucWinnow" for Ultraconservative Winnow; the qualified name of a TrainableClassifier subclass; or "Multi" resp. "OAR" followed by the name of |
classifier.ext.classify | String | false | true | [crm, -{ isolate (:stats:); classify <osb> (:*:_arg2:) (:stats:) /[[:graph:]]+/; output /:*:stats:/ }] | Command name + arguments to call for classification (list of possible target classes will be last arg, feature vector stdin) |
classifier.ext.directory | String | true | false | | The directory to run the classifier in (defaults to current working directory) |
classifier.ext.init | String | true | true | [cssutil, -b, -r, -s, 94321] | Command name + arguments to call for initialization (class to initialize will be last arg) |
classifier.ext.regex | String | false | false | \((.*?)\).*?prob:\s+(.*?)[,\s]\s*pR:\s+(\S+) | Regular expression to extract for all, or at least the best, classes (group 1) the probability (group 2) and optionally the pR = log(P / (1-P)) from the classifier's stdout |
classifier.ext.reset | String | true | true | [rm, -f] | Command name + arguments to call for resetting the classifier by deleting the prediction model (class to reset will be last arg) |
classifier.ext.suffix | String | true | false | .css | the suffix to append to classes for classifier |
classifier.ext.threshold.pR | Double | true | false | 20 | If specified the classifier is trained if the pR is below this value as well as on errors ("thick threshold" heuristic) |
classifier.ext.threshold.prob | Double | true | false | | If specified the classifier is trained if the probability is below this value (must be < 1.0) as well as on errors ("thick threshold" heuristic) |
classifier.ext.train | String | false | true | [crm, -{ learn <osb microgroom> (:*:_arg2:) /[[:graph:]]+/ }] | Command name + arguments to call for training (expected target class will be last arg, feature vector stdin) |
classifier.file | String | false | false | classifier.xsj | Name of the file used for storing the classifier |
classifier.meta.judge | String | false | true | Winnow | The specification of the judge classifiers used in the MetaClassifier (same syntax as "classifier" parameter) |
classifier.meta.layers | Integer | false | false | 2 | The number of layers to use in the MetaClassifier (at least 1, typically 2 or more) |
classifier.re-use | Boolean | false | false | true | Whether to re-use classifiers between several runs (incl. classifiers stored in the classifier.file, if exists) |
classifier.store | Boolean | false | false | false | Whether to store the final classifier in the file specified by the classifier.file parameter |
classifier.test-only | Boolean | false | false | false | If set to true, the classifier will be used only for prediction -- no training will take place |
classifier.text | String | false | true | [OAR, Winnow] | The trainable classifier to use, either "Ext" for an external classifier; "Winnow" for the Winnow algorithm; "ucWinnow" for Ultraconservative Winnow; the qualified name of a TrainableClassifier subclass; or "Multi" resp. "OAR" followed by the name of (for text classification) |
classifier.tie.layers | Integer | false | false | 2 | The number of layers to use in the TieClassifier (at least 1, typically 2 or more) |
classifier.tie.threshold | Double | false | false | 0.99 | TieClassifier invokes the next layer if the relative probability of the second best prediction as above or equal to this threshold (must be in the 0 to 1 range) |
classifier.train.all | Boolean | false | false | false | If true the classifier considers all classes for error-driven training, not only the candidate classes |
classifier.winnow.balanced | Boolean | false | false | false | Whether to use the Balanced Winnow or the standard Winnow algorithm; Balanced Winnow keeps two weights per feature and class, a positive and a negative one |
classifier.winnow.demotion | Float | false | false | 0.83 | The demotion factor used by the Winnow classifier |
classifier.winnow.features | Integer | false | false | 2000000 | The number of features to store |
classifier.winnow.ignore.exponent | Integer | false | false | 1 | If the ignore parameter is true, all features in the range from demotion factor^exponent to |
classifier.winnow.ignore.irrelevant | Boolean | false | false | false | Whether to ignore features within a certain range around the default weight for classification |
classifier.winnow.promotion | Float | false | false | 1.23 | The promotion factor used by the Winnow classifier |
classifier.winnow.shared-store | Boolean | false | false | true | Whether a shared feature store is used for all Winnow instances |
classifier.winnow.threshold.thickness | Float | false | false | 0.05 | The thickness of the threshold if the "thick threshold" heuristic is used (must be < 1.0), 0.0 otherwise |
classifier.winnow.threshold.thickness.uc | Float | false | false | 0.1 | The thickness of the threshold if the "thick threshold" heuristic is used (must be < 1.0), 0.0 otherwise (for the "ultraconservative" variant of the Winnow classifier) |
combination.adapter | String | false | false | | Mapping from a regular expression to a replacement text used to translate between an external labeling convention and the internal convention used by a combination strategy |
combination.begin-end.level2 | Boolean | false | false | false | Whether the begin/end combination strategy should use a second level similar to the ELIE/2 system by Finn and Kushmerick |
combination.strategy | String | false | false | IOB2 | Strategy to combine extractions, either "BE" for the begin/end strategy which uses two classifiers, "BIA" for the begin/after (a.k.a. BIA) strategy, "BIE1" or "BIE2" for variations of the open/close (a.k.a. BIE) strategy, "IOB1" or "IOB2" for variations of the inside/outside strategy, "Triv" for the trivial strategy, or the qualified name of a CombinationStrategy subclass |
compress.gzip | Boolean | false | false | false | Whether to compress your data in gzip format |
compress.gzip.xsj | Boolean | false | false | true | Whether to compress your data in gzip format (for XML-serialized Java) |
dsv.entry.separator.ws | Boolean | false | false | false | if set to true, any sequence of whitespace is assumed to separate DSV entries; otherwise only newlines are accepted as separators |
dsv.field.separator | String | true | false | | | Separator between fields in a DSV entry; a single space is used if the value is empty or missing |
dsv.keys | String | true | true | | The list of keys (header names) for DSV files; if none, the keys are read from the first line of a file |
dsv2xml.attribs.omit | Boolean | false | false | false | Whether to omit all attributes when converting DSV to XML |
dsv2xml.level2 | String | true | false | | Optional name of second level elements to insert at the start of a document and whenever an empty line is encountered -- if given, all field maps will be appended as child elements (third level) to the preceding second level element |
dsv2xml.root | String | false | false | document | The name used for the root element by the DSV-to-XML converter |
eval.feedback | Boolean | false | false | false | If true, a fully incremental setup is used where the trainer is trained on each document after the extractor processed it |
eval.match.all | Boolean | false | false | true | Whether to use "match-all" or "match-best" (more probably) as match mode |
eval.match.pos | Boolean | false | false | true | If true, the positions of extraction and answer keys must match; otherwise only their contents must match (string compare) |
eval.split.separator | String | true | false | | If given, the specified string is used to separate the training from the testing section of the corpus (e.g. "---") |
eval.test-split | Float | false | false | -1 | The percentage of a corpus to use for testing (evaluation); if -1, all remaining documents (1 - eval.train-split) are used |
eval.train-split | Float | false | false | 0.5 | The percentage of a corpus to use for training |
eval.train-split.text | Float | false | false | 1 | The percentage of a corpus to use for training (for text classification) |
eval.tune.each | Boolean | false | false | false | If true, evaluation results are measured after training iteration (starting from eval.tune.since); otherwise after the last |
eval.tune.list | Integer | true | true | 1 | A list of iterations after which to evaluate TUNE training in addition to the last one; ignored if eval.tune.each is true |
eval.tune.since | Integer | false | false | 1 | The training iteration after which to evaluate results for the first time if eval.tune.each is enabled |
ext.doc | String | false | false | application/msword | Maps a file extension of a MIME type contained in matching files (for DOC files) |
ext.dot | String | false | false | application/msword | Maps a file extension of a MIME type contained in matching files (for DOT (Document Template) files) |
ext.htm | String | false | false | text/html | Maps a file extension of a MIME type contained in matching files (for HTM files) |
ext.html | String | false | false | text/html | Maps a file extension of a MIME type contained in matching files (for HTML files) |
ext.pdf | String | false | false | application/pdf | Maps a file extension of a MIME type contained in matching files (for PDF files) |
ext.rtf | String | false | false | text/rtf | Maps a file extension of a MIME type contained in matching files (for RTF (Rich Text Format) files) |
ext.txt | String | false | false | text/plain | Maps a file extension of a MIME type contained in matching files (for TXT (plain text) files) |
ext.uri | String | false | false | text/uri-list | Maps a file extension of a MIME type contained in matching files (for URI files) |
ext.uris | String | false | false | text/uri-list | Maps a file extension of a MIME type contained in matching files (for URIS files) |
ext.xhtml | String | false | false | text/html | Maps a file extension of a MIME type contained in matching files (for XHTML files) |
ext.xml | String | false | false | text/xml | Maps a file extension of a MIME type contained in matching files (for XML files) |
externalize.key | String | false | false | File | The name of the field to externalize |
extract.bias | Double | true | false | | Bias that reduces or increases the score calculated for a class |
extract.evaluate | Boolean | false | false | true | Whether to evaluate predictions by comparing them to answer keys, otherwise predictions are stored without evaluating them |
extract.pred.ext | String | false | false | pred | The extension used to stored predictions (if the extract.evaluate option is set to false) |
extract.pred.use-outdir | Boolean | false | false | true | Whether to write prediction files to the configured output directory or the the directory containing the input file |
extract.punctuation.relevant | String | true | true | [., )] | A list of punctuation and symbol tokens that are considered as relevant from the very start (other such tokens are added on demand) |
feature.extractor | String | false | false | de.fu_berlin.ties.text.TokenizingExtractor | Qualified name of a de.fu_berlin.ties.classify.feature.FeatureExtractor instance to be used for converting text sequences into feature vectors |
file.ext | String | true | false | | The extension to append to file names (if any),currently used by the class-train goal |
filter.avoid | String | true | true | [pos, const, entity] | List of elements that should be avoided (use parent element instead) when filtering as first step of a double classification approach |
filter.elements | String | true | true | | List of elements to filter as the first step of a double classification approach ("sentence filtering"); if none, no sentence filtering is used |
goal.adjust | Class definition | false | false | [de.fu_berlin.ties.xml.XMLAdjuster, xml] | Tries to fix corrupt XML documents, especially documents containing nesting errors |
goal.analyze | Class definition | false | false | [de.fu_berlin.ties.eval.MistakeAnalyzer, mistakes] | Analyses the types of prediction errors that occurred during a test run |
goal.answers | Class definition | false | false | [de.fu_berlin.ties.extract.AnswerBuilder, ans] | Builds answer keys from from an annotated text (in XML format) |
goal.avg-length | Class definition | false | false | [de.fu_berlin.ties.eval.AverageLength, avl] | Calculates the average length for extractions of different types and evaluation statuses |
goal.class-train | Class definition | false | false | [de.fu_berlin.ties.classify.ClassTrain, cls] | Classifies a list of files, training the text classifier on each error |
goal.dsv2xml | Class definition | false | false | [de.fu_berlin.ties.xml.convert.DSVtoXMLConverter, xml] | Converts data in DSV format into XML |
goal.eval-preds | Class definition | false | false | de.fu_berlin.ties.eval.PredictionEvaluator | Reads a set of files that must contain predictions and evaluates them against the corresponding answer keys (*.ans files) |
goal.externalize | Class definition | false | false | [de.fu_berlin.ties.io.Externalize, dsv] | Externalizes the contents of a file in DSV format. For each entry, the contents of one specified field (read from the "externalize.key" configuration parameter) are stored in an external file whose name is stored in the output DSV file instead of its content. |
goal.extract | Class definition | false | false | [de.fu_berlin.ties.extract.Extractor, pred] | Extracts relevant information from texts |
goal.filter | Class definition | false | false | de.fu_berlin.ties.classify.TextFilter | A simple filter for classifying and/or training text files |
goal.preprocess | Class definition | false | false | [de.fu_berlin.ties.preprocess.PreProcessor, aug] | Preprocesses documents by converting them to a suitable XML format and adding lingustic information |
goal.re-eval | Class definition | false | false | [de.fu_berlin.ties.eval.ReEvaluator, ext] | Re-evaluates evaluated extractions (useful for switching the match mode -- eval.match.all) |
goal.shuffle | Class definition | false | false | de.fu_berlin.ties.eval.ShuffleGenerator | Creates random "shuffles" of input arguments (e.g. files or URLs) |
goal.shuffle-lines | Class definition | false | false | [de.fu_berlin.ties.eval.LineShuffleGenerator, rand] | Randomly reshuffles the lines in a file |
goal.simple-quotes | Class definition | false | false | [de.fu_berlin.ties.text.SimplifyQuotes, txt] | Simplifies different kinds of quotes that can occur in text files |
goal.split | Class definition | false | false | de.fu_berlin.ties.io.Split | Splits an input file into a series of output files |
goal.strip | Class definition | false | false | [de.fu_berlin.ties.xml.dom.XMLStripper, txt] | Strips all markup from an XML document and stores the resulting plain text |
goal.train | Class definition | false | false | de.fu_berlin.ties.extract.Trainer | Trains the classifier used to extract information |
goal.train-eval | Class definition | false | false | [de.fu_berlin.ties.extract.TrainEval, metrics] | Trains an extractor and evaluates extraction quality |
goal.unflatten | Class definition | false | false | [de.fu_berlin.ties.xml.convert.AttributeUnflatten, xml] | Unflattens an XML document, reading labels for a combination strategy from an XML attribute ("class" by default) |
html-converter.command.text/plain | String | false | true | [txt2html, --prebegin, 1, --preend, 1, --tables, --nounhyphenation, --xhtml, --doctype, , -8] | External converter from specified MIME type to XHTML (first element: command name, further elements: command-line arguments) (for plain text) |
lang | String | false | false | en | Language of documents (ISO 639 language code) |
lengthfilter.tolerance | Double | false | false | 1.05 | Length filter discards extractions longer than the longest trained extraction multiplied with this tolerance factor |
log.log | String | false | false | DEBUG | Only messages with this priority or higher are logged: DEBUG, INFO, WARN, ERROR, FATAL_ERROR or NONE (ignoring case) |
log.show | String | false | false | INFO | Only messages with this priority or higher are written to standard output (but only if covered by log.log) |
mime.application/rtf | String | false | false | text/rtf | Maps an alternative MIME type to a main MIME type (for Rich Text Format) |
mime.application/xhtml+xml | String | false | false | text/html | Maps an alternative MIME type to a main MIME type (for XHTML content) |
mime.application/xml | String | false | false | text/xml | Maps an alternative MIME type to a main MIME type (for XML content) |
outdir | String | true | false | | Output files are written if the directory, if specified;otherwise they are written in the same directory as the corresponding input file resp. in the working directory if there is no local input file; not relevant for log files which are always written in the working directory |
post | Class definition | false | false | | Postprocessor for files with the given extension (must extend TextProcessor) |
preprocess.tagger | String | true | true | de.fu_berlin.ties.preprocess.TreeTagger | A tagger (or a list of taggers) used to annotate a text e.g. with linguistic information; each tagger must implement the TextProcessor interface and accept a string (the output extension) as single constructor argument |
preprocess.text | Boolean | false | false | true | Whether plain text is preprocessed to recognize and reformat definition lists |
prune.candidates | Integer | false | false | 1 | The number of candidates considered for each pruning operation; if 1, the feature store behaves like a standard LRU cache |
prune.num | Integer | false | false | 1 | The number of candidates to prune by each pruning operation; must not be larger than prune.candidates |
reestimator.chain | String | true | true | de.fu_berlin.ties.extract.reestimate.LengthFilter | List of re-estimators (fully specified names of de.fu_berlin.ties.extract.reestimate.Reestimator subclasses) used in a chain |
representation.ancestor.num | Integer | false | false | 4 | Maximum number of ancestors to represent |
representation.default.attribs | String | false | true | [type, class] | Local names of default attributes |
representation.head.attrib | String | false | false | normal | Local name of the attribute to use for calculating head values |
representation.head.element | String | false | false | const | Local name of the element to use for calculating head values |
representation.prefix.maxlength | Integer | false | false | 4 | The maximum length of prefixed and suffixes |
representation.recogn.detailed | Integer | false | false | 2 | The number of preceding recognitions to represent in detail |
representation.recogn.num | Integer | false | false | 4 | The number of preceding recognitions to represent |
representation.sensors | String | true | true | de.fu_berlin.ties.context.sensor.ListSensor | List of classes implementing the Sensor interface that are used to add e.g. semantic information to tokens (each class must provide a constructor that accepts a TiesConfiguration as single argument) |
representation.sibling.num | Integer | false | false | 4 | Basic number of preceding and following siblings to represent |
representation.split.maximum | Integer | false | false | 4 | Maximum number of subsequences to keep when a feature value must be split (at whitespace) |
representation.store.nth | Integer | false | false | 0 | If > 0, every n-th (n=given value) context is stored for debugging and inspection purposes |
rewriter.pred.classes | String | true | true | | Names of the prediction classes to use -- all are used if this parameter if missing or empty |
rewriter.pred.ext | String | false | false | pred | Extension of the files containing predictions |
rewriter.pred.none | String | true | false | | "None" marker to use for tokens that do not belong to any prediction -- if empty or missing, these tokens are not tagged |
rewriters | String | true | true | | List of classes extending DocumentRewriter that are used to add e.g. semantic information to documents (each class must provide a constructor that accepts a TiesConfiguration as single argument) |
sensor.list.basepath | String | true | false | ${user.home}/lib/semantic | Gazetteer files are resolved relative to this path, if given |
sensor.list.case | Boolean | false | false | false | Gazetteer entries are looked up case-sensitive if set to true |
sensor.list.map.dict | String | false | false | american-english.gz | Maps a identifier to a gazetteer file (American English dictionary) |
sensor.list.map.lastname | String | false | false | census.last.frequent.gz | Maps a identifier to a gazetteer file (list of last names from US census: http://www.census.gov/genealogy/names/) |
sensor.list.map.name-female | String | false | false | census.female.first.gz | Maps a identifier to a gazetteer file (list of female names from US census) |
sensor.list.map.name-male | String | false | false | census.male.first.gz | Maps a identifier to a gazetteer file (list of female names from US census) |
sensor.list.map.suffix | String | false | false | usps.suffix.gz | Maps a identifier to a gazetteer file (address suffixes from US Postal Service) |
sensor.list.map.title | String | false | false | wikipedia.titles.gz | Maps a identifier to a gazetteer file (titles collected from Wikipedia: http://en.wikipedia.org/wiki/Title) |
sensor.list.negative | Boolean | false | false | true | Whether to add a negative marker if there is no positive information for a token |
sensor.list.negative.value | String | true | false | none | The key to use as negative marker if configured; if not set or empty, "false" is added for each gazetteer type if sensor.list.negative is true |
sent.tune | Integer | false | false | 0 | For how many iterations to TUNE the sentence classifier; if 0 or negative, it is TUNE trained for all iterations |
shuffle.ext | String | false | false | list | Suffix used for the generated random shuffle files |
shuffle.first | Integer | false | false | 1 | The number used in the name of the first generated shuffle file |
shuffle.lines.ignore-first | Integer | false | false | 1 | The number of lines at the start of a file that are ignored by the line shuffle generator |
shuffle.num | Integer | false | false | 5 | The number of random shuffles to generate |
shuffle.prefix | String | false | false | run | Prefix used for the generated random shuffle files |
split.pattern | String | true | false | | The default pattern used by the split goal to split input |
strip.normalize | Boolean | false | false | false | Whether or not to normalize whitespace |
strip.to-xml | Boolean | false | false | false | If this is set to true, the strip target generates a XML file instead of a plain text file, by preserving the root element (while still discarding all other elements + attributes) |
textfilter.classes | String | false | true | [spam, ham] | Names of the classes used to filter text. The probability of the very first class will be returned as "score". |
tokenizer.pattern | String | false | true | [(?:\p{N}[.,:]\p{N}|[\p{L}\p{M}\p{N}])+, \p{Sc}+(?:\p{N}[.,:]\p{N}|\p{N})*, [\p{Sm}\p{Sk}\p{So}]+, (\p{P})\1*] | List of regular expressions defining the token types accepted by the tokenizer |
tokenizer.pattern.classifier | String | false | true | [^\p{Z}\p{C}][/!?#]?[-\p{L}\p{M}\p{N}]*(?:["'=;]|/?>|:/*)? | List of regular expressions defining the token types accepted by the tokenizer (for classification of whole texts) |
tokenizer.whitespace | String | false | false | [\p{Z}\p{C}]* | Regular expression giving the whitespace accepted by the tokenizer |
train.only-errors | Boolean | false | false | true | Whether to train only errors (TOE mode, recommmended) or all instances (brute-force mode) |
train.test-only | Boolean | false | false | false | If enabled the trainer only checks whether all answer keys exist and can be located in the document (logging errors if they don't), it doesn't do any training |
train.tune | Integer | false | false | 15 | The maximum number of iterations used for training the extractor; if 1, training is incremental |
train.tune.stop | Integer | false | false | 1 | TUNE training is stopped if the training accuracy didn't improve for the specified number of iterations |
train.tune.text | Integer | false | false | 1 | The maximum number of iterations used for training the extractor; if 1, training is incremental (for text classification) |
transformer.chain | String | true | true | de.fu_berlin.ties.classify.feature.OSBTransformer | List of feature transformers (fully specified names of de.fu_berlin.ties.classify.FeatureTransformer subclasses) used in a chain |
transformer.osb.length | Integer | false | false | 5 | The window length used by the OSB (orthogonal sparse bigrams) transformer to generate joint features; minimum value is 2 |
transformer.osb.preserve | Boolean | false | false | false | If true the OSB transformer preserves the original features (unigrams); otherwise it discards them (using only the generated joint features) |
transformer.osb.separator | String | true | false | | Used by the OSB transformer to separate joint features; a space character is used if not specified |
transformer.sbph.length | Integer | false | false | 5 | The window length used by the SBPH (sparse binary polynomial hashing) transformer to generate joint features |
transformer.sbph.separator | String | true | false | | Used by the SBPH transformer to separate joint features; a space character is used if not specified |
treetagger.after-eos.en | String | true | true | '' | List of POS tags still to include in a previous (for English) |
treetagger.command.de | String | false | false | tagger-chunker-german | Name of the TreeTagger command (for German) |
treetagger.command.en | String | false | false | tagger-chunker-english | Name of the TreeTagger command (for English) |
treetagger.eos.de | String | false | false | \$\. | Regular expression fragment: POS tag marking the end of a sentence (for German) |
treetagger.eos.en | String | false | false | SENT | Regular expression fragment: POS tag marking the end of a sentence (for English) |
treetagger.tag-sentences | Boolean | false | false | true | Whether to add XML tags around each sentence |
unflatten.attribute | String | false | false | class | Name of the attribute used for unflattening |