de.fu_berlin.ties.extract
Class Trainer

java.lang.Object
  extended by de.fu_berlin.ties.ConfigurableProcessor
      extended by de.fu_berlin.ties.TextProcessor
          extended by de.fu_berlin.ties.DocumentReader
              extended by de.fu_berlin.ties.extract.ExtractorBase
                  extended by de.fu_berlin.ties.extract.Trainer
All Implemented Interfaces:
Oracle, SkipHandler, Processor, TokenProcessor

public class Trainer
extends ExtractorBase
implements Oracle

A trainer trains a local Classifier to be used for extraction.

Instances of this class are not thread-safe and cannot handle training on several documents in parallel.

Version:
$Revision: 1.49 $, $Date: 2004/12/07 12:01:48 $, $Author: siefkes $
Author:
Christian Siefkes

Field Summary
static String CONFIG_TEST_ONLY
          Configuration key determining whether the trainer only ensures that all answer keys exist and can be located in the document instead of doing any training.
static String CONFIG_TOE
          Configuration key for determining the training mode (isTrainingOnlyErrors()).
static String PREFIX_GLOBAL_ACC
          Prefix used for serializing the global (overall) accuracy.
static String PREFIX_LOCAL_ACC
          Prefix used for serializing the local (document-specific) accuracy.
 
Fields inherited from class de.fu_berlin.ties.extract.ExtractorBase
CONFIG_AVOID, CONFIG_ELEMENTS, CONFIG_RELEVANT_PUNCTUATION, CONFIG_SENTENCE
 
Fields inherited from class de.fu_berlin.ties.TextProcessor
CONFIG_POST, KEY_DIRECTORY, KEY_LOCAL_NAME, KEY_OUT_DIRECTORY, KEY_URL
 
Constructor Summary
Trainer()
          Creates a new instance without specifying an output extension (which isn't needed anyway, because this class doesn't produce output).
Trainer(String outExt)
          Creates a new instance.
Trainer(String outExt, File runDirectory, TiesConfiguration config)
          Creates a new instance.
Trainer(String outExt, TargetStructure targetStruct, TrainableClassifier[] theClassifiers, Representation theRepresentation, CombinationStrategy combiStrat, TokenizerFactory tFactory, TrainableFilter sentFilter)
          Creates a new instance, using the standard configuration to configure the training mode and the superclasses.
Trainer(String outExt, TargetStructure targetStruct, TrainableClassifier[] theClassifiers, Representation theRepresentation, CombinationStrategy combiStrat, TokenizerFactory tFactory, TrainableFilter sentFilter, Set<String> relevantPunct, boolean trainOnlyErrors, boolean testOnly, TiesConfiguration config)
          Creates a new instance.
Trainer(String outExt, TiesConfiguration config)
          Creates a new instance.
 
Method Summary
protected  FilteringTokenWalker createFilteringTokenWalker(TrainableFilter repFilter)
          Creates a filtering token walker to be used for walking through a document and sentence classification if a double classification approach is used.
 void disableSentenceTraining()
          Disables training the embedded sentence filter, if sentence filtering is used.
 void enableSentenceTraining()
          Re-enables training the embedded filter, if sentence filtering is used.
 FMetricsView evaluateSentenceFiltering()
          Evaluates precision and recall for sentence filtering on the last processed document.
 boolean isTestingOnly()
          If true the trainer only ensures that all answer keys exist and can be located in the document instead of doing any training.
 boolean isTrainingOnlyErrors()
          Whether to train only errors (TOE mode, recommmended) or to train all instances (brute-force mode).
 void process(Document document, Writer writer, ContextMap context)
          Trains the local classifier with the correct extractions of an XML document, using the provided context representation.
 void processToken(Element element, String left, TokenDetails details, String right, ContextMap context)
          Processes a token in an XML element, optionally modifying the element or the document it is part of.
 void reset()
          Resets the internal classifer, completely deleting the prediction model.
 void resetGlobalAccuracy()
          Resets the global (overall) accuracies measured so far by each classifier.
protected  void resetStrategy()
          Reset the combination strategy, logging a warning if it tells me to discard the last extraction.
 boolean shouldMatch(Element element)
          Decides whether an element should be accepted by filters.
 String toString()
          Returns a string representation of this object.
 Accuracy[] train(Document document, ExtractionContainer correctExtractions)
          Trains the local classifier with the correct extractions of an XML document, using the provided context representation.
 AccuracyView[] viewGlobalAccuracy()
          Returns a view on the global (overall) accuracies measured so far (or after the last call to resetGlobalAccuracy()) by each classifier.
 
Methods inherited from class de.fu_berlin.ties.extract.ExtractorBase
createSentenceFilter, evaluateSentenceFiltering, getActiveClasses, getClassifiers, getFactory, getFeatureCount, getFeatures, getPriorRecognitions, getRepresentation, getSentenceFilter, getStrategy, getTargetStructure, getWalker, initFields, isRelevant, isSentenceFiltering, markRelevant, skip, updateState, viewFeatureCount, viewRelevantPunctuation
 
Methods inherited from class de.fu_berlin.ties.DocumentReader
doProcess
 
Methods inherited from class de.fu_berlin.ties.TextProcessor
getOutFileExt, process, process, process, process, process, process
 
Methods inherited from class de.fu_berlin.ties.ConfigurableProcessor
getConfig
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

CONFIG_TOE

public static final String CONFIG_TOE
Configuration key for determining the training mode (isTrainingOnlyErrors()).

See Also:
Constant Field Values

CONFIG_TEST_ONLY

public static final String CONFIG_TEST_ONLY
Configuration key determining whether the trainer only ensures that all answer keys exist and can be located in the document instead of doing any training.

See Also:
Constant Field Values

PREFIX_GLOBAL_ACC

public static final String PREFIX_GLOBAL_ACC
Prefix used for serializing the global (overall) accuracy.

See Also:
Constant Field Values

PREFIX_LOCAL_ACC

public static final String PREFIX_LOCAL_ACC
Prefix used for serializing the local (document-specific) accuracy.

See Also:
Constant Field Values
Constructor Detail

Trainer

public Trainer()
        throws IllegalArgumentException,
               ProcessingException
Creates a new instance without specifying an output extension (which isn't needed anyway, because this class doesn't produce output). Delegates to Trainer(String) using a dummy extension.

Throws:
IllegalArgumentException - if the combination strategy cannot be initialized (cf. CombinationStrategy.createStrategy(java.util.Set, TiesConfiguration))
ProcessingException - if an error occurs during initialization

Trainer

public Trainer(String outExt)
        throws IllegalArgumentException,
               ProcessingException
Creates a new instance. Delegates to Trainer(String, TiesConfiguration) using the standard configuration.

Parameters:
outExt - the extension to use for output files
Throws:
IllegalArgumentException - if the combination strategy cannot be initialized (cf. CombinationStrategy.createStrategy(java.util.Set, TiesConfiguration))
ProcessingException - if an error occurs during initialization

Trainer

public Trainer(String outExt,
               TiesConfiguration config)
        throws IllegalArgumentException,
               ProcessingException
Creates a new instance. Delegates to the Trainer(String, File, TiesConfiguration) constructor without specifying a runDirectory.

Parameters:
outExt - the extension to use for output files
config - the configuration to use
Throws:
IllegalArgumentException - if the combination strategy cannot be initialized (cf. CombinationStrategy.createStrategy(java.util.Set, TiesConfiguration))
ProcessingException - if an error occurs during initialization

Trainer

public Trainer(String outExt,
               File runDirectory,
               TiesConfiguration config)
        throws IllegalArgumentException,
               ProcessingException
Creates a new instance. Sets the training mode (isTrainingOnlyErrors()) to the value of the CONFIG_TOE configuration key in the provided configuration and delegates to the corresponding super constructor to configure the fields.

Parameters:
outExt - the extension to use for output files
runDirectory - the directory to run the classifier in; used instead of the configured directory if not null
config - the configuration to use
Throws:
IllegalArgumentException - if the combination strategy cannot be initialized (cf. CombinationStrategy.createStrategy(java.util.Set, TiesConfiguration))
ProcessingException - if an error occurs during initialization

Trainer

public Trainer(String outExt,
               TargetStructure targetStruct,
               TrainableClassifier[] theClassifiers,
               Representation theRepresentation,
               CombinationStrategy combiStrat,
               TokenizerFactory tFactory,
               TrainableFilter sentFilter)
Creates a new instance, using the standard configuration to configure the training mode and the superclasses.

Parameters:
outExt - the extension to use for output files
targetStruct - the target structure specifying the classes to recognize
theClassifiers - the array of classifiers to train
theRepresentation - the context representation to use training
combiStrat - the combination strategy to use
tFactory - used to instantiate tokenizers
sentFilter - the filter used in the first step of a double classification approach ("sentence filtering"); if null, no sentence filtering is used

Trainer

public Trainer(String outExt,
               TargetStructure targetStruct,
               TrainableClassifier[] theClassifiers,
               Representation theRepresentation,
               CombinationStrategy combiStrat,
               TokenizerFactory tFactory,
               TrainableFilter sentFilter,
               Set<String> relevantPunct,
               boolean trainOnlyErrors,
               boolean testOnly,
               TiesConfiguration config)
Creates a new instance.

Parameters:
outExt - the extension to use for output files
targetStruct - the target structure specifying the classes to recognize
theClassifiers - the array of classifiers to train
theRepresentation - the context representation to use training
combiStrat - the combination strategy to use
tFactory - used to instantiate tokenizers
sentFilter - the filter used in the first step of a double classification approach ("sentence filtering"); if null, no sentence filtering is used
relevantPunct - a set of punctuation tokens that have been found to be relevant for token classification; might be empty but not null
trainOnlyErrors - whether to train only errors (TOE mode, recommmended) or to train all instances (brute-force mode)
testOnly - if true the trainer only ensures that all answer keys exist and can be located in the document instead of doing any training
config - used to configure superclasses; if null, the standard configuration is used
Method Detail

createFilteringTokenWalker

protected FilteringTokenWalker createFilteringTokenWalker(TrainableFilter repFilter)
Creates a filtering token walker to be used for walking through a document and sentence classification if a double classification approach is used.

Specified by:
createFilteringTokenWalker in class ExtractorBase
Parameters:
repFilter - the trainable filter to use
Returns:
the created walker

disableSentenceTraining

public void disableSentenceTraining()
Disables training the embedded sentence filter, if sentence filtering is used.


enableSentenceTraining

public void enableSentenceTraining()
Re-enables training the embedded filter, if sentence filtering is used.


evaluateSentenceFiltering

public FMetricsView evaluateSentenceFiltering()
Evaluates precision and recall for sentence filtering on the last processed document.

Returns:
the calculated statistics for sentence filtering on the last document; null if sentence filtering is disabled

isTestingOnly

public boolean isTestingOnly()
If true the trainer only ensures that all answer keys exist and can be located in the document instead of doing any training.

Returns:
the value of the attribute

isTrainingOnlyErrors

public boolean isTrainingOnlyErrors()
Whether to train only errors (TOE mode, recommmended) or to train all instances (brute-force mode).

Returns:
the value of the attribute

process

public void process(Document document,
                    Writer writer,
                    ContextMap context)
             throws IOException,
                    ProcessingException
Trains the local classifier with the correct extractions of an XML document, using the provided context representation. In TOE mode, training statistics are serialized to the writer. The answer keys must be in a corresponding file ending in AnswerBuilder.EXT_ANSWERS in the same directory (when processing a local file) or in the current working directory (when processin an URL).

Specified by:
process in class DocumentReader
Parameters:
document - the document to read
writer - ignored by this method
context - a map of objects that are made available for processing
Throws:
IOException - if an I/O error occurs
ProcessingException - if an error occurs during processing

processToken

public void processToken(Element element,
                         String left,
                         TokenDetails details,
                         String right,
                         ContextMap context)
                  throws ProcessingException
Processes a token in an XML element, optionally modifying the element or the document it is part of.

Specified by:
processToken in interface TokenProcessor
Parameters:
element - the element containing the token
left - the textual contents of the element to the left of the token (in case of mixed contents, only up to the last preceding child element, if any)
details - details about the token to process
right - the textual contents of the element to the right of the token (in case of mixed contents, only up to the next following child element, if any)
context - a map of objects that are made available for processing
Throws:
ProcessingException - if an error occurs during processing

reset

public void reset()
           throws ProcessingException
Resets the internal classifer, completely deleting the prediction model.

Throws:
ProcessingException - if an error occurs during reset

resetGlobalAccuracy

public void resetGlobalAccuracy()
Resets the global (overall) accuracies measured so far by each classifier. This can be used to restart accuracy measurements after each round (iteration) of TUNE training, for example. This method is only relevant in TOE (training-only/mainly-errors) mode, otherwise it does nothing.


resetStrategy

protected void resetStrategy()
Reset the combination strategy, logging a warning if it tells me to discard the last extraction.

Specified by:
resetStrategy in class ExtractorBase

shouldMatch

public boolean shouldMatch(Element element)
Decides whether an element should be accepted by filters.

Specified by:
shouldMatch in interface Oracle
Parameters:
element - the element to test
Returns:
true if filters should accept the element; false otherwise

toString

public String toString()
Returns a string representation of this object.

Overrides:
toString in class ExtractorBase
Returns:
a textual representation

train

public Accuracy[] train(Document document,
                        ExtractionContainer correctExtractions)
                 throws IOException,
                        ProcessingException
Trains the local classifier with the correct extractions of an XML document, using the provided context representation.

Parameters:
document - a document whose contents should be classified
correctExtractions - a container of all correct extractions for the document
Returns:
The token accuracies of each classifier of the trained document if in TOE mode; null otherwise
Throws:
IOException - if an I/O error occurs
ProcessingException - if an error occurs during processing

viewGlobalAccuracy

public AccuracyView[] viewGlobalAccuracy()
Returns a view on the global (overall) accuracies measured so far (or after the last call to resetGlobalAccuracy()) by each classifier. This is not a snapshot but will change whenever the underlying values are changed.

Returns:
A view on the global accuracies measured so far if in TOE mode; null otherwise


Copyright © 2003-2004 Christian Siefkes. All Rights Reserved.