de.fu_berlin.ties.extract
Class TrainEval

java.lang.Object
  extended by de.fu_berlin.ties.ConfigurableProcessor
      extended by de.fu_berlin.ties.TextProcessor
          extended by de.fu_berlin.ties.extract.TrainEval
All Implemented Interfaces:
Closeable, Processor

public class TrainEval
extends TextProcessor
implements Closeable

Trains an extractor and evaluates extraction quality. Processes shuffle files (as generated by ShuffleGenerator contain the files to use for training and evaluation. For each of these files, a corresponding answer key (*.ans) must exist.

Instances of this class are not thread-safe.

Version:
$Revision: 1.67 $, $Date: 2004/11/17 09:17:04 $, $Author: siefkes $
Author:
Christian Siefkes

Nested Class Summary
static class TrainEval.Results
          An inner class wrapping the results of a training + evaluation run.
 
Field Summary
static String CONFIG_FEEDBACK
          Configuration key: If true, a fully incremental setup is used where the trainer is trained on each document after the extractor processed it.
static String CONFIG_SENTENCE_TUNE
          Configuration key: The maximum number of iterations used for TUNE training the sentence classifier; if 0 or negative, the value of CONFIG_TUNE is used.
static String CONFIG_TEST_SPLIT
          Configuration key: The percentage of a corpus to use for testing (evaluation).
static String CONFIG_TRAIN_SPLIT
          Configuration key: The percentage of a corpus to use for training.
static String CONFIG_TUNE
          Configuration key: The maximum number of iterations used for TUNE (train until no error) training; if 1, training is incremental.
static String CONFIG_TUNE_EACH
          Configuration key: Whether to measure results after each TUNE iteration or only at the end of training.
static String CONFIG_TUNE_SINCE
          Configuration key: The training iteration after which to evaluate results for the first time if CONFIG_TUNE_EACH is enabled.
static String CONFIG_TUNE_STOP
          Configuration key: TUNE training is stopped if the training accuracy didn't improve for the specified number of iterations.
static String KEY_ITERATION
          Serialization key for the number of the iteration (when TUNE training).
static String KEY_RUN
          Serialization key for the number of the run.
static String KEY_TYPE
          Serialization key for the type (either "Train" or "Eval").
static String TYPE_EVAL
          Serialization value for the "Eval" type.
static String TYPE_TRAIN
          Serialization value for the "Train" type.
 
Fields inherited from class de.fu_berlin.ties.TextProcessor
CONFIG_POST, KEY_DIRECTORY, KEY_LOCAL_NAME, KEY_OUT_DIRECTORY, KEY_URL
 
Constructor Summary
TrainEval()
          Creates a new instance, using a default extension and the standard configuration.
TrainEval(String outExt)
          Creates a new instance, using the standard configuration.
TrainEval(String outExt, float trainingSplit, float testingSplit, int tuneRuns, int tuneStopAfter, boolean measureEachTUNE, int startMeasureTUNE, List tuneEvalList, int sentenceTUNE, boolean giveFeedback, TiesConfiguration config)
          Creates a new instance.
TrainEval(String outExt, TiesConfiguration config)
          Creates a new instance.
 
Method Summary
 void close(int errorCount)
          Closes this instance, releasing all resources and stopping any background threads.
protected  void doProcess(Reader reader, Writer writer, ContextMap context)
          Processes the contents of a reader, writing a modified version to a writer.
 float getTestSplit()
          Returns the percentage of a corpus to use for testing (evaluation).
 float getTrainSplit()
          Returns the percentage of a corpus to use for training; the remaining documents (1-x) are used for evaluation.
protected  Extractor initExtractor(Trainer trainer)
          Creates and initializes a extractor to use for an evaluation run, re-using the components of the provided trainer.
protected  Trainer initTrainer(File runDirectory)
          Creates and initializes a trainer to use for an evaluation run, configured from the stored configuration.
 String toString()
          Returns a string representation of this object.
 TrainEval.Results trainAndEval(String[] files, File inDirectory, File outDirectory, String baseName, Writer writer)
          Processes an array of files.
 
Methods inherited from class de.fu_berlin.ties.TextProcessor
getOutFileExt, process, process, process, process, process, process
 
Methods inherited from class de.fu_berlin.ties.ConfigurableProcessor
getConfig
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

CONFIG_TRAIN_SPLIT

public static final String CONFIG_TRAIN_SPLIT
Configuration key: The percentage of a corpus to use for training.

See Also:
Constant Field Values

CONFIG_TEST_SPLIT

public static final String CONFIG_TEST_SPLIT
Configuration key: The percentage of a corpus to use for testing (evaluation).

See Also:
Constant Field Values

CONFIG_FEEDBACK

public static final String CONFIG_FEEDBACK
Configuration key: If true, a fully incremental setup is used where the trainer is trained on each document after the extractor processed it.

See Also:
Constant Field Values

CONFIG_TUNE

public static final String CONFIG_TUNE
Configuration key: The maximum number of iterations used for TUNE (train until no error) training; if 1, training is incremental.

See Also:
Constant Field Values

CONFIG_TUNE_STOP

public static final String CONFIG_TUNE_STOP
Configuration key: TUNE training is stopped if the training accuracy didn't improve for the specified number of iterations.

See Also:
Constant Field Values

CONFIG_SENTENCE_TUNE

public static final String CONFIG_SENTENCE_TUNE
Configuration key: The maximum number of iterations used for TUNE training the sentence classifier; if 0 or negative, the value of CONFIG_TUNE is used.

See Also:
Constant Field Values

CONFIG_TUNE_EACH

public static final String CONFIG_TUNE_EACH
Configuration key: Whether to measure results after each TUNE iteration or only at the end of training.

See Also:
Constant Field Values

CONFIG_TUNE_SINCE

public static final String CONFIG_TUNE_SINCE
Configuration key: The training iteration after which to evaluate results for the first time if CONFIG_TUNE_EACH is enabled.

See Also:
Constant Field Values

KEY_ITERATION

public static final String KEY_ITERATION
Serialization key for the number of the iteration (when TUNE training).

See Also:
Constant Field Values

KEY_RUN

public static final String KEY_RUN
Serialization key for the number of the run.

See Also:
Constant Field Values

KEY_TYPE

public static final String KEY_TYPE
Serialization key for the type (either "Train" or "Eval").

See Also:
Constant Field Values

TYPE_TRAIN

public static final String TYPE_TRAIN
Serialization value for the "Train" type.

See Also:
Constant Field Values

TYPE_EVAL

public static final String TYPE_EVAL
Serialization value for the "Eval" type.

See Also:
Constant Field Values
Constructor Detail

TrainEval

public TrainEval()
          throws IllegalArgumentException,
                 ClassCastException,
                 NoSuchElementException
Creates a new instance, using a default extension and the standard configuration.

Throws:
IllegalArgumentException - if the configured values are outside the allowed ranges
ClassCastException - if the configured numeric values cannot be parsed
NoSuchElementException - if one of the required values is missing from the configuration

TrainEval

public TrainEval(String outExt)
          throws IllegalArgumentException,
                 ClassCastException,
                 NoSuchElementException
Creates a new instance, using the standard configuration.

Parameters:
outExt - the extension to use for output files
Throws:
IllegalArgumentException - if the configured values are outside the allowed ranges
ClassCastException - if the configured numeric values cannot be parsed
NoSuchElementException - if one of the required values is missing from the configuration

TrainEval

public TrainEval(String outExt,
                 TiesConfiguration config)
          throws IllegalArgumentException,
                 ClassCastException,
                 NoSuchElementException
Creates a new instance.

Parameters:
outExt - the extension to use for output files
config - used to configure this instance
Throws:
IllegalArgumentException - if the configured values are outside the allowed ranges
ClassCastException - if the configured numeric values cannot be parsed
NoSuchElementException - if one of the required values is missing from the configuration

TrainEval

public TrainEval(String outExt,
                 float trainingSplit,
                 float testingSplit,
                 int tuneRuns,
                 int tuneStopAfter,
                 boolean measureEachTUNE,
                 int startMeasureTUNE,
                 List tuneEvalList,
                 int sentenceTUNE,
                 boolean giveFeedback,
                 TiesConfiguration config)
          throws IllegalArgumentException
Creates a new instance.

Parameters:
outExt - the extension to use for output files
trainingSplit - the percentage of a corpus to use for training
testingSplit - the percentage of a corpus to use for testing (evaluation); if -1, all remaining documents (1 - trainingSplit) are used
tuneRuns - the maximum number of iterations used for TUNE (train until no error) training; if 1, training is incremental
tuneStopAfter - TUNE training is stopped if the training accuracy didn't improve for the specified number of iterations.
measureEachTUNE - whether to measure results after each TUNE iteration or only at the end of training
startMeasureTUNE - he training iteration after which to evaluate results for the first time if measureEachTUNE is enabled (ignored otherwise)
sentenceTUNE - the maximum number of iterations used for TUNE training the sentence classifier (if used); if 0 or negative, the value of tuneRuns is used
tuneEvalList - A list of Integers or int Strings specifying iterations after which to evaluate TUNE training in addition to the last one; ignored if measureEachTUNE is true
giveFeedback - if true, a fully incremental setup is used where the trainer is trained on each document after the extractor processed it; it's not allowed to set this both this and measureEachTUNE to true when training for several tuneRuns because that would mean to evaluate on the training set
config - used to configure superclasses, trainer, and extractor; if null, the standard configuration is used
Throws:
IllegalArgumentException - if trainingSplit is not a percentage (larger than 1 or smaller than 0) or if tuneRuns is non-positive
Method Detail

close

public void close(int errorCount)
           throws IOException,
                  ProcessingException
Closes this instance, releasing all resources and stopping any background threads.

Specified by:
close in interface Closeable
Parameters:
errorCount - the number of errors (exceptions) that occurred during calls to this instance (0 if none)
Throws:
IOException - if an I/O error occurs
ProcessingException - if an error occurs during processing any remaining input

doProcess

protected void doProcess(Reader reader,
                         Writer writer,
                         ContextMap context)
                  throws IOException,
                         ProcessingException
Processes the contents of a reader, writing a modified version to a writer.

Specified by:
doProcess in class TextProcessor
Parameters:
reader - reader containing the text to process; should not be closed by this method
writer - the writer to write the processed text to; might be flushed but not closed by this method; if this method does not use the writer, the underlying file will be deleted afterwards
context - a map of objects that are made available for processing; when called from the implemented process methods in this class, it will contain mappings from IOUtils.KEY_LOCAL_CHARSET to the character set of the output writer; from ContentType.KEY_MIME_TYPE to the document's MIME type; from TextProcessor.KEY_LOCAL_NAME to the local name (String) and either from TextProcessor.KEY_DIRECTORY to the directory (File), in case of a local file) or from TextProcessor.KEY_URL to the URL (otherwise) of the processed document
Throws:
IOException - if an I/O error occurs
ProcessingException - if an error occurs during processing

getTestSplit

public float getTestSplit()
Returns the percentage of a corpus to use for testing (evaluation).

Returns:
the percentage to use for evaluation; if negative, all remaining documents (1 - getTrainSplit()) are used for evaluation

getTrainSplit

public float getTrainSplit()
Returns the percentage of a corpus to use for training; the remaining documents (1-x) are used for evaluation.

Returns:
the percentage to use for training

initExtractor

protected Extractor initExtractor(Trainer trainer)
Creates and initializes a extractor to use for an evaluation run, re-using the components of the provided trainer. Subclasses can overwrite this method to provide a different extractor.

Parameters:
trainer - trainer whose components should be re-used
Returns:
the created extractor

initTrainer

protected Trainer initTrainer(File runDirectory)
                       throws ProcessingException
Creates and initializes a trainer to use for an evaluation run, configured from the stored configuration. Subclasses can overwrite this method to provide a different trainer.

Parameters:
runDirectory - directory used to run the classifier
Returns:
the created trainer
Throws:
ProcessingException - if an error occurs during initialization

toString

public String toString()
Returns a string representation of this object.

Overrides:
toString in class TextProcessor
Returns:
a textual representation

trainAndEval

public TrainEval.Results trainAndEval(String[] files,
                                      File inDirectory,
                                      File outDirectory,
                                      String baseName,
                                      Writer writer)
                               throws IOException,
                                      ProcessingException
Processes an array of files. For each file, a corresponding answer key (*.ans) must exist.

Parameters:
files - the array of file names to process (relative to the inDirectory)
inDirectory - directory containing the files to process
outDirectory - directory used to do this run and store the results
baseName - the base name of the files to use for storing all extractions and training statistics
writer - used to serialize the calculated metrics
Returns:
a wrapper of the results of this run
Throws:
IOException - if an I/O error occurs
ProcessingException - if an error occurs during processing


Copyright © 2003-2004 Christian Siefkes. All Rights Reserved.