de.fu_berlin.ties.classify
Class ClassTrain

java.lang.Object
  extended by de.fu_berlin.ties.ConfigurableProcessor
      extended by de.fu_berlin.ties.TextProcessor
          extended by de.fu_berlin.ties.classify.ClassTrain
All Implemented Interfaces:
Closeable, Processor

public class ClassTrain
extends TextProcessor
implements Closeable

Classifies a list of files, training the classifier on each error if the true class is provided. See classifyAndTrain(FieldContainer, File, String, String) for a description of input and output formats.

This class does not calculate statistics; you can do so be calling e.g. tail -q --lines 500 FILENAME|grep -v "|+"|wc on the output serialized in DelimSepValues format to get the number of errors during the last 500 classifications (assuming that classes to not start with a "+" and that the true class is known for all files).

Instances of this class are not thread-safe and must be synchronized externally, if required.

Version:
$Revision: 1.35 $, $Date: 2006/10/21 16:03:54 $, $Author: siefkes $
Author:
Christian Siefkes

Field Summary
static String CONFIG_FILE_EXT
          Configuration key: The extension to append to file names given via the File key (if any).
static String CONFIG_SUFFIX_TEXT
          Configuration suffix used for text classification--specific settings.
static String CORRECT_CLASS
          Value of the KEY_CLASSIFICATION field for correct predictions: "+".
static String KEY_CLASS
          Serialization key for the correct class.
static String KEY_CLASSIFICATION
          Serialization key for the result of the classification: either CORRECT_CLASS if the correct class was predicted or the wrongly predicted class in case of an error.
static String KEY_FILE
          Serialization key for the name of the file to classify.
 
Fields inherited from class de.fu_berlin.ties.TextProcessor
CONFIG_POST, KEY_DIRECTORY, KEY_LOCAL_NAME, KEY_OUT_DIRECTORY, KEY_URL
 
Constructor Summary
ClassTrain()
          Creates a new instance using a default extension and the standard configuration.
ClassTrain(String outExt)
          Creates a new instance using the standard configuration.
ClassTrain(String outExt, TiesConfiguration conf)
          Creates a new instance from the provided configuration.
ClassTrain(String outExt, TiesConfiguration conf, FeatureExtractor featureExt, Tuner myTuner, String fileExt, String classifierFile, boolean doReUse, boolean doStore, boolean doTestOnly)
          Creates a new instance.
 
Method Summary
 FieldContainer classifyAndTrain(FieldContainer filesToClassify, File directory, String baseName, String charset)
          Classifies a list of files, training the classifier on each error if the true class is known.
 void close(int errorCount)
          Closes this instance, releasing all resources and stopping any background threads.
protected  void doProcess(Reader reader, Writer writer, ContextMap context)
          Delegates to classifyAndTrain(FieldContainer, File, String, String).
 
Methods inherited from class de.fu_berlin.ties.TextProcessor
getOutFileExt, process, process, process, process, process, process, toString
 
Methods inherited from class de.fu_berlin.ties.ConfigurableProcessor
getConfig
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

CONFIG_FILE_EXT

public static final String CONFIG_FILE_EXT
Configuration key: The extension to append to file names given via the File key (if any).

See Also:
Constant Field Values

CONFIG_SUFFIX_TEXT

public static final String CONFIG_SUFFIX_TEXT
Configuration suffix used for text classification--specific settings.

See Also:
Constant Field Values

KEY_FILE

public static final String KEY_FILE
Serialization key for the name of the file to classify.

See Also:
Constant Field Values

KEY_CLASS

public static final String KEY_CLASS
Serialization key for the correct class.

See Also:
Constant Field Values

KEY_CLASSIFICATION

public static final String KEY_CLASSIFICATION
Serialization key for the result of the classification: either CORRECT_CLASS if the correct class was predicted or the wrongly predicted class in case of an error.

See Also:
Constant Field Values

CORRECT_CLASS

public static final String CORRECT_CLASS
Value of the KEY_CLASSIFICATION field for correct predictions: "+".

See Also:
Constant Field Values
Constructor Detail

ClassTrain

public ClassTrain()
           throws ProcessingException
Creates a new instance using a default extension and the standard configuration.

Throws:
ProcessingException - if an error occurs while initializing this instance

ClassTrain

public ClassTrain(String outExt)
           throws ProcessingException
Creates a new instance using the standard configuration.

Parameters:
outExt - the extension to use for output files
Throws:
ProcessingException - if an error occurs while initializing this instance

ClassTrain

public ClassTrain(String outExt,
                  TiesConfiguration conf)
           throws ProcessingException
Creates a new instance from the provided configuration.

Parameters:
outExt - the extension to use for output files
conf - used to configure this instance; if null, the standard configuration is used
Throws:
ProcessingException - if an error occurs while initializing this instance

ClassTrain

public ClassTrain(String outExt,
                  TiesConfiguration conf,
                  FeatureExtractor featureExt,
                  Tuner myTuner,
                  String fileExt,
                  String classifierFile,
                  boolean doReUse,
                  boolean doStore,
                  boolean doTestOnly)
Creates a new instance.

Parameters:
outExt - the extension to use for output files
conf - used to configure this instance; if null, the standard configuration is used
featureExt - used to convert texts into feature vectors
myTuner - used to control TUNE training (iterative training)
fileExt - the extension to append to file names given via the File key; null or the empty string if none should be appended
classifierFile - name of the file used for storing the classifier
doReUse - whether to re-use classifiers between several runs (incl. classifiers stored in the classifierFile, if exists)
doStore - whether to store the final classifier in the classifierFile
doTestOnly - If this is set to true, the classifier will be used only for prediction -- no training will take place
Method Detail

classifyAndTrain

public FieldContainer classifyAndTrain(FieldContainer filesToClassify,
                                       File directory,
                                       String baseName,
                                       String charset)
                                throws IOException,
                                       ProcessingException
Classifies a list of files, training the classifier on each error if the true class is known.

Parameters:
filesToClassify - a field container of the files to process; each entry must contain a KEY_FILE field giving the name of the file to classify; if it also contains a KEY_CLASS field giving the true class of the file, the classifier is trained in case of an error
directory - file names are relative to this directory; if null they are relative to the working directory
baseName - the base name of the file listing the files to classify
charset - the character set of the files to process
Returns:
a field container of the classification results; in addition to the fields given above, each entry will contain the classification result in a KEY_CLASSIFICATION field: CORRECT_CLASS in case of a classification that is known to be correct (this requires that the true class is given in the KEY_CLASS field, otherwise we don't know whether a prediction is correct); the name of the predicted class otherwise
Throws:
IOException - if an I/O error occurs
ProcessingException - if an error occurs during processing

close

public void close(int errorCount)
           throws IOException
Closes this instance, releasing all resources and stopping any background threads.

Specified by:
close in interface Closeable
Parameters:
errorCount - the number of errors (exceptions) that occurred during calls to this instance (0 if none)
Throws:
IOException - if an I/O error occurs

doProcess

protected void doProcess(Reader reader,
                         Writer writer,
                         ContextMap context)
                  throws IOException,
                         ProcessingException
Delegates to classifyAndTrain(FieldContainer, File, String, String).

Specified by:
doProcess in class TextProcessor
Parameters:
reader - the FieldContainer of files to classify is read from this reader; not closed by this method
writer - the resulting FieldContainer containing classification results is serialized to this writer; not closed by this method
context - a map of objects that are made available for processing; the IOUtils.KEY_LOCAL_CHARSET is used to determine the character set of the listed files; the TextProcessor.KEY_DIRECTORY File determines the source of relative file names, if given (otherwise the current working directory is used)
Throws:
IOException - if an I/O error occurs
ProcessingException - if an error occurs during processing


Copyright © 2003-2007 Christian Siefkes. All Rights Reserved.