de.fu_berlin.ties.extract
Class Extractor

java.lang.Object
  extended byde.fu_berlin.ties.ConfigurableProcessor
      extended byde.fu_berlin.ties.TextProcessor
          extended byde.fu_berlin.ties.DocumentReader
              extended byde.fu_berlin.ties.extract.ExtractorBase
                  extended byde.fu_berlin.ties.extract.Extractor
All Implemented Interfaces:
Processor, TokenProcessor

public class Extractor
extends ExtractorBase
implements TokenProcessor

An extractor runs a local Classifier on a list of items/nodes and combines their results using a CombinationStrategy.

Instances of this class are not thread-safe and cannot extract from several documents in parallel.

Version:
$Revision: 1.22 $, $Date: 2004/04/13 07:08:30 $, $Author: siefkes $
Author:
Christian Siefkes

Field Summary
static String EXT_EXTRACTIONS
          The recommended file extension to use for storing extractions.
 
Fields inherited from class de.fu_berlin.ties.TextProcessor
CONFIG_POST, KEY_DIRECTORY, KEY_LOCAL_NAME, KEY_OUT_DIRECTORY, KEY_URL
 
Constructor Summary
Extractor(String outExt)
          Creates a new instance.
Extractor(String outExt, File runDirectory, TiesConfiguration config)
          Creates a new instance.
Extractor(String outExt, TargetStructure targetStruct, Classifier theClassifier, Representation theRepresentation, CombinationStrategy combiStrat, TokenizerFactory tFactory, TiesConfiguration config)
          Creates a new instance.
Extractor(String outExt, TiesConfiguration config)
          Creates a new instance.
Extractor(String outExt, Trainer trainer)
          Creates a new instance, re-using the components from the provided trainer.
 
Method Summary
 ExtractionContainer extract(Document document)
          Extracts items of interest from the contents of an XML document, based on context representation and local classifier.
protected  ExtractionContainer getPredictedExtractions()
          Returns the extraction container used for storing the predicted extractions.
 void process(Document document, Writer writer, ContextMap context)
          Extracts items of interest from the contents of an XML document and serializes the extractions.
 void processToken(Element element, String left, String token, String right, int tokenRep, boolean whitespaceBefore, ContextMap context)
          Classifies a token in an XML document, building features and delegating to the classifier.
 String toString()
          Returns a string representation of this object.
 
Methods inherited from class de.fu_berlin.ties.extract.ExtractorBase
getActiveClasses, getClassifier, getFactory, getFeatureCount, getFeatures, getPriorRecognitions, getRepresentation, getStrategy, getTargetStructure, initFields, updateState, viewFeatureCount
 
Methods inherited from class de.fu_berlin.ties.DocumentReader
doProcess
 
Methods inherited from class de.fu_berlin.ties.TextProcessor
getOutFileExt, process, process, process, process
 
Methods inherited from class de.fu_berlin.ties.ConfigurableProcessor
getConfig
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

EXT_EXTRACTIONS

public static final String EXT_EXTRACTIONS
The recommended file extension to use for storing extractions.

See Also:
Constant Field Values
Constructor Detail

Extractor

public Extractor(String outExt)
          throws IllegalArgumentException,
                 ProcessingException
Creates a new instance. Delegates to Extractor(String, TiesConfiguration) using the standard configuration.

Parameters:
outExt - the extension to use for output files
Throws:
IllegalArgumentException - if the combination strategy cannot be initialized (cf. CombinationStrategy.createStrategy(Set, TiesConfiguration))
ProcessingException - if an error occurs during initialization

Extractor

public Extractor(String outExt,
                 TiesConfiguration config)
          throws IllegalArgumentException,
                 ProcessingException
Creates a new instance. Delegates to the corresponding super constructor to configure the fields.

Parameters:
outExt - the extension to use for output files
config - the configuration to use
Throws:
IllegalArgumentException - if the combination strategy cannot be initialized (cf. CombinationStrategy.createStrategy(Set, TiesConfiguration))
ProcessingException - if an error occurs during initialization

Extractor

public Extractor(String outExt,
                 File runDirectory,
                 TiesConfiguration config)
          throws IllegalArgumentException,
                 ProcessingException
Creates a new instance. Delegates to the corresponding super constructor to configure the fields.

Parameters:
outExt - the extension to use for output files
runDirectory - the directory to run the classifier in; used instead of the configured directory if not null
config - the configuration to use
Throws:
IllegalArgumentException - if the combination strategy cannot be initialized (cf. CombinationStrategy.createStrategy(Set, TiesConfiguration))
ProcessingException - if an error occurs during initialization

Extractor

public Extractor(String outExt,
                 Trainer trainer)
Creates a new instance, re-using the components from the provided trainer.

Parameters:
outExt - the extension to use for output files
trainer - trainer whose components should be re-used

Extractor

public Extractor(String outExt,
                 TargetStructure targetStruct,
                 Classifier theClassifier,
                 Representation theRepresentation,
                 CombinationStrategy combiStrat,
                 TokenizerFactory tFactory,
                 TiesConfiguration config)
Creates a new instance.

Parameters:
outExt - the extension to use for output files
targetStruct - the target structure specifying the classes to recognize
theClassifier - the classifier to use for the local classification decisions
theRepresentation - the context representation to use for local classifications
combiStrat - the combination strategy to use
tFactory - used to instantiate tokenizers
config - used to configure superclasses; if null, the standard configuration is used
Method Detail

getPredictedExtractions

protected ExtractionContainer getPredictedExtractions()
Returns the extraction container used for storing the predicted extractions.

Returns:
the extraction container

extract

public ExtractionContainer extract(Document document)
                            throws IOException,
                                   ProcessingException
Extracts items of interest from the contents of an XML document, based on context representation and local classifier.

Parameters:
document - a document whose contents should be classified
Returns:
a container of all extractions from the document, in document order
Throws:
IOException - if an I/O error occurs
ProcessingException - if an error occurs during processing

process

public void process(Document document,
                    Writer writer,
                    ContextMap context)
             throws IOException,
                    ProcessingException
Extracts items of interest from the contents of an XML document and serializes the extractions.

Specified by:
process in class DocumentReader
Parameters:
document - the document to read
writer - the writer to write the extracted items to; flushed but not closed by this method
context - a map of objects that are made available for processing
Throws:
IOException - if an I/O error occurs
ProcessingException - if an error occurs during processing

processToken

public void processToken(Element element,
                         String left,
                         String token,
                         String right,
                         int tokenRep,
                         boolean whitespaceBefore,
                         ContextMap context)
                  throws ProcessingException
Classifies a token in an XML document, building features and delegating to the classifier.

Specified by:
processToken in interface TokenProcessor
Parameters:
element - the element containing the token
left - the textual contents of the element to the left of the token (in case of mixed contents, only up to the last preceding child element, if any)
token - the token to process
right - the textual contents of the element to the right of the token (in case of mixed contents, only up to the next following child element, if any)
tokenRep - the repetition of the token in the document (counting starts with 0, as the first occurrence is the "0th repetition").
whitespaceBefore - whether there is whitespace before the main token (either at the end of left or in the preceding element)
context - a map of objects that are made available for processing; ignored by this method
Throws:
ProcessingException - if an error occurs during classification

toString

public String toString()
Returns a string representation of this object.

Overrides:
toString in class ExtractorBase
Returns:
a textual representation


Copyright © 2003-2004 Christian Siefkes. All Rights Reserved.