de.fu_berlin.ties.extract
Class Extractor

java.lang.Object
  extended by de.fu_berlin.ties.ConfigurableProcessor
      extended by de.fu_berlin.ties.TextProcessor
          extended by de.fu_berlin.ties.DocumentReader
              extended by de.fu_berlin.ties.extract.ExtractorBase
                  extended by de.fu_berlin.ties.extract.Extractor
All Implemented Interfaces:
SkipHandler, Processor, TokenProcessor

public class Extractor
extends ExtractorBase

An extractor runs a local Classifier on a list of items/nodes and combines their results using a CombinationStrategy.

Instances of this class are not thread-safe and cannot extract from several documents in parallel.

Version:
$Revision: 1.41 $, $Date: 2004/12/06 09:21:06 $, $Author: siefkes $
Author:
Christian Siefkes

Field Summary
static String EXT_EXTRACTIONS
          The recommended file extension to use for storing extractions.
 
Fields inherited from class de.fu_berlin.ties.extract.ExtractorBase
CONFIG_AVOID, CONFIG_ELEMENTS, CONFIG_RELEVANT_PUNCTUATION, CONFIG_SENTENCE
 
Fields inherited from class de.fu_berlin.ties.TextProcessor
CONFIG_POST, KEY_DIRECTORY, KEY_LOCAL_NAME, KEY_OUT_DIRECTORY, KEY_URL
 
Constructor Summary
Extractor()
          Creates a new instance using a default extension.
Extractor(String outExt)
          Creates a new instance.
Extractor(String outExt, File runDirectory, TiesConfiguration config)
          Creates a new instance.
Extractor(String outExt, TargetStructure targetStruct, Classifier[] theClassifiers, Representation theRepresentation, CombinationStrategy combiStrat, TokenizerFactory tFactory, TrainableFilter sentFilter, Reranker rerank, Set<String> relevantPunct, TiesConfiguration config)
          Creates a new instance.
Extractor(String outExt, TiesConfiguration config)
          Creates a new instance.
Extractor(String outExt, Trainer trainer)
          Creates a new instance, re-using the components from the provided trainer.
 
Method Summary
protected  void addPunctuationDetails(TokenDetails details)
          Adds an element to the collected punctuation details.
protected  void appendPunctuation(Extraction ext)
          Appends the collected punctuation details (if any) to the provided extraction.
protected  void clearPunctuation()
          Clears the collected punctuation details.
protected  FilteringTokenWalker createFilteringTokenWalker(TrainableFilter repFilter)
          Creates a filtering token walker to be used for walking through a document and sentence classification if a double classification approach is used.
 FMetricsView evaluateSentenceFiltering(ExtractionContainer correctExtractions)
          Evaluates precision and recall for sentence filtering on the last processed document.
 ExtractionContainer extract(Document document)
          Extracts items of interest from the contents of an XML document, based on context representation and local classifier.
protected  ExtractionContainer getPredictedExtractions()
          Returns the extraction container used for storing the predicted extractions.
 void process(Document document, Writer writer, ContextMap context)
          Extracts items of interest from the contents of an XML document and serializes the extractions.
 void processToken(Element element, String left, TokenDetails details, String right, ContextMap context)
          Processes a token in an XML element, optionally modifying the element or the document it is part of.
protected  void resetStrategy()
          Reset strategy and discard last prediction extraction if requested.
 String toString()
          Returns a string representation of this object.
 
Methods inherited from class de.fu_berlin.ties.extract.ExtractorBase
createSentenceFilter, evaluateSentenceFiltering, getActiveClasses, getClassifiers, getFactory, getFeatureCount, getFeatures, getPriorRecognitions, getRepresentation, getSentenceFilter, getStrategy, getTargetStructure, getWalker, initFields, isRelevant, isSentenceFiltering, markRelevant, skip, updateState, viewFeatureCount, viewRelevantPunctuation
 
Methods inherited from class de.fu_berlin.ties.DocumentReader
doProcess
 
Methods inherited from class de.fu_berlin.ties.TextProcessor
getOutFileExt, process, process, process, process, process, process
 
Methods inherited from class de.fu_berlin.ties.ConfigurableProcessor
getConfig
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

EXT_EXTRACTIONS

public static final String EXT_EXTRACTIONS
The recommended file extension to use for storing extractions.

See Also:
Constant Field Values
Constructor Detail

Extractor

public Extractor()
          throws IllegalArgumentException,
                 ProcessingException
Creates a new instance using a default extension. Delegates to Extractor(String, TiesConfiguration) using the standard configuration.

Throws:
IllegalArgumentException - if the combination strategy cannot be initialized (cf. CombinationStrategy.createStrategy(java.util.Set, TiesConfiguration))
ProcessingException - if an error occurs during initialization

Extractor

public Extractor(String outExt)
          throws IllegalArgumentException,
                 ProcessingException
Creates a new instance. Delegates to Extractor(String, TiesConfiguration) using the standard configuration.

Parameters:
outExt - the extension to use for output files
Throws:
IllegalArgumentException - if the combination strategy cannot be initialized (cf. CombinationStrategy.createStrategy(java.util.Set, TiesConfiguration))
ProcessingException - if an error occurs during initialization

Extractor

public Extractor(String outExt,
                 TiesConfiguration config)
          throws IllegalArgumentException,
                 ProcessingException
Creates a new instance. Delegates to the corresponding super constructor to configure the fields.

Parameters:
outExt - the extension to use for output files
config - the configuration to use
Throws:
IllegalArgumentException - if the combination strategy cannot be initialized (cf. CombinationStrategy.createStrategy(java.util.Set, TiesConfiguration))
ProcessingException - if an error occurs during initialization

Extractor

public Extractor(String outExt,
                 File runDirectory,
                 TiesConfiguration config)
          throws IllegalArgumentException,
                 ProcessingException
Creates a new instance. Delegates to the corresponding super constructor to configure the fields.

Parameters:
outExt - the extension to use for output files
runDirectory - the directory to run the classifier in; used instead of the configured directory if not null
config - the configuration to use
Throws:
IllegalArgumentException - if the combination strategy cannot be initialized (cf. CombinationStrategy.createStrategy(java.util.Set, TiesConfiguration))
ProcessingException - if an error occurs during initialization

Extractor

public Extractor(String outExt,
                 Trainer trainer)
Creates a new instance, re-using the components from the provided trainer.

Parameters:
outExt - the extension to use for output files
trainer - trainer whose components should be re-used

Extractor

public Extractor(String outExt,
                 TargetStructure targetStruct,
                 Classifier[] theClassifiers,
                 Representation theRepresentation,
                 CombinationStrategy combiStrat,
                 TokenizerFactory tFactory,
                 TrainableFilter sentFilter,
                 Reranker rerank,
                 Set<String> relevantPunct,
                 TiesConfiguration config)
Creates a new instance.

Parameters:
outExt - the extension to use for output files
targetStruct - the target structure specifying the classes to recognize
theClassifiers - the classifiers to use for the local classification decisions
theRepresentation - the context representation to use for local classifications
combiStrat - the combination strategy to use
tFactory - used to instantiate tokenizers
sentFilter - the filter used in the first step of a double classification approach ("sentence filtering"); if null, no sentence filtering is used
rerank - a reranker that recalculates probabilities to introduce a bias (can be used to favor recall over precision, by setting a bias < 1 for the background class, etc.); must not be null
relevantPunct - a set of punctuation tokens that have been found to be relevant for token classification; might be empty but not null
config - used to configure superclasses; if null, the standard configuration is used
Method Detail

addPunctuationDetails

protected void addPunctuationDetails(TokenDetails details)
Adds an element to the collected punctuation details.

Parameters:
details - the element to add

appendPunctuation

protected void appendPunctuation(Extraction ext)
Appends the collected punctuation details (if any) to the provided extraction. Finally delegates to clearPunctuation() to dleetes the processed punctuation.

Parameters:
ext - the extraction to append to

clearPunctuation

protected void clearPunctuation()
Clears the collected punctuation details.


createFilteringTokenWalker

protected FilteringTokenWalker createFilteringTokenWalker(TrainableFilter repFilter)
Creates a filtering token walker to be used for walking through a document and sentence classification if a double classification approach is used.

Specified by:
createFilteringTokenWalker in class ExtractorBase
Parameters:
repFilter - the trainable filter to use
Returns:
the created walker

extract

public ExtractionContainer extract(Document document)
                            throws IOException,
                                   ProcessingException
Extracts items of interest from the contents of an XML document, based on context representation and local classifier.

Parameters:
document - a document whose contents should be classified
Returns:
a container of all extractions from the document, in document order
Throws:
IOException - if an I/O error occurs
ProcessingException - if an error occurs during processing

evaluateSentenceFiltering

public FMetricsView evaluateSentenceFiltering(ExtractionContainer correctExtractions)
Evaluates precision and recall for sentence filtering on the last processed document.

Parameters:
correctExtractions - a container of all correct extractions for the document
Returns:
the calculated statistics for sentence filtering on the last document; null if sentence filtering is disabled

getPredictedExtractions

protected ExtractionContainer getPredictedExtractions()
Returns the extraction container used for storing the predicted extractions.

Returns:
the extraction container

process

public void process(Document document,
                    Writer writer,
                    ContextMap context)
             throws IOException,
                    ProcessingException
Extracts items of interest from the contents of an XML document and serializes the extractions.

Specified by:
process in class DocumentReader
Parameters:
document - the document to read
writer - the writer to write the extracted items to; flushed but not closed by this method
context - a map of objects that are made available for processing
Throws:
IOException - if an I/O error occurs
ProcessingException - if an error occurs during processing

processToken

public void processToken(Element element,
                         String left,
                         TokenDetails details,
                         String right,
                         ContextMap context)
                  throws ProcessingException
Processes a token in an XML element, optionally modifying the element or the document it is part of.

Parameters:
element - the element containing the token
left - the textual contents of the element to the left of the token (in case of mixed contents, only up to the last preceding child element, if any)
details - details about the token to process
right - the textual contents of the element to the right of the token (in case of mixed contents, only up to the next following child element, if any)
context - a map of objects that are made available for processing
Throws:
ProcessingException - if an error occurs during processing

resetStrategy

protected void resetStrategy()
Reset strategy and discard last prediction extraction if requested.

Specified by:
resetStrategy in class ExtractorBase

toString

public String toString()
Returns a string representation of this object.

Overrides:
toString in class ExtractorBase
Returns:
a textual representation


Copyright © 2003-2004 Christian Siefkes. All Rights Reserved.