de.fu_berlin.ties.extract
Class ExtractorBase

java.lang.Object
  extended by de.fu_berlin.ties.ConfigurableProcessor
      extended by de.fu_berlin.ties.TextProcessor
          extended by de.fu_berlin.ties.DocumentReader
              extended by de.fu_berlin.ties.extract.ExtractorBase
All Implemented Interfaces:
SkipHandler, Processor, TokenProcessor
Direct Known Subclasses:
Extractor, Trainer

public abstract class ExtractorBase
extends DocumentReader
implements SkipHandler, TokenProcessor

Common code base shared by Extractorand Trainer.

Instances of subclasses are not thread-safe and cannot process several documents in parallel.

Version:
$Revision: 1.54 $, $Date: 2006/10/21 16:04:13 $, $Author: siefkes $
Author:
Christian Siefkes

Field Summary
static String CONFIG_AVOID
          Configuration key: List of elements that should be avoided when filtering (using parent element instead).
static String CONFIG_ELEMENTS
          Configuration key: List of elements to filter.
static String CONFIG_RELEVANT_PUNCTUATION
          Configuration key: list of punctuation and symbol tokens that are considered as relevant from the very start.
static String CONFIG_SENTENCE
          Configuration suffix/prefix used for sentence filtering.
static String CONFIG_SUFFIX_IE
          Configuration suffix used for information extraction--specific settings.
 
Fields inherited from class de.fu_berlin.ties.TextProcessor
CONFIG_POST, KEY_DIRECTORY, KEY_LOCAL_NAME, KEY_OUT_DIRECTORY, KEY_URL
 
Constructor Summary
ExtractorBase(String outExt)
          Creates a new instance.
ExtractorBase(String outExt, File runDirectory, TiesConfiguration config)
          Creates a new instance, configuring target structure, classifier, DefaultRepresentation, node filter, combination strategy and tokenizer factory from the provided configuration.
ExtractorBase(String outExt, TargetStructure targetStruct, Classifier[] theClassifiers, Representation theRepresentation, CombinationStrategy combiStrat, FinalReextractor reextract, TokenizerFactory tFactory, Reestimator estimator, DocumentRewriter[] docFilters, TrainableFilter sentFilter, Set<String> relevantPunct, TiesConfiguration config)
          Creates a new instance.
ExtractorBase(String outExt, TiesConfiguration config)
          Creates a new instance, configuring target structure, classifier, DefaultRepresentation, node filter and combination strategy from the provided configuration.
 
Method Summary
protected  void addContextDetails(ContextDetails details)
          Adds an element to the collected context details.
protected static DocumentRewriter[] createDocumentRewriters(TiesConfiguration conf)
          Initializes the list of document rewriters.
protected abstract  FilteringTokenWalker createFilteringTokenWalker(TrainableFilter repFilter)
          Creates a filtering token walker to be used for walking through a document and sentence classification if a double classification approach is used.
protected  FMetricsView evaluateSentenceFiltering(EmbeddingElements embeddingElements)
          Evaluates precision and recall for sentence filtering on the last processed document.
protected  Document filterDocument(Document orgDocument, File filename)
          Runs a document though the list of document filters (if any) to modify it.
protected  Set[] getActiveClasses()
          Returns the set of candidate classes to consider for the current element for each classifier.
 Classifier[] getClassifiers()
          Returns the array of classifiers used for the local classification decisions.
protected  List<ContextDetails> getContextDetails()
          Returns the list of context details representing all tokens in the current document.
protected  DocumentRewriter[] getDocumentRewriters()
          Returns the list of document processors that are invoked to modify the XML representations of the documents to process, e.g. by adding semantic information such as named-entity predictions.
 TokenizerFactory getFactory()
          Returns the factory used to instantiate tokenizers.
protected  FeatureVector getFeatures()
          Returns vector of features representing the currently processed element.
 PriorRecognitions getPriorRecognitions()
          Returns the buffer of preceding Recognitions from the current document.
protected  Reestimator getReestimator()
          Returns the re-estimator chain.
protected  FinalReextractor getReextractor()
          Returns an optional re-extractor that can modify extractions in any suitable way.
 Representation getRepresentation()
          Returns the context representation used for local classifications.
protected  TrainableFilter getSentenceFilter()
          Returns the filter used in the first step of a double classification approach ("sentence filtering").
protected  CombinationStrategy getStrategy()
          Returns the combination strategy used.
 TargetStructure getTargetStructure()
          Returns the target structure specifying the classes to recognize.
protected  TokenWalker getWalker()
          Returns the token walker used to walk thru documents.
protected  void initFields(File filename)
          Initializes the fields used for processing a document (feature cache, buffer of prior recognitions, token walker, and statistics) and resets the combination strategy.
protected  boolean isRelevant(String token)
          Checks whether a token is relevant for training and extraction.
 boolean isSentenceFiltering()
          Whether this instance uses sentence filtering (classification of relevant versus irrelevant sentences in a double classification approach).
protected  void markRelevant(String token)
          Marks a punctuation token as relevant for classification ((because it did occur as the first or last token of an extraction).
protected abstract  void resetStrategy()
          Reset the combination strategy, handling the boolean result value in an appropriate way.
 void skip()
          This method is called by FilteringTokenWalker whenever some tokens are skipped.
 String toString()
          Returns a string representation of this object.
protected  void updateState(Element element, String leftText, String mainText, String rightText)
          Helper that build the featuresand determines the active classesfor an element.
 Set<String> viewRelevantPunctuation()
          Returns a read-only view on the set of punctuation tokens that have been found to be relevant for token classification (because they sometimes occur as the first or last token of an extraction).
 
Methods inherited from class de.fu_berlin.ties.DocumentReader
doProcess, process
 
Methods inherited from class de.fu_berlin.ties.TextProcessor
getOutFileExt, process, process, process, process, process, process
 
Methods inherited from class de.fu_berlin.ties.ConfigurableProcessor
getConfig
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface de.fu_berlin.ties.xml.dom.TokenProcessor
processToken
 

Field Detail

CONFIG_ELEMENTS

public static final String CONFIG_ELEMENTS
Configuration key: List of elements to filter.

See Also:
Constant Field Values

CONFIG_AVOID

public static final String CONFIG_AVOID
Configuration key: List of elements that should be avoided when filtering (using parent element instead).

See Also:
Constant Field Values

CONFIG_RELEVANT_PUNCTUATION

public static final String CONFIG_RELEVANT_PUNCTUATION
Configuration key: list of punctuation and symbol tokens that are considered as relevant from the very start.

See Also:
Constant Field Values

CONFIG_SENTENCE

public static final String CONFIG_SENTENCE
Configuration suffix/prefix used for sentence filtering.

See Also:
Constant Field Values

CONFIG_SUFFIX_IE

public static final String CONFIG_SUFFIX_IE
Configuration suffix used for information extraction--specific settings.

See Also:
Constant Field Values
Constructor Detail

ExtractorBase

public ExtractorBase(String outExt)
              throws IllegalArgumentException,
                     ProcessingException
Creates a new instance. Delegates to ExtractorBase(String, TiesConfiguration)using the standard configuration.

Parameters:
outExt - the extension to use for output files
Throws:
IllegalArgumentException - if the combination strategy cannot be initialized (cf. CombinationStrategy.createStrategy(Set, TiesConfiguration))
ProcessingException - if an error occurs during initialization

ExtractorBase

public ExtractorBase(String outExt,
                     TiesConfiguration config)
              throws IllegalArgumentException,
                     ProcessingException
Creates a new instance, configuring target structure, classifier, DefaultRepresentation, node filter and combination strategy from the provided configuration.

Parameters:
outExt - the extension to use for output files
config - the configuration to use
Throws:
IllegalArgumentException - if the combination strategy cannot be initialized (cf. CombinationStrategy.createStrategy(Set, TiesConfiguration))
ProcessingException - if an error occurs during initialization

ExtractorBase

public ExtractorBase(String outExt,
                     File runDirectory,
                     TiesConfiguration config)
              throws IllegalArgumentException,
                     ProcessingException
Creates a new instance, configuring target structure, classifier, DefaultRepresentation, node filter, combination strategy and tokenizer factory from the provided configuration.

Parameters:
outExt - the extension to use for output files
runDirectory - the directory to run the classifier in; used instead of the configured directory if not null
config - the configuration to use
Throws:
IllegalArgumentException - if the combination strategy cannot be initialized (cf. CombinationStrategy.createStrategy(Set, TiesConfiguration))
ProcessingException - if an error occurs during initialization

ExtractorBase

public ExtractorBase(String outExt,
                     TargetStructure targetStruct,
                     Classifier[] theClassifiers,
                     Representation theRepresentation,
                     CombinationStrategy combiStrat,
                     FinalReextractor reextract,
                     TokenizerFactory tFactory,
                     Reestimator estimator,
                     DocumentRewriter[] docFilters,
                     TrainableFilter sentFilter,
                     Set<String> relevantPunct,
                     TiesConfiguration config)
Creates a new instance.

Parameters:
outExt - the extension to use for output files
targetStruct - the target structure specifying the classes to recognize
theClassifiers - the array of classifiers to use for the local classification decisions
theRepresentation - the context representation to use for local classifications
combiStrat - the combination strategy to use
reextract - an optional re-extractor that can modify extractions in any suitable way
tFactory - used to instantiate tokenizers
estimator - the last element of the re-estimator chain, or null if the chain is empty
docFilters - a list (possibly empty) of document processors that are invoked to modify the XML representations of the documents to process
sentFilter - the filter used in the first step of a double classification approach ("sentence filtering"); if null, no sentence filtering is used
relevantPunct - a set of punctuation tokens that have been found to be relevant for token classification; might be empty but not null
config - used to configure superclasses; if null, the standard configurationis used
Method Detail

createDocumentRewriters

protected static DocumentRewriter[] createDocumentRewriters(TiesConfiguration conf)
                                                     throws ProcessingException
Initializes the list of document rewriters.

Parameters:
conf - the filters are initialized from the optional "rewriters" parameter in this configuration
Returns:
the created list of filters, might be empty bot not null
Throws:
ProcessingException - if an error occurred while creating the classifier

addContextDetails

protected void addContextDetails(ContextDetails details)
Adds an element to the collected context details.

Parameters:
details - the element to add

createFilteringTokenWalker

protected abstract FilteringTokenWalker createFilteringTokenWalker(TrainableFilter repFilter)
Creates a filtering token walker to be used for walking through a document and sentence classification if a double classification approach is used.

Parameters:
repFilter - the trainable filter to use
Returns:
the created walker

evaluateSentenceFiltering

protected FMetricsView evaluateSentenceFiltering(EmbeddingElements embeddingElements)
Evaluates precision and recall for sentence filtering on the last processed document.

Parameters:
embeddingElements - the correct set of embedding elements
Returns:
the calculated statistics for sentence filtering on the last document; null if sentence filtering is disabled

filterDocument

protected Document filterDocument(Document orgDocument,
                                  File filename)
                           throws IOException,
                                  ProcessingException
Runs a document though the list of document filters (if any) to modify it.

Parameters:
orgDocument - the original document
filename - the file name of the document
Returns:
the resulting document as modified by the filters
Throws:
IOException - if an I/O error occurs during filtering
ProcessingException - if a processing error occurs during filtering

getActiveClasses

protected Set[] getActiveClasses()
Returns the set of candidate classes to consider for the current element for each classifier.

Returns:
the value of the attribute

getClassifiers

public Classifier[] getClassifiers()
Returns the array of classifiers used for the local classification decisions.

Returns:
the local classifier

getContextDetails

protected List<ContextDetails> getContextDetails()
Returns the list of context details representing all tokens in the current document.

Returns:
the list of context details

getDocumentRewriters

protected DocumentRewriter[] getDocumentRewriters()
Returns the list of document processors that are invoked to modify the XML representations of the documents to process, e.g. by adding semantic information such as named-entity predictions.

Returns:
the list of filters used, might be empty

getFactory

public TokenizerFactory getFactory()
Returns the factory used to instantiate tokenizers.

Returns:
the value of the attribute

getFeatures

protected FeatureVector getFeatures()
Returns vector of features representing the currently processed element.

Returns:
the value of the attribute

getPriorRecognitions

public PriorRecognitions getPriorRecognitions()
Returns the buffer of preceding Recognitions from the current document.

Returns:
the buffer

getReestimator

protected Reestimator getReestimator()
Returns the re-estimator chain.

Returns:
the last element of the re-estimator chain, or null if the chain is empty

getReextractor

protected FinalReextractor getReextractor()
Returns an optional re-extractor that can modify extractions in any suitable way.

Returns:
the re-extractor used; may be null

getRepresentation

public Representation getRepresentation()
Returns the context representation used for local classifications.

Returns:
the context representation

getSentenceFilter

protected TrainableFilter getSentenceFilter()
Returns the filter used in the first step of a double classification approach ("sentence filtering").

Returns:
the node filter, or null if no sentence filtering is used

getStrategy

protected CombinationStrategy getStrategy()
Returns the combination strategy used.

Returns:
the combination strategy

getTargetStructure

public TargetStructure getTargetStructure()
Returns the target structure specifying the classes to recognize.

Returns:
the used target structure

getWalker

protected TokenWalker getWalker()
Returns the token walker used to walk thru documents.

Returns:
the token walker

initFields

protected void initFields(File filename)
                   throws ProcessingException,
                          IOException
Initializes the fields used for processing a document (feature cache, buffer of prior recognitions, token walker, and statistics) and resets the combination strategy.

Parameters:
filename - the name of the document
Throws:
ProcessingException - if an error occurs while initializing
IOException - if an I/O error occurs

isRelevant

protected boolean isRelevant(String token)
Checks whether a token is relevant for training and extraction. Tokens containing only punctuation or symbol characters are considered irrevelant unless they have been marked to be relevant.

Parameters:
token - the token to check
Returns:
true if the is relevant for training and extraction; false it is can be ignored

isSentenceFiltering

public boolean isSentenceFiltering()
Whether this instance uses sentence filtering (classification of relevant versus irrelevant sentences in a double classification approach).

Returns:
true if sentence filtering is used

markRelevant

protected void markRelevant(String token)
Marks a punctuation token as relevant for classification ((because it did occur as the first or last token of an extraction).

Parameters:
token - the token to mark as relevant

resetStrategy

protected abstract void resetStrategy()
Reset the combination strategy, handling the boolean result value in an appropriate way.


skip

public void skip()
This method is called by FilteringTokenWalker whenever some tokens are skipped.

Specified by:
skip in interface SkipHandler

toString

public String toString()
Returns a string representation of this object.

Overrides:
toString in class TextProcessor
Returns:
a textual representation

updateState

protected void updateState(Element element,
                           String leftText,
                           String mainText,
                           String rightText)
Helper that build the featuresand determines the active classesfor an element.

Parameters:
element - the element to process
leftText - textual content to the left of (preceding) mainText, might be empty
mainText - the main textual content to represent, might be empty
rightText - textual content to the right of (following) mainText, might be empty

viewRelevantPunctuation

public Set<String> viewRelevantPunctuation()
Returns a read-only view on the set of punctuation tokens that have been found to be relevant for token classification (because they sometimes occur as the first or last token of an extraction).

Returns:
a read-only view on the relevant punctuation


Copyright © 2003-2007 Christian Siefkes. All Rights Reserved.