|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectde.fu_berlin.ties.ConfigurableProcessor
de.fu_berlin.ties.TextProcessor
de.fu_berlin.ties.DocumentReader
de.fu_berlin.ties.extract.ExtractorBase
public abstract class ExtractorBase
Common code base shared by Extractor
and
Trainer
.
Instances of subclasses are not thread-safe and cannot process several documents in parallel.
Field Summary | |
---|---|
static String |
CONFIG_AVOID
Configuration key: List of elements that should be avoided when filtering (using parent element instead). |
static String |
CONFIG_ELEMENTS
Configuration key: List of elements to filter. |
static String |
CONFIG_RELEVANT_PUNCTUATION
Configuration key: list of punctuation and symbol tokens that are considered as relevant from the very start. |
static String |
CONFIG_SENTENCE
Configuration suffix/prefix used for sentence filtering. |
Fields inherited from class de.fu_berlin.ties.TextProcessor |
---|
CONFIG_POST, KEY_DIRECTORY, KEY_LOCAL_NAME, KEY_OUT_DIRECTORY, KEY_URL |
Constructor Summary | |
---|---|
ExtractorBase(String outExt)
Creates a new instance. |
|
ExtractorBase(String outExt,
File runDirectory,
TiesConfiguration config)
Creates a new instance, configuring target structure, classifier, DefaultRepresentation , node filter, combination strategy and
tokenizer factory from the provided configuration. |
|
ExtractorBase(String outExt,
TargetStructure targetStruct,
Classifier[] theClassifiers,
Representation theRepresentation,
CombinationStrategy combiStrat,
TokenizerFactory tFactory,
TrainableFilter sentFilter,
Set<String> relevantPunct,
TiesConfiguration config)
Creates a new instance. |
|
ExtractorBase(String outExt,
TiesConfiguration config)
Creates a new instance, configuring target structure, classifier, DefaultRepresentation , node filter and combination strategy from
the provided configuration. |
Method Summary | |
---|---|
protected abstract FilteringTokenWalker |
createFilteringTokenWalker(TrainableFilter repFilter)
Creates a filtering token walker to be used for walking through a document and sentence classification if a double classification approach is used. |
static TrainableFilter |
createSentenceFilter(TiesConfiguration conf,
Representation representation)
Helper methat that initializes the filter to be used for the first step of a double classification approach ("sentence filtering"). |
protected FMetricsView |
evaluateSentenceFiltering(EmbeddingElements embeddingElements)
Evaluates precision and recall for sentence filtering on the last processed document. |
protected Set[] |
getActiveClasses()
Returns the set of candidate classes to consider for the current element for each classifier. |
Classifier[] |
getClassifiers()
Returns the array of classifiers used for the local classification decisions. |
TokenizerFactory |
getFactory()
Returns the factory used to instantiate tokenizers. |
FeatureCount |
getFeatureCount()
Returns the object used to count documents, contexts, and features and to calculate averages. |
protected FeatureVector |
getFeatures()
Returns vector of features representing the currently processed element. |
PriorRecognitions |
getPriorRecognitions()
Returns the buffer of preceding Recognition s from the current
document. |
Representation |
getRepresentation()
Returns the context representation used for local classifications. |
protected TrainableFilter |
getSentenceFilter()
Returns the filter used in the first step of a double classification approach ("sentence filtering"). |
protected CombinationStrategy |
getStrategy()
Returns the combination strategy used. |
TargetStructure |
getTargetStructure()
Returns the target structure specifying the classes to recognize. |
protected TokenWalker |
getWalker()
Returns the token walker used to walk thru documents. |
protected void |
initFields()
Initializes the fields used for processing a document (feature cache, buffer of prior recognitions, token walker, and statistics) and resets the combination strategy. |
protected boolean |
isRelevant(String token)
Checks whether a token is relevant for training and extraction. |
boolean |
isSentenceFiltering()
Whether this instance uses sentence filtering (classification of relevant versus irrelevant sentences in a double classification approach). |
protected void |
markRelevant(String token)
Marks a punctuation token as relevant for classification ((because it did occur as the first or last token of an extraction). |
protected abstract void |
resetStrategy()
Reset the combination strategy, handling the boolean result value in an appropriate way. |
void |
skip()
This method is called by FilteringTokenWalker whenever some
tokens are skipped. |
String |
toString()
Returns a string representation of this object. |
protected void |
updateState(Element element,
String leftText,
String mainText,
String rightText)
Helper that build the featuresand determines the active classesfor an element. |
FeatureCountView |
viewFeatureCount()
Returns a read-only view on the counted documents, contexts, and features and the calculated averages. |
Set<String> |
viewRelevantPunctuation()
Returns a read-only view on the set of punctuation tokens that have been found to be relevant for token classification (because they sometimes occur as the first or last token of an extraction). |
Methods inherited from class de.fu_berlin.ties.DocumentReader |
---|
doProcess, process |
Methods inherited from class de.fu_berlin.ties.TextProcessor |
---|
getOutFileExt, process, process, process, process, process, process |
Methods inherited from class de.fu_berlin.ties.ConfigurableProcessor |
---|
getConfig |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Methods inherited from interface de.fu_berlin.ties.xml.dom.TokenProcessor |
---|
processToken |
Field Detail |
---|
public static final String CONFIG_ELEMENTS
public static final String CONFIG_AVOID
public static final String CONFIG_RELEVANT_PUNCTUATION
public static final String CONFIG_SENTENCE
Constructor Detail |
---|
public ExtractorBase(String outExt) throws IllegalArgumentException, ProcessingException
ExtractorBase(String, TiesConfiguration)
using the
standard configuration.
outExt
- the extension to use for output files
IllegalArgumentException
- if the combination strategy cannot be
initialized (cf.
CombinationStrategy.createStrategy(Set, TiesConfiguration)
)
ProcessingException
- if an error occurs during initializationpublic ExtractorBase(String outExt, TiesConfiguration config) throws IllegalArgumentException, ProcessingException
DefaultRepresentation
, node filter and combination strategy from
the provided configuration.
outExt
- the extension to use for output filesconfig
- the configuration to use
IllegalArgumentException
- if the combination strategy cannot be
initialized (cf.
CombinationStrategy.createStrategy(Set, TiesConfiguration)
)
ProcessingException
- if an error occurs during initializationpublic ExtractorBase(String outExt, File runDirectory, TiesConfiguration config) throws IllegalArgumentException, ProcessingException
DefaultRepresentation
, node filter, combination strategy and
tokenizer factory from the provided configuration.
outExt
- the extension to use for output filesrunDirectory
- the directory to run the classifier in; used instead
of the
configured directory if not null
config
- the configuration to use
IllegalArgumentException
- if the combination strategy cannot be
initialized (cf.
CombinationStrategy.createStrategy(Set, TiesConfiguration)
)
ProcessingException
- if an error occurs during initializationpublic ExtractorBase(String outExt, TargetStructure targetStruct, Classifier[] theClassifiers, Representation theRepresentation, CombinationStrategy combiStrat, TokenizerFactory tFactory, TrainableFilter sentFilter, Set<String> relevantPunct, TiesConfiguration config)
outExt
- the extension to use for output filestargetStruct
- the target structure specifying the classes to
recognizetheClassifiers
- the array of classifiers to use for the local
classification decisionstheRepresentation
- the context representation to use for local
classificationscombiStrat
- the combination strategy to usetFactory
- used to instantiate tokenizerssentFilter
- the filter used in the first step of a double
classification approach ("sentence filtering"); if null
,
no sentence filtering is usedrelevantPunct
- a set of punctuation tokens that have been found to
be relevant for token classification; might be empty but not
null
config
- used to configure superclasses; if null
,
the standard configurationis usedMethod Detail |
---|
public static TrainableFilter createSentenceFilter(TiesConfiguration conf, Representation representation) throws ProcessingException
conf
- the filter is initialized from the "filter" parameters in
this configurationrepresentation
- the representation to use
null
if no sentence
filtering should be used
ProcessingException
- if an error occurs while creating the filterprotected abstract FilteringTokenWalker createFilteringTokenWalker(TrainableFilter repFilter)
repFilter
- the trainable filter to use
protected FMetricsView evaluateSentenceFiltering(EmbeddingElements embeddingElements)
embeddingElements
- the correct set of embedding elements
null
if sentence filtering is disabledprotected Set[] getActiveClasses()
public Classifier[] getClassifiers()
public TokenizerFactory getFactory()
public FeatureCount getFeatureCount()
protected FeatureVector getFeatures()
public PriorRecognitions getPriorRecognitions()
Recognition
s from the current
document.
public Representation getRepresentation()
protected TrainableFilter getSentenceFilter()
null
if no sentence filtering
is usedprotected CombinationStrategy getStrategy()
public TargetStructure getTargetStructure()
protected TokenWalker getWalker()
protected void initFields() throws ProcessingException
ProcessingException
- if an error occurs while initializingprotected boolean isRelevant(String token)
token
- the token to check
true
if the is relevant for training and
extraction; false
it is can be ignoredpublic boolean isSentenceFiltering()
true
if sentence filtering is usedprotected void markRelevant(String token)
token
- the token to mark as relevantprotected abstract void resetStrategy()
public void skip()
FilteringTokenWalker
whenever some
tokens are skipped.
skip
in interface SkipHandler
public String toString()
toString
in class TextProcessor
protected void updateState(Element element, String leftText, String mainText, String rightText)
element
- the element to processleftText
- textual content to the left of (preceding)
mainText
, might be emptymainText
- the main textual content to represent, might be emptyrightText
- textual content to the right of (following)
mainText
, might be emptypublic FeatureCountView viewFeatureCount()
public Set<String> viewRelevantPunctuation()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |