|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectde.fu_berlin.ties.ConfigurableProcessor
de.fu_berlin.ties.TextProcessor
de.fu_berlin.ties.DocumentReader
de.fu_berlin.ties.extract.ExtractorBase
de.fu_berlin.ties.extract.Extractor
public class Extractor
An extractor runs a local Classifier
on a list of items/nodes and combines their results using a
CombinationStrategy
.
Instances of this class are not thread-safe and cannot extract from several documents in parallel.
Field Summary | |
---|---|
static String |
EXT_EXTRACTIONS
The recommended file extension to use for storing extractions. |
Fields inherited from class de.fu_berlin.ties.extract.ExtractorBase |
---|
CONFIG_AVOID, CONFIG_ELEMENTS, CONFIG_RELEVANT_PUNCTUATION, CONFIG_SENTENCE, CONFIG_SUFFIX_IE |
Fields inherited from class de.fu_berlin.ties.TextProcessor |
---|
CONFIG_POST, KEY_DIRECTORY, KEY_LOCAL_NAME, KEY_OUT_DIRECTORY, KEY_URL |
Constructor Summary | |
---|---|
Extractor()
Creates a new instance using a default extension. |
|
Extractor(String outExt)
Creates a new instance. |
|
Extractor(String outExt,
File runDirectory,
TiesConfiguration config)
Creates a new instance. |
|
Extractor(String outExt,
TargetStructure targetStruct,
Classifier[] theClassifiers,
Representation theRepresentation,
CombinationStrategy combiStrat,
FinalReextractor reextract,
TokenizerFactory tFactory,
Reestimator estimator,
DocumentRewriter[] docFilters,
TrainableFilter sentFilter,
Reranker rerank,
Set<String> relevantPunct,
TiesConfiguration config)
Creates a new instance. |
|
Extractor(String outExt,
TiesConfiguration config)
Creates a new instance. |
|
Extractor(String outExt,
Trainer trainer)
Creates a new instance, re-using the components from the provided trainer. |
Method Summary | |
---|---|
protected void |
addPunctuationDetails(TokenDetails details)
Adds an element to the collected punctuation details. |
protected void |
appendPunctuation(Extraction ext)
Appends the collected punctuation details (if any) to the provided extraction. |
protected void |
clearPunctuation()
Clears the collected punctuation details. |
protected FilteringTokenWalker |
createFilteringTokenWalker(TrainableFilter repFilter)
Creates a filtering token walker to be used for walking through a document and sentence classification if a double classification approach is used. |
void |
destroy()
Destroys the internal classifers. |
FMetricsView |
evaluateSentenceFiltering(ExtractionContainer correctExtractions)
Evaluates precision and recall for sentence filtering on the last processed document. |
ExtractionContainer |
extract(Document doc,
File filename)
Extracts items of interest from the contents of an XML document, based on context representation and local classifier. |
protected ExtractionContainer |
getPredictedExtractions()
Returns the extraction container used for storing the predicted extractions. |
void |
process(Document document,
Writer writer,
ContextMap context)
Extracts items of interest from the contents of an XML document and serializes the extractions. |
void |
processToken(Element element,
String left,
TokenDetails details,
String right,
ContextMap context)
Processes a token in an XML element, optionally modifying the element or the document it is part of. |
protected void |
resetStrategy()
Reset strategy and discard last prediction extraction if requested. |
void |
serializeExtractions(ExtractionContainer extractions,
Writer writer)
Helper method that serializes the content of an extraction container to a writer. |
Methods inherited from class de.fu_berlin.ties.extract.ExtractorBase |
---|
addContextDetails, createDocumentRewriters, evaluateSentenceFiltering, filterDocument, getActiveClasses, getClassifiers, getContextDetails, getDocumentRewriters, getFactory, getFeatures, getPriorRecognitions, getReestimator, getReextractor, getRepresentation, getSentenceFilter, getStrategy, getTargetStructure, getWalker, initFields, isRelevant, isSentenceFiltering, markRelevant, skip, toString, updateState, viewRelevantPunctuation |
Methods inherited from class de.fu_berlin.ties.DocumentReader |
---|
doProcess |
Methods inherited from class de.fu_berlin.ties.TextProcessor |
---|
getOutFileExt, process, process, process, process, process, process |
Methods inherited from class de.fu_berlin.ties.ConfigurableProcessor |
---|
getConfig |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
public static final String EXT_EXTRACTIONS
Constructor Detail |
---|
public Extractor() throws IllegalArgumentException, ProcessingException
Extractor(String, TiesConfiguration)
using the
standard configuration.
IllegalArgumentException
- if the combination strategy cannot be
initialized (cf. CombinationStrategy.createStrategy(java.util.Set,
TiesConfiguration)
)
ProcessingException
- if an error occurs during initializationpublic Extractor(String outExt) throws IllegalArgumentException, ProcessingException
Extractor(String, TiesConfiguration)
using the
standard configuration.
outExt
- the extension to use for output files
IllegalArgumentException
- if the combination strategy cannot be
initialized (cf. CombinationStrategy.createStrategy(java.util.Set,
TiesConfiguration)
)
ProcessingException
- if an error occurs during initializationpublic Extractor(String outExt, TiesConfiguration config) throws IllegalArgumentException, ProcessingException
outExt
- the extension to use for output filesconfig
- the configuration to use
IllegalArgumentException
- if the combination strategy cannot be
initialized
(cf. CombinationStrategy.createStrategy(java.util.Set,
TiesConfiguration)
)
ProcessingException
- if an error occurs during initializationpublic Extractor(String outExt, File runDirectory, TiesConfiguration config) throws IllegalArgumentException, ProcessingException
outExt
- the extension to use for output filesrunDirectory
- the directory to run the classifier in; used instead
of the
configured directory if not null
config
- the configuration to use
IllegalArgumentException
- if the combination strategy cannot be
initialized
(cf. CombinationStrategy.createStrategy(java.util.Set,
TiesConfiguration)
)
ProcessingException
- if an error occurs during initializationpublic Extractor(String outExt, Trainer trainer)
outExt
- the extension to use for output filestrainer
- trainer whose components should be re-usedpublic Extractor(String outExt, TargetStructure targetStruct, Classifier[] theClassifiers, Representation theRepresentation, CombinationStrategy combiStrat, FinalReextractor reextract, TokenizerFactory tFactory, Reestimator estimator, DocumentRewriter[] docFilters, TrainableFilter sentFilter, Reranker rerank, Set<String> relevantPunct, TiesConfiguration config)
outExt
- the extension to use for output filestargetStruct
- the target structure specifying the classes to
recognizetheClassifiers
- the classifiers to use for the local classification
decisionstheRepresentation
- the context representation to use for local
classificationscombiStrat
- the combination strategy to usereextract
- an optional re-extractor that can modify extractions in
any suitable waytFactory
- used to instantiate tokenizersestimator
- the last element of the re-estimator chain, or
null
if the chain is emptydocFilters
- a list (possibly empty) of document processors that are
invoked to modify the XML representations of the documents to processsentFilter
- the filter used in the first step of a double
classification approach ("sentence filtering"); if null
,
no sentence filtering is usedrerank
- a reranker that recalculates probabilities to
introduce a bias (can be used to favor recall over precision, by setting
a bias < 1 for the background class, etc.); must not be
null
relevantPunct
- a set of punctuation tokens that have been found to
be relevant for token classification; might be empty but not
null
config
- used to configure superclasses; if null
,
the standard configuration is usedMethod Detail |
---|
protected void addPunctuationDetails(TokenDetails details)
details
- the element to addprotected void appendPunctuation(Extraction ext)
clearPunctuation()
to dleetes the processed punctuation.
ext
- the extraction to append toprotected void clearPunctuation()
protected FilteringTokenWalker createFilteringTokenWalker(TrainableFilter repFilter)
createFilteringTokenWalker
in class ExtractorBase
repFilter
- the trainable filter to use
public void destroy() throws ProcessingException
ProcessingException
- if an error occurs while the classifiers are
being destroyedpublic ExtractionContainer extract(Document doc, File filename) throws IOException, ProcessingException
doc
- a document whose contents should be classifiedfilename
- the name of the document
IOException
- if an I/O error occurs
ProcessingException
- if an error occurs during processingpublic FMetricsView evaluateSentenceFiltering(ExtractionContainer correctExtractions)
correctExtractions
- a container of all correct extractions for the
document
null
if sentence filtering is disabledprotected ExtractionContainer getPredictedExtractions()
public void process(Document document, Writer writer, ContextMap context) throws IOException, ProcessingException
process
in class DocumentReader
document
- the document to readwriter
- the writer to write the extracted items to; flushed
but not closed by this methodcontext
- a map of objects that are made available for processing
IOException
- if an I/O error occurs
ProcessingException
- if an error occurs during processingpublic void processToken(Element element, String left, TokenDetails details, String right, ContextMap context) throws ProcessingException
element
- the element containing the tokenleft
- the textual contents of the element to the left of the
token
(in case of mixed contents, only up to the last
preceding child element, if any)details
- details about the token to processright
- the textual contents of the element to the right of the
token
(in case of mixed contents, only up to the next
following child element, if any)context
- a map of objects that are made available for processing
ProcessingException
- if an error occurs during processingprotected void resetStrategy()
resetStrategy
in class ExtractorBase
public void serializeExtractions(ExtractionContainer extractions, Writer writer) throws IOException
extractions
- the extraction container to serializewriter
- the writer to write; will be flushed but not closed by
this method
IOException
- if an I/O error occurs
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |