de.fu_berlin.ties.context
Class DefaultRepresentation

java.lang.Object
  extended by de.fu_berlin.ties.context.Representation
      extended by de.fu_berlin.ties.context.AbstractRepresentation
          extended by de.fu_berlin.ties.context.DefaultRepresentation

public class DefaultRepresentation
extends AbstractRepresentation

The context representation used by default. This class is thread-safe.

Version:
$Revision: 1.60 $, $Date: 2004/10/12 15:55:47 $, $Author: siefkes $
Author:
Christian Siefkes

Field Summary
protected static String AXIS_ANCESTOR
          Ancestor axis.
protected static String AXIS_DESC_OR_SELF
          Descendant-or-self axis.
protected static String AXIS_FOLLOW_SIBLING
          Following sibling axis.
protected static String AXIS_PREC_SIBLING
          Preceeding sibling axis.
protected static String AXIS_PRIOR
          The pseudo-axis of prior recognitions.
protected static org.apache.commons.collections.map.LinkedMap TOKEN_TYPE_PATTERNS
          A sequence map mapping used by calculateValuesFromText(String, String, List) to determine the "tokenType" value.
 
Fields inherited from class de.fu_berlin.ties.context.AbstractRepresentation
CONFIG_RECOGN_NUM, CONFIG_SPLIT_MAXIMUM, CONFIG_STORE_NTH
 
Constructor Summary
DefaultRepresentation()
          Creates a new instance based on the standard configuration.
DefaultRepresentation(int recogNum, int detailedRecogs, int numberOfAncestors, int numberOfSiblings, int splitMax, int prefixMax, String headElementName, String headAttribName, String[] defaultAttribs, int n, String outCharset, String[] sensorNames, TiesConfiguration config)
          Creates a new instance.
DefaultRepresentation(TiesConfiguration config)
          Creates a new instance based on the provided configuration.
 
Method Summary
protected  void buildFeatures(String axisName, Element element, ElementPosition position, boolean recurseInsteadOfText, LinkedList<Feature> featureList, boolean addAtEnd, Map<Element,List<LocalFeature>> cache)
          Builds the features of an element and appends them to the specified featureList.
protected  List<LocalFeature> buildLocalFeatures(Element element, ElementPosition position, boolean ignoreText)
          Builds the local features of an element.
protected  List<Feature> buildPrior(PriorRecognitions priorRecognitions)
          Builds the pseudo-axis of prior recognitions.
protected  void buildTextFeatures(String axisName, Element element, String trimmedLeft, String trimmedMain, String trimmedRight, LinkedList<Feature> featureList)
          Builds the context representation of text in an element, differentiating between three kinds of textual contents: a left part, a main part, and a right part.
protected  void calculateHeadValues(Element element, List<LocalFeature> values)
          Creates values that depend on "head" children of an element, if the element contains any of them.
protected  void calculatePositionalValues(String elementName, ElementPosition position, List<LocalFeature> values)
          Calculates values that depend on the position of an element within its parent.
protected  void calculateValuesFromText(String elementName, String trimmedText, List<LocalFeature> values)
          Calculates values that depend on the textual content of an element: prefixes, suffixes, length data, and token type.
protected  String determineHeadValue(Element element)
          Helper method for determining the head value for an element of type getHeadElement().
protected  String determineRoughPosition(int position, int elementCount)
          Helper method called by calculatePositionalValues(String, ElementPosition, List) to collapse a position in to one of five values.
protected  FeatureVector doBuildContext(Element element, String leftText, String mainText, String rightText, PriorRecognitions priorRecognitions, Map<Element,List<LocalFeature>> featureCache, String logPurpose)
          Builds the context representation of text in an element.
protected  List<Feature> filterRepresentation(FeatureVector originalRep)
          Creates a filtered view of a context representation.
 int getAncestorNumber()
          Returns the maximum number of ancestors to include in the context representation.
 Set getDefaultAttributes()
          Returns the unmodifiable set of names of default attributes.
 int getDetailedRecognitions()
          Returns the number of preceding recognitions to represent in detail.
 String getHeadAttribute()
          Returns the name of the element to use for calculating head values.
 String getHeadElement()
          Returns the name of the attribute to use for calculating head values.
 int getSiblingNumber()
          Returns the basic number of preceding and following siblings to include in the context representation.
protected  void handleAncestors(Element element, int ancestorsToAdd, int ancestorSiblingsToAdd, LinkedList<Feature> ancestorFeatures, LinkedList<Feature> ancestorSiblingFeatures, Bag processedAncestorNames, Map<Element,List<LocalFeature>> cache)
          Handles ancestors and ancestor siblings of an element.
protected  ElementPosition handleSiblings(String axisPrefix, Element element, int baseNumber, LinkedList<Feature> precedingFeatures, LinkedList<Feature> followingFeatures, Map<Element,List<LocalFeature>> cache)
          Adds the preceding and following siblings of an element.
protected  void removeExtraMarkers(List features)
          Modifies a list of GlobalFeatures to remove extraneous FeatureType.MARKER features.
protected  List<Element> selectFollowingSiblings(Element mainElement, LinkedList<Element> allFollowingSiblings, int baseNumber)
          Selects the siblings to keep among all following siblings.
protected  List<Element> selectPrecedingSiblings(Element mainElement, LinkedList<Element> allPrecedingSiblings, int baseNumber)
          Selects the siblings to keep among all preceding siblings.
 String toString()
          Returns a string representation of this object.
 
Methods inherited from class de.fu_berlin.ties.context.AbstractRepresentation
buildContext, getSplitMaximum, getStoreN
 
Methods inherited from class de.fu_berlin.ties.context.Representation
buildContext, buildContext, createRecognitionBuffer, getRecognitionNumber
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

AXIS_ANCESTOR

protected static final String AXIS_ANCESTOR
Ancestor axis.

See Also:
Constant Field Values

AXIS_DESC_OR_SELF

protected static final String AXIS_DESC_OR_SELF
Descendant-or-self axis.

See Also:
Constant Field Values

AXIS_FOLLOW_SIBLING

protected static final String AXIS_FOLLOW_SIBLING
Following sibling axis.

See Also:
Constant Field Values

AXIS_PREC_SIBLING

protected static final String AXIS_PREC_SIBLING
Preceeding sibling axis.

See Also:
Constant Field Values

AXIS_PRIOR

protected static final String AXIS_PRIOR
The pseudo-axis of prior recognitions.

See Also:
Constant Field Values

TOKEN_TYPE_PATTERNS

protected static final org.apache.commons.collections.map.LinkedMap TOKEN_TYPE_PATTERNS
A sequence map mapping used by calculateValuesFromText(String, String, List) to determine the "tokenType" value. Maps description strings to Patterns to match. The description string for the first matching pattern is used. Initialized in the static constuctor. The value "mixed" is reserved for tokens that are not matched by any of the patterns.

Constructor Detail

DefaultRepresentation

public DefaultRepresentation()
                      throws ProcessingException
Creates a new instance based on the standard configuration.

Throws:
ProcessingException - if an error occurs while initializing this instance

DefaultRepresentation

public DefaultRepresentation(TiesConfiguration config)
                      throws ProcessingException
Creates a new instance based on the provided configuration.

Parameters:
config - used to configure this instance
Throws:
ProcessingException - if an error occurs while initializing this instance

DefaultRepresentation

public DefaultRepresentation(int recogNum,
                             int detailedRecogs,
                             int numberOfAncestors,
                             int numberOfSiblings,
                             int splitMax,
                             int prefixMax,
                             String headElementName,
                             String headAttribName,
                             String[] defaultAttribs,
                             int n,
                             String outCharset,
                             String[] sensorNames,
                             TiesConfiguration config)
                      throws ProcessingException
Creates a new instance. All element/attribute names must be compatible to DOMUtils.name(Element).

Parameters:
recogNum - the number of preceding recognitions to represent
detailedRecogs - the number of preceding recognitions to represent in detail
numberOfAncestors - the maximum number of ancestors to include in the context representation
numberOfSiblings - the basic number of preceding and following siblings to include
splitMax - the maximum number of subsequences to keep when a feature value must be split (at whitespace).
prefixMax - the maximum length of prefixed and suffixes
headElementName - the name of the element to use for calculating head values (cf. calculateHeadValues(Element, List))
headAttribName - the name of the attribute to use for calculating head values
defaultAttribs - array of names of default attributes
outCharset - the output character set to use (only used to store some configurations for inspection purposes, if n > 0); if null, the default charset of the current platform is used
n - Each n-th context representation is stored if > 0; otherwise no representation is stored
sensorNames - array of fully specified names of classes implementing the Sensor interface to be used to look up semantic information
config - used to configure the sensors
Throws:
ProcessingException - if an error occurs while initializing this instance
Method Detail

doBuildContext

protected FeatureVector doBuildContext(Element element,
                                       String leftText,
                                       String mainText,
                                       String rightText,
                                       PriorRecognitions priorRecognitions,
                                       Map<Element,List<LocalFeature>> featureCache,
                                       String logPurpose)
                                throws ClassCastException,
                                       IllegalArgumentException
Builds the context representation of text in an element. Returns a feature vector of all context features considered relevant for representation.

Specified by:
doBuildContext in class AbstractRepresentation
Parameters:
element - the element whose context should be represented
leftText - textual content to the left of (preceding) mainText, might be empty
mainText - the main textual content to represent, might be empty
rightText - textual content to the right of (following) mainText, might be empty
priorRecognitions - a buffer of the last Recognitions from the document, created by calling Representation.createRecognitionBuffer(); might be null
featureCache - a cache of (local) feature, should be re-used between all calls for the nodes in a single document (but must not be re-used when building the context of nodes in different documents!)
logPurpose - the type of contexts of main interest to the caller (e.g. "Token" or "Sentence"), used for logging
Returns:
a vector of features considered relevant for representation
Throws:
ClassCastException - if the priorRecognitions buffer contains objects that aren't Recognitions
IllegalArgumentException - if the specified node is of an unsupported type

buildFeatures

protected void buildFeatures(String axisName,
                             Element element,
                             ElementPosition position,
                             boolean recurseInsteadOfText,
                             LinkedList<Feature> featureList,
                             boolean addAtEnd,
                             Map<Element,List<LocalFeature>> cache)
Builds the features of an element and appends them to the specified featureList. Handles attributes and calculated values and the element itself. Child elements are only handled when recurseInsteadOfText it true -- the axis name is not changed for child elements.

Parameters:
axisName - the name of the axis, used to start each feature
element - the element to process
position - wraps the position of the element within its parent element and related data, used for calculating positional features; if null, no positional features are calculated
recurseInsteadOfText - if true child elements are recursively processed, but no values from text are calculated; otherwise text is processed but child elements are not
featureList - the list of GlobalFeatures to add features to
addAtEnd - whether to add the new features at the end of at the beginning of the featureList
cache - a cache of local feature, mapping Elements to Lists of LocalFeature

buildLocalFeatures

protected List<LocalFeature> buildLocalFeatures(Element element,
                                                ElementPosition position,
                                                boolean ignoreText)
Builds the local features of an element. Handles attributes and calculated values and the element itself, but no child elements. The LocalFeatures can be stored in a cache for re-use; they must be combined with an axis name for getting a global feature.

Parameters:
element - the element to process
position - wraps the position of the element within its parent element and related data, used for calculating positional features; if null, no positional features are calculated
ignoreText - if true, no values from text or head values are calculated
Returns:
returns a List of created LocalFeatures

buildPrior

protected List<Feature> buildPrior(PriorRecognitions priorRecognitions)
                            throws ClassCastException
Builds the pseudo-axis of prior recognitions.

Parameters:
priorRecognitions - a buffer of the last Recognitions from the document, created by calling Representation.createRecognitionBuffer()
Returns:
a list of GlobalFeatures representing prior recognitions
Throws:
ClassCastException - if the priorRecognitions buffer contains objects that aren't Recognitions

buildTextFeatures

protected void buildTextFeatures(String axisName,
                                 Element element,
                                 String trimmedLeft,
                                 String trimmedMain,
                                 String trimmedRight,
                                 LinkedList<Feature> featureList)
Builds the context representation of text in an element, differentiating between three kinds of textual contents: a left part, a main part, and a right part. Each text part is optional (if all of them are empty, this method does nothing).

Parameters:
axisName - the name of the axis, used to start each feature
element - the element whose context should be represented
trimmedLeft - trimmed textual content to the left of (preceding) trimmedMain, might be empty
trimmedMain - trimmed main textual content to represent, might be empty
trimmedRight - trimmed textual content to the right of (following) trimmedMain, might be empty
featureList - a list of GlobalFeatures to add the values to

calculateHeadValues

protected void calculateHeadValues(Element element,
                                   List<LocalFeature> values)
Creates values that depend on "head" children of an element, if the element contains any of them. This method is meant for element that don't contain any textual content, so their children are checked instead. This implementation proceeds as follows:

If a value contains whitespace, only the final subsequence following all whitespace is preserved.

Parameters:
element - the element to process
values - a list of LocalFeatures to add the calculated values to

calculatePositionalValues

protected void calculatePositionalValues(String elementName,
                                         ElementPosition position,
                                         List<LocalFeature> values)
Calculates values that depend on the position of an element within its parent.

Parameters:
elementName - the name of the element to process, as returned by DOMUtils.name(Element)
position - wraps the position of the element within its parent element and related data, must not be null
values - a list to add the calculated values to

calculateValuesFromText

protected void calculateValuesFromText(String elementName,
                                       String trimmedText,
                                       List<LocalFeature> values)
                                throws IllegalArgumentException
Calculates values that depend on the textual content of an element: prefixes, suffixes, length data, and token type. Also includes the text itself.

Parameters:
elementName - the name of the element to process, as returned by DOMUtils.name(Element)
trimmedText - the trimmed textual content of the element to process, must not be empty
values - a list of LocalFeatures to add the calculated values to
Throws:
IllegalArgumentException - if the empty string was given as trimmedText

determineHeadValue

protected String determineHeadValue(Element element)
Helper method for determining the head value for an element of type getHeadElement(). See calculateHeadValues(Element, List) for a description of the algorithm.

Parameters:
element - the element to process, must be of type getHeadElement()
Returns:
the calculated head value, or the empty string if no value could be calculated (the specified element doesn't have any children, or the right-most child contains neither getHeadAttribute() nor textual content)

determineRoughPosition

protected String determineRoughPosition(int position,
                                        int elementCount)
Helper method called by calculatePositionalValues(String, ElementPosition, List) to collapse a position in to one of five values.
first
for the first element
early
if position is within the first third of all elements (but not the first one), upper limit included
middle
if position is within the second third of all elements, limits excluded
late
if position is within the last third of all elements (but not the last one), lower limit included
last
for the last element

Parameters:
position - the position counted from 0, should be non-negative and smaller than elementCount (otherwise the results are undefined)
elementCount - the number of elements
Returns:
a String containing one of the five values described above

filterRepresentation

protected List<Feature> filterRepresentation(FeatureVector originalRep)
Creates a filtered view of a context representation. A filtered representation is created for each type of features that occurs at least twice; attributes and calculated values occuring only once are omitted. For textual content, another filtered representation is created.

Features representing markers (FeatureType.MARKER), stand-alone elements (FeatureType.ELEMENT) and default attributes (getDefaultAttributes()) are included in all filtered representations. Comment-only features are ignored.

Parameters:
originalRep - a feature vector containing the representation to filter
Returns:
a list of Features combining the filtered representations created by this method

getAncestorNumber

public int getAncestorNumber()
Returns the maximum number of ancestors to include in the context representation.

Returns:
the number of ancestors to include

getDefaultAttributes

public Set getDefaultAttributes()
Returns the unmodifiable set of names of default attributes. The contained names are Strings in a format compatible to DOMUtils.name(Attribute).

Returns:
the set of default attributes

getDetailedRecognitions

public int getDetailedRecognitions()
Returns the number of preceding recognitions to represent in detail.

Returns:
the value of the attribute

getHeadAttribute

public String getHeadAttribute()
Returns the name of the element to use for calculating head values.

Returns:
the name of the attribute
See Also:
calculateHeadValues(Element, List)

getHeadElement

public String getHeadElement()
Returns the name of the attribute to use for calculating head values.

Returns:
the name of the element
See Also:
calculateHeadValues(Element, List)

getSiblingNumber

public int getSiblingNumber()
Returns the basic number of preceding and following siblings to include in the context representation.

Returns:
the number of siblings to include

handleAncestors

protected void handleAncestors(Element element,
                               int ancestorsToAdd,
                               int ancestorSiblingsToAdd,
                               LinkedList<Feature> ancestorFeatures,
                               LinkedList<Feature> ancestorSiblingFeatures,
                               Bag processedAncestorNames,
                               Map<Element,List<LocalFeature>> cache)
                        throws IllegalArgumentException
Handles ancestors and ancestor siblings of an element.

Parameters:
element - the element to process
ancestorsToAdd - the number of ancestors to add, must be > 0; if > 1, this method calls itself recursively, decreasing the number by 1
ancestorSiblingsToAdd - the number of preceding/following siblings of the current ancestors to add; if 0 or negative, no siblings are added; a recursive call decreases this parameter by 1 if any ancestor siblings were found (if no siblings were found, the number passes unchanged)
ancestorFeatures - the list of GlobalFeatures to prepend the features on the ancestors to
ancestorSiblingFeatures - the list of GlobalFeatures to append the features on the ancestors siblings to
processedAncestorNames - a bag that will typically be empty when first calling this method (will be filled by recursive calls)
cache - a cache of local feature, mapping Elements to Lists of LocalFeature
Throws:
IllegalArgumentException - if ancestorsToAdd is 0 or negative

handleSiblings

protected ElementPosition handleSiblings(String axisPrefix,
                                         Element element,
                                         int baseNumber,
                                         LinkedList<Feature> precedingFeatures,
                                         LinkedList<Feature> followingFeatures,
                                         Map<Element,List<LocalFeature>> cache)
Adds the preceding and following siblings of an element.

Parameters:
axisPrefix - the prefix of the axis name, used to start each feature; specify the empty string if no prefix should be used
element - the element to process
baseNumber - the basic number of siblings to keep; the actual number of siblings kept might vary
precedingFeatures - the list of GlobalFeatures to prepend the features on the preceding siblings to
followingFeatures - the list of GlobalFeatures to append the features on the following siblings to
cache - a cache of local feature, mapping Elements to Lists of LocalFeature
Returns:
a wrapper of the position of the element with its parent element and related data, required by calculatePositionalValues(String, ElementPosition, List), or null if there is no parent element

removeExtraMarkers

protected void removeExtraMarkers(List features)
Modifies a list of GlobalFeatures to remove extraneous FeatureType.MARKER features. Keeps only the last one of several sequential marker features; trailing marker features are removed as well.

Parameters:
features - the list of features to modify

selectFollowingSiblings

protected List<Element> selectFollowingSiblings(Element mainElement,
                                                LinkedList<Element> allFollowingSiblings,
                                                int baseNumber)
Selects the siblings to keep among all following siblings. This implementation always keeps the baseNumber first siblings. One of the selected siblings may have a different type (name as returned by DOMUtils.name(Element)) than the main element -- if there are more with different types, they are skipped.

Parameters:
mainElement - the element whose siblings should be selected
allFollowingSiblings - the list of all following siblings
baseNumber - the basic number of siblings to keep; the actual number of siblings kept might vary
Returns:
the subset of following siblings to keep; an empty list if there are no siblings to keep

selectPrecedingSiblings

protected List<Element> selectPrecedingSiblings(Element mainElement,
                                                LinkedList<Element> allPrecedingSiblings,
                                                int baseNumber)
Selects the siblings to keep among all preceding siblings. This implementation always keeps the first sibling and the baseNumber last siblings. One of the last siblings may have a different type (name as returned by DOMUtils.name(Element)) than the main element -- if there are more with different types, they are skipped. If none of the baseNumber last siblings has a different types, the last siblings with a different type (name) is also kept.

Parameters:
mainElement - the element whose siblings should be selected
allPrecedingSiblings - the list of all preceding siblings
baseNumber - the basic number of siblings to keep; the actual number of siblings kept might vary
Returns:
the subset of preceding siblings to keep; an empty list if there are no siblings to keep

toString

public String toString()
Returns a string representation of this object.

Overrides:
toString in class AbstractRepresentation
Returns:
a textual representation


Copyright © 2003-2004 Christian Siefkes. All Rights Reserved.