de.fu_berlin.ties.text
Class FieldTokenizingExtractor

java.lang.Object
  extended by de.fu_berlin.ties.text.TokenizingExtractor
      extended by de.fu_berlin.ties.text.FieldTokenizingExtractor
All Implemented Interfaces:
FeatureExtractor

public class FieldTokenizingExtractor
extends TokenizingExtractor

A tokenizing extractor that prepends field names to each token. The default implementation is meant for e-mail (or newsgroup) messages: each e-mail header is converted into a field (using the header name as prefix, e.g. "Subject:"); the whole body (including attachments) is treated as a single field (using the empty string as prefix). The first first token is also considered the beginning of a new field (e.g. start of the "From " line).

Version:
$Revision: 1.6 $, $Date: 2006/10/21 16:04:25 $, $Author: siefkes $
Author:
Christian Siefkes

Field Summary
protected static Pattern END_OF_FIELDS_WS
          Pattern matching the whitespace marks the end of all regular field names (ie the begin of the message body): must contain two consecutive line break.
protected static char FIELD_SEP
          Separator character inserted between field name and actual token: 95.
protected static Pattern FIELDNAME
          Pattern matching an RFC2822-style header name: a sequence of printable characters terminated by a colon and not containing any other colons.
protected static String FINAL_FIELDNAME
          The name of the final field (using after END_OF_FIELDS_WS matched): "" (the empty string).
protected static Pattern PRE_FIELDNAME_WS
          Pattern matching the whitespace that must occur in front of an RFC2822-style header name: must end in a line break.
 
Constructor Summary
FieldTokenizingExtractor(TiesConfiguration conf, String suffix)
          Creates a new instance.
 
Method Summary
 FeatureVector buildFeatures(Reader reader)
          Extracts a vector of relevant features from a text sequence.
 
Methods inherited from class de.fu_berlin.ties.text.TokenizingExtractor
getTokenizer, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

FIELDNAME

protected static final Pattern FIELDNAME
Pattern matching an RFC2822-style header name: a sequence of printable characters terminated by a colon and not containing any other colons.


PRE_FIELDNAME_WS

protected static final Pattern PRE_FIELDNAME_WS
Pattern matching the whitespace that must occur in front of an RFC2822-style header name: must end in a line break.


END_OF_FIELDS_WS

protected static final Pattern END_OF_FIELDS_WS
Pattern matching the whitespace marks the end of all regular field names (ie the begin of the message body): must contain two consecutive line break.


FINAL_FIELDNAME

protected static final String FINAL_FIELDNAME
The name of the final field (using after END_OF_FIELDS_WS matched): "" (the empty string).

See Also:
Constant Field Values

FIELD_SEP

protected static final char FIELD_SEP
Separator character inserted between field name and actual token: 95.

See Also:
Constant Field Values
Constructor Detail

FieldTokenizingExtractor

public FieldTokenizingExtractor(TiesConfiguration conf,
                                String suffix)
Creates a new instance.

Parameters:
conf - used to configure this instance
suffix - optional suffix for adapting configuration keys if not null
Method Detail

buildFeatures

public FeatureVector buildFeatures(Reader reader)
                            throws IOException
Extracts a vector of relevant features from a text sequence.

Specified by:
buildFeatures in interface FeatureExtractor
Overrides:
buildFeatures in class TokenizingExtractor
Parameters:
reader - a reader containing the text to represent
Returns:
a feature vector representing the input text sequence
Throws:
IOException - if an I/O error occurs while reading the input


Copyright © 2003-2007 Christian Siefkes. All Rights Reserved.