|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectde.fu_berlin.ties.text.TokenizingExtractor
de.fu_berlin.ties.text.FieldTokenizingExtractor
public class FieldTokenizingExtractor
A tokenizing extractor that prepends field names to each token. The default implementation is meant for e-mail (or newsgroup) messages: each e-mail header is converted into a field (using the header name as prefix, e.g. "Subject:"); the whole body (including attachments) is treated as a single field (using the empty string as prefix). The first first token is also considered the beginning of a new field (e.g. start of the "From " line).
Field Summary | |
---|---|
protected static Pattern |
END_OF_FIELDS_WS
Pattern matching the whitespace marks the end of all regular field names (ie the begin of the message body): must contain two consecutive line break. |
protected static char |
FIELD_SEP
Separator character inserted between field name and actual token: 95. |
protected static Pattern |
FIELDNAME
Pattern matching an RFC2822-style header name: a sequence of printable characters terminated by a colon and not containing any other colons. |
protected static String |
FINAL_FIELDNAME
The name of the final field (using after END_OF_FIELDS_WS
matched): "" (the empty string). |
protected static Pattern |
PRE_FIELDNAME_WS
Pattern matching the whitespace that must occur in front of an RFC2822-style header name: must end in a line break. |
Constructor Summary | |
---|---|
FieldTokenizingExtractor(TiesConfiguration conf,
String suffix)
Creates a new instance. |
Method Summary | |
---|---|
FeatureVector |
buildFeatures(Reader reader)
Extracts a vector of relevant features from a text sequence. |
Methods inherited from class de.fu_berlin.ties.text.TokenizingExtractor |
---|
getTokenizer, toString |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
protected static final Pattern FIELDNAME
protected static final Pattern PRE_FIELDNAME_WS
protected static final Pattern END_OF_FIELDS_WS
protected static final String FINAL_FIELDNAME
END_OF_FIELDS_WS
matched): "" (the empty string).
protected static final char FIELD_SEP
Constructor Detail |
---|
public FieldTokenizingExtractor(TiesConfiguration conf, String suffix)
conf
- used to configure this instancesuffix
- optional suffix for
adapting configuration keys if not null
Method Detail |
---|
public FeatureVector buildFeatures(Reader reader) throws IOException
buildFeatures
in interface FeatureExtractor
buildFeatures
in class TokenizingExtractor
reader
- a reader containing the text to represent
IOException
- if an I/O error occurs while reading the input
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |