|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectde.fu_berlin.ties.text.TokenizerFactory
Factory for creating TextTokenizer
s of
different types.
Field Summary | |
static String |
CONFIG_TOKEN_PATTERNS
Configuration key for the array of regular expressions defining the token types accepted by the tokenizer. |
static String |
CONFIG_WHITESPACE_PATTERN
Configuration key for the regular expression giving the whitespace accepted by the tokenizer. |
static String |
WHITESPACE_CONTROL_OTHER
Pattern string capturing whitespace and control/other characters. |
Constructor Summary | |
TokenizerFactory(TiesConfiguration config)
Creates a new instance from the CONFIG_TOKEN_PATTERNS and
CONFIG_WHITESPACE_PATTERN keys of the provided configuration. |
|
TokenizerFactory(TiesConfiguration config,
String suffix)
Creates a new instance from the CONFIG_TOKEN_PATTERNS and
CONFIG_WHITESPACE_PATTERN keys of the provided configuration,
adapted by
appending the suffix . |
Method Summary | |
static TextTokenizer |
createAlnumTokenizer(CharSequence text)
Static factory method to create an instance for tokenizing alphanumeric and symbol sequences and puntuation. |
static TextTokenizer |
createCategoryTokenizer(CharSequence text)
Static factory method to create an instance for tokenizing according to Unicode categories. |
static TextTokenizer |
createThoroughTokenizer(CharSequence text)
Static factory method to create an instance that uses the "thorough" patterns listed below. |
TextTokenizer |
createTokenizer(CharSequence text)
Factory method to create an instance from the configured token and whitespace patterns. |
String |
toString()
Returns a string representation of this object. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
public static final String CONFIG_TOKEN_PATTERNS
public static final String CONFIG_WHITESPACE_PATTERN
public static final String WHITESPACE_CONTROL_OTHER
Constructor Detail |
public TokenizerFactory(TiesConfiguration config)
CONFIG_TOKEN_PATTERNS
and
CONFIG_WHITESPACE_PATTERN
keys of the provided configuration.
config
- the configuration to usepublic TokenizerFactory(TiesConfiguration config, String suffix)
CONFIG_TOKEN_PATTERNS
and
CONFIG_WHITESPACE_PATTERN
keys of the provided configuration,
adapted by
appending the suffix
.
config
- the configuration to usesuffix
- the suffix to append to the keysMethod Detail |
public static TextTokenizer createAlnumTokenizer(CharSequence text)
TextTokenizer.capturedText()
When you are only interested in words and numbers (e.g. for indexing),
you can use the captured text
--
it will contain the full token for alphanumeric sequences, it will be
empty for symbols and punctuation.
The whitespace pattern comprised a sequence of whitespace and control/other characters ("C" and "Z" categories).
text
- the text to tokenize
public static TextTokenizer createCategoryTokenizer(CharSequence text)
TextTokenizer.capturedText()
When you are only interested in words and numbers (e.g. for indexing),
you can use the captured text
--
it will contain the full token for letter and digit sequences, it will
be empty for symbols and punctuation.
The whitespace pattern comprised a sequence of whitespace and control/other characters ("C" and "Z" categories).
text
- the text to tokenize
public static TextTokenizer createThoroughTokenizer(CharSequence text)
These patterns don't contain any useful information for
TextTokenizer.capturedText()
.
The whitespace pattern comprised a sequence of whitespace and control/other characters.
text
- the text to tokenize
public TextTokenizer createTokenizer(CharSequence text)
text
- the text to tokenize
public String toString()
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |