de.fu_berlin.ties.text
Class TokenizerFactory

java.lang.Object
  extended byde.fu_berlin.ties.text.TokenizerFactory

public class TokenizerFactory
extends Object

Factory for creating TextTokenizers of different types.

Version:
$Revision: 1.5 $, $Date: 2004/04/13 07:08:35 $, $Author: siefkes $
Author:
Christian Siefkes

Field Summary
static String CONFIG_TOKEN_PATTERNS
          Configuration key for the array of regular expressions defining the token types accepted by the tokenizer.
static String CONFIG_WHITESPACE_PATTERN
          Configuration key for the regular expression giving the whitespace accepted by the tokenizer.
static String WHITESPACE_CONTROL_OTHER
          Pattern string capturing whitespace and control/other characters.
 
Constructor Summary
TokenizerFactory(TiesConfiguration config)
          Creates a new instance from the CONFIG_TOKEN_PATTERNS and CONFIG_WHITESPACE_PATTERN keys of the provided configuration.
TokenizerFactory(TiesConfiguration config, String suffix)
          Creates a new instance from the CONFIG_TOKEN_PATTERNS and CONFIG_WHITESPACE_PATTERN keys of the provided configuration, adapted by appending the suffix.
 
Method Summary
static TextTokenizer createAlnumTokenizer(CharSequence text)
          Static factory method to create an instance for tokenizing alphanumeric and symbol sequences and puntuation.
static TextTokenizer createCategoryTokenizer(CharSequence text)
          Static factory method to create an instance for tokenizing according to Unicode categories.
static TextTokenizer createThoroughTokenizer(CharSequence text)
          Static factory method to create an instance that uses the "thorough" patterns listed below.
 TextTokenizer createTokenizer(CharSequence text)
          Factory method to create an instance from the configured token and whitespace patterns.
 String toString()
          Returns a string representation of this object.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

CONFIG_TOKEN_PATTERNS

public static final String CONFIG_TOKEN_PATTERNS
Configuration key for the array of regular expressions defining the token types accepted by the tokenizer.

See Also:
Constant Field Values

CONFIG_WHITESPACE_PATTERN

public static final String CONFIG_WHITESPACE_PATTERN
Configuration key for the regular expression giving the whitespace accepted by the tokenizer.

See Also:
Constant Field Values

WHITESPACE_CONTROL_OTHER

public static final String WHITESPACE_CONTROL_OTHER
Pattern string capturing whitespace and control/other characters.

See Also:
Constant Field Values
Constructor Detail

TokenizerFactory

public TokenizerFactory(TiesConfiguration config)
Creates a new instance from the CONFIG_TOKEN_PATTERNS and CONFIG_WHITESPACE_PATTERN keys of the provided configuration.

Parameters:
config - the configuration to use

TokenizerFactory

public TokenizerFactory(TiesConfiguration config,
                        String suffix)
Creates a new instance from the CONFIG_TOKEN_PATTERNS and CONFIG_WHITESPACE_PATTERN keys of the provided configuration, adapted by appending the suffix.

Parameters:
config - the configuration to use
suffix - the suffix to append to the keys
Method Detail

createAlnumTokenizer

public static TextTokenizer createAlnumTokenizer(CharSequence text)
Static factory method to create an instance for tokenizing alphanumeric and symbol sequences and puntuation. Token types:

When you are only interested in words and numbers (e.g. for indexing), you can use the captured text -- it will contain the full token for alphanumeric sequences, it will be empty for symbols and punctuation.

The whitespace pattern comprised a sequence of whitespace and control/other characters ("C" and "Z" categories).

Parameters:
text - the text to tokenize
Returns:
the created tokenizer

createCategoryTokenizer

public static TextTokenizer createCategoryTokenizer(CharSequence text)
Static factory method to create an instance for tokenizing according to Unicode categories. Token types:

When you are only interested in words and numbers (e.g. for indexing), you can use the captured text -- it will contain the full token for letter and digit sequences, it will be empty for symbols and punctuation.

The whitespace pattern comprised a sequence of whitespace and control/other characters ("C" and "Z" categories).

Parameters:
text - the text to tokenize
Returns:
the created tokenizer

createThoroughTokenizer

public static TextTokenizer createThoroughTokenizer(CharSequence text)
Static factory method to create an instance that uses the "thorough" patterns listed below.

These patterns don't contain any useful information for TextTokenizer.capturedText().

The whitespace pattern comprised a sequence of whitespace and control/other characters.

Parameters:
text - the text to tokenize
Returns:
the created tokenizer

createTokenizer

public TextTokenizer createTokenizer(CharSequence text)
Factory method to create an instance from the configured token and whitespace patterns.

Parameters:
text - the text to tokenize
Returns:
the created tokenizer

toString

public String toString()
Returns a string representation of this object.

Returns:
a textual representation


Copyright © 2003-2004 Christian Siefkes. All Rights Reserved.