de.fu_berlin.ties.text
Class TextTokenizer

java.lang.Object
  extended byde.fu_berlin.ties.text.TextTokenizer

public class TextTokenizer
extends Object

Splits a text into a sequence of tokens.

This class is not thread-safe, so if you should want to share a tokenizer between threads you have to ensure adequate synchronization.

Version:
$Revision: 1.7 $, $Date: 2004/04/15 09:45:53 $, $Author: siefkes $
Author:
Christian Siefkes

Constructor Summary
TextTokenizer(String[] patterns, String whitespacePattern, CharSequence text)
          Creates a new instance.
 
Method Summary
 String capturedText()
          Returns the text captured within "capturing groups" in the last token.
 String getNormalizedWhitespace()
          Returns the normalized whitespace representation prepended if isNormalizedWhitespacePrepended() is true.
 boolean hasPrecedingWhitespace()
          Whether the token returned by the last call to nextToken() is preceded by whitespace (i.e., text not matched by any token).
 int initialWhitespaceCount(String text)
          Convenience method that counts the number of whitespace characters at the begin of a string, according to the defined whitespace pattern.
 boolean isNormalizedWhitespacePrepended()
          Returns whether whitespace is prepended in a normalized form (@link #getNormalizedWhitespace()}) to those tokens where hasPrecedingWhitespace() would return true.
 boolean isValidWhitespace(String text)
          Convenience method that checks whether a string matches the defined whitespace pattern.
 boolean isWhitespacePatternEnsured()
          Whether whitespace (the text between patterns) is checked to ensure that the defined whitespace pattern is matched.
 CharSequence leftText()
          Returns the complete text to the left (preceding) the token returned by the last call to nextToken().
 String nextToken()
          Returns the next token, or null if there are no more tokens left in the provided text.
 String precedingWhitespace()
          Returns the whitespace (i.e., text not matched by any token) preceding the token returned by the last call to nextToken().
 boolean precedingWhitespaceIsValid()
          Checks whether the whitespace (i.e., text not matched by any token) preceding the token returned by the last call to nextToken() matches the defined whitespace pattern.
 void reset()
          Resets this tokenizer, so it will restart at the begin of the current text.
 void reset(CharSequence newText)
          Resets this tokenizer, so it will restart at the begin of the provided text.
 CharSequence rightText()
          Returns the complete text to the right (following) the token returned by the last call to nextToken().
 void setNormalizedWhitespace(String newValue)
          Changes the normalized whitespace representation prepended if isNormalizedWhitespacePrepended() is true.
 void setNormalizedWhitespacePrepended(boolean newValue)
          Changes whether whitespace is prepended in a normalized form (@link #getNormalizedWhitespace()}) to those tokens where hasPrecedingWhitespace() would return true.
 void setWhitespacePatternEnsured(boolean ensured)
          Specifies whether whitespace (the text between patterns) is checked to ensure that the defined whitespace pattern is matched.
 String toString()
          Returns a string representation of this object.
 int trailingWhitespaceCount(String text)
          Convenience method that counts the number of whitespace characters at the end of a string, according to the defined whitespace pattern.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

TextTokenizer

public TextTokenizer(String[] patterns,
                     String whitespacePattern,
                     CharSequence text)
              throws PatternSyntaxException
Creates a new instance. Only use this constructor if you know what you are doing! Usually it should be sufficient to use one of the factory methods provided by TokenizerFactory.

Parameters:
patterns - a list of patterns to accept as tokens; patterns jointed and compiled with the Pattern.DOTALL flag activated
whitespacePattern - a pattern that should match all text between tokens ("whitespace"), to ensure that no text is left out by mistake; the pattern is compiled with the Pattern.DOTALL flag activated
text - the text to tokenize
Throws:
PatternSyntaxException - if the syntax of the provided patterns is invalid
Method Detail

capturedText

public final String capturedText()
Returns the text captured within "capturing groups" in the last token. All captured text sequences are joint in a single string. If there were no capturing groups involved in the last match, the empty string is returned.

Returns:
the joint text matched within captured groups in the last token match

getNormalizedWhitespace

public final String getNormalizedWhitespace()
Returns the normalized whitespace representation prepended if isNormalizedWhitespacePrepended() is true. Defaults to a space character.

Returns:
the normalized representation

hasPrecedingWhitespace

public final boolean hasPrecedingWhitespace()
                                     throws IllegalStateException,
                                            IllegalArgumentException
Whether the token returned by the last call to nextToken() is preceded by whitespace (i.e., text not matched by any token). If we arrived at the end of the text to tokenize (last call to nextToken() returned null), this is the whitespace between the last existing token and the end of the text.

Returns:
whether the last token is preceded by whitespace
Throws:
IllegalStateException - if this method is called without a prior call to nextToken()
IllegalArgumentException - if isWhitespacePatternEnsured() is true and the whitespace preceding the last read token does not match the defined whitespace pattern

initialWhitespaceCount

public int initialWhitespaceCount(String text)
Convenience method that counts the number of whitespace characters at the begin of a string, according to the defined whitespace pattern.

Parameters:
text - the text to check
Returns:
the number of whitespace characters at the begin, of 0 if there are none

isNormalizedWhitespacePrepended

public final boolean isNormalizedWhitespacePrepended()
Returns whether whitespace is prepended in a normalized form (@link #getNormalizedWhitespace()}) to those tokens where hasPrecedingWhitespace() would return true. Defaults to false.

Returns:
whether whitespace is prepended

isValidWhitespace

public boolean isValidWhitespace(String text)
Convenience method that checks whether a string matches the defined whitespace pattern.

Parameters:
text - the text to match
Returns:
true iff the given text matches the defined whitespace pattern or is the empty string

isWhitespacePatternEnsured

public final boolean isWhitespacePatternEnsured()
Whether whitespace (the text between patterns) is checked to ensure that the defined whitespace pattern is matched. If set to true (default), a call to hasPrecedingWhitespace() or precedingWhitespace() will throw an IllegalArgumentException if the whitespace preceding the last read token does not match.

Returns:
the value of this property

leftText

public CharSequence leftText()
                      throws IllegalStateException
Returns the complete text to the left (preceding) the token returned by the last call to nextToken(). This includes any precedingWhitespace().

Returns:
the complete text to the left of the last token
Throws:
IllegalStateException - if this method is called without a prior call to nextToken()

nextToken

public final String nextToken()
                       throws IllegalArgumentException
Returns the next token, or null if there are no more tokens left in the provided text. When the tokenizer arrived at the end of the text, all subsequent calls to this method will return null until you call one of the reset() methods. If the token is preceded by whitespace and isNormalizedWhitespacePrepended() is true, the returned token will start with the normalized whitespace representation (getNormalizedWhitespace()).

Returns:
the next token read from the provided text (with or without prepended whitespace), or null if no tokens are left
Throws:
IllegalArgumentException - if isWhitespacePatternEnsured() and isNormalizedWhitespacePrepended() are true and the whitespace preceding this token does not match the defined whitespace pattern

precedingWhitespace

public final String precedingWhitespace()
                                 throws IllegalStateException,
                                        IllegalArgumentException
Returns the whitespace (i.e., text not matched by any token) preceding the token returned by the last call to nextToken(). If we arrived at the end of the text to tokenize (last call to nextToken() returned null), this is the whitespace between the last existing token and the end of the text.

Returns:
the whitespace preceding the last token, or the empty string if there is no preceding whitespace (i.e. hasPrecedingWhitespace() would return false)
Throws:
IllegalStateException - if this method is called without a prior call to nextToken()
IllegalArgumentException - if isWhitespacePatternEnsured() is true and the whitespace preceding the last read token does not match the defined whitespace pattern

precedingWhitespaceIsValid

public boolean precedingWhitespaceIsValid()
                                   throws IllegalStateException
Checks whether the whitespace (i.e., text not matched by any token) preceding the token returned by the last call to nextToken() matches the defined whitespace pattern. This method is called automatically if isWhitespacePatternEnsured() is true. Otherwise it can be called externally to check whether the whitespace is valid and take appropriate action if required.

Returns:
true iff the preceding whitespace matches the specified whitespace pattern or if there is no preceding whitespace
Throws:
IllegalStateException - if this method is called without a prior call to nextToken()

reset

public final void reset()
Resets this tokenizer, so it will restart at the begin of the current text.


reset

public final void reset(CharSequence newText)
Resets this tokenizer, so it will restart at the begin of the provided text.

Parameters:
newText - the new text to tokenize

rightText

public CharSequence rightText()
                       throws IllegalStateException
Returns the complete text to the right (following) the token returned by the last call to nextToken(). This includes any following whitespace.

Returns:
the complete text to the right of the last token
Throws:
IllegalStateException - if this method is called without a prior call to nextToken()

setNormalizedWhitespace

public final void setNormalizedWhitespace(String newValue)
Changes the normalized whitespace representation prepended if isNormalizedWhitespacePrepended() is true.

Parameters:
newValue - the new value

setNormalizedWhitespacePrepended

public final void setNormalizedWhitespacePrepended(boolean newValue)
Changes whether whitespace is prepended in a normalized form (@link #getNormalizedWhitespace()}) to those tokens where hasPrecedingWhitespace() would return true.

Parameters:
newValue - the new value

setWhitespacePatternEnsured

public final void setWhitespacePatternEnsured(boolean ensured)
Specifies whether whitespace (the text between patterns) is checked to ensure that the defined whitespace pattern is matched. If set to true (default), a call to hasPrecedingWhitespace() or precedingWhitespace() will throw an IllegalArgumentException if the whitespace preceding the last read token does not match.

Parameters:
ensured - the new value of this property

toString

public String toString()
Returns a string representation of this object.

Returns:
a textual representation

trailingWhitespaceCount

public int trailingWhitespaceCount(String text)
Convenience method that counts the number of whitespace characters at the end of a string, according to the defined whitespace pattern.

Parameters:
text - the text to check
Returns:
the number of whitespace characters at the end, of 0 if there are none


Copyright © 2003-2004 Christian Siefkes. All Rights Reserved.