de.fu_berlin.ties.xml
Class XMLTokenizerFactory

java.lang.Object
  extended by de.fu_berlin.ties.xml.XMLTokenizerFactory

public final class XMLTokenizerFactory
extends Object

Static factory for creating a TextTokenizers for XML-like input.

Version:
$Revision: 1.2 $, $Date: 2004/04/08 16:38:26 $, $Author: siefkes $
Author:
Christian Siefkes

Field Summary
static String MARKUP_DECL
          Pattern string specifying an markup declaration within a doctype declaration.
static String PE_REFERENCE
          Pattern string specifying a PE reference within a doctype declaration.
static String XML_ATTRIBUTE
          Pattern string specifying an XML attribute (name = quoted-value pair).
static String XML_CDATA_SECTION
          Pattern string specifying a CDATA section in an XML document.
static String XML_CDATA_TOKEN
          Pattern string for a visible textual token in XML documents (contains neither whitespace nor markup).
static String XML_COMMENT
          Pattern string specifying an XML comment.
static String XML_DOCTYPE
          Pattern string specifying an XML document type declaration.
static String XML_END_TAG
          Pattern string specifying an XML end tag.
static String XML_EQUAL_SIGN
          Pattern string for the '=' sign, optionally surrounded by whitespace.
static String XML_NAME
          Pattern string for XML names (according to XML 1.1).
static String XML_NAME_START_CHAR
          Pattern string specifying the class of valid start characters of XML names.
static String XML_OPT_WHITESPACE
          Pattern specifying optional whitespace in an XML document (zero or more whitespace characters).
static String[] XML_PATTERNS
          The array of patterns used for shallow XML parsing.
static String XML_PROLOG_OR_PI
          Pattern string specifying an XML prolog or processing instruction.
static String XML_QUOTED_STRING
          Pattern string for strings enclosed in full or half quotes, e.g.
static String XML_START_OR_EMPTY_TAG
          Pattern string specifying an XML start or empty tag (combined into a single pattern to avoid unnecessary backtracking).
static String XML_TEXTUAL_CONTENT
          Pattern string specifying textual content (character data) in an XML document.
static String XML_WHITESPACE
          Pattern specifying whitespace in an XML document (one or more whitespace characters).
static String XML_WHITESPACE_CHARS
          Pattern fragment listing allowed whitespace characters in an XML document.
 
Method Summary
static TextTokenizer createXMLTokenizer(CharSequence text, boolean ensureWhitespace)
          Factory method to create an instance for parsing files in XML syntax.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

XML_WHITESPACE_CHARS

public static final String XML_WHITESPACE_CHARS
Pattern fragment listing allowed whitespace characters in an XML document. Space, tab, line feed, carriage return, and some other line-ending characters (according to XML 1.1) are allowed.

See Also:
Constant Field Values

XML_WHITESPACE

public static final String XML_WHITESPACE
Pattern specifying whitespace in an XML document (one or more whitespace characters).

See Also:
Constant Field Values

XML_OPT_WHITESPACE

public static final String XML_OPT_WHITESPACE
Pattern specifying optional whitespace in an XML document (zero or more whitespace characters).

See Also:
Constant Field Values

XML_NAME_START_CHAR

public static final String XML_NAME_START_CHAR
Pattern string specifying the class of valid start characters of XML names.

See Also:
Constant Field Values

XML_NAME

public static final String XML_NAME
Pattern string for XML names (according to XML 1.1).

See Also:
Constant Field Values

XML_QUOTED_STRING

public static final String XML_QUOTED_STRING
Pattern string for strings enclosed in full or half quotes, e.g. XML attribute values.

See Also:
Constant Field Values

XML_EQUAL_SIGN

public static final String XML_EQUAL_SIGN
Pattern string for the '=' sign, optionally surrounded by whitespace.

See Also:
Constant Field Values

XML_ATTRIBUTE

public static final String XML_ATTRIBUTE
Pattern string specifying an XML attribute (name = quoted-value pair).

See Also:
Constant Field Values

XML_CDATA_TOKEN

public static final String XML_CDATA_TOKEN
Pattern string for a visible textual token in XML documents (contains neither whitespace nor markup).

See Also:
Constant Field Values

XML_START_OR_EMPTY_TAG

public static final String XML_START_OR_EMPTY_TAG
Pattern string specifying an XML start or empty tag (combined into a single pattern to avoid unnecessary backtracking).

See Also:
Constant Field Values

XML_END_TAG

public static final String XML_END_TAG
Pattern string specifying an XML end tag.

See Also:
Constant Field Values

XML_PROLOG_OR_PI

public static final String XML_PROLOG_OR_PI
Pattern string specifying an XML prolog or processing instruction.

See Also:
Constant Field Values

XML_COMMENT

public static final String XML_COMMENT
Pattern string specifying an XML comment.

See Also:
Constant Field Values

PE_REFERENCE

public static final String PE_REFERENCE
Pattern string specifying a PE reference within a doctype declaration.

See Also:
Constant Field Values

MARKUP_DECL

public static final String MARKUP_DECL
Pattern string specifying an markup declaration within a doctype declaration. A markup declaration either declares an entity, element, attribute, or notation; or it is a processing instruction or a comment.

See Also:
Constant Field Values

XML_DOCTYPE

public static final String XML_DOCTYPE
Pattern string specifying an XML document type declaration.

See Also:
Constant Field Values

XML_CDATA_SECTION

public static final String XML_CDATA_SECTION
Pattern string specifying a CDATA section in an XML document.

See Also:
Constant Field Values

XML_TEXTUAL_CONTENT

public static final String XML_TEXTUAL_CONTENT
Pattern string specifying textual content (character data) in an XML document. Starting and trailing whitespace is not included.

See Also:
Constant Field Values

XML_PATTERNS

public static final String[] XML_PATTERNS
The array of patterns used for shallow XML parsing.

Method Detail

createXMLTokenizer

public static TextTokenizer createXMLTokenizer(CharSequence text,
                                               boolean ensureWhitespace)
Factory method to create an instance for parsing files in XML syntax. Creates a shallow XML parser that splits XML input into a series of tags and textual content. The main difference from regular XML parsers is that tags can occur in any order; nesting constrains are not enforced. This can be useful to repair XML-like files.

The type of token returns can be determined by calling TextTokenizer.capturedText():

tagname
for start tags of type tagname
/tagname
for end tags
tagname//
for empty tags
!DOCTYPE
for doctype declarations
?targetname
for prolog ("?xml") and processing instructions
<!--
for comments
[CDATA
for CDATA section
"" (empty string)
for textual content (character data)

Whitespace between tags and before and after textual content can be retrieved using the TextTokenizer.precedingWhitespace() method.

Parameters:
text - the text to tokenize
ensureWhitespace - whether to validate whitespace (TextTokenizer.isWhitespacePatternEnsured()), throwing an exception if a document contains serious errors (i.e. an unescaped "<" within textual content); if false, the caller is responsible for validating whitespace
Returns:
a new instance suitable for parsing XML


Copyright © 2003-2004 Christian Siefkes. All Rights Reserved.