de.fu_berlin.ties.xml
Class XMLAdjuster

java.lang.Object
  extended by de.fu_berlin.ties.ConfigurableProcessor
      extended by de.fu_berlin.ties.TextProcessor
          extended by de.fu_berlin.ties.xml.XMLAdjuster
All Implemented Interfaces:
Processor

public class XMLAdjuster
extends TextProcessor

This class tries to fix corrupt XML documents, especially documents containing nesting errors. Instances of this class are thread-safe and can fix several documents in parallel.

Version:
$Revision: 1.24 $, $Date: 2006/10/21 16:04:29 $, $Author: siefkes $
Author:
Christian Siefkes

Field Summary
static String CONFIG_DELETE_CONTROL_CHARS
          Configuration key: whether to delete control characters (which are not allowed in XML 1.0 and discouraged in XML 1.1).
static String CONFIG_DELETE_PSEUDO_TAGS
          Configuration key: whether to delete "pseudo-tags".
static String CONFIG_DELETE_TRAILING_GARBAGE
          Configuration key: whether to delete trailing garbage (illegal content that occurs after the root tag has been closed).
static String CONFIG_EMPTIABLE_TAGS
          Configuration key: Set of names of tags that can be converted empty tags when required.
static String CONFIG_ESCAPE_PSEUDO_ENTITIES
          Configuration key: whether to escape "&" starting a possible nonstandard entity reference ("&" at the start of one of the 5 predefined entity references or a character reference is never escaped, all other "&" are always escaped).
static String CONFIG_MISSING_ROOT
          Configuration key: the name to use for the root element if missing.
static Pattern CONTROL_CHARS
          Pattern specifying sequences of control characters (character codes below the space character, except tab, line feed and carriage return).
static String ESCAPED_AMP
          Escape sequence for the "&" character.
protected static String EVENT_CONVERTED_TO_EMPTY_TAG
          Event constant: Converted to empty tag.
protected static String EVENT_DELETED_CONTROL_CHARS
          Event constant: Deleted control characters.
protected static String EVENT_DELETED_PSEUDO_TAG
          Event constant: Deleted pseudo-tag.
protected static String EVENT_DELETED_TRAILING_GARBAGE
          Event constant: Deleted trailing garbage.
protected static String EVENT_ESCAPED_CHARS
          Event constant: Escaped characters that are illegal or unwanted.
protected static String EVENT_INSERTED_MISSING_END_TAG
          Event constant: Inserted missing end tag.
protected static String EVENT_INSERTED_MISSING_ROOT_ELEMENT
          Event constant: Inserted missing root element.
protected static String EVENT_INSERTED_MISSING_START_TAG
          Event constant: Inserted missing start tag.
protected static String EVENT_MOVED_END_TAG_UP
          Event constant: Moved end tag up.
protected static String EVENT_MOVED_START_TAG_DOWN
          Event constant: Moved start tag dow.
protected static String EVENT_QUOTED_ATTRIBUTE_VALUES
          Event constant: Quoted attribute values.
protected static String EVENT_SPLIT_TAG
          Event constant: Split tag.
static Pattern LAX_START_OR_EMPTY_TAG
          Pattern specifying of a "lax" XML start or empty tag that can contain unquoted (invalid) attributes (combined into a single pattern to avoid unnecessary backtracking).
static Pattern PSEUDO_AMP
          A "&" that is not the start of an predefined entity reference or a character reference and thus should be escaped if isEscapingPseudoEntities() is true.
static Pattern SPURIOUS_AMP
          A "&" that is not the start of an entity and thus must be escaped.
static String UNQUOTED_ATTRIB_CHARS
          Pattern string specifying characters that can occur at the start of end of an unquoted attribute value: everything except '<', '>', '=' and whitespace (whitespace is also allowed, but only in the middle of a value).
static String UNQUOTED_ATTRIBUTE
          Pattern string specifying an XML attribute without proper quotes.
 
Fields inherited from class de.fu_berlin.ties.TextProcessor
CONFIG_POST, KEY_DIRECTORY, KEY_LOCAL_NAME, KEY_OUT_DIRECTORY, KEY_URL
 
Constructor Summary
XMLAdjuster()
          Creates a new instance using a default extension and the standard configuration.
XMLAdjuster(String outExt)
          Creates a new instance, configured from the standard configuration.
XMLAdjuster(String outExt, String missingRoot, Set<String> emptiableTagSet, boolean deleteControlChars, boolean deletePseudoTags, boolean deleteTrailingGarbage, boolean escapePseudoEntities, TiesConfiguration config)
          Creates a new instance.
XMLAdjuster(String outExt, TiesConfiguration config)
          Creates a new instance from the provided configuration.
 
Method Summary
 void adjust(CharSequence input, Writer out)
          Tries to fix corrupt XML documents, especially documents containing nesting errors.
 void adjust(Reader in, Writer out)
          Tries to fix corrupt XML documents, especially documents containing nesting errors.
protected  void checkEvent(String eventType)
          Method called by the logEvent(String, String) methods whenever an event occurred to ensure the event is acceptable.
protected  void doProcess(Reader reader, Writer writer, ContextMap context)
          Tries to fix corrupt XML documents, especially documents containing nesting errors.
 XMLConstituent fixedConstituents(CharSequence input)
          Returns the constituents of an XML-like document after fixing possible nesting errors etc.
protected  void handleEndTag(TagConstituent endTag, OpenTags openTags, UnprocessedTags unprocessedTags)
          Helper method for handling an end tag.
protected  void handleEOF(XMLConstituent lastConst, OpenTags openTags, UnprocessedTags unprocessedTags, boolean insertedMissingRoot)
          Helper method for handling an the end of a file.
protected  boolean isAnEmptiableTag(String tag)
          Whether the specified tag is one of the tags that can be converted an empty tags when required for fixing a document.
 boolean isDeletingControlChars()
          Whether control characters are deleted (these characters are not allowed in XML 1.0 and discouraged in XML 1.1).
 boolean isDeletingPseudoTags()
          Whether "pseudo-tags" are deleted, i.e., sequences that cannot be parsed as tags but look similar to them.
 boolean isEscapingPseudoEntities()
          Whether to escape "&" starting a possible nonstandard entity reference ("&" at the start of one of the 5 predefined entity references or a character reference is never escaped, all other "&" are always escaped).
protected  void logEvent(String eventType, String details)
          Logs the occurance of an event necessary for fixing a document.
protected  void logEvent(String eventType, TagConstituent tag)
          Logs the occurance of an event necessary for fixing a document.
 XMLConstituent rawConstituents(CharSequence input, boolean fixCharacterErrors)
          Returns the raw constituents of an XML-like document.
protected  XMLConstituent rawConstituents(CharSequence input, boolean fixCharacterErrors, UnprocessedTags startAndEndTags)
          Returns the raw constituents of an XML-like document.
 String toString()
          Returns a string representation of this object.
 
Methods inherited from class de.fu_berlin.ties.TextProcessor
getOutFileExt, process, process, process, process, process, process
 
Methods inherited from class de.fu_berlin.ties.ConfigurableProcessor
getConfig
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

CONFIG_MISSING_ROOT

public static final String CONFIG_MISSING_ROOT
Configuration key: the name to use for the root element if missing.

See Also:
Constant Field Values

CONFIG_EMPTIABLE_TAGS

public static final String CONFIG_EMPTIABLE_TAGS
Configuration key: Set of names of tags that can be converted empty tags when required.

See Also:
Constant Field Values

CONFIG_DELETE_CONTROL_CHARS

public static final String CONFIG_DELETE_CONTROL_CHARS
Configuration key: whether to delete control characters (which are not allowed in XML 1.0 and discouraged in XML 1.1).

See Also:
Constant Field Values

CONFIG_DELETE_PSEUDO_TAGS

public static final String CONFIG_DELETE_PSEUDO_TAGS
Configuration key: whether to delete "pseudo-tags".

See Also:
Constant Field Values

CONFIG_DELETE_TRAILING_GARBAGE

public static final String CONFIG_DELETE_TRAILING_GARBAGE
Configuration key: whether to delete trailing garbage (illegal content that occurs after the root tag has been closed).

See Also:
Constant Field Values

CONFIG_ESCAPE_PSEUDO_ENTITIES

public static final String CONFIG_ESCAPE_PSEUDO_ENTITIES
Configuration key: whether to escape "&" starting a possible nonstandard entity reference ("&" at the start of one of the 5 predefined entity references or a character reference is never escaped, all other "&" are always escaped).

See Also:
Constant Field Values

UNQUOTED_ATTRIB_CHARS

public static final String UNQUOTED_ATTRIB_CHARS
Pattern string specifying characters that can occur at the start of end of an unquoted attribute value: everything except '<', '>', '=' and whitespace (whitespace is also allowed, but only in the middle of a value). Evaluated lazily (reluctant) to avoid missing the "/>" at the end of an empty tag.

See Also:
Constant Field Values

UNQUOTED_ATTRIBUTE

public static final String UNQUOTED_ATTRIBUTE
Pattern string specifying an XML attribute without proper quotes. Equal sign and value are captured in groups; the value is matched lazily (reluctant) so we won't miss the start of the next attribute.

See Also:
Constant Field Values

LAX_START_OR_EMPTY_TAG

public static final Pattern LAX_START_OR_EMPTY_TAG
Pattern specifying of a "lax" XML start or empty tag that can contain unquoted (invalid) attributes (combined into a single pattern to avoid unnecessary backtracking). Element name, equal signs and values of the (last) unquoted attribute, and '/' (for empty tags) are captured.


PSEUDO_AMP

public static final Pattern PSEUDO_AMP
A "&" that is not the start of an predefined entity reference or a character reference and thus should be escaped if isEscapingPseudoEntities() is true. (A pattern matching the rest of predefined entity or character reference is included via negative lookahead.)


SPURIOUS_AMP

public static final Pattern SPURIOUS_AMP
A "&" that is not the start of an entity and thus must be escaped. (A pattern matching the rest of legal entity or character reference is included via negative lookahead.)


ESCAPED_AMP

public static final String ESCAPED_AMP
Escape sequence for the "&" character.

See Also:
Constant Field Values

CONTROL_CHARS

public static final Pattern CONTROL_CHARS
Pattern specifying sequences of control characters (character codes below the space character, except tab, line feed and carriage return).


EVENT_CONVERTED_TO_EMPTY_TAG

protected static final String EVENT_CONVERTED_TO_EMPTY_TAG
Event constant: Converted to empty tag.

See Also:
Constant Field Values

EVENT_INSERTED_MISSING_END_TAG

protected static final String EVENT_INSERTED_MISSING_END_TAG
Event constant: Inserted missing end tag.

See Also:
Constant Field Values

EVENT_INSERTED_MISSING_ROOT_ELEMENT

protected static final String EVENT_INSERTED_MISSING_ROOT_ELEMENT
Event constant: Inserted missing root element.

See Also:
Constant Field Values

EVENT_INSERTED_MISSING_START_TAG

protected static final String EVENT_INSERTED_MISSING_START_TAG
Event constant: Inserted missing start tag.

See Also:
Constant Field Values

EVENT_MOVED_END_TAG_UP

protected static final String EVENT_MOVED_END_TAG_UP
Event constant: Moved end tag up.

See Also:
Constant Field Values

EVENT_MOVED_START_TAG_DOWN

protected static final String EVENT_MOVED_START_TAG_DOWN
Event constant: Moved start tag dow.

See Also:
Constant Field Values

EVENT_SPLIT_TAG

protected static final String EVENT_SPLIT_TAG
Event constant: Split tag.

See Also:
Constant Field Values

EVENT_DELETED_CONTROL_CHARS

protected static final String EVENT_DELETED_CONTROL_CHARS
Event constant: Deleted control characters.

See Also:
Constant Field Values

EVENT_DELETED_PSEUDO_TAG

protected static final String EVENT_DELETED_PSEUDO_TAG
Event constant: Deleted pseudo-tag.

See Also:
Constant Field Values

EVENT_ESCAPED_CHARS

protected static final String EVENT_ESCAPED_CHARS
Event constant: Escaped characters that are illegal or unwanted.

See Also:
Constant Field Values

EVENT_QUOTED_ATTRIBUTE_VALUES

protected static final String EVENT_QUOTED_ATTRIBUTE_VALUES
Event constant: Quoted attribute values.

See Also:
Constant Field Values

EVENT_DELETED_TRAILING_GARBAGE

protected static final String EVENT_DELETED_TRAILING_GARBAGE
Event constant: Deleted trailing garbage.

See Also:
Constant Field Values
Constructor Detail

XMLAdjuster

public XMLAdjuster()
Creates a new instance using a default extension and the standard configuration.


XMLAdjuster

public XMLAdjuster(String outExt)
Creates a new instance, configured from the standard configuration.

Parameters:
outExt - the extension to use for output files

XMLAdjuster

public XMLAdjuster(String outExt,
                   TiesConfiguration config)
Creates a new instance from the provided configuration.

Parameters:
outExt - the extension to use for output files
config - used to configure this instance

XMLAdjuster

public XMLAdjuster(String outExt,
                   String missingRoot,
                   Set<String> emptiableTagSet,
                   boolean deleteControlChars,
                   boolean deletePseudoTags,
                   boolean deleteTrailingGarbage,
                   boolean escapePseudoEntities,
                   TiesConfiguration config)
Creates a new instance.

Parameters:
outExt - the extension to use for output files
missingRoot - the name to use for the root element if missing, i.e. if not all elements and textual content are inclosed within a single element (the root); if null, processing stops with an exception if the root element is missing
emptiableTagSet - contains the names (Strings) of tags that can be converted an empty tags when required for fixing a document (e.g. "br" when <br> may be converted to <br/> during repair); might be null if there are none
deleteControlChars - whether to delete control characters (which are not allowed in XML 1.0 and discouraged in XML 1.1)
deletePseudoTags - whether to delete "pseudo-tags"
deleteTrailingGarbage - whether to delete trailing garbage (illegal content that occurs after the root tag has been closed)
escapePseudoEntities - whether to escape "&" starting a possible nonstandard entity reference ("&" at the start of one of the 5 predefined entity references or a character reference is never escaped, all other "&" are always escaped)
config - used to configure superclasses; if null, the standard configuration is used
Method Detail

adjust

public final void adjust(CharSequence input,
                         Writer out)
                  throws IOException,
                         ParsingException
Tries to fix corrupt XML documents, especially documents containing nesting errors. Delegates to fixedConstituents(CharSequence) and writes the result to the specified writer.

Parameters:
input - the corrupt XML document
out - the writer to write the corrected XML document to; flushed but not closed by this method
Throws:
IOException - if an I/O error occurs while using the writer
ParsingException - if the XML input contains an uncorrectable error

adjust

public final void adjust(Reader in,
                         Writer out)
                  throws IOException,
                         ParsingException
Tries to fix corrupt XML documents, especially documents containing nesting errors. Delegates to adjust(CharSequence, Writer).

Parameters:
in - the reader to read the corrupt XML document from; not closed by this method
out - the writer to write the corrected XML document to; flushed but not closed by this method
Throws:
IOException - if an I/O error occurs while using the reader or writer
ParsingException - if the XML input contains an uncorrectable error

checkEvent

protected void checkEvent(String eventType)
                   throws ParsingException
Method called by the logEvent(String, String) methods whenever an event occurred to ensure the event is acceptable. Subclasses that want to prevent certain events can overwrite this method and throw an exception if an "illegal" event is encountered. This implementation does nothing, letting all events pass.

Parameters:
eventType - the event that occurred; should be one of the EVENT constants defined in this class.
Throws:
ParsingException - could be thrown by subclasses if the event is considered illicit

doProcess

protected void doProcess(Reader reader,
                         Writer writer,
                         ContextMap context)
                  throws IOException,
                         ParsingException
Tries to fix corrupt XML documents, especially documents containing nesting errors. Delegates to adjust(Reader, Writer), ignoring the context.

Specified by:
doProcess in class TextProcessor
Parameters:
reader - reader containing the text to process; should not be closed by this method
writer - the writer to write the processed text to; might be flushed but not closed by this method
context - a map of objects that are made available for processing
Throws:
IOException - if an I/O error occurs while using the reader or writer
ParsingException - if the XML input contains an uncorrectable error

fixedConstituents

public final XMLConstituent fixedConstituents(CharSequence input)
                                       throws ParsingException
Returns the constituents of an XML-like document after fixing possible nesting errors etc.

Parameters:
input - the XML-like input data
Returns:
a reference to the first contained constituent; the list of constituents can be traversed by calling XMLConstituent.nextConstituent() on each constituent untill null is returned
Throws:
ParsingException - if the XML input contains an uncorrectable error

handleEndTag

protected void handleEndTag(TagConstituent endTag,
                            OpenTags openTags,
                            UnprocessedTags unprocessedTags)
                     throws ParsingException
Helper method for handling an end tag. Here the actual work of the algorithm takes place, because we have to ensure a match.

Parameters:
endTag - the end tag to handle
openTags - must contain all currently open tags
unprocessedTags - must contain all unprocessed start and end tags
Throws:
ParsingException - might be thrown by checkEvent(String) implementations in subclasses if an "illicit" event occurred

handleEOF

protected void handleEOF(XMLConstituent lastConst,
                         OpenTags openTags,
                         UnprocessedTags unprocessedTags,
                         boolean insertedMissingRoot)
                  throws ParsingException
Helper method for handling an the end of a file. Suitable end tags are created to close any left-over open tags.

Parameters:
lastConst - the last constituent in the original input (must not have a successor)
openTags - must contain all currently open tags
unprocessedTags - must contain all unprocessed start and end tags; should better be empty
insertedMissingRoot - whether the start tag of a missing root element was created (in this case we'll insert the corresponding end tag without logging another event)
Throws:
ParsingException - might be thrown by checkEvent(String) implementations in subclasses if an "illicit" event occurred

logEvent

protected void logEvent(String eventType,
                        String details)
                 throws ParsingException
Logs the occurance of an event necessary for fixing a document. This method variant is used for character errors.

Parameters:
eventType - the event that occurred; should be one of the EVENT constants defined in this class.
details - a detailed description of the event
Throws:
ParsingException - might be thrown by checkEvent(String) implementations in subclasses if the event is considered illicit

logEvent

protected void logEvent(String eventType,
                        TagConstituent tag)
                 throws ParsingException
Logs the occurance of an event necessary for fixing a document. This method variant is used for nesting errors and missing root elements.

Parameters:
eventType - the event that occurred; should be one of the EVENT constants defined in this class.
tag - the involved tag
Throws:
ParsingException - might be thrown by checkEvent(String) implementations in subclasses if the event is considered illicit

isAnEmptiableTag

protected boolean isAnEmptiableTag(String tag)
Whether the specified tag is one of the tags that can be converted an empty tags when required for fixing a document. For example, "br" when <br> may be converted to <br/> during repair.

Parameters:
tag - the name of the tag to look up
Returns:
true iff this tag is contained in the set of emptiable tags

isDeletingControlChars

public boolean isDeletingControlChars()
Whether control characters are deleted (these characters are not allowed in XML 1.0 and discouraged in XML 1.1).

Returns:
true iff control characters are deleted

isDeletingPseudoTags

public boolean isDeletingPseudoTags()
Whether "pseudo-tags" are deleted, i.e., sequences that cannot be parsed as tags but look similar to them. "Pseudo-tags" start with '<' and end with '>', contain a printable character after the initial '<', and do not contain any inner '<' or '>'). Disabled by default.

Returns:
true iff pseudo-tags are be deleted (otherwise the starting '<' is escaped)

isEscapingPseudoEntities

public boolean isEscapingPseudoEntities()
Whether to escape "&" starting a possible nonstandard entity reference ("&" at the start of one of the 5 predefined entity references or a character reference is never escaped, all other "&" are always escaped).

Returns:
true iff "pseudo entites" are escaped

rawConstituents

public final XMLConstituent rawConstituents(CharSequence input,
                                            boolean fixCharacterErrors)
                                     throws ParsingException
Returns the raw constituents of an XML-like document. The constituents are returned "raw" as they occur in the input, without fixing possible nesting errors etc.

Parameters:
input - the XML-like input data
fixCharacterErrors - whether to try to fix character errors, i.e. unescaped "<" and "&" and tags with unquoted attribute values; if false, unescaped "<" in textual content and unquoted attribute values will yield an exception, while any unescaped "&" and unescaped "<" in attribute values will be ignored
Returns:
a reference to the first contained constituent; the list of constituents can be traversed by calling XMLConstituent.nextConstituent() on each constituent untill null is returned
Throws:
ParsingException - if the XML input contains an uncorrectable error

rawConstituents

protected final XMLConstituent rawConstituents(CharSequence input,
                                               boolean fixCharacterErrors,
                                               UnprocessedTags startAndEndTags)
                                        throws ParsingException
Returns the raw constituents of an XML-like document. The constituents are returned "raw" as they occur in the input, without fixing possible nesting errors etc.

Parameters:
input - the XML-like input data
fixCharacterErrors - whether to try to fix character errors, i.e. unescaped "<" and "&" and tags with unquoted attribute values; if false, unescaped "<" in textual content and unquoted attribute values will yield an exception, while any unescaped "&" and unescaped "<" in attribute values will be ignored
startAndEndTags - all start and end tags are added to this container, if it isn't null
Returns:
a reference to the first contained constituent; the list of constituents can be traversed by calling XMLConstituent.nextConstituent() on each constituent untill null is returned
Throws:
ParsingException - if the XML input contains an uncorrectable error

toString

public String toString()
Returns a string representation of this object.

Overrides:
toString in class TextProcessor
Returns:
a textual representation


Copyright © 2003-2007 Christian Siefkes. All Rights Reserved.