|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectde.fu_berlin.ties.ConfigurableProcessor
de.fu_berlin.ties.TextProcessor
de.fu_berlin.ties.xml.XMLAdjuster
public class XMLAdjuster
This class tries to fix corrupt XML documents, especially documents containing nesting errors. Instances of this class are thread-safe and can fix several documents in parallel.
Field Summary | |
---|---|
static String |
CONFIG_DELETE_CONTROL_CHARS
Configuration key: whether to delete control characters (which are not allowed in
XML 1.0 and discouraged in XML 1.1). |
static String |
CONFIG_DELETE_PSEUDO_TAGS
Configuration key: whether to delete "pseudo-tags". |
static String |
CONFIG_DELETE_TRAILING_GARBAGE
Configuration key: whether to delete trailing garbage (illegal content that occurs after the root tag has been closed). |
static String |
CONFIG_EMPTIABLE_TAGS
Configuration key: Set of names of tags that can be converted empty tags when required. |
static String |
CONFIG_ESCAPE_PSEUDO_ENTITIES
Configuration key: whether to escape "&" starting a possible nonstandard entity reference ("&" at the start of one of the 5 predefined entity references or a character reference is never escaped, all other "&" are always escaped). |
static String |
CONFIG_MISSING_ROOT
Configuration key: the name to use for the root element if missing. |
static Pattern |
CONTROL_CHARS
Pattern specifying sequences of control characters (character codes below the space character, except tab, line feed and carriage return). |
static String |
ESCAPED_AMP
Escape sequence for the "&" character. |
protected static String |
EVENT_CONVERTED_TO_EMPTY_TAG
Event constant: Converted to empty tag. |
protected static String |
EVENT_DELETED_CONTROL_CHARS
Event constant: Deleted control characters. |
protected static String |
EVENT_DELETED_PSEUDO_TAG
Event constant: Deleted pseudo-tag. |
protected static String |
EVENT_DELETED_TRAILING_GARBAGE
Event constant: Deleted trailing garbage. |
protected static String |
EVENT_ESCAPED_CHARS
Event constant: Escaped characters that are illegal or unwanted. |
protected static String |
EVENT_INSERTED_MISSING_END_TAG
Event constant: Inserted missing end tag. |
protected static String |
EVENT_INSERTED_MISSING_ROOT_ELEMENT
Event constant: Inserted missing root element. |
protected static String |
EVENT_INSERTED_MISSING_START_TAG
Event constant: Inserted missing start tag. |
protected static String |
EVENT_MOVED_END_TAG_UP
Event constant: Moved end tag up. |
protected static String |
EVENT_MOVED_START_TAG_DOWN
Event constant: Moved start tag dow. |
protected static String |
EVENT_QUOTED_ATTRIBUTE_VALUES
Event constant: Quoted attribute values. |
protected static String |
EVENT_SPLIT_TAG
Event constant: Split tag. |
static Pattern |
LAX_START_OR_EMPTY_TAG
Pattern specifying of a "lax" XML start or empty tag that can contain unquoted (invalid) attributes (combined into a single pattern to avoid unnecessary backtracking). |
static Pattern |
PSEUDO_AMP
A "&" that is not the start of an predefined entity reference or a character reference and thus should be escaped if isEscapingPseudoEntities() is true . |
static Pattern |
SPURIOUS_AMP
A "&" that is not the start of an entity and thus must be escaped. |
static String |
UNQUOTED_ATTRIB_CHARS
Pattern string specifying characters that can occur at the start of end of an unquoted attribute value: everything except '<', '>', '=' and whitespace (whitespace is also allowed, but only in the middle of a value). |
static String |
UNQUOTED_ATTRIBUTE
Pattern string specifying an XML attribute without proper quotes. |
Fields inherited from class de.fu_berlin.ties.TextProcessor |
---|
CONFIG_POST, KEY_DIRECTORY, KEY_LOCAL_NAME, KEY_OUT_DIRECTORY, KEY_URL |
Constructor Summary | |
---|---|
XMLAdjuster()
Creates a new instance using a default extension and the standard configuration. |
|
XMLAdjuster(String outExt)
Creates a new instance, configured from the standard configuration. |
|
XMLAdjuster(String outExt,
String missingRoot,
Set<String> emptiableTagSet,
boolean deleteControlChars,
boolean deletePseudoTags,
boolean deleteTrailingGarbage,
boolean escapePseudoEntities,
TiesConfiguration config)
Creates a new instance. |
|
XMLAdjuster(String outExt,
TiesConfiguration config)
Creates a new instance from the provided configuration. |
Method Summary | |
---|---|
void |
adjust(CharSequence input,
Writer out)
Tries to fix corrupt XML documents, especially documents containing nesting errors. |
void |
adjust(Reader in,
Writer out)
Tries to fix corrupt XML documents, especially documents containing nesting errors. |
protected void |
checkEvent(String eventType)
Method called by the logEvent(String, String) methods whenever
an event occurred to ensure the event is acceptable. |
protected void |
doProcess(Reader reader,
Writer writer,
ContextMap context)
Tries to fix corrupt XML documents, especially documents containing nesting errors. |
XMLConstituent |
fixedConstituents(CharSequence input)
Returns the constituents of an XML-like document after fixing possible nesting errors etc. |
protected void |
handleEndTag(TagConstituent endTag,
OpenTags openTags,
UnprocessedTags unprocessedTags)
Helper method for handling an end tag. |
protected void |
handleEOF(XMLConstituent lastConst,
OpenTags openTags,
UnprocessedTags unprocessedTags,
boolean insertedMissingRoot)
Helper method for handling an the end of a file. |
protected boolean |
isAnEmptiableTag(String tag)
Whether the specified tag is one of the tags that can be converted an empty tags when required for fixing a document. |
boolean |
isDeletingControlChars()
Whether control characters are deleted (these
characters are not allowed in XML 1.0 and discouraged in XML 1.1). |
boolean |
isDeletingPseudoTags()
Whether "pseudo-tags" are deleted, i.e., sequences that cannot be parsed as tags but look similar to them. |
boolean |
isEscapingPseudoEntities()
Whether to escape "&" starting a possible nonstandard entity reference ("&" at the start of one of the 5 predefined entity references or a character reference is never escaped, all other "&" are always escaped). |
protected void |
logEvent(String eventType,
String details)
Logs the occurance of an event necessary for fixing a document. |
protected void |
logEvent(String eventType,
TagConstituent tag)
Logs the occurance of an event necessary for fixing a document. |
XMLConstituent |
rawConstituents(CharSequence input,
boolean fixCharacterErrors)
Returns the raw constituents of an XML-like document. |
protected XMLConstituent |
rawConstituents(CharSequence input,
boolean fixCharacterErrors,
UnprocessedTags startAndEndTags)
Returns the raw constituents of an XML-like document. |
String |
toString()
Returns a string representation of this object. |
Methods inherited from class de.fu_berlin.ties.TextProcessor |
---|
getOutFileExt, process, process, process, process, process, process |
Methods inherited from class de.fu_berlin.ties.ConfigurableProcessor |
---|
getConfig |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
public static final String CONFIG_MISSING_ROOT
public static final String CONFIG_EMPTIABLE_TAGS
public static final String CONFIG_DELETE_CONTROL_CHARS
control characters
(which are not allowed in
XML 1.0 and discouraged in XML 1.1).
public static final String CONFIG_DELETE_PSEUDO_TAGS
public static final String CONFIG_DELETE_TRAILING_GARBAGE
public static final String CONFIG_ESCAPE_PSEUDO_ENTITIES
public static final String UNQUOTED_ATTRIB_CHARS
public static final String UNQUOTED_ATTRIBUTE
public static final Pattern LAX_START_OR_EMPTY_TAG
public static final Pattern PSEUDO_AMP
isEscapingPseudoEntities()
is true
.
(A pattern matching the rest of predefined entity or character reference
is included via negative lookahead.)
public static final Pattern SPURIOUS_AMP
public static final String ESCAPED_AMP
public static final Pattern CONTROL_CHARS
protected static final String EVENT_CONVERTED_TO_EMPTY_TAG
protected static final String EVENT_INSERTED_MISSING_END_TAG
protected static final String EVENT_INSERTED_MISSING_ROOT_ELEMENT
protected static final String EVENT_INSERTED_MISSING_START_TAG
protected static final String EVENT_MOVED_END_TAG_UP
protected static final String EVENT_MOVED_START_TAG_DOWN
protected static final String EVENT_SPLIT_TAG
protected static final String EVENT_DELETED_CONTROL_CHARS
protected static final String EVENT_DELETED_PSEUDO_TAG
protected static final String EVENT_ESCAPED_CHARS
protected static final String EVENT_QUOTED_ATTRIBUTE_VALUES
protected static final String EVENT_DELETED_TRAILING_GARBAGE
Constructor Detail |
---|
public XMLAdjuster()
public XMLAdjuster(String outExt)
outExt
- the extension to use for output filespublic XMLAdjuster(String outExt, TiesConfiguration config)
outExt
- the extension to use for output filesconfig
- used to configure this instancepublic XMLAdjuster(String outExt, String missingRoot, Set<String> emptiableTagSet, boolean deleteControlChars, boolean deletePseudoTags, boolean deleteTrailingGarbage, boolean escapePseudoEntities, TiesConfiguration config)
outExt
- the extension to use for output filesmissingRoot
- the name to use for the root element if missing, i.e.
if not all elements and textual content are inclosed within a single
element (the root); if null
, processing stops with an
exception if the root element is missingemptiableTagSet
- contains the names (Strings) of tags that can be
converted an empty tags when required for fixing a document (e.g. "br"
when <br>
may be converted to
<br/>
during repair); might be null
if
there are nonedeleteControlChars
- whether to delete
control characters
(which are not allowed in XML
1.0 and discouraged in XML 1.1)deletePseudoTags
- whether to
delete "pseudo-tags"deleteTrailingGarbage
- whether to delete trailing garbage
(illegal content that occurs after the root tag has been closed)escapePseudoEntities
- whether to escape "&" starting a possible
nonstandard entity reference ("&" at the start of one of the 5
predefined entity references or a character reference is never escaped,
all other "&" are always escaped)config
- used to configure superclasses; if null
,
the standard configuration is usedMethod Detail |
---|
public final void adjust(CharSequence input, Writer out) throws IOException, ParsingException
fixedConstituents(CharSequence)
and writes the result to the
specified writer.
input
- the corrupt XML documentout
- the writer to write the corrected XML document to; flushed but
not closed by this method
IOException
- if an I/O error occurs while using the writer
ParsingException
- if the XML input contains an uncorrectable errorpublic final void adjust(Reader in, Writer out) throws IOException, ParsingException
adjust(CharSequence, Writer)
.
in
- the reader to read the corrupt XML document from; not closed
by this methodout
- the writer to write the corrected XML document to; flushed but
not closed by this method
IOException
- if an I/O error occurs while using the reader or
writer
ParsingException
- if the XML input contains an uncorrectable errorprotected void checkEvent(String eventType) throws ParsingException
logEvent(String, String)
methods whenever
an event occurred to ensure the event is acceptable. Subclasses that want
to prevent certain events can overwrite this method and throw an
exception if an "illegal" event is encountered.
This implementation does nothing, letting all events pass.
eventType
- the event that occurred; should be one of the
EVENT constants defined in this class.
ParsingException
- could be thrown by subclasses if the event is
considered illicitprotected void doProcess(Reader reader, Writer writer, ContextMap context) throws IOException, ParsingException
adjust(Reader, Writer)
, ignoring
the context
.
doProcess
in class TextProcessor
reader
- reader containing the text to process; should not be closed
by this methodwriter
- the writer to write the processed text to; might be flushed
but not closed by this methodcontext
- a map of objects that are made available for processing
IOException
- if an I/O error occurs while using the reader or
writer
ParsingException
- if the XML input contains an uncorrectable errorpublic final XMLConstituent fixedConstituents(CharSequence input) throws ParsingException
input
- the XML-like input data
XMLConstituent.nextConstituent()
on each constituent untill
null
is returned
ParsingException
- if the XML input contains an uncorrectable errorprotected void handleEndTag(TagConstituent endTag, OpenTags openTags, UnprocessedTags unprocessedTags) throws ParsingException
endTag
- the end tag to handleopenTags
- must contain all currently open tagsunprocessedTags
- must contain all unprocessed start and end tags
ParsingException
- might be thrown by checkEvent(String)
implementations in subclasses if an "illicit" event occurredprotected void handleEOF(XMLConstituent lastConst, OpenTags openTags, UnprocessedTags unprocessedTags, boolean insertedMissingRoot) throws ParsingException
lastConst
- the last constituent in the original input (must not
have a successor)openTags
- must contain all currently open tagsunprocessedTags
- must contain all unprocessed start and end tags;
should better be emptyinsertedMissingRoot
- whether the start tag of a missing root
element was created (in this case we'll insert the corresponding end tag
without logging another event)
ParsingException
- might be thrown by checkEvent(String)
implementations in subclasses if an "illicit" event occurredprotected void logEvent(String eventType, String details) throws ParsingException
eventType
- the event that occurred; should be one of the
EVENT constants defined in this class.details
- a detailed description of the event
ParsingException
- might be thrown by checkEvent(String)
implementations in subclasses if the event is considered illicitprotected void logEvent(String eventType, TagConstituent tag) throws ParsingException
eventType
- the event that occurred; should be one of the
EVENT constants defined in this class.tag
- the involved tag
ParsingException
- might be thrown by checkEvent(String)
implementations in subclasses if the event is considered illicitprotected boolean isAnEmptiableTag(String tag)
<br>
may be converted to <br/>
during repair.
tag
- the name of the tag to look up
true
iff this tag is contained in the set of
emptiable tagspublic boolean isDeletingControlChars()
control characters
are deleted (these
characters are not allowed in XML 1.0 and discouraged in XML 1.1).
true
iff control characters are deletedpublic boolean isDeletingPseudoTags()
true
iff pseudo-tags are be deleted (otherwise the
starting '<' is escaped)public boolean isEscapingPseudoEntities()
true
iff "pseudo entites" are escapedpublic final XMLConstituent rawConstituents(CharSequence input, boolean fixCharacterErrors) throws ParsingException
input
- the XML-like input datafixCharacterErrors
- whether to try to fix character errors, i.e.
unescaped "<" and "&" and tags with unquoted attribute values; if
false
, unescaped "<" in textual content and unquoted
attribute values will yield an exception, while any unescaped "&"
and unescaped "<" in attribute values will be ignored
XMLConstituent.nextConstituent()
on each constituent untill
null
is returned
ParsingException
- if the XML input contains an uncorrectable errorprotected final XMLConstituent rawConstituents(CharSequence input, boolean fixCharacterErrors, UnprocessedTags startAndEndTags) throws ParsingException
input
- the XML-like input datafixCharacterErrors
- whether to try to fix character errors, i.e.
unescaped "<" and "&" and tags with unquoted attribute values; if
false
, unescaped "<" in textual content and unquoted
attribute values will yield an exception, while any unescaped "&"
and unescaped "<" in attribute values will be ignoredstartAndEndTags
- all start and end tags are added to this
container, if it isn't null
XMLConstituent.nextConstituent()
on each constituent untill
null
is returned
ParsingException
- if the XML input contains an uncorrectable errorpublic String toString()
toString
in class TextProcessor
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |