Previous | Next | Index | TOC | Top | Top Contents Index Glossary


5b. DTD's Effect on the Nonvalidating Parser

Link Summary
Exercise Links

In the last section, you defined a rudimentary document type and used it in your XML file. In this section, you'll use the Echo program to see how the data appears to the SAX parser when the DTD is included.

Note:
The output shown in this section is contained in Echo07-05.log.

Running the Echo program on your latest version of slideSample.xml shows that many of the superfluous calls to the characters method have now disappeared:

ELEMENT: <slideshow
   ATTR: ...
>
PROCESS: ...
    ELEMENT: <slide
       ATTR: ...
    >
        ELEMENT: <title>
        CHARS:   Wake up to ...
        END_ELM: </title>
    END_ELM: </slide>
    ELEMENT: <slide
        ATTR: ...
    >
    ...

It is evident here that the whitespace characters which were formerly being echoed around the slide elements are no longer appearing, because the DTD declares that slideshow consists solely of slide elements:

<!ELEMENT slideshow (slide+)>

Tracking Ignorable Whitespace

Now that the DTD is present, the parser is no longer the characters method with whitespace that it knows to be irrelevant. From the standpoint of an application that is only interested in processing the XML data, that is great. The application is never bothered with whitespace that exists purely to make the XML file readable.

On the other hand, if you were writing an application that was filtering an XML data file, and you wanted to output an equally readable version of the file, then that whitespace would no longer be irrelevant -- it would be essential. To get those characters, you need to add the ignorableWhitespace method to your application. You'll do that next.

Note:
The code written in this section is contained in Echo08.java. The output is in Echo08-05.log.

To process the (generally) ignorable whitespace that the parser is seeing, add the code highlighted below to implement the ignorableWhitespace event handler in your version of the Echo program:

    public void characters (char buf [], int offset, int len)
      ...     
    }


    public void ignorableWhitespace (char buf [], int offset, int len)
    throws SAXException
    {
        nl(); emit("IGNORABLE");
    }
  
    public void processingInstruction (String target, String data)

This code simply generates a message to let you know that ignorable whitespace was seen.

Note:
Again, not all parsers are created equal. The SAX specification does not require this method to be invoked. The Java XML implementation does so whenever the DTD makes it possible.

When you run the Echo application now, your output looks like this:

ELEMENT: <slideshow
   ATTR: ...
>
IGNORABLE
IGNORABLE
PROCESS: ...
IGNORABLE
IGNORABLE
    ELEMENT: <slide
       ATTR: ...
    >
    IGNORABLE
        ELEMENT: <title>
        CHARS:   Wake up to ...
        END_ELM: </title>
    IGNORABLE
    END_ELM: </slide>
IGNORABLE
IGNORABLE
    ELEMENT: <slide
       ATTR: ...
    >
    ...
Here, it is apparent that the ignorableWhitespace is being invoked before and after comments and slide elements, where characters was being invoked before there was a DTD.

Cleanup

Now that you have seen ignorable whitespace echoed, remove that code from your version of the Echo program -- you won't be needing it any more in the exercises ahead.

Note:
That change has been made in Echo09.java.

Documents and Data

Earlier, you learned that one reason you hear about XML documents, on the one hand, and XML data, on the other, is that XML handles both comfortably, depending on whether text is or is not allowed between elements in the structure.

In the sample file you have been working with, the slideshow element is an example of a data element -- it contains only subelements with no intervening text. The item element, on the other hand, might be termed a document element, because it is defined to include both text and subelements.

As you work through this tutorial, you will see how to expand the definition of the title element to include HTML-style markup, which will turn it into a document element as well.

Empty Elements, Revisited

Now that you understand how certain instances of whitespace can be ignorable, it is time revise the definition of an "empty" element. That definition can now be expanded to include

<foo>   </foo>

where there is whitespace between the tags and the DTD defines that whitespace as ignorable.


Previous | Next | Index | TOC | Top | Top Contents Index Glossary