Substituting and Inserting Text

The Java^TM Web Services Tutorial

Substituting and Inserting Text

The next thing we want to do with the parser is to customize it a bit, so you can see how to get information it usually ignores. But before we can do that, you're going to need to learn a few more important XML concepts. In this section, you'll learn about:

Handling Special Characters ("<", "&", and so on)
Handling Text with XML-style syntax

Handling Special Characters

In XML, an entity is an XML structure (or plain text) that has a name. Referencing the entity by name causes it to be inserted into the document in place of the entity reference. To create an entity reference, the entity name is surrounded by an ampersand and a semicolon, like this:
  &entityName;
 
Later, when you learn how to write a DTD, you'll see that you can define your own entities, so that &yourEntityName; expands to all the text you defined for that entity. For now, though, we'll focus on the predefined entities and character references that don't require any special definitions.

Predefined Entities

An entity reference like & contains a name (in this case, "amp") between the start and end delimiters. The text it refers to (&) is substituted for the name, like a macro in a C or C++ program. Table 6-1 shows the predefined entities for special characters.

Table 6-1 Predefined Entities
Character
Reference

&
&

<
<

>
>

"
"

'
'

Character References

A character reference like  contains a hash mark (#) followed by a number. The number is the Unicode value for a single character, such as 65 for the letter "A", 147 for the left-curly quote, or 148 for the right-curly quote. In this case, the "name" of the entity is the hash mark followed by the digits that identify the character.

Note: XML expects values to be specified in decimal. However, the Unicode charts at http://www.unicode.org/charts/ specify values in hexadecimal! So you'll need to do a conversion to get the right value to insert into your XML data set.

Using an Entity Reference in an XML Document

Suppose you wanted to insert a line like this in your XML document:
 Market Size < predicted
 
The problem with putting that line into an XML file directly is that when the parser sees the left-angle bracket (<), it starts looking for a tag name, which throws off the parse. To get around that problem, you put < in the file, instead of "<".

Note: The results of the modifications below are contained in slideSample03.xml. The results of processing it are shown in Echo07-03.txt. (The browsable versions are slideSample03-xml.html and Echo07-03.html.)

If you are following the programming tutorial, add the text highlighted below to your slideSample.xml file:
  
  <slide type="all">
    <title>Overview</title>
    ...
  </slide>
 
  <slide type="exec">
    <title>Financial Forecast</title>
    <item>Market Size &lt; predicted</item>
    <item>Anticipated Penetration</item>
    <item>Expected Revenues</item>
    <item>Profit Margin </item>
  </slide>
 
</slideshow>
 
When you run the Echo program on your XML file, you see the following output:
ELEMENT:        <item>
CHARS:        Market Size < predicted
END_ELM:        </item>
 
The parser converted the reference into the entity it represents, and passed the entity to the application.

Handling Text with XML-Style Syntax

When you are handling large blocks of XML or HTML that include many of the special characters, it would be inconvenient to replace each of them with the appropriate entity reference. For those situations, you can use a CDATA section.

Note: The results of the modifications below are contained in slideSample04.xml. The results of processing it are shown in Echo07-04.txt. (The browsable versions are slideSample04-xml.html and Echo07-04.html.)

A CDATA section works like <pre>...</pre> in HTML, only more so--all whitespace in a CDATA section is significant, and characters in it are not interpreted as XML. A CDATA section starts with <![CDATA[ and ends with ]]>. Add the text highlighted below to your slideSample.xml file to define a CDATA section for a fictitious technical slide:
   ...
  <slide type="tech">
    <title>How it Works</title>
    <item>First we fozzle the frobmorten</item>
    <item>Then we framboze the staten</item>
    <item>Finally, we frenzle the fuznaten</item>
    <item><![CDATA[Diagram:
      frobmorten <--------------- fuznaten
        |    <3>  ^
        | <1>    |  <1> = fozzle
        V     |  <2> = framboze 
        Staten--------------------+      <3> = frenzle
           <2>
    ]]></item>
  </slide>
</slideshow>
 
When you run the Echo program on the new file, you see the following output:
  ELEMENT: <item>
  CHARS:   Diagram:

frobmorten <--------------fuznaten
     |          <3>          ^
     | <1>                  |   <1> = fozzle
    V                  |   <2> = framboze 
  staten----------------------+   <3> = frenzle
           <2>

END_ELM: </item>
 
You can see here that the text in the CDATA section arrived as it was written. Since the parser didn't treat the angle brackets as XML, they didn't generate the fatal errors they would otherwise cause. (Because, if the angle brackets weren't in a CDATA section, the document would not be well-formed.)

Handling CDATA and Other Characters

The existence of CDATA makes the proper echoing of XML a bit tricky. If the text to be output is not in a CDATA section, then any angle brackets, ampersands, and other special characters in the text should be replaced with the appropriate entity reference. (Replacing left angle brackets and ampersands is most important, other characters will be interpreted properly without misleading the parser.)

But if the output text is in a CDATA section, then the substitutions should not occur, to produce text like that in the example above. In a simple program like our Echo application, it's not a big deal. But many XML-filtering applications will want to keep track of whether the text appears in a CDATA section, in order to treat special characters properly.

One other area to watch for is attributes. The text of an attribute value could also contain angle brackets and semicolons that need to be replaced by entity references. (Attribute text can never be in a CDATA section, though, so there is never any question about doing that substitution.)

Later in this tutorial, you will see how to use a LexicalHandler to find out whether or not you are processing a CDATA section. Next, though, you will see how to define a DTD.

**Table 6-1 Predefined Entities**
Character	Reference
&	&
<	<
>	>
"	"
'	'

Home
TOC
Index

This tutorial contains information on the 1.0 version of the Java Web Services Developer Pack.

All of the material in The Java Web Services Tutorial is copyright-protected and may not be published in other works without express written permission from Sun Microsystems.