Troubleshooters.Com and
The March 2001 Troubleshooting Professional Magazine Present

XML Coding Exercises
Copyright (C) 2001 by Steve Litt. All rights reserved. Materials from guest authors copyrighted by them and licensed for perpetual use to Troubleshooting Professional Magazine. All rights reserved to the copyright holder, except for items specifically marked otherwise (certain free software source code, GNU/GPL, etc.). All material herein provided "As-Is". User assumes all risk and responsibility for any outcome.

IDL code snippets and other information from the DOM specification are copied from http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001/, Copyright © 1998 World Wide Web Consortium , (Massachusetts Institute of Technology , Institut National de Recherche en Informatique et en Automatique , Keio
University ). All Rights Reserved. Status of this document is a w3c recommendation.

A Hello World XML App in Java

Making a DOM Walker Program

Building a DOM Document From Scratch

Writing an XML File From a DOM Document

Accessing DOM Elements and Attributes by Name

SAX

DTD's

Resume reading the 3/2001 Troubleshooting Professional

A Hello World XML App in Java

By Steve Litt

In this Article You Will Learn

How to write, compile and run a proof of concept XML app.
How to make a script to speed your XML programming.
How to download, install and use the Xerces XML tools from Apache Software Foundation.

In this exercise, you'll build a trivially simple XML app that does nothing but report the XML file's document element, which is the top level element in the XML file.

Because this demo is being done in Java, we will start with a Hello World Java application. Then we will download the necessary XML tools from the Apache website, install them in the proper directory, and configure your $CLASSPATH.

Building and Running Your Hello World Java App

The following code is a Hello World Java application that should you can compile and run. Save it as Hello.java. (The Uppercase H in the filename must match the uppercase H in the class name in the code!)

class Hello { public static void main(String[] args) { System.out.println("Hello World\n"); } }

Try compiling it with the following command:

$ javac Hello.java

It should compile with no output, simply returning to the command prompt. If there are error messages, make sure you typed in the Java source code exactly as shown in the box above. Make sure you entered the command precisely as shown in the preceding command. If still no joy, suspect the installation of your Java compiler, or your $CLASSPATH variable. Troubleshoot accordingly.

Once you can compile it, try running it. It probably won't work, but instead will error out as shown below:

$ java Hello Internal error: caught an unexpected exception. Please check your CLASSPATH and your installation. java/lang/ClassNotFoundException: Hello at java.lang.Class.forName(Class.java:native) at java.lang.Class.forName(Class.java:52) Aborted $

The preceding error probably indicates that your newly compiled Hello.class program is not on your $CLASSPATH. Fix it with the following command:

$ CLASSPATH=$CLASSPATH:.

The preceding command appends the current directory to the $CLASSPATH. Perform an ls command to verify that Hello.class really exists, and then run your program again. The following is the correct result:

$ java Hello Hello World $

Note that the $CLASSPATH fix is good only for the current shell session. To compile and run your Hello.java app, create the following script, which we will name jj:

rm Hello.class CLASSPATH=$CLASSPATH:. javac Hello.java java Hello $@

Next, use an import statement and make use of a command line argument, as shown in the next invocation of Hello.java:

import java.io.IOException; class Hello { public static void main(String[] args) { System.out.println("Hello " + args[0] + "!"); } }

Run the following command:

$ ./jj one two three Hello one! $

The preceding did just what it was supposed to do -- compiled and ran Hello.java, which printed the word Hello and the first argument. Now you're ready to add XML to your app.

Building your Hello World XML App

Your Java implementation probably doesn't come with XML capabilities. To add XML capabilities you need to download the proper libraries. Many people recommend the Xerces Java XML library from the Apache Foundation as the Cadillac of the industry, so that's what we'll use here. From http://xml.apache.org/dist/xerces-j/, download either Xerces-J-bin.1.3.0.zip or Xerces-J-bin.1.3.0.tar.gz (the latter is about a third smaller for some reason). Note that as time goes on, you'll need a later version than 1.3.0, which is the latest stable version as of this writing.

Anyway, extract the files from the archive. You'll get lots of files and directories. There's extensive documentation in html format -- a good thing. But what you want is the file called xerces.jar, which is located in the root of the new tree created when you extracted files from the archive. Copy xerces.jar to a directory in which you want to put Java tools. In my case I put it in /usr/jre-blackdown1.2.2/lib. Once you have it where you want it, you need to add it to your $CLASSPATH. There are many ways to do that, but I chose to modify my jj script to accomplish it:

rm Hello.class CLASSPATH=$CLASSPATH:/usr/jre-blackdown1.2.2/lib/xerces.jar:. javac Hello.java java Hello $@

Now you should be able to add an import statement to import the Xerces DOMParser. The following is Hello.java after adding the import statement. If you've done everything correctly, this program should compile and act just like it acted before you added the import statement. If not, troubleshoot:

import java.io.IOException; import org.apache.xerces.parsers.DOMParser; class Hello { public static void main(String[] args) { System.out.println("Hello " + args[0] + "!"); } }

If the preceding compiled and ran, it means you correctly installed and utilized xerces.jar, and you're ready for your first XML program. The following program parses an XML file into a DOM document. Finally, the program outputs the name of the top level element in the file.

import java.io.IOException; // Exception handling import org.w3c.dom.*; // DOM interface import org.apache.xerces.parsers.DOMParser; // Parser (to DOM) class Hello { public static void main(String[] args) { String filename = args[0]; System.out.print("The document element of " + filename + " is ... "); try { DOMParser dp = new DOMParser(); dp.parse(filename); Document doc = dp.getDocument(); Element docElm = doc.getDocumentElement(); System.out.println(docElm.getNodeName() + "."); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } }

Let's examine the preceding code. org.w3c.dom and org.apache.xerces.parsers.DOMParser are both contained in the xerces.jar that you downloaded. org.w3c.dom contains the entire DOM interface, while org.apache.xerces.parsers.DOMParser is actually a SAX parser to load a DOM document. If this means nothing to you, refer to articles Anatomy of an XML App and Simplified Explanation of the DOM API earlier in this magazine (use your back button to come back).

The preceding code uses the first argument as a filename, and uses a DOMParser object to parse that file into a DOM document. Once it's a Document object, you have it in DOM form, and have no further use for the parser. Next you obtain the Document Element, which is the single top level element in an XML file. Finally, you get Document Element's name and print it.

The try{}catch(){} structure is error handling by exception. The e.getMessage() delivers a clear message about what went wrong, and is excellent for troubleshooting.

Run this program against the blank.xml file you created in the Dia diagramming tool exercises. The result should look something like this:

$ ./jj blank.xml The document element of blank.xml is ... diagram. $

But what happens if the file doesn't exist, or if the file is not XML? See below:

$ cp /etc/fstab fstab.txt $ ./jj blank.xml The document element of blank.xml is ... diagram. $ ./jj nonexist.txt The document element of nonexist.txt is ... Error: File "file:///home/slitt/nonexist.txt" not found. $ ./jj fstab.txt The document element of fstab.txt is ... Error: The markup in the document preceding the root element must be well-formed. $

By the way, the jj script compiles and runs. If you don't want to compile, make another file (call it ss) that doesn't delete hello.class or do the compile step (javac). Be careful though. I often make the mistake of changing my program, forgetting to compile it, running it, and wondering why my change made no difference. Or even worse, figuring what I changed had nothing to do with the problem, and needlessly troubleshooting further.

Have you gotten the preceding code to run? Congratulations! You've written an XML program. Now it's time to do something more substantial...

In this Article You Have Learned

How to write, compile and run a proof of concept XML app.
How to make a script to speed your XML programming.
How to download, install and use the Xerces XML tools from Apache Software Foundation.

Steve Litt is the documentor of the Universal Troubleshooting Process. He can be reached at slitt@troubleshooters.com.

Making a DOM Walker Program

By Steve Litt

In this Article You Will Learn

The algorithm to navigate a DOM document without recursion
Why you must have a boolean ascending variable.
How to code the algorithm in Java
How to delete empty text elements.
How to recover your "position" after the deletion.
How to retrieve the list of an element's attributes, and iterate through the list.

A DOM walker program "walks" the DOM hierarchy, reporting on every text node, every element, and every attribute of each element. The concept is similar to walking any type of tree -- a recursive directory listing comes to mind.

And speaking of recursion, it's the standard algorithm for walking trees. But it isn't used in this program. That's because the DOM API bestows methods crafted to walk non-recursively -- getFirstChild(), getNextSibling(), and getParentNode(). The algorithm is simple if you think of a checker.

A checker is that black or red circular piece of plastic used in the game called checkers. In the game they're each used to mark a position. In this program, you can imagine a single checker being moved from node to node. The "current node" is covered by the "checker".

Although trees are typically walked recursively, recursion is often too memory intensive to be practical in DOM apps. You can walk the DOM hierarchy iteratively (in a loop with no recursion) using the following algorithm:

If you can move your checker down, move it down to the first child,
otherwise, if you can move your checker to the right, move it to the right,
otherwise, if you can move your checker up, move it up,
otherwise, the reason you can't move your checker up is because you've once again ascended to the single top level node, so you're done.

The preceding algorithm is complicated slightly by the fact that if you've just moved your checker up, you've moved it up to a node you already visited and moved down from, so you don't want to take action on that node because you already took action on it, and you also don't want to move your checker down from there, because you already moved down from there, so moving down again would create an infinite loop. The solution is simple enough. Implement a boolean variable called ascending, setting it true when the checker moves up, but false when it moves down or right. Take action only when ascending is false, and go down only when it's false (and you can go down).

The preceding description is the explanation of the DOMwalker class in the following code. A description of the remainder of the code follows the code itself. And please remember that the following is a simplified DOM walker that doesn't print attributes, and also doesn't delete extraneous text nodes caused by XML formatting. Those functionalities will be addressed later in this article. The following is the simplified iterative DOM walker:

import java.io.IOException; // Exception handling import org.w3c.dom.*; // DOM interface import org.apache.xerces.parsers.DOMParser; // Parser (to DOM) /************************************** class DocumentMaker encapsulates all parser dependent code. If you change XML parsers, only this class and the parser's import statement need be modified. As written, DocumentMaker uses DOMParser from Apache. **************************************/ class DocumentMaker { private Document doc; public Document getDocument () {return(doc);} public DocumentMaker (String filename) { try { DOMParser dp = new DOMParser(); dp.parse(filename); doc = dp.getDocument(); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } /************************************** class NodeTypes encapsulates text names for the various node types. Its asAscii method returns those strings according to its nodeTypeNumber argument. It's a number to string translator. **************************************/ class NodeTypes { private static String[] nodenames={"ELEMENT","ATTRIBUTE","TEXT", "CDATA_SECTION","ENTITY_REFERENCE", "ENTITY","PROCESSING_INSTRUCTION", "COMMENT","DOCUMENT","DOCUMENT_TYPE", "DOCUMENT_FRAGMENT","NOTATION"}; public static String asAscii(int nodeTypeNumber) { return(nodenames[nodeTypeNumber-1]); } } /************************************** class DOMwalker's job is to walk the DOM document and print out each node's type, its name, its value (null for Elements), and in the case of elements, its attributes in parentheses. The tree is walked non-recursively using standard DOM traversal methods. **************************************/ class DOMwalker { private Node checker; // like a checker that gets moved from // square to square in a checkers game // points to "current" node private void indentToLevel(int level) { for(int n=0; n < level; n++) { System.out.print(" "); } } private void printNodeInfo(Node thisNode) { System.out.print(NodeTypes.asAscii(thisNode.getNodeType()) + " : " + thisNode.getNodeName() + " : " + thisNode.getNodeValue() + " : "); System.out.println(); } public DOMwalker(Document doc) { boolean ascending = false; int level = 1; System.out.println(); try { checker=doc.getDocumentElement(); while (true) { //*** TAKE ACTION ON NODE WITH CHECKER *** if (!ascending) { indentToLevel(level); printNodeInfo(checker); } //*** GO DOWN IF YOU CAN *** if ((checker.hasChildNodes()) && (!ascending)) { checker=checker.getFirstChild(); ascending = false; level++; } //*** OTHERWISE GO RIGHT IF YOU CAN *** else if (checker.getNextSibling() != null) { checker=checker.getNextSibling(); ascending = false; } //*** OTHERWISE GO UP IF YOU CAN *** else if (checker.getParentNode() != null) { checker=checker.getParentNode(); ascending = true; level--; } //*** OTHERWISE YOU'VE ASCENDED BACK TO *** //*** THE DOCUMENT ELEMENT, SO YOU'RE DONE *** else { break; } } } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } /************************************** Class Hello is the repository of this program's main routine. It removes empty text nodes, then walks the DOM tree and prints out the DOM tree's info. **************************************/ class Hello { public static void main(String[] args) { String filename = args[0]; System.out.print("Walking XML file " + filename + " ... "); DocumentMaker docMaker = new DocumentMaker(filename); Document doc = docMaker.getDocument(); try { DOMwalker walker = new DOMwalker(doc); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } }

In the preceding code, class DocumentMaker encapsulates all parser dependent code except the parser's import statement. What this means is you can place this class in a separate file, along with its parser dependent import statement. If you change parsers, all you need to change is that separate file. Class DocumentMaker delivers a parser independent DOM Document object.

Class NodeTypes implements a simple number to text name lookup on node types.

The logic of class DOMwalker was discussed extensively before the preceding code. Its constructor (DOMwalker()) is simply the implementation of the down if you can, else right if you can, else up if you can, else done algorithm. A Node object called checker is continually moved through that algorithm, thus defining the current position.

The level variable keeps track of the level in the hierarchy, and is passed to method indentToLevel(), which outputs the proper indentation for the level.

The printNodeInfo() method prints information about the node passed as its only argument. In this simplified program it prints only the node type, the node name, and the node value.

Class Hello implements the main logic of the program, using the DocumentMaker object to create a Document object, which it passes to the DOMwalker object.

To test the preceding code, use your favorite text editor to create the following XML file, called simple.xml:

<?xml version="1.0"?> <toplevel> <secondlevel> This is a text node within the second level. </secondlevel> </toplevel>

Now test your program. The following is the command and the output:

$ ./jj simple.xml Walking XML file simple.xml ... ELEMENT : toplevel : null : TEXT : #text : : ELEMENT : secondlevel : null : TEXT : #text : This is a text node within the second level. : TEXT : #text : : DOCUMENT_TYPE : toplevel : null : $

The preceding output is pretty ugly, isn't it? Why are there three text elements instead of just the one with text? The answer is that the newlines and formatting spaces between tags are interpreted by many parsers as text elements. You can verify that by creating a single line version of simple.xml with no spaces between tags, and the result looks like what you expect:

$ ./jj simp.xml Walking XML file simp.xml ... ELEMENT : toplevel : null : ELEMENT : secondlevel : null : TEXT : #text : This is a text node within the second level. : DOCUMENT_TYPE : toplevel : null : $

Note that because there's no DTD, the DOCUMENT_TYPE node is empty.

The `WhiteSpaceKiller` object

Of course, we can't expect everyone to write single line, space compressed XML for our convenience. So we must create a WhiteSpaceKiller object whose job is to remove all empty text nodes. Note this action is done on the DOM Document, not in the parsing process. If you want to do it in the parsing process, please defer that desire until our discussion of SAX.

The WhiteSpaceKiller object is basically a DOM walker whose action on the current node is to delete it if it's a blank text node. Once again, we use the down if possible, right if possible, up if possible, done algorithm. With a twist...

When you delete the current node, you can't very well look for its first child, its next sibling, or its parent. It's gone. Its now a null. There are many ways of handling this, but I picked the method that seemed the cleanest to me. I keep track of where the checker was before it arrived "here", and if a deletion takes place, I move the checker back to its previous location. So the iteration will select the node after the one that was just deleted. By moving to the previous checker location, our down/right/up algorithm goes exactly where we would have from the text node if we hadn't deleted it.

The following is the WhiteSpaceKiller class code:

/************************************** class WhiteSpaceKiller's job is to walk the DOM Document and delete any empty text nodes. The tree is walked non-recursively using standard DOM traversal methods. Once an empty text node is deleted, the "checker" is moved back to the previous node to avoid attempts at calling DOM traversal methods on a (now) null object. **************************************/ class WhiteSpaceKiller { private Node checker; // like a checker that gets moved from // square to square in a checkers game // points to "current" node WhiteSpaceKiller(Document doc) { boolean ascending = false; Node previousChecker = null; try { checker=doc.getDocumentElement(); while (true) { //*** TAKE ACTION ON NODE WITH CHECKER *** if ((!ascending) && (checker.getNodeType() == Node.TEXT_NODE)) { String trimmedText = checker.getNodeValue().trim(); if (trimmedText == "") { checker.getParentNode().removeChild(checker); checker=previousChecker; //back to undeleted node } } previousChecker=checker; //*** GO DOWN IF YOU CAN *** if ((checker.hasChildNodes()) && (!ascending)) { checker=checker.getFirstChild(); ascending = false; } //*** OTHERWISE GO RIGHT IF YOU CAN *** else if (checker.getNextSibling() != null) { checker=checker.getNextSibling(); ascending = false; } //*** OTHERWISE GO UP IF YOU CAN *** else if (checker.getParentNode() != null) { checker=checker.getParentNode(); ascending = true; } //*** OTHERWISE YOU'VE ASCENDED BACK TO *** //*** THE DOCUMENT ELEMENT, SO YOU'RE DONE *** else { break; } } } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } }

In the preceding, all the work is done in the constructor. The algorithm is almost identical to the DOM walker code discussed earlier, except for moving back one step upon deletion.

This object is called upon to do its work in the Hello object, just before calling the DOMwalker object to print the hierarchy. The following code shows the addition of the invocation to the DOMwalker object, with that invocation highlighted:

class Hello { public static void main(String[] args) { String filename = args[0]; System.out.print("Walking XML file " + filename + " ... "); DocumentMaker docMaker = new DocumentMaker(filename); Document doc = docMaker.getDocument(); try { WhiteSpaceKiller wpc = new WhiteSpaceKiller(doc); DOMwalker walker = new DOMwalker(doc); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } }

Try compiling and running the new code against simple.xml (the one with the newlines and indenting), and note that there are no extraneous text nodes:

[slitt@mydesk slitt]$ vi jj [slitt@mydesk slitt]$ ./jj simple.xml Walking XML file simple.xml ... ELEMENT : toplevel : null : ELEMENT : secondlevel : null : TEXT : #text : This is a text node within the second level. : DOCUMENT_TYPE : toplevel : null : [slitt@mydesk slitt]$

We're almost there. The remaining task is to print the attributes of each element. We'll print a comma separated list of attributes in parentheses to the right of the node value. Each attribute will have the attributes name followed by an equal sign followed by the attributes name in doublequotes.

Give the DOMwalker class a new method called printAttributes(), which is called from printNodeInfo() on any mode of type Node.ELEMENT_NODE. The printNodeInfo() method uses DOM methods rather than native Java language to work its magic.

private void indentToLevel(int level) { for(int n=0; n < level; n++) { System.out.print(" "); } } private void printAttributes(Node thisNode) { System.out.print("("); NamedNodeMap attribs = thisNode.getAttributes(); int numAttribs = attribs.getLength(); for(int i=0; i < attribs.getLength(); i++){ Node attrib = attribs.item(i); if(i>0){System.out.print(",");} System.out.print(attrib.getNodeName()); System.out.print("=\""); System.out.print(attrib.getNodeValue()); System.out.print("\""); } System.out.print(")"); } private void printNodeInfo(Node thisNode) { System.out.print(NodeTypes.asAscii(thisNode.getNodeType()) + " : " + thisNode.getNodeName() + " : " + thisNode.getNodeValue() + " : "); if(thisNode.getNodeType() == Node.ELEMENT_NODE) { printAttributes(thisNode); } System.out.println(); }

The preceding is straight out of the DOM spec. getAttributes() returns a NamedNodeMap object with methods getlLngth() to return the number of items, and item() to return the single item. Then it's just a matter of iterating through them. Note that neither XML nor DOM specifies that the order of returned attributes is the same as in the file, so applications cannot assume anything concerning the order of returned attributes.

Just in case you've gotten out of sync with this exercises, the following source code listing is the complete listing for our DOM walker, complete with blank text deletion and attribute listing:

/* * Copyright (C) 2001 by Steve Litt * * COMPLETE DOM WALKER * */ import java.io.IOException; // Exception handling import org.w3c.dom.*; // DOM interface import org.apache.xerces.parsers.DOMParser; // Parser (to DOM) /************************************** class DocumentMaker encapsulates all parser dependent code. If you change XML parsers, only this class and the parser's import statement need be modified. As written, DocumentMaker uses DOMParser from Apache. **************************************/ class DocumentMaker { private Document doc; public Document getDocument () {return(doc);} public DocumentMaker (String filename) { try { DOMParser dp = new DOMParser(); dp.parse(filename); doc = dp.getDocument(); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } /************************************** class NodeTypes encapsulates text names for the various node types. Its asAscii method returns those strings according to its nodeTypeNumber argument. It's a number to string translator. **************************************/ class NodeTypes { private static String[] nodenames={"ELEMENT","ATTRIBUTE","TEXT", "CDATA_SECTION","ENTITY_REFERENCE", "ENTITY","PROCESSING_INSTRUCTION", "COMMENT","DOCUMENT","DOCUMENT_TYPE", "DOCUMENT_FRAGMENT","NOTATION"}; public static String asAscii(int nodeTypeNumber) { return(nodenames[nodeTypeNumber-1]); } } /************************************** class WhiteSpaceKiller's job is to walk the DOM Document and delete any empty text nodes. The tree is walked non-recursively using standard DOM traversal methods. Once an empty text node is deleted, the "checker" is moved back to the previous node to avoid attempts at calling DOM traversal methods on a (now) null object. **************************************/ class WhiteSpaceKiller { private Node checker; // like a checker that gets moved from // square to square in a checkers game // points to "current" node WhiteSpaceKiller(Document doc) { boolean ascending = false; Node previousChecker = null; try { checker=doc.getDocumentElement(); while (true) { //*** TAKE ACTION ON NODE WITH CHECKER *** if ((!ascending) && (checker.getNodeType() == Node.TEXT_NODE)) { String trimmedText = checker.getNodeValue().trim(); if (trimmedText == "") { checker.getParentNode().removeChild(checker); checker=previousChecker; //back to undeleted node } } previousChecker=checker; //*** GO DOWN IF YOU CAN *** if ((checker.hasChildNodes()) && (!ascending)) { checker=checker.getFirstChild(); ascending = false; } //*** OTHERWISE GO RIGHT IF YOU CAN *** else if (checker.getNextSibling() != null) { checker=checker.getNextSibling(); ascending = false; } //*** OTHERWISE GO UP IF YOU CAN *** else if (checker.getParentNode() != null) { checker=checker.getParentNode(); ascending = true; } //*** OTHERWISE YOU'VE ASCENDED BACK TO *** //*** THE DOCUMENT ELEMENT, SO YOU'RE DONE *** else { break; } } } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } /************************************** class DOMwalker's job is to walk the DOM Document and print out each node's type, its name, its value (null for Elements), and in the case of elements, its attributes in parentheses. The tree is walked non-recursively using standard DOM traversal methods. **************************************/ class DOMwalker { private Node checker; // like a checker that gets moved from // square to square in a checkers game // points to "current" node private void indentToLevel(int level) { for(int n=0; n < level; n++) { System.out.print(" "); } } private void printAttributes(Node thisNode) { System.out.print("("); NamedNodeMap attribs = thisNode.getAttributes(); int numAttribs = attribs.getLength(); for(int i=0; i < attribs.getLength(); i++){ Node attrib = attribs.item(i); if(i>0){System.out.print(",");} System.out.print(attrib.getNodeName()); System.out.print("=\""); System.out.print(attrib.getNodeValue()); System.out.print("\""); } System.out.print(")"); } private void printNodeInfo(Node thisNode) { System.out.print(NodeTypes.asAscii(thisNode.getNodeType()) + " : " + thisNode.getNodeName() + " : " + thisNode.getNodeValue() + " : "); if(thisNode.getNodeType() == Node.ELEMENT_NODE) { printAttributes(thisNode); } System.out.println(); } public DOMwalker(Document doc) { boolean ascending = false; int level = 1; System.out.println(); try { checker=doc.getDocumentElement(); while (true) { //*** TAKE ACTION ON NODE WITH CHECKER *** if (!ascending) { indentToLevel(level); printNodeInfo(checker); } //*** GO DOWN IF YOU CAN *** if ((checker.hasChildNodes()) && (!ascending)) { checker=checker.getFirstChild(); ascending = false; level++; } //*** OTHERWISE GO RIGHT IF YOU CAN *** else if (checker.getNextSibling() != null) { checker=checker.getNextSibling(); ascending = false; } //*** OTHERWISE GO UP IF YOU CAN *** else if (checker.getParentNode() != null) { checker=checker.getParentNode(); ascending = true; level--; } //*** OTHERWISE YOU'VE ASCENDED BACK TO *** //*** THE DOCUMENT ELEMENT, SO YOU'RE DONE *** else { break; } } } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } /************************************** Class Hello is the repository of this program's main routine. It removes empty text nodes, then walks the DOM tree and prints out the DOM tree's info. **************************************/ class Hello { public static void main(String[] args) { String filename = args[0]; System.out.print("Walking XML file " + filename + " ... "); DocumentMaker docMaker = new DocumentMaker(filename); Document doc = docMaker.getDocument(); try { WhiteSpaceKiller wpc = new WhiteSpaceKiller(doc); DOMwalker walker = new DOMwalker(doc); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } }

In this Article You Have Learned

Using the "down if you can, otherwise right if you can, otherwise up if you can, otherwise done" (DRUD) algorithm to navigate a DOM document without recursion.
Using the ascending boolean variable to prevent infinite loops and excess operations.
Code the DRUD algorithm in Java using the getFirstChild(), getNextSibling(), and getParentNode() methods.
Using the checker.getParentNode().removeChild(checker) syntax to delete the current node.
Saving the previous node to recover your "position" after the deletion.
Retrieving an element's attribute list with thisNode.getAttributes(), and iterating through the list using attribs.getLength() and attribs.item(index).

Steve Litt is the author of Troubleshooting Techniques of the Successful Technologist. He can be reached at Steve Litt's email address.

Building a DOM Document From Scratch

By Steve Litt

In this Article You Will Learn

How to create an empty DOM Document.
Which new classes must be imported to create an empty DOM Document.
How to use DOM methods to fill the empty Document with the proper hierarchy of information.

We've built a DOM Document object from a file, parsed it, deleted blank nodes, and basically had our way with it. The one thing we haven't done is built one in memory from scratch. From-scratch building is necessary in order to save data out to XML, and also to use DOM as a tool for remembering out of order data in SAX apps. SAX apps are discussed in a later article.

Building a DOM Document from scratch isn't rocket science. Here are the steps:

Build a class or method to create an empty Document object
Build another class or method to fill the Document object with the desired elements with the desired attributes, and the desired text nodes.
Use the Document object the way you would any other DOM Document object.

In our case, #3 will simply be to walk the DOM Document.

#1: Build a class or method to create an empty Document object

!! CAUTION !!

Begin this exercise with an empty Hello.java file, or you'll paint yourself into a corner. If you want to save your current Hello.java, back it up before emptying it.

Start with an empty Hello.java file, and code the following class, which delivers a completely empty (not even a document Element) Document via its getDocument() method:

/************************************** class EmptyDocumentMaker creates an empty document **************************************/ class EmptyDocumentMaker { private Document doc; public Document getDocument () {return(doc);} public EmptyDocumentMaker () { try { DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder(); doc = docBuilder.newDocument(); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } }

The preceding code uses javax/xml/DocumentBuilderFactory and javax/xml/DocumentBuilder to create an empty document. Please remember where you saw this code, as it's extremely difficult to find sample code to create an empty DOM document.

#2: Build another class or method to fill the Document object with the desired elements with the desired attributes, and the desired text nodes.

The following code builds a Document in memory, using only those methods described in the actual DOM specification. Our main tools here are from the Document, Element and Node interfaces of the DOM spec:Node.appendChild(), Element.setAttribute(), Document.createElement(), Document.createTextNode(), Node.getParentNode(), and Node.getDocumentElement().

All elements are instantiated by the Document.createElement() method, and all text nodes are instantiated by the Document.createTextNode() methods. Elements and text nodes are appended by the Node.appendChild() method. Note that the document element is added by running Node.appendChild() on the document itself. Documents can be thought of as a special kind of node.

We're creating a sort of data file of political figures, consisting of two "records" -- Alan Greenspan and George Bush. Each record has the first and last names as attributes of its element. Each "record" element has subelements called job and party, each of which contain a text node to describe the job or party, as appropriate.

All work is done with an Element object. In cases where methods return a Node object, we typecast them to Element. Of course, this works only if we're sure the nodes returned really are Element objects. Because we're building it ourselves, we have that assurance, assuming we've coded correctly.

There are better and more readable ways to accomplish what's done by the code below. Like everything in this Java/XML tutorial, there are Javaesqe methods looking more like native Java, and much easier. Once again, I want to familiarize you with the use of the DOM API, not the Java language. A couple weeks doing this stuff, combined with reading some good XML/Java books will allow you to code this more efficiently and readably. So the following is the code for class DomFiller, which builds the desired hierarcy inside a formerly empty DOM Document object. Place this code below the code for the EmptyDocumentMaker class:

/************************************** Class DomFiller takes an empty DOM Document and fills it as a demonstration of building a DOM Document in memory. **************************************/ class DomFiller { public DomFiller(Document doc) { try { //*** CREATE THE DOCUMENT ELEMENT *** doc.appendChild(doc.createElement("mytoplevel")); //*** CREATE THE FIRST PERSON RECORD *** Element elm = doc.getDocumentElement(); //Get to a known state elm = (Element)elm.appendChild(doc.createElement("person")); elm.setAttribute("fname","Alan"); elm.setAttribute("lname","Greenspan"); elm = (Element)elm.appendChild(doc.createElement("job")); elm.appendChild(doc.createTextNode("Federal Reserve Chairman")); elm = (Element)elm.getParentNode().appendChild( doc.createElement("party")); elm.appendChild(doc.createTextNode("Libertarian")); //*** CREATE THE SECOND PERSON RECORD *** elm = doc.getDocumentElement(); //Get to a known state elm = (Element)elm.appendChild(doc.createElement("person")); elm.setAttribute("lname", "Bush"); elm.setAttribute("fname", "George"); elm = (Element)elm.appendChild(doc.createElement("job")); elm.appendChild(doc.createTextNode("President")); elm = (Element)elm.getParentNode().appendChild( doc.createElement("party")); elm.appendChild(doc.createTextNode("Republican")); } catch (Exception e) { System.out.print("DomFiller: " + e.getMessage()); } } }

#3: Use the Document the way you would any other DOM Document

Copy the complete DOM walker program into your current Hello.java containing the EmptyDocumentMaker and DomFiller classes, above those two classes. The complete DOM walker is the one shown previously in this tutorial, with a top comment like this:

Now add these two import statements to the program's list of import
statements:

//*** NEXT 2 STATEMENTS CREATE EMPTY DOCUMENT *** import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory;

Next, delete the Hello class, go to the bottom, and insert the following
version of the Hello class:

/************************************** Class Hello is the repository of this program's main routine. Its purpose is to showcase building a DOM Document object in memory. After doing that, it invokes a DOM walker to prove that the DOM Document contains the desired material in the desired organization. **************************************/ class Hello { public static void main(String[] args) { try { EmptyDocumentMaker emptyDocMaker = new EmptyDocumentMaker(); Document doc = emptyDocMaker.getDocument(); DomFiller df = new DomFiller(doc); DOMwalker walker = new DOMwalker(doc); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } }

The preceding code instantiates an EmptyDocumentMaker to make an empty document, then it instantiates a DomFiller, which fills the empty document with the desired political figures and their information. Finally, it instantiates a DOMwalker to walk the Document object. Here's what you get when you compile and run it:

$ ./jj ELEMENT : mytoplevel : null : () ELEMENT : person : null : (fname="Alan",lname="Greenspan") ELEMENT : job : null : () TEXT : #text : Federal Reserve Chairman : ELEMENT : party : null : () TEXT : #text : Libertarian : ELEMENT : person : null : (fname="George",lname="Bush") ELEMENT : job : null : () TEXT : #text : President : ELEMENT : party : null : () TEXT : #text : Republican : $

CAUTION!

Be sure to save your code once it works. You'll need it in the next exercise!

The preceding is pretty much what you'd expect. There's a root level element called <mytoplevel>, containing two person elements, one named Alan Greenspan and one named George Bush, as evidenced by their fname and lname attributes. Each person element contains a job element and a party element, and those two subelements contain a text node with the proper information.

Once again, there are easier, more readable, and more natively Java methods of accomplishing the preceding, but this shows off the DOM API documented by the W3C. You'll be using this technique later when you use SAX to process files whose information is ordered differently than the desired output.

In this Article You Have Learned

Using DocumentBuilderFactory and DocumentBuilder to create an empty DOM Document.

DocumentBuilderFactory

DocumentBuilder

javax.xml.parsers

Node.appendChild(Element), Document.createElement(String), Document.createTextNode(String), and Element.setAttribute(String, String) are the main tools of the trade for building a DOM Document, in memory, with the proper hierarchy of information.

Steve Litt is the author of Rapid Learning: Secret Weapon of the Successful Technologist. He can be reached at slitt@troubleshooters.com.

Writing an XML File From a DOM Document

By Steve Litt

In this Article You Will Learn

How to modify typical DOMwalker logic to write an XML file from a DOM Document.
The element end tags are handled when the checker ascends back up to an element.

Start with the Hello.java from the preceding article. We're going to change class DOMwalker to print out an XML file instead of an informative outline. We'll display the code first, and discuss it following the code. The changes from the preceding article's code are marked in bold red. Here's the code:

class DOMwalker { private Node checker; // like a checker that gets moved from // square to square in a checkers game // points to "current" node private void indentToLevel(int level) { System.out.println(); for(int n=0; n < level; n++) { System.out.print(" "); } } private void printAttributes(Node thisNode) { NamedNodeMap attribs = thisNode.getAttributes(); int numAttribs = attribs.getLength(); for(int i=0; i < attribs.getLength(); i++){ Node attrib = attribs.item(i); System.out.print(" "); System.out.print(attrib.getNodeName()); System.out.print("=\""); System.out.print(attrib.getNodeValue()); System.out.print("\""); } } private void printNodeInfo(Node thisNode) { int nodeType = thisNode.getNodeType(); if(nodeType == Node.ELEMENT_NODE) { System.out.print("<" + thisNode.getNodeName().trim()); printAttributes(thisNode); System.out.print(">"); } else { System.out.print(thisNode.getNodeValue()); } } private void printEndTag(Node thisNode) { System.out.print("</" + thisNode.getNodeName() + ">"); } public DOMwalker(Document doc) { boolean ascending = false; int level = 0; // System.out.println(); try { checker=doc.getDocumentElement(); System.out.println("<?xml version=\"1.0\"?>"); while (true) { //*** TAKE ACTION ON NODE WITH CHECKER *** if (!ascending) { indentToLevel(level); printNodeInfo(checker); } else { if (checker.getNodeType() == Node.ELEMENT_NODE) { indentToLevel(level); printEndTag(checker); } } //*** GO DOWN IF YOU CAN *** if ((checker.hasChildNodes()) && (!ascending)) { checker=checker.getFirstChild(); ascending = false; level++; } //*** OTHERWISE GO RIGHT IF YOU CAN *** else if (checker.getNextSibling() != null) { checker=checker.getNextSibling(); ascending = false; } //*** OTHERWISE GO UP IF YOU CAN *** else if (checker.getParentNode() != null) { checker=checker.getParentNode(); ascending = true; level--; } //*** OTHERWISE YOU'VE ASCENDED BACK TO *** //*** THE DOCUMENT ELEMENT, SO YOU'RE DONE *** else { break; } } System.out.println(); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } }

The preceding prints the DOM object as HTML, nicely indented and formatted.

The only change to indentToLevel(int level) is that we print a linefeed before indenting. This is a handy way to make sure that all start tags, end tags, and text nodes are on their own line.

printAttributes(Node thisNode) is changed to the extent that there are no commas, and a couple other formatting changes. The original logic remains intact.

printNodeInfo(Node thisNode) has been changed so that it doesn't print extraneous information. On elements it prints the starting angle bracket, the node name, and then calls printAttributes() to print the attributes, and finally prints the closing angle bracket. On text nodes it simply prints the node's value. The logic is substantially unchanged.

DOMwalker(Document doc) has several minor changes. The level variable starts at 0 instead of 1, <?xml version=\"1.0\"?> is printed at the top of the document, and a linefeed is printed at the bottom. And it has one major change -- the logic of the action. It now takes action when the checker returns up to an element. The action, as you might expect, is to print a closing tag for the element. This is a slick way to avoid introducing a Stack to the logic :-).

The following shows what happens when you run this program:

$ ./jj <?xml version="1.0"?> <mytoplevel> <person fname="Alan" lname="Greenspan"> <job> Federal Reserve Chairman </job> <party> Libertarian </party> </person> <person fname="George" lname="Bush"> <job> President </job> <party> Republican </party> </person> </mytoplevel> $

That's it. You just wrote an XML file from a DOM Document.

Don't do this at work. The preceding example handles only elements and text nodes -- hardly the entirety of the XML specification. There are all sorts of pre-made classes to write XML files correctly, including DTD's. The preceding was simply intended to show that writing XML from a DOM Document isn't rocket science.

In this Article You Have Learned

How to modify typical DOMwalker logic to write an XML file from a DOM Document.
The element end tags are handled when the checker ascends back up to an element.

Steve Litt is the developer of The Universal Troubleshooting Process troubleshooting courseware. He can be reached at slitt@troubleshooters.com.

Accessing DOM Elements and Attributes by Name

By Steve Litt

In this Article You Will Learn

How to get the list of all subelements with a particular name.
How to iterate through that subelement list.
How to get an attribute by name.

DOM walking is an impressive demonstration, but in reality yields very little power. The average app knows something about what it's looking for. It needs a specific piece of info, and walking the DOM to find it would be folly. So the DOM specification gives methods to find elements and attributes with specific names. Element method getElementsByTagName(String name) yields a Nodelist containing all immediate subelements with the specified tag. For a given element, getAttribute(String name) returns the value of the attribute whose name is the argument. getAttributeNode(String name) does the same thing, except it delivers the whole Attr object instead of just the value.

Start by creating the following XML file, which has been crafted just for this example (and is probably lousy XML because contractors, employees and partners are all people). Name the following XML file workers.xml:

<?xml version="1.0"?> <workers> <contractor> <info lname="albertson" fname="albert" ssno="123456789"/> <job>C++ programmer</job> <hiredate>1/1/1999</hiredate> </contractor> <employee> <info lname="bartholemew" fname="bart" ssno="223456789"/> <job>Technology Director</job> <hiredate>1/1/2000</hiredate> <firedate>1/11/2000</firedate> </employee> <partner> <info lname="carlson" fname="carl" ssno="323456789"/> <job>labor law</job> <hiredate>10/1/1979</hiredate> </partner> <contractor> <info lname="denby" fname="dennis" ssno="423456789"/> <job>cobol programmer</job> <hiredate>1/1/1959</hiredate> </contractor> <employee> <info lname="edwards" fname="eddie" ssno="523456789"/> <job>project manager</job> <hiredate>4/4/1996</hiredate> </employee> <partner> <info lname="fredericks" fname="fred" ssno="623456789"/> <job>intellectual property law</job> <hiredate>10/1/1991</hiredate> </partner> </workers>

Let's say you want to print out the last name of the contractors. Start by copying the complete DOM walker program to Hello.java.. The complete DOM walker is the one shown previously in this tutorial, with a top comment like this:

/* * Copyright (C) 2001 by Steve Litt * * COMPLETE DOM WALKER * */

Next, create the following ContractorNamePrinter class:

/************************************** class ContractorLastNamePrinter prints the last names of contractors only. It must be run on the workers.xml example file. **************************************/ class ContractorLastNamePrinter { ContractorLastNamePrinter(Document doc) { System.out.println(); try { //*** GET DOCUMENT ELEMENT BY NAME *** NodeList nodelist = doc.getElementsByTagName("workers"); Element elm = (Element) nodelist.item(0); //*** GET ALL contractors BELOW workers *** NodeList contractors = elm.getElementsByTagName("contractor"); for(int i = 0; i < contractors.getLength(); i++) { Element contractor = (Element) contractors.item(i); //*** NO NEED TO ITERATE info ELEMENTS, *** //*** WE KNOW THERE'S ONLY ONE *** Element info = (Element)contractor.getElementsByTagName("info").item(0); System.out.println( "Contractor last name is " + info.getAttribute("lname")); } } catch (Exception e) { System.out.println( "ContractorLastNamePrinter() error: " + e.getMessage()); } } }

In the preceding code, elements are looked up by name, and the list of elements is iterated through.

Your final step is to, instantiate ContractorLastNamePrinter instead of DOMwalker, as shown in the code below:

/************************************** Class Hello is the repository of this program's main routine. It removes empty text nodes, then walks the DOM tree and prints out the DOM tree's info. **************************************/ class Hello { public static void main(String[] args) { String filename = args[0]; System.out.print("Walking XML file " + filename + " ... "); DocumentMaker docMaker = new DocumentMaker(filename); Document doc = docMaker.getDocument(); try { WhiteSpaceKiller wpc = new WhiteSpaceKiller(doc); // DOMwalker walker = new DOMwalker(doc); ContractorLastNamePrinter cPrinter = new ContractorLastNamePrinter(doc); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } }

Now run the program against workers.xml, and as predicted, it prints the last name of the contractors and nobody else.

$ ./jj workers.xml Walking XML file workers.xml ... Contractor last name is albertson Contractor last name is denby [slitt@mydesk slitt]$

Obviously the preceding was written as an illustration rather than as real Java code. It instantiates too many objects, and it doesn't combine steps that could have been combined. But it's pretty clear how the preceding code uses getElementsByTagName() and getAttribute() to print only the desired information.

You can optimize your code as appropriate. The important thing is that you understand the workings of getElementsByTagName(), getAttributeNode() and getAttribute().

In this Article You Have Learned

myElement.getElementsByTagName("contractor") returns a list (NamedNodeMap) of all subelements, of myElement, named "contractor".
Use NamedNodeMap.length() and NamedNodeMap.item(int) to iterate through the list of same named elements.
Use myElement.getAttribute("lname") to get the value of the lname attribute of myElement. Use getAttributeNode("lname") to get the attribute object itself.

Steve Litt is the main author of Samba Unleashed. He can be reached at slitt@troubleshooters.com.

SAX

By Steve Litt

In this Article You Will Learn

What SAX is.
SAX is an alternative to DOM.
When to use SAX and when to use DOM.
To preserve the node hierarchy in a SAX app, use a Stack.
Classes that must be imported to use SAX.
How SAX uses callbacks to parse an XML file.
The ContentHandler and ErrorHandler classes, and the callbacks they contain.
Using per-record DOM Documents to achieve the best of both worlds.

As a junior programmer in the days when 128Kilobytes was the most RAM you could expect on the company minicomputer, I got a specification change report for our batch insurance printing program. It sounded tiny to the DP manager, but it resulted in major structural changes. This program printed the forms you get after going to the doctor. "All we want", said the DP manager, "is to have the total at the top of the page instead of the bottom.".

I laughed hysterically. "How do you expect me", I exclaimed, "to know the total before reading the line items!". I asked her if she had some tealeaves I could read to predict the future.

I finally did it with a sortable intermediate file. Each sheet's line items remained in order, but each page's total sorted to the top of its sheet. I printed straight off the intermediate file. In the days when you had to fight for every kilobyte, that was probably the best solution.

Of course, I could have kept all lineitem info for the page in memory, with break logic triggering a complete calculation and print. I could have simulated the 80 characters per line and 66 lines as a 5280 byte 2 dimensional array, in which case I could have "moved the paper backwards". But back then, RAM was too dear. Anything more than a small array was written out to temporary files.

This story introduces the difference between DOM and SAX. DOM keeps the entire XML file in memory, ready for instant and random access. As in all other computations, when you can afford the RAM, keeping information in RAM makes your programming task much easier. But if your XML file is a gigabyte long, DOM isn't an option. Additionally, if you don't know how big the XML file will eventually be, DOM is a bad idea. Use DOM when the likelihood of memory exhaustion is nil.

SAX is a parsing methodology, plain and simple. A SAX parser reads an XML file, and every time it runs across an XML tag or other entity, it reports it.

How does SAX report the tag or other entity? It calls a callback routine supplied by the application programmer. The programmer loads the callback routine with code to process the information. For instance, the callback routine for an element's start tag would inquire about the element's attributes. In the callback routines, the programmer saves what must be saved, and keeps track of the hierarchical nature of the XML file. For instance, three element start tags without an element end tag means the three are at descending levels.

This is a lot of busywork for the programmer, so the obvious question is "why not use DOM?". The answer is usually "we can't afford the memory", or "we don't know how big this thing will end up being". SAX stores nothing. It can parse a terabyte file a little bit at a time. You can do anything with SAX, but DOM is reserved for known-resource, small footprint XML hierarchies.

SAX Hello World

Note: Much of this information is gleaned from the SAX 2.0 page at http://www.megginson.com/SAX/index.html. That site is a must-read for anyone doing serious SAX work.

So let's make a proof of concept SAX program to report the document element. The first step is to make sure we even have the SAX API for Java. Fortunately, that should have been included with the xerces.jar that we downloaded. To test, try this Hello.java:

import java.io.IOException; import org.xml.sax.XMLReader; class Hello { public static void main(String[] args) { System.out.println("Hello " + args[0] + "!"); } }

$ ./jj firstarg Hello firstarg! $

If you get error messages, investigate the $CLASSPATH setting in your jj script, and whether you really downloaded and properly extracted xerces.jar. Once you get the program running, it's time to try a ghost parser.

The ghost parser gets its name from the fact that it outputs nothing. Once again, it's a test to make sure your SAX API is downloaded and working. Code for the ghost parser follows:

import java.io.IOException; import org.xml.sax.XMLReader; import org.xml.sax.SAXException; import org.apache.xerces.parsers.SAXParser; class Hello { public static void main(String[] args) { System.out.print("parsing " + args[0] + "... "); try { XMLReader parser = new SAXParser(); parser.parse(args[0]); } catch (IOException e) { System.out.print("IO Exception: " + e.getMessage()); } catch (SAXException e) { System.out.print("SAX Exception: " + e.getMessage()); } System.out.println("Done!"); } }

Now compile and run it:

$ ./jj blank.xml parsing blank.xml... Done! $

The preceding parsed the file (we assume), but because there were no callbacks, it did nothing. The next step is to make an instance of a subclass of ContentHandler to define callbacks, and instantiate that class from the Hello class, and link it with the SAXParser. The following code implements such a class (called MyContentHandler in the code), and links the handler object to the parser object:

import java.io.IOException; import org.xml.sax.XMLReader; import org.xml.sax.SAXException; import org.xml.sax.ContentHandler; import org.xml.sax.Locator; import org.xml.sax.Attributes; import org.apache.xerces.parsers.SAXParser; class MyContentHandler implements ContentHandler { // Receive notification of character data. public void characters(char[] ch, int start, int length) { } // Receive notification of the end of a document. public void endDocument() { } // Receive notification of the end of an element. public void endElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName) { } // End the scope of a prefix-URI mapping. public void endPrefixMapping(java.lang.String prefix) { } // Receive notification of ignorable whitespace in element content. public void ignorableWhitespace(char[] ch, int start, int length) { } // Receive notification of a processing instruction. public void processingInstruction( java.lang.String target, java.lang.String data ) { } // Receive an object for locating the origin of SAX document events. public void setDocumentLocator(Locator locator) { } // Receive notification of a skipped entity. public void skippedEntity(java.lang.String name) { } // Receive notification of the beginning of a document. public void startDocument() { } // Receive notification of the beginning of an element. public void startElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName, Attributes atts) { System.out.println(localName); //<<=====PRINT ELEMENT NAME } // Begin the scope of a prefix-URI Namespace mapping. public void startPrefixMapping(java.lang.String prefix, java.lang.String uri) { } } class Hello { public static void main(String[] args) { System.out.print("parsing " + args[0] + "... "); try { XMLReader parser = new SAXParser(); ContentHandler handler = new MyContentHandler(); parser.setContentHandler(handler); parser.parse(args[0]); } catch (IOException e) { System.out.print("IO Exception: " + e.getMessage()); } catch (SAXException e) { System.out.print("SAX Exception: " + e.getMessage()); } System.out.println("Done!"); } }

In the preceding, the only parsing action implemented in MyContentHandler is to print each element's name (see the large, bold, green comment). All the rest of the methods are just stubs. The list of methods was taken directly from the authoritative documentation at http://www.megginson.com/SAX/Java/javadoc/org/xml/sax/ContentHandler.html. The MyContentHandler object is created in the main routine, and then linked to the parser via the parser.setContentHandler() method. Everything added to the previous program is marked in bold red for your understanding.

Create the following simple.xml to test this program:

<?xml version="1.0"?> <toplevel> <secondlevel> This is a text node within the second level. </secondlevel> </toplevel>

When you compile and run the program it should simply list the elements:

$ ./jj simple.xml parsing simple.xml... toplevel secondlevel Done! $

Notice that the printing of the element names occurs between the first parsing simple.xml... and the Done! prompts. The element printing occurs as part of the parsing process, in the callback routine. Now it's time to make a SAX tree walker...

A SAX Tree Walker

We actually have already done most of the work to make a SAX tree walker. All that's necessary is to add a private variable to record the level (for indentation), and to code some of the callback routines in MyContentHandler.

As far as level goes, it's incremented by the last statement in startElement() and decremented by endElement(). Even though characters() uses level to indent, characters() doesn't change it, but instead, indents one indent past the current level (text nodes are children of their parent elements).

The characters() callback tests for an all whitespace string, and if it isn't all whitespace, prints it. Note that characters() doesn't receive a String, but instead an array of characters with a start point and a length. The "string" in question is the characters from the startpoint out length bytes. SAX does this for performance reasons. It's the programmer's job to move those bytes into a String object with the new String(characterArray, start, length) constructor.

The startElement() callback prints the proper indent, prints the type (which is always ELEMENT because this callback is called only by elements), then the name, then the value (which is always null, so hardcoded to novalue). Finally, startElement()'s atts argument is iterated, via the SAX Attributes interface methods, and printed.

The added and changed code is in bold red. The following is the SAX tree walker code:

import java.io.IOException; import org.xml.sax.XMLReader; import org.xml.sax.SAXException; import org.xml.sax.ContentHandler; import org.xml.sax.Locator; import org.xml.sax.Attributes; import org.apache.xerces.parsers.SAXParser; class MyContentHandler implements ContentHandler { private int level = 0; // Receive notification of character data. public void characters(char[] ch, int start, int length) { String s = new String(ch, start, length); if (s.trim() != "") { for(int i=0; i < level + 1; i++) {System.out.print(" ");} System.out.print("TEXT : noname : "); System.out.println(s); } } // Receive notification of the end of a document. public void endDocument() { } // Receive notification of the end of an element. public void endElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName) { level--; } // End the scope of a prefix-URI mapping. public void endPrefixMapping(java.lang.String prefix) { } // Receive notification of ignorable whitespace in element content. public void ignorableWhitespace(char[] ch, int start, int length) { } // Receive notification of a processing instruction. public void processingInstruction( java.lang.String target, java.lang.String data ) { } // Receive an object for locating the origin of SAX document events. public void setDocumentLocator(Locator locator) { } // Receive notification of a skipped entity. public void skippedEntity(java.lang.String name) { } // Receive notification of the beginning of a document. public void startDocument() { } // Receive notification of the beginning of an element. public void startElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName, Attributes atts) { for(int i=0; i < level; i++) {System.out.print(" ");} System.out.print("ELEMENT : "); System.out.print(localName); System.out.print(" : novalue : "); System.out.print("("); for(int i=0; i < atts.getLength(); i++) { if(i > 0) {System.out.print(",");} System.out.print(atts.getLocalName(i) + "=\"" + atts.getValue(i) + "\""); } System.out.print(")"); System.out.println(); level++; } // Begin the scope of a prefix-URI Namespace mapping. public void startPrefixMapping(java.lang.String prefix, java.lang.String uri) { } } class Hello { public static void main(String[] args) { System.out.println("parsing " + args[0] + "... "); try { XMLReader parser = new SAXParser(); ContentHandler handler = new MyContentHandler(); parser.setContentHandler(handler); parser.parse(args[0]); } catch (IOException e) { System.out.print("IO Exception: " + e.getMessage()); } catch (SAXException e) { System.out.print("SAX Exception: " + e.getMessage()); } System.out.println("Done!"); } }

Compile and run the program. On simple.xml it should look something like this:

$ ./jj simple.xml parsing simple.xml... ELEMENT : toplevel : novalue : () ELEMENT : secondlevel : novalue : () TEXT : noname : This is a text node within the second level. Done! $

Note that the actual text is a line below its node type and name. That's because it includes the newline and spaces of the original XML. If you don't like this, simply use the String.trim() method to clip border whitespace.

Try running this program on a longer file like the blank.xml file you made in the Dia exercises earlier in this magazine, or on some other complex XML program. Notice the attributes, and the faithful adhearance to the XML file's hierarchy.

A SAX Explorer

As long as we've gone this far, let's code all the callbacks to see when they're called, and what they're passed. This gives us a SAX explorer to tell us about a file, and about SAX itself. The code follows, with added and changed code in bold red:

import java.io.IOException; import org.xml.sax.XMLReader; import org.xml.sax.SAXException; import org.xml.sax.ContentHandler; import org.xml.sax.Locator; import org.xml.sax.Attributes; import org.apache.xerces.parsers.SAXParser; class MyContentHandler implements ContentHandler { private int level = 0; public static void indent(int llevel) { for(int i=0; i < llevel; i++) {System.out.print(" ");} } // Receive notification of character data. public void characters(char[] ch, int start, int length) { String s = new String(ch, start, length); if (s.trim() != "") { indent(level + 1); System.out.print("TEXT : noname : "); System.out.println(s); } } // Receive notification of the end of a document. public void endDocument() { indent(level); System.out.println("endDocument()"); } // Receive notification of the end of an element. public void endElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName) { level--; indent(level); System.out.println("endElement(namespaceURI=" + namespaceURI + ",localName=" + localName + ",qName=" + qName +")"); } // End the scope of a prefix-URI mapping. public void endPrefixMapping(java.lang.String prefix) { indent(level); System.out.println("endPrefixMapping(prefix=" + prefix +")"); } // Receive notification of ignorable whitespace in element content. public void ignorableWhitespace(char[] ch, int start, int length) { indent(level); String s = new String(ch, start, length); System.out.println("ignorableWhitespace(string=" + s + ")"); } // Receive notification of a processing instruction. public void processingInstruction( java.lang.String target, java.lang.String data ) { indent(level); System.out.println("processingInstruction(target=" + target + ",data=" + data + ")"); } // Receive an object for locating the origin of SAX document events. public void setDocumentLocator(Locator locator) { indent(level); System.out.println("setDocumentLocator()"); } // Receive notification of a skipped entity. public void skippedEntity(java.lang.String name) { indent(level); System.out.println("skippedEntity(name=" + name + ")"); } // Receive notification of the beginning of a document. public void startDocument() { indent(level); System.out.println("startDocument()"); } // Receive notification of the beginning of an element. public void startElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName, Attributes atts) { indent(level); System.out.print("ELEMENT : "); System.out.print(localName); System.out.print(" : novalue : "); System.out.print("("); for(int i=0; i < atts.getLength(); i++) { if(i > 0) {System.out.print(",");} System.out.print(atts.getLocalName(i) + "=\"" + atts.getValue(i) + "\""); } System.out.print(")"); System.out.println(); level++; } // Begin the scope of a prefix-URI Namespace mapping. public void startPrefixMapping(java.lang.String prefix, java.lang.String uri) { indent(level); System.out.println("startPrefixMapping(prefix=" + prefix + ",uri=" + uri + ")"); } } class Hello { public static void main(String[] args) { System.out.println("parsing " + args[0] + "... "); try { XMLReader parser = new SAXParser(); ContentHandler handler = new MyContentHandler(); parser.setContentHandler(handler); parser.parse(args[0]); } catch (IOException e) { System.out.print("IO Exception: " + e.getMessage()); } catch (SAXException e) { System.out.print("SAX Exception: " + e.getMessage()); } System.out.println("Done!"); } }

Adding an Error Handler to Your SAX Explorer

Errors happen. In a production environment, parsing errors must be caught and handled. The SAX API defines an error handling class to catch errors.

Start with the code for your SAX explorer in the preceding section of this article. Next, add the following to your import statements:

import org.xml.sax.ErrorHandler; import org.xml.sax.SAXParseException;

The preceding import the ErrorHandler class and the exceptions it throws. Next, code class MyErrorHandler, which prints simple messages for each of class ErrorHandler's three callbacks, error(), fatalError(), and warning().

class MyErrorHandler implements ErrorHandler { public void error(SAXParseException exception) { System.out.println("SAX nonfatal error: " + exception.getMessage()); } public void fatalError(SAXParseException exception) { System.out.println("SAX fatal error: " + exception.getMessage()); } public void warning(SAXParseException exception) { System.out.println("SAX warning: " + exception.getMessage()); } }

Finally, "hook up" your new error handler by adding the two bolded (and brown if you're looking at a color browser) lines between the parser.setContentHandler(handler) statement and the parser.parse(args[0]) line.

class Hello { public static void main(String[] args) { System.out.println("parsing " + args[0] + "... "); try { XMLReader parser = new SAXParser(); ContentHandler handler = new MyContentHandler(); parser.setContentHandler(handler); ErrorHandler errHandler = new MyErrorHandler(); parser.setErrorHandler(errHandler); parser.parse(args[0]); } catch (IOException e) { System.out.print("IO Exception: " + e.getMessage()); } catch (SAXException e) { System.out.print("SAX Exception: " + e.getMessage()); } System.out.println("Done!"); } }

Save, compile and run on blank.xml. You'll notice that the output is the same as before you added the ErrorHandler code. That's because there are no errors or warnings. But make a copy of blank.xml, remove the quotes around the value of any attribute, and run your code against that XML file. You'll see your fatal error message print just before the program aborts due to the fatal error.

SAX with Per Record DOMs

Did you notice the SAX tree walker was easier to code than the DOM walker? Was I lying when I said DOM programming was easier than SAX programming?

No, I was telling the truth, but a hierarchy walker is one of those lucky programs where the output follows the input. Remember the discussion of the batch insurance program in which I had to put the totals on top of the page, before receiving the line items making up that total? Those are the types of circumstances in which SAX programming becomes a challenge. The programmer must find a way of "remembering" the former information until outputting the later information.

All the same techniques are available. An intermediate file can be created and then sorted in such a way that the information appears in the order needed. The file can be traversed multiple times. Memory can be set aside to hold essential data.

Typically, the reason XML files get too big for DOM is that they have many, many "records" (for want of a better word). Consider the following trivially simple invoice data, which you should save as invoices.xml:

<?xml version="1.0"?> <invoices> <invoice> <lineitem item="widget" price="21.41" quantity="4"></lineitem> <lineitem item="mousetrap" price="2.11" quantity="14"></lineitem> <customer>Garcia, Maria</customer> <lineitem item="wrench" price="8.88" quantity="3"></lineitem> </invoice> <invoice> <lineitem item="mouse" price="7.41" quantity="84"></lineitem> <customer>Smith, John</customer> <lineitem item="mousepad" price="0.91" quantity="184"></lineitem> </invoice> </invoices>

This would have been just perfect for a simple SAX app, except that the specifications call for the total to print first, then the customer name, and then the lineitems. Oh, and by the way, there's no telling whether the <customer> element will be before the lineitems, after them, or even tucked in between them.

This is seriously ugly, and obviously horrible XML design. But I designed it to showcase the use of per-record DOM documents. Obviously, a simple line for line loop won't work here.

Nor will placing the entire file in a DOM document. According to the specifications, there can be up to 100,000 invoices in a single invoices.xml file. It just so happens that this example has only 2 invoices.

The entire file might be too big for a DOM document, but the single invoices certainly are not. The subclassed ContentHandler object can repeatedly instantiate empty DOM document, fill them with data, total the invoices, and then print the lineitems and other information.

Before showing you the code, I'd like to explain the program at a high level. This is basically a SAX program, with a subclass of ContentHandler defining the necessary callbacks. Of all the ContentHandler callbacks, this code's MyContentHandler class uses only the following:

startDocument()
startElement()
characters()
endElement()

The MyContentHandler class contains a DOM document class variable to hold an invoice's information until it can be output, and it contains a Stack class variable to track which parent node each new element and text node should be inserted under. The DOM document class variable is emptied after the printing of each invoice, to make room for the info from the next invoice. The startDocument() callback, which happens before any element or text node callbacks, instantiates the DOM document and the Stack object.

Basically, at all levels at and below <invoice>, the elements and text nodes read by the SAX parser are inserted into a DOM document, in the proper order and hierarchy. The SAX parser loads the DOM document, but it "zeros out" the DOM document at every </invoice> tag, and starts building fresh with every <invoice> tag. Because the DOM document starts out empty, there's no special logic for the first time. As a matter of fact, the SAX API is very break logic friendly. Everything has a begin and an end, so there's no need for priming this or after-the-loop that, or keeping track of whether it's been through the loop before.

Here's the logic at the highest level:

startDocument

Instantiate the DOM document and the Stack object.

startElement

If the SAX callback has delivered an <invoice>, insert it in the new DOM doc and push it on the stack. Otherwise, insert it into the DOM doc below the last thing on the stack and then push it on the stack. All attributes passed to the SAX callback are added to the element.

characters

Insert it into the DOM doc below the last thing on the stack.

endElement

If it's an </invoice>, run the method to print the invoice that just ended, and then "zero out" the DOM doc. If it's not an </invoice>, pop it off the stack, because its days of parenthood have come to an end.

Let's talk about the Stack object. It helps faithfully represent the XML hierarchy. If a start tag for element Y appears between the start and end tag of element X, the XML spec says element Y is a child of element X, and should be inserted below element X in the DOM document. And in fact, when the start of Y is detected, X will be next to be popped off the stack, because the startElement() for X pushed it, but the endElement() for X has not been encountered yet, and therefore has not popped it.

Now suppose that the start tag of Y happens after the end tag of X. The endElement of X will have already popped X off the stack, so the next element to be popped won't be X, but in fact it will be the parent of X. What if X has no parents? In this program, that means that X is an <invoice> element, and it is treated specially so that we don't pop off an empty stack.

In summary, using a stack guarantees the XML hierarchy will be faithfully reproduced in the per-invoice DOM document.

I could go on talking about this for a long time, but instead I'll show you the code. As you look at the code, please keep in mind that it uses just the DOM methods and SAX API that we've already discussed, plus the Stack object, for which we use the push(), peek(), pop(), and empty() methods. empty() returns whether or not the stack is empty. This is important, because it's how we detect whether an element is above <invoice> in the hierarchy.

Here's the code:

import java.io.IOException; import org.xml.sax.XMLReader; import org.xml.sax.SAXException; import org.xml.sax.ContentHandler; import org.xml.sax.Locator; import org.xml.sax.Attributes; import org.apache.xerces.parsers.SAXParser; import org.w3c.dom.*; // DOM interface import java.util.Stack; // For Element Stack import java.util.EmptyStackException; // Stack exception //*** NEXT 2 LINES NECESSARY CREATE EMPTY DOCUMENT *** import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; /************************************** class EmptyDocumentMaker creates an empty document **************************************/ class EmptyDocumentMaker { private Document doc; public Document getDocument () {return(doc);} public EmptyDocumentMaker () { try { DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder(); doc = docBuilder.newDocument(); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } class MyContentHandler implements ContentHandler { private Document doc; private Stack elementStack; /********************************** Following methods are added to those of ContentHandler They implement intermediate DOM document handling, invoice printing, and the like. **********************************/ private void createEmptyDocument() { try { DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder(); doc = docBuilder.newDocument(); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } private void resetDocument() { try { Element docElm = doc.getDocumentElement(); docElm.getParentNode().removeChild(docElm); docElm = null; } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } private void printInvoice() { NodeList lineItems = doc.getDocumentElement(). getElementsByTagName("lineitem"); //*** PRINT TOP SEPARATOR *** System.out.println(); System.out.print("==============================================="); System.out.println(); //*** CALCULATE THE TOTAL *** float total = (float)0.0; for(int i = 0; i < lineItems.getLength(); i++) { Element lineItemElm = (Element)lineItems.item(i); int quantity = Integer.valueOf( lineItemElm.getAttribute("quantity")).intValue(); float unitPrice = Float.valueOf( lineItemElm.getAttribute("price")).floatValue(); total += ((float)quantity * unitPrice); } //*** PRINT THE TOTAL *** System.out.println("Invoice total=" + total); //*** PRINT THE CUSTOMER *** Element custElm = (Element)(doc.getDocumentElement(). getElementsByTagName("customer"). item(0)); String custName = custElm.getFirstChild().getNodeValue(); System.out.println("Customer=" + custName); //*** PRINT THE LINE ITEMS *** System.out.println(); System.out.println("Items purchased..."); for(int i = 0; i < lineItems.getLength(); i++) { Element lineItemElm = (Element)lineItems.item(i); System.out.print( (lineItemElm.getAttribute("item") + " "). substring(0,20) ); int quantity = Integer.valueOf( lineItemElm.getAttribute("quantity")).intValue(); float itemPrice = Float.valueOf( lineItemElm.getAttribute("price")).floatValue(); float itemTotal = (float)quantity * itemPrice; System.out.println( quantity + " @ " + itemPrice + " = " + itemTotal); } System.out.println(); //*** PRINT BOTTOM SEPARATOR *** System.out.print("==============================================="); System.out.println(); } /********************************** Following methods modify those of ContentHandler **********************************/ public void startDocument() { try { elementStack = new Stack(); this.createEmptyDocument(); } catch (Exception e) { System.out.println("startDocument error: " + e.getMessage()); } } public void startElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName, Attributes atts) { try { if(localName.equals("invoice")) { doc.appendChild(doc.createElement("invoice")); elementStack.push(doc.getDocumentElement()); } else if(!elementStack.empty()) { Element parentElm = (Element)elementStack.peek(); Element tempElm = (Element) (parentElm.appendChild(doc.createElement(localName))); for(int i=0; i < atts.getLength(); i++){ tempElm.setAttribute(atts.getLocalName(i), atts.getValue(i)); } elementStack.push(tempElm); } } catch (EmptyStackException e) { System.out.println("startElement stack error: " + e.getMessage()); System.out.println("startElement localName=" + localName); } catch (Exception e) { System.out.println("startElement error: " + e.getMessage()); } } public void characters(char[] ch, int start, int length) { try { if(!elementStack.empty()) { String s = new String(ch, start, length); Element parentElm = (Element)elementStack.peek(); parentElm.appendChild(doc.createTextNode(s)); } } catch (EmptyStackException e) { System.out.println("characters stack error: " + e.getMessage()); } catch (Exception e) { System.out.println("characters error: " + e.getMessage()); } } public void endElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName) { try { if(localName.equals("invoice")) { printInvoice(); resetDocument(); } else if(!elementStack.empty()) { elementStack.pop(); } } catch (EmptyStackException e) { System.out.println("endElement stack error: " + e.getMessage()); } catch (Exception e) { System.out.println("endElement error: " + e.getMessage()); } } /********************************** Following methods are empty stubs of ContentHandler methods This is necessary to compile. **********************************/ public void endDocument() {} public void endPrefixMapping(java.lang.String prefix) {} public void ignorableWhitespace(char[] ch, int start, int length) {} public void processingInstruction(java.lang.String target, java.lang.String data) {} public void setDocumentLocator(Locator locator) {} public void skippedEntity(java.lang.String name) {} public void startPrefixMapping(java.lang.String prefix, java.lang.String uri) {} } class Hello { public static void main(String[] args) { try { XMLReader parser = new SAXParser(); ContentHandler handler = new MyContentHandler(); parser.setContentHandler(handler); parser.parse("invoices.xml"); } catch (IOException e) { System.out.print("IO Exception: " + e.getMessage()); } catch (SAXException e) { System.out.print("SAX Exception: " + e.getMessage()); } System.out.println("Done!"); } }

Before compiling and running this code, BE SURE YOU HAVE SAVED THE invoices.xml FILE DESCRIBED EARLIER IN THIS ARTICLE. The preceding code expects XML with <invoice> under <invoices>, <customer> and multiple <lineitem> under <invoice>, and a text node describing the customer name under <customer>. Any other hierarchy will cause this program to fail. Therefore, I have hardcoded invoices.xml into the program as the file to be parsed.

If you've copied the code invoices.xml file and the code correctly, when you compile and run it the result should look something like this.

$ ./jj =============================================== Invoice total=141.82001 Customer=Garcia, Maria Items purchased... widget 4 @ 21.41 = 85.639999 mousetrap 14 @ 2.1099999 = 29.539999 wrench 3 @ 8.8800001 = 26.639999 =============================================== =============================================== Invoice total=789.88 Customer=Smith, John Items purchased... mouse 84 @ 7.4099998 = 622.44 mousepad 184 @ 0.91000003 = 167.44 =============================================== Done! $

Obviously this was a case contrived to showcase the use of per-record DOM documents in an otherwise SAX app. And I didn't bother to make columns line up or decimals round to pennies. But it doesn't take a lot of imagination to see how this technique can make life much easier for those coding apps with too many records to put in a single DOM document, but whose each record is too complex or too out of order to print line for line from the SAX callbacks.

This article illustrates one other thing. You've seen various parsers that parse straight to a DOM document, and you may have wondered how they work. Most of them do pretty much just what you did in this article -- they have their SAX callbacks call DOM methods to load a DOM document.

One more thing. There's a special kind of DOM object, called a DocumentFragment, especially created to be a "lightweight object", which is perfect for temporarily storing moderate amounts of data. You might be better off using a DocumentFragment for per-record DOM work, but I didn't have time to check it out.

In this Article You Have Learned

SAX is a parsing methodology employing callback routines for events like startElement, endElement, and the like.
SAX is an alternative to DOM.
DOM keeps the entire hierarchy in memory, which makes it easier for the programmer but can deplete resources. SAX keeps only the current tag or text node in memory. Use SAX when working with large XML files, or working with XML files of unknown and possibly unlimited size.
To preserve the node hierarchy in a SAX app, use a Stack.
Depending on the functionality of the program, SAX may require the import of many classes. The classes imported in this article include:

import org.xml.sax.XMLReader;
import org.xml.sax.SAXException;
import org.xml.sax.ContentHandler;
import org.xml.sax.Locator;
import org.xml.sax.Attributes;
import org.apache.xerces.parsers.SAXParser;

Associating a ContentHandler and ErrorHandler to the SAXParser enables the SAXParser to call callback routines from the ContentHandler and ErrorHandler. By placing the proper code in those callback routines, the programmer creates an app to do his bidding.
The ContentHandler and ErrorHandler classes, contain the following callback routines:

ContentHandler

public void characters(char[] ch, int start, int length)
public void endDocument()
public void endElement(java.lang.String namespaceURI,

java.lang.String localName,

java.lang.String qName)

public void endPrefixMapping(java.lang.String prefix)
public void ignorableWhitespace(char[] ch, int start, int length)
public void processingInstruction(java.lang.String target,

java.lang.String data)

public void setDocumentLocator(Locator locator)
public void skippedEntity(java.lang.String name)
public void startDocument()
public void startElement(java.lang.String namespaceURI,

java.lang.String localName,

java.lang.String qName,

Attributes atts)

public void startPrefixMapping(java.lang.String prefix,

java.lang.String uri)

ErrorHandler

public void error(SAXParseException exception)
public void fatalError(SAXParseException exception)
public void warning(SAXParseException exception)

If you can separate the XML into independent "records", you can build a (presumably small) in-memory DOM Document for each "record", thereby giving complete random access to the data at the "record" level, but maintaining a small footprint.

Steve Litt is the documentor of the Universal Troubleshooting Process. He can be reached at slitt@troubleshooters.com.

DTD's

By Steve Litt

In this Article You Will Learn

DTD stands for Data Type Definition.
DTD's define what combinations of elements, text and attributes are legal.
Validating Parsers are necessary to enforce a DTD.
A file's DTD can be internal to the file, or contained in another file.
The two most often used declarations in DTD's are ELEMENT and ATTLIST.
How to define a sequence of legal subelements.
How to define legal attributes.
How to declare default attribute values.
A process for deducing a proper DTD from an existing XML file.

DTD's are Data Type Definitions. They inform a validating parser about which combinations of elements and/or text nodes are legal, and which attributes are legal for those elements, and what types of values those attributes can contain. The advantage of this is that an application can know what to expect, and can handle departures from the DTD with an error handler.

To repeat what was said much earlier in this issue of Troubleshooting Professional Magazine, to say an XML document is well formed means it conforms to XML syntax. To say a document is valid means the XML conforms to its DTD. Such validation can be done only by a validating parser. Non-validating parsers (I believe expat is non-validating) simply ignore DTD's, as long as the DTD's have proper XML syntax.

All the XML exercises in this TPM issue use the Apache Foundations Xerces Java parser, which can run as a validating parser or a non-validating parser (default non-validating). This article will show you how to make a "Hello World" level DTD, how to turn on the Xerces parser, and then expand. If you're using a different parser, you need to make adjustments. It might be better for the sake of these exercises to just use Xerces.

Because I ran out of time, the entire exploration of validation is done via SAX, which is very straightforward due to its ErrorHandler class. Feel free to do some research on validation with parsers dumping to a DOM document.

Handy Scripts

To make this tutorial go faster, review the jj script to compile and run Hello.java:

rm Hello.class CLASSPATH=$CLASSPATH:/usr/jre-blackdown1.2.2/lib/xerces.jar:. javac Hello.java java Hello $@

But in this article, we'll usually be changing the XML or DTD but not the program. Therefore compiling wastes time. So create the following script called ss, for the purpose of running the already compiled program:

CLASSPATH=$CLASSPATH:/usr/jre-blackdown1.2.2/lib/xerces.jar:. java Hello $@

Just remember that IF you change the Java program, you MUST use the jj script.

Now let's make and explore a Hello World DTD...

A Hello World DTD

For this article we're going to use the SAX explorer you built earlier, complete with the ErrorHandler class. If you already blew it off, click here to get another copy to paste into your editor.

Now create the following dtdtest.xml:

<?xml version="1.0"?> <docelement> </docelement>

Obviously, this is nothing but a document element. Compile and run against dtdtest.xml, and this should be the result:

$ ./jj dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : () endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

So far, so good. But of course, there was no DTD. Now add a DTD for which the XML is valid (take my word for it):

<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> ]> <docelement> </docelement>

In the preceding XML, the line starting with <!DOCTYPE declares the DTD. The opening square bracket says the DTD is internal to the document, and will be contained between the opening square bracket and a closing square bracket. Notice that the closing square bracket, and the angle bracket which completes the DOCTYPE declaration, is on its own line. Between the declaration and the closing square and angle bracket is the single line saying the one and only element allowed in this document is called docelement, and that it can have a single text node (that's what #PCDATA means -- text). Later we'll validate against a DTD in another file, but for now let's work with the DTD and XML in the same file.

Now run your already compiled Java program against this new DTD equipped file, using the ss script, and watch what happens:

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : () endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

Nothing's changed. Of course, you'd expect that from a document that's valid according to its DTD. Now let's change the DTD to make an error occur:

<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT xdocelement (#PCDATA)> ]> <docelement> </docelement>

We placed the letter x before the word docelement in the <!ELEMENT> declaration. This makes the document invalid. Let's see what happens:

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : () endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

Oops! There are still no errors. What happened? What has happened is that the parser is runnning in non-validating mode. So we add the single line shown in bolded red, in the new main() code that follows:

class Hello { public static void main(String[] args) { System.out.println("parsing " + args[0] + "... "); try { XMLReader parser = new SAXParser(); ContentHandler handler = new MyContentHandler(); parser.setContentHandler(handler); ErrorHandler errHandler = new MyErrorHandler(); parser.setErrorHandler(errHandler); parser.setFeature("http://xml.org/sax/features/validation", true); parser.parse(args[0]); } catch (IOException e) { System.out.print("IO Exception: " + e.getMessage()); } catch (SAXException e) { System.out.print("SAX Exception: " + e.getMessage()); } System.out.println("Done!"); } }

That should turn validation on. So let's compile and run again, with the "bad" DTD. BE SURE TO USE jj instead of ss:

$ ./jj dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() SAX nonfatal error: Element type "docelement" must be declared. ELEMENT : docelement : novalue : () endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

Good. The NonfatalError callback was fired. Now let's delete the offending x, so the xml file looks like this:

<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> ]> <docelement> </docelement>

Run the program, and watch the error go away:

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : () endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

We're validating!   *     *

                     \ o /

                      \|/ 

                       |               C O O L

                      / \  _  

                     /   \/

                    /

                   -

Working with Attribute Lists

Now let's make it fail again by requiring that docelement have an attribute called lname:

<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED> ]> <docelement> </docelement>

Run the program:

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() SAX nonfatal error: Attribute "lname" is required and must be specified for element type "docelement". ELEMENT : docelement : novalue : () endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

So let's add an lname attribute and see if the parser likes it:

<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED> ]> <docelement lname="Litt"> </docelement>

Sure enough, now reports the last name as an attribute instead of erroring out, as the following output shows:

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : (lname="Litt") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

Next, let's give the element an attribute not declared in the DTD:

<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED> ]> <docelement lname="Litt" fname="Steve"> </docelement>

Run the program, and see what happens:

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() SAX nonfatal error: Attribute "fname" must be declared for element type "docelement". ELEMENT : docelement : novalue : (lname="Litt",fname="Steve") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

Oh Oh, it errored out, saying fname must be declared. Note the difference between this error message and the one where an attribute was declared required but not included. So let's fix this by declaring attribute fname, and let's declare it #IMPLIED, which means it can exist or not. We also reformat it into lines, one line per attribute, one line for the declaration, and one line for the declaration's ending angle bracket:

<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED fname CDATA #IMPLIED > ]> <docelement lname="Litt" fname="Steve"> </docelement>

Run it:

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : (lname="Litt",fname="Steve") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

Note that now the error goes away, and the fname attribute is reported. Remembering that #IMPLIED means optional, let's remove the fname attribute:

<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED fname CDATA #IMPLIED > ]> <docelement lname="Litt"> </docelement>

And run it:

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : (lname="Litt") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

As expected, the preceding exhibited no error, and no fname attribute was reported. The preceding allowed a lack of the fname attribute, but of course didn't recognize an fname. Wouldn't it be nice if you could declare a default fname in case the attribute was missing? Check out the following:

<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED fname CDATA "XMLBrain" > ]> <docelement lname="Litt"> </docelement>

In the preceding we simply replaced the word #IMPLIED with the desired default value, which in this case is "XMLBrain". Running it we see:

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : (lname="Litt",fname="XMLBrain") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

As expected, the preceding output reports attribute fname to be the default value, "XMLBrain". Of course, we can override it as follows:

<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED fname CDATA "XMLBrain" > ]> <docelement lname="Litt" fname="Steve"> </docelement>

Running the preceding program, we see in the following output that the default fname value has been replaced by the explicitly declared value, "Steve".

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : (lname="Litt",fname="Steve") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

Do you remember in the DOM specification for the Attr interface the getSpecified() method that returns a true or a false? It's to handle situations like the 2 previous examples. When you override the default getSpecified() returns true. When you let it default, getSpecified() returns false.

Finally, let's explore attributes that can take only certain values. We do that by placing the values between pipe symbols (|) and enclosed in parentheses, as shown below:

<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED fname CDATA "XMLBrain" employee_type (employee|contractor|partner) "partner" > ]> <docelement lname="Litt" fname="Steve"> </docelement>

In the preceding notice that after the list of alternatives, there's a default value matching one of the alternatives. That's necessary. The absense of a default value produces a fatal error, while a default not listed in the alternatives yields a nonfatal error. If you don't like defaulting, you have two choices:

Add another alternative called "none selected", and make it the default
Replace the default string with #IMPLIED, in which case a lack of the attribute is accepted and not defaulted.

Running the preceding xml code prints the default:

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : (lname="Litt",fname="Steve",employee_type="partner") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

As you can see, the default was printed. Let's rewrite the XML to use a disallowed value for employee_type, and see what happens:

<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED fname CDATA "XMLBrain" employee_type (employee|contractor|partner) "partner" > ]> <docelement lname="Litt" fname="Steve" employee_type="volunteer"> </docelement>

Running the preceding XML produces the following error:

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() SAX nonfatal error: Attribute "employee_type" with value "volunteer" must have a value from the list "(employee|contractor|partner)". ELEMENT : docelement : novalue : (lname="Litt",fname="Steve",employee_type="volunteer") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

Of course, using a legitimate value overrides the default and delivers the proper attribute value to the application, as follows:

<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED fname CDATA "XMLBrain" employee_type (employee|contractor|partner) "partner" > ]> <docelement lname="Litt" fname="Steve" employee_type="contractor"> </docelement>

Running the preceding XML produces the following valid result:

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : (lname="Litt",fname="Steve",employee_type="contractor") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

There are other types of attributes. There are ID attributes which specify that the value of the attribute must be unique document wide (is that cool or what). There are IDREF attributes which must refer to existing elements with a matching #ID attribute.

Attribute lists are powerful ways of specifying which attributes are legal and which aren't. They take the following form:

<!ATTLIST elementname

  attributename  DATATYPE  modifier

>

The data type is usually CDATA, but it can be ID, IDREF, and certain other types. The data type can also be replaced by a series of alternatives separated by pipe symbols (|) and surrounded by parentheses. The modifier can either be a validation specifier like #REQUIRED, #IMPLIED, #FIXED, or it can be a default value. In certain cases it can be a validation specifier followed by a default.

This discussion of alternatives is by no means exhaustive. Its purpose is only to get you to the point where you can experiment and research the building of <!ATTLIST> constructions.

Validating Elements within Elements

As the DTD writer, you can specify what elements can be contained within a kind of element, and what kind of elements must be contained within an element. We'll do only the simple stuff, leaving it to your experimentation to do the complex kinds of specification. When specifying a DTD, always make sure to use good design principles. Remember the invoices.txt file we used earlier on a SAX/DOM program? The one with a <customer> element thrown in with a bunch of <invoice> elements? That's not good design. Good XML design is something like good, normalized database design.

Let's start with this simple XML file:

<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement EMPTY> ]> <docelement></docelement>

It succeeds when run, as can be seen in the following output:

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : () endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

But now place text inside, as follows:

<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement EMPTY> ]> <docelement>This is some text</docelement>

Now run it to see the results:

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : () SAX nonfatal error: The content of element type "docelement" must match "EMPTY" .endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

In the preceding example, the text node in the <docelement> element doesn't match the EMPTY declaration. Interestingly enough, the previous code would have produced the same error without text if the ending </docelement> tag had been on its own line. That's because the new line would have been considered a text node.

This problem is simple enough to fix by changing the type from EMPTY to (#PCDATA):

<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> ]> <docelement>This is some text</docelement>

The preceding XMLproduces the following output:

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : () TEXT : noname : This is some text endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

Now let's say that we want docelement to contain a single instance of subdocument, and nothing else. Further, subdocument may contain text:

<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (subdocument)> <!ELEMENT subdocument (#PCDATA)> ]> <docelement> <subdocument>Subdocument's text</subdocument> </docelement>

The preceding says docelement can contain one instance of subdocument, no more and no less. And no text. Will this succeed? Remember the whitespace problem with our DOM walker ? Remember we needed to make a whitespace killer object to delete all the whitespace caused by formatting blanks and newlines? Let's see what happens when we run our program against it:

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : () ignorableWhitespace(string= ) ELEMENT : subdocument : novalue : () TEXT : noname : Subdocument's text endElement(namespaceURI=,localName=subdocument,qName=subdocument) ignorableWhitespace(string= ) endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $

As you can see in the preceding, there are no error messages or warnings. The two elements are printed at their proper levels, and the text within subdocument is printed correctly. And if you look closely, you'll see that the ContentHandler's ignorableWhitespace() callback has fired twice -- once for the newline after <docelement> and once for the newline before </docelement>. When the DTD says text cannot appear in an element, and pure whitespace appears in the element, that whitespace is assumed to be ignorable. Cool!

Let's try a real example now. Remember the invoices.xml file? Let's make a DTD for it, except that we'll consider it manditory to put the <customer> element before any <lineitem> elements:

<?xml version="1.0"?> <!DOCTYPE invoices [ <!ELEMENT invoices (invoice+)> <!ELEMENT invoice (customer,lineitem+)> <!ELEMENT customer (#PCDATA)> <!ELEMENT lineitem EMPTY> <!ATTLIST lineitem item CDATA #REQUIRED price CDATA #REQUIRED quantity CDATA #REQUIRED > ]> <invoices> <invoice> <customer>Garcia, Maria</customer> <lineitem item="widget" price="21.41" quantity="4"></lineitem> <lineitem item="mousetrap" price="2.11" quantity="14"></lineitem> <lineitem item="wrench" price="8.88" quantity="3"></lineitem> </invoice> <invoice> <customer>Smith, John</customer> <lineitem item="mouse" price="7.41" quantity="84"></lineitem> <lineitem item="mousepad" price="0.91" quantity="184"></lineitem> </invoice> </invoices>

Let's recite the DTD in English, starting with the document element:

<!ELEMENT invoices (invoice+)>

<invoices> contains one or more <invoice>. Because invoices has no ATTLIST, <invoices> can have no attributes.

<!ELEMENT invoice (customer,lineitem+)>

<invoice> contains one or more <customer>, followed by one or more <lineitem>. Because invoice has no ATTLIST, <invoice> can have no attributes.

<!ELEMENT customer (#PCDATA)>

<customer> contains no subelements, but may contain text. Because customer has no ATTLIST, <customer> can have no attributes.

<!ELEMENT lineitem EMPTY>

<lineitem> cannot contain elements or text.

<!ATTLIST lineitem

item CDATA #REQUIRED

price CDATA #REQUIRED

quantity CDATA #REQUIRED

<lineitem> must have attributes called item, price and quantity with a text value. It can have no other attributes.

Please review the preceding description and the code before it until you understand. Then run the program and verify that the DTD indeed validates the XML:

$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : invoices : novalue : () ignorableWhitespace(string= ) ELEMENT : invoice : novalue : () ignorableWhitespace(string= ) ELEMENT : customer : novalue : () TEXT : noname : Garcia, Maria endElement(namespaceURI=,localName=customer,qName=customer) ignorableWhitespace(string= ) ELEMENT : lineitem : novalue : (item="widget",price="21.41",quantity="4") endElement(namespaceURI=,localName=lineitem,qName=lineitem) ignorableWhitespace(string= ) ELEMENT : lineitem : novalue : (item="mousetrap",price="2.11",quantity="14") endElement(namespaceURI=,localName=lineitem,qName=lineitem) ignorableWhitespace(string= ) ELEMENT : lineitem : novalue : (item="wrench",price="8.88",quantity="3") endElement(namespaceURI=,localName=lineitem,qName=lineitem) ignorableWhitespace(string= ) endElement(namespaceURI=,localName=invoice,qName=invoice) ignorableWhitespace(string= ) ELEMENT : invoice : novalue : () ignorableWhitespace(string= ) ELEMENT : customer : novalue : () TEXT : noname : Smith, John endElement(namespaceURI=,localName=customer,qName=customer) ignorableWhitespace(string= ) ELEMENT : lineitem : novalue : (item="mouse",price="7.41",quantity="84") endElement(namespaceURI=,localName=lineitem,qName=lineitem) ignorableWhitespace(string= ) ELEMENT : lineitem : novalue : (item="mousepad",price="0.91",quantity="184") endElement(namespaceURI=,localName=lineitem,qName=lineitem) ignorableWhitespace(string= ) endElement(namespaceURI=,localName=invoice,qName=invoice) ignorableWhitespace(string= ) endElement(namespaceURI=,localName=invoices,qName=invoices) endDocument() Done! $

It worked. Remember, the declaration:

<!ELEMENT invoice (customer,lineitem+)>

means one <customer> followed by one or more lineitems. Verify this by placing a customer record between <lineitem> elements, and verify that it displays a nonfatal error. If you really wanted the ability to randomly place the <customer> among the <lineitem> elements, it can be done, but it's beyond the scope of this article. And in this case, it wouldn't be good program design to do so.

Remember, commas between the subelement types mean first a single <customer> element, followed by one or more <lineitem> elements. A specific order is enforced.

Let's discuss the plus sign you see in the preceding invoice element declaration. There are characters you can append to an element name to define how many times it occurs. Without any character appended, the element happens exactly once, as in customer in the preceding definition of an <invoice> element. Here are the others:

? : 0 or 1

* : 0 or more

+ : 1 or more

no character : exactly 1

Writing a Substantial DTD

This exercise demonstrates the use of an external DTD file, and also demonstrates the methodology used to create a DTD from a substantial XML file.

As a final DTD exercise, let's make an external DTD for this monstrosity, which is one layer of a Dia file. Save this file as layer.xml:

<layer name="Background" visible="true"> <group> <group> <object type="Standard - Ellipse" version="0" id="O0"> <attribute name="obj_pos"> <point val="7.55,5.85"/> </attribute> <attribute name="obj_bb"> <rectangle val="7.5,5.8;13.3,7.8"/> </attribute> <attribute name="elem_corner"> <point val="7.55,5.85"/> </attribute> <attribute name="elem_width"> <real val="5.7"/> </attribute> <attribute name="elem_height"> <real val="1.9"/> </attribute> </object> <object type="Standard - Box" version="0" id="O1"> <attribute name="obj_pos"> <point val="10.1,7.4"/> </attribute> <attribute name="obj_bb"> <rectangle val="10.05,7.35;12.95,9.3"/> </attribute> <attribute name="elem_corner"> <point val="10.1,7.4"/> </attribute> <attribute name="elem_width"> <real val="2.8"/> </attribute> <attribute name="elem_height"> <real val="1.85"/> </attribute> <attribute name="show_background"> <boolean val="true"/> </attribute> </object> </group> <object type="Standard - Polygon" version="0" id="O2"> <attribute name="obj_pos"> <point val="7.8,3.45"/> </attribute> <attribute name="obj_bb"> <rectangle val="7.75,3.4;10,5"/> </attribute> <attribute name="poly_points"> <point val="7.8,3.45"/> <point val="8.8,3.45"/> <point val="9.95,4.95"/> </attribute> <attribute name="show_background"> <boolean val="true"/> </attribute> </object> </group> <object type="Standard - Box" version="0" id="O3"> <attribute name="obj_pos"> <point val="14.6,3.7"/> </attribute> <attribute name="obj_bb"> <rectangle val="14.55,3.65;17.15,5.5"/> </attribute> <attribute name="elem_corner"> <point val="14.6,3.7"/> </attribute> <attribute name="elem_width"> <real val="2.5"/> </attribute> <attribute name="elem_height"> <real val="1.75"/> </attribute> <attribute name="show_background"> <boolean val="true"/> </attribute> </object> </layer>

Here's the English discussion:

A <layer> has manditory attributes name and visible. visible is boolean.

<!ATTLIST layer
name CDATA #REQUIRED
visible (true | false) #REQUIRED
>

A <layer> can contain <object> and/or <group>, in any order and any quantity.

<!ELEMENT layer (group | object)*>

A <object> has manditory attributes type, version and id.

<!ATTLIST object
type CDATA #REQUIRED
version CDATA #REQUIRED
id CDATA #REQUIRED
>

A <object> contains one or more <attribute> (Caution: Don't confuse the <attribute> element with the attributes of this or other elements).

<!ELEMENT object (attribute+)>

A <attribute> has manditory attribute name.

<!ATTLIST attribute name CDATA #REQUIRED>

A <attribute> contains exactly one subelement, which may be any of the following: boolean, point (one or more), real, or rectangle.

<!ELEMENT attribute (boolean | point+ | real | rectangle)>

A (Each of) <point>, <real>, <rectangle> has manditory attribute val, which is a string that can take any value.

<!ATTLIST point val CDATA #REQUIRED>

<!ATTLIST real val CDATA #REQUIRED>

<!ATTLIST rectangle val CDATA #REQUIRED>

A <boolean> has manditory val attribute, which is a string that is either "true" or "false".

<!ATTLIST boolean val (true | false) #REQUIRED>

A (Each of) <boolean>, <point>, <real>, <rectangle> contain no subelements.

<!ELEMENT boolean EMPTY>

<!ELEMENT point EMPTY>

<!ELEMENT real EMPTY>

<!ELEMENT rectangle EMPTY>

A <group> has no attributes.

So no ATTLIST is necessary for <group>

A <group> contains <object> and <group> elements in any quantities and in any order, except there must be at least one.

<!ELEMENT group (group | object)+>

Dump the preceding outline into a text editor, delete the non-code items, and you have the proper DTD, which you must save as layer.dtd , as follows:

<!ATTLIST layer name CDATA #REQUIRED visible (true|false) #REQUIRED > <!ELEMENT layer (group | object)*> <!ATTLIST object type CDATA #REQUIRED version CDATA #REQUIRED id CDATA #REQUIRED > <!ELEMENT object (attribute+)> <!ATTLIST attribute name CDATA #REQUIRED> <!ELEMENT attribute (boolean | point+ | real | rectangle)> <!ATTLIST point val CDATA #REQUIRED> <!ATTLIST real val CDATA #REQUIRED> <!ATTLIST rectangle val CDATA #REQUIRED> <!ATTLIST boolean val (true | false) #REQUIRED> <!ELEMENT boolean EMPTY> <!ELEMENT point EMPTY> <!ELEMENT real EMPTY> <!ELEMENT rectangle EMPTY> <!ELEMENT group (group | object)*>

By the way, the VI editor is a wonderful ally when trying to deduce what contains what. Try it.

Finally, place the following two lines at the top of layer.xml:

<?xml version="1.0"?> <!DOCTYPE layer SYSTEM "layer.dtd">

Save everything and run it. To save bandwidth I won't show you the entire output, which is voluminous. I'll simply show you the output grepped to prove there are no errors or warnings:

$ ./ss layer.xml | grep -i error $ ./ss layer.xml | grep -i warning $

That's it. You've written a DTD to cover a large chunk of XML output from a professional program (Dia). If you can do this, you can do anything.

In this Article You Have Learned

DTD stands for Data Type Definition.
DTD's define what combinations of elements, text and attributes are legal.
Validating Parsers are necessary to enforce a DTD.
A file's DTD can be internal to the file, or contained in another file.
The two most often used declarations in DTD's are ELEMENT and ATTLIST.
How to define a sequence of legal subelements.
How to define legal attributes.
How to declare default attribute values.
How to describe element and attribute sequences in English, and then create a DTD by typing DTD code below each English statement.

Steve Litt is the author of Rapid Learning: Secret Weapon of the Successful Technologist. He can be reached at slitt@troubleshooters.com.

This Concludes the XML Coding Exercises. Click here to resume the 3/2001 Troubleshooting Professional where you left off.

?	:	0 or 1
*	:	0 or more
+	:	1 or more
no character	:	exactly 1

CONTENTS

A Hello World XML App in Java

By Steve Litt

Building and Running Your Hello World Java App

Building your Hello World XML App

Steve Litt is the documentor of the Universal Troubleshooting Process. He can be reached at slitt@troubleshooters.com.

Making a DOM Walker Program

By Steve Litt

The WhiteSpaceKiller object

Steve Litt is the author of Troubleshooting Techniques of the Successful Technologist. He can be reached at Steve Litt's email address.

Building a DOM Document From Scratch

By Steve Litt

#1: Build a class or method to create an empty Document object

#2: Build another class or method to fill the Document object with the desired elements with the desired attributes, and the desired text nodes.

#3: Use the Document the way you would any other DOM Document

Steve Litt is the author of Rapid Learning: Secret Weapon of the Successful Technologist. He can be reached at slitt@troubleshooters.com.

Writing an XML File From a DOM Document

By Steve Litt

Steve Litt is the developer of The Universal Troubleshooting Process troubleshooting courseware. He can be reached at slitt@troubleshooters.com.

Accessing DOM Elements and Attributes by Name

By Steve Litt

Steve Litt is the main author of Samba Unleashed. He can be reached at slitt@troubleshooters.com.

SAX

By Steve Litt

SAX Hello World

A SAX Tree Walker

A SAX Explorer

Adding an Error Handler to Your SAX Explorer

SAX with Per Record DOMs

Steve Litt is the documentor of the Universal Troubleshooting Process. He can be reached at slitt@troubleshooters.com.

DTD's

By Steve Litt

Handy Scripts

A Hello World DTD

Working with Attribute Lists

Validating Elements within Elements

Writing a Substantial DTD

Steve Litt is the author of Rapid Learning: Secret Weapon of the Successful Technologist. He can be reached at slitt@troubleshooters.com.

The `WhiteSpaceKiller` object