XML Coding Exercises
Copyright (C) 2001 by Steve Litt. All rights reserved.
Materials from guest authors copyrighted by them and licensed for perpetual
use to Troubleshooting Professional Magazine. All rights reserved to the
copyright holder, except for items specifically marked otherwise (certain
free software source code, GNU/GPL, etc.). All material herein provided
"As-Is". User assumes all risk and responsibility for any outcome.
IDL code snippets and other information from the DOM
specification are copied from http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001/,
Copyright © 1998 World Wide Web Consortium , (Massachusetts Institute
of Technology , Institut National de Recherche en Informatique et en Automatique
, Keio
University ). All Rights Reserved. Status of this
document is a w3c recommendation.
A Hello World XML App in Java Making a DOM Walker Program Building a DOM Document From Scratch Writing an XML File From a DOM Document Accessing DOM Elements and Attributes by Name SAX DTD's Resume reading the 3/2001 Troubleshooting Professional
|
Because this demo is being done in Java, we will start with a Hello
World Java application. Then we will download the necessary XML tools from
the Apache website, install them in the proper directory, and configure
your $CLASSPATH.
class Hello { public static void main(String[] args) { System.out.println("Hello World\n"); } } |
Try compiling it with the following command:
$ javac Hello.javaIt should compile with no output, simply returning to the command prompt. If there are error messages, make sure you typed in the Java source code exactly as shown in the box above. Make sure you entered the command precisely as shown in the preceding command. If still no joy, suspect the installation of your Java compiler, or your $CLASSPATH variable. Troubleshoot accordingly.
Once you can compile it, try running it. It probably won't work, but instead will error out as shown below:
$ java Hello Internal error: caught an unexpected exception. Please check your CLASSPATH and your installation. java/lang/ClassNotFoundException: Hello at java.lang.Class.forName(Class.java:native) at java.lang.Class.forName(Class.java:52) Aborted $ |
The preceding error probably indicates that your newly compiled Hello.class program is not on your $CLASSPATH. Fix it with the following command:
$ CLASSPATH=$CLASSPATH:.The preceding command appends the current directory to the $CLASSPATH. Perform an ls command to verify that Hello.class really exists, and then run your program again. The following is the correct result:
$ java Hello Hello World $ |
Note that the $CLASSPATH fix is good only for the current shell session.
To compile and run your Hello.java app, create the following script,
which we will name jj:
rm Hello.class CLASSPATH=$CLASSPATH:. javac Hello.java java Hello $@ |
Next, use an import statement and make use of a command line argument,
as shown in the next invocation of Hello.java:
import java.io.IOException; class Hello { public static void main(String[] args) { System.out.println("Hello " + args[0] + "!"); } } |
Run the following command:
$ ./jj one two three Hello one! $ |
The preceding did just what it was supposed to do -- compiled and ran Hello.java, which printed the word Hello and the first argument. Now you're ready to add XML to your app.
Anyway, extract the files from the archive. You'll get lots of files
and directories. There's extensive documentation in html format -- a good
thing. But what you want is the file called xerces.jar, which is located
in the root of the new tree created when you extracted files from the archive.
Copy xerces.jar to a directory in which you want to put Java tools.
In my case I put it in /usr/jre-blackdown1.2.2/lib. Once you have
it where you want it, you need to add it to your $CLASSPATH. There are
many ways to do that, but I chose to modify my jj script to accomplish
it:
rm Hello.class CLASSPATH=$CLASSPATH:/usr/jre-blackdown1.2.2/lib/xerces.jar:. javac Hello.java java Hello $@ |
Now you should be able to add an import statement to import the Xerces
DOMParser. The following is Hello.java after adding the import statement.
If you've done everything correctly, this program should compile and act
just like it acted before you added the import statement. If not, troubleshoot:
import java.io.IOException; import org.apache.xerces.parsers.DOMParser; class Hello { public static void main(String[] args) { System.out.println("Hello " + args[0] + "!"); } } |
If the preceding compiled and ran, it means you correctly installed
and utilized xerces.jar, and you're ready for your first XML program.
The following program parses an XML file into a DOM document. Finally,
the program outputs the name of the top level element in the file.
import java.io.IOException; // Exception handling import org.w3c.dom.*; // DOM interface import org.apache.xerces.parsers.DOMParser; // Parser (to DOM) class Hello { public static void main(String[] args) { String filename = args[0]; System.out.print("The document element of " + filename + " is ... "); try { DOMParser dp = new DOMParser(); dp.parse(filename); Document doc = dp.getDocument(); Element docElm = doc.getDocumentElement(); System.out.println(docElm.getNodeName() + "."); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } |
Let's examine the preceding code. org.w3c.dom and org.apache.xerces.parsers.DOMParser are both contained in the xerces.jar that you downloaded. org.w3c.dom contains the entire DOM interface, while org.apache.xerces.parsers.DOMParser is actually a SAX parser to load a DOM document. If this means nothing to you, refer to articles Anatomy of an XML App and Simplified Explanation of the DOM API earlier in this magazine (use your back button to come back).
The preceding code uses the first argument as a filename, and uses a DOMParser object to parse that file into a DOM document. Once it's a Document object, you have it in DOM form, and have no further use for the parser. Next you obtain the Document Element, which is the single top level element in an XML file. Finally, you get Document Element's name and print it.
The try{}catch(){} structure is error handling by exception. The e.getMessage() delivers a clear message about what went wrong, and is excellent for troubleshooting.
Run this program against the blank.xml file you created in
the Dia diagramming tool exercises. The result should look something like
this:
$ ./jj blank.xml The document element of blank.xml is ... diagram. $ |
But what happens if the file doesn't exist, or if the file is not XML?
See below:
$ cp /etc/fstab fstab.txt $ ./jj blank.xml The document element of blank.xml is ... diagram. $ ./jj nonexist.txt The document element of nonexist.txt is ... Error: File "file:///home/slitt/nonexist.txt" not found. $ ./jj fstab.txt The document element of fstab.txt is ... Error: The markup in the document preceding the root element must be well-formed. $ |
By the way, the jj script compiles and runs. If you don't want to compile, make another file (call it ss) that doesn't delete hello.class or do the compile step (javac). Be careful though. I often make the mistake of changing my program, forgetting to compile it, running it, and wondering why my change made no difference. Or even worse, figuring what I changed had nothing to do with the problem, and needlessly troubleshooting further.
Have you gotten the preceding code to run? Congratulations! You've written an XML program. Now it's time to do something more substantial...
|
|
A DOM walker program "walks" the DOM hierarchy, reporting on every text node, every element, and every attribute of each element. The concept is similar to walking any type of tree -- a recursive directory listing comes to mind.
And speaking of recursion, it's the standard algorithm for walking trees. But it isn't used in this program. That's because the DOM API bestows methods crafted to walk non-recursively -- getFirstChild(), getNextSibling(), and getParentNode(). The algorithm is simple if you think of a checker.
A checker is that black or red circular piece of plastic used in the game called checkers. In the game they're each used to mark a position. In this program, you can imagine a single checker being moved from node to node. The "current node" is covered by the "checker".
Although trees are typically walked recursively, recursion is often too memory intensive to be practical in DOM apps. You can walk the DOM hierarchy iteratively (in a loop with no recursion) using the following algorithm:
The preceding description is the explanation of the DOMwalker class
in the following code. A description of the remainder of the code follows
the code itself. And please remember that the following is a simplified
DOM walker that doesn't print attributes, and also doesn't delete extraneous
text nodes caused by XML formatting. Those functionalities will be addressed
later in this article. The following is the simplified iterative DOM walker:
import java.io.IOException; // Exception handling import org.w3c.dom.*; // DOM interface import org.apache.xerces.parsers.DOMParser; // Parser (to DOM) /************************************** class DocumentMaker encapsulates all parser dependent code. If you change XML parsers, only this class and the parser's import statement need be modified. As written, DocumentMaker uses DOMParser from Apache. **************************************/ class DocumentMaker { private Document doc; public Document getDocument () {return(doc);} public DocumentMaker (String filename) { try { DOMParser dp = new DOMParser(); dp.parse(filename); doc = dp.getDocument(); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } /************************************** class NodeTypes encapsulates text names for the various node types. Its asAscii method returns those strings according to its nodeTypeNumber argument. It's a number to string translator. **************************************/ class NodeTypes { private static String[] nodenames={"ELEMENT","ATTRIBUTE","TEXT", "CDATA_SECTION","ENTITY_REFERENCE", "ENTITY","PROCESSING_INSTRUCTION", "COMMENT","DOCUMENT","DOCUMENT_TYPE", "DOCUMENT_FRAGMENT","NOTATION"}; public static String asAscii(int nodeTypeNumber) { return(nodenames[nodeTypeNumber-1]); } } /************************************** class DOMwalker's job is to walk the DOM document and print out each node's type, its name, its value (null for Elements), and in the case of elements, its attributes in parentheses. The tree is walked non-recursively using standard DOM traversal methods. **************************************/ class DOMwalker { private Node checker; // like a checker that gets moved from // square to square in a checkers game // points to "current" node private void indentToLevel(int level) { for(int n=0; n < level; n++) { System.out.print(" "); } } private void printNodeInfo(Node thisNode) { System.out.print(NodeTypes.asAscii(thisNode.getNodeType()) + " : " + thisNode.getNodeName() + " : " + thisNode.getNodeValue() + " : "); System.out.println(); } public DOMwalker(Document doc) { boolean ascending = false; int level = 1; System.out.println(); try { checker=doc.getDocumentElement(); while (true) { //*** TAKE ACTION ON NODE WITH CHECKER *** if (!ascending) { indentToLevel(level); printNodeInfo(checker); } //*** GO DOWN IF YOU CAN *** if ((checker.hasChildNodes()) && (!ascending)) { checker=checker.getFirstChild(); ascending = false; level++; } //*** OTHERWISE GO RIGHT IF YOU CAN *** else if (checker.getNextSibling() != null) { checker=checker.getNextSibling(); ascending = false; } //*** OTHERWISE GO UP IF YOU CAN *** else if (checker.getParentNode() != null) { checker=checker.getParentNode(); ascending = true; level--; } //*** OTHERWISE YOU'VE ASCENDED BACK TO *** //*** THE DOCUMENT ELEMENT, SO YOU'RE DONE *** else { break; } } } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } /************************************** Class Hello is the repository of this program's main routine. It removes empty text nodes, then walks the DOM tree and prints out the DOM tree's info. **************************************/ class Hello { public static void main(String[] args) { String filename = args[0]; System.out.print("Walking XML file " + filename + " ... "); DocumentMaker docMaker = new DocumentMaker(filename); Document doc = docMaker.getDocument(); try { DOMwalker walker = new DOMwalker(doc); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } |
In the preceding code, class DocumentMaker encapsulates all parser dependent code except the parser's import statement. What this means is you can place this class in a separate file, along with its parser dependent import statement. If you change parsers, all you need to change is that separate file. Class DocumentMaker delivers a parser independent DOM Document object.
Class NodeTypes implements a simple number to text name lookup on node types.
The logic of class DOMwalker was discussed extensively before the preceding code. Its constructor (DOMwalker()) is simply the implementation of the down if you can, else right if you can, else up if you can, else done algorithm. A Node object called checker is continually moved through that algorithm, thus defining the current position.
The level variable keeps track of the level in the hierarchy, and is passed to method indentToLevel(), which outputs the proper indentation for the level.
The printNodeInfo() method prints information about the node passed as its only argument. In this simplified program it prints only the node type, the node name, and the node value.
Class Hello implements the main logic of the program, using the DocumentMaker object to create a Document object, which it passes to the DOMwalker object.
To test the preceding code, use your favorite text editor to create
the following XML file, called simple.xml:
<?xml version="1.0"?> <toplevel> <secondlevel> This is a text node within the second level. </secondlevel> </toplevel> |
Now test your program. The following is the command and the output:
$ ./jj simple.xml Walking XML file simple.xml ... ELEMENT : toplevel : null : TEXT : #text : : ELEMENT : secondlevel : null : TEXT : #text : This is a text node within the second level. : TEXT : #text : : DOCUMENT_TYPE : toplevel : null : $ |
The preceding output is pretty ugly, isn't it? Why are there three text
elements instead of just the one with text? The answer is that the newlines
and formatting spaces between tags are interpreted by many parsers as text
elements. You can verify that by creating a single line version of simple.xml
with no spaces between tags, and the result looks like what you expect:
$ ./jj simp.xml Walking XML file simp.xml ... ELEMENT : toplevel : null : ELEMENT : secondlevel : null : TEXT : #text : This is a text node within the second level. : DOCUMENT_TYPE : toplevel : null : $ |
Note that because there's no DTD, the DOCUMENT_TYPE node is empty.
The WhiteSpaceKiller object is basically a DOM walker whose action on the current node is to delete it if it's a blank text node. Once again, we use the down if possible, right if possible, up if possible, done algorithm. With a twist...
When you delete the current node, you can't very well look for its first child, its next sibling, or its parent. It's gone. Its now a null. There are many ways of handling this, but I picked the method that seemed the cleanest to me. I keep track of where the checker was before it arrived "here", and if a deletion takes place, I move the checker back to its previous location. So the iteration will select the node after the one that was just deleted. By moving to the previous checker location, our down/right/up algorithm goes exactly where we would have from the text node if we hadn't deleted it.
The following is the WhiteSpaceKiller class code:
/************************************** class WhiteSpaceKiller's job is to walk the DOM Document and delete any empty text nodes. The tree is walked non-recursively using standard DOM traversal methods. Once an empty text node is deleted, the "checker" is moved back to the previous node to avoid attempts at calling DOM traversal methods on a (now) null object. **************************************/ class WhiteSpaceKiller { private Node checker; // like a checker that gets moved from // square to square in a checkers game // points to "current" node WhiteSpaceKiller(Document doc) { boolean ascending = false; Node previousChecker = null; try { checker=doc.getDocumentElement(); while (true) { //*** TAKE ACTION ON NODE WITH CHECKER *** if ((!ascending) && (checker.getNodeType() == Node.TEXT_NODE)) { String trimmedText = checker.getNodeValue().trim(); if (trimmedText == "") { checker.getParentNode().removeChild(checker); checker=previousChecker; //back to undeleted node } } previousChecker=checker; //*** GO DOWN IF YOU CAN *** if ((checker.hasChildNodes()) && (!ascending)) { checker=checker.getFirstChild(); ascending = false; } //*** OTHERWISE GO RIGHT IF YOU CAN *** else if (checker.getNextSibling() != null) { checker=checker.getNextSibling(); ascending = false; } //*** OTHERWISE GO UP IF YOU CAN *** else if (checker.getParentNode() != null) { checker=checker.getParentNode(); ascending = true; } //*** OTHERWISE YOU'VE ASCENDED BACK TO *** //*** THE DOCUMENT ELEMENT, SO YOU'RE DONE *** else { break; } } } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } |
In the preceding, all the work is done in the constructor. The algorithm is almost identical to the DOM walker code discussed earlier, except for moving back one step upon deletion.
This object is called upon to do its work in the Hello object,
just before calling the DOMwalker object to print the hierarchy. The following
code shows the addition of the invocation to the DOMwalker object, with
that invocation highlighted:
class Hello { public static void main(String[] args) { String filename = args[0]; System.out.print("Walking XML file " + filename + " ... "); DocumentMaker docMaker = new DocumentMaker(filename); Document doc = docMaker.getDocument(); try { WhiteSpaceKiller wpc = new WhiteSpaceKiller(doc); DOMwalker walker = new DOMwalker(doc); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } |
Try compiling and running the new code against simple.xml (the one with the newlines and indenting), and note that there are no extraneous text nodes:
[slitt@mydesk slitt]$ vi jj [slitt@mydesk slitt]$ ./jj simple.xml Walking XML file simple.xml ... ELEMENT : toplevel : null : ELEMENT : secondlevel : null : TEXT : #text : This is a text node within the second level. : DOCUMENT_TYPE : toplevel : null : [slitt@mydesk slitt]$ |
We're almost there. The remaining task is to print the attributes of each element. We'll print a comma separated list of attributes in parentheses to the right of the node value. Each attribute will have the attributes name followed by an equal sign followed by the attributes name in doublequotes.
Give the DOMwalker class a new method called printAttributes(), which
is called from printNodeInfo() on any mode of type Node.ELEMENT_NODE.
The printNodeInfo() method uses DOM methods rather than native
Java language to work its magic.
private void indentToLevel(int level) { for(int n=0; n < level; n++) { System.out.print(" "); } } private void printAttributes(Node thisNode) { System.out.print("("); NamedNodeMap attribs = thisNode.getAttributes(); int numAttribs = attribs.getLength(); for(int i=0; i < attribs.getLength(); i++){ Node attrib = attribs.item(i); if(i>0){System.out.print(",");} System.out.print(attrib.getNodeName()); System.out.print("=\""); System.out.print(attrib.getNodeValue()); System.out.print("\""); } System.out.print(")"); } private void printNodeInfo(Node thisNode) { System.out.print(NodeTypes.asAscii(thisNode.getNodeType()) + " : " + thisNode.getNodeName() + " : " + thisNode.getNodeValue() + " : "); if(thisNode.getNodeType() == Node.ELEMENT_NODE) { printAttributes(thisNode); } System.out.println(); } |
The preceding is straight out of the DOM spec. getAttributes() returns a NamedNodeMap object with methods getlLngth() to return the number of items, and item() to return the single item. Then it's just a matter of iterating through them. Note that neither XML nor DOM specifies that the order of returned attributes is the same as in the file, so applications cannot assume anything concerning the order of returned attributes.
Just in case you've gotten out of sync with this exercises, the following
source code listing is the complete listing for our DOM walker, complete
with blank text deletion and attribute listing:
/* * Copyright (C) 2001 by Steve Litt * * COMPLETE DOM WALKER * */ import java.io.IOException; // Exception handling import org.w3c.dom.*; // DOM interface import org.apache.xerces.parsers.DOMParser; // Parser (to DOM) /************************************** class DocumentMaker encapsulates all parser dependent code. If you change XML parsers, only this class and the parser's import statement need be modified. As written, DocumentMaker uses DOMParser from Apache. **************************************/ class DocumentMaker { private Document doc; public Document getDocument () {return(doc);} public DocumentMaker (String filename) { try { DOMParser dp = new DOMParser(); dp.parse(filename); doc = dp.getDocument(); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } /************************************** class NodeTypes encapsulates text names for the various node types. Its asAscii method returns those strings according to its nodeTypeNumber argument. It's a number to string translator. **************************************/ class NodeTypes { private static String[] nodenames={"ELEMENT","ATTRIBUTE","TEXT", "CDATA_SECTION","ENTITY_REFERENCE", "ENTITY","PROCESSING_INSTRUCTION", "COMMENT","DOCUMENT","DOCUMENT_TYPE", "DOCUMENT_FRAGMENT","NOTATION"}; public static String asAscii(int nodeTypeNumber) { return(nodenames[nodeTypeNumber-1]); } } /************************************** class WhiteSpaceKiller's job is to walk the DOM Document and delete any empty text nodes. The tree is walked non-recursively using standard DOM traversal methods. Once an empty text node is deleted, the "checker" is moved back to the previous node to avoid attempts at calling DOM traversal methods on a (now) null object. **************************************/ class WhiteSpaceKiller { private Node checker; // like a checker that gets moved from // square to square in a checkers game // points to "current" node WhiteSpaceKiller(Document doc) { boolean ascending = false; Node previousChecker = null; try { checker=doc.getDocumentElement(); while (true) { //*** TAKE ACTION ON NODE WITH CHECKER *** if ((!ascending) && (checker.getNodeType() == Node.TEXT_NODE)) { String trimmedText = checker.getNodeValue().trim(); if (trimmedText == "") { checker.getParentNode().removeChild(checker); checker=previousChecker; //back to undeleted node } } previousChecker=checker; //*** GO DOWN IF YOU CAN *** if ((checker.hasChildNodes()) && (!ascending)) { checker=checker.getFirstChild(); ascending = false; } //*** OTHERWISE GO RIGHT IF YOU CAN *** else if (checker.getNextSibling() != null) { checker=checker.getNextSibling(); ascending = false; } //*** OTHERWISE GO UP IF YOU CAN *** else if (checker.getParentNode() != null) { checker=checker.getParentNode(); ascending = true; } //*** OTHERWISE YOU'VE ASCENDED BACK TO *** //*** THE DOCUMENT ELEMENT, SO YOU'RE DONE *** else { break; } } } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } /************************************** class DOMwalker's job is to walk the DOM Document and print out each node's type, its name, its value (null for Elements), and in the case of elements, its attributes in parentheses. The tree is walked non-recursively using standard DOM traversal methods. **************************************/ class DOMwalker { private Node checker; // like a checker that gets moved from // square to square in a checkers game // points to "current" node private void indentToLevel(int level) { for(int n=0; n < level; n++) { System.out.print(" "); } } private void printAttributes(Node thisNode) { System.out.print("("); NamedNodeMap attribs = thisNode.getAttributes(); int numAttribs = attribs.getLength(); for(int i=0; i < attribs.getLength(); i++){ Node attrib = attribs.item(i); if(i>0){System.out.print(",");} System.out.print(attrib.getNodeName()); System.out.print("=\""); System.out.print(attrib.getNodeValue()); System.out.print("\""); } System.out.print(")"); } private void printNodeInfo(Node thisNode) { System.out.print(NodeTypes.asAscii(thisNode.getNodeType()) + " : " + thisNode.getNodeName() + " : " + thisNode.getNodeValue() + " : "); if(thisNode.getNodeType() == Node.ELEMENT_NODE) { printAttributes(thisNode); } System.out.println(); } public DOMwalker(Document doc) { boolean ascending = false; int level = 1; System.out.println(); try { checker=doc.getDocumentElement(); while (true) { //*** TAKE ACTION ON NODE WITH CHECKER *** if (!ascending) { indentToLevel(level); printNodeInfo(checker); } //*** GO DOWN IF YOU CAN *** if ((checker.hasChildNodes()) && (!ascending)) { checker=checker.getFirstChild(); ascending = false; level++; } //*** OTHERWISE GO RIGHT IF YOU CAN *** else if (checker.getNextSibling() != null) { checker=checker.getNextSibling(); ascending = false; } //*** OTHERWISE GO UP IF YOU CAN *** else if (checker.getParentNode() != null) { checker=checker.getParentNode(); ascending = true; level--; } //*** OTHERWISE YOU'VE ASCENDED BACK TO *** //*** THE DOCUMENT ELEMENT, SO YOU'RE DONE *** else { break; } } } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } /************************************** Class Hello is the repository of this program's main routine. It removes empty text nodes, then walks the DOM tree and prints out the DOM tree's info. **************************************/ class Hello { public static void main(String[] args) { String filename = args[0]; System.out.print("Walking XML file " + filename + " ... "); DocumentMaker docMaker = new DocumentMaker(filename); Document doc = docMaker.getDocument(); try { WhiteSpaceKiller wpc = new WhiteSpaceKiller(doc); DOMwalker walker = new DOMwalker(doc); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } |
|
|
We've built a DOM Document object from a file, parsed it, deleted blank nodes, and basically had our way with it. The one thing we haven't done is built one in memory from scratch. From-scratch building is necessary in order to save data out to XML, and also to use DOM as a tool for remembering out of order data in SAX apps. SAX apps are discussed in a later article.
Building a DOM Document from scratch isn't rocket science. Here are the steps:
Begin this exercise with an empty Hello.java file, or you'll paint yourself into a corner. If you want to save your current Hello.java, back it up before emptying it. |
Start with an empty Hello.java file, and code the following class, which delivers a completely empty (not even a document Element) Document via its getDocument() method:
/************************************** class EmptyDocumentMaker creates an empty document **************************************/ class EmptyDocumentMaker { private Document doc; public Document getDocument () {return(doc);} public EmptyDocumentMaker () { try { DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder(); doc = docBuilder.newDocument(); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } |
The preceding code uses javax/xml/DocumentBuilderFactory and javax/xml/DocumentBuilder to create an empty document. Please remember where you saw this code, as it's extremely difficult to find sample code to create an empty DOM document.
All elements are instantiated by the Document.createElement() method, and all text nodes are instantiated by the Document.createTextNode() methods. Elements and text nodes are appended by the Node.appendChild() method. Note that the document element is added by running Node.appendChild() on the document itself. Documents can be thought of as a special kind of node.
We're creating a sort of data file of political figures, consisting of two "records" -- Alan Greenspan and George Bush. Each record has the first and last names as attributes of its element. Each "record" element has subelements called job and party, each of which contain a text node to describe the job or party, as appropriate.
All work is done with an Element object. In cases where methods return a Node object, we typecast them to Element. Of course, this works only if we're sure the nodes returned really are Element objects. Because we're building it ourselves, we have that assurance, assuming we've coded correctly.
There are better and more readable ways to accomplish what's done by
the code below. Like everything in this Java/XML tutorial, there are Javaesqe
methods looking more like native Java, and much easier. Once again, I want
to familiarize you with the use of the DOM API, not the Java language.
A couple weeks doing this stuff, combined with reading some good XML/Java
books will allow you to code this more efficiently and readably. So the
following is the code for class DomFiller, which builds the desired hierarcy
inside a formerly empty DOM Document object. Place this code below the
code for the EmptyDocumentMaker class:
/************************************** Class DomFiller takes an empty DOM Document and fills it as a demonstration of building a DOM Document in memory. **************************************/ class DomFiller { public DomFiller(Document doc) { try { //*** CREATE THE DOCUMENT ELEMENT *** doc.appendChild(doc.createElement("mytoplevel")); //*** CREATE THE FIRST PERSON RECORD *** Element elm = doc.getDocumentElement(); //Get to a known state elm = (Element)elm.appendChild(doc.createElement("person")); elm.setAttribute("fname","Alan"); elm.setAttribute("lname","Greenspan"); elm = (Element)elm.appendChild(doc.createElement("job")); elm.appendChild(doc.createTextNode("Federal Reserve Chairman")); elm = (Element)elm.getParentNode().appendChild( doc.createElement("party")); elm.appendChild(doc.createTextNode("Libertarian")); //*** CREATE THE SECOND PERSON RECORD *** elm = doc.getDocumentElement(); //Get to a known state elm = (Element)elm.appendChild(doc.createElement("person")); elm.setAttribute("lname", "Bush"); elm.setAttribute("fname", "George"); elm = (Element)elm.appendChild(doc.createElement("job")); elm.appendChild(doc.createTextNode("President")); elm = (Element)elm.getParentNode().appendChild( doc.createElement("party")); elm.appendChild(doc.createTextNode("Republican")); } catch (Exception e) { System.out.print("DomFiller: " + e.getMessage()); } } } |
/* * Copyright (C) 2001 by Steve Litt * * COMPLETE DOM WALKER * */ |
Now add these two import statements to the program's list of import
statements:
//*** NEXT 2 STATEMENTS CREATE EMPTY DOCUMENT *** import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; |
Next, delete the Hello class, go to the bottom, and insert the following
version of the Hello class:
/************************************** Class Hello is the repository of this program's main routine. Its purpose is to showcase building a DOM Document object in memory. After doing that, it invokes a DOM walker to prove that the DOM Document contains the desired material in the desired organization. **************************************/ class Hello { public static void main(String[] args) { try { EmptyDocumentMaker emptyDocMaker = new EmptyDocumentMaker(); Document doc = emptyDocMaker.getDocument(); DomFiller df = new DomFiller(doc); DOMwalker walker = new DOMwalker(doc); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } |
The preceding code instantiates an EmptyDocumentMaker to make
an empty document, then it instantiates a DomFiller, which fills
the empty document with the desired political figures and their information.
Finally, it instantiates a DOMwalker to walk the Document object.
Here's what you get when you compile and run it:
$ ./jj ELEMENT : mytoplevel : null : () ELEMENT : person : null : (fname="Alan",lname="Greenspan") ELEMENT : job : null : () TEXT : #text : Federal Reserve Chairman : ELEMENT : party : null : () TEXT : #text : Libertarian : ELEMENT : person : null : (fname="George",lname="Bush") ELEMENT : job : null : () TEXT : #text : President : ELEMENT : party : null : () TEXT : #text : Republican : $ |
Be sure to save your code once it works. You'll need it in the next exercise! |
The preceding is pretty much what you'd expect. There's a root level element called <mytoplevel>, containing two person elements, one named Alan Greenspan and one named George Bush, as evidenced by their fname and lname attributes. Each person element contains a job element and a party element, and those two subelements contain a text node with the proper information.
Once again, there are easier, more readable, and more natively Java
methods of accomplishing the preceding, but this shows off the DOM API
documented by the W3C. You'll be using this technique later when you use
SAX to process files whose information is ordered differently than the
desired output.
|
|
Start with the Hello.java from the preceding article. We're going to
change class DOMwalker to print out an XML file instead of an informative
outline. We'll display the code first, and discuss it following the code.
The changes from the preceding article's code are marked in bold red. Here's
the code:
class DOMwalker { private Node checker; // like a checker that gets moved from // square to square in a checkers game // points to "current" node private void indentToLevel(int level) { System.out.println(); for(int n=0; n < level; n++) { System.out.print(" "); } } private void printAttributes(Node thisNode) { NamedNodeMap attribs = thisNode.getAttributes(); int numAttribs = attribs.getLength(); for(int i=0; i < attribs.getLength(); i++){ Node attrib = attribs.item(i); System.out.print(" "); System.out.print(attrib.getNodeName()); System.out.print("=\""); System.out.print(attrib.getNodeValue()); System.out.print("\""); } } private void printNodeInfo(Node thisNode) { int nodeType = thisNode.getNodeType(); if(nodeType == Node.ELEMENT_NODE) { System.out.print("<" + thisNode.getNodeName().trim()); printAttributes(thisNode); System.out.print(">"); } else { System.out.print(thisNode.getNodeValue()); } } private void printEndTag(Node thisNode) { System.out.print("</" + thisNode.getNodeName() + ">"); } public DOMwalker(Document doc) { boolean ascending = false; int level = 0; // System.out.println(); try { checker=doc.getDocumentElement(); System.out.println("<?xml version=\"1.0\"?>"); while (true) { //*** TAKE ACTION ON NODE WITH CHECKER *** if (!ascending) { indentToLevel(level); printNodeInfo(checker); } else { if (checker.getNodeType() == Node.ELEMENT_NODE) { indentToLevel(level); printEndTag(checker); } } //*** GO DOWN IF YOU CAN *** if ((checker.hasChildNodes()) && (!ascending)) { checker=checker.getFirstChild(); ascending = false; level++; } //*** OTHERWISE GO RIGHT IF YOU CAN *** else if (checker.getNextSibling() != null) { checker=checker.getNextSibling(); ascending = false; } //*** OTHERWISE GO UP IF YOU CAN *** else if (checker.getParentNode() != null) { checker=checker.getParentNode(); ascending = true; level--; } //*** OTHERWISE YOU'VE ASCENDED BACK TO *** //*** THE DOCUMENT ELEMENT, SO YOU'RE DONE *** else { break; } } System.out.println(); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } |
The preceding prints the DOM object as HTML, nicely indented and formatted.
The only change to indentToLevel(int level) is that we print a linefeed before indenting. This is a handy way to make sure that all start tags, end tags, and text nodes are on their own line.
printAttributes(Node thisNode) is changed to the extent that there are no commas, and a couple other formatting changes. The original logic remains intact.
printNodeInfo(Node thisNode) has been changed so that it doesn't print extraneous information. On elements it prints the starting angle bracket, the node name, and then calls printAttributes() to print the attributes, and finally prints the closing angle bracket. On text nodes it simply prints the node's value. The logic is substantially unchanged.
DOMwalker(Document doc) has several minor changes. The level variable starts at 0 instead of 1, <?xml version=\"1.0\"?> is printed at the top of the document, and a linefeed is printed at the bottom. And it has one major change -- the logic of the action. It now takes action when the checker returns up to an element. The action, as you might expect, is to print a closing tag for the element. This is a slick way to avoid introducing a Stack to the logic :-).
The following shows what happens when you run this program:
$ ./jj <?xml version="1.0"?> <mytoplevel> <person fname="Alan" lname="Greenspan"> <job> Federal Reserve Chairman </job> <party> Libertarian </party> </person> <person fname="George" lname="Bush"> <job> President </job> <party> Republican </party> </person> </mytoplevel> $ |
That's it. You just wrote an XML file from a DOM Document.
Don't do this at work. The preceding example handles only elements and
text nodes -- hardly the entirety of the XML specification. There are all
sorts of pre-made classes to write XML files correctly, including DTD's.
The preceding was simply intended to show that writing XML from a DOM Document
isn't rocket science.
|
|
DOM walking is an impressive demonstration, but in reality yields very little power. The average app knows something about what it's looking for. It needs a specific piece of info, and walking the DOM to find it would be folly. So the DOM specification gives methods to find elements and attributes with specific names. Element method getElementsByTagName(String name) yields a Nodelist containing all immediate subelements with the specified tag. For a given element, getAttribute(String name) returns the value of the attribute whose name is the argument. getAttributeNode(String name) does the same thing, except it delivers the whole Attr object instead of just the value.
Start by creating the following XML file, which has been crafted just
for this example (and is probably lousy XML because contractors, employees
and partners are all people). Name the following XML file workers.xml:
<?xml version="1.0"?> <workers> <contractor> <info lname="albertson" fname="albert" ssno="123456789"/> <job>C++ programmer</job> <hiredate>1/1/1999</hiredate> </contractor> <employee> <info lname="bartholemew" fname="bart" ssno="223456789"/> <job>Technology Director</job> <hiredate>1/1/2000</hiredate> <firedate>1/11/2000</firedate> </employee> <partner> <info lname="carlson" fname="carl" ssno="323456789"/> <job>labor law</job> <hiredate>10/1/1979</hiredate> </partner> <contractor> <info lname="denby" fname="dennis" ssno="423456789"/> <job>cobol programmer</job> <hiredate>1/1/1959</hiredate> </contractor> <employee> <info lname="edwards" fname="eddie" ssno="523456789"/> <job>project manager</job> <hiredate>4/4/1996</hiredate> </employee> <partner> <info lname="fredericks" fname="fred" ssno="623456789"/> <job>intellectual property law</job> <hiredate>10/1/1991</hiredate> </partner> </workers> |
Let's say you want to print out the last name of the contractors. Start
by copying the complete DOM walker
program to Hello.java.. The complete DOM walker is the one shown previously
in this tutorial, with a top comment like this:
/* * Copyright (C) 2001 by Steve Litt * * COMPLETE DOM WALKER * */ |
Next, create the following ContractorNamePrinter class:
/************************************** class ContractorLastNamePrinter prints the last names of contractors only. It must be run on the workers.xml example file. **************************************/ class ContractorLastNamePrinter { ContractorLastNamePrinter(Document doc) { System.out.println(); try { //*** GET DOCUMENT ELEMENT BY NAME *** NodeList nodelist = doc.getElementsByTagName("workers"); Element elm = (Element) nodelist.item(0); //*** GET ALL contractors BELOW workers *** NodeList contractors = elm.getElementsByTagName("contractor"); for(int i = 0; i < contractors.getLength(); i++) { Element contractor = (Element) contractors.item(i); //*** NO NEED TO ITERATE info ELEMENTS, *** //*** WE KNOW THERE'S ONLY ONE *** Element info = (Element)contractor.getElementsByTagName("info").item(0); System.out.println( "Contractor last name is " + info.getAttribute("lname")); } } catch (Exception e) { System.out.println( "ContractorLastNamePrinter() error: " + e.getMessage()); } } } |
In the preceding code, elements are looked up by name, and the list of elements is iterated through.
Your final step is to, instantiate ContractorLastNamePrinter instead
of DOMwalker, as shown in the code below:
/************************************** Class Hello is the repository of this program's main routine. It removes empty text nodes, then walks the DOM tree and prints out the DOM tree's info. **************************************/ class Hello { public static void main(String[] args) { String filename = args[0]; System.out.print("Walking XML file " + filename + " ... "); DocumentMaker docMaker = new DocumentMaker(filename); Document doc = docMaker.getDocument(); try { WhiteSpaceKiller wpc = new WhiteSpaceKiller(doc); // DOMwalker walker = new DOMwalker(doc); ContractorLastNamePrinter cPrinter = new ContractorLastNamePrinter(doc); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } |
Now run the program against workers.xml, and as predicted,
it prints the last name of the contractors and nobody else.
$ ./jj workers.xml Walking XML file workers.xml ... Contractor last name is albertson Contractor last name is denby [slitt@mydesk slitt]$ |
Obviously the preceding was written as an illustration rather than as real Java code. It instantiates too many objects, and it doesn't combine steps that could have been combined. But it's pretty clear how the preceding code uses getElementsByTagName() and getAttribute() to print only the desired information.
You can optimize your code as appropriate. The important thing is that you understand the workings of getElementsByTagName(), getAttributeNode() and getAttribute().
|
|
As a junior programmer in the days when 128Kilobytes was the most RAM you could expect on the company minicomputer, I got a specification change report for our batch insurance printing program. It sounded tiny to the DP manager, but it resulted in major structural changes. This program printed the forms you get after going to the doctor. "All we want", said the DP manager, "is to have the total at the top of the page instead of the bottom.".
I laughed hysterically. "How do you expect me", I exclaimed, "to know the total before reading the line items!". I asked her if she had some tealeaves I could read to predict the future.
I finally did it with a sortable intermediate file. Each sheet's line items remained in order, but each page's total sorted to the top of its sheet. I printed straight off the intermediate file. In the days when you had to fight for every kilobyte, that was probably the best solution.
Of course, I could have kept all lineitem info for the page in memory, with break logic triggering a complete calculation and print. I could have simulated the 80 characters per line and 66 lines as a 5280 byte 2 dimensional array, in which case I could have "moved the paper backwards". But back then, RAM was too dear. Anything more than a small array was written out to temporary files.
This story introduces the difference between DOM and SAX. DOM keeps the entire XML file in memory, ready for instant and random access. As in all other computations, when you can afford the RAM, keeping information in RAM makes your programming task much easier. But if your XML file is a gigabyte long, DOM isn't an option. Additionally, if you don't know how big the XML file will eventually be, DOM is a bad idea. Use DOM when the likelihood of memory exhaustion is nil.
SAX is a parsing methodology, plain and simple. A SAX parser reads an XML file, and every time it runs across an XML tag or other entity, it reports it.
How does SAX report the tag or other entity? It calls a callback routine supplied by the application programmer. The programmer loads the callback routine with code to process the information. For instance, the callback routine for an element's start tag would inquire about the element's attributes. In the callback routines, the programmer saves what must be saved, and keeps track of the hierarchical nature of the XML file. For instance, three element start tags without an element end tag means the three are at descending levels.
This is a lot of busywork for the programmer, so the obvious question is "why not use DOM?". The answer is usually "we can't afford the memory", or "we don't know how big this thing will end up being". SAX stores nothing. It can parse a terabyte file a little bit at a time. You can do anything with SAX, but DOM is reserved for known-resource, small footprint XML hierarchies.
So let's make a proof of concept SAX program to report the document element. The first step is to make sure we even have the SAX API for Java. Fortunately, that should have been included with the xerces.jar that we downloaded. To test, try this Hello.java:
import java.io.IOException; import org.xml.sax.XMLReader; class Hello { public static void main(String[] args) { System.out.println("Hello " + args[0] + "!"); } } |
$ ./jj firstarg Hello firstarg! $ |
If you get error messages, investigate the $CLASSPATH setting in your jj script, and whether you really downloaded and properly extracted xerces.jar. Once you get the program running, it's time to try a ghost parser.
The ghost parser gets its name from the fact that it outputs nothing.
Once again, it's a test to make sure your SAX API is downloaded and working.
Code for the ghost parser follows:
import java.io.IOException; import org.xml.sax.XMLReader; import org.xml.sax.SAXException; import org.apache.xerces.parsers.SAXParser; class Hello { public static void main(String[] args) { System.out.print("parsing " + args[0] + "... "); try { XMLReader parser = new SAXParser(); parser.parse(args[0]); } catch (IOException e) { System.out.print("IO Exception: " + e.getMessage()); } catch (SAXException e) { System.out.print("SAX Exception: " + e.getMessage()); } System.out.println("Done!"); } } |
Now compile and run it:
$ ./jj blank.xml parsing blank.xml... Done! $ |
The preceding parsed the file (we assume), but because there were no callbacks, it did nothing. The next step is to make an instance of a subclass of ContentHandler to define callbacks, and instantiate that class from the Hello class, and link it with the SAXParser. The following code implements such a class (called MyContentHandler in the code), and links the handler object to the parser object:
import java.io.IOException; import org.xml.sax.XMLReader; import org.xml.sax.SAXException; import org.xml.sax.ContentHandler; import org.xml.sax.Locator; import org.xml.sax.Attributes; import org.apache.xerces.parsers.SAXParser; class MyContentHandler implements ContentHandler { // Receive notification of character data. public void characters(char[] ch, int start, int length) { } // Receive notification of the end of a document. public void endDocument() { } // Receive notification of the end of an element. public void endElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName) { } // End the scope of a prefix-URI mapping. public void endPrefixMapping(java.lang.String prefix) { } // Receive notification of ignorable whitespace in element content. public void ignorableWhitespace(char[] ch, int start, int length) { } // Receive notification of a processing instruction. public void processingInstruction( java.lang.String target, java.lang.String data ) { } // Receive an object for locating the origin of SAX document events. public void setDocumentLocator(Locator locator) { } // Receive notification of a skipped entity. public void skippedEntity(java.lang.String name) { } // Receive notification of the beginning of a document. public void startDocument() { } // Receive notification of the beginning of an element. public void startElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName, Attributes atts) { System.out.println(localName); //<<=====PRINT ELEMENT NAME } // Begin the scope of a prefix-URI Namespace mapping. public void startPrefixMapping(java.lang.String prefix, java.lang.String uri) { } } class Hello { public static void main(String[] args) { System.out.print("parsing " + args[0] + "... "); try { XMLReader parser = new SAXParser(); ContentHandler handler = new MyContentHandler(); parser.setContentHandler(handler); parser.parse(args[0]); } catch (IOException e) { System.out.print("IO Exception: " + e.getMessage()); } catch (SAXException e) { System.out.print("SAX Exception: " + e.getMessage()); } System.out.println("Done!"); } } |
In the preceding, the only parsing action implemented in MyContentHandler is to print each element's name (see the large, bold, green comment). All the rest of the methods are just stubs. The list of methods was taken directly from the authoritative documentation at http://www.megginson.com/SAX/Java/javadoc/org/xml/sax/ContentHandler.html. The MyContentHandler object is created in the main routine, and then linked to the parser via the parser.setContentHandler() method. Everything added to the previous program is marked in bold red for your understanding.
Create the following simple.xml to test this program:
<?xml version="1.0"?> <toplevel> <secondlevel> This is a text node within the second level. </secondlevel> </toplevel> |
When you compile and run the program it should simply list the elements:
$ ./jj simple.xml parsing simple.xml... toplevel secondlevel Done! $ |
Notice that the printing of the element names occurs between the first parsing simple.xml... and the Done! prompts. The element printing occurs as part of the parsing process, in the callback routine. Now it's time to make a SAX tree walker...
As far as level goes, it's incremented by the last statement in startElement() and decremented by endElement(). Even though characters() uses level to indent, characters() doesn't change it, but instead, indents one indent past the current level (text nodes are children of their parent elements).
The characters() callback tests for an all whitespace string, and if it isn't all whitespace, prints it. Note that characters() doesn't receive a String, but instead an array of characters with a start point and a length. The "string" in question is the characters from the startpoint out length bytes. SAX does this for performance reasons. It's the programmer's job to move those bytes into a String object with the new String(characterArray, start, length) constructor.
The startElement() callback prints the proper indent, prints the type (which is always ELEMENT because this callback is called only by elements), then the name, then the value (which is always null, so hardcoded to novalue). Finally, startElement()'s atts argument is iterated, via the SAX Attributes interface methods, and printed.
The added and changed code is in bold red. The following is the SAX
tree walker code:
import java.io.IOException; import org.xml.sax.XMLReader; import org.xml.sax.SAXException; import org.xml.sax.ContentHandler; import org.xml.sax.Locator; import org.xml.sax.Attributes; import org.apache.xerces.parsers.SAXParser; class MyContentHandler implements ContentHandler { private int level = 0; // Receive notification of character data. public void characters(char[] ch, int start, int length) { String s = new String(ch, start, length); if (s.trim() != "") { for(int i=0; i < level + 1; i++) {System.out.print(" ");} System.out.print("TEXT : noname : "); System.out.println(s); } } // Receive notification of the end of a document. public void endDocument() { } // Receive notification of the end of an element. public void endElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName) { level--; } // End the scope of a prefix-URI mapping. public void endPrefixMapping(java.lang.String prefix) { } // Receive notification of ignorable whitespace in element content. public void ignorableWhitespace(char[] ch, int start, int length) { } // Receive notification of a processing instruction. public void processingInstruction( java.lang.String target, java.lang.String data ) { } // Receive an object for locating the origin of SAX document events. public void setDocumentLocator(Locator locator) { } // Receive notification of a skipped entity. public void skippedEntity(java.lang.String name) { } // Receive notification of the beginning of a document. public void startDocument() { } // Receive notification of the beginning of an element. public void startElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName, Attributes atts) { for(int i=0; i < level; i++) {System.out.print(" ");} System.out.print("ELEMENT : "); System.out.print(localName); System.out.print(" : novalue : "); System.out.print("("); for(int i=0; i < atts.getLength(); i++) { if(i > 0) {System.out.print(",");} System.out.print(atts.getLocalName(i) + "=\"" + atts.getValue(i) + "\""); } System.out.print(")"); System.out.println(); level++; } // Begin the scope of a prefix-URI Namespace mapping. public void startPrefixMapping(java.lang.String prefix, java.lang.String uri) { } } class Hello { public static void main(String[] args) { System.out.println("parsing " + args[0] + "... "); try { XMLReader parser = new SAXParser(); ContentHandler handler = new MyContentHandler(); parser.setContentHandler(handler); parser.parse(args[0]); } catch (IOException e) { System.out.print("IO Exception: " + e.getMessage()); } catch (SAXException e) { System.out.print("SAX Exception: " + e.getMessage()); } System.out.println("Done!"); } } |
Compile and run the program. On simple.xml it should look something
like this:
$ ./jj simple.xml parsing simple.xml... ELEMENT : toplevel : novalue : () ELEMENT : secondlevel : novalue : () TEXT : noname : This is a text node within the second level. Done! $ |
Note that the actual text is a line below its node type and name. That's because it includes the newline and spaces of the original XML. If you don't like this, simply use the String.trim() method to clip border whitespace.
Try running this program on a longer file like the blank.xml file you made in the Dia exercises earlier in this magazine, or on some other complex XML program. Notice the attributes, and the faithful adhearance to the XML file's hierarchy.
import java.io.IOException; import org.xml.sax.XMLReader; import org.xml.sax.SAXException; import org.xml.sax.ContentHandler; import org.xml.sax.Locator; import org.xml.sax.Attributes; import org.apache.xerces.parsers.SAXParser; class MyContentHandler implements ContentHandler { private int level = 0; public static void indent(int llevel) { for(int i=0; i < llevel; i++) {System.out.print(" ");} } // Receive notification of character data. public void characters(char[] ch, int start, int length) { String s = new String(ch, start, length); if (s.trim() != "") { indent(level + 1); System.out.print("TEXT : noname : "); System.out.println(s); } } // Receive notification of the end of a document. public void endDocument() { indent(level); System.out.println("endDocument()"); } // Receive notification of the end of an element. public void endElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName) { level--; indent(level); System.out.println("endElement(namespaceURI=" + namespaceURI + ",localName=" + localName + ",qName=" + qName +")"); } // End the scope of a prefix-URI mapping. public void endPrefixMapping(java.lang.String prefix) { indent(level); System.out.println("endPrefixMapping(prefix=" + prefix +")"); } // Receive notification of ignorable whitespace in element content. public void ignorableWhitespace(char[] ch, int start, int length) { indent(level); String s = new String(ch, start, length); System.out.println("ignorableWhitespace(string=" + s + ")"); } // Receive notification of a processing instruction. public void processingInstruction( java.lang.String target, java.lang.String data ) { indent(level); System.out.println("processingInstruction(target=" + target + ",data=" + data + ")"); } // Receive an object for locating the origin of SAX document events. public void setDocumentLocator(Locator locator) { indent(level); System.out.println("setDocumentLocator()"); } // Receive notification of a skipped entity. public void skippedEntity(java.lang.String name) { indent(level); System.out.println("skippedEntity(name=" + name + ")"); } // Receive notification of the beginning of a document. public void startDocument() { indent(level); System.out.println("startDocument()"); } // Receive notification of the beginning of an element. public void startElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName, Attributes atts) { indent(level); System.out.print("ELEMENT : "); System.out.print(localName); System.out.print(" : novalue : "); System.out.print("("); for(int i=0; i < atts.getLength(); i++) { if(i > 0) {System.out.print(",");} System.out.print(atts.getLocalName(i) + "=\"" + atts.getValue(i) + "\""); } System.out.print(")"); System.out.println(); level++; } // Begin the scope of a prefix-URI Namespace mapping. public void startPrefixMapping(java.lang.String prefix, java.lang.String uri) { indent(level); System.out.println("startPrefixMapping(prefix=" + prefix + ",uri=" + uri + ")"); } } class Hello { public static void main(String[] args) { System.out.println("parsing " + args[0] + "... "); try { XMLReader parser = new SAXParser(); ContentHandler handler = new MyContentHandler(); parser.setContentHandler(handler); parser.parse(args[0]); } catch (IOException e) { System.out.print("IO Exception: " + e.getMessage()); } catch (SAXException e) { System.out.print("SAX Exception: " + e.getMessage()); } System.out.println("Done!"); } } |
Start with the code for your SAX explorer in the preceding section of
this article. Next, add the following to your import statements:
import org.xml.sax.ErrorHandler; import org.xml.sax.SAXParseException; |
The preceding import the ErrorHandler class and the exceptions
it throws. Next, code class MyErrorHandler, which prints simple messages
for each of class ErrorHandler's three callbacks, error(),
fatalError(), and warning().
class MyErrorHandler implements ErrorHandler { public void error(SAXParseException exception) { System.out.println("SAX nonfatal error: " + exception.getMessage()); } public void fatalError(SAXParseException exception) { System.out.println("SAX fatal error: " + exception.getMessage()); } public void warning(SAXParseException exception) { System.out.println("SAX warning: " + exception.getMessage()); } } |
Finally, "hook up" your new error handler by adding the two bolded (and
brown if you're looking at a color browser) lines between the parser.setContentHandler(handler)
statement and the parser.parse(args[0]) line.
class Hello { public static void main(String[] args) { System.out.println("parsing " + args[0] + "... "); try { XMLReader parser = new SAXParser(); ContentHandler handler = new MyContentHandler(); parser.setContentHandler(handler); ErrorHandler errHandler = new MyErrorHandler(); parser.setErrorHandler(errHandler); parser.parse(args[0]); } catch (IOException e) { System.out.print("IO Exception: " + e.getMessage()); } catch (SAXException e) { System.out.print("SAX Exception: " + e.getMessage()); } System.out.println("Done!"); } } |
Save, compile and run on blank.xml. You'll notice that the output is the same as before you added the ErrorHandler code. That's because there are no errors or warnings. But make a copy of blank.xml, remove the quotes around the value of any attribute, and run your code against that XML file. You'll see your fatal error message print just before the program aborts due to the fatal error.
No, I was telling the truth, but a hierarchy walker is one of those lucky programs where the output follows the input. Remember the discussion of the batch insurance program in which I had to put the totals on top of the page, before receiving the line items making up that total? Those are the types of circumstances in which SAX programming becomes a challenge. The programmer must find a way of "remembering" the former information until outputting the later information.
All the same techniques are available. An intermediate file can be created and then sorted in such a way that the information appears in the order needed. The file can be traversed multiple times. Memory can be set aside to hold essential data.
Typically, the reason XML files get too big for DOM is that they have
many, many "records" (for want of a better word). Consider the following
trivially simple invoice data, which you should save as invoices.xml:
<?xml version="1.0"?> <invoices> <invoice> <lineitem item="widget" price="21.41" quantity="4"></lineitem> <lineitem item="mousetrap" price="2.11" quantity="14"></lineitem> <customer>Garcia, Maria</customer> <lineitem item="wrench" price="8.88" quantity="3"></lineitem> </invoice> <invoice> <lineitem item="mouse" price="7.41" quantity="84"></lineitem> <customer>Smith, John</customer> <lineitem item="mousepad" price="0.91" quantity="184"></lineitem> </invoice> </invoices> |
This would have been just perfect for a simple SAX app, except that the specifications call for the total to print first, then the customer name, and then the lineitems. Oh, and by the way, there's no telling whether the <customer> element will be before the lineitems, after them, or even tucked in between them.
This is seriously ugly, and obviously horrible XML design. But I designed it to showcase the use of per-record DOM documents. Obviously, a simple line for line loop won't work here.
Nor will placing the entire file in a DOM document. According to the specifications, there can be up to 100,000 invoices in a single invoices.xml file. It just so happens that this example has only 2 invoices.
The entire file might be too big for a DOM document, but the single invoices certainly are not. The subclassed ContentHandler object can repeatedly instantiate empty DOM document, fill them with data, total the invoices, and then print the lineitems and other information.
Before showing you the code, I'd like to explain the program at a high level. This is basically a SAX program, with a subclass of ContentHandler defining the necessary callbacks. Of all the ContentHandler callbacks, this code's MyContentHandler class uses only the following:
Basically, at all levels at and below <invoice>, the elements and text nodes read by the SAX parser are inserted into a DOM document, in the proper order and hierarchy. The SAX parser loads the DOM document, but it "zeros out" the DOM document at every </invoice> tag, and starts building fresh with every <invoice> tag. Because the DOM document starts out empty, there's no special logic for the first time. As a matter of fact, the SAX API is very break logic friendly. Everything has a begin and an end, so there's no need for priming this or after-the-loop that, or keeping track of whether it's been through the loop before.
Here's the logic at the highest level:
Now suppose that the start tag of Y happens after the end tag of X. The endElement of X will have already popped X off the stack, so the next element to be popped won't be X, but in fact it will be the parent of X. What if X has no parents? In this program, that means that X is an <invoice> element, and it is treated specially so that we don't pop off an empty stack.
In summary, using a stack guarantees the XML hierarchy will be faithfully reproduced in the per-invoice DOM document.
I could go on talking about this for a long time, but instead I'll show you the code. As you look at the code, please keep in mind that it uses just the DOM methods and SAX API that we've already discussed, plus the Stack object, for which we use the push(), peek(), pop(), and empty() methods. empty() returns whether or not the stack is empty. This is important, because it's how we detect whether an element is above <invoice> in the hierarchy.
Here's the code:
import java.io.IOException; import org.xml.sax.XMLReader; import org.xml.sax.SAXException; import org.xml.sax.ContentHandler; import org.xml.sax.Locator; import org.xml.sax.Attributes; import org.apache.xerces.parsers.SAXParser; import org.w3c.dom.*; // DOM interface import java.util.Stack; // For Element Stack import java.util.EmptyStackException; // Stack exception //*** NEXT 2 LINES NECESSARY CREATE EMPTY DOCUMENT *** import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; /************************************** class EmptyDocumentMaker creates an empty document **************************************/ class EmptyDocumentMaker { private Document doc; public Document getDocument () {return(doc);} public EmptyDocumentMaker () { try { DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder(); doc = docBuilder.newDocument(); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } } class MyContentHandler implements ContentHandler { private Document doc; private Stack elementStack; /********************************** Following methods are added to those of ContentHandler They implement intermediate DOM document handling, invoice printing, and the like. **********************************/ private void createEmptyDocument() { try { DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder(); doc = docBuilder.newDocument(); } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } private void resetDocument() { try { Element docElm = doc.getDocumentElement(); docElm.getParentNode().removeChild(docElm); docElm = null; } catch (Exception e) { System.out.println("\nError: " + e.getMessage()); } } private void printInvoice() { NodeList lineItems = doc.getDocumentElement(). getElementsByTagName("lineitem"); //*** PRINT TOP SEPARATOR *** System.out.println(); System.out.print("==============================================="); System.out.println(); //*** CALCULATE THE TOTAL *** float total = (float)0.0; for(int i = 0; i < lineItems.getLength(); i++) { Element lineItemElm = (Element)lineItems.item(i); int quantity = Integer.valueOf( lineItemElm.getAttribute("quantity")).intValue(); float unitPrice = Float.valueOf( lineItemElm.getAttribute("price")).floatValue(); total += ((float)quantity * unitPrice); } //*** PRINT THE TOTAL *** System.out.println("Invoice total=" + total); //*** PRINT THE CUSTOMER *** Element custElm = (Element)(doc.getDocumentElement(). getElementsByTagName("customer"). item(0)); String custName = custElm.getFirstChild().getNodeValue(); System.out.println("Customer=" + custName); //*** PRINT THE LINE ITEMS *** System.out.println(); System.out.println("Items purchased..."); for(int i = 0; i < lineItems.getLength(); i++) { Element lineItemElm = (Element)lineItems.item(i); System.out.print( (lineItemElm.getAttribute("item") + " "). substring(0,20) ); int quantity = Integer.valueOf( lineItemElm.getAttribute("quantity")).intValue(); float itemPrice = Float.valueOf( lineItemElm.getAttribute("price")).floatValue(); float itemTotal = (float)quantity * itemPrice; System.out.println( quantity + " @ " + itemPrice + " = " + itemTotal); } System.out.println(); //*** PRINT BOTTOM SEPARATOR *** System.out.print("==============================================="); System.out.println(); } /********************************** Following methods modify those of ContentHandler **********************************/ public void startDocument() { try { elementStack = new Stack(); this.createEmptyDocument(); } catch (Exception e) { System.out.println("startDocument error: " + e.getMessage()); } } public void startElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName, Attributes atts) { try { if(localName.equals("invoice")) { doc.appendChild(doc.createElement("invoice")); elementStack.push(doc.getDocumentElement()); } else if(!elementStack.empty()) { Element parentElm = (Element)elementStack.peek(); Element tempElm = (Element) (parentElm.appendChild(doc.createElement(localName))); for(int i=0; i < atts.getLength(); i++){ tempElm.setAttribute(atts.getLocalName(i), atts.getValue(i)); } elementStack.push(tempElm); } } catch (EmptyStackException e) { System.out.println("startElement stack error: " + e.getMessage()); System.out.println("startElement localName=" + localName); } catch (Exception e) { System.out.println("startElement error: " + e.getMessage()); } } public void characters(char[] ch, int start, int length) { try { if(!elementStack.empty()) { String s = new String(ch, start, length); Element parentElm = (Element)elementStack.peek(); parentElm.appendChild(doc.createTextNode(s)); } } catch (EmptyStackException e) { System.out.println("characters stack error: " + e.getMessage()); } catch (Exception e) { System.out.println("characters error: " + e.getMessage()); } } public void endElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName) { try { if(localName.equals("invoice")) { printInvoice(); resetDocument(); } else if(!elementStack.empty()) { elementStack.pop(); } } catch (EmptyStackException e) { System.out.println("endElement stack error: " + e.getMessage()); } catch (Exception e) { System.out.println("endElement error: " + e.getMessage()); } } /********************************** Following methods are empty stubs of ContentHandler methods This is necessary to compile. **********************************/ public void endDocument() {} public void endPrefixMapping(java.lang.String prefix) {} public void ignorableWhitespace(char[] ch, int start, int length) {} public void processingInstruction(java.lang.String target, java.lang.String data) {} public void setDocumentLocator(Locator locator) {} public void skippedEntity(java.lang.String name) {} public void startPrefixMapping(java.lang.String prefix, java.lang.String uri) {} } class Hello { public static void main(String[] args) { try { XMLReader parser = new SAXParser(); ContentHandler handler = new MyContentHandler(); parser.setContentHandler(handler); parser.parse("invoices.xml"); } catch (IOException e) { System.out.print("IO Exception: " + e.getMessage()); } catch (SAXException e) { System.out.print("SAX Exception: " + e.getMessage()); } System.out.println("Done!"); } } |
Before compiling and running this code, BE SURE YOU HAVE SAVED THE invoices.xml FILE DESCRIBED EARLIER IN THIS ARTICLE. The preceding code expects XML with <invoice> under <invoices>, <customer> and multiple <lineitem> under <invoice>, and a text node describing the customer name under <customer>. Any other hierarchy will cause this program to fail. Therefore, I have hardcoded invoices.xml into the program as the file to be parsed.
If you've copied the code invoices.xml file and the code correctly,
when you compile and run it the result should look something like this.
$ ./jj =============================================== Invoice total=141.82001 Customer=Garcia, Maria Items purchased... widget 4 @ 21.41 = 85.639999 mousetrap 14 @ 2.1099999 = 29.539999 wrench 3 @ 8.8800001 = 26.639999 =============================================== =============================================== Invoice total=789.88 Customer=Smith, John Items purchased... mouse 84 @ 7.4099998 = 622.44 mousepad 184 @ 0.91000003 = 167.44 =============================================== Done! $ |
Obviously this was a case contrived to showcase the use of per-record DOM documents in an otherwise SAX app. And I didn't bother to make columns line up or decimals round to pennies. But it doesn't take a lot of imagination to see how this technique can make life much easier for those coding apps with too many records to put in a single DOM document, but whose each record is too complex or too out of order to print line for line from the SAX callbacks.
This article illustrates one other thing. You've seen various parsers that parse straight to a DOM document, and you may have wondered how they work. Most of them do pretty much just what you did in this article -- they have their SAX callbacks call DOM methods to load a DOM document.
One more thing. There's a special kind of DOM object, called a DocumentFragment,
especially created to be a "lightweight object", which is perfect for temporarily
storing moderate amounts of data. You might be better off using a DocumentFragment
for per-record DOM work, but I didn't have time to check it out.
|
|
DTD's are Data Type Definitions. They inform a validating parser about which combinations of elements and/or text nodes are legal, and which attributes are legal for those elements, and what types of values those attributes can contain. The advantage of this is that an application can know what to expect, and can handle departures from the DTD with an error handler.
To repeat what was said much earlier in this issue of Troubleshooting Professional Magazine, to say an XML document is well formed means it conforms to XML syntax. To say a document is valid means the XML conforms to its DTD. Such validation can be done only by a validating parser. Non-validating parsers (I believe expat is non-validating) simply ignore DTD's, as long as the DTD's have proper XML syntax.
All the XML exercises in this TPM issue use the Apache Foundations Xerces Java parser, which can run as a validating parser or a non-validating parser (default non-validating). This article will show you how to make a "Hello World" level DTD, how to turn on the Xerces parser, and then expand. If you're using a different parser, you need to make adjustments. It might be better for the sake of these exercises to just use Xerces.
Because I ran out of time, the entire exploration of validation is done via SAX, which is very straightforward due to its ErrorHandler class. Feel free to do some research on validation with parsers dumping to a DOM document.
rm Hello.class CLASSPATH=$CLASSPATH:/usr/jre-blackdown1.2.2/lib/xerces.jar:. javac Hello.java java Hello $@ |
But in this article, we'll usually be changing the XML or DTD but not
the program. Therefore compiling wastes time. So create the following script
called ss, for the purpose of running the already compiled program:
CLASSPATH=$CLASSPATH:/usr/jre-blackdown1.2.2/lib/xerces.jar:. java Hello $@ |
Just remember that IF you change the Java program, you MUST use the jj script.
Now let's make and explore a Hello World DTD...
Now create the following dtdtest.xml:
<?xml version="1.0"?> <docelement> </docelement> |
Obviously, this is nothing but a document element. Compile and run against
dtdtest.xml,
and this should be the result:
$ ./jj dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : () endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
So far, so good. But of course, there was no DTD. Now add a DTD for which the XML is valid (take my word for it):
<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> ]> <docelement> </docelement> |
In the preceding XML, the line starting with <!DOCTYPE declares the DTD. The opening square bracket says the DTD is internal to the document, and will be contained between the opening square bracket and a closing square bracket. Notice that the closing square bracket, and the angle bracket which completes the DOCTYPE declaration, is on its own line. Between the declaration and the closing square and angle bracket is the single line saying the one and only element allowed in this document is called docelement, and that it can have a single text node (that's what #PCDATA means -- text). Later we'll validate against a DTD in another file, but for now let's work with the DTD and XML in the same file.
Now run your already compiled Java program against this new DTD equipped
file, using the ss script, and watch what happens:
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : () endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
Nothing's changed. Of course, you'd expect that from a document that's valid according to its DTD. Now let's change the DTD to make an error occur:
<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT xdocelement (#PCDATA)> ]> <docelement> </docelement> |
We placed the letter x before the word docelement
in the <!ELEMENT> declaration. This makes the document invalid.
Let's see what happens:
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : () endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
Oops! There are still no errors. What happened? What has happened is
that the parser is runnning in non-validating mode. So we add the single
line shown in bolded red, in the new main() code that follows:
class Hello { public static void main(String[] args) { System.out.println("parsing " + args[0] + "... "); try { XMLReader parser = new SAXParser(); ContentHandler handler = new MyContentHandler(); parser.setContentHandler(handler); ErrorHandler errHandler = new MyErrorHandler(); parser.setErrorHandler(errHandler); parser.setFeature("http://xml.org/sax/features/validation", true); parser.parse(args[0]); } catch (IOException e) { System.out.print("IO Exception: " + e.getMessage()); } catch (SAXException e) { System.out.print("SAX Exception: " + e.getMessage()); } System.out.println("Done!"); } } |
That should turn validation on. So let's compile and run again, with
the "bad" DTD. BE SURE TO USE jj instead of ss:
$ ./jj dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() SAX nonfatal error: Element type "docelement" must be declared. ELEMENT : docelement : novalue : () endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
Good. The NonfatalError callback was fired. Now let's delete
the offending x, so the xml file looks like this:
<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> ]> <docelement> </docelement> |
Run the program, and watch the error go away:
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : () endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
We're validating! * *
\ o /
\|/
| C O O L
/ \ _
/ \/
/
-
<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED> ]> <docelement> </docelement> |
Run the program:
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() SAX nonfatal error: Attribute "lname" is required and must be specified for element type "docelement". ELEMENT : docelement : novalue : () endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
So let's add an lname attribute and see if the parser likes it:
<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED> ]> <docelement lname="Litt"> </docelement> |
Sure enough, now reports the last name as an attribute instead of erroring
out, as the following output shows:
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : (lname="Litt") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
Next, let's give the element an attribute not declared in the DTD:
<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED> ]> <docelement lname="Litt" fname="Steve"> </docelement> |
Run the program, and see what happens:
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() SAX nonfatal error: Attribute "fname" must be declared for element type "docelement". ELEMENT : docelement : novalue : (lname="Litt",fname="Steve") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
Oh Oh, it errored out, saying fname must be declared. Note the difference
between this error message and the one where an attribute was declared
required but not included. So let's fix this by declaring attribute fname,
and let's declare it #IMPLIED, which means it can exist or not. We also
reformat it into lines, one line per attribute, one line for the declaration,
and one line for the declaration's ending angle bracket:
<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED fname CDATA #IMPLIED > ]> <docelement lname="Litt" fname="Steve"> </docelement> |
Run it:
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : (lname="Litt",fname="Steve") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
Note that now the error goes away, and the fname attribute is reported.
Remembering that #IMPLIED means optional, let's remove
the fname attribute:
<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED fname CDATA #IMPLIED > ]> <docelement lname="Litt"> </docelement> |
And run it:
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : (lname="Litt") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
As expected, the preceding exhibited no error, and no fname
attribute was reported. The preceding allowed a lack of the fname attribute,
but of course didn't recognize an fname. Wouldn't it be nice if you could
declare a default fname in case the attribute was missing? Check out the
following:
<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED fname CDATA "XMLBrain" > ]> <docelement lname="Litt"> </docelement> |
In the preceding we simply replaced the word #IMPLIED with the desired
default value, which in this case is "XMLBrain". Running it we see:
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : (lname="Litt",fname="XMLBrain") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
As expected, the preceding output reports attribute fname to be the default value, "XMLBrain". Of course, we can override it as follows:
<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED fname CDATA "XMLBrain" > ]> <docelement lname="Litt" fname="Steve"> </docelement> |
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : (lname="Litt",fname="Steve") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
Do you remember in the DOM specification for the Attr interface the getSpecified() method that returns a true or a false? It's to handle situations like the 2 previous examples. When you override the default getSpecified() returns true. When you let it default, getSpecified() returns false.
Finally, let's explore attributes that can take only certain values.
We do that by placing the values between pipe symbols (|) and enclosed
in parentheses, as shown below:
<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED fname CDATA "XMLBrain" employee_type (employee|contractor|partner) "partner" > ]> <docelement lname="Litt" fname="Steve"> </docelement> |
In the preceding notice that after the list of alternatives, there's a default value matching one of the alternatives. That's necessary. The absense of a default value produces a fatal error, while a default not listed in the alternatives yields a nonfatal error. If you don't like defaulting, you have two choices:
Running the preceding xml code prints the default:
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : (lname="Litt",fname="Steve",employee_type="partner") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
As you can see, the default was printed. Let's rewrite the XML to use
a disallowed value for employee_type, and see what happens:
<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED fname CDATA "XMLBrain" employee_type (employee|contractor|partner) "partner" > ]> <docelement lname="Litt" fname="Steve" employee_type="volunteer"> </docelement> |
Running the preceding XML produces the following error:
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() SAX nonfatal error: Attribute "employee_type" with value "volunteer" must have a value from the list "(employee|contractor|partner)". ELEMENT : docelement : novalue : (lname="Litt",fname="Steve",employee_type="volunteer") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
Of course, using a legitimate value overrides the default and delivers
the proper attribute value to the application, as follows:
<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> <!ATTLIST docelement lname CDATA #REQUIRED fname CDATA "XMLBrain" employee_type (employee|contractor|partner) "partner" > ]> <docelement lname="Litt" fname="Steve" employee_type="contractor"> </docelement> |
Running the preceding XML produces the following valid result:
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : (lname="Litt",fname="Steve",employee_type="contractor") endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
There are other types of attributes. There are ID attributes which specify that the value of the attribute must be unique document wide (is that cool or what). There are IDREF attributes which must refer to existing elements with a matching #ID attribute.
Attribute lists are powerful ways of specifying which attributes are legal and which aren't. They take the following form:
<!ATTLIST elementnameThe data type is usually CDATA, but it can be ID, IDREF, and certain other types. The data type can also be replaced by a series of alternatives separated by pipe symbols (|) and surrounded by parentheses. The modifier can either be a validation specifier like #REQUIRED, #IMPLIED, #FIXED, or it can be a default value. In certain cases it can be a validation specifier followed by a default.
attributename DATATYPE modifier
>
This discussion of alternatives is by no means exhaustive. Its purpose is only to get you to the point where you can experiment and research the building of <!ATTLIST> constructions.
Let's start with this simple XML file:
<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement EMPTY> ]> <docelement></docelement> |
It succeeds when run, as can be seen in the following output:
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : () endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
But now place text inside, as follows:
<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement EMPTY> ]> <docelement>This is some text</docelement> |
Now run it to see the results:
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : () SAX nonfatal error: The content of element type "docelement" must match "EMPTY" .endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
In the preceding example, the text node in the <docelement> element doesn't match the EMPTY declaration. Interestingly enough, the previous code would have produced the same error without text if the ending </docelement> tag had been on its own line. That's because the new line would have been considered a text node.
This problem is simple enough to fix by changing the type from EMPTY
to (#PCDATA):
<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (#PCDATA)> ]> <docelement>This is some text</docelement> |
The preceding XMLproduces the following output:
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : () TEXT : noname : This is some text endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
Now let's say that we want docelement to contain a single instance
of subdocument, and nothing else. Further, subdocument may contain
text:
<?xml version="1.0"?> <!DOCTYPE docelement [ <!ELEMENT docelement (subdocument)> <!ELEMENT subdocument (#PCDATA)> ]> <docelement> <subdocument>Subdocument's text</subdocument> </docelement> |
The preceding says docelement can contain one instance of subdocument,
no more and no less. And no text. Will this succeed? Remember the whitespace
problem with our DOM walker ? Remember we needed to make a whitespace killer
object to delete all the whitespace caused by formatting blanks and newlines?
Let's see what happens when we run our program against it:
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : docelement : novalue : () ignorableWhitespace(string= ) ELEMENT : subdocument : novalue : () TEXT : noname : Subdocument's text endElement(namespaceURI=,localName=subdocument,qName=subdocument) ignorableWhitespace(string= ) endElement(namespaceURI=,localName=docelement,qName=docelement) endDocument() Done! $ |
As you can see in the preceding, there are no error messages or warnings. The two elements are printed at their proper levels, and the text within subdocument is printed correctly. And if you look closely, you'll see that the ContentHandler's ignorableWhitespace() callback has fired twice -- once for the newline after <docelement> and once for the newline before </docelement>. When the DTD says text cannot appear in an element, and pure whitespace appears in the element, that whitespace is assumed to be ignorable. Cool!
Let's try a real example now. Remember the invoices.xml file?
Let's make a DTD for it, except that we'll consider it manditory to put
the <customer> element before any <lineitem> elements:
<?xml version="1.0"?> <!DOCTYPE invoices [ <!ELEMENT invoices (invoice+)> <!ELEMENT invoice (customer,lineitem+)> <!ELEMENT customer (#PCDATA)> <!ELEMENT lineitem EMPTY> <!ATTLIST lineitem item CDATA #REQUIRED price CDATA #REQUIRED quantity CDATA #REQUIRED > ]> <invoices> <invoice> <customer>Garcia, Maria</customer> <lineitem item="widget" price="21.41" quantity="4"></lineitem> <lineitem item="mousetrap" price="2.11" quantity="14"></lineitem> <lineitem item="wrench" price="8.88" quantity="3"></lineitem> </invoice> <invoice> <customer>Smith, John</customer> <lineitem item="mouse" price="7.41" quantity="84"></lineitem> <lineitem item="mousepad" price="0.91" quantity="184"></lineitem> </invoice> </invoices> |
Let's recite the DTD in English, starting with the document element:
$ ./ss dtdtest.xml parsing dtdtest.xml... setDocumentLocator() startDocument() ELEMENT : invoices : novalue : () ignorableWhitespace(string= ) ELEMENT : invoice : novalue : () ignorableWhitespace(string= ) ELEMENT : customer : novalue : () TEXT : noname : Garcia, Maria endElement(namespaceURI=,localName=customer,qName=customer) ignorableWhitespace(string= ) ELEMENT : lineitem : novalue : (item="widget",price="21.41",quantity="4") endElement(namespaceURI=,localName=lineitem,qName=lineitem) ignorableWhitespace(string= ) ELEMENT : lineitem : novalue : (item="mousetrap",price="2.11",quantity="14") endElement(namespaceURI=,localName=lineitem,qName=lineitem) ignorableWhitespace(string= ) ELEMENT : lineitem : novalue : (item="wrench",price="8.88",quantity="3") endElement(namespaceURI=,localName=lineitem,qName=lineitem) ignorableWhitespace(string= ) endElement(namespaceURI=,localName=invoice,qName=invoice) ignorableWhitespace(string= ) ELEMENT : invoice : novalue : () ignorableWhitespace(string= ) ELEMENT : customer : novalue : () TEXT : noname : Smith, John endElement(namespaceURI=,localName=customer,qName=customer) ignorableWhitespace(string= ) ELEMENT : lineitem : novalue : (item="mouse",price="7.41",quantity="84") endElement(namespaceURI=,localName=lineitem,qName=lineitem) ignorableWhitespace(string= ) ELEMENT : lineitem : novalue : (item="mousepad",price="0.91",quantity="184") endElement(namespaceURI=,localName=lineitem,qName=lineitem) ignorableWhitespace(string= ) endElement(namespaceURI=,localName=invoice,qName=invoice) ignorableWhitespace(string= ) endElement(namespaceURI=,localName=invoices,qName=invoices) endDocument() Done! $ |
<!ELEMENT invoice (customer,lineitem+)>means one <customer> followed by one or more lineitems. Verify this by placing a customer record between <lineitem> elements, and verify that it displays a nonfatal error. If you really wanted the ability to randomly place the <customer> among the <lineitem> elements, it can be done, but it's beyond the scope of this article. And in this case, it wouldn't be good program design to do so.
Remember, commas between the subelement types mean first a single <customer> element, followed by one or more <lineitem> elements. A specific order is enforced.
Let's discuss the plus sign you see in the preceding invoice element
declaration. There are characters you can append to an element name to
define how many times it occurs. Without any character appended, the element
happens exactly once, as in customer in the preceding definition of an
<invoice> element. Here are the others:
? | : | 0 or 1 |
* | : | 0 or more |
+ | : | 1 or more |
no character | : | exactly 1 |
As a final DTD exercise, let's make an external DTD for this monstrosity,
which is one layer of a Dia file. Save this file as layer.xml:
<layer name="Background" visible="true"> <group> <group> <object type="Standard - Ellipse" version="0" id="O0"> <attribute name="obj_pos"> <point val="7.55,5.85"/> </attribute> <attribute name="obj_bb"> <rectangle val="7.5,5.8;13.3,7.8"/> </attribute> <attribute name="elem_corner"> <point val="7.55,5.85"/> </attribute> <attribute name="elem_width"> <real val="5.7"/> </attribute> <attribute name="elem_height"> <real val="1.9"/> </attribute> </object> <object type="Standard - Box" version="0" id="O1"> <attribute name="obj_pos"> <point val="10.1,7.4"/> </attribute> <attribute name="obj_bb"> <rectangle val="10.05,7.35;12.95,9.3"/> </attribute> <attribute name="elem_corner"> <point val="10.1,7.4"/> </attribute> <attribute name="elem_width"> <real val="2.8"/> </attribute> <attribute name="elem_height"> <real val="1.85"/> </attribute> <attribute name="show_background"> <boolean val="true"/> </attribute> </object> </group> <object type="Standard - Polygon" version="0" id="O2"> <attribute name="obj_pos"> <point val="7.8,3.45"/> </attribute> <attribute name="obj_bb"> <rectangle val="7.75,3.4;10,5"/> </attribute> <attribute name="poly_points"> <point val="7.8,3.45"/> <point val="8.8,3.45"/> <point val="9.95,4.95"/> </attribute> <attribute name="show_background"> <boolean val="true"/> </attribute> </object> </group> <object type="Standard - Box" version="0" id="O3"> <attribute name="obj_pos"> <point val="14.6,3.7"/> </attribute> <attribute name="obj_bb"> <rectangle val="14.55,3.65;17.15,5.5"/> </attribute> <attribute name="elem_corner"> <point val="14.6,3.7"/> </attribute> <attribute name="elem_width"> <real val="2.5"/> </attribute> <attribute name="elem_height"> <real val="1.75"/> </attribute> <attribute name="show_background"> <boolean val="true"/> </attribute> </object> </layer> |
Here's the English discussion:
<!ATTLIST layer name CDATA #REQUIRED visible (true|false) #REQUIRED > <!ELEMENT layer (group | object)*> <!ATTLIST object type CDATA #REQUIRED version CDATA #REQUIRED id CDATA #REQUIRED > <!ELEMENT object (attribute+)> <!ATTLIST attribute name CDATA #REQUIRED> <!ELEMENT attribute (boolean | point+ | real | rectangle)> <!ATTLIST point val CDATA #REQUIRED> <!ATTLIST real val CDATA #REQUIRED> <!ATTLIST rectangle val CDATA #REQUIRED> <!ATTLIST boolean val (true | false) #REQUIRED> <!ELEMENT boolean EMPTY> <!ELEMENT point EMPTY> <!ELEMENT real EMPTY> <!ELEMENT rectangle EMPTY> <!ELEMENT group (group | object)*> |
By the way, the VI editor is a wonderful ally when trying to deduce what contains what. Try it.
Finally, place the following two lines at the top of layer.xml:
<?xml version="1.0"?> <!DOCTYPE layer SYSTEM "layer.dtd"> |
Save everything and run it. To save bandwidth I won't show you the entire
output, which is voluminous. I'll simply show you the output grepped to
prove there are no errors or warnings:
$ ./ss layer.xml | grep -i error $ ./ss layer.xml | grep -i warning $ |
That's it. You've written a DTD to cover a large chunk of XML output
from a professional program (Dia). If you can do this, you can do anything.
|