4.1 XML

Finally, you have reached the meat and potatoes of this book: XML. Although SGML and HTML had the potential to be very useful, there were some limitations that drove XML to be produced as a W3C Recommendation. A recommendation is a specification that the W3C recommends developers treat as a standard, but lacks any specific authority to do so (by contrast with ANSI and ISO). The standards produced by W3C may not be accredited in as large a scale as standards from those organizations, which is why the term recommendation is used, but the W3C standards are much more widely accepted and implemented.

One of the reasons why these recommendations are so pervasive is because they are completely free. They can be accessed from the W3C website free of charge 24 hours a day, in stark contrast with the SGML ISO standard which you must purchase for $180. These free standards are compatible with open-source software, such as the browser Mozilla Firefox, which parses XML. Although Mozilla Firefox uses its own public license, many other programs use the GNU public license, including operating systems that use the Linux kernel. Public licenses are licenses that require that software be open source, and that any modifications or enhancements to the software must continue to be open source. Technologies that cost money to obtain are at odds with this philosophy, since the code is proprietary and may not be released together with an open-source program. One example of this is the LZW algorithm used in the GIF (Graphics Interchange Format) file format. GIF images are a popular format on the internet due to their small file size, but they could not be processed by open-source software unless a separate binary plug-in was loaded. To get around this, W3C released another standard for the PNG (Portable Network Graphics) file format, which is smaller, has more features, and is more efficient than the GIF format. The same is true of XML: It is less cumbersome than SGML, and much better geared toward use on the internet.

XML was developed under the W3C in 1996 with a list of ten particular goals for the project. Those goals were as follows:

  1. XML shall be straightforwardly usable over the Internet.
  2. XML shall support a wide variety of applications.
  3. XML shall be compatible with SGML.
  4. It shall be easy to write programs which process XML documents.
  5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
  6. XML documents should be human-legible and reasonably clear.
  7. The XML design should be prepared quickly.
  8. The design of XML shall be formal and concise.
  9. XML documents shall be easy to create.
  10. Terseness in XML markup is of minimal importance.

These goals, the XML Working Group asserted, were not met by SGML. They then proceeded to release the first, and soon after, version 1.1 of the XML Recommendation. That recommendation is now online at w3.org.

Why did I cover SGML and HTML first? Basically, you already know XML. Because XML is a subset of SGML, much of the syntax is the same. The main purpose of XML was to streamline SGML and remove little-used parts of the specification and focus on the main uses of SGML that would benefit the internet. However, since XML is a validated format, in accordance with the goal to make XML easier to process, more error handling is done automatically and you must now follow XML syntax rules. As long as you do that, you can create XML vocabularies, or sets of elements and attributes, as freely as you would like.

There are a few main rules that are important to remember when formatting XML. These go in addition to the SGML rules you already know, such as having only one root element and nesting elements within each other properly. You must also observe these new rules, many of which I warned you about in chapter 2:

These rules make it much easier for programs to parse the resulting document, since they do not have to worry about as much error-trapping to catch malformed syntax. Ready-made XML processors will catch that before the program accesses the data. If an XML document follows these rules, it is said to be well-formed. Well-formed XML is so much easier to process that it can be processed by portable devices that could not handle SGML or HTML. A great example of this is WML - Wireless Markup Language, which is the portable device equivalent of HTML. As PDAs and mobile phones become more powerful, they are beginning to support HTML or at least a version of XHTML, however many devices use the WML vocabulary because its strict XML syntax is much easier to process.

The XML declaration tag comes before the DOCTYPE tag. It is very similar in its design, although this one uses real attribute and value pairs unlike the funky DOCTYPE tag:

<?xml version="1.1" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

The <?xml ?> syntax is a processing instruction—it sends a command to a parser somewhere along the line to handle it. It is not rendered by a browser. Processing instructions are a feature that came from SGML, and they are also used in the processing of PHP (Personal Home Page HyperText Preprocessor, or PHP: HyperText Preprocessor) commands. The PHP language begins parsing commands where it sees a processing instruction formatted as <?php ?>. Contained within the tag are the processing directives, and in the case of the XML declaration tag, there are two that should be present: A version, reflecting the XML version the document uses, and a character encoding. Both need to be quoted, just like you should be doing for all your attribute values now.

XML also has a number of tools available to make any XML document more powerful. Anyone wanting to create his own XML vocabulary could expect it to be adopted much more easily than a SGML vocabulary. In the chapters that follow, you will learn about Cascading Style Sheets (for styling), Extensible Stylesheet Language (XSL) and XSL Transformations, and Document Type Definition files for extended validation. You can use tools that are widely available to process your XML documents and convert them to other XML formats, or convert them to HTML, or for that matter any other sequential file format.

4.2 Namespaces

One handy feature of XML is that XML documents can be embedded within other XML documents. For example, if you have an XHTML webpage, and you want to include a Scalable Vector Graphics image, you can just embed the image within the same XHTML document, and have one XML file containing two different formats of XML. However, with this convenience come complications. What if your SVG has a title element inside it? Is this an HTML title element or an SVG title element?

To solve this problem, the W3C created the XML namespace. An XML namespace ties each element name to one unique XML implementation. For example, let's say you have this document:

<?xml version="1.1" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html>
 <head>
  <title>Linked Image</title>
 </head>
 <body>
  <img src="svgimage.svg" />
 </body>
</html>

Note the mandatory self-closing img tag. This method of loading an SVG image is fine; however, as small as the document is, maybe it would be worth the trouble to embed the SVG image within the document. To do so, you could simply copy the root element of the SVG image in place of the img element:

...
  <svg version="1.1" xmlns="http://www.w3.org/2000/svg" ...>
   <title>My SVG Image</title>
   <rect ... />
   ...
  </svg>
...

Does the browser confuse the title element in the SVG code with the title element in the XHTML code? The answer is no. The xmlns attribute is an XML namespace which is a string of characters that uniquely identifies this XML vocabulary. The SVG vocabulary is uniquely identified by the URL to the W3C site, which if opened, says "This is an XML namespace..." The svg element has a namespace applied to it, and that namespace affects all child elements. This is called namespace defaulting. Therefore, the title element that is a child of the svg element is treated as SVG code.

The namespace does not have to be a URL. It could be your full name, or it could be a bunch of letters you get when you pound the keyboard. The problem rises when someone else defines a namespace, and they have the same full name, or they hit the same keys with their fist. Suddenly you have a duplicate namespace, and there is no way to determine which came first or which is correct. By using a URL to a website you control, you can post your own XML specification there and be sure that the URL uniquely identifies your XML code.

Namespace defaulting becomes a problem when you have a document with 100 SVG images embedded. Although the argument could be made that you should not embed so many SVG images directly in your XML, if you ever did encounter such a situation, you need to know how to handle it. It would be ridiculous to put the namespace URL on every single svg element. Instead of doing that, you can add a namespace prefix to an element to associate it with a namespace. The resulting syntax is known as a qualified name (abbreviated QName), which is the combination of prefix and element name (the element name is also known as the local part). To define a namespace for a prefix, add a colon followed by the prefix name you want to use:

...
  <svg:svg version="1.1" xmlns:svg="http://www.w3.org/2000/svg" ...>
   <svg:title>My SVG Image</svg:title>
   <svg:rect ... />
   ...
  </svg:svg>
...

Notice how I also have added the prefix, plus a colon, to all the SVG-related tags. These tags are now explicitly bound to that namespace. However, the scope of the namespace ends at the end of the svg element on which it was defined, so in the following example, the second svg would not be bound even though the prefix is there:

BAD EXAMPLE
...
   <svg:svg version="1.1" xmlns:svg="http://www.w3.org/2000/svg" ...>
    <svg:title>My SVG Image</svg:title>
    ...
   </svg:svg>
   <svg:svg version="1.1" ...>
    <svg:title>Error! This title is not in the correct namespace.</svg:title>
    ...
   </svg:svg>
...

If your XML parser is paying attention, it should alert you to either an undefined namespace or an undefined element name upon reaching the second svg element. To fix this problem, simply define the xmlns attribute in the html start tag. Not to worry, it will not change the scope of any of the HTML elements, since they do not have the svg: prefix.

...
  <html xmlns:svg="http://www.w3.org/2000/svg" ...>
...

There is also the possibility that your document might be imported into another XML document. Would its elements then be confused for the parent document's elements? It is completely possible, so to prevent this from happening you should define your own namespace. Just make up a URL that you control, and default the namespace for your document. In the XHTML example I gave, you would do this:

...
  <html xmlns="http://www.w3.org/1999/xhtml"
xmlns:svg="http://www.w3.org/2000/svg" ...>
...

The namespace for your document is now, by default, XHTML. However, any elements prefixed with svg: will be treated as SVG. Now there is no excuse for an XML parser to be confused.

Things get tricky when you look at the attributes, however. In the case of namespace defaulting, attributes are treated as having the same namespace as the default. However, in the case of prefixing, attribute names do not inherit the namespace from the element (as I know you were all thinking until I said that).

BAD EXAMPLE
...
  <html xmlns="http://www.w3.org/1999/xhtml"
xmlns:svg="http://www.w3.org/2000/svg" ...>
...
   <svg:svg version="1.1" ...>
    <svg:title>My SVG Image</svg:title>
    <svg:rect x="1cm" ...>
   </svg:svg>
   <svg:svg version="1.1" ...>
    <svg:title>My Other SVG Image</svg:title>
    ...
   </svg:svg>
...

In the above example, I should first point out that all the elements prefixed with svg: are now associated with the SVG namespace. Good job! However, the x attribute in the rect element is still defaulted to the XHTML namespace, wherein there is no such attribute and the document fails validation. The same problem exists for the version attributes. There are two ways to fix this: Either go back to defaulting the namespace for each svg element, or add the svg: prefix to each x attribute. The latter is a better choice:

...
  <html xmlns="http://www.w3.org/1999/xhtml" xmlns:svg="http://www.w3.org/2000/svg" ...>
...
   <svg:svg svg:version="1.1" ...>
    <svg:title>My SVG Image</svg:title>
    <svg:rect svg:x="1cm" ...>
   </svg:svg>
   <svg:svg svg:version="1.1" ...>
    <svg:title>My Other SVG Image</svg:title>
    ...
   </svg:svg>
...

You now have a document that is free of ambiguity. Because of this, there is absolutely no excuse to use foolish names to try to be unique. Choose element names like name and address, not QBGCustName and QBGCustAddr. Never forget that one of the goals of XML is for it to be human-legible. You can also still use the same kind of comment tags you used in SGML in your documents where necessary. Also, all the design suggestions from SGML still apply: Make sure your elements' parent-child relationships make sense. Use attributes for behind-the-scenes data, and use character data for visible text. Make your XML so obvious to understand that it becomes second nature to maintain your XML documents.

Also, to close off the chapter, when working in XML, you can check your work in Internet Explorer or Mozilla Firefox. Both include a default XSLT stylesheet that will display your XML document in pretty-print format. As an added benefit, both browsers will check your XML code to ensure that it is well-formed, and if there are any syntax errors, you will be alerted to them. Note, however, that the browser will not catch logic errors in a well-formed document. For example, if your XML vocabulary requires that element a be a child of element b, but you accidentally make element a a sibling of element b, the document is still well-formed. You can test for proper use of your XML vocabulary when DTDs are introduced in chapter 7. Also, your browser may have trouble parsing an XML 1.1 document, so change the version to XML 1.0 if necessary.

4.3 Chapter Review & Exercises

You have learned in this chapter why XML was created, and why it has surpassed SGML in its popularity. You know what you need to do to produce a well-formed document, and how to define namespaces. You also should understand how embedding works. You should know what a vocabulary is, and you should understand the syntax for self-closing tags.

  1. Fred is converting his system from SGML to XML, and has discovered that some sloppy person has discovered just how much SGML let him get away with when revising the code. The current menu is not even close to being well-formed, can you fix it without changing any of the new information? The result must be valid XML. Hint: You will need to add a Document Type Declaration and XML declaration. You may consider this to be a system file.

    <menugroup>
     <restaurant name="Fred's Restaurant" ID="FREDS">
      <menu LUNCH>
       <food>
        <name>Club Sandwich</name>
        <price>5.00</price>
       </food>
       <FOOD>
        <NAME>Turkey Sandwich</NAME>
        <PRICE>4.75</PRICE>
       </FOOD>
       <food>
        <name>Soup du Jour
        <price>2.00
       <food>
        <name>Soup and Half Sandwich
        <price>4.50
      </menu>
      <menu DINNER>
       <food>
        <name>Pepperoni Pizza</name>
        <price>8.99</price>
        <icon smile>
       </food>
       <food>
        <name>Other Toppings</name>
        <price>0.50</price>
       </food>
       <food>
        <name>Double Cheeseburger
        <price>7.50
     </restaurant></food>
    
     <restaurant name="Fred's China Town" ID="CHITO">
      <food>
       <Name>Lunch Buffet</Name>
       <Price>5.99</Price>
      </food>
      <food>
       <name>Spicy Chicken</name>
       <price>5.50</price>
       <icon chili>
      </food>
     </restaurant>
     <restaurant name="Fred's Little Italy" ID=LITTL>
      <food>
       <name>Lasagna</name>
       <price>7.99</price>
      </food>
     </restaurant>
    </menugroup>
    
  2. Update your computer lab system from chapter 2 to well-formed XML. There is no W3C XML validator to check for well-formed XML, but there are numerous tools that can be found through Google search or you can simply test in a web browser. Do not worry about a DOCTYPE tag.

  3. Design an XML vocabulary to keep track of items in a shop's inventory. Do not include quantities on hand or anything of that sort, only product information. You must include the product description, UPC number (this is shown to users and searchable, and is 12 digits long), product ID number (users never see this), price per unit, wholesale price per unit, item shipping weight, and give the item a category. Also include front and rear photographs of the item, both optional. Create some imaginary products (at least seven of them) with varying characteristics and populate an XML document with the data for those items. Do not worry about a DOCTYPE tag.