Chapter 6 - XHTML - XML: A Deeper Understanding by John Shirrell

6.1 XHTML

You should be able to guess that XHTML stands for Extensible Hypertext Markup Language, since it is basically the combination of HTML and XML. XHTML documents are well-formed XML documents, and that is basically the only difference between XHTML and HTML. If you remember how to make a well-formed XML document, you will not have any problem converting HTML to XHTML. However, there are a few tricky things that you need to remember when making the conversion. Since XHTML has no new syntax over HTML, and since an XHTML reference is also freely accessible as the XHTML 1.1 Specification, this chapter is dedicated solely to the issues involved in upgrading a website from HTML to XHTML. In addition to RSS and AJAX, the process of modernizing HTML code to be well-formed XHTML is one of the big industries created by XML.

I have had a difficult time accepting XHTML as a standard. I have resisted it for years since it first became a W3C recommendation in 2000. It seemed silly to create a new standard for HTML when the current standard works well, and ten years worth of the internet will always be HTML documents and always need to be accessible, regardless of how many currently maintained webpages are coded in XHTML. For this reason, we will always have the Internet Explorer browser, which is based on NCSA Mosaic, which happens to be the first graphical web browser ever produced.

This leads one to ask, if the code is the same (except for changes to make the document well-formed), and the results for the end user are the same, why bother to update to XHTML? What does XHTML add to HTML? Well, the answer is in the name: XHTML makes HTML extensible. Do you remember the example from chapter 5 of extending an RSS file's capabilities by adding tags from another namespace, as Apple and LiveJournal did? The same can be done with XHTML documents. You can embed math expressions using MathML, you can embed Scalable Vector Graphics, and you can embed any other format that uses XML. If you run a Google search for XHTML documents with these other XML formats embedded inline, you should come across several testcase files, which are simple XHTML documents that demonstrate that your browser can, or can't, handle the technology. Internet Explorer cannot even handle valid XHTML files to begin with (I will talk about this later in the chapter), but Firefox has XHTML and MathML and SVG all included in its main distribution, so the testcases work flawlessly.

What does this mean for the web as a whole? Well, if you remember the idea of a Web 2.0, there is a driving force behind changing the internet from a document-oriented interface to an application-oriented interface. It is no longer exciting to be able to have tables and bold text and images, so it is time to add new features to the internet. Although it is already possible to add a limited amount of functionality to the internet through browser plug-ins, such as the Macromedia Flash Player, those act as an object affixed to the page (using the object element) which has little connection with the surrounding HTML document. With XML and namespaces, it is possible to mix elements from XHTML with elements from another XML vocabulary, and blend them together however suits your project. You can even mix attributes from one vocabulary with attributes from another by prefixing them properly.

The other advantage that XHTML has over HTML is, due to its rigid, well-formed structure, the ability to programmatically access any element or attribute in the document and modify it dynamically. JavaScript is a programming language that is interpreted, not compiled, and is embedded in HTML or XHTML documents and processed by the browser on the end user's computer. HTML, which is affectionately called "tag soup" by XHTML proponents, cannot be treated and manipulated as a hierarchy because a very small proportion of HTML documents are even well-formed enough for the JavaScript engine to have a prayer of making any sense of it. The only way to modify an HTML document using JavaScript is to use its document.write function to add text wherever in the document this command is found. Once the document has finished loading in the browser, no further writing can be performed. Unfortunately for people who have become used to this JavaScript command, it is no longer possible to use the document.write function in XHTML.

Instead, XHTML is manipulated using the Document Object Model (DOM), which is (gasp) another W3C recommendation. Basically, by using the Document Object Model, you can manipulate any element at any location in the XHTML file in any way you would ever want. Instead of writing plain text containing tags, you create a new element and fill it with contents. Of course, this can be a hassle if you are writing a block of text with numerous hyperlinks, because each a element must be added individually, but it is the only way to maintain the XHTML document's well-formed structure. The DOM includes many features for handling namespaces, as well. This allows you to add elements under any namespace at any time.

As a side note, it is possible to perform DOM commands against HTML documents. I do not intend to mislead you, it can be done with good old HTML 4.01. However, when a browser is instructed to perform a DOM command against an element that is not well-formed, the browser may perform that command very differently than expected. This problem is compounded by the difference in how gracefully browsers handle this situation. Many HTML authors are never aware that their HTML documents are not well-formed because the browser they test with handles the code gracefully enough to make it seem as though it received well-formed code. To be safe, reserve use of DOM commands to XHTML documents. DOM programming is very complicated, and can fill a 500 page book by itself. I will not cover DOM, but it is available for reference from the W3C.

Aside from the future benefits of XHTML, there is one present benefit: By being strict about well-formed tags, XHTML requires less processing power and less complicated algorithms to render. This means that XHTML can be displayed by devices with less processing power, such as mobile phones. The catch is that XHTML is so new and uncommon that very few of these devices will use an XML parser to read webpages anyway. By the time XHTML has proliferated enough to justify that, HTML browsers will be made leaner and portable devices will have more power. As an example, the Sony Playstation Portable, a handheld gaming system, has a wonderful browser that is downloaded into the system's firmware as part of an update. The browser has better compatibility both with tag soup HTML pages and with W3C standard webpages than Internet Explorer, and runs on a palm-size system with only a 266MHz processor.

One final benefit of XHTML: It, being XML, can be styled using either Cascading Style Sheets (chapter 8) or Extensible Stylesheet Language Transforms (chapter 9). I will explain those as they come along, but the idea of the latter is to be able to take any input XML document and transform it into any format, XML, HTML, or plain text, that is desired. One could even have a webpage that can be styled and output a printer-friendly PDF (Portable Document Format) file! The possibilities are literally endless. HTML has reached the end of the line as far as innovation, so it may be time to put it to rest.

6.2 Switching to XHTML

XHTML is easy to learn, because it is nearly identical to HTML. The only problem with that is that, if you have been working with HTML for a long time, you may discover that what you have actually been writing was not valid HTML at all. There are also a few elements that have been completely removed. I am using and will discuss XHTML 1.1, which is a very strict implementation that does not include any elements that have a better alternative using other elements or CSS. The idea of XHTML 1.1 is "modularization," which is this process of removing extraneous elements in the hope of making the vocabulary leaner and more portable. Although one could simply convert their documents to XHTML 1.0 Transitional, which is the same as HTML except for requiring that your document is well-formed, that seems like a rather trivial step forward in modernizing your code. I originally designed this book's website using HTML 4.01 Transitional, but I decided to convert everything to cutting-edge XHTML 1.1 (strict, which is the only document type available in XHTML 1.1). Since I already had a large amount of content, it was a bit of a hassle. This helped me discover a few problems in converting from HTML to XHTML that did not appear in my research.

The main problem I discovered was the loss of several attributes which I held dear to my heart, but were phased out in XHTML 1.1. An example of this is the name attribute on the a element. An anchor can be either a source (as in a hyperlink) or a destination. By adding a destination anchor to a point in the middle of a document, you can then reference it in a mid-page link like so:

<a name="middle"></a>

This is a neat trick, and you may have noticed that it works for skipping around sections in the online version of this book. You simply link to #middle, and the browser will skip ahead to the location of that anchor. However, this is actually incorrect syntax. You see, in HTML and in XML, there are classes and there are identifiers. A class can apply to many items. An identifier can only apply to one item, because it identifies it. Now, by giving an element a name, you do not prohibit other elements from having the same name. The name attribute could be the same for two different a elements, and the browser would have to decide which one to select. There is no name attribute for the a element in XHTML 1.1, probably for this reason. However, the behavior is the same in version 5 browsers for the id attribute.

<a id="middle"></a>

There is just one problem with this. XHTML also has rules governing the use of identifier names. In HTML you could make them up however you wanted, but in XHTML your identifier names cannot begin with a number. This forced me to change my identifier names, as I was using numbers only.

Another problem I encountered involved the use of CSS. I found that styles on the body element, particularly background colors and images, would not cover the entire surface of the page. This is because the body is a box within the HTML document, and extra space that does not contain any of the content contained within the body element was not styled (e.g. no background). To work around this problem, I moved all of my background styles to the html element. This too was sufficiently backward compatible for me to be satisfied.

HTML has many minimized attributes, particularly in forms. One popular use of a minimized attribute was to check a checkbox, which is most often done to have an irritating newsletter sign-up option be checked by default:

<input type="checkbox" name="signup" value="yes" checked />

This is not well-formed XML. If you remember the rules, minimized attributes are no longer allowed in XML. To rewrite any such attribute from HTML, just set the value to the attribute name:

<input type="checkbox" name="signup" value="yes" checked="checked" />

This resolves the problem and still works in old browsers. The same process can be done to any minimized attribute from HTML (as long as that attribute still exists in the XHTML 1.1 vocabulary).

Another problem arises when embedding JavaScript code or CSS within an XHTML document. Previously, HTML browsers processed script code and CSS first, and then removed it from the document. This meant that an author could use greater-than and less-than symbols for comparisons, or ampersands as Boolean operators. In XHTML, as with any XML, the parser inspects the document first and develops the element structure. However, it would flag reserved characters like these as syntax errors, because it would view anything between a less-than and greater-than symbol as a tag and the ampersand as the start of an entity. This is because, by default, element contents are treated as Parseable Character Data (abbreviated PCDATA) and the contents are parsed to check for child elements. Although the script and style elements, which are used for JavaScript and CSS, are not defined to have any child elements, it is possible that the XHTML specification continues to treat their contents as PCDATA in the event you ever want to embed XML within these tags. There is, however, a workaround.

XML has a construct to define a block of text as ordinary Character Data (or CDATA) that will not be parsed. The way to do this is to use the CDATA section tag to prevent it from being parsed:

<script type="text/javascript" language="JavaScript">
<![CDATA[
function notParsed() {
...
}
]]>
</script>

The beginning of the CDATA section is marked with the characters <![CDATA[, and the end is marked with ]]>. This will appear to the browser as character data. However, this poses a problem for old browsers. Since old browsers treat XHTML as regular HTML and do not parse it, as such it will see the CDATA section tag and treat it as a syntax error. To prevent this, comment out both parts of the CDATA section tag in your script:

<script type="text/javascript" language="JavaScript">
/*<![CDATA[*/
function notParsed() {
...
}
/*]]>*/
</script>

This method works for both JavaScript and CSS. Since XHTML does not understand JavaScript-style comment tags, it will simply treat the first /* and the last */ as PCDATA that happens to mean nothing.

There is one other important thing to mention when talking about script blocks, which applies to people who have developed the habit of enclosing their entire JavaScript or CSS blocks in comment tags. This was done back in the early days of scripting and stylesheets, when old browsers that did not recognize the script and style elements would simply dump the entire block of text onto the screen as if it was character data for the parent element. This has remained the trend for many years, even as browsers that did not recognize those elements have long since gone extinct, because it was trivial to insert the extra comment tags to be "better safe than sorry." Unfortunately, when you convert these HTML webpages to XHTML, you will find yourself more sorry than safe, because the XML parser will disregard the comments before the browser has an opportunity to read the code. This will result in the script disappearing from the page. You should consider doing away with the comment tags on scripts anyway. Every browser available today, including browsers for portable devices and television set-top boxes, either knows well enough to disregard the content or is actually capable of understanding a limited amount of JavaScript and CSS. If you want to be completely safe, and avoid both XHTML parsing issues and problems with (very) old browsers, simply save the script or style sheet as an external file and reference it from within the XHTML document.

One last issue I encountered when converting to XHTML was, due to the ambiguity of having an XML document with no namespace, my XHTML webpage was rendered as a naked XML document. This was because I did not define the namespace for my document. To resolve this, I added the namespace to the html element, thus making it the default:

<html xmlns="http://www.w3.org/1999/xhtml">

As a quick summary of XHTML conversion issues, here are some things you need to remember:

All elements must have a start tag and end tag. For empty elements, use a self-closing tag, e.g. <br />.
Expand minimized attributes, e.g. attribute="attribute".
All attribute values must be contained in quote marks.
Convert all element and attribute names to lowercase. All HTML tags and elements are lowercase, with a few exceptions such as script events, e.g. onClick.
Include an XML declaration and a DOCTYPE tag.
Declare the XHTML namespace as the default for the document.
Eliminate use of name attribute on elements (except form objects).
Eliminate use of deprecated elements that have been removed from the XHTML vocabulary.
Do not comment out script sections, and use CDATA section tags.
The body element is a box; move background styles to the html element.

There is just one more problem with the conversion from HTML to XHTML. How do you know if you have a well-formed document at the end? Obviously, a good place to start is the W3C Validator. However, the validator does not, by default, tell you if your document is actually being sent to the browser as XHTML. When you test your XHTML webpage, you may very well be testing it as a regular HTML webpage.

6.3 The XHTML MIME Type

There is one thing you need to remember about XHTML documents. The MIME type, or Content Type, of XHTML is application/xhtml+xml. Since XHTML is also XML, you could substitute text/xml, but the XHTML MIME type is more specific and a better choice. You should use a proxy or CGI script to check the headers being sent by your XHTML webpage to see if the MIME type is correct. If your webpage still works in Internet Explorer, the MIME type is being incorrectly sent as text/html. Your XHTML webpage might look like a perfectly fine HTML webpage, and the browser will never know the difference. However, once you try to use an XML feature in your XHTML document, you will discover that it doesn't work, since it is not being loaded as XHTML. You will then fix the MIME type only to discover that you never had valid XHTML at all.

The problem lies in backward compatibility. After all, what good is a webpage if it cannot be viewed in Internet Explorer at all? This leads us to a subject of debate in the XHTML world: Should you report your webpage as XHTML or as HTML? The correct answer is both, and neither.

To allow your webpage to degrade gracefully, you need to use a bit of server-side scripting to try to guess what the browser wants. With an HTTP request, a browser is supposed to tell the server what MIME types it will accept. Internet Explorer does not provide this information, instead it accepts */* (if you guessed that this is a wildcard to cover any MIME type, you would be correct). On the other hand, the Mozilla Firefox browser includes application/xhtml+xml in its accept list, because it is capable of parsing it as XML as intended. The goal is to send Firefox the real XHTML document, and to lie to Internet Explorer and tell it your XHTML document is an ordinary HTML document.

I accomplished this using the PHP scripting language, which I feel is the simplest way to handle the problem. However, most servers running the Apache web server have configuration files, called .htaccess files, which can contain instructions to the server's URL Rewriting Engine. This method allows you to switch the MIME type of static, non-scripted XHTML files on the fly. Since I already was using PHP, I decided to stick to that method.

The way HTTP works is as follows: First, the browser on the client side sends a request to the server, containing the URL being requested, the name of the browser being used, the version of HTTP being used, the referring page, and the accepted content types. This "Accept" header is the one that will be checked. The stristr function evaluates as true if the second string is within the first. To check for whether the browser accepts XHTML, the first string is the value of the Accept header, and the second is the MIME type being sought, application/xhtml+xml.

if(stristr($_SERVER["HTTP_ACCEPT"],"application/xhtml+xml")) {
  header("Content-Type: application/xhtml+xml; charset=ISO-8859-1");
}
else {
  header("Content-Type: text/html; charset=ISO-8859-1");
}

If the Accept header contains the XHTML MIME type, it is assumed that the browser accepts XHTML, and the proper Content-Type header for XHTML is sent to the browser. If the browser does not specifically state that it accepts XHTML, it is assumed that the browser would not be able to accept it, and it is sent the HTML MIME type instead. The document's contents are exactly the same, the only thing being changed is the HTTP header that the browser sees when it receives the XHTML webpage.

I also found that PHP is confused by the XML declaration tag, since both follow the SGML standard for processing instructions (PHP uses <?php ?>, XML uses <?xml ?>). Apparently PHP is greedy and assumes that all processing instructions are PHP processing instructions, even if a different application is indicated at the beginning of the tag. To get around this, I added an echo instruction within the PHP code:

echo "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n";

I only needed to use a few PHP escape sequences (the same style as C++) for the quote marks and a newline character at the end of the XML declaration tag to prevent a syntax error. If you are presenting XHTML to the whole internet, this is the only acceptable way to do so. Sending XHTML as HTML to an XHTML aware browser is a waste, and sending XHTML as XHTML to a browser that cannot handle it does not degrade gracefully. There are many solutions to this problem posted on the internet, so it is worth looking for one that works best for your particular server setup. As a worst-case scenario, if you do not have access to run server-side programs, simply upload two different versions of your document, one with the .xhtml extension, and one with the .html extension, and they will be sent as their respective MIME types automatically. If you are feeling particularly mean, you could also make the HTML version be a text-only webpage with a link to download the Firefox browser at the top. This would not be a good idea in a corporate setting, since some customers access the internet from computers that do not allow the installation of new browsers. It would be a good idea for a personal webpage, if you want to make a point with your Internet Explorer viewers in an effectively disruptive way.

6.4 Chapter Review & Exercises

In this chapter, you learned how XHTML differs from HTML, some common problems that arise when converting from HTML to XHTML, and how to use the correct MIME type for XHTML and make it degrade gracefully. You should know how to escape character data and when it is necessary to do so. You should also have a basic understanding of HTTP.

Convert your webpage from chapter 3 to valid XHTML 1.1, as verified by the W3C Validator. Write a response containing an explanation of some of the things you needed to change to make your page validate (especially if you failed the first try at the validator). Test your results in Firefox, ensuring that Firefox is parsing the document as XHTML (View Page Info should show Type: application/xhtml+xml). The file extension .xhtml should trigger this.
Repeat exercise 1 with another webpage you find on the internet. It may be a simple page, but it must be an HTML webpage and it must not be a text-only page.
Add a standalone JavaScript to either of the two webpages you converted to XHTML. This can be a JavaScript downloaded from a free JavaScript exchange site like DynamicDrive.com, but ensure that it is one that ordinarily works in both Internet Explorer and Firefox. Insert it into your XHTML document using the script element and do not use any external files. Also remember, the script that you choose cannot use the document.write function.