3.1 HTML

HTML is the most well-known application of SGML, and one of the more important markup languages to know today. HTML coding involves a different mode of thinking from SGML, although since many programmers learned HTML before learning SGML or XML, it is the more traditional uses of markup language that seem to be different to the "old fashioned" programmers. HTML was invented in the early 1990s by Tim Berners-Lee in order to make webpages which could link to one another and have a limited amount of formatting applied to the text. He called this concept "HyperText" and this led to the acronym for HTML, which stands for HyperText Markup Language. If you are reading the book online, you are reading this book presented in XHTML, a variant of HTML which will be covered later.

Berners-Lee did not only invent HTML, he also invented HTTP, which stands for HyperText Transfer Protocol, and basically he invented the World Wide Web. He set up the world's first web server, a NeXTcube system, and proceeded to affix a sticker on the front where he scribbled out: "This machine is a server. DO NOT POWER IT DOWN!!" This image is online at Wikipedia (and if you are reading the HTML version of the book, this appears as a hyperlink, which you can click on and be immediately taken to the destination page). HTML was later standardized by two big names in the standardization of internet protocols: The IETF, or Internet Engineering Task Force, published HTML 2.0 as one of its thousands of RFCs or Requests For Comments, and then in 1994 Berners-Lee founded the World Wide Web Consortium, or W3C for short, who should by the end of this book be very important to you. The W3C are responsible for HTML versions 3 and 4, XML, XHTML, CSS, and pretty much everything else in this book other than SGML.

I will not cover HTML very thoroughly, since there are thousands of books, a great many websites, classes at almost every college and high school, and the W3C standard that can be referenced for further learning. I will only cover enough HTML to facilitate your understanding of SGML a little better. However, after reading this, you should be well equipped to make your very own website, which is a collection of several HTML documents that are posted online.

3.2 Structure

As I stated earlier, HTML is a very different approach from any other use of SGML or, as you will see, XML. Rather than containing a hierarchical structure of data and representing it by relationship, HTML is simply plain text with segments of the text encapsulated in an element to represent visual formatting. This approach is a very different way to look at SGML, but it is perfectly valid. Although it might seem like the order of elements would not inherently matter in SGML, that is not a rule of SGML. Like many other aspects of SGML, whether the order of elements matters depends on the implementation. This will become obvious as we go on, but if the order did not matter for elements in an HTML document, you might have one block of bold text substituted for another. All of the bold keywords in this book would appear in random places, and you might see in the above paragraph, "You should be well equipped to make your very own IETF" instead of website. Obviously in the case of HTML, order does matter, and in fact most web browsers process an HTML document sequentially, rather than looking at the page as a whole. This method allows the browser to begin displaying the page immediately rather than wait for it to download completely.

There are some pitfalls to this approach. By assigning HTML tags a distinct visual meaning, it then becomes difficult to change the visual appearance of a website when there may be dozens or even hundreds of elements that need to be changed. Cascading Style Sheets (chapter 8) can be applied to HTML documents and instruct the browser to change the visual appearance of certain elements, but an even more maintainable approach is to create a webpage using XML and then processing the XML to convert it to HTML when it is accessed by the user. This will be covered in due time, but in order to be able to convert XML to HTML, you must know how to code HTML.

To continue the Fred's Restaurant application, Fred has decided he would like to post a website containing his menu. Fred does not change his menu very often yet, so he will just use HTML and update his website manually when he does. Fred begins, as any astute web designer would do, by drawing up a visual representation of how he would like his site to look:

Fred's Restaurant

Open Monday-Friday 10 AM to 10 PM, Saturday-Sunday 10 AM to 12 Midnight

Menu:
Lunch Club Sandwich, $5.00 Turkey Sandwich, $4.75 Soup du Jour, $2.00 Soup and Half Sandwich, $4.50	Dinner Pepperoni Pizza, $8.99 Other toppings, add $0.50 Double Cheeseburger, $7.50

E-mail Fred's Restaurant

Fred, of course, expects this to be easy, since it is easy to make a document like this in a word processor (except for the e-mail hyperlink, of course). However, he soon realizes, and let's say this is in the early 90's when there were no WYSIWYG (What You See Is What You Get) HTML editors, that this is actually a fairly complicated webpage to put together. However, it will benefit Fred greatly to learn HTML, since he may one day decide to integrate his point-of-sale and menu printing systems with the website and have the HTML generated automatically, which cannot be done with WYSIWYG editors.

The HTML document is a SGML document, and as such it must follow SGML conventions. There are a few that I did not mention in chapter 1, but now that you understand the mechanics of SGML, you can learn a small technical detail about SGML. Every SGML document should have a Document Type Declaration (DTD), and the same is true for XML. Is it absolutely necessary that you include one? Usually, the answer is no. Most web browsers assume that any webpage is going to be HTML, and that any RSS stream that is referenced will be in RSS format. Also, very few end-user applications actually check the document against the DTD you supply, as that would be time-consuming. Instead, the browser uses its internal rules for handling the document, which is all it would be able to process anyway. But just to be a good sport, Fred is going to include the DOCTYPE tag and avoid a warning from the W3C Validator (which will be discussed later):

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">

The DOCTYPE tag is not an element, so rules that apply to elements are not closed. The DOCTYPE tag is never closed. There are no attributes, only values whose meaning is defined by their order in the tag. The first value, html, is the root element. In case-sensitive markup languages, like XML, it must be capitalized the same way as the actual root element. Since HTML is a subset of SGML, you can mix capitalization and it won't matter. As a side note, all my HTML examples will be in lowercase to be consistent with XHTML and XML conventions. However, I generally prefer uppercase HTML tags to help them stand out from character data when working with regular HTML.

The PUBLIC defines the usage of the markup language you are using. If this were an XML language you designed by yourself for use inside a system, you would use SYSTEM in place of PUBLIC. Any W3C standard is of course going to be PUBLIC. The next item in the tag is a big quoted item that describes the standard being used. This standard definition is a sort of list that is delimited by two forward slashes. The first item in the list is a minus sign. The next item is the organization that created the standard, W3C. If the standard was created by an ISO registered organization, the minus that came before would be a plus instead. Minus means that the organization is not ISO registered, and the W3C is not.

The next item is the document type in use. The first word is always DTD for any document that uses a DTD file, and all the standards in this book do. After that comes the name of the standard. HTML 4.01 Transitional is a document type that allows for all the old tags we love to use so much, to make text underlined or centered, for example. The HTML 4.01 (also known as Strict) document type was created by the W3C to forbid the use of those tags, because they are deprecated or basically obsolete. Although Cascading Style Sheets are a valid alternative to using an underline element, for a small webpage it can be monstrously inconvenient when the deprecated element for an underline is simply <u>Underline</u>. The u element is much easier, and the Transitional document type allows it to be used. After that comes the EN, which means the tags are written in English. The next item is a quoted URL (Uniform Resource Locator, basically a web address to a resource) to a DTD file containing all of the formatting rules.

This is an important note about DOCTYPE tags in HTML: Since the meaning of certain tags has changed from past versions of HTML, version 5 and higher browsers test the DOCTYPE tag to choose how to handle the page. Often the way it works is, if no DOCTYPE tag is present, or if a Transitional DOCTYPE tag is present but missing the DTD file URL, the page is rendered by the browser in quirks mode, and all of the new features of HTML and CSS that are seen by the browser developers as a conflict are turned off for compatibility. To turn them back on, give either a Strict DOCTYPE tag (with or without the URL), or a Transitional DOCTYPE with the URL. Your page will then be rendered in standards mode. These sort of arbitrary ways that browsers look at your HTML are a solid reminder that when publishing an HTML document online, it is important to test in many different browsers to make sure they are being displayed the way you intended.

Next comes the root element of the HTML document, which is, simply enough, html:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
</html>

HTML documents have two main parts: The header and the body. Each can only exist once. The header contains information about the document to identify and describe it, including the title and some metadata about the document. The header can also be used to include JavaScript or Cascading Style Sheets, or to link RSS documents. Basically, the header is where anything that can't be seen is placed. The body contains character data and elements that add special formatting to text or insert images.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
 <head>
  <title>Fred's Restaurant</title>
 </head>
 <body>
  <center>
   <h1>Fred's Restaurant</h1>
   <br>
   <br>
   Open Monday-Friday 10 AM to 10 PM, Saturday-Sunday 10 AM to 12 Midnight
  </center>
  <h3>Menu:</h3><br><br>

  <u>Lunch</u><br><br>

  Club Sandwich, <b>$5.00</b><br>
  Turkey Sandwich, <b>$4.75</b><br>
  Soup du Jour, <b>$2.00</b><br>
  Soup and Half Sandwich, <b>$4.50</b><br><br>

  <u>Dinner</u><br>

  Pepperoni Pizza, <b>$8.99</b><br>
  Other toppings, add <b>$0.50</b><br>
  Double Cheeseburger, <b>$7.50</b><br><br>

  <a href="mailto:freds@restaurant.net">E-mail Fred's Restaurant</a>
 </body>
</html>

The head element contains the header, and the body element contains the body. Let's review the other new tags.

title – Gives the document a title. This appears in search engines and the browser's title bar.
h1 and h3 – Header text. The browser usually draws this as big, bold text. The tags range from h1 to h6, h1 being the largest, h6 being the smallest. For quick and dirty webpages, it is convenient to use this element, but the exact size and formatting is left completely to the browser's discretion. In chapter 8 I will show you how to use Cascading Style Sheets to give the browser more specific formatting instructions.
center – Center alignment for text and images. The default, without this tag, would be left alignment. This is a deprecated element; W3C recommends using <div align="center"> instead. The align attribute also accepts left, right, and justify, giving you a few more alignment options. I will explain the div element later.
u, b, i, s – These one-letter elements represent Underline, Bold, Italic, and Strikethrough, respectively. I only used two of those, but I'm listing them all because they are so simple.
br – This one is tricky. br represents a line break in the final formatted document. It will send following text to the next line. Line breaks in your code do not display as line breaks in the final document! In HTML, as in any SGML or XML, line breaks are treated as whitespace (in other words, space characters). Don't worry about having multiple spaces appear in your final document if you have a lot of line breaks in your code, because the browser will convert all whitespace into just one space character when it is rendered.

Also note that br is one of those empty elements I warned you about in chapter 2. I did not close them, however, because the W3C forbids an end tag, and in fact when I tried closing a br element, the browser treated both the start and end tags as two separate brs. However, if it makes you uncomfortable to leave a tag empty, you may use the XML notation for an empty tag: <br />. By adding a slash at the end of the start tag, it becomes a self-closing tag, which is something I will discuss in chapter 4. It is very important that you include a space before the forward slash, because if you do not, the browser will think the element name is br/ instead of br and ignore it.

As a side note, many HTML authors have developed a bad habit of using the p or paragraph element as a double line break. While the p element does insert two line breaks, it is also a block-level container which means that the end tag is required. The W3C specification technically leaves the end tag as optional, but discourages this use of p as well. I will explain the proper use of the p element shortly. That's two block-level containers that I owe you an explanation about, p and div. You will note that in Fred's example, a double line break is formed using two br elements.

a – I've saved the best for last. The a element is used to format hyperlinks, which are one of HTML's key features. The letter a stands for anchor, which is a rather confusing mnemonic for a hyperlink. This is the only element in this document that has an attribute, because an anchor without an attribute would be nothing. The href attribute is a hypertext reference, which can contain any URL. In this case it is set to "mailto:freds@restaurant.net", which is a type of URL. The mailto: is a scheme (not a protocol, since there is no such protocol as mailto), which directs your web browser to open your e-mail program and start a new e-mail to the e-mail address that follows. A scheme is at the beginning of a URL followed by a colon, and most web browsers will assume you mean http if you do not specify a scheme. A protocol is a standardized method of transmitting data over the internet. Protocols are a kind of scheme, for example, http or ftp (File Transfer Protocol) are schemes and they are also protocols, but mailto is not a protocol because it does not involve any network communication. It is an instruction for your web browser to follow. The href attribute could instead contain a link to another website beginning with http://. The character data contained in the a element appears to the user as underlined text that, when clicked, will take the user to the resource referenced by the href attribute.

Fred has not yet completed his webpage. Currently his page is very drab and disorganized because all of the menu items are flush with the left side of the page:

Fred's Restaurant

Open Monday-Friday 10 AM to 10 PM, Saturday-Sunday 10 AM to 12 Midnight

How was this accomplished? First, there was the table element which contains the entire table. Within every table is a set of table rows represented by the td element, and within every set of table rows is a set of table data cells. The reason it is td and not simply tc is because there are also th, or table header cells. Table header cells are better used in traditional spreadsheet-style tables, and they are usually styled differently by the browser (commonly bold text). The W3C directs HTML developers to just use td in the absence of headers. This is one of the few good examples of nesting in HTML.

Table cells may contain column span or row span attributes, which are colspan and rowspan, respectively. Just in case you are not familiar with spreadsheet terminology, columns are vertical and rows are horizontal. To remember, think of columns in a fancy courthouse holding up the ceiling that go from top to bottom, and think of rows of crops in a field that go from side to side. The top cell in Fred's table occupies two columns, so it has a column span of 2, coded as colspan="2".

Finally, Fred really wants his e-mail link to be right justified. This is where a block-level container is used. A block-level container basically puts the contents into a box, stopping it from flowing with the rest of the document. You can move the box around, you can draw borders on it, you can align the text inside it, you can change the style of text inside it, and many other things. The only thing you cannot do with a block-level container is make it flow, since that is the opposite of the definition of a block-level container (HTML has an inline container, the span element).

Here I am keeping my promise to explain p and div. The p element is a block-level container that contains a paragraph of text, and the div element is a block-level container that contains anything else. Technically they behave in the same way, but it is easier to keep organized if you use p for paragraphs only. To move Fred's e-mail to the right side of the page, it is placed in a block-level container and that container is then right-aligned:

...
  <div align="right">
   <a href="mailto:freds@restaurant.net">E-mail Fred's Restaurant</a>
  </div>
 </body>
</html>

Fred's website now looks as he initially planned, but the header is still very boring. Fred could draw up his own logo and insert it in place of the header text. To do this, he would upload the image to his web server in the same directory as his HTML document. He would then place a relative URL, which is a URL of a document in relation to the current document, into an img element:

...
  <center>
   <img src="filename.jpg" alt="Fred's Restaurant">
   <br>
   <br>
   Open Monday-Friday 10 AM to 10 PM, Saturday-Sunday 10 AM to 12 Midnight
  </center>
...

The img element has two attributes that are required. The first, src, is the source of the image given as a URL. Why isn't src used for hyperlinks? Because a hyperlink isn't a source, it is a reference to a destination, the shorthand for hypertext reference which is href. Do not mix the two up. If you are a C++ programmer, consider the difference between pointers and includes. Later in the book, we will be using the link element, which also uses href. This may seem confusing, since the link element appears to be more similar to an include than a pointer. link is used to open an external resource, such as a CSS file, an RSS feed, or something else like that to enhance the document. However, that external resource is not pulled into the document, it stays out in that external file where it existed from the beginning. The browser goes out to look at it and comes back to the HTML document empty-handed.

The second attribute for the img element, alt, specifies alternate text to display in case the image does not load. This is the case for screen readers for the blind, which do not load images. This is also the case for search engines. Neither of those can understand images, so you must duplicate any text that appears in the image in the alt attribute. Also, this is another empty tag, so you may convert this into a self-closing tag if you would prefer. Make sure there is a space between the last attribute and the forward slash. Just like with br, end tags are forbidden by the HTML specification.

The final source code for Fred's site would look like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
 <head>
  <title>Fred's Restaurant</title>
 </head>
 <body>
  <center>
   <img src="filename.jpg" alt="Fred's Restaurant">
   <br>
   <br>
   Open Monday-Friday 10 AM to 10 PM, Saturday-Sunday 10 AM to 12 Midnight
  </center>
  <table align="center" border="1">
   <tr>
    <td colspan="2">
     <h3>Menu:</h3>
    </td>
   </tr>
   <tr>
    <td>
     <u>Lunch</u><br><br>
     Club Sandwich, <b>$5.00</b><br>
     Turkey Sandwich, <b>$4.75</b><br>
     Soup du Jour, <b>$2.00</b><br>
     Soup and Half Sandwich, <b>$4.50</b>
    </td>
    <td>
     <u>Dinner</u><br><br>
     Pepperoni Pizza, <b>$8.99</b><br>
     Other toppings, add <b>$0.50</b><br>
     Double Cheeseburger, <b>$7.50</b>
    </td>
   </tr>
  </table>
  <div align="right">
   <a href="mailto:freds@restaurant.net">E-mail Fred's Restaurant</a>
  </div>
 </body>
</html>

You might be wondering, what if I want to make an HTML page to demonstrate HTML? Is it possible to escape characters in HTML documents so they are not processed? The answer is yes, and it is done with a nice feature of SGML called entities. Entities are aliases delimited by an ampersand and followed by a semicolon that are replaced with a document-specified string. Later on you will learn how to make your own entities. The entities you would need to escape HTML characters is < for the less-than sign, > for the greater-than sign, and " for the double-quote mark. A full list of entities can be found at the Visibone site.

As a final note for this chapter, I know you may be wondering if it is possible to change fonts, colors, widths, heights, and those things. I could tell you the old way to do it in HTML, but I will consciously leave that information out. Those methods are very cumbersome and unpredictable compared with the CSS method, which will be covered in chapter 8. If you really want to use the HTML methods to change the appearance of a document, you can look them up in the HTML specification at the W3C site.

3.3 Chapter Review & Exercises

You should now know what HTML and HTTP are, and what purpose they were designed to serve. You now know who IETF and W3C are, and their roles in the development of HTML. You need to know how to form the header section and body section, and how to make header text, format text in bold/underline/etc., align text using a block, and make tables. You should understand entities, and know the difference between src and href and when to use them.

Determine if the schemes provided below are protocols or just schemes. Note: Using Google to find the answer is a bad idea, as many sites erroneously list all of these schemes as protocols. However, using it to research the scheme may help you decide whether it is a protocol or not.
1. telnet:
2. view-source:
3. javascript:
4. irc:
5. aim:
6. nntp:
7. news:
Create a webpage using the information you have learned in chapters 2 and 3. Follow SGML and HTML rules. If you are unsure about something, check the rules at the W3C website . Use all the elements used in the Fred's Restaurant example website at least once. Test your webpage in a web browser, and then use the W3C Validator to check your work. As long as you follow the HTML specification you indicate on your Document Type Declaration, you should be able to pass the validation step.
Change the HTML document from step 2 to contain invalid HTML code that causes the page to fail W3C Validation. (Be careful, because as a subset of SGML, many end tags are considered optional!) Write a response explaining why the change was invalid HTML.