2.1 SGML

Without SGML, there would not be any XML. Many XML books devote about two sentences out of the entire book to SGML. However, XML and SGML are so similar, it is necessary to look at SGML to understand where XML came from. The Standardized Generalized Markup Language began the whole movement toward a structured markup language that is human-readable and self-documenting.

SGML is a standardized variant of its original form, which was just Generalized Markup Language (GML). Its creators were Charles Goldfarb, Edward Mosher and Raymond Lorie (last names ending with the letters G, M, and L, respectively). Like so many technologies of old, GML was conceived at IBM for use in law office information systems. In 1969, these three created GML to address a problem with data storage: How to keep one's data consistent on every platform, without loss of formatting? After all, in those days, there was not the oligopoly of computer brands there is today; there were many different breeds of computer and none played nice with any other. GML was an approach to resolve this issue by tossing arbitrary data structures in favor of a flexible, self-documenting markup language. Eventually, this language grew into SGML, and became an ANSI (American National Standards Institute) standard. Later, the International Organization for Standardization adopted SGML as a standard, ISO 8879:1986. You can go to the ISO website and purchase the documentation for this standard for a meager $180.00. Later in this book, when we get to XML, I will talk about free standards: standards that are published and accessible free of charge.

2.2 Structure

The whole point of SGML is for a formatted document to be structured in a hierarchical manner, such that portions of data are contained within elements. These elements do not natively have any meaning; in SGML you give the element a name, and then you decide in your program what you want to do with that element. The set of all the names and attributes used in an SGML format are known as an SGML vocabulary. For example, let's say there is a man named Fred, who owns a restaurant, Fred's Restaurant. Fred wants to update his menu every week. There are three dishes for sale:

If Fred's prices and specials change often, it makes sense to use a computer program to keep track of the menu and print off new ones with the formatting already applied. (Of course, when we get into XML and styling, we can look at some even more exciting possibilities, such as making the menu appear on the web or creating a point-of-sale system with this data!) Now, with an existing format, you might have special characters for bold, italic, large fonts, and copy and paste the data into that format or write a program for manual entry of data. That is not elegant or efficient. However, if you have a text document that is written in SGML, you can represent the data with elements, like so:

<menu>
 <food>
  <name>Pepperoni Pizza</name>
  <price>8.99</price>
 </food>
 <food>
  <name>Double Cheeseburger</name>
  <price>7.50</price>
 </food>
 <food>
  <name>Club Sandwich</name>
  <price>5.00</price>
 </food>
</menu>

Is this a database you would be willing to update? As you can see, a well designed SGML document is very self explanatory. Documentation is not a standard practice in the world of SGML or any of its children, but it is very important to choose obvious element names. In the example above, you can see that the elements have a start tag and an end tag. Both are enclosed in angle brackets <> to distinguish them from the tag's contents, the regular character data contained in the element. In SGML the end tag begins with a forward slash character, /, to mark the end of the container. Without the end tag, the element could go on forever. The act of placing an end tag at the end of your element is called closing the tag, or in my book, it is called a good idea. Although SGML and HTML are designed to have exceptions to the rule of end tags, I tend to shy away from them as XML does not have exceptions like that. In XML, every element has a start tag and an end tag.

Just to demonstrate how one might live recklessly without the use of end tags, here is a sample of the same menu being made without end tags, assuming the document has been defined in such a way that the end tags are optional. (I will discuss definitions later.) The root element, menu, must always have an end tag, no matter what. However, if the food element is not defined to have any other food elements nested below it, the parser could assume that once it reaches a new food element, the current one has ended and it may begin the new one. Likewise, if name and price cannot contain themselves or each other, those can be assumed to have ended once a name or price start tag is found. As complicated as all of that explanation is, the change to the code hardly seems worth it:

<menu>
 <food>
  <name>Pepperoni Pizza
  <price>8.99
 <food>
  <name>Double Cheeseburger
  <price>7.50
 <food>
  <name>Club Sandwich
  <price>5.00
</menu>

If you had to write a program to parse this SGML data and produce a menu, which style would you prefer? Would you rather write a program that stops reading character data when the tag is closed, or would you rather read the next tag, then check all the rules in the definition for the nesting of tags, and determine if you should stop reading character data based on all those rules?

The lesson I hope this teaches you is that end tags are your friend. You must never forget them. There is also the occasional need for a tag which contains no data, but is left empty. An empty tag, according to the intuition of an SGML writer, has no need for an end tag. However, once again, XML requires the end tag even for an empty tag. Since SGML does not specifically prohibit an end tag, you would be doing yourself a favor to include one.

Why would anyone ever use an empty tag? In some cases, information needs to be stored in a document that will never be read in the final production. This makes the most sense in a displayed medium; one who uses XML as a database would probably want all data to be plain character data. However, for Fred's menu, he might want to place a smiling face next to menu items that are a favorite among customers. Rather than resort to a pitiful-looking emoticon, he can add an empty element to flag these items:

<menu>
 <food>
  <name>Pepperoni Pizza</name>
  <price>8.99</price>
  <icon smile="yes"></icon>
 </food>
...

The pizza is now flagged. The element name is the first word in the tag, icon. After the space can come one or more attributes, or invisible data that further defines the element. The attribute named smile has a value of yes. Perhaps Fred's Double Cheeseburger is very spicy, and he needs to designate it with a chili pepper. He can add another attribute to his icon:

...
 <food>
  <name>Double Cheeseburger</name>
  <price>7.50</price>
  <icon chili="yes"></icon>
 </food>
...

Fred could even have both smile="yes" and chili="yes" on his Double Cheeseburger at the same time:

...
  <icon smile="yes" chili="yes"></icon>
...

There is no limit to the number of attributes. Generally you should always put double-quote marks on the value. First of all, this makes it easier to keep track of the value. Second, it prevents the parser from becoming confused if your value contains spaces. Third, and most importantly, you are required to do it in XML anyway, so get used to it. The good news is XML has a shorthand for empty tags, so you will not have to keep using the </icon> end tag for long. That syntax would be invalid SGML, though, so be patient.

Fred could have omitted the ="yes" portion of the smile and chili attributes. He could have just left them as smile and chili:

...
  <icon smile chili></icon>
...

This would be valid SGML. SGML allows attributes to be left without values, and instead they are either set or unset depending on whether the attribute is present. These are called minimized attributes.This is another one I will tell you to shy away from, because this is another thing you cannot do in XML. XML requires every attribute to have a value.

It is possible to add comments to an SGML document. This comment syntax is compatible with every SGML descendent in this book, including HTML, XML, and all the derivative document types. A comment looks sort of like a tag, but because of the way it is formed, it can contain other tags without them being processed. To begin a comment tag, you use this syntax: <!--. That's an explanation point and two dashes at the beginning of the tag. To end a comment, you again use two dashes but not another exclamation point: -->. Here is an example of a comment that might be seen in an SGML file:

<menu>
 <food>
  <!-- Pepperoni Pizza is reduced to 6.99 week of April 5th -->
  <name>Pepperoni Pizza</name>
  <price>8.99</price>
  <icon smile="yes"></icon>
 </food>
...

Although, as I noted above, SGML is fairly self-documenting, it is sometimes important to include further documentation in the file. For example, someone adding new items to the menu might not know how to add icons. Fred could write a big manual detailing everything about this system, but for a quick update that would consume too much time. Instead, Fred should insert a comment like this:

<menu>
 <!-- Possible icons are smile="yes" and chili="yes"
      Example: <icon smile="yes" chili="yes">
      Default value for both icons is no, just omit the attribute
      if unwanted.-->
 <food>
...
2.3 Hierarchy

By now, you should be noticing something about the way tags are nested. Until XML, there was not nearly as much emphasis on the nesting of elements—but it was always a part of SGML. As I mentioned in 2.2, all elements in a document form a hierarchy. Any element could be defined to have a parent and a child. (Note: Parents of parents and children of children are not still parents and children. This should be obvious, but they are grandparents and grandchildren.) The root element, the element at the very top of the tree (or bottom, depending on how you look at it), cannot have any parents. Also, the root element cannot have siblings, meaning there can only be one root element and nothing else at the root level in the hierarchy. Other elements could have siblings, either of the same element or other elements.

Some elements will be defined to never have any children. For example, why might someone ever nest another element as a child of an icon? The icon element would probably be defined to have no children. Although it may seem very unlikely, perhaps even ridiculous, as the system is expanded it is always possible that the definition for the element could change to allow a child.

As it might turn out, perhaps many years after implementing and expanding this system, Fred decides he would like for the icon to appear in both his menu and his point-of-sale system. His reason for this change is he would like for new employees taking delivery orders to notify the customer of the spicy items before placing the order. The problem is that the program he uses to produce his print menu takes SVG (Scalable Vector Graphics) format, but his point-of-sale system can only display PNG (Portable Network Graphics) images.

By the way, Scalable Vector Graphics is one of the applications of XML! More information will be provided about SVG later on.

To handle this situation, Fred might add the following children to the icon element:

...
 <food>
  <name>Double Cheeseburger</name>
  <price>7.50</price>
  <icon chili="yes">
   <posicon file="chili.png">
   <menuicon file="chili.svg">
  </icon>
 </food>
...

Fred's colleague Angela points out that he should just hard-code the chili images into each respective system, since the picture is the same for every chili. Fred agrees that that would make more sense, but unfortunately, SGML does not have an easy way to handle that—the change would have to be made to the application program. In the XML world, there are two much better ways of handling this situation that will be discussed in this book: Cascading Style Sheets (CSS) and eXtensible Stylesheet Language (XSL). Fred holds off on the icons and starts evaluating the possibility of changing his system over to XML.

Meanwhile, Fred and Angela acquire two other restaurants, and all three have different menus. Fred would like to keep all of his menus in one SGML file. How does he do this? He simply changes the root element menu so its child is not the food element, but instead a new restaurant element.

<menu>
 <restaurant name="Fred's Restaurant">
  <food>
   <name>Pepperoni Pizza</name>
   <price>8.99</price>
  </food>
  ...
 </restaurant>
 <restaurant name="Fred's China Town">
  <food>
   <name>Lunch Buffet</name>
   <price>5.99</price>
  </food>
  ...
 </restaurant>
 <restaurant name="Fred's Little Italy">
  <food>
   <name>Lasagna</name>
   <price>7.99</price>
  </food>
  ...
 </restaurant>
</menu>

As you can see, the food elements are now the children of restaurants. This makes each food item appear on each restaurant's menu. By doing it this way, Fred can take delivery orders for all three restaurants using one point-of-sale system accessing one SGML file. If he wanted to do so, he could even write a program to increase the prices of all the menu items at all his restaurants in one sweep. In many cases, it is ideal to have one document contain information spanning multiple entities, as SGML and XML processing can in some cases be faster than file system processing.

When designing a document type in SGML or XML, it is important to think about the relationship between the data items when nesting them. Do not nest one element as a child of another just because it looks nice. For an element to have children, you imply that those child elements could not exist without the parent. For example, the name and price of a food could not exist without that food existing. However, this is not always a valid test. Could the restaurants exist without a menu? Probably not, but does it make sense for them to be children? If Fred had decided to create separate SGML files for each restaurant, he may have decided to make the root element be restaurant and then have either menu or food elements as children. However, if he did the same thing with the one XML file, in other words, had restaurant as the root and menu or food elements as children, that would not make sense. SGML only allows one instance of the root element. In that case, you would have one restaurant with three menus, which is not an accurate representation of the data: Fred owns three restaurants, and each has just one menu. A good way to check to see if your hierarchical relationships make sense is to draw a tree of all the elements in your document.

One way to interpret the system that is implemented in the example above is to say that each restaurant is a part of the menu—the part for that restaurant. Another more accurate way to describe it is that not all parent-child relationships make perfect sense from a logical standpoint, but it makes sense to code it that way. One alternative would be to change the root element to menugroup, then make menu a child of each restaurant. However, if each restaurant has only one menu, this would be wasteful. You would have a restaurant tag and a menu tag for every restaurant. If there were multiple menus for each restaurant, this would be an ideal solution.

After Fred and Angela debate about this matter all night, they compromise and code menugroup as the root element, and restaurant as the child of menugroup. When the day comes that they create separate lunch and dinner menus for a restaurant, they will add menu elements as children of each restaurant. Until then, they just leave food elements as children of restaurants:

<menugroup>
 <restaurant name="Fred's Restaurant">
  <food>
   <name>Pepperoni Pizza</name>
   <price>8.99</price>
  </food>
  ...
 </restaurant>
 ...
</menugroup>

It makes the most sense, when designing a system in SGML or XML, to make your root element descriptive of the document, and not any tangible entity in the outside (or inside) world. For example, if you were making an SGML file containing information about a baseball team, you could name your root element team, but this would cause problems just as soon as you decided to cover more than one team. However, if you made your root element teamdoc, a shorthand for team document, you are encapsulating your SGML file containing a team, or teams, in a bubble that will (probably) never get any bigger. It would not make sense to have two teamdocs, because if it is data that could not be possibly be contained in one teamdoc, you would need to create a whole separate SGML file anyway. Under teamdoc you can place any element that belongs in this document: teams, freeagents, commissioners, sponsors, and so on.

2.4 Chapter Review & Exercises

You should now know what an element is. An element has a start tag and end tag. Each tag has angle brackets <> on either side to separate it from text. You should be able to identify the element name, attributes, and values, as well as its contents, parents, and children. You should know that element contents are usually used for printable data, and attributes are used for behind-the-scenes information.

Here are a few exercises you should try to test your understanding of the section:

  1. Design your own SGML system. The application is a list of computer labs at a university. You must make up all of the information; do not use any real information in your assignment. All the information should be fictitious.

    For each computer lab, you need to specify all of the following information: Lab building and room number, phone number, directions to the lab, number of computers, software programs available, printers available (black and white or color?), private or public access, and the hours open for all seven days of the week. You must also add one other element of your own choosing. If any default values are invoked by omitting an element or attribute, you must leave a comment noting the default value that is being used.

    All possible values must be used for each element, so for example, you must have labs where there is black and white, color, both, or neither kinds of printing available, and you must have a 24-hour lab and a lab that is closed on the weekend. Use attributes, element contents, empty tags, etc. appropriately for the way the data is likely to be handled by an application program.

    Remember that the rules for SGML do allow optional end tags and unquoted attribute values, so you may choose to take my advice or not regarding those two things. Also, SGML is not case-sensitive, so you can use capital or lowercase letters for element names and attributes or whatever combination thereof you like.

  2. Pick any element (or two or three) in the below document and identify its element name, start tag, end tag, attributes, attribute values, parents, children, grandparents, grandchildren, siblings, contents, and whether or not it is an empty tag. For hierarchical relationships, you only need to identify element names (multiple times for multiples of the same element name). The document is valid SGML.

    <hotelnetwork>
     <branding code=doz>
      <hotel name="Doz-E Inn Tonville" number=25>
       <location>
        <street>1002 E Hotel St</street>
        <city>Tonville</city>
        <postalcode>48404</postalcode>
        <map id=25 filename="25-doz.png">
       </location>
       <rooms>
        <roomtype single units=25 price=35.99>Queen</roomtype>
        <roomtype king units=10 price=41.99>King</roomtype>
        <roomtype double units=50 price=50.99>Double</roomtype>
       </rooms>
      </hotel> 
     <branding code=nit>
      <hotel name="Nite-time Suites Edge Canyon" number=80>
       <location>
        <street>132 Canyon Rd</street>
        <city>Edge Canyon</city>
        <postalcode>25599</postalcode>
        <map id=80 filename="80-nit.png">
       </location>
       <rooms>
        <roomtype kingsuite units=100 price=95.99>King Suite</roomtype>
        <roomtype doublesuite units=100 price=105.99>Double Suite</roomtype>
        <roomtype king units=50 price=55.99>King</roomtype>
        <roomtype double units=50 price=57.99>Double</roomtype>
       </rooms>
      </hotel>
      <hotel name="Nite-time Suites Fairview" number=81>
       <location>
        <street>8820 Fairview Crossing</street>
        <city>Fairview</city>
        <postalcode>25578</postalcode>
        <map id=81 filename="81-nit.png">
       </location>
       <rooms>
        <roomtype kingsuite units=100 price=95.99>King Suite</roomtype>
        <roomtype doublesuite units=100 price=105.99>Double Suite</roomtype>
        <roomtype king units=50 price=55.99>King</roomtype>
        <roomtype double units=50 price=57.99>Double</roomtype>
       </rooms>
      </hotel>
    </hotelnetwork>
    
  3. Draw a tree representing the hierarchy of the above SGML document.

  4. Although the above SGML document breaks many of my style rules for XML preparation, there are a few other problems with the way elements and attributes are laid out. Find ways to improve this document's structure into a form that makes more sense based on what you learned in this chapter. Remember the rules about printable vs. invisible data and smart hierarchy.