Shame and War Revisited

Adding Semantic Markup to HTML

Philip Greenspun
Laboratory for Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology

Abstract

"HTML represents the worst of two worlds. We could have taken a formatting language and added hypertext anchors so that users had beautifully designed documents on their desktops. We could have developed a powerful document structure language so that browsers could automatically do intelligent things with Web documents. What we have got with HTML is ugly documents without formatting or structural information." I wrote that in August 1994. In the intervening 18 months, style sheets have substantially enhanced HTML's formatting capabilities, but no progress has been made on the structure problem. I propose a class-based semantic markup system compatible with existing browsers and HTTP servers.

Introduction

"Owing to the neglect of our defences and the mishandling of the German problem in the last five years, we seem to be very near the bleak choice between War and Shame. My feeling is that we shall choose Shame, and then have War thrown in a little later, on even more adverse terms than at present."

Winston Churchill in a letter to Lord Moyne, 1938 [Gilbert 1991]

If you asked a naive user what the Web would do for them, they'd probably say "I could ask my computer to find me the cheapest pair of blue jeans being sold on the Internet and 10 seconds later, I'd be staring at a photo of the product and being asked to confirm the purchase. I'd see an announcement for a concert and click a button on my Web browser to add the date to my calendar; the information would get transferred automatically."

We computer scientists know that the Web doesn't actually work this way for naive users. Of course, armed with 20 years of Internet experience and the latest in equipment and software, we computer scientists go out into the Web and... fall into exactly the same morass. When we find a conference announcement, we can't click the mouse and watch entries show up in our electronic calendars. We will have to wait for computers to develop natural language understanding and common sense reasoning. That doesn't seem like such a long way off until one reflects that, given the ability to understand language and reason a bit, the computer could go to college for four years and come back capable of taking over our job.

Recently adopted HTML styles sheets offer us a glimmer of hope on the formatting front. It may yet be possible to render a novel readably in HTML. However, style sheets can't fix all of HTML's formatting deficiencies and certainly don't accomplish anything on the semantic tagging front.

Fixing the formatting problem; frames are not the answer (but maybe style sheets are)

When a reader connects to a $100,000 Web server via a $10,000/month T3 line, his first thought is likely to be "Wow, this document looks almost as good as it would if had been hastily printed out from a simple word processor." His second thought is likely to be "Wow, this document looks only almost as good as it would if had been hastily printed out from a simple word processor."

Increased formatting capabilities are fundamentally beneficial. It is more efficient for one person to spend a few days formatting a document well than for 20 million users to each spend five minutes formatting a document badly. Yet the original Web model was the latter. Users would edit resource files on the Unix machines or dialog boxes on their Macintosh to choose the fonts, sizes, and colors that best suited their hardware and taste. Still, when they were all done, Travels with Samantha and the Bible ended up looking more or less the same.

The Netscape extensions ushered in a new era of professional document design, but it hasn't all been for the best, especially the round introduced with Netscape 2.0. HTML documents may have looked clunky back in 1993 but at least they all worked the same. Users knew that if they saw something black, they should read it. If they saw something gray, that would be the background. If they saw something blue, they should click on it. Unlike CD-ROMs, web sites did not have sui generis navigation tools or colors that took a few minutes to learn. Web sites had user interface stability, the same thing that made the Macintosh's pull-down menus so successful (because the print command was always in the same place, even if it was sometimes grayed-out).

Netscape 1.1 allowed publishers to play with the background, text, link, and visited link colors. Oftentimes, a graphic designer would note that most of the text on a page was hyperlinks and therefore just make all the text black. Alternatively, he or she would choose a funky color for a background and then three more funky colors for text, link, and visited link. Either way, users have no way of knowing what is a hyperlink and what isn't. Oftentimes, designers get bored and change these colors even within a site.

Very creative publishers managed to use the Netscape 1.1 extensions to create Web documents that looked like book or magazine pages. They did this by dropping in thousands of references to transparent GIFs, painful for them but even more painful for the non-Netscape-enhanced user.

Frames, introduced with Netscape 2.0, give the user the coldest plunge into unfamiliar user interface yet. The "Back" button no longer undoes the last mouse click; it exits the site altogether. The space bar no longer scrolls down; the user has to first click the mouse in the frame containing the scroll bar. Screen space, the user's most precious resource, is wasted with ads, navigation "aids" that he has never seen before, and other items extraneous to the requested document.

Thanks to all of these Netscape extensions, the Web abounds with multi-frame, multi-color, multi-interfaced sites. Unfortunately, it still isn't possible to format a novel readably. I'll use The English Patient [Ondaatje 1992] as an example. Although its narrative style is about as unconventional as you'd expect for a Booker Prize winner, it is formatted very typically for a modern novel.

Sections are introduced with a substantial amount of whitespace (3 cm), a large capital letter about twice the height of the normal font, and the first few words in small caps. Paragraphs are typically separated by their first line being indented about three characters. Chronological or thematic breaks are denoted by vertical whitespace between paragraphs, anywhere from one line's worth to a couple of centimeters. If the thematic break has been large, it gets a lot of whitespace and the first line of the next paragraph is not indented. If the thematic break is small, it gets only a line of whitespace and the first line of the next paragraph is indented. So the "author's intent" needs to be expressed with tags like <small-thematic-break>. The "designer's intent" needs to be expressed with equations like small-thematic-break = one line of whitespace.

Style sheets, officially adopted as a standard on March 5, 1996 by most browser makers, make this possible in almost the manner I've described. I asked Hakon W. Lie, one of the authors of the style sheet proposal, for the most tasteful way to format The English Patient. He came back with the following:


<STYLE>
  P { text-indent : 3em }
  P.stb { margin-top: 12pt }
  P.mtb { margin-top: 24pt; text-indent : 0em}
  P.ltb { margin-top: 36pt; text-indent : 0em}
</STYLE>

<P CLASS=stb>Sample of small thematic break
<P>just an ordinary paragraph
<P CLASS=mtb>Sample of medium thematic break
<P CLASS=ltb>Sample of large thematic break

The cascading style sheet proposal that was ultimately successful rejected the idea of new tags because a document marked up with such tags would not have been valid under the HTML document type definition (DTD).

Is the formatting problem solved? I begged for style sheets in my August 1994 paper and now we have them, much better thought-out and more powerful than I envisioned. The author/designer intent split is captured nicely. So what is left to do? Style sheets don't let one publish mathematics, figures with captions, or dozens of other things faciliated by old languages like LaTeX or newer systems like Microsoft Word.

We could just add hyperlinks to LaTeX. This is more or less what a group of people at Los Alamos National Labs did a few years ago. I don't think there are really sound intellectual arguments against this approach, but sentiment seems to be on the side of keeping HTML. If we are indeed stuck with HTML, though, perhaps there is a better way to extend it.

Our methodology for extending HTML seems to be the following

sit down with a few formatting languages in common use
argue about which are the most commonly needed commands
argue about whether Web browser programmers are really up to the task of writing the code that implements those commands

At this rate, it will be the year 2000 before HTML is really powerful enough for most people, by which time it may have been replaced with Adobe's Portable Document Format (PDF).

I'd like to suggest an alternative approach:

choose a set of 100 documents that represent the spectrum of things we'd like to see on the Web
come up with a language capable of expressing the author's intent in 98 of those documents
come up with a language capable of expressing the designer's intent in 98 of those documents
add to HTML the semantics of the languages developed in the preceding two steps

Fixing the Structure Problem

Can the same approach solve the structure problem? What if we locked a bunch of librarians and a handful of programmers in a room together and made them think up every possible semantic slot that any Web document could ever want to fill. They'd come out with a list of thousands of fields, each one appropriate to at least a small class of documents.

An obvious reason why this wouldn't work is that the committee could never think of all the useful fields. Five years from now, people are going to want to do new, different, and unenvisioned things with the Web and Web clients. Thus, a decentralized revision and extension mechanism is essential for a structure system to be useful.

A deeper reason why this wouldn't work is that nobody would be able to write parsers and user interfaces for it. If a user is developing a Web document, does he want to see a flat list of 10,000 fields and go through each one to decide which is relevant? If you are programming a parser to do something interesting with Web documents, do you want to deal with arbitrary combinations of 10,000 fields?

Malone's Work on Semistructured Messages

Back in the early 1980s, Tom Malone and his collaborators at MIT developed the Information Lens, a system for sharing information within an organization. He demonstrated how classifying messages into a kind-of hierarchy facilitated the development of user interfaces. Figure 1 shows one of Malone's example hierarchies [**** insert Figure 6 from Malone's paper]. For each message type, there is an associated list of fields, some of which are inherited from superclasses. Consider the class meeting-announcement. Fields such as to, from, cc, and subject are inherited from the base class message. Fields such as meeting-place are associated with the class meeting-annoucement itself.

Each message type also has an associated list of suggested types for a reply message. For example, the suggested reply type for meeting-announcement is request-for-information. Most importantly, the decomposition of message types into a kind-of hierarchy allows the automatic generation of helpful user interfaces. For example, once the system knows that the user is writing a lens-meeting-announcement, that determines which fields are offered for filling and what defaults are presented. Fields having to do with software bugs or New York Times articles are not presented and fields such as place and time may be helpfully defaulted with the usual room and time.

What did Malone's team learn from this?

That a very wide range of messages could be processed automatically. It was convenient for users to fill in lots of fields so messages typically had enough structure to enable fairly sophisticated automatic processing.
That by not forcing users to fill out every field and by allowing users to insert arbitrary text in some fields, unusual situations could be handled gracefully.
That making message types explicit facilitated the development of rules for automated processing. For example, a few lines of code sufficed to delete every New York Times article whose article date was prior to today.

Adapting Malone's Work to the Web

Where do we put the fields?

First of all, if we are not to break current clients, we need a place to put fields in an HTML document such that they won't be user-visible. Fortunately, the HTML level 2 specification provides just such a place in the form of the META element. META tags go in the head of an HTML document and include information about the document as a whole. For example


    <meta name="type" content="conference-announcement">
    <meta name="conference-name" content="WebNet-96">
    <meta name="conference-location-brief" content="San Francisco">
    <meta name="conference-location-full" content="Holiday Inn Golden Gateway Hotel, San Francisco, California, USA">
    <meta name="conference-date-start" content="16 October 1996">
    <meta name="conference-date-end" content="19 October 1996">
    <meta name="conference-papers-deadline" content="15 March 1996">
    <meta name="conference-camera-ready-copy-deadline" content="1 August 1996">

would be part of the description for our conference and provides enough information for entries to be made automatically in a user's calendar.

It might not be pretty. It might not be compact. But it will work without causing any HTML level 2 client to choke.

There are a few obvious objections to this mechanism. The most serious objection is that duplicate information must be maintained consistently in two places. For example, if the conference organizers decide to change the papers deadline from 15 March to 20 March, they'll have to make that change both in the META element in the HEAD and in some human-readable area of the BODY.

An obvious solution is to expose the field names and contents to the reader directly, as is typically done with electronic mail and as is done in [Malone 1987]. When Malone added semiformal structure to hypertext [Malone 1989], he opted to continue exposing field names directly to users. However, that is not in the spirit of the Web; stylistically, the best Web documents are supposed to read like ordinary text.

A better long-term solution is a smart editor for authors that presents a form full of the relevant fields for the document type and from those fields generates human-readable text in the BODY of the document. When the author changes a field, the text in the BODY changes automatically. Thus, no human is ordinarily relied upon to maintain duplicate data.

How do we maintain the document type hierarchy?

Malone unfortunately cannot give us any guidance for maintaining a type hierarchy over a wide area network. He envisioned a system restricted to one organization. His object-oriented approach can give us some inspiration, however. Malone reports that a small amount of user-level programming sufficed to turn his structure-augmented hypertext system into a rather nice argument maintenance tool, complete with user-interface for both display and input [Malone 1989].

Whatever mechanism we propose, therefore, had better allow for an organization to develop further specialized types that facilitate clever processing and presentation. At the same time, should one of these hyperspecialized documents be let loose on the wider Internet, it should carry some type information understandable to unsuspecting clients. Once mechanism for doing this is the inclusion of an extra type specification:


    <meta name="type" content="lanl-acl-conference-announcement">
    <meta name="most-specific-public-type" content="conference-announcement">

In this case, the Los Alamos National Laboratory's Advanced Computing Laboratory has concocted a highly specialized type of conference announcement that permits extensive automated processing by Web clients throughout Los Alamos. However, should someone at MIT be looking at the conference announcement, his Web client would fail to recognize the type lanl-acl-conference-announcement and look at the most-specific-public-type field. As conference-announcement is a superclass of lanl-acl-conference-announcement, all the things that the MIT user's client is accustomed to doing with conference announcements should work with this one.

Nonhierarchical inheritance (also known as "multiple inheritance") is also important so that duplicate type hierarchies are not spawned. For example, the fact that a document is restricted to a group or company might possibly apply to any type of document. Should there be two identical trees, one rooted at basic-document and the other at basic-internal-document? Then we might imagine documents for which there is an access charge. Now we just need four identical trees, rooted at basic-free-document, basic-metered-document, basic-internal-free-document, basic-internal-metered-document. There is a better way and it was demonstrated in the MIT Lisp Machine Flavor system (a Smalltalk-inspired object system grafted onto Lisp around 1978): mixins. Mixins are orthogonal classes that can be combined in any order and with any of the classes in the standard kind-of hierarchy. Here are some example mixin classes:

Class Name Fields Contributed Comments

draft-mixin draft-expected-completion-date draft-address-for-comments draft-version-number draft-previous-version-url User Agent displays "****DRAFT****" prominently, offers to look up previous version and show change bars.

restricted-mixin
restricted-current-release-terms restricted-person-authorized-to-release restricted-expected-release-date
restricted-authorized-access-specification (explains who can access, possibly a domain name or list of networks)
HTTP server watches for documents whose type inherits from this class and only delivers them to authorized users; non-authorized users sent an explanation with the name of a person who could authorize release.

Class Name	Fields Contributed	Comments
`draft-mixin`	`draft-expected-completion-date draft-address-for-comments draft-version-number draft-previous-version-url`	User Agent displays "**DRAFT**" prominently, offers to look up previous version and show change bars.
`restricted-mixin`	`restricted-current-release-terms restricted-person-authorized-to-release restricted-expected-release-date` `restricted-authorized-access-specification` (explains who can access, possibly a domain name or list of networks)	HTTP server watches for documents whose type inherits from this class and only delivers them to authorized users; non-authorized users sent an explanation with the name of a person who could authorize release.

If there are N mixins recognized in the public type registry, we might have to have 2^N classes for every class in the old kind-of hierarchy. That's one for every possible subset of mixins, so we'd have classes like travel-magazine, travel-magazine-restricted, travel-magazine-draft, travel-magazine-draft-restricted, etc. This doesn't seem like a great improvement on the 2^N identical trees situation.

However, if we allow documents to specify multiple types

    <meta name="types" content="travel-magazine restricted-mixin draft-mixin">

and build the final composite type at runtime in the content editor, HTTP server, and Web user agent, then we need only have one hierarchy plus a collection of independent orthogonal mixins. This presents no problem for programmers using modern computer languages such as Smalltalk and Common Lisp. These allow new type definitions at run-time and have had multiple inheritance for over a decade. A program implemented in a language that has purely static types, e.g, C++ or Java, is going to need to include its own dynamic type system, built from scratch and not based on the underlying language's type system.

We established then that we need multiple inheritance and distributed extensibility. A standard Internet approach to distributed maintenance of a hierarchy is found in the Domain Name System (DNS), where authority for a zone is parcelled out and that authority includes the ability to parcel out subzones [Stevens 1994; Mockapetris 1987a ; Mockapetris 1987b].

DNS-style type definition service might seem like overkill initially and would result in delays for pioneer users of document types. Without a substantial local cache, document type queries would have to be sent across the Internet for practically every Web document viewed. An alternative would be to have documents include their type definition code at the top or reference a URL where such a definition might be found. This is how it is done with style sheets.

Regardless of how the hierarchy is maintained, developing the initial core taxonomy is a daunting task. The taxonomies developed by librarians are only a partial solution because they do not generally concern themselves with the sorts of ephemera that constitute the bulk of Internet traffic. If we don't get the core taxonomy right, we won't reap the benefits of useful standard software.

Conclusions

Measured against the yardstick of "how well does this Internet thing work?", HTML is an underachiever. It lacks sufficient structural and formatting tags to render many documents comprehensible much less aesthetic, even with the addition of style sheets. The META tag can be exploited to implement a document typing system. We need to develop a hierarchy of document types to facilitate implementation of programs that automatically process Web documents. This type system must support multiple inheritance. If we fail to develop some kind of semantic tagging system, computers will be unable to render us any useful assistance with Web documents until the dawn of Artifical Intelligence, i.e., natural language understanding and common sense reasoning.

References

Martin Gilbert 1991. Churchill A Life Henry Holt & Company, New York, page 595

Lie, H.W., Bos, B. 1996. "Cascading Style Sheets, W3C Working Draft" (http://www.w3.org/pub/WWW/TR/WD-css1.html)

Malone, Thomas W., Grant, Kenneth R., Lai, Jum-Yew, Rao, Ramana, and Rosenblitt, David 1987. "Semistructured Messages are Surprisingly Useful for Computer-Supported Coordination." ACM Transactions on Office Information Systems, 5, 2, pp. 115-131.

Malone, Thomas W., Yu, Keh-Chaing, Lee, Jintae 1989. What Good are Semistructured Objects? Adding Semiformal Structure to Hypertext. Center for Coordination Science Technical Report #102. M.I.T. Sloan School of Management, Cambridge, MA

Mockapetris, P.V. 1987a. "Domain Names: Concepts and Facilities," RFC 1034

Mockapetris, P.V. 1987b. "Domain Names: Concepts and Facilities," RFC 1035

Ondaatje, Michael 1992. The English Patient. Vintage International, New York

Stevens, W. Richard 1994. TCP/IP Illustrated, Volume 1: The Protocols. Addison-Wesley, Reading, Massachusetts