HTML5 — XML’s Stealth Weapon

Even after the death-of-XHTML2, syntax debate still dominates the day. Here is my contribution.

The XML story

In the beginning was SGML. There is a lot to be said about SGML so I won't. HTML was specified to be an application of SGML, but that never happened in practice. Among browsers Opera kept the pretence of supporting SGML for the longest time, causing us a lot of trouble because Opera behaved differently from every other browser. DocBook is another known SGML application, but in general SGML was no success.

About a decade ago a small group of people started a reformulation of the old SGML standard, First they did it outside of the W3C and later, when the success became apparent, within the W3C. The story of this simplified SGML, now known as XML, may be best told via the annotated XML, by Tim Bray, one of the principal authors. Essentially XML is angle brackets and a number of production rules on top of Unicode (for a fuller description see Comparison of SGML and XML). …

One of the design decisions was that XML is case sensitive. The case for case is argued above, primarily that being case insensitive is too hard to do outside the range of US ASCII. From a usability point of view that is too bad (especially in HTML where all elements and attributes are in US ASCII), but it was among the sacrifices made for simplicity. In SGML you could specify whether elements, attributes, and attribute values were case sensitive or not. Even in SGML-based HTML some components were case sensitive, like the contents of the 'id' attribute.

HTML also used some SGML shorthands which by design were not included in the XML spec. This included allowing authors to drop quotes on unambiguous attributes (rel=back), drop attribute values when identical to the attribute (hr noshade), and having no end tag when unambiguous (p and br). More controversially, also in the group itself, XML included draconian error handling in direct violation of the Postel principle that made the Internet success story. With hindsight I would rather wish it hadn't, but XML had data formats to consider.

The XHTML story

Like other late converts, when the W3C did convert to XML it did so with unstoppable evangelism. By the time I joined the W3C everything that could be seen as a problem were to be solved by adding angle brackets. The HTML Working Group caught the euphoria. The Browser Ragnarok was over, now to build the new markup world, all new and well-formed. As HTML4 formally was an SGML application and XML an SGML restriction, all that was needed to do was to reformulate HTML4 within those restrictions, and so they did.

All the choices made were based on this, not as a matter of "improving" HTML. Wellformedness and draconian error handling were seen as fringe benefits, but other changes were by necessity. As such <html> isn't inherently superior to <HTML>, though arguably better than <HtMl>, but they had to pick one. After a lot of arguments they ended up with everything lower-case. For people like me who were used to lower-case anyway this was no big deal, but notably DOM and JavaScript, like XML, is case sensitive, and in this case they ended up with UPPER-CASE element and attribute names.

Likewise letting tags like letting the p end tag be optional or dropping superfluous attribute values would have been preferable, but wasn't possible within XML. The group fought a long and losing battle to support the character reference entities in XHTML.

The upshot was that the XHTML1 language was marginally less powerful than the HTML4 language, with its only killing feature was that it was XML as well. This turned out to be much less of a feature than was expected. There were processes that benefited, but for most users there were no apparent advantage to this except that XHTML was claimed to be "the future" so the uptake was very slow. The other XML formats that were to show the advantage of the extensible XML framework never materialised. You won't find many MathML pages out there, and there were cultural collisions with the SVG people as well.

The hope lied in making better forms, the HTML2 forms were underpowered and messy as were the somewhat better HTML4 forms. The resulting XML model citizen XForms completely failed to garner the attention or enthusiasm of the Web developers. If it had there would have been no HTML5 today.

By then HTML had been essentially in suspended animation, a modularisation of XHTML and a few minor features like Ruby didn't add to much. Thus the plan to make XHTML2, as mentioned. As the years moved on the lack of developer interest in the specification was near-complete, as was the lack of interest in more obscure satellite specs.

Furthermore the "future compliant" argument wore down with age. Great hopes were attached to the then new mobile devices and their phone browsers, which were expected to use the relatively simpler XHTML formats, particularly variants of the reduced XHTML Basic vocabulary. If you wrote your web pages in HTML, the argument went, these new browsers wouldn't be able to present your web pages unlike if you upgraded them to the more modern XHTML family of formats.

This never happened. Yes, within W3C, OMA, NTT DoCoMo, and elsewhere a flurry of mobile XML-based specifications were made, and there were a number of "XHTML" phone browsers around. The problem was that these new browsers were horribly buggy, on a level Netscape or IE never descended to. Not one of these XHTML browsers implemented XHTML correctly, nor could they handle XML. One problem was that proper XML processing was relatively processing intensive, and draconian error handling was costly, so these browsers tried to handle the XHTML and ignored the XML. Better browsers that could handle XHTML and XML, like Opera and later the Webkit-based Nokia and iPhone browsers, could also handle HTML. Even phone browsers with ambitions, like the NetFront browsers, opted for HTML support as soon as they could, as this was what the Web was made of.

Real XML tools, like XSLT processors, could use HTML as an input format as well, reducing the need for XHTML as an intermediate step.

XHTML was becoming no more an XML success story than HTML had been an SGML success story.

Yes, by effectively obsoleting anything HTML and proclaiming XML and XHTML to be the future, there a sluggish drift towards XHTML, far more by doctype than content type. But the enthusiasm isn't there, to the developer question "What is in it for me?" the honest answer would be "Not much." If something nicer and shinier comes along, say a better Flash, XHTML and HTML with it could be abandoned.

Enter HTML5

To the chagrin of some, HTML5 doesn't try to kill HTML in order to improve it. To the annoyance of others, many of its proponents actively flaunt non-XML practices and don't hide their disdain for XHTML (though not XML). To the relief of many it actually tries to define the HTML document format instead of wishing it is adequately described in some SGML DTD.

However HTML5 supports the HTML and XHTML serialisations in parallel, not in sequence like HTML4 and XHTML1 did. It adds new sought-after features unlike XHTML1 and for the most part XHTML2. It defines the processing of non-valid HTML. This makes it possible to serve as XHTML5 not only HTML5 but also in theory any HTML4 to the extent that HTML5 is backwards compatible to HTML4. This matters as most HTML4 documents on the Web aren't valid. HTML5 doesn't require XML, you must use XML because I tell you to, but by allowing XML and by defining the mappings between the serialisations, it may actually be more successful in triggering a migration to XML than XHTML 1&2 ever would.

This can make HTML5 the stealth weapon of XML as hinted in the title, by providing a progress path that actually progresses. This is assuming that XML is the better document serialisation, because if it is not a switch to XML would not be progress but regress, and there is nothing anyone in W3C could do about that.

Join the Conversation

  1. Marcos Caceres writes:Still seems that the serialization is mostly irrelevant and what matters is the generated DOM. I imagine in the near future, developers will be taught to think more like the (HTML5) parser in conjunction to markup: if you understand what tree is being created, then you can understand what structure is has, what the content is, how it can be styled, and where behavioral are. Developers need to understand that there is little correlation between serialization and the generated tree. For example:

  2. there were a number of “XHTML” phone browsers around. The problem was that these new browsers were horribly buggy, on a level Netscape or IE never descended to. Not one of these XHTML browsers implemented XHTML correctly, nor could they handle XML. One problem was that proper XML processing was relatively processing intensive, and draconian error handling was costly, so these browsers tried to handle the XHTML and ignored the XML. Better browsers that could handle XHTML and XML, like Opera and later the Webkit-based Nokia and iPhone browsers, could also handle HTML. I came across a test of many of these browsers. The conclusion is the same:Originally posted by Simon Pieters:The conclusion I can draw from this research is that the claim that XHTML would be needed for mobile devices is simply a myth.That the Nokia browser was hacked to not support XHTML also rings true. In the phone world with horribly broken documents to follow horribly broken browsers actually supporting XHTML would be a commercial disadvantage (and if I remember correctly often was a requirement from phone vendors).

  3. Originally posted by Marcos Caceres:Still seems that the serialization is mostly irrelevant and what matters is the generated DOM. I imagine in the near future, developers will be taught to think more like the (HTML5) parser in conjunction to markup: if you understand what tree is being created, then you can understand what structure is has, what the content is, how it can be styled, and where behavioral are.I agree, the DOM (or the infoset or the document tree, pick a term) is where it is at. The serialisation, whether XML, HTML, or other, is just a way to marshal and propagate the document over the internet. When the document is parsed on the receiving end its job is done.I guess the reason why it gets so disproportionate attention, is that this is what we write. If we only used visual tools to distance ourself from our code we wouldn’t have cared. Though for all the distracting syntax wars we are subjected to it is fortunate that we aren’t alienated from the code. By requiring the specs to support handcoding, it keeps them simple, honest, and clean. Most of all it allows a third party to easily use the source code to enhance the document and create new services without prior agreements with the author. If HTML5, 6, 7 (or the XHTML serialisation of it) was generated by a descendant of NetObject Fusion (an early visual editor that generated some of the most horrible and bloated HTML known to mankind), the HTML or XML could just as well be a a JPEG image. Though parsers might create the DOM, the resulting document would be devoid of meaning or structure.

  4. Oh, by the way is there any human readable document about HTML5 with all tags defined and their properties? Reading W3C documents is only possible when you know the case and just forgot some tiny piece of information.

  5. Rhyaniwyn writes:I think your assessment is fair and jives with my research & observations over the years. It’s also a good jumping off point to correct some misinterpretations floating around.But at least some designers/developers were excited about XML especially as an extensible format. I’ve talked to people about it over the years. I played around with XML, XSLT, XPath, writing doctypes, & vaguely started on schemas. Never got to XForms, through laziness.But I was one of those who didn’t really use it in production. My XHTML was served as HTML. It wasn’t that I didn’t like XML-based HTML, it was that it seemed to me that the necessary support wasn’t there. I couldn’t serve as XML to IE. Most of the extra X bits weren’t complete recommendations; and I don’t follow candidates that closely (maybe they got finished and I just didn’t hear).Neither doctypes nor schemas provided a way to create a profile alongside the definition for an XML module–to create a glossary that would provide meaning to the content you marked up with the custom content. I always felt an extension should also function as a taxonomy.But I remembered how slow CSS had been and figured it would come in time. Now I find that belief was very misguided. And I’ve buckled down and started reading the HTML 5 spec. There’s some stuff in there I like, some stuff in there I think is ok, and some stuff in there I think is absolute nonsense. And personally I don’t agree with about 80% of the philosophy outlined at the beginning of the spec. I’m not really totally pleased with it, basically. But obviously I wasn’t totally pleased with XHTML either.

  6. There was a time when prefixing a spec with X was the recipe for success, but these days a spec would have to compete on merits. Some have succeeded, like XSLT/XPath, other have failed, like XLink. But as said above the XML syntax (or lack thereof) matters little up until the point it is parsed by the recipient processor, and not at all thereafter.What we should care about is the enhancements (or lack thereof) that the HTML5 vocabulary offers. The entire XHTML1-XHTML2 trail is fine, but it didn’t offer much in the terms of enhancements and spec updating.

  7. Originally posted by Aux:Oh, by the way is there any human readable document about HTML5 with all tags defined and their properties?One article has collated 70 HTML5 and CSS3 resources. For an overview you could use this cheat sheet, ironically in PDF.

Comment

Leave a Reply to Anonymous Cancel reply

Your email address will not be published. Required fields are marked *