Minimal Markup

I have earlier proclaimed markup an [necessary] evil. A more constructive way of putting it is to say that markup should always be minimal. You should use as much markup as you need, and no more. Markup is something we add to aid machines. Too much or wrong markup can do more damage as too little or too vague.

This design principle determines how to standardize markup. Unless the author knows something the user doesn't, the markup should not be there.

This principle obviously caters to the author's laziness, the admirable human trait not to do more than necessary. It is less obvious, but no less important, that it also empowers the user. More minimal markup means more flexible and accessible markup, assuming that the user agents do their job and actually act on their users' behalf. …

Approaching the most minimal markup

The most minimal markup is no markup.

More words…

By some approaches to markup you could wonder how people could communicate before the invention of the angle bracket. <sentence type="declarative"><word>we</word><word expanded="do not">don't</word><word>need</word><word>to</word><word>use</word><word expanded="Extensible Markup Language">XML</word><word>syntax</word><word>for</word><word>machines</word><word>to</word><word>read</word><word>what</word><word expanded="we are">we're</word><word>writing</word></sentence> We rather write this as "We don't need to use XML syntax for machines to read what we're writing." In most cases machines can comprehend this markup-free language pretty well.

In English words are marked up with spaces (the word space is a fairly recent, but highly appreciated, invention in Latin-based writing), and sentences with '.', '?'. '!' and a few other punctuation marks (Spanish also has the "¿" and "¡" start tags). There are ambiguous cases with basic
typography, but as a rule it is clear enough that even machines can understand it. Is "white space" one word or two? What about "white-space" or "whitespace"? For convenience we generally apply a syntactic rule to decide, it is two words because of the space between "word" and "space", in a few cases we use semantic rules based on the role of words in a sentence. As the rules and heuristics for when to use space, when to use hyphen, and when to use neither are complex and changing so would the number of words in English. Spaces are collapsible, so "white space" is still two words, not three, "white", "", and "space".

In other languages words can be harder to determine. Should 北京 be transliterated into Běi Jīng, Běi-Jīng or Běijīng (or Beijing if transliterated into US ASCII)? The name means North(ern) Capital, the equivalent phrase would be written in two words in English and in one word in Norwegian. Should the city be written as Bei Jing in English and Beijing in Norwegian?

For any application, except word counters (does "the" count as a word?) and similar, this doesn't really matter. We pick a representation that is more or less consistent with other words and stick with it. This makes it easier for processes to find appearances of "Beijing" in a text. This is a good approach when conversing with machines. Don't bother with distinctions that don't matter, and try to be consistent.

…less math

This applies to a more machine-friendly language, mathematics, as well. Take the two syntaxes to represent "12+12" in the MathML markup language, presentional:

<math>
     <mrow><mn>12</mn><mo>+</mo><mn>12</mn></mrow>
 </math>

and content:

<math>
     <apply><plus/><cn>12</cn><cn>12</cn></apply>
 </math>

I am not out to lampoon MathML, and there may be math applications where this markup is useful, but for most purposes the string "12+12" is a far better way to represent "12+12". The codepoint for "1" and "2" indicates digits and the codepoint for "+" indicates an operator. The explicit markup telling that "12" is a number or "+" an operator is redundant. The string "12+12" can be ambiguous. In most cases this would resolve to the number 24, but if the number "12" is using octal representation this expression would resolve to 20. Similarly if "+" represents the string concatenation operator it would resolve to "1212" instead.

Few of us need to present mathematical expressions, and thus few of us need MathML, but the real problem isn't MathML, but a serious shortcoming in CSS. If you want to style "12+12" nicely in CSS you actually need the presentational markup above. CSS only operates on elements, not strings, so if you wanted to make a syntax highlighter using different colours for numbers, operators, keywords and so on you need those redundant elements. To show something like "12+12" in HTML you need to code something like:

〈style〉
     .expr .num {color: blue}
     .expr .op {color: green}
 〈/style〉
 
 〈span class="expr"〉
   〈span class="num"〉12〈/span〉〈span class="op"〉+〈/span〉〈span class="num"〉12〈/span〉
 〈/span〉

This is in no way better than the MathML examples, actually it is worse. The shortcoming lies in CSS3 selectors. CSS only operates on elements (the added span elements), so you have to add elements where none strictly are needed.

An elementary flaw: What CSS3 Selectors cannot select

CSS3 Selectors generally can select most node sets as long as the nodes are elements. CSS also has a limited number of pseudo-element selectors. These "elements" don't exist in the markup but CSS treat them as if they did.

Regularly I have come across the need to style a substring of an element, something which would need extensions like ::word (notwithstanding the problem of defining a word as discussed above) and ::char. Given

〈style〉
   .example::word(3){font-weight: bold} 
   .example::word(2:4){color: red}
   .example::char(-1) {background: pink}
 〈style〉
 〈span class="example"〉A sentence. Another one.〈/span〉

The result should be:

A sentence. Another one.

This extension would be nice, but not sufficient to do syntax highlighting as above. To do that we would need some sort of pattern matching, like with regexp style syntax:

.expr ::match(d+)     /* Numbers */   {color: blue} 
 .expr ::match([+*/-]) /* Operators */ {color: green}

String selectors in CSS3 selectors would be very nice, but is unlikely to happen.

XPath to the rescue?

As part of the modularisation of CSS3, the Selectors spec is defined without CSS dependencies so that Selectors could be used in non-CSS languages as well. The flip side of this is that CSS could opt for a different selector language to ornament the document tree. The obvious other candidate would be XPath. It could even be possible to mix the two selector languages if desired. Among other things XPath could match text strings and it has support for regular expressions. On the other hand CSS Selectors are designed to be more efficient, especially over a slow network, and complex XPath expressions can look quite frightening (then again, so can complex CSS3 selectors).

Don't repeat yourself

Redundant markup is the opposite of minimal markup, and should be avoided. This is similar to the relational database principle that you code information once, and update it in one place. When you copy code from one part of the document into another part of the document, it is an indicator that you or the spec designer has done something wrong (though sometimes minimality may come out the loser in design trade-offs).

Data URLs considered harmful

Data URLs are used as a mechanism to embed external resources like images into a single file. They have a range of problems, one of which is that you can't refer to them, so if you use an image more than once you have to encode and embed that image every time. Data URLs are not minimal.

Quoting

One way of repeating yourself is to quote yourself. One of these structures is the pull quote. The HTML5 aside element is proposed to be used to mark up pull quotes.

〈aside〉〈q〉One way of repeating yourself is to quote yourself. 
         One of these structures is the pull quote.〈/q〉〈/aside〉

This is nice, but it is a duplication of information. Better to something like this:

〈pullquote ref="sourceID"/〉

assuming the quoted text is an element that has an id="sourceID" attribute, the pull quote would generate the content manually added to the aside/q combination. If the quote is edited, the edited version could be the pullquote content:

〈pullquote ref="sourceID"〉
      One way of repeating yourself is to quote 
      yourself [with] the pull quote.
 〈/pullquote〉

The ref attribute would still be useful to establish the link back to the quoted text even if a different text was displayed. XPath could again help if the quoted text isn't neatly constrained by a single element.

Adding a pullquote to HTML5 is unlikely to happen, keeping the number of elements in a language low is also a way to minimize markup, even though it is of lesser importance. However the q element could have the same functionality, if desired.

Content in context

Code that is context-free, that represent the same data in any context, is often desirable as it makes it easier to move from one context to another. However contextual content is far more efficient when the author and the user [agent] can agree on a context. It should be possible to store context with content e.g. for archiving purposes, but a conversation shouldn't have to be context-free.

Marking time

HTML5 has added a time element. An element like this is needed as the normal representation of time is ambiguous. A string like "12/3" can resolve to 4, or to "12 March" or "3 December" depending on context. Even a string identified as a time element can be ambiguous. The date "6/7/8" can be "8 July 2006", "6 July 2008", "7 June 2008" or conceivably any of the other three day/month/year combinations. We need a way to find out which is which.

Sensible defaults

Ambiguity aside, only surprising or unusual information needs markup. If the language is English we can assume that the decimal separator is "." and the thousand separator is ",", and vice versa if the language is Norwegian. If we know nothing else it is better to assume that the text is in English than it is in Norwegian. Markup should override such heuristics, not preclude them.

Keep your markup clean

Only code the what the document is. Styles, scripts, transforms, metadata, and other components to enhance the document should augment the markup, but not be embedded in it. Every level of markup makes it harder for the other components to interact with it, and makes the document harder to maintain.

The world can be complex, but but your markup needn't be. Where complexity is required, put it in those auxillary files.

Join the Conversation

Anonymous says:

August 25, 2009 at 1:08 am

You’re right about the ambiguity of the time elements/types in HTML5. Whenever I think about that, I keep remembering this article.Makes me jump out of my skin.

Anonymous says:

August 28, 2009 at 11:08 pm

That is kind of awkward, worse negative dates (BC/BCE) don’t seem to be supported.

Anonymous says:

August 29, 2009 at 12:08 am

And the time element doesn’t seem to differentiate between 24-hour time and AM/PM time. That was a big problem I had.

Is this thing still on?