There has been a lot of talk about HTML5 video, codecs, containers, and the lot. That certainly matters, but it isn’t something I care about. Assuming the browsers could agree on some standard media codec plug-in interface, like they have done before, browsers shouldn’t be different from any other media player like VLC. That way it wouldn’t be a major work to update the browsers and the spec itself to new formats. Problem solved. …
The licensing problem wouldn’t go away, but it would be moved from the domain of the browsers (or other media players). If a royalty-free codec like Theora were shown to be torpedoed by one of those media patents, and we had to use some Plan TheorB it would be a matter of discovering how to evade the patent in question and distribute the new patent-proof plug-in, instead of involving a number of browser upgrade cycles. It also moves the patent risk from the browser companies, which are huge lawsuit targets like Microsoft, Google, and Apple, and smaller ones like Opera and Mozilla, to the plug-in developers that would be so small and fleet as not to be a viable target.
I care about a much simpler issue, subtitles, those little blocks of text that put movies into writing. For all the controversy of the video
element, the design goals have pretty modest, essentially recreating YouTube without using Flash. But HTML5 doesn’t “natively” support YouTube’s captions, annotations, and subtitles. Of course it doesn’t have to, anything you can do, you can do with JavaScript. However it would be a missed opportunity.
Why care about subtitles?
Subtitles are not that popular in mostly monolingual countries like USA with a tradition of dubbing foreign videos, they can be considered an aquired taste. They are still superior to dubbing, and crucially subtitles are more adapted to the Internet age, and they are searchable and accessible as well.
There are different types of subtitles, as the three YouTube variants (captions, annotations, and subtitles) imply. For the moment I will consider the traditional subtitles.
Subtitles are a force for good. They improve literacy rates, and is one of the best ways to acquire a foreign language. This is why I care about subtitles. I am learning Chinese and am not fully fluent in Czech either. Furthermore Chinese can be written with either Chinese characters or the Latin alphabet using Pinyin. For that reason it can be useful to have a double subtitle track. Indeed a double subtitle track is useful when learning a new language, the language actually spoken and a language you are familiar with.
A better example than YouTube of using subtitling for good is TED talks. The format is very simple. The talks are (so far) invariably in English, the subtitles in a variety of languages, community translated. The different translations are automatically available for each talk, and it is possible to select talks based on a given subtitle language.
What exists already?
There are a number of pre-existing subtitle formats for TV, DVD players and media players.
One simple example would be the SubRip (.srt) format. It could hardly be easier, a numbered sequence of timestamped text entries. This could be edited in any text editor and read inline. Unfortunately it hasn’t even the most basic metadata like character encoding or language(s) the subtitles are in.
Other formats go into the deep end. This would include the W3C effort, with the mnemonic name DFXP. It is an interesting undertaking, a should-read spec for the issues it uncovers and solves, but unfortunately it doesn’t play along that well with HTML5, at least in my opinion.
What about Karaoke?
Karaoke support is about as complex as subtitles go. Annotations could get more complex, possibly including features like SVG animations, but subtitles are just a timed text track. Karaoke isn’t among the requirements for DFXP, though it could probably simulate that, other formats that consider karaoke do so by custom scripting, which make them less applicable for general consumption.
The animation effects of karaoke can easily be achieved by SVG, Javascript, or simply CSS styling. Synchronising the words or syllables to the audio track can be a bigger challenge.
Subtitles and HTML
DFXP has many useful features, like allowing multiple simultaneous subtitle tracks, but it reinvents other features like layout, which is unnecessary. The box model should take care of how the boxes visible at a given time should be rendered. A subtitle text would consist of a number of boxes with display: none
except during their allotted time. In fact when the text is not timed, it should be displayed as any other text would.
Apart from timing, which might be handled by some CSS transform, the subtitles would have to be associated with the video or audio in question in a predictable manner. It should be easy for a community to add or edit subtitles, and for the user agent to pick applicable languages from user preferences, or let the user turn toggle the subtitles by whim.
Any added animation instructions, metadata, or other auxiliary information shouldn’t make the subtitle markup more complex, or make assistive tools, search engines, or other automated enhancing functions work harder than necessary to dig out the information they need.
There have been proposals to add external subtitle files to the video
element. BBC also have experimented with HTML5 video and subtitles. One of the advantages of this approach is that the subtitles are normal, selectable, text. Like with the other proposal, the actual subtitles are not directly a part of the document, but of a JS structure. This limits the operations that can be done on the subtitles.
Originally posted by BBC:
"#subtitle": [
{ startTime: 31.78, endTime: 35.291, html: "It is half past 9 and we've just passed Sheffield" },
{ startTime: 35.292, endTime: 37.43, html: "and we're coming home from Maker Faire in Newcastle" },
{ startTime: 54.594, endTime: 58.932, html: "I'm here with BBC R&D at Maker Faire UK Newcastle 2009" },
{ startTime: 58.933, endTime: 61.221, html: "and we have some demos we're also making some stuff" },
{ startTime: 61.222, endTime: 63.049, html: "but we have some demos too." },
{ startTime: 63.05, endTime: 65.354, html: "What we have here is a webcam in a cardboard box" },
{ startTime: 65.355, endTime: 67.583, html: "with a picture frame on top of it and we're using this" },
{ startTime: 67.584, endTime: 70.294, html: "to prototype the next generation of computer interaction." },
{ startTime: 70.295, endTime: 73.455, html: "It's getting these very, very, very wobbly little cams" },
{ startTime: 73.456, endTime: 75.94, html: "Something like this, you could, well, we have actually" },
{ startTime: 75.941, endTime: 78.562, html: "strapped it to someone's head, like, uh, so they can" },
{ startTime: 78.563, endTime: 79.823, html: "go like this and then" },
{ startTime: 79.824, endTime: 82.603, html: "the cunning thing we do then is we press this button here" },
{ startTime: 84.231, endTime: 86.389, html: "is it F? Oh god no that isn't the one" },
{ startTime: 86.39, endTime: 87.39, html: "<LAUGHS" },
{ startTime: 87.391, endTime: 91.167, html: "Right, er, yeah OK" },
{ startTime: 91.168, endTime: 93.258, html: "there we go, it's steadied" },
The HTML WG accessibility task force has been discussing this stuff, trying to come up with a syntax to associate captions/subtitles with [tt]video[/tt] and also a format for it. I think the syntax will be something like this:
Some people like SRT and some people like DFXP. We’ll see what happens with that 🙂
SRT has the advantage of simplicity. It does one task, and does that very straight-forwardly. If you want to do anything more fancy than that, you would have to build that on top of it. I could like that, but “anything more fancy” even includes basic text functionality like character encoding. That can be very frustrating with a language like Chinese where you don’t know the encoding used, and programs like VLC are not good at guessing either. As an XML format there wouldn’t be encoding issues with DFXP, but there are other concerns. I would expect that a constraint for any subtitle language is that it should be predictable enough to be playable with standalone players. That said, the layout component of DFXP gave me a bad case of déjà vu. In the 90s the mobile phone industry created a suite of specifications similar but not identical to the W3C standards, for historical and political reasons HTML and SVG lived in their own bubbles creating specs that were subtly and often not so subtly incompatible to each other. The TV industry also created some HTML-like specs of their own. Later it was slow and pretty messy to reconcile all the minor inconsistencies resulting from that. The last half decade or so SVG ❤ HTML, but it still will be a long time before the two specs play along seamlessly. I think the timed text community would be better off not maintaining a layout module of their own.
Those spec gripes out of the way (somebody else can care about that), a track element seems a simple and neat solution for subtitles/captions, though a little anaemic for annotations. Would that include audio tracks as well as text tracks? And how would they be included in the DOM? Given a subtitle track like the BBC demo above, would it be possible to find whether the subtitle has the string “cunning” inside it and when?