Ross McNab

Random thoughts and code from a
.Net software developer.

HTML5 Loves Tag Soup

There's an HTML5 feature that I haven't heard many people talking about - probably because it's entirely invisible to users and web developers. It's a feature for browser developers.

The HTML5 spec is the first to include parsing rules to generate a DOM tree from html text. You can read the full mindnumbing details of the state machine in section 8.2 of the spec.

Postel's Law

In order to understand why this is a Good Thing™ we must go back in time to the salad days of the web, when the world was a simpler place. A time before Chrome, Firefox, and IE, a time even before Netscape, when there was only one web browser - NSCA Mosaic 1.

The developers of Mosaic were great engineers, and followed the robustness principle known as Postel's Law:

"Be liberal in what you accept, and conservative in what you send"
Jon Postel, RFC 1122, October 1989

In practical terms, this meant that it didn't matter what half-arsed malformed HTML you threw at Mosaic, it would do its best to interpret it and display something to the user. Lesser engineers would have thrown up their hands, implemented a parser that bailed out at the first hint of trouble, and "Unrecoverable Parser Error" would have become as well known as "404 Not Found".

But the diligence of the Mosaic engineers caused problems for the browser developers that came after them. They had to ensure that their new browsers interpreted the nascent web's tag soup in the same way as Mosaic. It was a strange mirror image of the modern world, a world where browser developers were beholden to web authors, implementing workarounds in their browser to deal with the crazy range of inconsistent HTML.

The future

This state of affairs has persisted for 20 years. The web is still a wild untamed place, with unclosed tags, half quoted attributes, and <tr>s nested in <td>s. Every major browser has its own proprietary HTML parser, and a suite of unit tests to ensure that its workarounds are the same as every other browser's workarounds (but not quite succeeding).

Mere mortals like myself rely on libraries like Beautiful Soup and HTML Agility Pack to do the dirty work of HTML parsing. These libraries attempt to replicate the workarounds of the major browsers, but again never quite manage to exactly match any one browser's behaviour.

But finally, the HTML5 spec should change all that. There's an open source Python implementation of the parser rules, called html5lib, that's been around since 2009, and can be used as a replacement for Beautiful Soup. It's been ported to PHP, Ruby, JavaScript, and Java.

1 OK, I know Mosaic wasn't strictly the first web browser, but it was the first mass-market graphical browser.