Implicit Evaluation with PHP

29 August 2006

PHP Parsing HTML

One of the higher level items I seem to deal with frequently is HTML parsing. The primary reason that HTML gets elected as a data format over something like CSV is that it’s easy to style. CSV lacks any formatting at all. XLS is beautiful, but essentially impossible to parse without a COM friendly language, platform and extra license of Excel laying around. But HTML offers nearly all of the control that XLS does, lacking only dynamic formula support.

Another time HTML interaction comes up is while rolling out new systems. Often, old systems were static HTML, or HTML is the only way the original system can re-produce data.

So regardless of the reason it’s important, consuming HTML is a high-level task which comes up frequently relative to the scale of the task. What’s troubling is the lack of HTML libraries when working with PHP. XML parsing is covered by xquery, and can be outputted relatively easily. But HTML seems to be treated as output-only.

There are two ways to begin HTML parsing. Deciding on a method depends on how the the HTML will be used. In Fortitude Forms, for instance, HTML parsing is relatively primitive. The beginning of all Fortitude tags is <fort:. Once that’s located in code, another string query is performed to find the next > after <fort:. That region is than extracted and operated on. The method is relatively fast but only works since there’s such a limited amount of text we’re interested in.

A full blown HTML parser is next up. And it is significantly more difficult. It is interesting, though, as diminishing returns between the amount of code and its effect begins even faster than usual. To begin with, read about preg_split. It allows splitting a string by a regular expression. A very basic parser looks like this:

$dom = preg_split('/(]+(?:"[^"]*"|'[^']*')?)+>)/',
$strHTML, -1, PREG_SPLIT_DELIM_CAPTURE);
foreach ($dom as $index => $content) {
// data is returned in [text][tag] pairs
if ($index % 2 == 0) {
renderContent (html_entity_decode(trim($content)));
} else {
//strip out
processTag (substr ($content, 1, strlen($content)-2));
}
}

Of course, you need additional code to implement renderContent and processTag. But still, the entirety of my parsing class is 129 lines, whitespace, comments and special cases included.

Special cases are what gets difficult, and why web browsers weigh so much and take so long to develop. There are many, many HTML documents out there and unfortunately, many do not conform to the easily parse-able XHTML standard. Instead, many choose HTML 3.2 or HTML 4.0. These have special cases for HTML tags which do not require closing tags, or even worse, make it optional. But the absolute worst thing about parsing HTML are the documents which don’t even conform to HTML’s optional implementations. There are many documents which have simply forgotten a close tag, some which deemed it unnecessary, and code written by programmers who just don’t care how it works and will stick by the first correctly rendering edition of code. These are the special cases, and they must be dealt with inconveniently early in the parsing cycle. A DOM cannot be established without knowing the rules for each special case.

Therefore, any parser will need to be tweaked for each and every project. As the writer of the parser, you must know what kind of HTML you will be consuming and adapt your parser to it. I will include pseudo code for a generic parser.

class HTMLDom {
var tagStack = NULL
var dom[] = NULL

function parse (html) {
dom = //split into tags and content
foreach (dom as index => content) {
// data is returned in [text][tag] pairs
if ($index % 2 == 0) {
renderContent (html_entity_decode(trim(content)))
} else {
//strip out
processTag (substr (content, 1, strlen(content)-2))
}
}

function renderContent (content) {
// push content onto dom stack
}

function processTag (content) {
// determine if open or close tag and call approriate function
}

function processOpenTag (content) {
// establish tag
// set tag name from content
// set self-closing tag (tags like br)
//      from knowledge at design-time
// set tag's attributes
// push tag onto stack
// push tag onto dom stack
}

function processCloseTag (content)
// grab tag off stack
// if this tag closes it, remove it from the stack
// handle dom stack as needed
}
}

The non-obvious trick is that there are two stacks to be maintained: The DOM stack, which is what your code will use, and the tag stack, which the parser relies on internally for proper parsing. It becomes much harder (if at all possible) to combine these stacks.

Once you have a structured JavaScript-like DOM showing your HTML document, querying it and processing it become much easier. But once you’ve dealt with all the bad HTML people write, you’ll strive to write better HTML personally, and have a newfound respect for everyone from the kids at Firefox to the Microsoft dev’s who do Trident.

No Comments currently posted.

Post a comment on this entry: