html-parsing: Permissive Parsing of HTML to SXML/xexp in Racket

Version: 0.2

html-parsing: Permissive Parsing of HTML to SXML/xexp in Racket

Neil Van Dyke

License: LGPL 3 Web: http://www.neilvandyke.org/racket-html-parsing/

(require (planet neil/html-parsing:1:=1))

1 Introduction

The html-parsing library provides a permissive HTML parser. The parser is useful for software agent extraction of information from Web pages, for programmatically transforming HTML files, and for implementing interactive Web browsers. html-parsing emits SXML/xexp, so that conventional invalid HTML may be processed with XML tools such as SXPath. Like Oleg Kiselyov’s SSAX-based HTML parser, html-parsing provides a permissive tokenizer, but html-parsing extends this by attempting to recover syntactic structure.

The html-parsing parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. html-parsing’s handling of errors is intended to generally emulate popular Web browsers’ interpretation of the structure of erroneous HTML. We euphemistically term this kind of parse “pragmatic.”

html-parsing also has some support for XHTML, although XML namespace qualifiers are accepted but stripped from the resulting SXML/xexp. Note that valid XHTML input might be better handled by a validating XML parser like Kiselyov’s SSAX.

This package obsoletes HtmlPrag.

2 Interface

(html->xexp input) → any/c
input : any/c

Permissively parse HTML from input, which is either an input port or a string, and emit an SXML/xexp equivalent or approximation. To borrow and slightly modify an example from Kiselyov’s discussion of his HTML parser:

(html->xexp
"<html><head><title></title><title>whatever</title></head><body>\n<a href=\"url\">link</a><ul compact style=\"aa\">\nBLah italic bold <tt> ened\nstill < bold </body> But not done yet...")

==>
(*TOP* (html (head (title) (title "whatever"))
 (body "\n"
 (a (@ (href "url")) "link")
 (p (@ (align "center"))
 (ul (@ (compact) (style "aa")) "\n"))
 (p "BLah"
 (*COMMENT* " comment <comment> ")
 " "
 (i " italic " (b " bold " (tt " ened")))
 "\n"
 "still < bold "))
 (p " But not done yet...")))

Note that, in the emitted SXML/xexp, the text token "still < bold" is not inside the b element, which represents an unfortunate failure to emulate all the quirks-handling behavior of some popular Web browsers.

3 History

Version 0.2 — 2011-08-27 - PLaneT (1 1)
Fixed embarrassing bug due to code tidying. Thanks to Danny Yoo for reporting.
Version 0.1 — 2011-08-21 - PLaneT (1 0)
Part of forked development from HtmlPrag.

4 Legal

Copyright (c) 2003–2011 Neil Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License (LGPL 3), or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See http://www.gnu.org/licenses/ for details. For other licenses and consulting, please contact the author.

Standard Documentation Format Note: The API signatures in this documentation are likely incorrect in some regards, such as indicating type any/c for things that are not, and not indicating when arguments are optional. This is due to a transitioning from the Texinfo documentation format to Scribble, which the author intends to finish someday.