3:0
html-parsing: Permissive Parsing of HTML to SXML/xexp in Racket
(require (planet neil/html-parsing:3:0)) | package: base |
1 Introduction
The html-parsing library provides a permissive HTML parser. The parser is useful
for software agent extraction of information from Web pages, for
programmatically transforming HTML files, and for implementing interactive Web
browsers. html-parsing emits SXML/xexp, so that conventional invalid HTML may be processed with XML tools such
as SXPath. Like Oleg Kiselyov’s SSAX-based HTML parser, html-parsing provides a permissive tokenizer, but html-parsing extends this by attempting to recover syntactic structure.
The html-parsing parsing behavior is permissive in that it accepts erroneous HTML,
handling several classes of HTML syntax errors gracefully, without yielding a
parse error. This is crucial for parsing arbitrary real-world Web pages, since
many pages actually contain syntax errors that would defeat a strict or
validating parser. html-parsing’s handling of errors is intended to generally emulate popular Web
browsers’ interpretation of the structure of erroneous HTML. We
euphemistically term this kind of parse “pragmatic.”
html-parsing also has some support for XHTML, although XML namespace qualifiers
are accepted but stripped from the resulting SXML/xexp. Note that valid XHTML input might be better handled by a validating XML parser
like Kiselyov’s SSAX.
This package obsoletes HtmlPrag.
2 Interface
Parse HTML permissively from input, which is either an input port or a string, and emit an
SXML/xexp equivalent or approximation. To borrow and slightly modify an
example from Kiselyov’s discussion of his HTML parser:
> (html->xexp (string-append "<html><head><title></title><title>whatever</title></head>" "<body> <a href=\"url\">link</a><p align=center>" "<ul compact style=\"aa\"> <p>BLah<!-- comment <comment> -->" " <i> italic <b> bold <tt> ened</i> still < bold </b>" "</body><P> But not done yet..."))
(*TOP* (html (head (title) (title "whatever")) |
(body "\n" |
(a (@ (href "url")) "link") |
(p (@ (align "center")) |
(ul (@ (compact) (style "aa")) "\n")) |
(p "BLah" |
(*COMMENT* " comment <comment> ") |
" " |
(i " italic " (b " bold " (tt " ened"))) |
"\n" |
"still < bold ")) |
(p " But not done yet..."))) |
Note that, in the emitted SXML/xexp, the text token still < bold is not inside the b element, which represents an unfortunate failure to emulate all
the quirks-handling behavior of some popular Web browsers.
3 History
- PLaneT 3:0 —
2015-04-24 Numeric character entities now parse to Racket strings instead of Racket characters, to bring SXML/xexp back closer to SXML. Thanks to John Clements for reporting. - PLaneT 2:0 —
2012-06-13 Converted to McFly. - Version 0.3 —
PLaneT 1:2 — 2011-08-27 Converted test suite from Testeez to Overeasy. - Version 0.2 —
PLaneT 1:1 — 2011-08-27 Fixed embarrassing bug due to code tidying. Thanks to Danny Yoo for reporting. - Version 0.1 —
PLaneT 1:0 — 2011-08-21 Part of forked development from HtmlPrag.
4 Legal
Copyright 2003 – 2015 Neil Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See http://www.gnu.org/licenses/ for details. For other licenses and consulting, please contact the author.