html-parsing: Permissive Parsing of HTML to SXML/xexp in Racket
| (require (planet neil/html-parsing:3:0))|| package: base|
library provides a permissive HTML parser. The parser is useful
for software agent extraction of information from Web pages, for
programmatically transforming HTML files, and for implementing interactive Web
, so that conventional invalid HTML may be processed with XML tools such
. Like Oleg Kiselyov’s SSAX-based HTML parser
provides a permissive tokenizer, but html-parsing
extends this by attempting to recover syntactic structure.
The html-parsing parsing behavior is permissive in that it accepts erroneous HTML,
handling several classes of HTML syntax errors gracefully, without yielding a
parse error. This is crucial for parsing arbitrary real-world Web pages, since
many pages actually contain syntax errors that would defeat a strict or
validating parser. html-parsing’s handling of errors is intended to generally emulate popular Web
browsers’ interpretation of the structure of erroneous HTML. We
euphemistically term this kind of parse “pragmatic.”
also has some support for XHTML, although XML namespace qualifiers
are accepted but stripped from the resulting SXML/xexp. Note that valid
XHTML input might be better handled by a validating XML parser
like Kiselyov’s SSAX
(html->xexp input) → xexp
| input : (or/c input-port? string?)|
Parse HTML permissively from input, which is either an input port or a string, and emit an
SXML/xexp equivalent or approximation. To borrow and slightly modify an
example from Kiselyov’s discussion of his HTML parser:
| "<body> <a href=\"url\">link</a><p align=center>"|
| "<ul compact style=\"aa\"> <p>BLah<!-- comment <comment> -->"|
| " <i> italic <b> bold <tt> ened</i> still < bold </b>"|
| "</body><P> But not done yet..."))|
| (*TOP* (html (head (title) (title "whatever"))|
| (body "\n"|
| (a (@ (href "url")) "link")|
| (p (@ (align "center"))|
| (ul (@ (compact) (style "aa")) "\n"))|
| (p "BLah"|
| (*COMMENT* " comment <comment> ")|
| " "|
| (i " italic " (b " bold " (tt " ened")))|
| "still < bold "))|
| (p " But not done yet...")))|
Note that, in the emitted SXML/xexp, the text token still < bold is not inside the b element, which represents an unfortunate failure to emulate all
the quirks-handling behavior of some popular Web browsers.
PLaneT 3:0 — 2015-04-24
Numeric character entities now parse to Racket strings instead
of Racket characters, to bring SXML/xexp back closer to SXML. Thanks to John
Clements for reporting.
PLaneT 2:0 — 2012-06-13
Converted to McFly.
Version 0.3 — PLaneT 1:2 — 2011-08-27
Converted test suite from Testeez to Overeasy.
Version 0.2 — PLaneT 1:1 — 2011-08-27
Fixed embarrassing bug due to code tidying. Thanks to Danny
Yoo for reporting.
Version 0.1 — PLaneT 1:0 — 2011-08-21
Part of forked development from HtmlPrag.
Copyright 2003 – 2015 Neil Van Dyke. This program is Free Software; you
can redistribute it and/or modify it under the terms of the GNU Lesser General
Public License as published by the Free Software Foundation; either version 3
of the License, or (at your option) any later version. This program is
distributed in the hope that it will be useful, but without any warranty;
without even the implied warranty of merchantability or fitness for a
particular purpose. See http://www.gnu.org/licenses/ for details. For other
licenses and consulting, please contact the author.