html-parsing: Permissive Parsing of HTML to SXML/xexp in Racket

3:0

html-parsing: Permissive Parsing of HTML to SXML/xexp in Racket

Neil Van Dyke

License: LGPLv3 Web: http://www.neilvandyke.org/racket-html-parsing/

(require (planet neil/html-parsing:3:0))

package: base

1 Introduction

The html-parsing library provides a permissive HTML parser. The parser is useful for software agent extraction of information from Web pages, for programmatically transforming HTML files, and for implementing interactive Web browsers. html-parsing emits SXML/xexp, so that conventional invalid HTML may be processed with XML tools such as SXPath. Like Oleg Kiselyov’s SSAX-based HTML parser, html-parsing provides a permissive tokenizer, but html-parsing extends this by attempting to recover syntactic structure.

The html-parsing parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. html-parsing’s handling of errors is intended to generally emulate popular Web browsers’ interpretation of the structure of erroneous HTML. We euphemistically term this kind of parse “pragmatic.”

html-parsing also has some support for XHTML, although XML namespace qualifiers are accepted but stripped from the resulting SXML/xexp. Note that valid XHTML input might be better handled by a validating XML parser like Kiselyov’s SSAX.

This package obsoletes HtmlPrag.

2 Interface

procedure
(html->xexp input) → xexp
input : (or/c input-port? string?)

Parse HTML permissively from input, which is either an input port or a string, and emit an SXML/xexp equivalent or approximation. To borrow and slightly modify an example from Kiselyov’s discussion of his HTML parser:

> (html->xexp
 (string-append
 "<html><head><title></title><title>whatever</title></head>"
 "<body> <a href=\"url\">link</a>"
 "<ul compact style=\"aa\"> BLah"
 " italic bold <tt> ened still < bold "
 "</body> But not done yet..."))

(*TOP* (html (head (title) (title "whatever"))

(body "\n"

(a (@ (href "url")) "link")

(p (@ (align "center"))

(ul (@ (compact) (style "aa")) "\n"))

(p "BLah"

(*COMMENT* " comment <comment> ")

" "

(i " italic " (b " bold " (tt " ened")))

"\n"

"still < bold "))

(p " But not done yet...")))

Note that, in the emitted SXML/xexp, the text token still < bold is not inside the b element, which represents an unfortunate failure to emulate all the quirks-handling behavior of some popular Web browsers.

3 History

PLaneT 3:0 — 2015-04-24
Numeric character entities now parse to Racket strings instead of Racket characters, to bring SXML/xexp back closer to SXML. Thanks to John Clements for reporting.
PLaneT 2:0 — 2012-06-13
Converted to McFly.
Version 0.3 — PLaneT 1:2 — 2011-08-27
Converted test suite from Testeez to Overeasy.
Version 0.2 — PLaneT 1:1 — 2011-08-27
Fixed embarrassing bug due to code tidying. Thanks to Danny Yoo for reporting.
Version 0.1 — PLaneT 1:0 — 2011-08-21
Part of forked development from HtmlPrag.

4 Legal

Copyright 2003 – 2015 Neil Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See http://www.gnu.org/licenses/ for details. For other licenses and consulting, please contact the author.