1 Introduction
2 Interface
3 History
4 Legal
Version: 0.1

html-parsing: Permissive Parsing of HTML to SXML/xexp in Racket

Neil Van Dyke

License: LGPL 3   Web: http://www.neilvandyke.org/racket-html-parsing/

 (require (planet neil/html-parsing:1:=0))

1 Introduction

The html-parsing library provides a permissive HTML parser. The parser is useful for software agent extraction of information from Web pages, for programmatically transforming HTML files, and for implementing interactive Web browsers. html-parsing emits SXML/xexp, so that conventional invalid HTML may be processed with XML tools such as SXPath. Like Oleg Kiselyov’s SSAX-based HTML parser, html-parsing provides a permissive tokenizer, but html-parsing extends this by attempting to recover syntactic structure.

The html-parsing parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. html-parsing’s handling of errors is intended to generally emulate popular Web browsers’ interpretation of the structure of erroneous HTML. We euphemistically term this kind of parse “pragmatic.”

html-parsing also has some support for XHTML, although XML namespace qualifiers are accepted but stripped from the resulting SXML/xexp. Note that valid XHTML input might be better handled by a validating XML parser like Kiselyov’s SSAX.

This package will be replacing HtmlPrag.

2 Interface

(html->xexp input)  any/c
  input : any/c

Permissively parse HTML from input, which is either an input port or a string, and emit an SXML/xexp equivalent or approximation. To borrow and slightly modify an example from Kiselyov’s discussion of his HTML parser:

 "<html><head><title></title><title>whatever</title></head><body>\n<a href=\"url\">link</a><p align=center><ul compact style=\"aa\">\n<p>BLah<!-- comment <comment> --> <i> italic <b> bold <tt> ened</i>\nstill &lt; bold </b></body><P> But not done yet...")
(*TOP* (html (head (title) (title "whatever"))
             (body "\n"
                   (a (@ (href "url")) "link")
                   (p (@ (align "center"))
                      (ul (@ (compact) (style "aa")) "\n"))
                   (p "BLah"
                      (*COMMENT* " comment <comment> ")
                      " "
                      (i " italic " (b " bold " (tt " ened")))
                      "still < bold "))
             (p " But not done yet...")))

Note that, in the emitted SXML/xexp, the text token "still < bold" is not inside the b element, which represents an unfortunate failure to emulate all the quirks-handling behavior of some popular Web browsers.

3 History

4 Legal

Copyright (c) 2003–2011 Neil Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License (LGPL 3), or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See http://www.gnu.org/licenses/ for details. For other licenses and consulting, please contact the author.

Standard Documentation Format Note: The API signatures in this documentation are likely incorrect in some regards, such as indicating type any/c for things that are not, and not indicating when arguments are optional. This is due to a transitioning from the Texinfo documentation format to Scribble, which the author intends to finish someday.