#lang scribble/doc @; THIS FILE IS GENERATED @(require scribble/manual) @(require (for-label (planet neil/html-parsing:1:=0))) @(require (for-label racket)) @title[#:version "0.1"]{@bold{html-parsing}: Permissive Parsing of HTML to SXML/xexp in Racket} @author{Neil Van Dyke} License: @seclink["Legal" #:underline? #f]{LGPL 3} @(hspace 1) Web: @link["http://www.neilvandyke.org/racket-html-parsing/" #:underline? #f]{http://www.neilvandyke.org/racket-html-parsing/} @defmodule[(planet neil/html-parsing:1:=0)] @section{Introduction} The @tt{html-parsing} library provides a permissive HTML parser. The parser is useful for software agent extraction of information from Web pages, for programmatically transforming HTML files, and for implementing interactive Web browsers. @tt{html-parsing} emits @link["http://www.neilvandyke.org/racket-xexp/"]{SXML/@emph{xexp}}, so that conventional invalid HTML may be processed with XML tools such as @link["http://pair.com/lisovsky/query/sxpath/"]{SXPath}. Like Oleg Kiselyov's @link["http://pobox.com/~oleg/ftp/Scheme/xml.html#HTML-parser"]{SSAX-based HTML parser}, @tt{html-parsing} provides a permissive tokenizer, but @tt{html-parsing} extends this by attempting to recover syntactic structure. The @tt{html-parsing} parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. @tt{html-parsing}'s handling of errors is intended to generally emulate popular Web browsers' interpretation of the structure of erroneous HTML. We euphemistically term this kind of parse ``pragmatic.'' @tt{html-parsing} also has some support for XHTML, although XML namespace qualifiers are accepted but stripped from the resulting SXML/@emph{xexp}. Note that @emph{valid} XHTML input might be better handled by a validating XML parser like Kiselyov's @link["http://pobox.com/~oleg/ftp/Scheme/xml.html#XML-parser"]{SSAX}. This package will be replacing @link["http://www.neilvandyke.org/racket-xexp/"]{HtmlPrag}. @section{Interface} @defproc[ (html->xexp (input any/c)) any/c]{ Permissively parse HTML from @schemevarfont{input}, which is either an input port or a string, and emit an SXML/@emph{xexp} equivalent or approximation. To borrow and slightly modify an example from Kiselyov's discussion of his HTML parser: @SCHEMEBLOCK[ (html->xexp "
BLah italic bold ened still < bold
But not done yet...")
==>
(*TOP* (html (head (title) (title "whatever"))
(body "\n"
(a (\@ (href "url")) "link")
(p (\@ (align "center"))
(ul (\@ (compact) (style "aa")) "\n"))
(p "BLah"
(*COMMENT* " comment