#lang scribble/doc @; THIS FILE IS GENERATED @(require scribble/manual) @(require (for-label (planet neil/htmlprag:1:6))) @title[#:version "0.19"]{@bold{HtmlPrag}: Pragmatic Parsing and Emitting of HTML using SXML and SHTML} @author{Neil Van Dyke} License: @seclink["Legal" #:underline? #f]{LGPL 3} @(hspace 1) Web: @link["http://www.neilvandyke.org/htmlprag/" #:underline? #f]{http://www.neilvandyke.org/htmlprag/} @defmodule[(planet neil/htmlprag:1:6)] @section{Introduction} HtmlPrag provides permissive HTML parsing and emitting capability to Scheme programs. The parser is useful for software agent extraction of information from Web pages, for programmatically transforming HTML files, and for implementing interactive Web browsers. HtmlPrag emits ``SHTML,'' which is an encoding of HTML in @link["http://pobox.com/~oleg/ftp/Scheme/SXML.html"]{SXML}, so that conventional HTML may be processed with XML tools such as @link["http://pair.com/lisovsky/query/sxpath/"]{SXPath}. Like Oleg Kiselyov's @link["http://pobox.com/~oleg/ftp/Scheme/xml.html#HTML-parser"]{SSAX-based HTML parser}, HtmlPrag provides a permissive tokenizer, but also attempts to recover structure. HtmlPrag also includes procedures for encoding SHTML in HTML syntax. The HtmlPrag parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. HtmlPrag's handling of errors is intended to generally emulate popular Web browsers' interpretation of the structure of erroneous HTML. We euphemistically term this kind of parse ``pragmatic.'' HtmlPrag also has some support for XHTML, although XML namespace qualifiers are currently accepted but stripped from the resulting SHTML. Note that valid XHTML input is of course better handled by a strict XML parser. HtmlPrag requires R5RS, SRFI-6, and SRFI-23. This version of HtmlPrag is specific to PLT Scheme, due to a transition period in how portability is handled, but the exceedingly portable version 0.16 is available at: @link["http://www.neilvandyke.org/htmlprag/htmlprag-0-16.scm"]{http://www.neilvandyke.org/htmlprag/htmlprag-0-16.scm} @section{SHTML and SXML} SHTML is a variant of SXML, with two minor but useful extensions: @itemize[ @item{ The SXML keyword symbols, such as @tt{*TOP*}, are defined to be in all uppercase, regardless of the case-sensitivity of the reader of the hosting Scheme implementation in any context. This avoids several pitfalls. } @item{ Since not all character entity references used in HTML can be converted to Scheme characters in all R5RS Scheme implementations, nor represented in conventional text files or other common external text formats to which one might wish to write SHTML, SHTML adds a special @tt{&} syntax for non-ASCII (or non-Extended-ASCII) characters. The syntax is @tt{(& @schemevarfont{val})}, where @schemevarfont{val} is a symbol or string naming with the symbolic name of the character, or an integer with the numeric value of the character. } ] @defthing[shtml-comment-symbol any/c]{} @defthing[shtml-decl-symbol any/c]{} @defthing[shtml-empty-symbol any/c]{} @defthing[shtml-end-symbol any/c]{} @defthing[shtml-entity-symbol any/c]{} @defthing[shtml-pi-symbol any/c]{} @defthing[shtml-start-symbol any/c]{} @defthing[shtml-text-symbol any/c]{} @defthing[shtml-top-symbol any/c]{ These variables are bound to the following case-sensitive symbols used in SHTML, respectively: @tt{*COMMENT*}, @tt{*DECL*}, @tt{*EMPTY*}, @tt{*END*}, @tt{*ENTITY*}, @tt{*PI*}, @tt{*START*}, @tt{*TEXT*}, and @tt{*TOP*}. These can be used in lieu of the literal symbols in programs read by a case-insensitive Scheme reader. } @defthing[shtml-named-char-id any/c]{} @defthing[shtml-numeric-char-id any/c]{ These variables are bound to the SHTML entity public identifier strings used in SHTML @tt{*ENTITY*} named and numeric character entity references. } @defform[(make-shtml-entity val)]{ Yields an SHTML character entity reference for @schemevarfont{val}. For example: @SCHEMEBLOCK[ (make-shtml-entity "rArr") ==> (& rArr) (make-shtml-entity (string->symbol "rArr")) ==> (& rArr) (make-shtml-entity 151) ==> (& 151) ] } @defform[(shtml-entity-value obj)]{ Yields the value for the SHTML entity @schemevarfont{obj}, or @tt{#f} if @schemevarfont{obj} is not a recognized entity. Values of named entities are symbols, and values of numeric entities are numbers. An error may raised if @schemevarfont{obj} is an entity with system ID inconsistent with its public ID. For example: @SCHEMEBLOCK[ (define (f s) (shtml-entity-value (cadr (html->shtml s)))) (f " ") ==> nbsp (f "ߐ") ==> 2000 ] } @section{Tokenizing} The tokenizer is used by the higher-level structural parser, but can also be called directly for debugging purposes or unusual applications. Some of the list structure of tokens, such as for start tag tokens, is mutated and incorporated into the SHTML list structure emitted by the parser. @defform[(make-html-tokenizer in normalized?)]{ Constructs an HTML tokenizer procedure on input port @schemevarfont{in}. If boolean @schemevarfont{normalized?} is true, then tokens will be in a format conducive to use with a parser emitting normalized SXML. Each call to the resulting procedure yields a successive token from the input. When the tokens have been exhausted, the procedure returns the null list. For example: @SCHEMEBLOCK[ (define input (open-input-string "bar")) (define next (make-html-tokenizer input #f)) (next) ==> (a (\@ (href "foo"))) (next) ==> "bar" (next) ==> (*END* a) (next) ==> () (next) ==> () ] } @defform[(tokenize-html in normalized?)]{ Returns a list of tokens from input port @schemevarfont{in}, normalizing according to boolean @schemevarfont{normalized?}. This is probably most useful as a debugging convenience. For example: @SCHEMEBLOCK[ (tokenize-html (open-input-string "bar") #f) ==> ((a (\@ (href "foo"))) "bar" (*END* a)) ] } @defform[(shtml-token-kind token)]{ Returns a symbol indicating the kind of tokenizer @schemevarfont{token}: @tt{*COMMENT*}, @tt{*DECL*}, @tt{*EMPTY*}, @tt{*END*}, @tt{*ENTITY*}, @tt{*PI*}, @tt{*START*}, @tt{*TEXT*}. This is used by higher-level parsing code. For example: @SCHEMEBLOCK[ (map shtml-token-kind (tokenize-html (open-input-string "> (*START* *START* *TEXT* *START* *END* *END*) ] } @section{Parsing} Most applications will call a parser procedure such as @tt{html->shtml} rather than calling the tokenizer directly. @defform[(parse-html/tokenizer tokenizer normalized?)]{ Emits a parse tree like @tt{html->shtml} and related procedures, except using @schemevarfont{tokenizer} as a source of tokens, rather than tokenizing from an input port. This procedure is used internally, and generally should not be called directly. } @defform[(html->sxml-0nf input)]{} @defform[(html->sxml-1nf input)]{} @defform[(html->sxml-2nf input)]{} @defform[(html->sxml input)]{} @defform[(html->shtml input)]{ Permissively parse HTML from @schemevarfont{input}, which is either an input port or a string, and emit an SHTML equivalent or approximation. To borrow and slightly modify an example from Kiselyov's discussion of his HTML parser: @SCHEMEBLOCK[ (html->shtml "whatever link