Version 0.16, 2005-12-18, http://www.neilvandyke.org/htmlprag/
by
Neil W. Van Dyke
<neil@neilvandyke.org
>
Copyright © 2003 - 2005 Neil W. Van Dyke. This program is Free
Software; you can redistribute it and/or modify it under the terms of the
GNU Lesser General Public License as published by the Free Software
Foundation; either version 2.1 of the License, or (at your option) any
later version. This program is distributed in the hope that it will be
useful, but without any warranty; without even the implied warranty of
merchantability or fitness for a particular purpose. See
<http://www.gnu.org/copyleft/lesser.html
> for details. For
other license options and consulting, contact the author.
HtmlPrag provides permissive HTML parsing and emitting capability to Scheme programs. The parser is useful for software agent extraction of information from Web pages, for programmatically transforming HTML files, and for implementing interactive Web browsers. HtmlPrag emits “SHTML,” which is an encoding of HTML in SXML, so that conventional HTML may be processed with XML tools such as SXPath. Like Oleg Kiselyov's SSAX-based HTML parser, HtmlPrag provides a permissive tokenizer, but also attempts to recover structure. HtmlPrag also includes procedures for encoding SHTML in HTML syntax.
The HtmlPrag parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. HtmlPrag's handling of errors is intended to generally emulate popular Web browsers' interpretation of the structure of erroneous HTML. We euphemistically term this kind of parse “pragmatic.”
HtmlPrag also has some support for XHTML, although XML namespace qualifiers are currently accepted but stripped from the resulting SHTML. Note that valid XHTML input is of course better handled by a validating XML parser like Kiselyov's SSAX.
HtmlPrag requires R5RS, SRFI-6, and SRFI-23.
SHTML is a variant of SXML, with two minor but useful extensions:
*TOP*
, are defined to be in all
uppercase, regardless of the case-sensitivity of the reader of the hosting
Scheme implementation in any context. This avoids several pitfalls.
&
syntax for
non-ASCII (or non-Extended-ASCII) characters. The syntax is (&
val)
, where val is a symbol or string naming with the symbolic
name of the character, or an integer with the numeric value of the
character.
These variables are bound to the following case-sensitive symbols used in SHTML, respectively:
*COMMENT*
,*DECL*
,*EMPTY*
,*END*
,*ENTITY*
,*PI*
,*START*
,*TEXT*
, and*TOP*
. These can be used in lieu of the literal symbols in programs read by a case-insensitive Scheme reader.1
These variables are bound to the SHTML entity public identifier strings used in SHTML
*ENTITY*
named and numeric character entity references.
Yields an SHTML character entity reference for val. For example:
(make-shtml-entity "rArr") => (& rArr) (make-shtml-entity (string->symbol "rArr")) => (& rArr) (make-shtml-entity 151) => (& 151)
Yields the value for the SHTML entity obj, or
#f
if obj is not a recognized entity. Values of named entities are symbols, and values of numeric entities are numbers. An error may raised if obj is an entity with system ID inconsistent with its public ID. For example:(define (f s) (shtml-entity-value (cadr (html->shtml s)))) (f " ") => nbsp (f "ߐ") => 2000
The tokenizer is used by the higher-level structural parser, but can also be called directly for debugging purposes or unusual applications. Some of the list structure of tokens, such as for start tag tokens, is mutated and incorporated into the SHTML list structure emitted by the parser.
Constructs an HTML tokenizer procedure on input port in. If boolean normalized? is true, then tokens will be in a format conducive to use with a parser emitting normalized SXML. Each call to the resulting procedure yields a successive token from the input. When the tokens have been exhausted, the procedure returns the null list. For example:
(define input (open-input-string "<a href=\"foo\">bar</a>")) (define next (make-html-tokenizer input #f)) (next) => (a (@ (href "foo"))) (next) => "bar" (next) => (*END* a) (next) => () (next) => ()
Returns a list of tokens from input port in, normalizing according to boolean normalized?. This is probably most useful as a debugging convenience. For example:
(tokenize-html (open-input-string "<a href=\"foo\">bar</a>") #f) => ((a (@ (href "foo"))) "bar" (*END* a))
Returns a symbol indicating the kind of tokenizer token:
*COMMENT*
,*DECL*
,*EMPTY*
,*END*
,*ENTITY*
,*PI*
,*START*
,*TEXT*
. This is used by higher-level parsing code. For example:(map shtml-token-kind (tokenize-html (open-input-string "<a<b>><c</</c") #f)) => (*START* *START* *TEXT* *START* *END* *END*)
Most applications will call a parser procedure such as
html->shtml
rather than calling the tokenizer directly.
Emits a parse tree like
html->shtml
and related procedures, except using tokenizer as a source of tokens, rather than tokenizing from an input port. This procedure is used internally, and generally should not be called directly.
Permissively parse HTML from input, which is either an input port or a string, and emit an SHTML equivalent or approximation. To borrow and slightly modify an example from Kiselyov's discussion of his HTML parser:
(html->shtml "<html><head><title></title><title>whatever</title></head><body> <a href=\"url\">link</a><p align=center><ul compact style=\"aa\"> <p>BLah<!-- comment <comment> --> <i> italic <b> bold <tt> ened</i> still < bold </b></body><P> But not done yet...") => (*TOP* (html (head (title) (title "whatever")) (body "\n" (a (@ (href "url")) "link") (p (@ (align "center")) (ul (@ (compact) (style "aa")) "\n")) (p "BLah" (*COMMENT* " comment <comment> ") " " (i " italic " (b " bold " (tt " ened"))) "\n" "still < bold ")) (p " But not done yet...")))Note that in the emitted SHTML the text token
"still < bold"
is not inside theb
element, which represents an unfortunate failure to emulate all the quirks-handling behavior of some popular Web browsers.The procedures
html->sxml-
nnf
for n 0 through 2 correspond to 0th through 2nd normal forms of SXML as specified in SXML, and indicate the minimal requirements of the emitted SXML.
html->sxml
andhtml->shtml
are currently aliases forhtml->sxml-0nf
, and can be used in scripts and interactively, when terseness is important and any normal form of SXML would suffice.
Two procedures encoding the SHTML representation as conventional HTML,
write-shtml-as-html
and shtml->html
. These are perhaps most
useful for emitting the result of parsed and transformed input HTML. They
can also be used for emitting HTML from generated or handwritten SHTML.
Writes a conventional HTML transliteration of the SHTML shtml to output port out. If out is not specified, the default is the current output port. HTML elements of types that are always empty are written using HTML4-compatible XHTML tag syntax.
If foreign-filter is specified, it is a procedure of two argument that is applied to any non-SHTML (“foreign”) object encountered in shtml, and should yield SHTML. The first argument is the object, and the second argument is a boolean for whether or not the object is part of an attribute value.
No inter-tag whitespace or line breaks not explicit in shtml is emitted. The shtml should normally include a newline at the end of the document. For example:
(write-shtml-as-html '((html (head (title "My Title")) (body (@ (bgcolor "white")) (h1 "My Heading") (p "This is a paragraph.") (p "This is another paragraph."))))) -| <html><head><title>My Title</title></head><body bgcolor="whi -| te"><h1>My Heading</h1><p>This is a paragraph.</p><p>This is -| another paragraph.</p></body></html>
Yields an HTML encoding of SHTML shtml as a string. For example:
(shtml->html (html->shtml "<P>This is<br<b<I>bold </foo>italic</ b > text.</p>")) => "<p>This is<br /><b><i>bold italic</i></b> text.</p>"Note that, since this procedure constructs a string, it should normally only be used when the HTML is relatively small. When encoding HTML documents of conventional size and larger,
write-shtml-as-html
is much more efficient.
The HtmlPrag test suite can be enabled by editing the source code file and loading Testeez.
div
is now always permitted as a parent, as a stopgap
measure until substantial time can be spent reworking the algorithm to
better support div
(bug reported by Corey Sweeney and Jepri). Also
no longer convert to Scheme character any HTML numeric character reference
with value above 126, to avoid Unicode problem with PLT 299/300 (bug
reported by Corey Sweeney).
sxml->html
and write-sxml-html
have been removed. Minor documentation changes.
syntax-rules
, and a reader that can read
@
as a symbol. SHTML now has a special &
element for
character entities, and it is emitted by the parser rather than the old
*ENTITY*
kludge. shtml-entity-value
supports both the new
and the old character entity representations. shtml-entity-value
now yields #f
on invalid SHTML entity, rather than raising an error.
write-shtml-as-html
now has a third argument, foreign-filter
.
write-shtml-as-html
now emits SHTML &
entity references.
Changed shtml-named-char-id
and shtml-numeric-char-id
, as
previously warned. Testeez is now used for the test suite. Test procedure
is now the internal %htmlprag:test
. Documentation changes.
Notably, much documentation about using HtmlPrag under various particular
Scheme implementations has been removed.
xml:
is now
preserved as a namespace qualifier (thanks to Peter Barabas for
reporting). Output port term of write-shtml-as-html
is now
optional. Began documenting loading for particular implementation-specific
packagings.
sxml-
x-symbol
to shtml-
x-symbol
,
sxml-html-
x to shtml-
x, and
sxml-token-kind
to shtml-token-kind
. html->shtml
,
shtml->html
, and write-shtml-as-html
have been added as
names. Considered deprecated but still defined (see the “Deprecated”
section of this documentation) are sxml->html
and
write-sxml-html
. The growing pains should now be all but over.
Internally, htmlprag-internal:error
introduced for Bigloo
portability. SISC returned to the test list; thanks to Scott G. Miller
for his help. Fixed a new character eq?
bug, thanks to SISC.
htmlprag:
”
prefix. The portability identifiers have been renamed to begin with an
htmlprag-internal:
prefix, are now considered strictly
internal-use-only, and have otherwise been changed. parse-html
and
always-empty-html-elements
are no longer public.
test-htmlprag
now tests html->sxml
rather than
parse-html
. SISC temporarily removed from the test list, until an
open source Java that works correctly is found.
htmlprag:sxml-html-entity-value
. Upper-case X
in hexadecimal
character entities is now parsed, in addition to lower-case x
.
Added htmlprag:always-empty-html-elements
. Added additional
portability bindings. Added more test cases.
*ENTITY*
SXML. SXML symbols like *TOP*
are now
always upper-case, regardless of the Scheme implementation. Identifiers
such as htmlprag:sxml-top-symbol
are bound to the upper-case
symbols. Procedures htmlprag:html->sxml-0nf
,
htmlprag:html->sxml-1nf
, and htmlprag:html->sxml-2nf
have
been added. htmlprag:html->sxml
now an alias for
htmlprag:html->sxml-0nf
. htmlprag:parse
has been refashioned
as htmlprag:parse-html
and should no longer be directly. A number
of identifiers have been renamed to be more appropriate when the
htmlprag:
prefix is dropped in some implementation-specific
packagings of HtmlPrag: htmlprag:make-tokenizer
to
htmlprag:make-html-tokenizer
, htmlprag:parse/tokenizer
to
htmlprag:parse-html/tokenizer
, htmlprag:html->token-list
to
htmlprag:tokenize-html
, htmlprag:token-kind
to
htmlprag:sxml-token-kind
, and htmlprag:test
to
htmlprag:test-htmlprag
. Verbatim elements with empty-element tag
syntax are handled correctly. New versions of Bigloo and RScheme tested.
script
and xmp
are now parsed
correctly. Two Scheme implementations have temporarily been dropped from
regression testing: Kawa, due to a Java bytecode verifier error likely due
to a Java installation problem on the test machine; and SXM 1.1, due to
hitting a limit on the number of literals late in the test suite code.
Tested newer versions of Bigloo, Chicken, Gauche, Guile, MIT Scheme, PLT
MzScheme, RScheme, SISC, and STklos. RScheme no longer requires the
“(define get-output-string close-output-port)
” workaround.
eq?
in character comparisons, thanks to Scott G.
Miller. Added htmlprag:html->normalized-sxml
and
htmlprag:html->nonnormalized-sxml
. Started to add
close-output-port
to uses of output strings, then reverted due to
bug in one of the supported dialects. Tested newer versions of Bigloo,
Gauche, PLT MzScheme, RScheme.
call-with-values
. Re-ordered top-level definitions,
for portability. Now tests under Kawa 1.6.99, RScheme 0.7.3.2, Scheme 48
0.57, SISC 1.7.4, STklos 0.54, and SXM 1.1.
@
as a symbol (although those implementations tend to present other
portability issues, as yet unresolved).
colgroup
, tbody
, and thead
elements. Erroneous
input, including invalid hexadecimal entity reference syntax and extraneous
double quotes in element tags, is now parsed better.
htmlprag:token-kind
emits symbols more consistent with SXML.
[1] Scheme
implementators who have not yet made read
case-sensitive by default
are encouraged to do so.