1 Copyright
2 About
3 API
make-html-parser
html->sxml
html-strip
4 Examples
Version: 4.1.4

html-parser

 (require (planet ashinn/html-parser/html-parser))

A SSAX-like tree-folding html parser

1 Copyright

Copyright (c) 2003-2008 Alex Shinn. All rights reserved. BSD-style license: http://synthcode.com/license.txt

2 About

This is intended as a permissive HTML parser for people who prefer the scalable interface described in Oleg Kiselyov’s SSAX parser, as well as providing simple convenience utilities. It correctly handles all invalid HTML, inserting "virtual" starting and closing tags as needed to maintain the proper tree structure needed for the foldts down/up logic. A major goal of this parser is bug-for-bug compatibility with the way common web browsers parse HTML.

3 API

(make-html-parser [keys])  any
  keys : any = ...

 Procedure: make-html-parser . keys
   Returns a procedure of two arguments, and initial seed and an
   optional input port, which parses the HTML document from the port
   with the callbacks specified in the plist KEYS (using normal,
   quoted symbols, for portability and to avoid making this a
   macro).  The following callbacks are recognized:
   START: TAG ATTRS SEED VIRTUAL?
       fdown in foldts, called when a start-tag is encountered.
     TAG:         tag name
     ATTRS:       tag attributes as a alist
     SEED:        current seed value
     VIRTUAL?:    #t iff this start tag was inserted to fix the HTML tree
   END: TAG ATTRS PARENT-SEED SEED VIRTUAL?
       fup in foldts, called when an end-tag is encountered.
     TAG:         tag name
     ATTRS:       tag attributes of the corresponding start tag
     PARENT-SEED: parent seed value (i.e. seed passed to the start tag)
     SEED:        current seed value
     VIRTUAL?:    #t iff this end tag was inserted to fix the HTML tree
   TEXT: TEXT SEED
       fhere in foldts, called when any text is encountered.  May be
       called multiple times between a start and end tag, so you need
       to string-append yourself if desired.
     TEXT:        entity-decoded text
     SEED:        current seed value
   COMMENT: TEXT SEED
       fhere on comment data
   DECL: NAME ATTRS SEED
       fhere on declaration data
       
   PROCESS: LIST SEED
       fhere on process-instruction data
   In addition, entity-mappings may be overriden with the ENTITIES:
   keyword.

(html->sxml [port])  list?
  port : (list/c port?) = ...

Returns the SXML representation of the document from port, using the default parsing options.

(html-strip [port])  string?
  port : (list/c port?) = ...

Returns a string representation of the document from port with all tags removed. No whitespace reduction or other rendering is done.

4 Examples

For sake of experimentation:

  (module examples scheme
    (require scheme/port
             (planet ashinn/html-parser/html-parser))
  
    (provide (all-defined-out))
  
    (define (file->string f)
      (let ((sp (open-output-string)))
        (call-with-input-file f
          (λ (fp) (copy-port fp sp)))
        (get-output-string sp)))
  
    (define html-strip-parser
      (make-html-parser
       'start: (lambda (tag attrs seed virtual?) seed)
       'end:   (lambda (tag attrs parent-seed seed virtual?) seed)
       'text:  (lambda (text seed) (display text))))
  
    (define f-string (file->string "/tmp/index.html"))
    (define f-sxml (html->sxml (open-input-string f-string)))
    (define f-stripped (html-strip (open-input-string f-string)))
    (define f-stripped-custom
      (call-with-output-string
       (lambda (p)
         (parameterize ([current-output-port p])
           (html-strip-parser (cons #f #f) (open-input-string f-string))))))
    (printf "html-strip and html-strip-parser produce the same result? ~a~n"
            (equal? f-stripped f-stripped-custom)))