xml-pull: pull-style parsing for large xml files

xml-pull: pull-style parsing for large xml files

Quick Example

    > (require (planet "" ("dyoo" "xml-pull.plt" 1 0)))

    > (define a-taffy
        (start-xml-pull (open-input-string #<<EOF
    <name>Sue Rhee</name>
    <name>Dan Garcia</name>
    <name>Mike Clancy</name>

We can consume this XML structure by morsels:

    > (pull-morsel a-taffy)
    #3(struct:start-element test-xml ())
    > (pull-morsel a-taffy)
    #3(struct:characters "\n" "")
    > (pull-morsel a-taffy)
    #3(struct:start-element person ())

At this point, we are at the start-element of a person.  When we
see a start-element that is interesting to us, we can _pull-sexp_
the rest of that element as a normalized SXML fragment:

    > (pull-sexp a-taffy)
    (person (@) (name (@) "Sue Rhee"))

What's nice about this is that we only consume as much of the XML
from our input-stream as we need, and moreover, memory usage is
bounded to the amount of memory needed to represent the


There are two structures in this module: taffy and morsel.

    * taffy

      A _taffy_ is a core structure that maintains the state of
      the XML parse.  Conceptually, a taffy is an iterator of morsels
      and SXML fragments.

    * morsel

      A _morsel_ is one of the following:

      * (make-start-element name attributes)
         where name is a symbol and attributes is
         a (listof (list symbol string))
      * (make-end-element n a)
         where name is a symbol and attributes is
         a (listof (list symbol string))

      * (make-characters s1 s2)
         where s1 and s2 are strings
      * (make-exhausted)

      Most of these are self-explanatory.  We produce an _exhausted_
      structure when there are no more elements in the xml
      to parse.

The expected predicates and selectors are also available:    

> taffy? : any -> boolean

> morsel? : any -> boolean

> start-element? : any -> boolean

> end-element? : any -> boolean

> characters? : any -> boolean

> exhausted? : any -> boolean

> start-element-name : start-element -> symbol

> start-element-attributes : start-element -> (listof (list symbol string))

> end-element-name: end-element -> symbol

> end-element-attributes end-element -> (listof (list symbol string))


> start-xml-pull: input-port -> taffy

Given an input-port, starts the XML parse and returns a taffy.

> pull-morsel: taffy -> morsel

Takes a taffy and rips off a morsel.

> pull-sexp: taffy -> (union sexp exhausted)

Assuming that the very last morsel that is pulled off is a start-element,
pulls enough morsels to reproduce that element.  If the last morsel is not
a start-event, raises an error.

> pull-sexps/g: taffy symbol -> (generatorof sexp)

The result is a _generator_ whose elements are s-expressions those
names match the given input symbol.

See for more details.


> current-namespace-translate: symbol -> symbol

If provided, this is used to translate the namespace portion of
element names in an XML document.  By default, this is bound to the
identity function.  (This is experimental --- I might remove this in
a later release of this software in favor of a simpler substitution
map similar to what ssax:xml->sxml takes in.)

More extenstive example

Here is code that takes a large XML document --- the collection of common ontology
terms used in bioinformatics --- and prints out the first hundred terms:

(module test-xml-pull-2 mzscheme
  (require (lib "" "net")
           (lib "")
           (lib "")
           (planet "" ("dyoo" "xml-pull.plt" 1 0))
           (planet "" ("dyoo" "generator.plt" 2 0)))

  ;; wrap-gunzip: input-port -> input-port
  ;; Wraps an uncompressor around the input stream.
  (define (wrap-gunzip original-ip)
    (define-values (ip op) (make-pipe 32768))
    (thread (lambda () (gunzip-through-ports original-ip op)))

  (define my-url
    (string->url ""))

  (define my-input-port (wrap-gunzip (get-pure-port my-url)))
  (define my-taffy (start-xml-pull my-input-port))
  (define generated-terms (pull-sexps/g my-taffy '
  ;; pretty-print the first 100 terms in the Gene Ontology
  (let loop ([i 0])
    (when (< i 100)
      (pretty-print (generator-next generated-terms))
      (loop (add1 i)))))


Thanks to the PLT folks for writing tools that are very enjoyable
to play with.  Special thanks to the bioinformaticians at
TAIR ( who taught me to appreciate
very large XML datasets.




About Pulldom and Minidom (