doc.txt

xml-pull: pull-style parsing for large xml files

xml-pull: pull-style parsing for large xml files


Quick Example
-------------

    > (require (planet "xml-pull.ss" ("dyoo" "xml-pull.plt" 1 0)))

    > (define a-taffy
        (start-xml-pull (open-input-string #<<EOF
<test-xml>
<person>
    <name>Sue Rhee</name>
</person>
<person>
    <name>Dan Garcia</name>
</person>
<person>
    <name>Mike Clancy</name>
</person>
</test-xml>
EOF
                                           )))

We can consume this XML structure by morsels:

    > (pull-morsel a-taffy)
    #3(struct:start-element test-xml ())
    > (pull-morsel a-taffy)
    #3(struct:characters "\n" "")
    > (pull-morsel a-taffy)
    #3(struct:start-element person ())

At this point, we are at the start-element of a person.  When we
see a start-element that is interesting to us, we can _pull-sexp_
the rest of that element as a normalized SXML fragment:

    > (pull-sexp a-taffy)
    (person (@) (name (@) "Sue Rhee"))


What's nice about this is that we only consume as much of the XML
from our input-stream as we need, and moreover, memory usage is
bounded to the amount of memory needed to represent the
fragment.
    

Structures
----------

There are two structures in this module: taffy and morsel.

    * taffy

      A _taffy_ is a core structure that maintains the state of
      the XML parse.  Conceptually, a taffy is an iterator of morsels
      and SXML fragments.


    * morsel

      A _morsel_ is one of the following:

      * (make-start-element name attributes)
         where name is a symbol and attributes is
         a (listof (list symbol string))
    
      * (make-end-element n a)
         where name is a symbol and attributes is
         a (listof (list symbol string))

      * (make-characters s1 s2)
         where s1 and s2 are strings
       
      * (make-exhausted)

      Most of these are self-explanatory.  We produce an _exhausted_
      structure when there are no more elements in the xml
      to parse.

The expected predicates and selectors are also available:    

> taffy? : any -> boolean

> morsel? : any -> boolean

> start-element? : any -> boolean

> end-element? : any -> boolean

> characters? : any -> boolean

> exhausted? : any -> boolean

> start-element-name : start-element -> symbol

> start-element-attributes : start-element -> (listof (list symbol string))

> end-element-name: end-element -> symbol

> end-element-attributes end-element -> (listof (list symbol string))


Functions
---------

> start-xml-pull: input-port -> taffy

Given an input-port, starts the XML parse and returns a taffy.


> pull-morsel: taffy -> morsel

Takes a taffy and rips off a morsel.


> pull-sexp: taffy -> (union sexp exhausted)

Assuming that the very last morsel that is pulled off is a start-element,
pulls enough morsels to reproduce that element.  If the last morsel is not
a start-event, raises an error.


> pull-sexps/g: taffy symbol -> (generatorof sexp)

The result is a _generator_ whose elements are s-expressions those
names match the given input symbol.

See http://planet.plt-scheme.org/#generator.plt2.0 for more details.


Parameters
----------

> current-namespace-translate: symbol -> symbol

If provided, this is used to translate the namespace portion of
element names in an XML document.  By default, this is bound to the
identity function.  (This is experimental --- I might remove this in
a later release of this software in favor of a simpler substitution
map similar to what ssax:xml->sxml takes in.)



More extenstive example
-----------------------

Here is code that takes a large XML document --- the collection of common ontology
terms used in bioinformatics --- and prints out the first hundred terms:


(module test-xml-pull-2 mzscheme
  (require (lib "url.ss" "net")
           (lib "inflate.ss")
           (lib "pretty.ss")
           (planet "xml-pull.ss" ("dyoo" "xml-pull.plt" 1 0))
           (planet "generator.ss" ("dyoo" "generator.plt" 2 0)))

  ;; wrap-gunzip: input-port -> input-port
  ;; Wraps an uncompressor around the input stream.
  (define (wrap-gunzip original-ip)
    (define-values (ip op) (make-pipe 32768))
    (thread (lambda () (gunzip-through-ports original-ip op)))
    ip)

  (define my-url
    (string->url "http://archive.godatabase.org/latest-termdb/go_daily-termdb.rdf-xml.gz"))

  (define my-input-port (wrap-gunzip (get-pure-port my-url)))
  
  (define my-taffy (start-xml-pull my-input-port))
  
  (define generated-terms (pull-sexps/g my-taffy 'http://www.geneontology.org/dtds/go.dtd#:term))
  
  ;; pretty-print the first 100 terms in the Gene Ontology
  (let loop ([i 0])
    (when (< i 100)
      (pretty-print (generator-next generated-terms))
      (loop (add1 i)))))




Thanks
------

Thanks to the PLT folks for writing tools that are very enjoyable
to play with.  Special thanks to the bioinformaticians at
TAIR (http://arabidopsis.org) who taught me to appreciate
very large XML datasets.



References
----------

SSAX (http://ssax.sourceforge.net/)

SXML (http://okmij.org/ftp/Scheme/SXML.html)

About Pulldom and Minidom (http://www.prescod.net/python/pulldom.html)