doc.txt

WebScraperHelper: Simple Generation of SXPath Queries from SXML Examples

WebScraperHelper: Simple Generation of SXPath Queries from SXML Examples
************************************************************************

Version 0.3, 2005-07-04, `http://www.neilvandyke.org/webscraperhelper/'

by Neil W. Van Dyke <neil@neilvandyke.org>

     Copyright (C) 2004 - 2005 Neil W. Van Dyke.  This program is Free
     Software; you can redistribute it and/or modify it under the terms
     of the GNU Lesser General Public License as published by the Free
     Software Foundation; either version 2.1 of the License, or (at
     your option) any later version.  This program is distributed in
     the hope that it will be useful, but without any warranty; without
     even the implied warranty of merchantability or fitness for a
     particular purpose.  See <http://www.gnu.org/copyleft/lesser.html>
     for details.  For other license options and consulting, contact
     the author.

Introduction
************

WebScraperHelper is intended as a programmer's aid for crafting SXPath
(http://pair.com/lisovsky/query/sxpath/) queries to extract information
(e.g., news items, prices) from HTML Web pages that have been parsed by
HtmlPrag (http://www.neilvandyke.org/htmlprag/).  The current version
of WebScraper accepts an example SXML
(http://pobox.com/~oleg/ftp/Scheme/SXML.html) (or SHTML) document and
an example "goal" subtree of the document, and yields up to three
different SXPath queries.  A generated query can often be incorporated
into a Web-scraping program as-is, for extracting information from
documents with very similar formatting.  Generated queries can also be
used as starting points for hand-crafted queries.

   For example, given the SXML document DOC:

     (define doc
       '(*TOP* (html (head (title "My Title"))
                     (body (@ (bgcolor "white"))
                           (p "Summary: This is a document.")
                           (div (@ (id "ResultsSection"))
                                (h2 "Results")
                                (p "These are the results.")
                                (table (@ (id "ResultTable"))
                                       (tr (td (b "Input:"))
                                           (td "2 + 2"))
                                       (tr (td (b "Output:"))
                                           (td "Four")))
                                (p "Lookin' good!"))))))

evaluating the expression

     (webscraperhelper '(td "Four") doc)

will display generated queries like:

     Absolute SXPath:           (html body div table (tr 2) (td 2))
     Absolute SXPath with IDs:  (html body
                                 (div (@ (equal? (id "ResultsSection"))))
                                 (table (@ (equal? (id "ResultTable"))))
                                 (tr 2) (td 2))
     Relative SXPath with IDs:  (// (table (@ (equal? (id "ResultTable"))))
                                 (tr 2) (td 2))

The queries can then be compiled with the `sxpath' procedure of the
SXPath library:

     (define query
       (sxpath '(// (table (@ (equal? (id "ResultTable"))))
                    (tr 2) (td 2))))

     (query doc) => ((td "Four"))

   This version of WebScraperHelper requires R5RS, SRFI-11, and SRFI-16.

   WebScraperHelper also comes with an advertising jingle (with
apologies to greasy ground bovine additive Americana):

     WebScraperHelper
     helps a programmer
     scrape the
     Web a great deal!

Interactive Interface
*********************

In this version, the `interactive" interface is a procedure intended to
be invoked manually from a REPL.

> (webscraperhelper goal sxml [ids])
     Displays some XPath queries yielding SXML GOAL from document SXML.

     GOAL is the desired SXML element node.

     SXML is the document in SXML First Normal Form (1NF).  Some nested
     nodelists emitted by SXML transformation tools, such as `(a ((@ (x
     "y")))' instead of `(a (@ (x "y"))', are not permitted.

     The optional IDS is a list of name symbols for element attributes
     that can be treate as unique identifiers.  If IDS is not given,
     then the default is `'(id)'.  (Note: Since some Scheme
     implementations have case-insensitive readers, but SXML is
     case-sensitive, you may have to use `(list (string->symbol "foo")
     (string->symbol "bar"))' instead of `'(foo bar)'.)

Programmatic Interface
**********************

The programmatic interface, such as it is, will likely change
substantially in a future version, as new ways of generating queries
are implemented.  The following procedures are therefore exposed only
for tinkering, and are not really documented.

> (find-wsh-path goal sxml)
     Yields a "wsh-path" to GOAL within SXML, or `#f' if no path could
     be found.  The yielded path might share structure with SXML.

> (wsh-path->sxpath-abs path)
> (wsh-path->sxpath-absids+relids path ids)
> (wsh-path->sxpath-abs+absids+relids path ids)
     Translate a "wsh-path" to various SXPath queries.  The yielded
     SXPath query lists should be considered immutable, as they might
     share structure with the original SXML from which PATH was
     generated, or multiple queries might share structure with each
     other.

History
*******

Version 0.3 -- 2005-07-04
     Documentation update, plus get it into PLaneT 299/3xx.

Version 0.2 -- 2004-08-16
     Corrected typographical error in attributions.

Version 0.1 -- 2004-07-31
     Initial version.