#lang scribble/doc @; THIS FILE IS GENERATED @(require scribble/manual) @(require (for-label (planet neil/webscraperhelper:1:2))) @title[#:version "0.5"]{@bold{WebScraperHelper}: Simple Generation of SXPath Queries from SXML Examples} @author{Neil Van Dyke} License: @seclink["Legal" #:underline? #f]{LGPL 3} @(hspace 1) Web: @link["http://www.neilvandyke.org/webscraperhelper/" #:underline? #f]{http://www.neilvandyke.org/webscraperhelper/} @defmodule[(planet neil/webscraperhelper:1:2)] @section{Introduction} WebScraperHelper is intended as a programmer's aid for crafting @link["http://pair.com/lisovsky/query/sxpath/"]{SXPath} queries to extract information (e.g., news items, prices) from HTML Web pages that have been parsed by @link["http://www.neilvandyke.org/htmlprag/"]{HtmlPrag}. The current version of WebScraper accepts an example @link["http://pobox.com/~oleg/ftp/Scheme/SXML.html"]{SXML} (or @link["http://www.neilvandyke.org/shtml/"]{SHTML}) document and an example ``goal'' subtree of the document, and yields up to three different SXPath queries. A generated query can often be incorporated into a Web-scraping program as-is, for extracting information from documents with very similar formatting. Generated queries can also be used as starting points for hand-crafted queries. For example, given the SXML document @schemevarfont{doc}: @SCHEMEBLOCK[ (define doc '(*TOP* (html (head (title "My Title")) (body (\@ (bgcolor "white")) (p "Summary: This is a document.") (div (\@ (id "ResultsSection")) (h2 "Results") (p "These are the results.") (table (\@ (id "ResultTable")) (tr (td (b "Input:")) (td "2 + 2")) (tr (td (b "Output:")) (td "Four"))) (p "Lookin' good!")))))) ] evaluating the expression @SCHEMEBLOCK[ (webscraperhelper '(td "Four") doc) ] will display generated queries like: @verbatim["Absolute SXPath: (html body div table (tr 2) (td 2))\nAbsolute SXPath with IDs: (html body\n (div (@ (equal? (id \"ResultsSection\"))))\n (table (@ (equal? (id \"ResultTable\"))))\n (tr 2) (td 2))\nRelative SXPath with IDs: (// (table (@ (equal? (id \"ResultTable\"))))\n (tr 2) (td 2))"] The queries can then be compiled with the @tt{sxpath} procedure of the SXPath library: @SCHEMEBLOCK[ (define query (sxpath '(// (table (\@ (equal? (id "ResultTable")))) (tr 2) (td 2)))) (query doc) ==> ((td "Four")) ] This version of WebScraperHelper requires R5RS, SRFI-11, and SRFI-16. WebScraperHelper also comes with an advertising jingle (with apologies to greasy ground bovine additive Americana): @verbatim["WebScraperHelper\nhelps a programmer\nscrape the\nWeb a great deal!"] @section{Interactive Interface} In this version, the `interactive'' interface is a procedure intended to be invoked manually from a REPL. @defproc[(webscraperhelper (goal any/c) (sxml any/c) (ids any/c)) any/c]{ Displays some XPath queries yielding SXML @schemevarfont{goal} from document @schemevarfont{sxml}. @schemevarfont{goal} is the desired SXML element node. @schemevarfont{sxml} is the document in SXML First Normal Form (1NF). Some nested nodelists emitted by SXML transformation tools, such as attributes nested in extra list levels, are not permitted. The optional @schemevarfont{ids} is a list of name symbols for element attributes that can be treate as unique identifiers. If @schemevarfont{ids} is not given, then the default is @tt{'(id)}. (Note: Since some Scheme implementations have case-insensitive readers, but SXML is case-sensitive, you may have to use @tt{(list (string->symbol "foo") (string->symbol "bar"))} instead of @tt{'(foo bar)}.) } @section{Programmatic Interface} The programmatic interface, such as it is, will likely change substantially in a future version, as new ways of generating queries are implemented. The following procedures are therefore exposed only for tinkering, and are not really documented. @defproc[(find-wsh-path (goal any/c) (sxml any/c)) any/c]{ Yields a @italic{wsh-path} to @schemevarfont{goal} within @schemevarfont{sxml}, or @tt{#f} if no path could be found. The yielded path might share structure with @schemevarfont{sxml}. } @defproc[(wsh-path->sxpath-abs (path any/c)) any/c]{} @defproc[(wsh-path->sxpath-absids+relids (path any/c) (ids any/c)) any/c]{} @defproc[(wsh-path->sxpath-abs+absids+relids (path any/c) (ids any/c)) any/c]{ Translate a @italic{wsh-path} to various SXPath queries. The yielded SXPath query lists should be considered immutable, as they might share structure with the original SXML from which @schemevarfont{path} was generated, or multiple queries might share structure with each other. } @section{History} @itemize[ @item{Version 0.5 --- 2009-03-14 --- PLaneT @tt{(1 2)} Minor documentation change. } @item{Version 0.4 --- 2009-02-24 --- PLaneT @tt{(1 1)} License now LGPL 3. Converted to author's new Scheme administration system. } @item{Version 0.3 --- 2005-07-04 --- PLaneT @tt{(1 0)} Documentation update, plus get it into PLaneT 299/3xx. } @item{Version 0.2 --- 2004-08-16 Corrected typographical error in attributions. } @item{Version 0.1 --- 2004-07-31 Initial version. } ] @section[#:tag "Legal"]{Legal} Copyright (c) 2004--2009 Neil Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License (LGPL 3), or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See http://www.gnu.org/licenses/ for details. For other licenses and consulting, please contact the author.