#lang scribble/doc @(require scribble/manual "util.rkt" scribble/racket (for-syntax racket/base) (for-label (this-package-in main))) @title[#:tag "sxpath"]{Search (SXPath)} @(define-syntax-rule (rewrite-table [lhs -> rhs] ...) (tabular (list (list @racketblock[lhs] @elem{⇒} @racketblock[rhs]) ...))) @defproc[(sxpath [path abbr-sxpath?] [ns-binding ns-binding? '()]) (-> sxml? (listof sxml?))]{ Given a path, produces a procedure that accepts an sxml document and returns a list of matches. @;{ docs previously said: Note that the @racket[*TOP*] node of the document is required. But it isn't! } AbbrPath is a list. It is translated to the full SXPath according to the following rewriting rules @(let (;; shadow the names so we don't get annoying undefined tag warnings [sxml:descendant-or-self #f] [select-kids #f] [sxml:node? #f] [node-equal? #f] [ntype?? #f] [ntype-names?? #f] [ntype-namespace-id?? #f] [node-join #f] [node-reduce #f] [node-pos #f] [node-eq? #f] [sxml:complement #f] [filter #f]) (define-syntax-rule (defmetas id ...) (begin (define-syntax id (make-element-id-transformer (lambda _ #'(racketvarfont (symbol->string 'id))))) ...)) (defmetas pc0 pc x p path symbol reducer) (rewrite-table [(sxpath '()) -> (node-join)] [(sxpath '(pc0 pc ...)) -> (node-join (sxpath1 pc0) (sxpath '(pc ...)))] [(sxpath1 '//) -> (sxml:descendant-or-self sxml:node?)] [(sxpath1 '(equal? x)) -> (select-kids (node-equal? _x))] [(sxpath1 '(eq? x)) -> (select-kids (node-eq? _x))] [(sxpath1 '(*or* p ...)) -> (select-kids (ntype-names?? '(p ...)))] [(sxpath1 '(*not* p ...)) -> (select-kids (sxml:complement (ntype-names?? '(p ...))))] [(sxpath1 '(ns-id:* x)) -> (select-kids (ntype-namespace-id?? x))] [(sxpath1 _symbol) -> (select-kids (ntype?? _symbol))] [(sxpath1 _string) -> (txpath _string)] [(sxpath1 _procedure) -> _procedure] [(sxpath1 '(@#,racket[symbol] ...)) -> (sxpath1 '((@#,racket[symbol]) ...))] [(sxpath1 '(@#,racket[path] reducer ...)) -> (node-reduce (sxpath path) (sxpathr reducer) ...)] [(sxpathr _number) -> (node-pos _number)] [(sxpathr _path-filter) -> (filter (sxpath _path-filter))])) To extract all cells from an html table: @interaction[#:eval the-eval (define table `(*TOP* (table (tr (td "a") (td "b")) (tr (td "c") (td "d"))))) ((sxpath '(table tr td)) table) #| should produce '((td "a") (td "b") (td "c") (td "d")) |# ] To extract all cells anywhere in a document: @interaction[#:eval the-eval (define table `(*TOP* (div (p (table (tr (td "a") (td "b")) (tr (td "c") (td "d")))) (table (tr (td "e")))))) ((sxpath '(// td)) table) #| should produce '((td "a") (td "b") (td "c") (td "d") (td "e")) |# ] One result may be nested in another one: @interaction[#:eval the-eval (define doc `(*TOP* (div (p (div "3") (div (div "4")))))) ((sxpath '(// div)) doc) #| should produce '((div (p (div "3") (div (div "4")))) (div "3") (div (div "4")) (div "4")) |# ] There's also a string-based syntax, @racket[txpath]. As shown in the grammar above, @racket[sxpath] assumes that any strings in the path are expressed using the @racket[txpath] syntax. So, for instance, the prior example could be rewritten using a string: @interaction[#:eval the-eval (define doc `(*TOP* (div (p (div "3") (div (div "4")))))) ((sxpath "//div") doc) #| should produce '((div (p (div "3") (div (div "4")))) (div "3") (div (div "4")) (div "4")) |# ] More generally, lists in the s-expression syntax correspond to string concatenation in the txpath syntax. So, to find all italics that appear at top level within a paragraph: @interaction[#:eval the-eval (define doc `(*TOP* (div (p (i "3") (froogy (i "4")))))) ((sxpath "//p/i") doc) #| should produce '((i "3")) |# ] Handling of namespaces in @racket[sxpath] is a bit surprising. In particular, it appears to me that sxpath's model is that namespaces must appear fully expanded in the matched source. For instance: @interaction[#:eval the-eval ((sxpath "//ns:p" `((ns . "http://example.com"))) '(*TOP* (html (|http://example.com:body| (|http://example.com:p| "first para") (|http://example.com:p| "second para containing" (|http://example.com:p| "third para") "inside it"))))) #| should produce '((|http://example.com:p| "first para") (|http://example.com:p| "second para containing" (|http://example.com:p| "third para") "inside it") (|http://example.com:p| "third para")) |# ] But the corresponding example where the source document contains a namespace shortcut does not match in the same way. That is: @interaction[#:eval the-eval ((sxpath "//ns:p" `((ns . "http://example.com"))) '(*TOP* (|@| (*NAMESPACES* (ns "http://example.com"))) (html (ns:body (ns:p "first para") (ns:p "second para containing" (ns:p "third para") "inside it"))))) ] It produces the empty list. Instead, you must pretend that the shortcut is actually the namespace. Thus: @interaction[#:eval the-eval ((sxpath "//ns:p" `((ns . "ns"))) '(*TOP* (|@| (*NAMESPACES* (ns "http://example.com"))) (html (ns:body (ns:p "first para") (ns:p "second para containing" (ns:p "third para") "inside it"))))) #| should produce '((ns:p "first para") (ns:p "second para containing" (ns:p "third para") "inside it") (ns:p "third para")) |# ] Ah well. }