#lang scribble/doc @begin[(require scribble/manual) (require scribble/eval) (require scribble/basic) (require (for-label "peg.ss"))] @(define the-eval (let ([the-eval (make-base-eval)]) (the-eval '(require "peg.ss")) the-eval)) @title[#:tag "top"]{@bold{PEG}: Parsing Expression Grammar} by Jon Rafkind (@tt{rafkind at cs dot utah dot edu}) This package provides a syntactic form for writing a BNF which will act as a parsing expression grammar. The result is a packrat parser as described by Bryan Ford in @link["http://pdos.csail.mit.edu/~baford/packrat/thesis/"]{Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking}. @table-of-contents[] @section[#:tag "intro"]{Example: Reading a's and b's} In this short example a peg is defined that will parse a string which can contain any number of a's and any number of b's in any order. That is, it is equivalent to the regular expression (a|b)*. @defexamples[#:eval the-eval (define parser (peg (start s) (grammar (s (((* a-or-b)) _)) (a-or-b (("a") 'the-a) (("b") 'the-b))))) (parse parser "aabba")] @section[#:tag "provides"]{Provided functions and syntax} @defform[#:id peg (peg (start s) (grammar (nonterminal ((pattern ...) action) ...) ...))]{ Defines a peg that will start parsing using the nonterminal @scheme[s]. Patterns and actions are described in the @secref{patterns} section. The result is a function of the following type} @defproc[(peg-result [input (lambda/c)]) (any/c)]{input :: @scheme[(lambda (column) ...)] = a function that accepts one argument: an integer representing the current position in the input stream. The peg will start parsing from column 0.} @scheme[nonterminal] can also take arguments. To do this use the following form @defform[(nonterminal args ...)] So inside a grammar it would look like @scheme[(peg (start s) (grammar ((s a b) ((pattern ...) action))))] Where @scheme[a] and @scheme[b] are bound within @scheme[pattern] and @scheme[action]. Nonterminals defined in this way can only be used with the @scheme[apply] pattern. @defproc[(parse [parser parser?] [obj (or/c string? path?)]) (any/c)]{Parse @scheme[obj] using @scheme[parser] which should be created using the @scheme[peg] syntax. If @scheme[obj] is a string then the parser will retrieve the data from the @scheme[obj] itself. If @scheme[obj] is a path then the parser will retrieve data from the file specified by @scheme[obj].} @section[#:tag "patterns"]{Patterns} @scheme[pattern] in the @scheme[peg] form can be one of the following @itemize[@item[@defform[#:id literal literal]{Matches literal data, either a string or a char.} @defexamples[#:eval the-eval (define parser (peg (start s) (grammar (s (("ab") 'a) ((#\c) 'c))))) (parse parser "ab") (parse parser "c") (parse parser "d")]] @item[@defform[#:id any _]{Matches any single character.} @defexamples[#:eval the-eval (define parser (peg (start s) (grammar (s ((_) _))))) (parse parser "a") (parse parser "b") (parse parser "c")]] @item[@defform[#:id id id]{Jump to nonterminal @scheme[id].} @defexamples[#:eval the-eval (define parser (peg (start s) (grammar (s ((nt1) 'ok)) (nt1 ((nt2 nt3) _)) (nt2 (("a") _)) (nt3 (("b") _))))) (parse parser "ab")]] @item[@defform[#:id eof eof]{Only recognizes the end of stream.} @defexamples[#:eval the-eval (define parser (peg (start s) (grammar (s (("a" eof) "complete") (("a") "incomplete"))))) ;; after parsing the first "a" the parser will accept (parse parser "aa") ;; but this one can succesfully match "a" and the end of input (parse parser "a")] ] @item[@defform[#:id bind (bind variable pattern ...)]{Binds @scheme[variable] to the result of @scheme[pattern]. @scheme[variable] is bound in the scope of the rest of the patterns that come after the @scheme[bind].} @defexamples[#:eval the-eval (define parser (peg (start s) (grammar (s (((bind c x) (bind d y)) (string-append c d))) (x (("x") "a")) (y (("y") "b"))))) (parse parser "xy")] ] @item[@defform[#:id * (* pattern ...)]{Will match @scheme[pattern] zero or more times until it doesn't match anymore. The result is always a @scheme[list].} @defexamples[#:eval the-eval (define parser (peg (start s) (grammar (s (("x" (* "a" "b")) _) (("y" (* "a")) _))))) (parse parser "yaaaa") (parse parser "y") (parse parser "xabab")] ] @item[@defform[#:id + (+ pattern ...)]{Will match @scheme[pattern] one or more times until it doesn't match anymore. The result is always a @scheme[list] or @scheme[#f].} @defexamples[#:eval the-eval (define parser (peg (start s) (grammar (s (("f" (+ "a" "b")) _))))) (parse parser "f") (parse parser "fabab")] ] @item[@defform[#:id ? (? pattern)]{Will match @scheme[pattern] zero or one times. The result is always a list, @scheme['()] if @scheme[pattern] didn't match and @scheme[(list result-of-pattern)] if it did.} @defexamples[#:eval the-eval (define parser (peg (start s) (grammar (s (("f" (? "a" "b") eof) 'ok))))) (parse parser "f") (parse parser "fab") (parse parser "fabab")] ] @item[@defform[#:id or (or pattern ...)]{Will match the first pattern in @scheme[pattern ...], trying each pattern in sequence if the previous patterns failed.} @defexamples[#:eval the-eval (define parser (peg (start s) (grammar (s (((or "a" "b" "c")) _))))) (parse parser "a") (parse parser "b") (parse parser "c")] ] @item[@defform[#:id foreign (foreign peg nonterminal)]{Use a different peg starting at nonterminal @scheme[nonterminal].} @defexamples[#:eval the-eval (define a-parser (peg (start s) (grammar (s ((a-s) _)) (a-s (((+ "a")) _))))) (define parser (peg (start s) (grammar (s (((* "b") (foreign a-parser s)) _))))) (parse parser "bbbaa") (parse parser "aa") (parse parser "bb")] ] @item[@defform[#:id predicate (predicate code)]{If @scheme[code] evaluates to anything besides @scheme[#f] then the parse will continue, otherwise if the result is @scheme[#f] the current pattern will fail. This does not consume any input.} @defexamples[#:eval the-eval (define parser (peg (start s) (grammar (s (((bind a (or "a" "b")) (predicate (string=? "a" a))) a))))) (parse parser "a") (parse parser "b")] ] @item[@defform[#:id apply (apply nonterminal args ...)]{Pass @scheme[args ...] to @scheme[nonterminal] and then jump to that nonterminal.} @defexamples[#:eval the-eval (define parser (peg (start s) (grammar (s (((apply foos #\a) (apply foos #\b)) 'ok)) ((foos x) (((bind c _) (predicate (if (char=? c x) x #f))) _))))) (parse parser "ab") (parse parser "aa") (parse parser "bb")] ] @item[@defform[#:id not (not pattern ...)]{The parse will continue only if @scheme[pattern] fails. This does not consume any input.} @defexamples[#:eval the-eval (define parser (peg (start s) (grammar (s (((not "a") "b") 'no-a) (("a" "b") 'yes-a))))) (parse parser "ab") (parse parser "b")] ] @item[@defform[#:id ensure (ensure pattern ...)]{The parse will continue only if @scheme[pattern] succeeds. In other words, exactly the opposite of @scheme[not]. This does not consume any input.} @defexamples[#:eval the-eval (define parser (peg (start s) (grammar (s (("f" "b" (ensure "a") "a") 'yes-a) (("f" "b") 'no-a))))) (parse parser "fba") (parse parser "fb")] ] @item[@defform[#:id except (except pattern ...)]{Matches one character only if @scheme[pattern] does not succeed. Only consumes the one character if the pattern succeeds.} @defexamples[#:eval the-eval (define parser (peg (start s) (grammar (s (((except "french")) _))))) (parse parser "french") (parse parser "b")] ] ] @section[#:tag "actions"]{Actions} @scheme[action] can be an arbitrary piece of code or the special @scheme[_] syntax. When @scheme[_] is used the result from the last pattern in the production is used as the return value. @defexamples[#:eval the-eval (define (parser) (peg (start s) (grammar (s ((a b c) _) ((c b a) _) ((a c b) 'happy-birthday)) (a (("a") 1)) (b (("b") 2)) (c (("c") 3))))) (parse (parser) "abc") (parse (parser) "cba") (parse (parser) "acb") ] @section[#:tag "parsing"]{Parsing} The PEG parser will start at the nonterminal given by @scheme[(start s)]. Each production of a nonterminal will be tried in sequence until one of the productions succeeds or all the productions fail. The result of parsing with a nonterminal is either @scheme[#f] or an arbitrary datum as the result of @scheme[action]. The result of a nonterminal is stored in a table for future lookups so that if the same nonterminal is used to parse the input stream at the same position the result that was computed for the first time will be returned. This situation occurs frequently in a PEG parser. Consider the following nonterminal. @schemeblock[ (foo (("hello" " " world " " "bob") "bob's world") (("hello" " " world " " "george") "george's world")) (world (("world") "world")) ] If this set of productions is tried on the input string "hello world george", the first production of foo will not match because "bob" does not match "george" in the input stream. @scheme[world] already matched, however, so in the second production of @scheme[foo], @scheme[world] will not have to be recomputed because at position 6 it is known to parse correctly and return the result "world". @section[#:tag "best-practices"]{Best practices} @itemize[@item{Generating a new parser for each object you would like to parse reduces unused memory. The PEG memoizes intermediate results which can go unused if the same object is not parsed twice. Consider the following. @schemeblock[ (define my-parser (peg ...)) (let loop () (parse my-parser "blah blah blah") (loop)) ] Will slowly use up all available memory. This table summarizes the memory usage of this program over a few seconds. @verbatim{ mzscheme 15m mzscheme 28m mzscheme 34m mzscheme 47m mzscheme 58m} However if the parser is created when it is needed memory stays constant. @schemeblock[ (define (my-parser) (peg ...)) (let loop () (parse (my-parser) "blah blah blah") (loop)) ] @verbatim{ mzscheme 15m mzscheme 15m mzscheme 15m mzscheme 15m} } @item{Use @scheme[not] and @scheme[ensure] to force the parser to choose one production over another. In the following example we see a case where the parser will match one nonterminal when it should have matched another but by that time it is too late. We can use @scheme[not] to be sure that the parser won't match a production that matches similar text as the proper nonterminal. @defexamples[#:eval the-eval (define (parser) (peg (start s) (grammar (s (((bind s stmt) eof) s)) (stmt (((bind f function) (not "(")) 'no-args) (((bind f function) "(" (* arg (? ",")) ")") 'args)) (function (("x") _)) (arg (("y") _))))) (parse (parser) "x") (parse (parser) "x(y,y)") ]} ]