1 Example: Reading a’s and b’s
2 Provided functions and syntax
3 Patterns
4 Actions
5 Parsing
6 Best practices
Version: 4.1.3.2

PEG: Parsing Expression Grammar

by Jon Rafkind (rafkind at cs dot utah dot edu)

This package provides a syntactic form for writing a BNF which will act as a parsing expression grammar. The result is a packrat parser as described by Bryan Ford in Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking.

    1 Example: Reading a’s and b’s

    2 Provided functions and syntax

    3 Patterns

    4 Actions

    5 Parsing

    6 Best practices

1 Example: Reading a’s and b’s

In this short example a peg is defined that will parse a string which can contain any number of a’s and any number of b’s in any order. That is, it is equivalent to the regular expression (a|b)*.

Examples:

  (define parser
    (peg
      (start s)
      (grammar
        (s (((* a-or-b)) $))
        (a-or-b (("a") 'the-a)
                (("b") 'the-b)))))
  > (parse parser "aabba")

  (the-a the-a the-b the-b the-a)

2 Provided functions and syntax

(peg (start s) (grammar (nonterminal ((pattern ...) action) ...) ...))

Defines a peg that will start parsing using the nonterminal s. Patterns and actions are described in the Patterns section. The result is a function of the following type

(peg-result input)  (any/c)
  input : (lambda/c)

input :: (lambda (column) ...) = a function that accepts one argument: an integer representing the current position in the input stream. The peg will start parsing from column 0.

nonterminal can also take arguments. To do this use the following form

(nonterminal args ...)

So inside a grammar it would look like

(peg (start s) (grammar ((s a b) ((pattern ...) action))))

Where a and b are bound within pattern and action. Nonterminals defined in this way can only be used with the apply pattern.

(parse parser obj)  (any/c)
  parser : parser?
  obj : (or/c string? path?)

Parse obj using parser which should be created using the peg syntax. If obj is a string then the parser will retrieve the data from the obj itself. If obj is a path then the parser will retrieve data from the file specified by obj.

3 Patterns

pattern in the peg form can be one of the following

4 Actions

action can be an arbitrary piece of code or special identifiers, $, $position, $span. When _ is used the result from the last pattern in the production is used as the return value. $position is the current index in the input stream. $span is the number of characters the production consumed. Note that lookahead patterns like not do not consume any input.

Examples:

  (define (parser)
    (peg
      (start s)
      (grammar
        (s ((a b c) (list $ $position $span))
           ((c b s) (list $ $position $span))
           ((a c b) 'happy-birthday))
        (a (("a") 1))
        (b (("b") 2))
        (c (("c") 3)))))
  > (parse (parser) "abc")

  (3 0 3)

  > (parse (parser) "cbabc")

  ((3 2 3) 0 5)

  > (parse (parser) "acb")

  happy-birthday

5 Parsing

The PEG parser will start at the nonterminal given by (start s). Each production of a nonterminal will be tried in sequence until one of the productions succeeds or all the productions fail. The result of parsing with a nonterminal is either #f or an arbitrary datum as the result of action.

The result of a nonterminal is stored in a table for future lookups so that if the same nonterminal is used to parse the input stream at the same position the result that was computed for the first time will be returned. This situation occurs frequently in a PEG parser. Consider the following nonterminal.

  (foo (("hello" " " world " " "bob") "bob's world")
       (("hello" " " world " " "george") "george's world"))
  (world (("world") "world"))

If this set of productions is tried on the input string "hello world george", the first production of foo will not match because "bob" does not match "george" in the input stream. world already matched, however, so in the second production of foo, world will not have to be recomputed because at position 6 it is known to parse correctly and return the result "world".

6 Best practices