On this page:
7.1 Parser
7.1.1 Grammar Definitions
7.1.2 Grammar Invariants
7.1.3 Syntactic Contexts
lexer-state
parser-state
7.1.4 Lexer and Parser Invariants
7.2 Header Compilation
Version: 4.1.5.3

7 Internals

This section provides some information about the implementation of this package. It is not required at all to understand how to use any of the C Metaprogramming Utilities. None of the modules documented in this section should be used externally, and are subject to incompatible changes at any time.

7.1 Parser

C is not an easy language to parse. The grammar is essentially LALR(1), except for one nasty wrinkle: there are two lexical classes that are considered distinct in the grammar but whose lexical definitions are identical: identifier and typedef_name.

The language distinguish these two classes by their bindings, so parsers must maintain an environment and make the environment available to the lexer. However, because the grammar is LALR(1), the lexer may sometimes look ahead of the current production(s) being parsed. Consequently, the lexer must also make sure that it does not look ahead to an apparent identifier before committing all necessary updates to the environment.

For example, in a program such as:

typedef int T;
T a;

the second occurrence of T might be tokenized during lookahead before the parser has had a chance to update the the environment from the first declaration. As a result, the lexer must be careful whenever it reaches the end of a potential declaration to commit any outstanding updates to the environment.

The following articles were useful in developing the parser:

7.1.1 Grammar Definitions

A declarator terminator is a token of token-name 'COMMA, '=, or 'SEMI_COLON.

A declarator id is the value of the DeclaratorId attribute of a ‹Declarator› node, using the following attribution rules for the language grammar:

 

DeclaratorX

 ::= 

[‹Pointer›] DirectDeclaratorX

 

     

   $0.DeclaratorId  $1.DeclaratorId

 

DirectDeclaratorX

 ::= 

X

 

     

   $0.DeclaratorId  $1

 

  |  

"(" Declarator ")"

 

     

   $0.DeclaratorId  $1.DeclaratorId

 

  |  

DirectDeclaratorX "[" TypeQualifier›* [‹AssignmentExpression›] "]"

 

     

   $0.DeclaratorId  $1.DeclaratorId

 

  |  

DirectDeclaratorX "[" "static" TypeQualifier›* AssignmentExpression "]"

 

     

   $0.DeclaratorId  $1.DeclaratorId

 

  |  

DirectDeclaratorX "[" TypeQualifier+ "static" AssignmentExpression "]"

 

     

   $0.DeclaratorId  $1.DeclaratorId

 

  |  

DirectDeclaratorX "[" TypeQualifier›* "*" "]"

 

     

   $0.DeclaratorId  $1.DeclaratorId

 

  |  

DirectDeclaratorX "(" ParameterTypeList ")"

 

     

   $0.DeclaratorId  $1.DeclaratorId

 

  |  

DirectDeclaratorX "(" [‹ListIdentifier] ")"

 

     

   $0.DeclaratorId  $1.DeclaratorId

7.1.2 Grammar Invariants

Following are some invariants – essentially, informal lemmas – about the implemented grammar:

7.1.3 Syntactic Contexts

The internal module (planet dherman/c:3:1/private/syntactic-context) defines structures that are used internally to maintain information about parsing and lexing state. It could be accessed via

 (require (planet dherman/c:3:1/private/syntactic-context))

but generally shouldn’t be used.

(struct lexer-state (read?
    source
    offset
    declarators
    brace-depth
    parenthesis-depth
    previous-token))
  read? : boolean?
  source : any
  offset : (or/c position? #f)
  declarators : (listof (box/c (or/c symbol? #f)))
  brace-depth : exact-nonnegative-integer?
  parenthesis-depth : exact-nonnegative-integer?
  previous-token : (or/c token? #f)

State pertaining to the lexer.

The read? field is used by the lexer for implementing a C reader. When read? is #t, the lexer reports an EOF token when it reaches a terminating delimiter.

The source field is used to identify the input source, such as with a path. The offset is used as a base offset for generating source location information. When offset is #f the base location (make-position 1 1 0) is used.

The declarators stack contains the cache for the current declarator id if the current context is a declaration context. Both the lexer state and parser state maintain views of the declarators stack, which can be compared to determine whether the lexer is currently looking ahead. The car of the stack is the current declarator cache.

The brace-depth tracks the nesting depth of curly-brace delimiters, and the parenthesis-depth tracks the nesting depth of parenthesis delimiters.

The previous-token stores the previously lexed token or #f if no tokens have been read yet.

A major context is one of

A minor context is one of

(struct parser-state (env declarators major-context minor-context))
  env : (listof (listof (cons symbol (or/c 'var 'type))))
  declarators : (listof (box/c (or/c symbol? #f)))
  major-context : (listof major context)
  minor-context : (listof minor context)

State pertaining to the parser. The env field is the current environment.

The declarators stack contains the parser’s view of the same cache as the lexer-state-declarators stack.

The major-context stack tracks the major context. The car of the stack is the current major context. The current major context is a declaration context if it is one of 'block, 'formals, or 'preamble.

The minor-context stack tracks the minor context. The car of the stack is the current minor context. A minor context is a typedef context if its value is 'typedef.

7.1.4 Lexer and Parser Invariants

Following are some invariants about the state of the lexer and parser.

7.2 Header Compilation

The compilation of headers uses a curious computational idiom of two separate monads in two distinct stages. The first monadic pass (“precompilation”) generates a computation in the second monadic pass (“compilation”).

More to follow.