1 API
generate-tokens
exn: fail: token
exn: fail: indentation
2 Translator Comments
Version: 5.2.1

python-tokenizer: a translation of Python’s tokenize.py library for Racket

Danny Yoo <hashcollision.org>

This is a fairly close translation of the tokenize.py library from Python.

The main function, generate-tokens, consumes an input port and produces a sequence of tokens.

For example:

> (require (planet dyoo/python-tokenizer))
> (define sample-input (open-input-string "def d22(a, b, c=2, d=2, *k): pass"))
> (define tokens
    (generate-tokens sample-input))
> (for ([t tokens])
    (printf "~s ~s ~s ~s\n" (first t) (second t) (third t) (fourth t)))

NAME "def" (1 0) (1 3)

NAME "d22" (1 4) (1 7)

OP "(" (1 7) (1 8)

NAME "a" (1 8) (1 9)

OP "," (1 9) (1 10)

NAME "b" (1 11) (1 12)

OP "," (1 12) (1 13)

NAME "c" (1 14) (1 15)

OP "=" (1 15) (1 16)

NUMBER "2" (1 16) (1 17)

OP "," (1 17) (1 18)

NAME "d" (1 19) (1 20)

OP "=" (1 20) (1 21)

NUMBER "2" (1 21) (1 22)

OP "," (1 22) (1 23)

OP "*" (1 24) (1 25)

NAME "k" (1 25) (1 26)

OP ")" (1 26) (1 27)

OP ":" (1 27) (1 28)

NAME "pass" (1 29) (1 33)

ENDMARKER "" (2 0) (2 0)

1 API

 (require (planet dyoo/python-tokenizer:1:=0))

(generate-tokens inp)
  (sequenceof (list/c symbol? string? (list/c number? number?) (list/c number? number?) string?))
  inp : input-port
Consumes an input port and produces a sequence of tokens.

Each token is a 5-tuple consisting of:
  1. token-type: one of the following symbols: 'NAME, 'NUMBER, 'STRING, 'OP, 'COMMENT, 'NL, 'NEWLINE, 'DEDENT, 'INDENT, 'ERRORTOKEN, or 'ENDMARKER. The only difference between 'NEWLINE and 'NL is that 'NEWLINE will only occurs if the indentation level is at 0.

  2. text: the string content of the token.

  3. start-pos: the line and column as a list of two numbers

  4. end-pos: the line and column as a list of two numbers

  5. current-line: the current line that the tokenizer is on

The last token produced, under normal circumstances, will be 'ENDMARKER.

If a recoverable error occurs, generate-tokens will produce single-character tokens with the 'ERRORTOKEN type until it can recover.

Unrecoverable errors occurs when the tokenizer encounters eof in the middle of a multi-line string or statement, or if an indentation level is inconsistent. On an unrecoverable error, an exn:fail:token or exn:fail:indentation error will be raised.

(struct exn:fail:token exn:fail (loc)
  #:extra-constructor-name make-exn:fail:token)
  loc : (list/c number number)
Raised when eof is unexpectedly encounted. exn:fail:token-loc holds the start position.

(struct exn:fail:indentation exn:fail (loc)
  #:extra-constructor-name make-exn:fail:indentation)
  loc : (list/c number number)
Raised when the indentation is inconsistent. exn:fail:indentation-loc holds the start position.

2 Translator Comments

The translation is a fairly direct one; I wrote an auxiliary package to deal with the while loops, which proved invaluable during the translation of the code. It may be instructive to compare the source here to that of tokenize.py.

Here are some points I observed while doing the translation: