#lang scribble/manual
@(require planet/scribble
          scribble/eval
          (for-label (this-package-in main)
                     racket/base
                     racket/sequence))

@(define myeval (make-base-eval))
@(myeval '(require racket/sequence))
@(myeval '(require racket/list))

@title{python-tokenizer: a translation of Python's @tt{tokenize.py} library for Racket}
@author+email["Danny Yoo" "hashcollision.org"]

This is a fairly close translation of the
@link["http://hg.python.org/cpython/file/2.7/Lib/tokenize.py"]{@tt{tokenize.py}}
library from @link["http://python.org"]{Python}.

The main function, @racket[generate-tokens], consumes an input port
and produces a sequence of tokens.

For example:

@interaction[#:eval myeval
(require (planet dyoo/python-tokenizer))
(define sample-input (open-input-string "def d22(a, b, c=2, d=2, *k): pass"))
(define tokens 
  (generate-tokens sample-input))
(for ([t tokens])
  (printf "~s ~s ~s ~s\n" (first t) (second t) (third t) (fourth t)))
]

@section{API}
@defmodule/this-package[main]

@defproc[(generate-tokens [inp input-port]) (sequenceof (list/c symbol? string? (list/c number? number?) (list/c number? number?) string?))]{
Consumes an input port and produces a sequence of tokens.

Each token is a 5-tuple consisting of:
@itemize[#:style 'ordered

@item{token-type: one of the following symbols: 
@racket['NAME], @racket['NUMBER], @racket['STRING],
@racket['OP], @racket['COMMENT], @racket['NL],
@racket['NEWLINE], @racket['DEDENT], @racket['INDENT],
@racket['ERRORTOKEN], or @racket['ENDMARKER].  The only difference between
@racket['NEWLINE] and @racket['NL] is that @racket['NEWLINE] will only occurs
if the indentation level is at @racket[0].}

@item{text: the string content of the token.}

@item{start-pos: the line and column as a list of two numbers}

@item{end-pos: the line and column as a list of two numbers}

@item{current-line: the current line that the tokenizer is on}
]

The last token produced, under normal circumstances, will be
@racket['ENDMARKER].


If a recoverable error occurs, @racket[generate-tokens] will produce
single-character tokens with the @racket['ERRORTOKEN] type until it
can recover.

Unrecoverable errors occur when the tokenizer encounters @racket[eof]
in the middle of a multi-line string or statement, or if an
indentation level is inconsistent.  On an unrecoverable error,
@racket[generate-tokesn] will raise an @racket[exn:fail:token] or
@racket[exn:fail:indentation] error.


@defstruct[(exn:fail:token exn:fail) ([loc (list/c number number)])]{
Raised when @racket[eof] is unexpectedly encounted.
@racket[exn:fail:token-loc] holds the start position.
}

@defstruct[(exn:fail:indentation exn:fail) ([loc (list/c number number)])]{
Raised when the indentation is inconsistent.
@racket[exn:fail:indentation-loc] holds the start position.
}

}


@section{Translator Comments}

The translation is a fairly direct one; I wrote an
@link["https://github.com/dyoo/while-loop"]{auxiliary package} to deal
with the @racket[while] loops, which proved invaluable during the
translation of the code.  It may be instructive to compare the
@link["https://github.com/dyoo/python-tokenizer/blob/master/python-tokenizer.rkt"]{source}
here to that of
@link["http://hg.python.org/cpython/file/2.7/Lib/tokenize.py"]{tokenize.py}.


Here are some points I observed while doing the translation:

@itemize[

@item{Mutation pervades the entirety of the tokenizer's main loop.
The main reason is because @racket[while] has no return type and
doesn't carry variables around; the @racket[while] loop communicates
values from one part of the code to others through mutation, often in
wildly distant locations.}


@item{Racket makes a syntactic distinction between variable definition
(@racket[define]) and mutation (@racket[set!]).  I've had to deduce
which variables were intended to be temporaries, and hopefully I
haven't induced any errors along the way.}

@item{In some cases, Racket has finer-grained type distinctions than
Python.  Python does not use a separate type to represent individual
characters, and instead uses a length-1 string.  In this translation,
I've used characters where I think they're appropriate.}

@item{Most uses of raw strings in Python can be translated to
uses of the
@link["http://docs.racket-lang.org/scribble/reader-internals.html#(mod-path._at-exp)"]{at-exp}
reader.}


@item{Generators in Racket and in Python are pretty similar, though
the Racket documentation can do a better job in documenting them.

When dealing with generators in Racket, what one really wants to
usually produce is a generic sequence.  For that reason, the
Racket documentation really needs to place more emphasis in 
@racket[in-generator], not the raw @racket[generator] form.}


@item{Python heavily overloads the @tt{in} operator.  Its expressivity
makes it easy to write code with it.  On the flip side, its
flexibility makes it a little harder to know what it actually means.}

@item{Regular expressions, on the whole, match
@; Yeah, that's a pun.  I had to get that in somewhere... :)
well between the two
languages.  Minor differences in the syntax are potholes: Racket's
regular expression matcher does not have an implicit @emph{begin}
anchor, and Racket's regexps are more sensitive to escape characters.

Python's regexp engine returns a single match object that can support
different operators.  Racket, on the other hand, requires the user to
select between getting the position of the match, with
@racket[regexp-match-positions], or getting the textual content with
@racket[regexp-match].}
]