Tuesday, December 27, 2011

Handwritten Scanners and C Streams

Writing a scanner is simple, at least it ought to be. Lexing usualy involves writing down the regular expressions and coding them using a tool like Flex or handwriting them. Usually, it is recommended to use lexing tools though there are sometimes reasons not to use them. The latter reasons normally being portability, context aware lexing of mixed languages, lexing very large files, Unicode, building reentrant scanners, and sometimes speed. GCC for example has a handwritten lexer.

Since regular expressions correspond to finite deterministic automata, LL(1) parsing should normally be enough, and C streams -which allow one byte lookahead- should do the trick. Given lexing tools, that is.

When handwriting a lexer for more complex expressions, stuff gets bizar. What to do with 3.14e+a? Is it a floating point number lacking digits in the exponent, an error, or is it a floating point number followed by an identifier 'e'? In this case, one would really like to have more lookahead than 1, but that is not guaranteed when using C streams.

First: Two random thoughts, and requirements, on lexing I want to share, for no particular reason.

Most programming examples for handwritten lexers use 'get' -get a token- and 'push' -push back a token to the stream- as primitives; I prefer to use 'look' -look at the token- and 'skip' -remove/go to the next token- because it is cleaner and doesn't modify the input source.[1]

One requirement on scanners is usually that a lexer should read (just) enough of the input stream to recognize one token and present that to the parser. The reason: very large files. Despite everything, reading a total source code file for the lexing phase is a not done, and I'll abide to this one rule for no other reason than that the Gods of programming determined it should be so.

Now: The current question of witing a scanner in C.

I am stuck with this question: Can you write a lexer which doesn't put back tokens? And if so, how?


And the answer is: Of course, you can. Most lexing problems reduce to finite deterministic automata, and you can use the state space of the automaton to 'remember' the characters read. But it won't be reentrant since the current state of the scanner consists also of the memorized characters. This is probably why C has pushback streams though that helps little while recognizing complex regular expressions.

The best manner to model a handwritten scanner, I think, is therefor to look at it as some kind of Mealy machine, or some kind of push automaton. It recognizes characters until it gobbled up enough to determine it has recognized enough, emits that character, and saves the scanned but unused characters in its state together with 'meta' state information about what state it should be in given the already read characters.

And presto, we have a design. My lexer should have the following minimal space: position information, a meta state encoding the current recognized token and the state of having recognized a number of characters ahead, characters read, and number of characters to emit. Though the last could be seen as a part of the meta state.

For people who want to know more: Read the Dragon book. This is all information which I reverse engineered from thinking about it, but it should be known and written down in compiler books, though the peculiar interaction of lexing and C streams might not be.

[1] I am guessing at this. But I think the historic rationale is that in very resource minimal environments, it is best to have basically one primitive 'get' and never do any push back; i.e., it saves one call. But I have the feeling that 'look' and 'skip' are academically nicer. Unsure about this.

No comments:

Post a Comment