Wednesday, April 25, 2012


Cubic rubbish. As stated, I've been looking at doing a reimplementation in C, as an interpreter, and, for the last few months, have mostly been doing other stuff. Like games, or watching movies.

Now I am coding, I am engineering a lexer implementation, and please ignore the last post since it is, well, rubbish. Since I want Unicode support, I've been reading up on C wide characters, and assume for the moment that wide characters are Unicode characters. All nice and dandy.

Now writing a lexer for Unicode most likely means writing a hand coded lexer since the state space for automatically generated scanning automata usually explodes. Also nice and dandy, I've written various small toy languages in my life, and hand coding a lexer shouldn't be that hard.

Rubbish, hand coding a lexer is hard. In the past, I have never written lexers for floating point literals, and floating point scanning breaks the usual one or two symbols lookahead invariant. Moreover, I've often written lexers in functional languages, usually employing my own set of parser combinators meanwhile employing the benefits of functional transparency, curried functions, closures, and garbage collection giving me arbitrary lookahead. With the prime benefit that every step one would advance automagically would be undone for each failure to recognize (part of) a token. Now, I need primitives to back up to a recognized portion of the input in case some alternative fails.

So, stuff I needed to consider when writing a lexer in C: no parser combinators, no curried applications, no referential transparency, no (automatic) state handling, no garbage collection, can't use lexer generators, can't automagically rewind portions of the input.

So I have been looking at other interpreters with hand coded lexers, and what do I find for GCC, Lua, and Python? Rubbish. Okay, be nice: Elaborate rubbish.

I have no clue how people managed to write programs in the past. Except for rubbish.