Tuesday, December 27, 2011

Handwritten Scanners and C Streams

Writing a scanner is simple, at least it ought to be. Lexing usualy involves writing down the regular expressions and coding them using a tool like Flex or handwriting them. Usually, it is recommended to use lexing tools though there are sometimes reasons not to use them. The latter reasons normally being portability, context aware lexing of mixed languages, lexing very large files, Unicode, building reentrant scanners, and sometimes speed. GCC for example has a handwritten lexer.

Since regular expressions correspond to finite deterministic automata, LL(1) parsing should normally be enough, and C streams -which allow one byte lookahead- should do the trick. Given lexing tools, that is.

When handwriting a lexer for more complex expressions, stuff gets bizar. What to do with 3.14e+a? Is it a floating point number lacking digits in the exponent, an error, or is it a floating point number followed by an identifier 'e'? In this case, one would really like to have more lookahead than 1, but that is not guaranteed when using C streams.

First: Two random thoughts, and requirements, on lexing I want to share, for no particular reason.

Most programming examples for handwritten lexers use 'get' -get a token- and 'push' -push back a token to the stream- as primitives; I prefer to use 'look' -look at the token- and 'skip' -remove/go to the next token- because it is cleaner and doesn't modify the input source.[1]

One requirement on scanners is usually that a lexer should read (just) enough of the input stream to recognize one token and present that to the parser. The reason: very large files. Despite everything, reading a total source code file for the lexing phase is a not done, and I'll abide to this one rule for no other reason than that the Gods of programming determined it should be so.

Now: The current question of witing a scanner in C.

I am stuck with this question: Can you write a lexer which doesn't put back tokens? And if so, how?


And the answer is: Of course, you can. Most lexing problems reduce to finite deterministic automata, and you can use the state space of the automaton to 'remember' the characters read. But it won't be reentrant since the current state of the scanner consists also of the memorized characters. This is probably why C has pushback streams though that helps little while recognizing complex regular expressions.

The best manner to model a handwritten scanner, I think, is therefor to look at it as some kind of Mealy machine, or some kind of push automaton. It recognizes characters until it gobbled up enough to determine it has recognized enough, emits that character, and saves the scanned but unused characters in its state together with 'meta' state information about what state it should be in given the already read characters.

And presto, we have a design. My lexer should have the following minimal space: position information, a meta state encoding the current recognized token and the state of having recognized a number of characters ahead, characters read, and number of characters to emit. Though the last could be seen as a part of the meta state.

For people who want to know more: Read the Dragon book. This is all information which I reverse engineered from thinking about it, but it should be known and written down in compiler books, though the peculiar interaction of lexing and C streams might not be.

[1] I am guessing at this. But I think the historic rationale is that in very resource minimal environments, it is best to have basically one primitive 'get' and never do any push back; i.e., it saves one call. But I have the feeling that 'look' and 'skip' are academically nicer. Unsure about this.

Monday, December 26, 2011

Unicode Decisions

Strings are peculiar things these days. Since I am now aiming at a high-level language with a fast interpreter, I need to solve what to do with character encodings and string representations.

The best I can think of for now is that characters should be represented internally as wide characters and strings as multibyte sequences.

Which breaks binding to C, since normal characters are much larger than C expects. Unless I think of something smart I'll do this representation anyway, which also means that I'll need a handwritten lexer and parser since tables explode with wide character entries...

No libffi, no lex, no yacc. No libffi also implies glue code or no cheap manner of reusing C libraries. Hmm.... Whatever?

Prototype ****s. Starting on an interpreter.

I installed the latest Fedora on my Macbook, and am now happily programming in C again. Programming is a blast given that I know how most should work.

That did it. I am going to work towards the fastest, smallest, and easiest interpreter in vanilla C while using the bootstrap compiler as a design.

It could take a year, of course...

Saturday, December 24, 2011

Bit Twiddling Logic on 64 bits

I decided on 64 bit headers. Since they will be used a lot, I want a fast scheme for compacting and extracting the fields. A problem which occurred to me: say you design an elaborate bit field scheme on 64 bits registers then you're bound to do exotic stuff like checking the 51th bit. If one isn't careful about it, that can generate lots of instructions with large 64 bit constants, or references to 64 bit constants.

So, 64 bit fields need a different approach than 8 bit fields.

Luckily, one can mostly rely on that smaller integer fields are cheaper, comparing to zero is normally cheap, and negating an integer is cheap too. I am not sure about taking the absolute value without branching.

I need the following:

  • a bit to check whether a field is a constant series of bits,
  • an integer field mapping to a record table (of strings),
  • an integer field mapping to a type table (of strings),
  • an integer describing the full size of the record.

I decided on distributing them on 1, 15, 16, 32 bit, without a real rationale other than that probably 64K of types is enough, and a large record size can be abused to implement arrays.

So I am thinking on using the signs of the fields to bit twiddle; or not even that, but just store the information. Like a negative record table value indicates a primitive value and a zero is reserved for thunks.

Friday, December 23, 2011

Lessons learned

Going old-skool and bootstrapping a language is a great way to learn about compiler technology since most books touch on the what of compiler construction, and hardly on the why. I.e., one usually learns next to nothing from reading a book or copying a design. Having said that, implementing a language from scratch gives you a lot of why answers, but since you get some of the hows wrong, you end up implementing the whole enchilada again.

So the lessons learned:
  • Old-skool bootstrapping is a lousy idea given that writing an interpreter is a lot easier and you have a lot more time to experiment. Probably Hugs, or was it Gopher, existed for a decade before people began building performing Haskell compilers, and I should have done the same. The bootstrap leap is just too big unless you copy a design - which I find an uninteresting exercise.
  • Lisp days are over. On 64 bit machines you want to use a packed text representation, since a list of characters representation consumes an awful lot of memory. This is both a problem of the compiler, if it would have been implemented similarly in Haskell it would have the same memory consumption (2.4GB), as well as of the naive unpacked 64 bit representation of the memory model. I'll change this.
  • The FFI scheme is dynamic, based on libffi, reasonably portable, but can't handle corner cases. Moreover, it adds a dependency and vanilla C is just so much nicer. I'll change this.
  • The combinator scheme used is about on par with an (old) javascript or Lisp compiler/interpreter. It's going to be at least one order, possibly two orders, slower than C. Which is still an order faster than Python, so I am not sure how to count that. There are three main reasons for the creeping performance:
    1. No stack, all thunks are allocated in the heap. A stack based runtime will always beat that, but programs will segfault from time to time because people will continue writing recursive functions which will consume the C stack. I won't change this.
    2. Lack of 'lets' or ANF form. Now all local computations are lifted out, leading to an explosion of combinators, compile time, and thunks allocated runtime. I am not sure how much this costs, but I am assuming that I'll need to implement this. I'll change this.
    3. The compiler generates an unhealthily sized binary executable. I have no idea why, except for some hunches: a) it generates 64 bit assembly from C code, b) native assembly is more wordly than a concise minimal VM (32 or 16 bit) instruction set, and c) compiling to a 'concrete' machine is more costly than compiling to a SECD, or other lambda calculus inspired, machine. Having said that, Haskell binaries -despite the VM assembly- seem to be as bloated, so it might be unavoidable in compiling functional languages. I won't change this.
  • The compiler uses the concrete C bit words internally for primitive values. This leads to a very small compiler since there is essentially one primitive value type (to optimize on), but I actually grew tired of all the a priory word manipulations. Moreover, it leads to a dependency on C primitive value representations, consequently, the machine architecture. Nice for toy compilers, but actually just a lousy idea, so the alternative of elaborating all primitive types appeals now. I'll change this.
  • The compiler uses texts internally a lot for identification, which is lousy giving how they are represented, but -when switching over to native strings- is probably as costly in memory consumption as integers but very costly in runtime performance. I won't change this.
  • The compiler has no explicit symbol table which is again novel, it doesn't seem to matter much. What is far worse is probably the performance of the lookup. I won't change this now (see purity).
  • The compiler employs novel AST data traversal combinators which are good for implementing concise algorithms since they mimic an intermediate between attribute based compilation and direct traversal techniques, but it costs probably a factor two in compilation speed. I won't change this.
  • The compiler uses a novel bottom-up constraint gatherer and a separate type inferencer, an idea copied from the Helium Haskell compiler, which was rushed; I am not sure of all the inference rules I use. The bottom-up part is academically nice, but, even with unfinished inference rules, I can see that no production compiler will ever want to take the associated performance hit. So, old-skool ML like direct AST typing it is. I'll change this.
  • I ended up with a pure compiler (except for C calls), which I think is a bad idea. It doesn't use a symbol table at the moment, but a symbol table will probably come in handy when more information is needed due to a more elaborate VM targeted. I personally loathe monads, so I should make the language impure. Hello ML? I'll change this.
I'll start with a new runtime and leave the thought on how to target the new runtime for a while. I blathered a bit with some C programmers on it, and by nature they'll always target an Ocaml-like runtime or minimal design -I don't like that particular VM much,- but my nature is different. The language should be robust, but not low-level, so I'll go in the direction of a Java-esque design, which means 64 bit headers for everything.

It should be fun, and I don't feel like working on it full time. So it'll take some time, again.

Wednesday, December 21, 2011

hi-language.org expired

Oh great, I didn't look at it for a while, now the domain expired and they deleted the whole site... Great, that the DNS entry expires okay, but why delete a perfectly functioning Google Site? (Oh great, I found it again.)

Anyway, I decided I'ld start the whole project again. I learned enough from the prototype to know what works, and what doesn't, so whatever. Think I'll just start a new site again, hopefully, most of the bits are still on this laptop.

Great, when a Google Sites domain expires, the automatic interaction between Google Apps and the hosting provider breaks. Two options: reregister with the hosting provider, reinstall the DNS records, and hope stuff will work from that point on, or -what I think is the least error-prone- register a new site...