Monday, November 29, 2010

Design Decisions

In the rather minute space I have in moving around in the operational semantics, one thing is clear: I need a symbol table. Not in the traditional sense of mapping symbols to meaning but just as a hashmap mapping strings to integers such that the compiler can compare faster.

Now, there seems to be three ways of going around this: A) Add a map purely functional such that not only an AST is passed around but also the symbol table. B) Add an impure map in a module by extending the language semantics with support for updateable arrays and strings. C) Add an impure map in a module by interfacing with C.

Option C has the least impact on the language...

Tuesday, November 23, 2010

15M * 80 bytes * 2

Something I missed. Right in the middle of a self-compile a whopping 15M chars are allocated in the heap (must be the result of an int_to_char conversion). That means about 32 * 15MB = 480MB of data is just sitting there whereas they should have been constants outside of the heap.

Moreover, almost every char is part of a list, which roughly means 48 bytes extra, and stop-and-copy means a factor two overhead too. So, the full formula is 2 * (32B + 48B) * 15M = 2.4GB....

Of course, this is bad news. The good news is that there should be an easy fix.

(Updated. I missed the line with 15MB of characters.)

Sunday, November 14, 2010

Bit Packing

The performance of a functional language is to some extend at least linearly dependent on the rate of allocation. The garbage collector used at the moment is a variant of Cheney, and pretty fast. All in all, the perceived sluggishness of the resulting bootstrapped compiler is more a result of the compilation scheme and the routines in the implementation than anything else.

Anyway, I still shot myself in the foot by going for a bootstrapped language for I end up paying with a slow compiler, even if the resulting programs are factors faster than Python or Perl. For the next incarnation of the runtime, I'll leave some of the current decisions, push thunks into a store, but bit-pack to get the allocation rate down.

At the moment, the memory model consists of size-tagged records of cells. First, a size is given, the second argument in the header is either a zero, in which case only integers are stored subsequently, or something else, in which case only pointers are stored. I'll give three examples of the current scheme.

Storing an int: An int '5' is stored as [4, 0, 'int\0', 5], four cells of 64 bits. The first cell gives the size, the second says it is a constant, the third holds the 64 bit coding of a string 'int' (it is not pointer but the actual character array, you can store up to 7 bytes excluding the '\0' end marker in 64 bits), lastly, the fourth cell holds the value 5.

Storing a record: A cons record 'cons 5 rest' is stored as [6 , 1, *"cons", *"list", *5, *rest]. First the size, than a 1 tag stating that it is a record of pointers, a pointer to the record information, a pointer to the type information, a pointer to the record holding the constant 5, and a pointer to the tail of the list.

Storing a thunk: A thunk f x y is stored as [7, *F, *k, *rt, *rti, *x, *y]. The size of the thunk, a pointer to the evaluation routine, a pointer to the continuation, a pointer to the record where this result should be placed, a pointer to an index into that record, a pointer to the first argument, and lastly, a pointer to the second argument.

It's a wasteful scheme, but it works, at least for a bootstrap. A lot of pointers will be to constants, which are never allocated in the heap. I thought that would be enough to take the pressure of the collector. Turned out I was wrong.

The point of bit packing should be to compress all static data, preferably in a header which is statically determinable at compile-time. So, it makes sense to bit pack the size, a tag whether it is a constant record, and RTTI into one header, and leave the rest. It makes sense to go with 64 bits headers so, I think I should go with: 32 bits for the size field including one bit for tagging whether a constant follows, and use the other 32 bits as indices into an RTTI structure. Guess I should split those 32 bits into two times 16 bits since I want to be able to dynamically check (fast) what the type of a value is.

Storing an int: An int '5' is stored as [0+i'int;'+0+2, 5], two cells of 64 bits. First, a header tag in which is stored an index into the RTTI array denoting the string 'int', a zero bit that this is a constant record, and a size of two describing the full size of the record. Second, 64 bits denoting the value 5.

Storing a record: A cons record 'cons 5 rest' is stored as [i'cons'+i'list'+1+3 , *5, *rest].

Storing a thunk: A thunk f x y is stored as [0+i'func'+1+7, *F, *k, *rt, *rti, *x, *y].

It gives a maximum of 31 bits for the size of records of cells, and a max of 32 bits for all declarations. It would halve the memory used, the number of live thunks is usually negligible (although a lot of them are allocated, of course), and more than double the performance.

Looks doable, but I still would have a slow compiler... (I need mutable state, array and string support, which are language extensions.)

Friday, November 5, 2010

It's all about Push

I thought a lot more, more on my bike than anything formal. But, I puzzled it out. Thing is, I explicitly reserve space in a heap which behaves, more or less, like a stack where intermediates may become garbage. Thus a combinator will call the allocation routine a few times and fill the content of a reserved thunk. I.e., you get (pseudo-code) like:

r0 = alloc(3);
   r0[0] = ...
   r0[1] = ...
   r0[2] = ...
   r1 = alloc(2);
   r1[0] = ...
   r1[1] = &r0;

Entirely unnecessary. It is a stack, not a heap, I should treat it like a stack, and get rid of all the alloc calls and just push values into it. For the example, that would mean, two allocs less, half of the arguments gone (since movs are replaced by push), but at the expense of some stack calculations.

I could, of course, make a double stack machine, where one stack is reserved for pushing closures and closures are just popped in order, and the second for pushing data. It would, of course, mean that sometimes closures should be copied to the data stack. I'll end up with something akin to Clean ABC code?

Problem, this only affects performance, number of instructions generated. So, it might be faster (if I can map to native push instructions), but it won't affect the space consumption of the generated code when running. Which is entirely a case of making the data-representation more compact, generating more compact representations for data-structures, and -for compiler performance- changing a number of routines in the compiler to make generated code for that less memory intensive.

For the sake of simplicity, I think I'ld rather go in the direction of a one-stack machine with an incremental garbage collector to clear away generations of thunks and data pushed.