The performance of a functional language is to some extend at least linearly dependent on the rate of allocation. The garbage collector used at the moment is a variant of Cheney, and pretty fast. All in all, the perceived sluggishness of the resulting bootstrapped compiler is more a result of the compilation scheme and the routines in the implementation than anything else.
Anyway, I still shot myself in the foot by going for a bootstrapped language for I end up paying with a slow compiler, even if the resulting programs are factors faster than Python or Perl. For the next incarnation of the runtime, I'll leave some of the current decisions, push thunks into a store, but bit-pack to get the allocation rate down.
At the moment, the memory model consists of size-tagged records of cells. First, a size is given, the second argument in the header is either a zero, in which case only integers are stored subsequently, or something else, in which case only pointers are stored. I'll give three examples of the current scheme.
Storing an int: An int '5' is stored as [4, 0, 'int\0', 5], four cells of 64 bits. The first cell gives the size, the second says it is a constant, the third holds the 64 bit coding of a string 'int' (it is not pointer but the actual character array, you can store up to 7 bytes excluding the '\0' end marker in 64 bits), lastly, the fourth cell holds the value 5.
Storing a record: A cons record 'cons 5 rest' is stored as [6 , 1, *"cons", *"list", *5, *rest]. First the size, than a 1 tag stating that it is a record of pointers, a pointer to the record information, a pointer to the type information, a pointer to the record holding the constant 5, and a pointer to the tail of the list.
Storing a thunk: A thunk f x y is stored as [7, *F, *k, *rt, *rti, *x, *y]. The size of the thunk, a pointer to the evaluation routine, a pointer to the continuation, a pointer to the record where this result should be placed, a pointer to an index into that record, a pointer to the first argument, and lastly, a pointer to the second argument.
It's a wasteful scheme, but it works, at least for a bootstrap. A lot of pointers will be to constants, which are never allocated in the heap. I thought that would be enough to take the pressure of the collector. Turned out I was wrong.
The point of bit packing should be to compress all static data, preferably in a header which is statically determinable at compile-time. So, it makes sense to bit pack the size, a tag whether it is a constant record, and RTTI into one header, and leave the rest. It makes sense to go with 64 bits headers so, I think I should go with: 32 bits for the size field including one bit for tagging whether a constant follows, and use the other 32 bits as indices into an RTTI structure. Guess I should split those 32 bits into two times 16 bits since I want to be able to dynamically check (fast) what the type of a value is.
Storing an int: An int '5' is stored as [0+i'int;'+0+2, 5], two cells of 64 bits. First, a header tag in which is stored an index into the RTTI array denoting the string 'int', a zero bit that this is a constant record, and a size of two describing the full size of the record. Second, 64 bits denoting the value 5.
Storing a record: A cons record 'cons 5 rest' is stored as [i'cons'+i'list'+1+3 , *5, *rest].
Storing a thunk: A thunk f x y is stored as [0+i'func'+1+7, *F, *k, *rt, *rti, *x, *y].
It gives a maximum of 31 bits for the size of records of cells, and a max of 32 bits for all declarations. It would halve the memory used, the number of live thunks is usually negligible (although a lot of them are allocated, of course), and more than double the performance.
Looks doable, but I still would have a slow compiler... (I need mutable state, array and string support, which are language extensions.)