I changed the runtime a bit. FFI calls are memoized, memory pools are cached. The ackerman test (optimized without overloading) now runs in a bit more than a second, on 2.26 GHz, three secs on 798 Mhz. This should mean a faked entry of 4.3 seconds on the windows benchmark shootout. Which is almost Lua speed and faster than the slowest Lisp entries. Some optimization (local computations, inlining) and I should beat Lua speed, and I am done with that.
(I normalized to 550Hz. The result remains skewed, of course. Biggest question is what the difference between 32 vs 64 bits means. A 32 bit machine would halve the Hi runtime cell size, so on 64 bits you need double the memory bandwith. No idea whether a Pentium has a 32 or a 64 bits bus.)
A fast compile of the sources now takes 20 minutes and 3GB of memory. Still, better than the 15 hours it took with the ocaml based AST interpreter. Guess I'll need support for proper literals next, 3GB is a bit over the top. (Which means a 1GB from region, I double each time and keep the last 4 freed regions in cache.)
This is fine-tuning actually. A generational garbage collector would improve speed, but it might also help if I just, instead of processing all lambda terms to combinators, would translate each term to a combinator and then compile that too C directly. There is a spot where it just blows up, and it doesn't make sense to keep all that information in the heap.
(Hmm, there is of course the question how long values within a function's body are retained. I should check that all top-level references are actually thrown away after their use. It may well be that it keeps the AST, the lambda terms, the combinators, and the intermediate code in memory.)