So, for several months I have been postponing about what to change to solve my compiler performance problems. I keep posting my thoughts, often rephrasing them, but I have the feeling I am converging.
As stated, there are two serious problems with memory usage: A) An uncanny amount of chars, 15 million of them, which lead to 2.4GB of used heap space. B) During the final phase, a few million lines of assembly are generated which also hog the memory for approximately 1GB.
Problem A is solvable with a symbol table mapping texts to ints and not using texts internally for most symbolic reasoning, like generating unique identifiers represented by texts on which to unify or storing each and every fully qualified identifier by its very long text representation. Problem B is trivially solvable by just emitting the compiled procedures to the generated C file one-by-one instead of first keeping them all in memory. So, B is solvable in a few hours, and I don't think I need to give it much attention.
So, I need a symbol table. I don't wish to pass monads around, neither do I feel like cheating by implementing the symbol table in C so I'll need a mutable global variable encapsulated in a module. I don't really like mutable state so I don't think I am going to add language support for it, instead I am just going to add primitive routines on the runtime to be able to expose and assign to fresh 'root' pointers into the heap. The bad point of this solution being that, pure or impure, I am going reason modulo a functional data structure for a symbol table which might just be slow.
Of course, say I implement a symbol table, it might hold several tens of thousands of characters, or around a megabyte of heap space. Further compressing it down is possible by having a native string representation which would also mean an end to the rather slow character list comparisons and a 'nicer' manner of interfacing with C.
But changing to strings opens up two cans of worms: 1) Allocating texts in the heap during a rewrite means you can run out of heap space in the middle of a combinator rewrite. 2) I heavily abuse libffi for calling native routines and use the native libffi structures for storing the calling convention which doesn't have provisions for discriminating between regular pointers and pointers to character arrays.
Problem 1 is addressable by tracking registers used in a combinatorial rewrite by translating them to use temporary root pointers into the heap. Such an approach will incur a substantial overhead. Problem 2 is solvable by not using libffi structures but by writing my own FFI routines, or getting rid of that altogether and just link and compile against imported libraries by generating calls with the right conversions.
Solving this the general manner means adapting what I have now with a proper VM. Which is possible, but also means at some point you end up measuring the throughput of interpreted instructions in GB/s, I am not sure I really want to go that way. On the other hand, I feel like there's no other possible path in the long run.
Then there are the two other points of: A) The compiler generates pretty fat binaries And, B) at some point being able to support multi-core processing.
Problem A is (probably) a side effect of compiling everything down to combinators without support for, for instance, local lets or local chaining of instructions, i.e., all sequential steps are represented by combinator applications which are translated to C functions. Moreover, the SSA forms used in the compiler might not be optimal, a push takes half the number of arguments of an assignment and most of the assignments generated could be represented by pushes. Problem B I don't dare to think about anymore.
Maybe I should have a good look at Caml light again? I guess the ML system gives at least one trivial answer: going with a light-weight VM is just worth it.