Monday, December 27, 2010

Push-Enter vs Eval-Apply

I use a simple 'push-enter' method where the callee determines, given the arity of the combinator and the number of arguments passed, how to rewrite a closure. Most optimizing functional compilers do arity-analysis and use an 'eval-apply' compilation scheme where the caller generates the code for the arity known at compile-time.

I can see most  of the benefits, primarily, it makes it possible to compile to C using it's normal calling convention. I.e., instead of trampolining thunks in the heap, most calls are translated to regular calls using the stack (the machine's calling convention) and closures are build in the heap on demand.

Translating to an abstract machine which uses a stack and heap, instead of just a heap, isn't a trivial move. Though, theoretically, combinators are easily translated to a stack/heap machine, at the moment, I fail to understand all the ramifications.

Is it true, and why would this be the case, that Clean's ABC machine is as fast as optimized native code?

Wednesday, December 22, 2010

What are Type Classes?

I joined the Haskell-cafe mailing list out of curiosity, and looked at some posts with some type-hacks. One of the questions it immediately threw up is: "What is a type-class, exactly?"

One of the reasons Hi has simple interfaces on typed values is that there is no short direct answer to the above in plain English. Well, that and the fact that there are programs for which Haskell has, or had, problems deriving the type or acts counter-intuitive. (Don't ask me for examples, it was just stuff I found in certain papers.)

So, since I am bored for the moment, think I'll spend some time whether there is an easy neat answer to that question.

[ Sterling was so kind to start a Reddit thread on it, see the replies to this post. I don't like the current answers yet since I want a non-phenomenological explanation; I believe a construct in a programming language should have a clear description in plain English which precisely describes the operational behavior of constructs such that programmers do not experience surprises in using them. Most phenomenological explanations don't do it for me, neither does a precise mathematical explanation since you really don't want to understand code in terms of the typing algorithm, or a specification of that.]

Friday, December 17, 2010

SECD, G-Machines, Combinatorial Machines and DOTs

I am wondering whether I should sit down and write down my musings on DOT evaluation, I once dubbed it the C-Machine, as a pun at the G-Machine and my target language C. Thing is, I devised it myself and don't really know what to think about it.

On first glance, it is just a reverse polish term rewrite system with eager semantics which is compiled down to a machine. It doesn't really make sense to see the DOT notation as a machine like SECD; for example, there is no code stack. On the other hand, it's not very complex, but there are a bunch of restrictions, invariants, and extra rules, which make it somewhat baroque for a notation.

Still, it is the simplest I could come up with, and it is a lot simpler than SECD or ZINC. It would make a lot more sense for people to, for instance, verify this simple scheme than a SECD machine which has its roots in Lisp and for which there are no real design rationale except for the fact that it works.

I also wonder whether SECD is faster by design, since lambda terms are mapped to it trivially (I guess), and I need the extra overhead of compiling down to combinators.

Moreover, why bother? I thought it might be fast, but at the moment, I overshot three performance metrics, code size, running time and memory consumption, each by a factor ten. Still, I am not sure that is the result of the scheme used. Or rather, the explosion seems to occur in the translation from combinators in Dot notation to abstract assembly, so, I probably messed up there. Does GHC still compile down to combinators internally?

Thursday, December 16, 2010

Log 121610

Changed the runtime to use the Hi ffi type structure as the normative reference instead of the libffi ffi type structure. Will need to extensively unit test this. Another feature, by seeing character data as values I'll end up copying to the heap, which is slow, and should probably automatically free whatever C returns, which may not always play well with the intension of C routines. Hey! It's a feature, not a bug.

Wednesday, December 15, 2010

The Woes of Change

To get to native string support, I ran into a set of mundane problems, the most pressing libffi support.

Since my language is bootstrapped, moving to string support is a non-trivial exercise since it changes a fundamental assumption on how data is represented.

In order to get it to work, I need to advance by going through the following steps:

  1. Devise a runtime which supports native strings
  2. Build a compiler which targets that runtime
  3. Build a compiler which uses that runtime
  4. Compile 2 with the regular compiler
  5. Compile 3 with 2
  6. Compile 3 with 3
Whoever taught me that bootstrapped languages are easily extended?

One problem is in the change of the runtime. To keep the implementation minimal, all Hi FFI type structures are compiled down to libffi type structures, and all subsequent data conversions are done with the libffi type structure as the normative reference. The latter choice I made because working with the libffi type structure is a bit faster and more easily expressible in C than working with the Hi FFI type structure, it also keeps the number of invariants down.

Problem now is that since I want to treat character data as values, but must pass a pointer to libffi, I want to discriminate between ordinary pointers and pointers to character data. Woe on me, I need an extra type tag in libffi types, which seems to be impossible. I ran into the expression problem in C.

There are two options:
  1. A quick-and-dirty fix, encode character pointers as something else such as a structure containing just one pointer, or an unsigned integer of pointer size. Not nice, and it will break if an ABI treats such an encoding different as a pointer.
  2. Take a step back, and use the Hi FFI type structure as the basis of all data conversions, which is a bit less nice, since I need to write some extra code for that, and it is probably somewhat slower.
Ah well. The better solution is to take a step back.

Tuesday, December 14, 2010

Log 121410

I did some programming, and threw all code away again. I am gonna cheat a bit and represent integers, floats, and other literals by their text representation and write conversion routines for them to native words. It keeps the number of AST nodes small since it doesn't introduce a separate case for each literal (like int, long, float, double) at the expense of inspecting the text representation from time to time.

Monday, December 13, 2010

Log 121310

After the joy of bootstrapping my language, and the subsequent tedious analysis of performance problems, I am programming again!

I failed to produce any interest in my language, which is somewhat expected. Programmers want completed tools, and other language writers are more interested in their own projects, so, I guess all interest will be suspended until I create a performing front-end to gcc or llvm. Which may never happen. It is questionable whether the simple rewrite system employed can be exploited to generate fast binaries. The DOT notation is a straight intermediate between a SECD machine and a TRS/GRS. Possibly, instead of inheriting the best of both worlds, it inherits the worst of both worlds performance wise.

But, ah well, it beats playing Sokoban. Changes I am now concentrating on are getting literal support (including strings) right before, possibly, changing to a ref-counting model.

Refcounting support isn't hard, but implementing it will probably mean wasting a few weeks on debugging. Still unsure about it. I was worried of running out of scratch space during a rewrite, now I get the feeling supporting strings may mean I need less scratch space. Simple examples like  [F s -> cons 'x' (cons 'y' (cons 'z' s)) ] actually need more scratch space than [F s -> append "xyz" s] in case the text s is small.

Now doing a bit boring work, the amount of literals just mean a lot more cases on the AST. It would be nicer if I could somehow factor literal support out with direct general support in the language, but it seems that would imply writing a lot of code.

Thursday, December 9, 2010

On 64 Bits, A Text Symbol is a Small Constant

In my mind, I've been going over how to improve the compiler performance. Now, I don't keep a symbol table at the moment, and I am thinking at supporting strings instead of character lists in the compiler.

I could, of course, also implement a symbol table just to make all strings unique and abstractly calculate with integers mapping into the symbol table. But, say, I implemented native strings, then most local variables (like 'x' or 'x#30') in the symbol table will be less than eight characters long, and most fully qualified identifiers (like 'system.print' or 'map.foldr') will be less than sixteen characters long.

That means most symbols are just one or two word sizes long!


I.e., it makes as much sense to use the native string representation of a symbol instead of an abstract integer mapping into a symbol table since the native string symbol will most likely take as much heap space as an integer.

For my compiler, due to the slowness of term rewriting, a text symbol should be seen as a constant, just like an integer, where the abstract costs of operations are O(1), just like on integers.

Tuesday, December 7, 2010

Implementing a Symbol Table, Without a Symbol Table

The Hi compiler, at the moment, doesn't implement a symbol table. It is implemented as a series of transformations on the AST which may carry extra information, like the bindings of symbols to definitions. If a transformation needs those bindings, it is preceded by a run over the AST which builds a map holding the current bindings.

Building a map is a cheap operation, it is hardly noticeable during the execution of the compiler, and it has an extra nice property that I don't need to keep a symbol table in sync with the AST, which may happen if definitions are added and removed.

The heart of the memory consumption problem stems from the naive manner in which I deal with symbols, they are all represented as text. All identifiers are elaborated during identification to their fully qualified name.  I.e., a function 'bar' in namespace 'foo' is mapped onto the text 'foo.bar'. Moreover, texts are not unique in their memory representation and symbols generated during semantic analysis are represented as texts. Hence a compiler which wastes memory, moreover is slow on look-up since text equality is implemented on lists of characters.

(Yeah, I knew it would be slow and waste memory. I thought it would take a few MBs, little did I know it would mean consuming 2.4GB.)

Moving to string support means getting rid of the expensive check for equality and not using as much memory.

Being a bit more smart with the representation of symbols, i.e. representing a qualified symbol as a list of texts, instead of a list of chars, would also help.

Mapping symbols to integers in a symbol table would also be helpful, i.e. text comparisons are replaced by integer comparisons at the expense of another global data structure. Question is: Can I do without? Abstractly, I'll need to change the internal representation of an identifier from text to some abstract value of type symbol.

Let's say I choose this representation:

    type symbol = [ raw text 

                  | qualified (list symbol) 
                  | abstract int ]

Is it worth it to go for a distinct data structure which holds all symbols? Another manner of keeping memory consumption low is just traversing the AST with an extra pass which makes all  'raw' texts unique and mapping local, not global, identifiers to integers. A bit slower on comparison, sure, but I keep the advantage of not needing a separate symbol table.

(I thought about it. I'll implement a symbol module since it is orthogonal to implementing texts as native strings, or reference counting the allocated data, and makes explicit that reasoning over symbols is best done abstractly and factored out of the code. Possibly, I'll add string support, maybe reference counting, later.)

(And then again, sigh, life is never easy. If I go for native text support, flattening a piece of text or comparison might take less memory and be faster... A 64 bit model means you can cram 8 bytes in the size of a pointer, and that is not including the overhead of 'boxing the pointer' in a cons cell. Implementing a symbol abstraction might just not be worth it.)

Not Convinced

I want string support. The language needs a collector which doesn't run out of space during a rewrite. One manner of achieving that is going for reference counting. Other manners are just increasing the number of free heap space before a rewrite, or possibly even a try-and-retry implementation where a rewrite is just retried after a garbage collect if it ran out of heap space during rewriting.

Reference counting has some good properties like not using more memory than needed by the program, it gets rid of a factor 2.0 needed by stop-and-copy. Also, for future extensions, it is nice that it collects dead data immediately if coupled with references and finalizers attached to system resources. For example, you never keep files open longer than you need them.

Going for reference counting means:

  • A more difficult runtime where a count is held in the header of a node.
  • Implementing ALLOC, INCREF and DECREF operations. In the scheme I envision ALLOC becomes a call, INCREF a series of instructions, DECREF another call. Currently, ALLOC is just returning a pointer followed by an increment.
  • Major problems are throughput and memory fragmentation.
Question remains: Is a lot of calls plus INCREF/DECREF operations cheap enough?

(Great. Most current reference counting literature is on Java/C#. So far, RC in Java a low factor overhead, RC in C# makes the code go to a grinding halt. But there are both complex RC implementations for Java, by Bacon and Petrank, and C#, by Joisha, with claim good performance. Also, reference counting lisp?)

Thursday, December 2, 2010

Or Reference Counting?

CDL3 is a compiler compiler I worked with now almost some twenty years ago. One of the things I noticed is that to some extend, the operational model of Hi is close to it. Though it has a very different implementation.

I discarded reference counting and went for a Cheney style collector since it is not 'in fashion' anymore since it has two problems: 1) you touch an enormous amount of objects for which you need to increase the counter leading to possible slow performance, 2) it cannot handle cyclic structures except for another garbage collector on top.

However, one of the ugly invariants I need to maintain is that I either never run out of heap space during a combinator rewrite, or I make it possible to garbage collect intermittently by tracking the registers in the body of a rewrite.

But, there might be another manner, which I think is possibly used in the implementation of CDL3. Clever buggers, them.

The solution might be just to not have any heap space what soever, but just use C's malloc/free lists for every allocation thereby ensuring you can't run out of heap space anywhere and just use reference counting, which is possible since Hi doesn't allow for cyclic structures, and move garbage to a free list. This has the advantage over Cheney that I don't need an extra to space (halving memory consumption) and do not need to schedule for global garbage collects (leading to more deterministic behavior). I end up with a scheme where I can still compile to C, and allocate any number of intermediate cells when necessary.

The disadvantage is that it may be slow. But maybe I should believe them clever buggers. It's been working for them for twenty years...

Wednesday, December 1, 2010

Converging

So, for several months I have been postponing about what to change to solve my compiler performance problems. I keep posting my thoughts, often rephrasing them, but I have the feeling I am converging.

As stated, there are two serious problems with memory usage: A) An uncanny amount of chars, 15 million of them, which lead to 2.4GB of used heap space. B) During the final phase, a few million lines of assembly are generated which also hog the memory for approximately 1GB.

Problem A is solvable with a symbol table mapping texts to ints and not using texts internally for most symbolic reasoning, like generating unique identifiers represented by texts on which to unify or storing each and every fully qualified identifier by its very long text representation. Problem B is trivially solvable by just emitting the compiled procedures to the generated C file one-by-one instead of first keeping them all in memory. So, B is solvable in a few hours, and I don't think I need to give it much attention.

So, I need a symbol table. I don't wish to pass monads around, neither do I feel like cheating by implementing the symbol table in C so I'll need a mutable global variable encapsulated in a module. I don't really like mutable state so I don't think I am going to add language support for it, instead I am just going to add primitive routines on the runtime to be able to expose and assign to fresh 'root' pointers into the heap. The bad point of this solution being that, pure or impure, I am going reason modulo a functional data structure for a symbol table which might just be slow.

Of course, say I implement a symbol table, it might hold several tens of thousands of characters, or around a megabyte of heap space. Further compressing it down is possible by having a native string representation which would also mean an end to the rather slow character list comparisons and a 'nicer' manner of interfacing with C.

But changing to strings opens up two cans of worms: 1) Allocating texts in the heap during a rewrite means you can run out of heap space in the middle of a combinator rewrite. 2) I heavily abuse libffi for calling native routines and use the native libffi structures for storing the calling convention which doesn't have provisions for discriminating between regular pointers and pointers to character arrays.

Problem 1 is addressable by tracking registers used in a combinatorial rewrite by translating them to use temporary root pointers into the heap. Such an approach will incur a substantial overhead. Problem 2 is solvable by not using libffi structures but by writing my own FFI routines, or getting rid of that altogether and just link and compile against imported libraries by generating calls with the right conversions.

Solving this the general manner means adapting what I have now with a proper VM. Which is possible, but also means at some point you end up measuring the throughput of interpreted instructions in GB/s, I am not sure I really want to go that way. On the other hand, I feel like there's no other possible path in the long run.

Then there are the two other points of: A) The compiler generates pretty fat binaries And, B) at some point being able to support multi-core processing.

Problem A is (probably) a side effect of compiling everything down to combinators without support for, for instance, local lets or local chaining of instructions, i.e., all sequential steps are represented by combinator applications which are translated to C functions. Moreover, the SSA forms used in the compiler might not be optimal, a push takes half the number of arguments of an assignment and most of the assignments generated could be represented by pushes. Problem B I don't dare to think about anymore.

Maybe I should have a good look at Caml light again? I guess the ML system gives at least one trivial answer: going with a light-weight VM is just worth it.

A Trip Down Memory Lane

All compilers go through an quantitative analysis phase where performance is studied. I liked this paper: A Trip Down Memory Lane in Haskell,Neil C. C. Brown, Adam T. Sampson.

Their experience is with a language called Tock.

(I need to get rid of two spikes: 15M (2.4GB) of character data, 30M (>1.5GB) of generated instructions.)