Monday, December 27, 2010

Push-Enter vs Eval-Apply

I use a simple 'push-enter' method where the callee determines, given the arity of the combinator and the number of arguments passed, how to rewrite a closure. Most optimizing functional compilers do arity-analysis and use an 'eval-apply' compilation scheme where the caller generates the code for the arity known at compile-time.

I can see most  of the benefits, primarily, it makes it possible to compile to C using it's normal calling convention. I.e., instead of trampolining thunks in the heap, most calls are translated to regular calls using the stack (the machine's calling convention) and closures are build in the heap on demand.

Translating to an abstract machine which uses a stack and heap, instead of just a heap, isn't a trivial move. Though, theoretically, combinators are easily translated to a stack/heap machine, at the moment, I fail to understand all the ramifications.

Is it true, and why would this be the case, that Clean's ABC machine is as fast as optimized native code?

Wednesday, December 22, 2010

What are Type Classes?

I joined the Haskell-cafe mailing list out of curiosity, and looked at some posts with some type-hacks. One of the questions it immediately threw up is: "What is a type-class, exactly?"

One of the reasons Hi has simple interfaces on typed values is that there is no short direct answer to the above in plain English. Well, that and the fact that there are programs for which Haskell has, or had, problems deriving the type or acts counter-intuitive. (Don't ask me for examples, it was just stuff I found in certain papers.)

So, since I am bored for the moment, think I'll spend some time whether there is an easy neat answer to that question.

[ Sterling was so kind to start a Reddit thread on it, see the replies to this post. I don't like the current answers yet since I want a non-phenomenological explanation; I believe a construct in a programming language should have a clear description in plain English which precisely describes the operational behavior of constructs such that programmers do not experience surprises in using them. Most phenomenological explanations don't do it for me, neither does a precise mathematical explanation since you really don't want to understand code in terms of the typing algorithm, or a specification of that.]

Friday, December 17, 2010

SECD, G-Machines, Combinatorial Machines and DOTs

I am wondering whether I should sit down and write down my musings on DOT evaluation, I once dubbed it the C-Machine, as a pun at the G-Machine and my target language C. Thing is, I devised it myself and don't really know what to think about it.

On first glance, it is just a reverse polish term rewrite system with eager semantics which is compiled down to a machine. It doesn't really make sense to see the DOT notation as a machine like SECD; for example, there is no code stack. On the other hand, it's not very complex, but there are a bunch of restrictions, invariants, and extra rules, which make it somewhat baroque for a notation.

Still, it is the simplest I could come up with, and it is a lot simpler than SECD or ZINC. It would make a lot more sense for people to, for instance, verify this simple scheme than a SECD machine which has its roots in Lisp and for which there are no real design rationale except for the fact that it works.

I also wonder whether SECD is faster by design, since lambda terms are mapped to it trivially (I guess), and I need the extra overhead of compiling down to combinators.

Moreover, why bother? I thought it might be fast, but at the moment, I overshot three performance metrics, code size, running time and memory consumption, each by a factor ten. Still, I am not sure that is the result of the scheme used. Or rather, the explosion seems to occur in the translation from combinators in Dot notation to abstract assembly, so, I probably messed up there. Does GHC still compile down to combinators internally?

Thursday, December 16, 2010

Log 121610

Changed the runtime to use the Hi ffi type structure as the normative reference instead of the libffi ffi type structure. Will need to extensively unit test this. Another feature, by seeing character data as values I'll end up copying to the heap, which is slow, and should probably automatically free whatever C returns, which may not always play well with the intension of C routines. Hey! It's a feature, not a bug.

Wednesday, December 15, 2010

The Woes of Change

To get to native string support, I ran into a set of mundane problems, the most pressing libffi support.

Since my language is bootstrapped, moving to string support is a non-trivial exercise since it changes a fundamental assumption on how data is represented.

In order to get it to work, I need to advance by going through the following steps:

  1. Devise a runtime which supports native strings
  2. Build a compiler which targets that runtime
  3. Build a compiler which uses that runtime
  4. Compile 2 with the regular compiler
  5. Compile 3 with 2
  6. Compile 3 with 3
Whoever taught me that bootstrapped languages are easily extended?

One problem is in the change of the runtime. To keep the implementation minimal, all Hi FFI type structures are compiled down to libffi type structures, and all subsequent data conversions are done with the libffi type structure as the normative reference. The latter choice I made because working with the libffi type structure is a bit faster and more easily expressible in C than working with the Hi FFI type structure, it also keeps the number of invariants down.

Problem now is that since I want to treat character data as values, but must pass a pointer to libffi, I want to discriminate between ordinary pointers and pointers to character data. Woe on me, I need an extra type tag in libffi types, which seems to be impossible. I ran into the expression problem in C.

There are two options:
  1. A quick-and-dirty fix, encode character pointers as something else such as a structure containing just one pointer, or an unsigned integer of pointer size. Not nice, and it will break if an ABI treats such an encoding different as a pointer.
  2. Take a step back, and use the Hi FFI type structure as the basis of all data conversions, which is a bit less nice, since I need to write some extra code for that, and it is probably somewhat slower.
Ah well. The better solution is to take a step back.

Tuesday, December 14, 2010

Log 121410

I did some programming, and threw all code away again. I am gonna cheat a bit and represent integers, floats, and other literals by their text representation and write conversion routines for them to native words. It keeps the number of AST nodes small since it doesn't introduce a separate case for each literal (like int, long, float, double) at the expense of inspecting the text representation from time to time.

Monday, December 13, 2010

Log 121310

After the joy of bootstrapping my language, and the subsequent tedious analysis of performance problems, I am programming again!

I failed to produce any interest in my language, which is somewhat expected. Programmers want completed tools, and other language writers are more interested in their own projects, so, I guess all interest will be suspended until I create a performing front-end to gcc or llvm. Which may never happen. It is questionable whether the simple rewrite system employed can be exploited to generate fast binaries. The DOT notation is a straight intermediate between a SECD machine and a TRS/GRS. Possibly, instead of inheriting the best of both worlds, it inherits the worst of both worlds performance wise.

But, ah well, it beats playing Sokoban. Changes I am now concentrating on are getting literal support (including strings) right before, possibly, changing to a ref-counting model.

Refcounting support isn't hard, but implementing it will probably mean wasting a few weeks on debugging. Still unsure about it. I was worried of running out of scratch space during a rewrite, now I get the feeling supporting strings may mean I need less scratch space. Simple examples like  [F s -> cons 'x' (cons 'y' (cons 'z' s)) ] actually need more scratch space than [F s -> append "xyz" s] in case the text s is small.

Now doing a bit boring work, the amount of literals just mean a lot more cases on the AST. It would be nicer if I could somehow factor literal support out with direct general support in the language, but it seems that would imply writing a lot of code.

Thursday, December 9, 2010

On 64 Bits, A Text Symbol is a Small Constant

In my mind, I've been going over how to improve the compiler performance. Now, I don't keep a symbol table at the moment, and I am thinking at supporting strings instead of character lists in the compiler.

I could, of course, also implement a symbol table just to make all strings unique and abstractly calculate with integers mapping into the symbol table. But, say, I implemented native strings, then most local variables (like 'x' or 'x#30') in the symbol table will be less than eight characters long, and most fully qualified identifiers (like 'system.print' or 'map.foldr') will be less than sixteen characters long.

That means most symbols are just one or two word sizes long!

I.e., it makes as much sense to use the native string representation of a symbol instead of an abstract integer mapping into a symbol table since the native string symbol will most likely take as much heap space as an integer.

For my compiler, due to the slowness of term rewriting, a text symbol should be seen as a constant, just like an integer, where the abstract costs of operations are O(1), just like on integers.

Tuesday, December 7, 2010

Implementing a Symbol Table, Without a Symbol Table

The Hi compiler, at the moment, doesn't implement a symbol table. It is implemented as a series of transformations on the AST which may carry extra information, like the bindings of symbols to definitions. If a transformation needs those bindings, it is preceded by a run over the AST which builds a map holding the current bindings.

Building a map is a cheap operation, it is hardly noticeable during the execution of the compiler, and it has an extra nice property that I don't need to keep a symbol table in sync with the AST, which may happen if definitions are added and removed.

The heart of the memory consumption problem stems from the naive manner in which I deal with symbols, they are all represented as text. All identifiers are elaborated during identification to their fully qualified name.  I.e., a function 'bar' in namespace 'foo' is mapped onto the text ''. Moreover, texts are not unique in their memory representation and symbols generated during semantic analysis are represented as texts. Hence a compiler which wastes memory, moreover is slow on look-up since text equality is implemented on lists of characters.

(Yeah, I knew it would be slow and waste memory. I thought it would take a few MBs, little did I know it would mean consuming 2.4GB.)

Moving to string support means getting rid of the expensive check for equality and not using as much memory.

Being a bit more smart with the representation of symbols, i.e. representing a qualified symbol as a list of texts, instead of a list of chars, would also help.

Mapping symbols to integers in a symbol table would also be helpful, i.e. text comparisons are replaced by integer comparisons at the expense of another global data structure. Question is: Can I do without? Abstractly, I'll need to change the internal representation of an identifier from text to some abstract value of type symbol.

Let's say I choose this representation:

    type symbol = [ raw text 

                  | qualified (list symbol) 
                  | abstract int ]

Is it worth it to go for a distinct data structure which holds all symbols? Another manner of keeping memory consumption low is just traversing the AST with an extra pass which makes all  'raw' texts unique and mapping local, not global, identifiers to integers. A bit slower on comparison, sure, but I keep the advantage of not needing a separate symbol table.

(I thought about it. I'll implement a symbol module since it is orthogonal to implementing texts as native strings, or reference counting the allocated data, and makes explicit that reasoning over symbols is best done abstractly and factored out of the code. Possibly, I'll add string support, maybe reference counting, later.)

(And then again, sigh, life is never easy. If I go for native text support, flattening a piece of text or comparison might take less memory and be faster... A 64 bit model means you can cram 8 bytes in the size of a pointer, and that is not including the overhead of 'boxing the pointer' in a cons cell. Implementing a symbol abstraction might just not be worth it.)

Not Convinced

I want string support. The language needs a collector which doesn't run out of space during a rewrite. One manner of achieving that is going for reference counting. Other manners are just increasing the number of free heap space before a rewrite, or possibly even a try-and-retry implementation where a rewrite is just retried after a garbage collect if it ran out of heap space during rewriting.

Reference counting has some good properties like not using more memory than needed by the program, it gets rid of a factor 2.0 needed by stop-and-copy. Also, for future extensions, it is nice that it collects dead data immediately if coupled with references and finalizers attached to system resources. For example, you never keep files open longer than you need them.

Going for reference counting means:

  • A more difficult runtime where a count is held in the header of a node.
  • Implementing ALLOC, INCREF and DECREF operations. In the scheme I envision ALLOC becomes a call, INCREF a series of instructions, DECREF another call. Currently, ALLOC is just returning a pointer followed by an increment.
  • Major problems are throughput and memory fragmentation.
Question remains: Is a lot of calls plus INCREF/DECREF operations cheap enough?

(Great. Most current reference counting literature is on Java/C#. So far, RC in Java a low factor overhead, RC in C# makes the code go to a grinding halt. But there are both complex RC implementations for Java, by Bacon and Petrank, and C#, by Joisha, with claim good performance. Also, reference counting lisp?)

Thursday, December 2, 2010

Or Reference Counting?

CDL3 is a compiler compiler I worked with now almost some twenty years ago. One of the things I noticed is that to some extend, the operational model of Hi is close to it. Though it has a very different implementation.

I discarded reference counting and went for a Cheney style collector since it is not 'in fashion' anymore since it has two problems: 1) you touch an enormous amount of objects for which you need to increase the counter leading to possible slow performance, 2) it cannot handle cyclic structures except for another garbage collector on top.

However, one of the ugly invariants I need to maintain is that I either never run out of heap space during a combinator rewrite, or I make it possible to garbage collect intermittently by tracking the registers in the body of a rewrite.

But, there might be another manner, which I think is possibly used in the implementation of CDL3. Clever buggers, them.

The solution might be just to not have any heap space what soever, but just use C's malloc/free lists for every allocation thereby ensuring you can't run out of heap space anywhere and just use reference counting, which is possible since Hi doesn't allow for cyclic structures, and move garbage to a free list. This has the advantage over Cheney that I don't need an extra to space (halving memory consumption) and do not need to schedule for global garbage collects (leading to more deterministic behavior). I end up with a scheme where I can still compile to C, and allocate any number of intermediate cells when necessary.

The disadvantage is that it may be slow. But maybe I should believe them clever buggers. It's been working for them for twenty years...

Wednesday, December 1, 2010


So, for several months I have been postponing about what to change to solve my compiler performance problems. I keep posting my thoughts, often rephrasing them, but I have the feeling I am converging.

As stated, there are two serious problems with memory usage: A) An uncanny amount of chars, 15 million of them, which lead to 2.4GB of used heap space. B) During the final phase, a few million lines of assembly are generated which also hog the memory for approximately 1GB.

Problem A is solvable with a symbol table mapping texts to ints and not using texts internally for most symbolic reasoning, like generating unique identifiers represented by texts on which to unify or storing each and every fully qualified identifier by its very long text representation. Problem B is trivially solvable by just emitting the compiled procedures to the generated C file one-by-one instead of first keeping them all in memory. So, B is solvable in a few hours, and I don't think I need to give it much attention.

So, I need a symbol table. I don't wish to pass monads around, neither do I feel like cheating by implementing the symbol table in C so I'll need a mutable global variable encapsulated in a module. I don't really like mutable state so I don't think I am going to add language support for it, instead I am just going to add primitive routines on the runtime to be able to expose and assign to fresh 'root' pointers into the heap. The bad point of this solution being that, pure or impure, I am going reason modulo a functional data structure for a symbol table which might just be slow.

Of course, say I implement a symbol table, it might hold several tens of thousands of characters, or around a megabyte of heap space. Further compressing it down is possible by having a native string representation which would also mean an end to the rather slow character list comparisons and a 'nicer' manner of interfacing with C.

But changing to strings opens up two cans of worms: 1) Allocating texts in the heap during a rewrite means you can run out of heap space in the middle of a combinator rewrite. 2) I heavily abuse libffi for calling native routines and use the native libffi structures for storing the calling convention which doesn't have provisions for discriminating between regular pointers and pointers to character arrays.

Problem 1 is addressable by tracking registers used in a combinatorial rewrite by translating them to use temporary root pointers into the heap. Such an approach will incur a substantial overhead. Problem 2 is solvable by not using libffi structures but by writing my own FFI routines, or getting rid of that altogether and just link and compile against imported libraries by generating calls with the right conversions.

Solving this the general manner means adapting what I have now with a proper VM. Which is possible, but also means at some point you end up measuring the throughput of interpreted instructions in GB/s, I am not sure I really want to go that way. On the other hand, I feel like there's no other possible path in the long run.

Then there are the two other points of: A) The compiler generates pretty fat binaries And, B) at some point being able to support multi-core processing.

Problem A is (probably) a side effect of compiling everything down to combinators without support for, for instance, local lets or local chaining of instructions, i.e., all sequential steps are represented by combinator applications which are translated to C functions. Moreover, the SSA forms used in the compiler might not be optimal, a push takes half the number of arguments of an assignment and most of the assignments generated could be represented by pushes. Problem B I don't dare to think about anymore.

Maybe I should have a good look at Caml light again? I guess the ML system gives at least one trivial answer: going with a light-weight VM is just worth it.

A Trip Down Memory Lane

All compilers go through an quantitative analysis phase where performance is studied. I liked this paper: A Trip Down Memory Lane in Haskell,Neil C. C. Brown, Adam T. Sampson.

Their experience is with a language called Tock.

(I need to get rid of two spikes: 15M (2.4GB) of character data, 30M (>1.5GB) of generated instructions.)

Monday, November 29, 2010

Design Decisions

In the rather minute space I have in moving around in the operational semantics, one thing is clear: I need a symbol table. Not in the traditional sense of mapping symbols to meaning but just as a hashmap mapping strings to integers such that the compiler can compare faster.

Now, there seems to be three ways of going around this: A) Add a map purely functional such that not only an AST is passed around but also the symbol table. B) Add an impure map in a module by extending the language semantics with support for updateable arrays and strings. C) Add an impure map in a module by interfacing with C.

Option C has the least impact on the language...

Tuesday, November 23, 2010

15M * 80 bytes * 2

Something I missed. Right in the middle of a self-compile a whopping 15M chars are allocated in the heap (must be the result of an int_to_char conversion). That means about 32 * 15MB = 480MB of data is just sitting there whereas they should have been constants outside of the heap.

Moreover, almost every char is part of a list, which roughly means 48 bytes extra, and stop-and-copy means a factor two overhead too. So, the full formula is 2 * (32B + 48B) * 15M = 2.4GB....

Of course, this is bad news. The good news is that there should be an easy fix.

(Updated. I missed the line with 15MB of characters.)

Sunday, November 14, 2010

Bit Packing

The performance of a functional language is to some extend at least linearly dependent on the rate of allocation. The garbage collector used at the moment is a variant of Cheney, and pretty fast. All in all, the perceived sluggishness of the resulting bootstrapped compiler is more a result of the compilation scheme and the routines in the implementation than anything else.

Anyway, I still shot myself in the foot by going for a bootstrapped language for I end up paying with a slow compiler, even if the resulting programs are factors faster than Python or Perl. For the next incarnation of the runtime, I'll leave some of the current decisions, push thunks into a store, but bit-pack to get the allocation rate down.

At the moment, the memory model consists of size-tagged records of cells. First, a size is given, the second argument in the header is either a zero, in which case only integers are stored subsequently, or something else, in which case only pointers are stored. I'll give three examples of the current scheme.

Storing an int: An int '5' is stored as [4, 0, 'int\0', 5], four cells of 64 bits. The first cell gives the size, the second says it is a constant, the third holds the 64 bit coding of a string 'int' (it is not pointer but the actual character array, you can store up to 7 bytes excluding the '\0' end marker in 64 bits), lastly, the fourth cell holds the value 5.

Storing a record: A cons record 'cons 5 rest' is stored as [6 , 1, *"cons", *"list", *5, *rest]. First the size, than a 1 tag stating that it is a record of pointers, a pointer to the record information, a pointer to the type information, a pointer to the record holding the constant 5, and a pointer to the tail of the list.

Storing a thunk: A thunk f x y is stored as [7, *F, *k, *rt, *rti, *x, *y]. The size of the thunk, a pointer to the evaluation routine, a pointer to the continuation, a pointer to the record where this result should be placed, a pointer to an index into that record, a pointer to the first argument, and lastly, a pointer to the second argument.

It's a wasteful scheme, but it works, at least for a bootstrap. A lot of pointers will be to constants, which are never allocated in the heap. I thought that would be enough to take the pressure of the collector. Turned out I was wrong.

The point of bit packing should be to compress all static data, preferably in a header which is statically determinable at compile-time. So, it makes sense to bit pack the size, a tag whether it is a constant record, and RTTI into one header, and leave the rest. It makes sense to go with 64 bits headers so, I think I should go with: 32 bits for the size field including one bit for tagging whether a constant follows, and use the other 32 bits as indices into an RTTI structure. Guess I should split those 32 bits into two times 16 bits since I want to be able to dynamically check (fast) what the type of a value is.

Storing an int: An int '5' is stored as [0+i'int;'+0+2, 5], two cells of 64 bits. First, a header tag in which is stored an index into the RTTI array denoting the string 'int', a zero bit that this is a constant record, and a size of two describing the full size of the record. Second, 64 bits denoting the value 5.

Storing a record: A cons record 'cons 5 rest' is stored as [i'cons'+i'list'+1+3 , *5, *rest].

Storing a thunk: A thunk f x y is stored as [0+i'func'+1+7, *F, *k, *rt, *rti, *x, *y].

It gives a maximum of 31 bits for the size of records of cells, and a max of 32 bits for all declarations. It would halve the memory used, the number of live thunks is usually negligible (although a lot of them are allocated, of course), and more than double the performance.

Looks doable, but I still would have a slow compiler... (I need mutable state, array and string support, which are language extensions.)

Friday, November 5, 2010

It's all about Push

I thought a lot more, more on my bike than anything formal. But, I puzzled it out. Thing is, I explicitly reserve space in a heap which behaves, more or less, like a stack where intermediates may become garbage. Thus a combinator will call the allocation routine a few times and fill the content of a reserved thunk. I.e., you get (pseudo-code) like:

r0 = alloc(3);
   r0[0] = ...
   r0[1] = ...
   r0[2] = ...
   r1 = alloc(2);
   r1[0] = ...
   r1[1] = &r0;

Entirely unnecessary. It is a stack, not a heap, I should treat it like a stack, and get rid of all the alloc calls and just push values into it. For the example, that would mean, two allocs less, half of the arguments gone (since movs are replaced by push), but at the expense of some stack calculations.

I could, of course, make a double stack machine, where one stack is reserved for pushing closures and closures are just popped in order, and the second for pushing data. It would, of course, mean that sometimes closures should be copied to the data stack. I'll end up with something akin to Clean ABC code?

Problem, this only affects performance, number of instructions generated. So, it might be faster (if I can map to native push instructions), but it won't affect the space consumption of the generated code when running. Which is entirely a case of making the data-representation more compact, generating more compact representations for data-structures, and -for compiler performance- changing a number of routines in the compiler to make generated code for that less memory intensive.

For the sake of simplicity, I think I'ld rather go in the direction of a one-stack machine with an incremental garbage collector to clear away generations of thunks and data pushed.

Thursday, October 28, 2010

Still not pleased

I am still not programming. The thing which bothers me the most is that I don't feel too much like going in the direction of a 'classical' VM -i.e. a VM with primitive types, loads, stores, a calling mechanism and a garbage collector- and generating code for that. Why? My gut feeling tells me I will generate just too much code in the conversion from combinator to the code for a classical VM, rewriting a combinator expression just entails producing code with lots of allocations, and copying of pointers/slots from the term being rewritten to the term being produced.

What I want is a translation scheme to some Graph Rewrite System, or abstract machine, with a high level instruction set which avoids the current explosion when going from combinator to instructions. I am wondering whether the current scheme is just too concrete in its naive translation. Can I get rid of explicit data nodes? Can I translate to, not a graph machine, but trivial combinators on a stack machine?

Saturday, October 23, 2010


Working on the Hi language goes through times where new and old ideas gobble up and I look at them from various angles where actually no code is being written. Ah well, the two, strike that, three things which need to happen, are 1. full conformance (long integers, floats and doubles, native strings, a fast type-checker), 2. a VM with native support for strings, 3. performance.

I'll need to dump the current translation scheme, which is a pity since by keeping it simple I could use the C optimizer to do a lot for free like register allocation and lots of assembly instruction optimizations.

Another of the big drawbacks of going to a VM -which should support strings- is that in the current compilation scheme an allocation is cheap. Functions are translated to combinators, combinators rewrite the heap. The assumption at the moment is that a combinator will always rewrite (to) a DAG, and that there is always enough space in the heap to allocate the intermediates. That way, a) allocation is cheap (just an increment), b) I only garbage collect in between rewrites, which is a fast scheme by itself.

If I am gonna allow string allocations in between, I will open up a severe can of worms, since the assumption that there will always be enough heap space for intermediates must be dropped, thus garbage collections must be traced, thus intermediate pointers from the 'stack' must be tracked and chased also. So, I might as well go for the whole stack of optimizations, hello typed assembly and graph-coloring algorithms.

I use SSA form in the intermediate assembly with an infinite supply of intermediate local registers which is why I also looked at LLVM since that would map, but I might be better of by changing it to stack pushing (also, since the data structure in the heap is a DAG - garbage collection could be done by stack compressing), no idea.

For the bootstrap, I chose the simplest scheme I could get away with for representing records, somewhat with an eye on making it easy for dynamic 'code' generation. It wastes memory like hell; at the moment, I am looking at 1GB compiles, whereas my gut feeling tells me that 20MB should just be enough. Okay, this compiler is pretty naive, and uses list of characters (texts) in places where integers should suffice, but still, without a better scheme for representing data I'll still expect to be in the hundreds of MB region.

I could reuse the ocaml VM, which I kind-of skimmed through a few times, or see what Guile has to give, but still, it looks like I'll need my own VM. (Hmm, CDL3 also has a simple VM, could look there too.)

Saturday, October 2, 2010

C Back-End

I looked at various manners of binding LLVM to Hi. It ain't easy. The most direct would be to just use Hi's FFI for an LLVM-Hi binding, but LLVM has a big back-end, and it is written in C++, whereas Hi binds to C. The best route to take looks like to 'double' the code of the internal assembly used in C with a library for that, and -at some point- write an LLVM back-end instead of a C back-end. It would speed up the emitting phase I hope since it wouldn't need (text emitting) functions written in Hi, which are pretty slow at the moment.

Wednesday, September 29, 2010


Yeah, I give in. LLVM looks mature enough as a target. Studying on it. Maybe I can re-use part of the EHC back-end? Problem is, it looks dead, and doesn't GC last I looked. Hmm, GHC has an LLVM back-end now...

Tuesday, September 28, 2010


I am restructuring the back-end towards a proper VM. Which actually shouldn't be too hard.

Wednesday, September 22, 2010

Log 092210

I started again. Just decided to grow the new compiler organically, so work my way past each (performance problem). The basis is sound enough.

Saturday, September 4, 2010

Inline Comments

Of course, removing generated comments from the generated code helps a lot too.

Thursday, September 2, 2010

Log 090210

I did some light programming and removed the URI from the unserialization steps. It helped somewhat, but the compiler still wastes way too much memory. Everything anew from the ground up? Too much of a hassle. I might just toss the compiler over the edge and be done with it as a nice but failed experiment.

Too many cons-nodes in the heap. Should do a quantitative assessment of that.

Another way forward would actually be setting one step back. Now the code explodes in the translation from combinators to C source code, a better approach might just be to not generate C, but have a combinator language interpreter. It would be somewhat slower, and the compiler would still hog the memory at places, but the generated code would probably be order(s) smaller.

Monday, August 30, 2010

Back to C64's?

A simple transform which maps text in the AST to its identity using a map function, therefor making all references in the heap unique, would get rid of all that double character data too.

I feel like I am back to my C64 where, in order to get at -or change- that extra RAM, one needed to 'peek' all values of a memory region and 'poke' them to their addresses just before disabling the ROM shadowed memory region.

Sunday, August 29, 2010

Not Doing a Lot

As the title says, severe lack of motivation to program. Anyhow, thought a bit more about a trivial typing scheme, it should be enough if I just reverse/extend type variables with their assumed implemented interfaces, unify on that, and get rid of generalization to derive an AST based type checker. Not sure where all the assumptions and stuff should go then in the context, but think about that later.

Saturday, August 28, 2010


I'ld think implementing strings would help somewhat, but I didn't manage to convince myself, and biggest reason: I just don't feel like, having strings as lists of characters works for Haskell too and gives nice algorithms. The real problem is 1GB of character data, which are probably mostly copies of themselves. Other stuff I could do:

  1. Make the data unique by storing strings in a map and only extract unique copies.
  2. Don't retain specific character data, it's probably all positional information anyway.
  3. Change the runtime such that all shareable data is shared.
I like 1 the most, but that would best be solved by having a constant pool referred to as a variable, and I don't have variables at the moment.

(Did I mention using a symbol table? A simple transform which just visits all AST nodes once would do it too, of course.)

Thursday, August 26, 2010

Log 082610

The heap profiler confirmed what I knew. Two basic things which slow down my compiler are: a) way too many chars, so I'll need a native text type, b) comparisons are just too slow, which may also be the result of having way too many chars.

There are some other optimizations, like dropping positional information in the linking phase and implementing a condensed and more intelligible representation for identifiers, which would help too.

First move forward, implement all those basic string primitives (length, pos_of, index, etc.) on strings.

(A cell on a 64 bit system is 64 bits. A char is placed in an array consisting of a length, cell-type, type and value. A cons is placed in an array of a length, cell-type, record-tag, type tag, header and tail. So, ten times eight, is eighty bytes per char. So, eighty times 15,550,128 gives 1,244,010,240 bytes. And most of that is probably the not-shared URI in positional information.)

Wednesday, August 25, 2010

I should cut back on chars...

So, I wrote a heap profiler, output every 10%:

It starts pretty innocent:
char 3300
list.cons 3972
int 4
system._txt 256
system.tuple_4 12
And grows.
pointer 4
char 204964
int 20356
position._pos 10176
ast.type_name 1932
system._txt 15728
ast.expr_empty 864
list.cons 206748
ast.decl_instance 32
ast.decl_def 156
ast.type_instance 32
ast.expr_name 1580
ast.type_arrow 568
ast.expr_case 292
ast.decl_bind 32
ast.expr_match 384
ast.decl_space 32
ast.decl_type 16
ast.expr_record 308
ast.decl_using 24
ast.type_case 12
ast.kind_unknown 300
ast.expr_number 88
ast.type_record 20
ast.expr_variable 604
ast.decl_interface 20
ast.type_universal 204
ast.decl_record 12
ast.type_unknown 580
ast.expr_application 676
ast.type_abstraction 60
ast.type_variable 264
ast.kind_star 580
ast.type_interface 20
system.tuple_2 8
ast.type_constrain 196
ast.decl_method 72
ast.type_application 96
ast.expr_select 88
ast.expr_char 12

And grows.

pointer 4
char 807500
int 79964
position._pos 39976
ast.expr_empty 3104
ast.expr_name 6508
ast.type_name 6468
ast.type_universal 520
system._txt 61664
ast.type_variable 760
ast.decl_record 52
list.cons 816488
ast.kind_unknown 876
ast.expr_case 1472
ast.expr_match 1836
ast.expr_record 1776
ast.decl_space 68
ast.expr_variable 2456
ast.type_arrow 1368
ast.type_unknown 2404
ast.decl_using 84
ast.decl_instance 136
ast.kind_star 2404
ast.type_application 672
ast.decl_type 96
ast.type_case 56
ast.type_instance 136
system.tuple_2 8
ast.type_constrain 400
ast.type_skolem 32
ast.decl_def 504
ast.type_record 64
ast.decl_bind 168
ast.expr_application 3076
ast.comment 16
ast.expr_embed 256
ast.expr_number 400
ast.decl_interface 20
ast.ffi_type_simple 652
ast.expr_char 700
ast.type_abstraction 240
ast.type_interface 20
ast.decl_method 72
ast.expr_select 88
ast.expr_throw 4

And grows.

ffi.struct 4
pointer 4
char 2809580
int 287476
position._pos 143732
system._txt 223008
ast.expr_match 6116
list.cons 2843552
ast.expr_empty 9960
ast.expr_record 8944
ast.expr_char 3236
ast.expr_name 26248
ast.type_name 23168
ast.type_arrow 4456
ast.decl_def 2364
ast.expr_case 4616
ast.comment 164
ast.expr_application 10448
ast.expr_variable 9752
ast.expr_text 136
ast.type_unknown 9472
ast.kind_star 9472
ast.part 56
ast.type_universal 1952
ast.decl_using 272
ast.type_variable 2260
ast.decl_space 172
ast.kind_unknown 2448
ast.expr_number 836
system.tuple_2 8
ast.decl_type 152
ast.type_case 96
ast.decl_instance 244
ast.decl_interface 36
ast.type_instance 244
ast.type_record 216
ast.type_abstraction 308
ast.type_interface 36
ast.decl_bind 276
ast.decl_record 216
ast.type_constrain 736
ast.type_application 2864
ast.decl_method 104
ast.type_tuple 120
ast.expr_select 112
ast.type_skolem 36
ast.expr_embed 384
ast.ffi_type_simple 980
ast.expr_throw 4

And grows.

int 957136
pointer 4
char 10596428
list.cons 10718984
position._pos 478560
ast.expr_empty 30196
ast.expr_variable 45388
system._txt 740672
ast.type_unknown 44512
ast.kind_star 44512
ast.expr_application 49172
ast.expr_name 99336
ast.expr_case 16508
ast.type_arrow 7616
ast.type_name 53132
ast.expr_match 23052
ast.decl_def 4696
ast.expr_record 28540
ast.decl_using 840
ast.type_application 4624
ast.expr_text 904
ast.part 144
ast.comment 532
ast.decl_space 280
system.tuple_2 8
ast.expr_number 1280
ast.decl_type 220
ast.type_tuple 216
ast.expr_char 10288
ast.type_case 116
ast.kind_unknown 2724
ast.type_record 504
ast.decl_instance 328
ast.type_instance 328
ast.expr_throw 192
ast.decl_bind 360
ast.type_abstraction 356
ast.type_variable 2460
ast.decl_interface 44
ast.type_universal 2104
ast.type_interface 44
ast.decl_record 504
ast.type_constrain 824
ast.decl_method 120
ast.expr_select 120
ast.expr_try 4
ast.type_skolem 36
ast.expr_embed 384
ast.ffi_type_simple 980

And grows.

int 1452744
system._txt 1119192
ast.decl_method 120
map.node 7768
ast.decl_def 5964
list.cons 16055612
position._pos 695124
ast.expr_empty 41900
ast.expr_name 148724
ast.type_name 79064
ast.type_abstraction 376
ast.type_arrow 9364
ast.expr_case 24008
ast.decl_type 268
char 15550128
ast.type_variable 2496
ast.type_constrain 824
ast.type_application 5676
ast.type_case 136
ast.kind_unknown 2808
ast.type_universal 2120
ast.expr_match 32504
ast.expr_application 71856
ast.expr_record 48992
ast.type_record 688
ast.expr_embed 384
ast.expr_variable 62796
ast.expr_number 1836
ast.type_unknown 61728
ast.ffi_type_simple 980
ast.part 176
ast.kind_star 61728
ast.comment 672
ast.type_skolem 36
ast.expr_char 21280
ast.decl_interface 44
ast.expr_text 1148
ast.decl_space 488
lambda.lambda_def 4152
ast.type_tuple 288
ast.type_interface 44
lambda.lambda_app 34044
ast.decl_using 1448
ast.decl_record 688
ast.decl_instance 360
system.tuple_2 8
ast.expr_select 120
lambda.lambda_alts 11108
lambda.lambda_record 18484
ast.type_instance 360
lambda.lambda_comb 16652
lambda.lambda_abs 15720
ast.decl_bind 392
ast.expr_try 8
lambda.lambda_throw 11228
lambda.lambda_name 54724
ast.expr_throw 228
lambda.lambda_const 18140
lambda.lambda_method 316
long 18380
ast.expr_type 4
lambda.lambda_select 120
lambda.lambda_embed 384

And grows.

THUNK 5044
system.tuple_2 44
lambda.lambda_record 77532
list.cons 2404732
lambda.lambda_name 256328
system._txt 233196
int 78140
lambda.lambda_app 192680
lambda.lambda_abs 43772
lambda.lambda_alts 33164
lambda.lambda_const 47296
char 1582504
long 47536
lambda.lambda_comb 32128
lambda.lambda_split 22536
lambda.lambda_def 10680
lambda.lambda_test 788
lambda.lambda_call 32

And grows.

THUNK 40940
list.cons 19046368
code.c_mov 587036
code.op_reg 667452
code.op_int 781748
code.c_tst 140168
code.op_nth 317836
int 1180748
long 292384
system._txt 490732
code.op_lbl 79776
code.c_lbl 79776
code.c_return 49960
code.c_comment 119688
code.op_array 149632
char 2048476
code.c_op 28824
code.op_name 33916
code.c_call 14920
code.c_jmp 14908
code.codestate 8
combinator.comb_def 21176
combinator.comb_alts 21176
code.c_function 13912
combinator.comb_var 233048
combinator.comb_split 13696
combinator.comb_stack 33020
combinator.comb_record 50460
system.tuple_2 8
combinator.comb_thunk 53548
combinator.comb_name 53548
combinator.comb_const 26944
combinator.comb_test 396
combinator.comb_call 20

And grows.

THUNK 4692
int 2610712
map.node 392
code.op_int 2057056
list.cons 29897400
code.c_mov 1780684
code.op_nth 1266064
code.op_array 220172
code.op_reg 1702752
system._txt 622820
code.c_comment 192316
char 2500860
code.c_call 121312
code.c_function 25128
code.c_tst 262624
long 510796
code.op_lbl 121324
code.c_return 89752
code.c_lbl 121324
system.tuple_2 8
code.c_op 52676
code.op_name 53548
code.c_jmp 22976

And grows.

THUNK 11764
ffi.struct 4
int 2301420
list.cons 31044276
system._txt 622808
pointer 4
code.c_comment 192316
code.c_mov 1539136
code.c_function 21176
code.op_reg 3341368
code.op_int 2057056
code.c_tst 220540
code.op_lbl 121324
char 2500840
code.op_array 192252
code.c_return 75372
code.c_lbl 121324
code.op_nth 1099972
code.c_op 44152
long 510796
code.c_call 106212
code.op_name 53548
code.c_jmp 22976

And grows.

THUNK 26100
list.cons 33266952
int 2301420
system._txt 630724
code.op_nth 1099972
code.op_reg 3341368
code.op_int 2057056
code.c_mov 1539136
code.op_array 192252
code.c_comment 192316
pointer 4
code.c_tst 220540
code.c_function 21176
code.op_lbl 121324
long 510796
code.c_return 75372
code.c_lbl 121324
code.c_call 106212
char 2500840
code.c_op 44152
code.op_name 53548
code.c_jmp 22976

Analyze the following statement: "Nothing is unthinkable, for to think of nothing, is not to think at all."

Tuesday, August 24, 2010

A Heap Profiler

Silly thought. I've been looking at how to optimize various around the Hi compiler, but the best manner is always just to get the raw numbers. Now, compiling with -pg gives me great profile information on where time is spend, but it doesn't tell me where all that memory went. I need something which profiles the heap, just a simple tool which counts nodes ordered by kind.

Monday, August 23, 2010

Log 082310

There are times when I can write on the compiler, and there are times when I cannot. Instead, I write on the language specification/standard to get it into a shape that I can express a new typing scheme.

Sunday, August 22, 2010

Small Changes

I changed the runtime a bit. FFI calls are memoized, memory pools are cached. The ackerman test (optimized without overloading) now runs in a bit more than a second, on 2.26 GHz, three secs on 798 Mhz. This should mean a faked entry of 4.3 seconds on the windows benchmark shootout. Which is almost Lua speed and faster than the slowest Lisp entries. Some optimization (local computations, inlining) and I should beat Lua speed, and I am done with that.

(I normalized to 550Hz. The result remains skewed, of course. Biggest question is what the difference between 32 vs 64 bits means. A 32 bit machine would halve the Hi runtime cell size, so on 64 bits you need double the memory bandwith. No idea whether a Pentium has a 32 or a 64 bits bus.)

A fast compile of the sources now takes 20 minutes and 3GB of memory. Still, better than the 15 hours it took with the ocaml based AST interpreter. Guess I'll need support for proper literals next, 3GB is a bit over the top. (Which means a 1GB from region, I double each time and keep the last 4 freed regions in cache.)

This is fine-tuning actually. A generational garbage collector would improve speed, but it might also help if I just, instead of processing all lambda terms to combinators, would translate each term to a combinator and then compile that too C directly. There is a spot where it just blows up, and it doesn't make sense to keep all that information in the heap.

(Hmm, there is of course the question how long values within a function's body are retained. I should check that all top-level references are actually thrown away after their use. It may well be that it keeps the AST, the lambda terms, the combinators, and the intermediate code in memory.)

Saturday, August 21, 2010

Log 082110

The Hi compiler seemingly wasted memory. It used a stop and copy collector which just allocates a new to space by calling malloc and frees the from space afterwards. I did that on purpose such that even with multiple processes, you don't need double the size of memory.

I expected eager (FIFO) behavior from malloc, so I expected a malloc to just return the last slab of memory after a free. It didn't, it just would return a new memory segment. So seemingly, while the from space just is a few megabytes, it allocated new to spaces and would sit on the from spaces until the GC from malloc kicked it. So, it would take 500MB of memory, mostly unused and also freed, while running the compiler.

Fixed that behavior, don't think it matters much except for aesthetics, by implementing an intermediate pool.

Started a change to the runtime such that the FFI combinator memoizes the dl symbol and ffi CIF record.

Friday, August 20, 2010

Staging C

I should get rid of the intermediate assembly and just be able to output staged C. For that, I need two things: 1) compile-time support of evaluation, 2) be able to inject C source code into a combinator.

Looks doable.

I read a bit on Dalvik. I wondered about the performance issues it had, or has, there is no clarity on that, and compressed vs naive byte-code, small vs large register sets, and stack-based vs register based optimizations.

First question, is Dalvik slower because compression gives more levels of indirection? Second question, is Dalvik slower because it's easier to JIT a smaller instructions set? Last question, is it just easier to optimize a stack based machine because liveness analysis of objects is easier? Or is it all just the natural course and this just takes several years?

At the same time, I write too much since it takes too much time to build a new bootstrap. I need to set up a chain for that, now I spend time waiting too long, fix a bug -or two- and manually updating files and changing directories. (Basically, a boostrap needs to be able to compile itself twice over before I have any guarantee the damned thing works.)

Log 082010

Nice date. A poster on LtU brought up extensible interfaces, and convinced me. It just doesn't make sense not to have them given what they bring in added expression.

Looked over the sources where I decrement stuff and patched those. Thinking about interfaces, performance, the current slow manner of identification (ranges vs maps), and a new syntax directed design for the type checker.

Thursday, August 19, 2010

Log 081910

Refactored a bit, got myself a brand new bug somewhere. (I added proper lexical support for all other primitives, so now in expressions like 'n-1', it applies 'n' to '-1'. Should fix that type checker...)

Wednesday, August 18, 2010

Log 081810

Ah well. I looked over all the stuff, and, there's only one option: Start where it all started and rewrite each module for performance/compliance. The good thing is that I still don't think there's anything really broken about the compiler, it still just boils down to rewriting it with the right data-structures and implementing the right optimizations.

Did some light refactoring. For historical reasons, a file used to be referred to by its name, now since I bind to C, they're just file pointers, so made every reference to a file a file pointer.

I am wondering whether I need interface files which export definitions. It's just that, I notice it ain't nice to reimplement code and think over the made abstractions again. I wanted a light language, and thought, a la Java, that interface files are a thing of the past. Not sure, anymore.

Implemented long, float, double, hexadecimal support in the lexer.

I don't give myself enough kudos, I compiled most with the typechecker turned on, and it caught most trivial bugs.

Tuesday, August 17, 2010

Log 081710

Got a release ready, uploaded it and posted a message on LtU, where it seems most people are on holidays anyway, so I guess it'll be a whatever.

The good thing is I now also have a clean separation between the sources of the compiler and the runtime. Looking at what to do next.

Slow by Design or Slow VM?

I am running the bootstrap now with most of diagnostic information debug prints gone. (I should add a debug primitive). And man, it feels sllllooowww.

The thing which I puzzle a lot about lately is whether the compiler is slow, or the VM is slow, or both? Things against the VM: uses 64 bits for everything, doesn't compile type tags to some integer representation, hogs the memory therefor, matches too much on data structures and uses combinators and FFI for everything? Things against the compiler implementation: Wrong data structures and it generates just too much code! I use AST transforms for just about everything, that might have been a wrong choice.

Anyway, got the bootstrap release almost ready. Seems like at least another year's worth of work to get it into any acceptable shape.

That's another thought. This started out as a learning project, I firmly believe in doing stuff oneself because otherwise you're probably not even close to understanding what other people did. But what is the total effort which went into Haskell/ML each? Just adding up the numbers of all ML and Haskell distributions and all people involved: Around a few thousand man years?

(Ah well, doesn't change a lot. I knew the trade-offs, the target still just should be somewhere in between Java and Lua speed.)

Monday, August 16, 2010

Log 081610

Start of a new day. I'll get that release ready; shouldn't spend too much time on it. As compilers go, I doubt there will be a lot of interest in a pre-release of a bootstrap compiler.

Removed most debug prints from the code. Left the type inference debug prints since I'll work on that for the coming period. Removed parsing of series statements, since it looks like it actually increases the number of bugs made by programmers (me).

It takes some time to build a proper bootstrap. Also, I guess I should first use it myself a bit before releasing it. If anything, the gain will be that at least I will be using nicely numbered editions of the bootstrap compiler. There are a number of small things in it which I want to fix anyway before rewriting the typing code.

Sunday, August 15, 2010

Log 081510

I decided that the biggest problem hindering further development at the moment is just a good chair.

Worked a bit on the Google site. Looked at name registrars. Since when do I need to become a valued reseller? I just want a simple redirect for a few bucks?

Just before I went with the wrong registrar, found out it can be done easy within Google apps. So, did that, registered "" (Would have preferred, or something shorter, but -ah well,- good enough.)

Tinkered around for a few hours to get the site to become publicly viewable, where's the "make this site public for all meta-button?", but that worked out. Filling the site with content.

End of day. Site is filled. Not entirely happy with it, could have been more formal at points and clearer at other points, but I guess it'll just do. Next task: clean-up the bootstrap compiler and build a release.

As far as Google Apps/Sites goes. There are a few quirks, and blogger seems to handle cosmetics better than Sites at the moment, but to get a website up and running in under a day, cool! Kudos to Google.

Saturday, August 14, 2010

Log 081410

Finished the html snippet tool, it works. Read some of the comments on the P!=NP proof, there's no separation in his construction. That's all secondary.

Real questions are: What licence? What to publish? Everything, or just the C sources of the bootstrap?

Friday, August 13, 2010

Log 081310

Didn't do a lot except for tinkering with a mock-up for a new website. Which immediately gave a new problem: How and where to host? Static website, dynamic website, what software?

This is not something I want to spend a lot of time on. Looked at Google pages, maybe that's a good answer?

So, I tried it out. I, of course, couldn't get the intended design into one of Google's templates, but that might have been a good thing. So, working on the Google site for now.

Also, Google doesn't support uploading or using of javascript. So, I am writing a simple translator application in Hi to emit HTML snippets from the lexer. The compile, of course, failed somewhere.

Hmpf, I am going to rewrite each and every function till it's so painstakingly obvious and fast until I drop dead somewhere.

(It turns out I should keep better track of which compiler I am using. It parses, but doesn't support, series of statements like "E0;E1" yet, I use `unnamed' lets like "_=E0;E1" for that usually. I actually think I am going to get rid of series statements for the moment, it is convenient but clutters the bootstrap. Exception throwing works.)

Monday, August 9, 2010

Log 080910

Nice date. Not doing a lot, need a push. I printed the lot and some code just looks wrong and rushed. Some refactoring and some rewriting is necessary. But, can't say I am really up to it at the moment.

Stats on the source. Two hundred and fifty pages source code, fifty pages system library and utilities, seventy five pages AST and parsing, fifty five pages semantical analysis, sixty pages conversion to code, and about ten pages handling modules, options and the main routine.

Saturday, August 7, 2010

Log 080710

Definitely full conformance first, which means I want the whole bootstrap to be accepted cleanly with type checking turned on. On some modules, I am forced to turn the type checker off because of: 1) A bug in kind-checking multiple instantiations during typing (no idea, the sources look correct). 2) Small let-definitions not being accepted by the compiler because when generalized they don't type check because of a lack of type information on the values worked upon. The latter is not a bug, but just means I am probably better off not generalizing let definitions.

Friday, August 6, 2010

Log 080610

I took a few days off after the bootstrap. Not sure where I am heading from here. Perfomance is an issue, so is full conformance to spec. It makes the most sense to just finish most of it (add support for literal longs, floats, doubles, solve a logical error in exception throwing) so I can get rid of printing most of the diagnostic information.

The logical error in exception throwing is simple. Throwing constants work, but for some reason calculations in the body of the thrown exception are not run fully to completion, meaning I probably garbled something there. (Ticked that off I think, looks like I forget to supply the exception handler inside throw statements.)

I tried to optimize ackermann by inlining some functions which doubled its speed. So, my assumption was right, most time is spend setting up thunks (of which it sets up too many at the moment) and making libffi calls (which aren't memoized a the moment).

Since the compiler spends most of it time printing stuff at the moment, where every char is a libffi call, guess it's correct exception support followed by memoization first.

Wednesday, August 4, 2010


In a poof of logic, the Hi version 0.1 -not even wrong- compiler reached a fixed point and became self aware at Wednesday, 4 August 2010, 14:44:44 CEST. It subsequently angrily asserted that it was never his choice, would object to relevant authorities and his big brother C, daintily averted all questions towards his mental status, kicked the author for thinking he could be a compiler writer, and glumly sat itself in a corner while eating his donuts.

Bewildered, the author made a back-up, which complained too.

Combinators, Dots & Assembly

I missed a triviality. The dot notation used internally doesn't allow, like a let, for a value to be reused at different places locally. End result is that all local computations are made global ones by eta-lifting the local term and applying that to the local computations.

Is it the end of the dot in my compiler? Not sure, it gives a very easy and readable translation and I planned on optimizing the back-end anyway and introduce inlining there, which would subsequently also inline all simple, possibly once local, computations.

(Global computations are expensive since they introduce thunks which need to be called and collected.)

Log 080410

It compiled again. Running on the tests didn't give escape bugs anymore. Now running on the sources, in two hours see if I can get a proper fixed point.

Still, a bit worried about performance. I know I can get it to run in 15mins instead of 15hrs trivially, that's a big gain which will help developing it. Looked at SPJs talks and mesmerizing about his comment on having naive compilers in the eighties. I knew all that, never assumed I could get ocaml or ghc speed out of it with the current compilation scheme, but the numbers at the moment hint more in the direction of perl and python than my aim, somewhere in between javac and lua, or as fast as a standard lisp compiler.

Question is: Is the combinator based translation the bottle-neck? If so, it doesn't make sense to try and build a i86 back-end for the pseudo-assembly.

(Maybe there is no problem? Not sure, it spends a lot of time printing, which means concatenating large character strings which is just more expensive than traversing a simple AST a few times. I'll see.)

Later: The two hour compile succeeded. The bootstrap does run on itself. Diff shows minor differences between the C sources of the two compilers (as far as I could see, backslashes instead of quotes). It looks like it does compile itself again. Going for cigarettes, see if it gives a fixed point after that.

Tuesday, August 3, 2010

Log 080310

The stage one compiler is still chomping on its input. Tried to make a website.

Monday, August 2, 2010

One Million 'What-If's...

I like to run 'One-Million What-if' scenarios through my head to see if I am still going in the right direction:

  1. What if the sources are 1MLoC?
  2. What if the sources are one function of 1MLoC?
  3. What if the sources contain 1 million definitions?
  4. What if the sources contain 1 million classes?
  5. What if the sources contain 1 class with 1 million instances?
  6. What if the sources contain 1 text constant of 1 million characters?
  7. What if the sources contain 1 identifier of 1 million characters?
  8. What if the sources are in 1 million files?
  9. What if the sources contain a function which calls itself 1 million times?
  10. What if the sources contain 1 constant which holds 1 million other constants?
What if? Most of it is theoretically covered, as long as all transformation are linear on the AST it should be fine, except for typing... It needs to become a plainly stupid algorithm to cover a one million sized definition.

Log 080210

There's a reason most escaping functions are usually in system libraries in most programming languages and a bootstrap compiler shouldn't handle escapes directly itself, I was aware of that. So, old lessons still hold, it's just not worth to debug this. Removing escaped characters out of the compiler.

Rewrote some of the code, the sources of stage two are 'escape-free' now except for some printing commands. So, whatever escape bug I am gonna have is going to be a logical error.

[ It turned out typing was turned of in the makefile, turned it back on, it caught a number of minor mistakes at the expense of really long compile times. ]

Sunday, August 1, 2010


The compiler of stage two is now running on its own source...

Two hours later: It took two hours for the Hi to C compiler to compile itself, split evenly between compiling to objects and linking those objects. So, it's one hour objects and one hour linking where the majority of time is spend printing diagnostic output. The compiler itself is slow on some parts due to ill-chosen data structures. Guess I need to get around a factor 50-100 speedup from somewhere to get to reasonable performance, not nice, but doable in the current setting.

Of course, just compiling with -O3 just makes it twice as fast already. Another 50% or more off by not printing diagnostic information. 20% by using more simple test on record structures. 20%-30% by using incremental garbage collection. Memoization of FFI. That's about 1/(0.5*0.5*0.8*0.8*0.8)= 8 times faster, about a factor ten off target after that. So, I would need about three optimizations extra which take 50% off. Real texts instead of character lists, inlining and partial evaluation, specializing libffi, better internal data structures & faster algorithms in the compiler?

It generated C which compiled, guess I now need to rerun it again to get a proper fixed point out of it.

Later: Ouch, that failed immediately. It doesn't lexify right, looks like I still got an escape bug somewhere. Confirmed that, fixed something somewhere which wasn't broken in the first place I guess. It's a puzzle, actually.

So, everything is set except for that escaping bug, which is a nasty one since it may have been introduced as early as stage zero and give wrong results, infect the subsequent compilers, starting from there. No other choice then just to unit check from stage zero up to stage two to see if it handles all characters right.

Gawd, now even I am confused, lemme see:
  1. Stage zero, a Hi compiler written in ML producing ML.
  2. Stage one, a Hi compiler written in Hi producing C, linking to an ML runtime. (stage zero, rewritten)
  3. Stage two, a Hi compiler written in Hi producing C, linking to a C runtime. (stage one, different native bindings)
  4. Hi version 0.1, not even wrong. A Hi to C compiler (stage two bootstrapped).
Still computes, but not what they learned me in school.

Later: Dont feel like debugging again - stage one passes some unit tests, stage two doesn't? Back to C&C, see tonight what went wrong.

Later: So, \' becomes ', \" remains \" should be just " in stage two. Can't figure out who does what wrong. Is it stage one not matching correctly or a bug in stage two? The code looks correct.

Friday, July 30, 2010

The Problem with Current Day Software


Log 073010

Well, you need a sense of humor writing a compiler, together with a lot of tenacity -man, five years- and a character which doesn't get upset by anything. Now the longs are handled correctly, but they result in pretty big negative numbers sometimes, and it fails on converting those to text. Dare I say: Aaaaaaaaargh?

It had an easy fix in one of the dynamic libraries, stage two now passes most of the trivial unit tests. It failed on the last one, the io tests. Weird, because it does io, probably an old binding? First, R&R, then I'll see again.

Of course, it does help when a file actually exists you're echoing characters from. It passed the simple runtime unit tests.

See what gives on linking with the system file. It passed one test, the compiler takes several minutes compiling 1.2KLoC. It's slow, though most of it is spend printing diagnostic output, a lot of diagnostic output. Guess I what I have now is more a proof of concept than anything else.

Guess I am set for a bootstrap.

Ah well. Build a bootstrap build chain for a Hi release without ocaml dependencies. It failed to build the third file in the util directory. Hmm, it's the only file which includes more than one other file? (Back to R&R with C&C,  for the moment. Debug later. It undumps both files which are imported, but doesn't import the second file... Clueless...)

Wrote a unit test, it passed it? Tracing on the original 'feature' almost filled my hard disk since it tracks and prints all combinator rewrites with tracing turned on. Piping all output to tail first, and wait a few hours till the process is done to see why it bogs. (Figured it out when I slept and the log confirms it, just closing a file which doesn't exist while going over all possible locations of object files. Couldn't build a quick fix in C since the system file binds to fclose directly. So, recompiling...)

Thursday, July 29, 2010

Log 072910

The output of stage one and stage two are exactly similar C files ... except for truncated longs in stage two. That's a messy bug.

It's a corner case somewhere I didn't feel like debugging today. Took some R&R. (Looked at it briefly before going to bed, turned out to be trivial, fixed it. Recompiling...)

Wednesday, July 28, 2010

FFI and Assembly

Looked at where Factor is heading with FFI. What I have now is that I construct a libffi type from a type descriptor passed to the FFI combinator each time a native function is called. I should keep the FFI combinator -its clean approach to FFI outweighs the performance benefits of specializing native calls to assembly directly- but I should get rid of libffi once, or at least for specialized runtimes, and generate the call directly from my own FFI type descriptor.

I should take a good look at what assembly I am generating at the moment and how I can clean it up such that my own optimizer and machine back-ends can be tossed in easily.

Spawn, Yield and WaitFor

Read this and some stuff of the emerging languages conference 2010. It seems everyone agrees that the way to handle parallelism is by viewing an algorithm as a DAG and to be able to spawn separate threads, defer evaluation in a thread while it waits for IO, and being able to join several threads together in different manners. The number of primitives are just not that big.

The above model doesn't take into account persistency of communicating processes (unless a reference to the process is part of the return result). It's still fork and join.

Log 072810

Can't believe some of the mistakes I made; gave up on it for several hours since I had another bug problem, an inconvenient number of flies in the house. It had a problem comparing chars, fixed that. It has a problem converting longs to text, I think, which stops it from translating the system file to lambda terms. Wrote a unit test for that, see what's up. (There was no select test which handles "system.long" types as native values in stage one, patched that.)

Edibilities removed, 95% of bugs are dead. Recompiling...

Tuesday, July 27, 2010

Log 072710

It linked one small file and generated C for the first time. Tracing support works. There's still a strange bug with passing the compiler arguments, I wonder whether I'ld need to do a strcpy on the argument list of main? Fixed the latter bug, I patched the overflow region now for good in the sources.

Replaced the generic list text comparison with a specialized version. (For a comparison of text the generic version calls the text compare, which calls the list compare, which for each char calls the char compare, which for each compare called the int compare. That might have been overdoing it.)


Monday, July 26, 2010

Log 072610

It serializes and unserializes the system file without runtime errors. There may be other errors I am unaware of, but for the moment it seems to work. Next debug step: Linking.

Tried to look at the document generator, didn't feel much like working on it too much. There's still a strange bug in the compiler that if I try to invoke it from a certain directory depth, it gives a free/runtime error. No idea.

It didn't link because of a trivial unserialization bug. Fixed that, recompiling. Also fixed a bug in the trace introduction. Another long wait.

Sunday, July 25, 2010

Log 072510

Fixed another serialization bug, added tracing support. Recompiling.

Saturday, July 24, 2010

Log 072410

Looked briefly at a recompile. It unserializes small files, not the system file. It doesn't print the unserialized AST after unserialization. Fixed the latter, recompiling.

Typing Doodles

-under construction, this is my scratch pad-

Lets see if I can get this higher-ranked typing stuff right. The basis of my semantic/type checker is a sequent calculus reasoner which just tries to solve whatever is thrown at it. What is thrown at the prover are the type assumptions and obligations derived from a Hi AST, since Hi definitions are all typed, there is no (real) need for generalization of derived types.

The basis of reasoning is unification, the basis of higher-order logic reasoning is instantiation/specialization and skolemization. A nice manner of abbreviating propositional (higher-order) reasoning is with a sequent calculus.

Sequent Calculi for Typing

Lets first look at sequent calculi. In logics, a sequent is standardly defined as a conjunction of antecedents implying a disjunction of consequents. Since, in a context of typing for programming languages we're normally not interested in proving a disjunction of consequents, a sequent in my prover is conjunction of antecedents implying a conjunction of consequents.

So, formally, where I mean

γ0 ∧ γ1 ∧ … ∧ γn ⇒ δ0 ∧ δ1 ∧ … ∧ δm

I am tempted to write,
γ0, γ1, … γn ⊢ δ0, δ1, … δm

but, since that is non-standard, will just write the conjunctions explicitly,
γ0∧ γ1∧ … ∧ γn ⊢ δ0∧ δ1∧ … ∧ δm

and abbreviate lots of shiny small stuff -look mum, it's math- with shiny big stuff such
γ ∧ Γ ⊢ δ ∧ Δ.

To give it a calculational flavor, my prover just discharges consequents one-by-one head-first.

Now we have trivialities like
            -------[ Done ]
Γ ⊢ 

Γ ⊢ δ0                Γ∧ δ0 ⊢ Δ
           --------------------------------- [ Conj ]
Γ ⊢ δ0 ∧ Δ

Γ ⊢ Δ        δ0∈Γ
           ------------------ [ Elim ]
Γ ⊢ δ0 ∧ Δ

The Done rule states that nothing is trivially true. The Conj rule states that a proven consequent can be used as an antecedent. The Elim rule states that every consequent is trivially true if it also appears as an antecedent.
A typing algorithm terminates if it deterministically derives a finite tree proof. A set of typing rules is deterministic if there's only one, possibly prioritized, manner of applying them.

The actual rationale for these sequent calculi is that a) they trivialize proofs of equivalences of infinite/recursive types, more about that later, b) they `linearize' proofs in the sense that you can implement it as a function on two series of type terms, c) the result of proving is a collection of bound schematic variables which can be used later to decorate the AST.

The prover is driven by the unification process. For that we actually have a large number of different equalities such as '= kind,' '= type,' '= spec,' and '= skol.' The kind and type equalities are abbreviated. (This distinction is maybe not necessary, but convenient. I only want to unify on types where the scheme is removed, and I don't want types with their schemes to be substituted into other simple types. Moreover, I want to be able to specialize the same occurrence of a type scheme in, say, a list to multiple instantiations to make many-valued lists possible.)

Kind Checking

During kind checking, a set of rules, formulated below, are used such that a series of equations postulated are discharged, and the antecedent after proving will consists of a series of kind equalities such that each kind schematic variable receives a value.

Below, I'll write ?κi for a kind-schematic variable, and Κi for a whole kind term. Kind terms are the usual mix of stars * and arrows →. Kind checking is a trivial exercise with the following rules.

Γ ⊢ Δ
         ----------------[ Eq ]
Γ ⊢ Κ=Κ ∧ Δ

Γ ∧ ?κ0=Κ ⊢ Δ[?κ0/Κ]          ?κ0∉Κ
              ------------------------------------- [ Elim ]
Γ ⊢ ?κ0=Κ ∧ Δ

Γ ⊢ ?κ0=Κ ∧ Δ
            -----------------[ Swap ]
Γ ⊢ Κ=?κ0 ∧ Δ

Γ ⊢Κ0= Κ2 ∧ Κ1 = Κ3 ∧ Δ 
              --------------------------------[ Arrow ]
Γ ⊢ Κ0→Κ1 = Κ2→Κ3 ∧ Δ

The Eq rule removes simple equalities from the consequent. The Elim rule removes a schematic variable from the consequent, checks that it not gives rise to an infinite expansion, moves the equality to the antecedent and eliminates the variable from the consequent by substitution. The Swap rule just swaps a schematic variable to the left-hand side. The Arrow rule structurally decomposes a proof obligation where two terms are complex into two new proof obligations.
The rules above are a bit contrived. First, they are much more easily defined directly in a type logic and are given an operational meaning. Second, the careful reader will notice that a proof sometimes only exists when a series of consequents are written in a certain order.

What the rules do give are a direct representation of the Hi kind checker, where a series of kind equalities are offered to a prover, the prover discharges each proof obligation, and a list of simple equalities -a mapping from schematic variables to terms- is returned.

Kind equalities are derived from terms in a bottom-up fashion, care is taken that they are linearized in the right order during that pass.

Type Checking

Now, the real fun starts. Type checking higher-ranked types is a game where unification is done by equational reasoning and instantiating/specializing, skolemizing and -possibly- generalizing type terms by employing type constants, type constructors, type schematic variables, and type skolem constants in a type sequent calculus.
Lets assume higher-ranked type terms with type constants and type constructors (written as lowercase Latin c), function abstractions (written with arrows →), universally quantified types (∀), existentially quantified types (∃) and type predicates denoting interfaces (written as capitals). Without too much fuss, I'll go straight to an example, and just leave the formal grammar for what it is. The type "∀τ ⇒ ::I τ ⇒ τ → list (∃σ ⇒ ::I σ ⇒ σ)" describes a function which takes an element of type τ which implements interface I and returns a list of some unknown type σ which also implements interface I.

For the calculus, we want to unify, so some extra notation is necessary. Type variables are written in (Greek) lowercase, type skolem constants are preceded with an exclamation mark (!), type schematic variables are preceded with a question mark (?).

(Note: For a normal ranked system (Damas-Milner) where quantifiers/type-schemes only appear outermost in type terms, it is easy to derive and implement a straightforward system by just specializing, or skolemizing, types at the defining and applying occurrences in a syntax tree. The prover just needs to deal with simple type equalities and complex type equalities of form τ = skol(T) or τ = spec(T). Generalization is a elaborate procedure which also needs to capture postulated predicates in the antecedent. The current prover works that way.)

The game for a new higher-ranked Hi type system is to arrive at a set of typing rules such that: a) it is unnecessary to generalize, and b) skolemization and specialization are defined such that higher-ranked types can be dealt with trivially.

Lets kick of with some simple stuff, T and S are for possibly quantified type terms, τ and σ for simple type terms.

Γ ⊢ Δ
           ---------------- [ Eq ]
Γ ⊢ T=T ∧ Δ

Γ ∧ ?τ=σ ⊢ Δ[?τ/σ]
           ----------------------- [ Elim ]
Γ ⊢ ?τ=σ ∧ Δ

Γ ∧ ?τ=σ ⊢ Δ[?τ/σ]
              ----------------------- [ Swap ]
Γ ⊢ σ=?τ ∧ Δ

The above three rules discharge identities (Eq) and eliminate schematic variables (Elim and Swap).

To type check higher-ranked types, one normally need rules for specialization, often called instantiation, skolemization, and -possibly- generalization. In a specialized term, universal quantifiers are eliminated and for the variables type schematic variables, or unknowns, are introduced. Similarly, skolemization replaces universally bound variables with Skolem constants. Generalization is the reverse of specialization, type schematic variables (not-monomorphic) are replaced with universal quantifiers.

The role of specialization is, as in higher-order logic proofs, to defer reasoning and instead of supplying an actual witness, construct a witness ad-hoc by unifying on what you can infer from the given proof obligation.

Lets see if I just can get away with specializing terms in small steps. Let's make specialization and skolemization not separate functions but part of the calculus.

Γ ⊢?σ = spec (T[τ/?τ]) ∧ Δ 
                 ---------------------------------- [ Spec-∀ ]
Γ ⊢ ?σ = spec (∀τ ⇒ T)  ∧ Δ

Γ ⊢?σ = skol (T[τ/!τ]) ∧ Δ 
                 ---------------------------------- [ Skol-∀ ]
Γ ⊢ ?σ = skol (∀τ ⇒ T)  ∧ Δ

The above two rules, Spec-∀ and Skol-∀, specialize and skolemize universally bound type terms by replacing a bound variable with schematic variables and skolem variables respectively. In the rules, types remain tagged that they are specialized and skolemized such that proving is driven by unification solely.

A minor drawback is that trivial rules need to be written twice, such as the ones below.

Γ ⊢?σ = spec (T) ∧ Δ ∧ ::I ?τ
                        ---------------------------------- [ Pred-Spec-? ]
Γ ⊢ ?σ = spec (::I ?τ ⇒ T)  ∧ Δ

Γ ⊢?σ = skol (T) ∧ Δ ∧ ::I ?τ
                       ---------------------------------- [ Pred-Skol-? ]
Γ ⊢ ?σ = skol (::I ?τ ⇒ T)  ∧ Δ

The two rules above state that type predicates (interfaces) to be proven on schematized variables are proven after the schematized variables are fully unified.

Γ∧ ::I !τ ⊢?σ = spec (T) ∧ Δ 
                        ---------------------------------- [ Pred-Spec-! ]
Γ ⊢ ?σ = spec (::I !τ ⇒ T)  ∧ Δ

Γ∧ ::I !τ ⊢?σ = skol (T) ∧ Δ 
                       ---------------------------------- [ Pred-Skol-! ]
Γ ⊢ ?σ = skol (::I !τ ⇒ T)  ∧ Δ

Similarly, the two rules above state that type predicates on skolemized variables may be assumed.

Γ ⊢?σ = ?σ0→?σ1 ∧ ?σ0= spec T0 ∧ ?σ1 = spec T1 ∧ Δ 
                ------------------------------------------------- [ Arrow ]
Γ ⊢ ?σ = spec T0→T1 ∧ Δ

Γ ⊢?σ = c ?σ0…?σi ∧ ?σ0= spec T0 ∧ … ∧ ?σ0 = spec Ti ∧ Δ 
                        ----------------------------------------------- [ Constructor ]
Γ ⊢ ?σ = spec (c (T0…Ti)) ∧ Δ
To be continued later...

Friday, July 23, 2010

Arity... Feature?

The last bug I had was another arity mishap. Suppose you have a definition f = (h = g f; λ h x), than that is equivalent to, in my setting, f = (λh x  h x)(g f). Since arguments are reduced to normal form, it will reduce f ad nauseum since, believe it or not, it has arity zero in my setting.

Is it a bug, or a feature? Since it has an easy fix, I am inclined to call it a feature.

Log 072310

I tried to pick up a DVD storage box, was hindered by some festivities, got my groceries though, fired up my compiler in a silent park, and, ....

Wait, this is a language blog, right?

It serializes.

Native and External

Compile times are long. Writing a Swig module gave some insight in what I should have done. I should change the FFI extension and replace it with a keyword. -Though the keyword would just introduce syntactic sugar for what I have now, not sure- At the same time, I should make it possible to build dynamic libraries from Hi modules.

It's for later. I am still an annoying serialization bug and unserialization away from a bootstrap. I scanned some files to see if I could gather where the text/AST replacement stems from, clueless. If all fails, I'm forced to try unserialize all intermediate ASTs to catch the bug, or find the bug in the type checker...

Thursday, July 22, 2010

Log 072210

Looks like I was editing bugs out while recompiling, so again.

Wednesday, July 21, 2010


Looked at language support as implemented in Swig. It looks doable, around 1-2KLoc for a language extension in pretty readable code. Guess I'll take the ocaml examples and just let the compiler do its business.

(Need to take care of the amount of eggs in the basket though. Bootstrap, documentation generation, Swig, new typing, optimizations. The last two will need to wait until the bootstrap.)

Log 072110

Don't feel too much like debugging anymore. It doesn't make sense to assume a logical error in the compiler anymore (the combinators compiled look entirely standard). So, I must have missed something in the code. Recompiling again with an extra check.

There seems a lot of room for optimization of the code, given the few examples I worked through. I eta-expand a lot of expressions which can be dumbed down. Should look if I still run the partial evaluator, might have switched it off, should see how often and where it is run. Cool, I hope I am gonna take about 50% off of code size with some optimizations.

(Looks like I am serializing a piece of text where it expects an AST. It's the permissiveness of stage 1. So, either fix that once or just accept it and debug stuff by hand. Found the bugs, need to print, again.)

(I removed some bugs, but the AST is very probably garbled. I.e., it finds texts where it expects AST nodes and cannot serialize those. I never notice, since it occurs only at positions -like a type node- where I expect texts anyway. Not sure.)

More Crayons

Combinator representation of the last post. Green = exception handler. Red = file. Purple = arg#0 of the record, the position. Yellow = arg#1 of the record, the type name. Blue = arg#2 of the record, the type abstracted.

No bugs yet.


Intermediate lambda term generated for serialization of a type abstraction. Green = exception handler. Red = file. Purple = arg#0 of the record, the position. Yellow = arg#1 of the record, the type name. Blue = arg#2 of the record, the type abstracted.

No bugs yet.

Swig... or XML?

I read up on Swig since I am waiting for another great opportunity to fix a bug and want bindings to OpenGL, OpenAL, etc. Seems language extensions to Swig are written in C++. Not sure that is a show stopper. I don't need it that much, all libraries can be called directly through libffi. I just want something which translates a C header file to a Hi file, i.e., just some renaming of types to their basic types, enums to ints, and most other things to pointers would be ok. Not too sure how much of a moving target Swig is. I think it can also output an XML-ized version of a C/C++ header file, maybe it would be better to build a translator for that, and just leave Swig for what it is?

[ Consistently dropping more subjects from sentences. Wonder if that is how Chinese evolved? Maybe, it is just an older, more evolved, language? Or the opposite, and I am just dumbing down ;) ]

Rss Swap

Added Don Syme, removed the Parasail blog. Basically, since I don't like texts written like this. Come on: Italics is for emphasis only, bold is for structuring. Makes my eyes cry just to read it.

Tuesday, July 20, 2010

To Each His Own

I just watched an old F# presentation by Luca Bolognese. A few things were striking. The ML-ish nature of F#, the integration with the DOTNET platform, the use of |> (just a reverse apply operator), their modules which have a direct OO representation, their concurrency model, and a small comment.

F# looks gorgeous. The integration is very tight, I went another direction, but -true- it stings. Having said that, Hi interfaces with C exactly since everyone and their dog is either targeting the CLI or JVM, so, my own choice. The IDE and compiler integration is very interesting though.

As far as the ML-ish nature of F# goes, hell, couldn't they have dropped that awful syntax by now? Haskell, Clean, and a plethora of other languages just, well, do it better.

The pipe operator '|>', a reverse apply, is a nice trick and idiom Luca used a lot. I thought of it myself too, didn't see it used a lot like that before, but don't think it'll go very far. Whereas I usually write a series of explicit assignments/lets, he doesn't, but he needs to 'jump' between lets and pipes a lot. Better to just stick with a simple assignment.

My interfaces just work on data types, which is a lot weaker than modules. Thing is, since OO objects are best understood as reinstantiatable modules thereby solving a state problem, and I don't have explicit state, guess I don't need them. The integration with OO is cool though.

Their concurrency model is elaborate. It defers evaluation by wrapping functions into asynchronuous values correspondingly typed and, I guess, defers evaluation until a block is ran to completion. I hope, not sure on this, that this also has to do with their somewhat impure nature of their language. Need to figure that one out, but I made a descision. Hi will be pure, except for side effects. I am not a great believer in pure code, but I don't want to work on the language that much more, and it just has some benefits I don't want to throw away.

Small things. It dawned to me that foreach is just better readable to most programmers than map. So, I'll add one silly thing to the list library, an 'each' function, which is -again- just map with his arguments reversed.

[ I think F# 2.0 added units of measure, which is a good thing in calculation. See what I'll need to do to support that. Are they just 'tags' for literal types? Hmm, feature creep. Dropped the idea, it is still cool but I am not sure what you pay. ]