Tuesday, November 24, 2009

Log 112409

First compile of new runtime. Runtime now measures 1.3k fully documented lines of code, which is a thad too much for my taste. Passing around an environment pointer explicitly to make it easier to migrate to thread-safe code. Removed GC_alloc, etc, so Boehm is out at the moment. Thinking over the libffi hack. Start of unit tests on all code.

Moved to another Machine. Again...

My Macbook Air's HD crashed leaving me with not a lot except for the sources which I saved just in time. Instead of building an SSD into the Air, one of the finest laptops I ever had except for overheating, I bought a Macbook Pro Aluminium.

I now own: an Asus S5000 (battery dead), a Dell latitude D420 (battery dead), a Dell XPS (fails to load the battery), a Macbook Air (HD crash), a Macbook Pro Aluminium.

I installed Fedora Core 12, which is almost running smoothly (no key backlights, no sound) in about a day. It took some time for it to boot nicely from the Unix partition after the pre-install. I got the feeling it found the bootloader correctly, but subsequently failed to find the linux partition, not sure what was up with that. The HD uses a weird double partitioning scheme which seems to mess up the bootloader. Refit also seems to have some small bug in it, it stalls sometime on reboot. I use a Mobile Broadband dongle which only works when I first boot into MacOs, so there seems to be something fishy with NetworkManager's initialization scripts. Anyway, guess I'll just wait for that and applesmc to be updated.

The Macbook Pro is a fine machine, but somewhat too sturdy if you're used to an Air. Hope it ain't gonna develop problems again, it might be that Fedora doesn't handle Apple's hardware well...

Monday, November 23, 2009

Log 112309

Reasonably clear. Writing unit tests for what should hopefully be the last runtime...

Tuesday, November 17, 2009

Very, Very, Very Fast 2.5-Phase Compacting

I skipped from Boehm/Demers to a (fast) variant of Cheney. Cheney has one serious drawback, it is a copying collector so it wastes memory. I guess I might go for a compacting algorithm for multicore machines. Here a minor variant on compacting.

I assume all nodes only refer to older nodes. For simplicity, I assume here that records only hold references. Nodes also don't point into other nodes.

Assume each node has a meta-field in which is stored the size of the node and an empty field used during compaction.

A running example:

A: B: C: D: E:
[4,#|E,B] ..n.. [3,#|C] [4,#|D,E] ..m.. [3,#|E] ..k.. [2,#|]

For simplicity, the nodes only refer to other nodes. A is a node [4,#|E,B], the first two fields are meta-information, the latter fields the actual content. A has size 4, the second field is not used. The content holds two fields which are references to nodes E and B. ..n.. denotes a space of size n filled with nodes which will be reclaimed.

The collection algorithm:

  1. Nodes are processed from left to right, the first node is marked life. When a marked life node is processed, its children are marked life. In a life node being processed the empty space preceding it is stored, i.e. the sum of the size of all unmarked nodes. For each unmarked node, we add its size to the unused space, and advance to the next node.

    A: B: C: D: E:
    [4,0|E,B] ..n.. [3,n|C] [4,n|D,E] ..m.. [3,n+m|E] ..k.. [2,n+m+k|]

  2. Slide the nodes from left to right, for each field, look up in the referenced field how much that node will be moved and recalc the correct new pointer.

    So, halfway after sliding A and B the heap will look like:

    A: B: C: D: E:
    [4,0|E-n-m-k,B-n] [3,n|C-n] ..n.. [4,n|D,E] ..m.. [3,n+m|E] ..k.. [2,n+m+k|]

  3. Major disadvantage, it slides to the left, where we want to compact to the right. So, that could be solved with an extra phase which moves the memory n+m+k to the right.

    A: B: C: D: E:
    ..n+m+k.. [4,0|E,B+m+k] [3,n|C+m+k] [4,n|D+k,E] [3,n+m|E] [2,n+m+k|]


Room for improvement? The correct final offsets could be calculated in the second phase, leaving a simple copy in the last phase. Nodes to be reclaimed could be merged in the first phase by setting the length field of the head node to the sum of all to be reclaimed nodes following it.

Benefits, ok-ish performance expected since it only has 'two and a half' phases and reasonable cache coherence if most nodes are relatively close. It also keeps the age order correct, which should give better performance than a Cheney collector.

Additional notes:

A disadvantage seems to be that we visit all nodes in the heap during a linear scan, instead of just life nodes during the mark phase. But given modern CPUs, we can expect that to be a minor overhead since for each non-marked node the pointer is just advanced with its size, and often nodes are expected to be in the chache anyway - which gives very good locality, and nothing is written to memory. The algorithm has two linear scans, where other algorithms might use some data structure to maintain information to not visit all nodes. The second phase could be made faster by storing forwarding pointers to the next life node, or adding them to a stack - then again, not sure that is worth the effort. (A slight adaption, in the first phase, let each dead node 'eat' the next dead node by setting it's size to the sum of both nodes.)

For a functional language we expect a large number of dead objects, hence the last phase, a simple copy of a small number of objects, should be relatively light. Its a time/space trade-off, expected running time is probably slower than Cheney, since the whole heap is visited (twice), but it halves the memory space used.

Then again, its my first try at compacting. Maybe a good book?

Something itches in the back of my mind, an old compiler building joke: "Nothing beats free lists."

11/19/2009: (Checked it against 'A Compacting Garbage Collector for Unidirectional Heaps (1998)' by Kent Boortz, Dan Sahlin which I just scanned for basic complexity. It has roughly the same overall complexity, but this algorithm is way simpler, so I guess I am save.)

Thursday, November 12, 2009

Running Time

Did some on and off programming at snail speed. Back-end is starting to look pretty slick. Compiled out the method calls, compiled out the try/catch statements, compiled out the FFI, reduced almost everything to graph rewriting, generating pretty clean C from intermediate, now fixing the runtime.

(Oh yeah, now aiming multicore.)