Tuesday, December 7, 2010

Implementing a Symbol Table, Without a Symbol Table

The Hi compiler, at the moment, doesn't implement a symbol table. It is implemented as a series of transformations on the AST which may carry extra information, like the bindings of symbols to definitions. If a transformation needs those bindings, it is preceded by a run over the AST which builds a map holding the current bindings.

Building a map is a cheap operation, it is hardly noticeable during the execution of the compiler, and it has an extra nice property that I don't need to keep a symbol table in sync with the AST, which may happen if definitions are added and removed.

The heart of the memory consumption problem stems from the naive manner in which I deal with symbols, they are all represented as text. All identifiers are elaborated during identification to their fully qualified name.  I.e., a function 'bar' in namespace 'foo' is mapped onto the text 'foo.bar'. Moreover, texts are not unique in their memory representation and symbols generated during semantic analysis are represented as texts. Hence a compiler which wastes memory, moreover is slow on look-up since text equality is implemented on lists of characters.

(Yeah, I knew it would be slow and waste memory. I thought it would take a few MBs, little did I know it would mean consuming 2.4GB.)

Moving to string support means getting rid of the expensive check for equality and not using as much memory.

Being a bit more smart with the representation of symbols, i.e. representing a qualified symbol as a list of texts, instead of a list of chars, would also help.

Mapping symbols to integers in a symbol table would also be helpful, i.e. text comparisons are replaced by integer comparisons at the expense of another global data structure. Question is: Can I do without? Abstractly, I'll need to change the internal representation of an identifier from text to some abstract value of type symbol.

Let's say I choose this representation:

    type symbol = [ raw text 

                  | qualified (list symbol) 
                  | abstract int ]

Is it worth it to go for a distinct data structure which holds all symbols? Another manner of keeping memory consumption low is just traversing the AST with an extra pass which makes all  'raw' texts unique and mapping local, not global, identifiers to integers. A bit slower on comparison, sure, but I keep the advantage of not needing a separate symbol table.

(I thought about it. I'll implement a symbol module since it is orthogonal to implementing texts as native strings, or reference counting the allocated data, and makes explicit that reasoning over symbols is best done abstractly and factored out of the code. Possibly, I'll add string support, maybe reference counting, later.)

(And then again, sigh, life is never easy. If I go for native text support, flattening a piece of text or comparison might take less memory and be faster... A 64 bit model means you can cram 8 bytes in the size of a pointer, and that is not including the overhead of 'boxing the pointer' in a cons cell. Implementing a symbol abstraction might just not be worth it.)

No comments:

Post a Comment