Tuesday, December 27, 2011

Handwritten Scanners and C Streams

Writing a scanner is simple, at least it ought to be. Lexing usualy involves writing down the regular expressions and coding them using a tool like Flex or handwriting them. Usually, it is recommended to use lexing tools though there are sometimes reasons not to use them. The latter reasons normally being portability, context aware lexing of mixed languages, lexing very large files, Unicode, building reentrant scanners, and sometimes speed. GCC for example has a handwritten lexer.

Since regular expressions correspond to finite deterministic automata, LL(1) parsing should normally be enough, and C streams -which allow one byte lookahead- should do the trick. Given lexing tools, that is.

When handwriting a lexer for more complex expressions, stuff gets bizar. What to do with 3.14e+a? Is it a floating point number lacking digits in the exponent, an error, or is it a floating point number followed by an identifier 'e'? In this case, one would really like to have more lookahead than 1, but that is not guaranteed when using C streams.

First: Two random thoughts, and requirements, on lexing I want to share, for no particular reason.

Most programming examples for handwritten lexers use 'get' -get a token- and 'push' -push back a token to the stream- as primitives; I prefer to use 'look' -look at the token- and 'skip' -remove/go to the next token- because it is cleaner and doesn't modify the input source.[1]

One requirement on scanners is usually that a lexer should read (just) enough of the input stream to recognize one token and present that to the parser. The reason: very large files. Despite everything, reading a total source code file for the lexing phase is a not done, and I'll abide to this one rule for no other reason than that the Gods of programming determined it should be so.

Now: The current question of witing a scanner in C.

I am stuck with this question: Can you write a lexer which doesn't put back tokens? And if so, how?

And the answer is: Of course, you can. Most lexing problems reduce to finite deterministic automata, and you can use the state space of the automaton to 'remember' the characters read. But it won't be reentrant since the current state of the scanner consists also of the memorized characters. This is probably why C has pushback streams though that helps little while recognizing complex regular expressions.

The best manner to model a handwritten scanner, I think, is therefor to look at it as some kind of Mealy machine, or some kind of push automaton. It recognizes characters until it gobbled up enough to determine it has recognized enough, emits that character, and saves the scanned but unused characters in its state together with 'meta' state information about what state it should be in given the already read characters.

And presto, we have a design. My lexer should have the following minimal space: position information, a meta state encoding the current recognized token and the state of having recognized a number of characters ahead, characters read, and number of characters to emit. Though the last could be seen as a part of the meta state.

For people who want to know more: Read the Dragon book. This is all information which I reverse engineered from thinking about it, but it should be known and written down in compiler books, though the peculiar interaction of lexing and C streams might not be.

[1] I am guessing at this. But I think the historic rationale is that in very resource minimal environments, it is best to have basically one primitive 'get' and never do any push back; i.e., it saves one call. But I have the feeling that 'look' and 'skip' are academically nicer. Unsure about this.

Monday, December 26, 2011

Unicode Decisions

Strings are peculiar things these days. Since I am now aiming at a high-level language with a fast interpreter, I need to solve what to do with character encodings and string representations.

The best I can think of for now is that characters should be represented internally as wide characters and strings as multibyte sequences.

Which breaks binding to C, since normal characters are much larger than C expects. Unless I think of something smart I'll do this representation anyway, which also means that I'll need a handwritten lexer and parser since tables explode with wide character entries...

No libffi, no lex, no yacc. No libffi also implies glue code or no cheap manner of reusing C libraries. Hmm.... Whatever?

Prototype ****s. Starting on an interpreter.

I installed the latest Fedora on my Macbook, and am now happily programming in C again. Programming is a blast given that I know how most should work.

That did it. I am going to work towards the fastest, smallest, and easiest interpreter in vanilla C while using the bootstrap compiler as a design.

It could take a year, of course...

Saturday, December 24, 2011

Bit Twiddling Logic on 64 bits

I decided on 64 bit headers. Since they will be used a lot, I want a fast scheme for compacting and extracting the fields. A problem which occurred to me: say you design an elaborate bit field scheme on 64 bits registers then you're bound to do exotic stuff like checking the 51th bit. If one isn't careful about it, that can generate lots of instructions with large 64 bit constants, or references to 64 bit constants.

So, 64 bit fields need a different approach than 8 bit fields.

Luckily, one can mostly rely on that smaller integer fields are cheaper, comparing to zero is normally cheap, and negating an integer is cheap too. I am not sure about taking the absolute value without branching.

I need the following:

  • a bit to check whether a field is a constant series of bits,
  • an integer field mapping to a record table (of strings),
  • an integer field mapping to a type table (of strings),
  • an integer describing the full size of the record.

I decided on distributing them on 1, 15, 16, 32 bit, without a real rationale other than that probably 64K of types is enough, and a large record size can be abused to implement arrays.

So I am thinking on using the signs of the fields to bit twiddle; or not even that, but just store the information. Like a negative record table value indicates a primitive value and a zero is reserved for thunks.

Friday, December 23, 2011

Lessons learned

Going old-skool and bootstrapping a language is a great way to learn about compiler technology since most books touch on the what of compiler construction, and hardly on the why. I.e., one usually learns next to nothing from reading a book or copying a design. Having said that, implementing a language from scratch gives you a lot of why answers, but since you get some of the hows wrong, you end up implementing the whole enchilada again.

So the lessons learned:
  • Old-skool bootstrapping is a lousy idea given that writing an interpreter is a lot easier and you have a lot more time to experiment. Probably Hugs, or was it Gopher, existed for a decade before people began building performing Haskell compilers, and I should have done the same. The bootstrap leap is just too big unless you copy a design - which I find an uninteresting exercise.
  • Lisp days are over. On 64 bit machines you want to use a packed text representation, since a list of characters representation consumes an awful lot of memory. This is both a problem of the compiler, if it would have been implemented similarly in Haskell it would have the same memory consumption (2.4GB), as well as of the naive unpacked 64 bit representation of the memory model. I'll change this.
  • The FFI scheme is dynamic, based on libffi, reasonably portable, but can't handle corner cases. Moreover, it adds a dependency and vanilla C is just so much nicer. I'll change this.
  • The combinator scheme used is about on par with an (old) javascript or Lisp compiler/interpreter. It's going to be at least one order, possibly two orders, slower than C. Which is still an order faster than Python, so I am not sure how to count that. There are three main reasons for the creeping performance:
    1. No stack, all thunks are allocated in the heap. A stack based runtime will always beat that, but programs will segfault from time to time because people will continue writing recursive functions which will consume the C stack. I won't change this.
    2. Lack of 'lets' or ANF form. Now all local computations are lifted out, leading to an explosion of combinators, compile time, and thunks allocated runtime. I am not sure how much this costs, but I am assuming that I'll need to implement this. I'll change this.
    3. The compiler generates an unhealthily sized binary executable. I have no idea why, except for some hunches: a) it generates 64 bit assembly from C code, b) native assembly is more wordly than a concise minimal VM (32 or 16 bit) instruction set, and c) compiling to a 'concrete' machine is more costly than compiling to a SECD, or other lambda calculus inspired, machine. Having said that, Haskell binaries -despite the VM assembly- seem to be as bloated, so it might be unavoidable in compiling functional languages. I won't change this.
  • The compiler uses the concrete C bit words internally for primitive values. This leads to a very small compiler since there is essentially one primitive value type (to optimize on), but I actually grew tired of all the a priory word manipulations. Moreover, it leads to a dependency on C primitive value representations, consequently, the machine architecture. Nice for toy compilers, but actually just a lousy idea, so the alternative of elaborating all primitive types appeals now. I'll change this.
  • The compiler uses texts internally a lot for identification, which is lousy giving how they are represented, but -when switching over to native strings- is probably as costly in memory consumption as integers but very costly in runtime performance. I won't change this.
  • The compiler has no explicit symbol table which is again novel, it doesn't seem to matter much. What is far worse is probably the performance of the lookup. I won't change this now (see purity).
  • The compiler employs novel AST data traversal combinators which are good for implementing concise algorithms since they mimic an intermediate between attribute based compilation and direct traversal techniques, but it costs probably a factor two in compilation speed. I won't change this.
  • The compiler uses a novel bottom-up constraint gatherer and a separate type inferencer, an idea copied from the Helium Haskell compiler, which was rushed; I am not sure of all the inference rules I use. The bottom-up part is academically nice, but, even with unfinished inference rules, I can see that no production compiler will ever want to take the associated performance hit. So, old-skool ML like direct AST typing it is. I'll change this.
  • I ended up with a pure compiler (except for C calls), which I think is a bad idea. It doesn't use a symbol table at the moment, but a symbol table will probably come in handy when more information is needed due to a more elaborate VM targeted. I personally loathe monads, so I should make the language impure. Hello ML? I'll change this.
I'll start with a new runtime and leave the thought on how to target the new runtime for a while. I blathered a bit with some C programmers on it, and by nature they'll always target an Ocaml-like runtime or minimal design -I don't like that particular VM much,- but my nature is different. The language should be robust, but not low-level, so I'll go in the direction of a Java-esque design, which means 64 bit headers for everything.

It should be fun, and I don't feel like working on it full time. So it'll take some time, again.

Wednesday, December 21, 2011

hi-language.org expired

Oh great, I didn't look at it for a while, now the domain expired and they deleted the whole site... Great, that the DNS entry expires okay, but why delete a perfectly functioning Google Site? (Oh great, I found it again.)

Anyway, I decided I'ld start the whole project again. I learned enough from the prototype to know what works, and what doesn't, so whatever. Think I'll just start a new site again, hopefully, most of the bits are still on this laptop.

Great, when a Google Sites domain expires, the automatic interaction between Google Apps and the hosting provider breaks. Two options: reregister with the hosting provider, reinstall the DNS records, and hope stuff will work from that point on, or -what I think is the least error-prone- register a new site...

Wednesday, July 20, 2011

Leaky Shift

Sometimes it helps to bounce ideas against other people, so thanks to LtU/MattM for getting something right. I hacked together a solution once to deal with the translation of exceptions, try/catch and throw blocks. I looked a bit at shift/reset, found it too difficult to implement, and tried to implement something more trivial with two combinators.

  R f = f S (remember the current context in S)

  S x = x   (but also, whenever S evaluates it replaces the original R f term)

An example:

  1 + R (\s -> 1 + s 3) = 1 + (\s -> 1 + s 3) S = 1 + (1 + S 3) = 1 + 3 = 4

The problem, of course, is that you don't want an S combinator to leak outside an R term. I was aware of that, but -rather unconvinced- thought it may go right.
Try/catch blocks are translated to a somewhat difficult lambda term employing R. The general thought -I more hoped than convinced myself- is that either an exception is thrown (and an S is eaten), or a value is produced (and introduced S's are lost). I.e., no leaks.

A post-mortem analysis of what I did:

It works, but it also dawned to me that it may only work at the moment since I only compile straight-forward code in the bootstrap compiler. The problem occurs when translating try/catch blocks which return functions. A contrived example:

  f = try [ x -> x / 0 ] catch [ NaN -> throw "division by zero" ];
  f 3

I didn't try it yet in real code, but I am pretty sure that the result would be that the thunk which holds f would be overwritten with the thrown exception instead that an exception is thrown. Fortunately, it also looks like it is fixable.

One solution would be to remove exception handlers from curried thunks, and reinstall the right ones of the context with employed @ combinators, but it would be difficult to get all invariants right.

Another, better, solution would be a bit different compilation scheme where the curried thunk f is guaranteed not to also contain the exception handler, and  f 3 is translated to f (exception-handler) 3, which would be more straight-forward. I should check whether I didn't translate to that by accident. Ah well, back to the drawing board.

A last, best, solution would be to assign to f not the lambda abstraction, but the whole try-catch block. That should do nicely. And I may even have done that by accident? Again, I should look at the code.

Wednesday, May 25, 2011


Aliasing describes a situation in which a data location in memory can be accessed through different symbolic names in a program. I use aliasing a lot in the garbage collector, it was a wrong assumption on my part on the semantics of the C language. Turns out, aliasing is not supported by default to make some optimizations work.

The problem is that I need aliasing. Thunks are supposed to consist of memory cells which mostly hold integers or pointers, but sometimes they hold boxed character lists, floats, doubles, or whatever what one would like to store in a constant.

I am not really writing on the compiler at the moment, but -as I am dropping the dependency on libffi- I would like to drop the compiler switch which turns strict aliasing off, thereby making aliasing possible...

I am not sure there's even a compatible manner in which to support this.

Update: Strange, the standard makes an exception for char pointers, so I guess that I'ld need to support that. Can't I use this in a reverse manner using a char pointer as an intermediate?

Update: It may be impossible if I read this rant.

Update: Sometimes a question and answer isn't defined well; the guys at comp.lang.c helped me figure it out. If I want to compile with strict aliasing, I just need to make sure that pointers of different type refer to different values. (Or, stated differently, it's a rather silly idea wanting to alias stuff where strict aliasing tells one you cannot.) So, wherever I cast/alias stuff now, I'll need memcpy. Hope that is sufficient.

Saturday, April 30, 2011

Log 043011

I removed the embedding functions and introduced a native keyword into the compiler. Now, like in most other languages, external definitions are just given, for example, such: native inc: int -> int. I.e., the name and type are given but the body isn't.

I now need to build a new runtime which makes use of this.

Monday, April 25, 2011

Dropping libFFI

I don't like the dependency on libffi anymore. The prime reason being that it becomes near impossible to treat strings as character arrays. The second reason, I expect problems with it when compiling on different platforms - it drops a dependency. The last reason, despite that everything -especially the definitions of external terms- becomes more bloated, it will simplify the compiler and make native function calls a lot faster.

So long, C...

MLs Value Restriction

Don't generalize mutable type unknowns

Sunday, April 24, 2011

Hi Gambit

I am thinking of changing the back-end to Gambit...

And decided against it.

Friday, April 22, 2011

Compiler with Sources now for Grabs

Get the compiler sources from http://www.hi-language.org/ if you care. I probably should have cleaned up the source code, but then again, why bother at the moment?

Hacked A Compiler Reference Document

As the title says, I hacked something together, probably should have spell-checked it first. Ah well, read it here.

Wednesday, April 20, 2011

Going Open Source

I lost track of where I am and decided to open source the compiler since it ain't seem to be going anywhere at the moment. As a preview, read the source of the compiler in printable format here on Google Docs.


I am moving to a new home. I added the OCAML bootstrap compiler to the distribution list of hi-language.org for historic reasons.

Wednesday, February 16, 2011

Typing Problems

The question is what problem I am solving by looking at other type inference algorithms.
  1. Traditional HM style inference allows for the introduction of non-constructive types.
  2. Traditional HM style inference leads to strange awkward fixes such as the monomorphism and value restrictions.
  3. Impredicative HM style inference is awkward, I am not convinced that such modalities are needed.
  4. Impredicative HM style inference in the general case is undecidable.
I need to elaborate on these points with examples.

1) might be minor. The introduction of non-constructive types might not be a problem since the general excuse for that could be that they cannot be instantiated except with non-terminating terms. Still, I guess the theoretical consequence is that such a type system is inconsistent. I need examples for that.

2) is major. I don't dig the value restriction and think it is nonsense; or rather, a poorly fixed bug of a typing system. Again, I need examples.

3) is just a conjecture.

4) is a major problem though some claim to have solved it.

Question is if and how a shift to type-constructors and constructive types would allow a fix of this. Generalizing what I am doing, I am going for a type system in which universally quantified types like ∀α.∀β. α → β are forbidden but we do allow constructive types such as ∀α.∃f. α → f(α), possibly interpreted as ∀α.∃f.(Λβ → f(β)) α, in the hope that only type arguments need be derived from the term and most 'logical' quantifiers can be used without a real restriction on where they occur in the type and can be dropped/made implicit.

Not sure I am solving any problem here.

Friday, February 11, 2011

Awkward Types

As I stated in the previous post, I am rethinking some assumptions of HM-style inference. An example on how I want to fix HM and another awkward example.

A main novelty of another typing system is that I think some of the problems stem from not differentiating between argument types and result types. I would like to see a bigger role for type constructors in the type system for the Hi language.

An example, the type σ → τ can be understood to be a propositional formula according to the constructive interpretation of types. For such a type, in the simply typed lambda calculus, a term should be constructed or assumed which constructs values of type τ from values of type σ. Dropping the propositional interpretation, I would rather see that such a type is interpreted, possibly denoted, as σ → f(σ); instead of moving the blame for proving the validity of a type to the term, I would like to work with types which are correct from a set-theoretic assumption, concentrate on what they denote, and type-check and derive properties accordingly.

It still begs the question whether there is anything wrong with naive HM inference and whether this small shift in interpretation would change anything fundamentally. A non-trivial example type which exposes what would change:

(σ → γ) → σ → (τ →γ) → τ → γ 

Let's look at the above type constructively. The propositional interpretation is that we should be able to construct a value of type γ out of the antecedent. Concretely, we have the choice to construct that value by combining the function of type σ → γ with a value of type σ, or similarly combining terms with types τ → γ and τ.

Let's change our view to the type-constructor interpretation of the above, making constructors explicit the type denotes

(σ → f(σ)) → σ → (τ →g(τ)) → τ → h where h = f(σ) = g(τ)

Here we now have a number of surprising choices.

One, we could chose to reject this type, since we cannot assume that a general constructor can be found which generates an equivalent type h = f(σ) = g(τ) for arbitrary σ and τ. Another possible interpretation of that is that the proof, the choice which would be made at the term level in a constructive calculus, hasn't been made explicit in the type.

Second, we could derive σ = τ and f = g, and interpret that type as

(σ → f(σ)) → σ → (σ →f(σ)) → σ → f(σ)

Lastly, another choice is to derive that f and g cannot depend on σ and τ, therefor they are Skolem constants and the valid interpretation of that type is

(σ → f()) → σ → (τ →f()) → τ → f()

(I would prefer this. It feels like a restricted approach which exposes that type variables which are normally universally quantified can be approximated with argument types, and existential types can be approximated with unknown Skolem functions. Not sure.)

Beyond awkward.

(Possibly, it is sufficient to forbid type variables at result positions, have the programmer make type constructors explicit, and check the validity of those? In the end, I guess the original example type is an example of a type which should be rejected, or made impossible to denote, by a type system which makes the constructors explicit. I.e., there should not be a valid translation for that type when made explicit since there is no interpretation which validates that type according to a type-constructor semantics.)

(I could have written this post differently and just expose that, hey, look there are types which make sense from a constructive point of view, but possibly do not make sense from a set-theoretic/type-constructor point of view. Moreover, by making the type constructors explicit, the proof burden associated with a classical propositional type is moved to the type - what remains is a validity check of the term.)

(Making sense of it all. I should rewrite this post, make clear I want to type-check with type constructors, show what the difference is w.r.t. traditional HM-inference, what problems that approach solves, and make explicit which choices one can make in the typing rules. My assumption at the moment is that type checking with type constructors could be a good alternative to impredicative typing.)

Monday, February 7, 2011

What's Wrong with Hindley-Milner?

Almost all students of CS learn basic type theory. In that setting, usually type systems are studied for lambda calculi where the starting point is the simply typed lambda calculus. Type systems are usually developed with an emphasis on a number of academic invariants, such as normalization, and the Curry-Howard correspondence.

I sometimes doubt the assumptions used and the usefulness of some of the type systems developed - especially within the context of the Hi language. As far as I am concerned, such a thing as normalization is only interesting in the context of long-term studies where programming languages are developed where well-typed programs terminate, a thing I am not very interested in within the pragmatic choices of a functional programming language which should work today. Moreover, from a programmer's point of view, as far as I am concerned, the only interesting property of the Curry-Howard correspondence is that it shows that you need a prover, or 'proving power,' to show that programs are well-typed.

Given those observations, I have a certain amount of distrust towards impredicative type systems which were originally developed within the context of types for logical statements, but now are implemented in functional programming languages. Are programming language developers progressing in the right direction, or are they progressing in a certain direction because of a lack of imagination at developing more pragmatic typing systems?

In that context, I would like to rethink the assumptions of Hindley-Milner based typing systems and go back to the only interesting property for type systems for programming languages: A well-typed program labels terms such that all values compute correctly.

But, there are an enormous amount of typing systems. Only experts know how to value them, I am not that big an expert, so the only option I have is to go back to simple examples and hope that I can make some sense of them and show where the theoretical pain lies.

A simple example:

      def f: (α → β) → α = f

Note that most Hindley-Milner based systems for functional programming languages will accept the above function.

I am interested in this particular example since it shows three things.
  1. One, normalization is a non-issue, from that respect it is well-typed. A pragmatic type system should not care about normalization.
  2. However, it shows a type for a function which I think should be rejected since one cannot construct a value from the domain of a function; a naive Hindley-Milner type derivation of the term can trick the type system in believing that all is okay. 
  3. Lastly, it shows that this is not a problem of impredicativeness, but more a problem on what we believe a type denotes, and how to manage the unification of type variables.
The problem at the moment is that I have ideas on how to tackle this, but lack the oversight and vocabulary to correctly work towards a solution.

Not sure anymore. Possibly it is sufficient to keep HM style inference, but restrict the declared types to constructive types with a simple validity check.