counting words in test_print_tuple.s add : 6855 and : 372 call : 472 cltq : 299 cmov : 302 cmp : 6595 ja : 2390 jb : 1812 je : 4936 jg : 304 jle : 305 jmp : 1431 jne : 718 lea : 8010 mov : 27955 pop : 4181 push : 1797 pxor : 297 ret : 870 salq : 609 shr : 1190 sub : 729 test : 2203 xorl : 1943 \.string : 574
Given everything, a factor 3-4 bigger in size than Ocaml isn't even that bad, for a first version which does no optimization and compiles to C. I don't have a stack, so every basic assignment translates to roughly two moves instead of a move and a relatively light push. I don't factor out string constants, so there are spurious calls to strneq. Each pattern match does an extra structure test on a record, for safety, at the moment.
Its hard not to optimize. Factoring out strings is easy, moving to bytecode too. But factoring out strings will make it harder to generate thunks 'on-the-fly,' which I want for a REPL, and moving to bytecode means an own optimizer, whereas I now leave that to C, which compiles to way more code than I expected. The usual route for bytecode is that performance drives implementers to optimizations such as hotspot compiles, which means dynamically ending up with the same, if not more, machine code in memory. Bytecode interpretation often remains faster than native code due to less cache misses, so its still a good option.
Still, some small changes and I might end up with 1.5x with respect to Ocaml? And 27k moves for 1kLoC source code? It might pay of to write my own optimizer on the generated intermediate assembly since the compiler knows best it is basically just shifting constants around.
I need a comparison between what assembly is, and should be, generated.
Note: Its all a first glance.
I have no humour