Ekaitz's tech blog:
I make stuff at ElenQ Technology and I talk about it

Understanding GNU Mes’ performance

From the series: GNU Mes interpreter speedup

Before1 starting our work in the project we should, first, make sure that we know that our hypothesis is true and GNU Mes could be faster and, second, try to understand where we could improve the execution speed.

The benchmarking tool

In order to check how fast GNU Mes is, I created a very simple benchmarking tool. The tool just runs GNU Mes under time (the command, not the built-in), and creates a report in a bunch of text files. Of course, time is not a great way to benchmark, but my hypothesis was that mes (remember, the scheme interpreter in GNU Mes) could be way faster than it is, so, if that’s true, time should be more than enough for the comparison.

This is done in the simple.make makefile, that has been revived to serve this development period. The simple.make makefile is important because normally GNU Mes is built from scratch, including the meslibc (remember, the C standard library in GNU Mes), which takes a long time. The simple.make file builds just mes, the interpreter, from GCC or M2-Planet, so the iteration time is acceptable during the development, meaning it’s almost instantaneous.

So, in summary this:

make -f simple.make benchmark

Would run the benchmark for the default mes, the one built by GCC (which is of course built if needed, too). But we also have benchmark-m2, for the mes built by M2-Planet.

The benchmark at this very moment runs three cases:

  • fib: A simple scheme program that finds the 35th element of the Fibonacci sequence.
  • mescc-hello: A mescc call that compiles a simple hello-world program written in C.
  • mescc-mes: A mescc call that compiles mess source code.

These three cases cover very basic operations but let us know if our changes are relevant in mess main job: compiling some C code. But also give us some interesting feedback about the evaluation speed in the fib check, specially in a very recursive program. The mescc-hello is a minimal test that shows us the startup speed of the mescc compiler.

However, that alone doesn’t give us a lot of information, because we don’t have anything to compare with. The scheme interpreter in GNU Mes is compatible with Guile meaning that it could be replaced with Guile in any moment (maybe not the other way around). There it is our great candidate for comparison.

The data

The following table shows the execution of all the benchmarks’ results for wall-time written in Average ±StdDev form. Each benchmark is run 10 times in my laptop (i7-10510U).

mes-gcc mes-m2 guile guile guile†‡
Fib 21.39 ±0.66 49.13 ±0.11 0.27 ±0.00 3.57 ±0.27 8.67 ±0.38
Mescc-hello 1.38 ±0.04 2.69 ±0.01 0.41 ±0.01 0.45 ±0.03 1.03 ±0.04
Mescc-mes 103.39 ±3.03 214.88 ±0.34 17.03 ±0.30 18.70 ±0.95 27.34 ±1.09
With GUILE_AUTO_COMPILE=0
With GUILE_JIT_THRESHOLD=-1

From left to right: mes-gcc means mes built with default GCC (including optimizations), mes-m2 means mes built with M2-Planet and guile means running guile (v3.0.9) instead of mes.

The first conclusion is pretty obvious:

Guile is way faster than Mes when compiling C code, even with all the optimizations disabled.

Unfair!

But that conclusion may look unfair to most of you, specially because Guile is a pretty large and powerful scheme implementation, while Mes is a simple scheme interpreter designed to be as minimal as possible.

Chibi-Scheme is a very interesting small scheme interpreter that happens to be somewhat slow comparing with other major scheme implementations. In terms of lines of code, chibi-scheme is around the twice the size of mes, which is still a manageable size. Chibi-Scheme is not compatible with Mes as Guile is, so we can only run the fib benchmark with it. The following table shows a comparison between the best Mes scenario vs Chibi-Scheme.

Mes Chibi-Scheme
Lines of Code 5838 13135
Fib (s) 21.39 ±0.66 1.77 ±0.03
Only counts the lines in .c files
Both built with GCC

That’s quite an improvement for the size!

Conclusion from the benchmark

Before taking any conclusion, we should make sure our assumption is true and the differences are high enough to be captured by the time command. Looking to the orders of magnitude we are dealing with here, I would say the data we obtained is a valid reference point, as the differences are large enough to be properly captured by time and our goal is to have a noticeable performance improvement.

From there, we understand the obvious: Guile is way faster than Mes. Even with compilation disabled, but that being a huge difference. Chibi-Scheme is also way faster, but not as fast as Guile with all optimizations enabled.

When it comes to the compiler used to build Mes, when compiled with M2 mes is twice as slow as with GCC. This is a very relevant detail to keep in mind during development. GCC’s optimizations account for a huge performance improvement in mes, and we cannot rely on them during the bootstrapping process.

Profiling

Still, we are far from justifying a full rewrite of the evaluator, and even further from rewriting mes as a bytecode interpreter, as proposed in the project presentation. That’s why I decided to run a profiler, again, the simplest possible, so I could get some insights from the execution of the program.

I run gprof in mes for both fib and mescc-mes cases2 and I summarize the output of the flat-profile below, right after a little reminder of what do all the columns in the table mean.

Just to clarify, gprof requires the program to be built with GCC with -pg flag, we don’t have profilers for the program built M2-Planet, but I think the insights obtained from this could be extrapolated to mes-m2 in most cases. You’ll see why in a minute.

Legend
% time
the percentage of the total running time of the program used by this function.
cumulative seconds
a running sum of the number of seconds accounted for by this function and those listed above it.
self seconds
the number of seconds accounted for by this function alone. This is the major sort for this listing.
calls
the number of times this function was invoked, if this function is profiled, else blank.
self ms/call
the average number of milliseconds spent in this function per call, if this function is profiled, else blank.
total ms/call
the average number of milliseconds spent in this function and its descendents per call, if this function is profiled, else blank.
name
the name of the function. This is the minor sort for this listing. The index shows the location of the function in the gprof listing. If the index is in parenthesis it shows where it would appear in the gprof listing if it were to be printed.
Gprof flat-profile for fib
%time cumulative seconds self seconds calls self ms/call total ms/call name
20.06 4.01 4.01 37 108.38 416.96 eval_apply
11.26 6.26 2.25 600202232 0.00 0.00 gc_push_frame
7.10 7.68 1.42 600202232 0.00 0.00 gc_peek_frame
6.83 9.05 1.37 481504722 0.00 0.00 make_cell
6.03 10.25 1.21 600200793 0.00 0.00 push_cc
5.48 11.35 1.10 300124024 0.00 0.00 struct_ref_
4.70 12.29 0.94 _init
3.70 13.03 0.74 269868352 0.00 0.00 make_value_cell
2.35 13.50 0.47 600202232 0.00 0.00 gc_pop_frame
2.15 13.93 0.43 406099172 0.00 0.00 cons
2.15 14.36 0.43 74983231 0.00 0.00 apply_builtin
2.00 14.76 0.40 300279513 0.00 0.00 cell_ref
1.95 15.15 0.39 300124024 0.00 0.00 assert_struct
1.85 15.52 0.37 135158343 0.00 0.00 length__
1.60 15.84 0.32 29862855 0.00 0.00 minus
1.38 16.11 0.28 300124024 0.00 0.00 assert_range
1.38 16.39 0.28 75315259 0.00 0.00 assq
1.18 16.62 0.24 75280752 0.00 0.00 lookup_binding
1.08 16.84 0.22 75280689 0.00 0.00 make_binding_
1.05 17.05 0.21 269866060 0.00 0.00 make_number
1.05 17.26 0.21 173 1.21 1.21 macro_set_x
1.00 17.46 0.20 135178741 0.00 0.00 gc_check
0.90 17.64 0.18 30093052 0.00 0.00 pairlis
0.85 17.81 0.17 105069896 0.00 0.00 check_formals
0.83 17.97 0.17 29861090 0.00 0.00 less_p
0.00 19.99 0.00 1 0.00 0.00 open_boot
0.00 19.99 0.00 1 0.00 3.09 read_boot
0.00 19.99 0.00 1 0.00 0.00 reader_read_octal
0.00 19.99 0.00 1 0.00 0.00 try_open_boot
Gprof flat-profile for mescc-mes
%time cumulative seconds self seconds calls self s/call total s/call name
31.64 58.01 58.01 81 0.72 1.91 eval_apply
7.53 71.82 13.81 428598051 0.00 0.00 assq
6.11 83.03 11.21 2789709236 0.00 0.00 make_cell
5.96 93.95 10.92 2109168010 0.00 0.00 gc_push_frame
4.88 102.89 8.94 607787117 0.00 0.00 length__
4.11 110.43 7.53 2109168010 0.00 0.00 gc_peek_frame
3.61 117.04 6.61 891765473 0.00 0.00 struct_ref_
3.39 123.25 6.22 2109164356 0.00 0.00 push_cc
3.39 129.46 6.21 _init
2.27 133.63 4.17 1742421 0.00 0.00 assoc_string
1.96 137.22 3.59 2235192336 0.00 0.00 cons
1.79 140.50 3.28 215232553 0.00 0.00 apply_builtin
1.54 143.33 2.83 679409558 0.00 0.00 make_value_cell
1.49 146.06 2.73 200300883 0.00 0.00 pairlis
1.18 148.22 2.16 2109168010 0.00 0.00 gc_pop_frame
1.13 150.28 2.07 925236148 0.00 0.00 cell_ref
1.10 152.30 2.02 196474232 0.00 0.00 gc_copy
1.07 154.26 1.96 419658520 0.00 0.00 lookup_value
1.04 156.17 1.91 411189650 0.00 0.00 check_formals
1.02 158.03 1.87 423486766 0.00 0.00 lookup_binding
0.98 159.82 1.79 183 0.01 0.01 gc_cellcpy
0.91 161.50 1.68 891765473 0.00 0.00 assert_struct
0.86 163.08 1.58 423486608 0.00 0.00 make_binding_
0.78 164.51 1.43 183 0.01 0.03 gc_loop
0.78 165.94 1.43 635403488 0.00 0.00 gc_check
0.00 183.37 0.00 1 0.00 0.00 open_boot
0.00 183.37 0.00 1 0.00 0.00 open_output_file
0.00 183.37 0.00 1 0.00 0.00 read_boot
0.00 183.37 0.00 1 0.00 0.00 reader_read_octal

The profiling data in context

The functions in the top are mostly the same in both cases: eval_apply, gc_push_frame, push_cc, gc_peek_frame and make_cell. They represent around the 50% of the execution time (51.21% in fib and 51.28% in mescc-mes) and they are closely related.

Before trying to explain why, we could spend a minute checking how they are eating that amount of time. eval_apply is executed a few times but it takes a long time to run, while the rest of them are very fast functions (0.00 ms!) but they are executed millions of times per second.

The reason for that is eval_apply is a long running loop that is calling them. It’s calling them all the time.

Deep in eval_apply

As I already explained in earlier episodes, mes is a tree-walk interpreter, but that doesn’t say much about how things work.

The current design of mes is very interesting. I didn’t ask the author about it but it feels like it was written in Scheme first and translated to C later. I say this because the eval_apply function, the core of the program, that expands the macros, traverses the tree, and does all the evaling and the applying, is written as it would be written in Scheme: abusing the tail-call optimization3 with mutually recursive functions.

That’s pretty understandable, if you are writing a Scheme interpreter you’d like to write Scheme, so you Schemeify your C as much as you can. Also, it happens that Scheme can be very elegantly described in Scheme, as the GNU Mes manual mentions, and most of the literature about Scheme interpreters uses Scheme. This tradition goes back to the SICP, and even further to the invention (discovery?) of LISP itself.

But C is not very compatible with that programming style. The C stack would overflow due to the lack of tail-call optimization.

GNU Mes overcomes that limitation manually manipulating a stack, that is allocated by its memory manager, and emulating function calls by doing a goto. The argument passing to these “functions” is done with some global variables that are understood as “registers” in this “virtual machine”.

Consider this piece of code:

evlis:
  if (R1 == cell_nil)
    goto vm_return;
  if (R1->type != TPAIR)
    goto eval;
  push_cc (R1->car, R1, R0, cell_vm_evlis2);
  goto eval;
evlis2:
  push_cc (R2->cdr, R1, R0, cell_vm_evlis3);
  goto evlis;
evlis3:
  R1 = cons (R2, R1);
  goto vm_return;

The “registers” are those R1, R0 and so on, and we don’t really need to see what they mean to understand how this whole thing works.

Imagine we enter evlis with an R1 that is a list: it’s not a cell_nil and it’s type is TPAIR. So we skip both if statements and we reach the push_cc. This call will push to the stack our current position in the code, cell_vm_evlis2, and then jump to the eval function”. eval eventually ends, doing a goto vm_return which pops the current frame and jumps to the top of the eval_apply function.

vm_return:
  x = R1;         // This saves the return value of the "call"
  gc_pop_frame ();
  R1 = x;         // and puts it back, after it has been overwritten by the pop
  goto eval_apply;

And, what do we find there?

A huge if cascade that brings us to the label where we left:

eval_apply:
  if (R3 == cell_vm_evlis2)
    goto evlis2;     // Bingo!
  else if (R3 == cell_vm_evlis3)
    goto evlis3;
  else if (R3 == cell_vm_eval_check_func)
    goto eval_check_func;
  else if (R3 == cell_vm_eval2)
    goto eval2;
 // There are quite a few more below.

Didn’t we just emulate a “function call”?

I’m sure you can continue with the execution from evlis2 by yourselves, just with the code I pasted here, and understand how a list is evaluated in mes.

It might feel crazy, but that’s because during the explanation the goal of this approach was left aside for a minute. I didn’t mention that the eval label we jumped to might be jumping to many other places, including apply, which would jump to eval again, which, as we mentioned before, would blow up our stack if we were doing this in C functions. Doing it this (weird) way lets you control the stack and manually apply the tail-call optimization, i.e. just use goto without the push_cc.

This whole thing is looping all the time, doing goto eval_apply over and over again, and that’s why the actual eval_apply function takes that long to run. It’s the main loop of the interpreter.

Now we are in a great position to understand why those other functions were called that many times. push_cc is the easiest one, as it’s what it prepares a call to one of those internal “functions” that require pushing to the stack. gc_push_frame is called by push_cc and is the function that actually does the pushing. gc_peek_frame is also called many times, but we don’t see that in the listings I shared. What happens here is gc_peek_frame is called by gc_pop_frame, exactly what the vm_return block does.

The make_cell case is slightly different but also related. It’s called mostly from cons, and it’s sibling make_value_cell is also called many times. These are the functions that allocate Scheme values. Of course, we cannot simply eliminate the calls to these functions, because a Scheme interpreter would always need to create values, but my bet is that these functions, specially cons are called more often that strictly needed because of the evaluation strategy I just described.

Conclusions

There are many conclusions to obtain from all this data and the design of the evaluator.

The first and more obvious one is that it looks like the project proposal makes sense and we could make mes way faster, as Guile is, even keeping the interpreter small, as shown in Chibi-Scheme. Of course, I don’t expect to obtain a 10x performance improvement in the first iteration, but I believe we can do a lot “starting with the simplest thing that could possibly work4” if we keep a non-pessimization philosophy.

The question now is, what is that thing?

I proposed to turn mes into a bytecode interpreter. This alone might sound as an overkill but from all the other options I considered it sounded like the simplest, and there is evidence that it could possibly work: Guile and Chibi-Scheme are bytecode compiler-interpreters, and thanks to the benchmarks we saw they can be very fast comparing to GNU Mes’ Scheme interpreter.

Other possibilities like tree traversal memoization, like the ones applied in older versions of Guile didn’t feel like they would have such a profound impact in the evaluation process, and I felt like they were going to be harder for me to implement.

But still, there’s a lot more we can obtain from the data.

When comparing mes-gcc and mes-m2, we could see that small tricks would not work if we want to improve the speed of the bootstrapping process. It’s not that hard to write some code in a different way so GCC does its magic and optimizes for a 10% improvement, but when it comes to M2-Planet we are alone, there’s no compiler optimization there to help us, and we must be smart and use optimizations that go deeper, those that happen in the CPU. Bytecode could give us some of that data locality we need.

We must remember the profiler was run on mes-gcc, but the information we obtained from that can be extrapolated to the mes-m2 case: it is the current implementation of mes that is inefficient by design as it was designed to be simple, not fast. All the pushing and the poping of scheme values to that internal managed stack happens too often.

An evaluator that is able to run an already processed program is, potentially, way faster than one that has to traverse the tree over and over again, while it does all this goto soup, that is also pretty hard to read. More specifically, a bytecode interpreter would not only reduce the execution time of those functions on the top in the profiler’s list, but many others that are related to them in the very long tail of function calls that I didn’t bother to paste in this post. This cascade effect could add up to a huge performance improvement further than that 50% of execution time we could find in those 5 functions on the top.

Everything that can be processed ahead of time should be done (macro expansion, lambda hoisting, argument checking, compilation…) first and only once if possible, so the evaluator can focus on its main goal: running the code, as fast as possible.

What I didn’t mention, though, are all the functions that don’t look relevant in the profiling: they are just good. The garbage collector and allocator in mes simply works and goes unnoticed in the profiling with a 1% of execution time in the gc_copy function and a very fast allocation that only becomes a noticeable when a huge number (hundreds of millions) of values are allocated.

When it comes to readability and auditability, one of the goals of the project, personally I consider the eval_apply function is very hard to follow. There are still many tricks that I don’t fully understand, and I already spent several weeks reading and hacking on the interpreter, which isn’t that large after all. Rewriting it is a chance for making it easier to read.

Work done

Understanding the interpreter internals isn’t an immediate task, so in order to get used to it I decided to go for some low-hanging fruit first.

First, I moved some operations to builtins and even created some. The performance gain I saw was simply zero. Probably because the benchmark is not designed to detect small improvements and because the main performance bottleneck is in the eval_apply function.

Same thing happened when updated the M2-Planet and I added switch statements that replace chained ifs that were there because of M2-Planet didn’t support switch statements in previous versions. The change affected mes-gccs performance slightly, but not enough to consider it a big win. mes-m2 remained mostly the same as it doesn’t apply any kind of optimization in switch statements. I still consider this change important for readability, so much so it helped me detect a bug in the string reader (the newline escaping was broken). That I also fixed.

The eval_apply function has been a puzzle to me for a long time to the point that I started to look to any possibility to make it shorter, simpler or anything that could help me understand it. In that process I found some very specific cases in that goto soup that mentioned pmatch. These were the result of a hack that avoided a hygiene issue in the pmatch macro. I changed the variable names in the macro and removed those special cases, slightly reducing the complexity of the eval_apply function and bringing the attention to the lack of hygiene in our macro expander, which, by the way, is also intertwined in the eval_apply function.

Well, and I wrote the benchmark tool, that happens to be very useful as a test for all these small changes, too.

All the changes you can find in the faster branch in the following repo5:

http://github.com/ekaitz-zarraga/mes

Next steps

None of the work done really tackles the root issue but hacking around is important to familiarize yourself with the code, and I’m the kind of person that requires quite a bit of context before getting into action.

I think now I’m ready for the real deal: the evaluator swallows most of the time, and we need to change it. We just need a plan for it.

First I have to prepare for the big change. Removing the macro expansion from the famous eval_apply function and reducing it to a plain Scheme evaluator (renamed to eval_core) that can only work with the basic forms and the builtins would help me in two fronts: working on the macro expansion alone, opening the door for writing a fully hygienic macro expander from scratch; and reducing the eval_core function even more so it can be replaced later.

Currently, as I write this lines, I just finished the removal. I can git show and see all the macro expansion related functionality in red.

Now I can get to the work. I have to write a macro expander. I’m yet to decide if I should write it first in Scheme (without macros) and translate it later or if I should write it directly in C. In any case, the macro expansion step would be separated from the evaluation.

The separation is not absolute, because the macro expander would make use of the brand new eval_core function to run the macros, but at least we would have the two concerns clearly split in the code. Consider this the 0 step.

With the split, it would be time to start thinking about the compilation and evaluation. In the first thought I didn’t know very well what to do, because if the evaluator works on bytecode, how would I expand the macros then?

I gave this too much thought, but it wasn’t that difficult, really. If the macro expander is not only a macro expander but also a compiler, the body of the macros can be compiled as they come. The macro expansion would still work with the eval_core function, the same as it did, but now it wouldn’t eval Scheme code, but bytecode.

Even more, we could compile all the Scheme code as we read it for macro expansion, so we could use helper functions from the current file in the macro expansion step, and we would traverse the tree only once.

Time for writing a macro expander, then.

Another thing I never did before. I’m sure it will be a lot of fun.


  1. Hypothetically speaking, because the work has already started. But ssssh… 

  2. See how the benchmark is useful? 

  3. Way before all the Scheme, compilers, interpreters and Guix, I wrote a post about it

  4. I’m quoting Janneke, the author of GNU Mes, who has a very pragmatical approach to programming. 

  5. What you cannot see is all the changes I tried to do and I regretted later.