Before1 starting our work in the project we should, first, make sure that we know that our hypothesis is true and GNU Mes could be faster and, second, try to understand where we could improve the execution speed.
The benchmarking tool
In order to check how fast GNU Mes is, I created a very simple benchmarking
tool. The tool just runs GNU Mes under time
(the command, not the built-in),
and creates a report in a bunch of text files. Of course, time
is not a great
way to benchmark, but my hypothesis was that mes
(remember, the scheme
interpreter in GNU Mes) could be way faster than it is, so, if that’s true,
time
should be more than enough for the comparison.
This is done in the simple.make
makefile, that has been revived to serve this
development period. The simple.make
makefile is important because normally
GNU Mes is built from scratch, including the meslibc
(remember, the C
standard library in GNU Mes), which takes a long time. The simple.make
file
builds just mes
, the interpreter, from GCC or M2-Planet, so the iteration
time is acceptable during the development, meaning it’s almost instantaneous.
So, in summary this:
make -f simple.make benchmark
Would run the benchmark for the default mes
, the one built by GCC (which is
of course built if needed, too). But we also have benchmark-m2
, for the mes
built by M2-Planet.
The benchmark at this very moment runs three cases:
fib
: A simple scheme program that finds the 35th element of the Fibonacci sequence.mescc-hello
: Amescc
call that compiles a simple hello-world program written in C.mescc-mes
: Amescc
call that compilesmes
‘s source code.
These three cases cover very basic operations but let us know if our changes are
relevant in mes
‘s main job: compiling some C code. But also give us some
interesting feedback about the evaluation speed in the fib
check, specially
in a very recursive program. The mescc-hello
is a minimal test that shows us
the startup speed of the mescc
compiler.
However, that alone doesn’t give us a lot of information, because we don’t have anything to compare with. The scheme interpreter in GNU Mes is compatible with Guile meaning that it could be replaced with Guile in any moment (maybe not the other way around). There it is our great candidate for comparison.
The data
The following table shows the execution of all the benchmarks’ results for
wall-time written in Average ±StdDev
form. Each benchmark is run 10 times in
my laptop (i7-10510U).
mes-gcc | mes-m2 | guile | guile† | guile†‡ | |
---|---|---|---|---|---|
Fib | 21.39 ±0.66 | 49.13 ±0.11 | 0.27 ±0.00 | 3.57 ±0.27 | 8.67 ±0.38 |
Mescc-hello | 1.38 ±0.04 | 2.69 ±0.01 | 0.41 ±0.01 | 0.45 ±0.03 | 1.03 ±0.04 |
Mescc-mes | 103.39 ±3.03 | 214.88 ±0.34 | 17.03 ±0.30 | 18.70 ±0.95 | 27.34 ±1.09 |
- †
- With
GUILE_AUTO_COMPILE=0
- ‡
- With
GUILE_JIT_THRESHOLD=-1
From left to right: mes-gcc means mes
built with default GCC (including
optimizations), mes-m2 means mes
built with M2-Planet and guile means
running guile
(v3.0.9
) instead of mes
.
The first conclusion is pretty obvious:
Guile is way faster than Mes when compiling C code, even with all the optimizations disabled.
Unfair!
But that conclusion may look unfair to most of you, specially because Guile is a pretty large and powerful scheme implementation, while Mes is a simple scheme interpreter designed to be as minimal as possible.
Chibi-Scheme is a very interesting small scheme interpreter that happens to be
somewhat slow comparing with other major scheme implementations. In terms of
lines of code, chibi-scheme
is around the twice the size of mes
, which is
still a manageable size. Chibi-Scheme is not compatible with Mes as Guile is,
so we can only run the fib
benchmark with it. The following table shows a
comparison between the best Mes scenario vs Chibi-Scheme.
Mes | Chibi-Scheme | |
---|---|---|
Lines of Code† | 5838 | 13135 |
Fib‡ (s) | 21.39 ±0.66 | 1.77 ±0.03 |
- †
- Only counts the lines in
.c
files - ‡
- Both built with GCC
That’s quite an improvement for the size!
Conclusion from the benchmark
Before taking any conclusion, we should make sure our assumption is true and
the differences are high enough to be captured by the time
command. Looking
to the orders of magnitude we are dealing with here, I would say the data we
obtained is a valid reference point, as the differences are large enough to
be properly captured by time
and our goal is to have a noticeable performance improvement.
From there, we understand the obvious: Guile is way faster than Mes. Even with compilation disabled, but that being a huge difference. Chibi-Scheme is also way faster, but not as fast as Guile with all optimizations enabled.
When it comes to the compiler used to build Mes, when compiled with M2 mes
is
twice as slow as with GCC. This is a very relevant detail to keep in mind
during development. GCC’s optimizations account for a huge performance
improvement in mes
, and we cannot rely on them during the bootstrapping process.
Profiling
Still, we are far from justifying a full rewrite of the evaluator, and even
further from rewriting mes
as a bytecode interpreter, as proposed in the
project presentation. That’s why I decided to run a profiler, again, the
simplest possible, so I could get some insights from the execution of the program.
I run gprof
in mes
for both fib
and mescc-mes
cases2
and I summarize the output of the flat-profile below, right after a little
reminder of what do all the columns in the table mean.
Just to clarify, gprof
requires the program to be built with GCC with -pg
flag, we don’t have profilers for the program built M2-Planet, but I think the
insights obtained from this could be extrapolated to mes-m2
in most cases.
You’ll see why in a minute.
Legend
- % time
- the percentage of the total running time of the program used by this function.
- cumulative seconds
- a running sum of the number of seconds accounted for by this function and those listed above it.
- self seconds
- the number of seconds accounted for by this function alone. This is the major sort for this listing.
- calls
- the number of times this function was invoked, if this function is profiled, else blank.
- self ms/call
- the average number of milliseconds spent in this function per call, if this function is profiled, else blank.
- total ms/call
- the average number of milliseconds spent in this function and its descendents per call, if this function is profiled, else blank.
- name
- the name of the function. This is the minor sort for this listing. The index shows the location of the function in the gprof listing. If the index is in parenthesis it shows where it would appear in the gprof listing if it were to be printed.
Gprof flat-profile for fib
%time | cumulative seconds | self seconds | calls | self ms/call | total ms/call | name |
---|---|---|---|---|---|---|
20.06 | 4.01 | 4.01 | 37 | 108.38 | 416.96 | eval_apply |
11.26 | 6.26 | 2.25 | 600202232 | 0.00 | 0.00 | gc_push_frame |
7.10 | 7.68 | 1.42 | 600202232 | 0.00 | 0.00 | gc_peek_frame |
6.83 | 9.05 | 1.37 | 481504722 | 0.00 | 0.00 | make_cell |
6.03 | 10.25 | 1.21 | 600200793 | 0.00 | 0.00 | push_cc |
5.48 | 11.35 | 1.10 | 300124024 | 0.00 | 0.00 | struct_ref_ |
4.70 | 12.29 | 0.94 | _init |
|||
3.70 | 13.03 | 0.74 | 269868352 | 0.00 | 0.00 | make_value_cell |
2.35 | 13.50 | 0.47 | 600202232 | 0.00 | 0.00 | gc_pop_frame |
2.15 | 13.93 | 0.43 | 406099172 | 0.00 | 0.00 | cons |
2.15 | 14.36 | 0.43 | 74983231 | 0.00 | 0.00 | apply_builtin |
2.00 | 14.76 | 0.40 | 300279513 | 0.00 | 0.00 | cell_ref |
1.95 | 15.15 | 0.39 | 300124024 | 0.00 | 0.00 | assert_struct |
1.85 | 15.52 | 0.37 | 135158343 | 0.00 | 0.00 | length__ |
1.60 | 15.84 | 0.32 | 29862855 | 0.00 | 0.00 | minus |
1.38 | 16.11 | 0.28 | 300124024 | 0.00 | 0.00 | assert_range |
1.38 | 16.39 | 0.28 | 75315259 | 0.00 | 0.00 | assq |
1.18 | 16.62 | 0.24 | 75280752 | 0.00 | 0.00 | lookup_binding |
1.08 | 16.84 | 0.22 | 75280689 | 0.00 | 0.00 | make_binding_ |
1.05 | 17.05 | 0.21 | 269866060 | 0.00 | 0.00 | make_number |
1.05 | 17.26 | 0.21 | 173 | 1.21 | 1.21 | macro_set_x |
1.00 | 17.46 | 0.20 | 135178741 | 0.00 | 0.00 | gc_check |
0.90 | 17.64 | 0.18 | 30093052 | 0.00 | 0.00 | pairlis |
0.85 | 17.81 | 0.17 | 105069896 | 0.00 | 0.00 | check_formals |
0.83 | 17.97 | 0.17 | 29861090 | 0.00 | 0.00 | less_p |
… | … | … | … | … | … | … |
0.00 | 19.99 | 0.00 | 1 | 0.00 | 0.00 | open_boot |
0.00 | 19.99 | 0.00 | 1 | 0.00 | 3.09 | read_boot |
0.00 | 19.99 | 0.00 | 1 | 0.00 | 0.00 | reader_read_octal |
0.00 | 19.99 | 0.00 | 1 | 0.00 | 0.00 | try_open_boot |
Gprof flat-profile for mescc-mes
%time | cumulative seconds | self seconds | calls | self s/call | total s/call | name |
---|---|---|---|---|---|---|
31.64 | 58.01 | 58.01 | 81 | 0.72 | 1.91 | eval_apply |
7.53 | 71.82 | 13.81 | 428598051 | 0.00 | 0.00 | assq |
6.11 | 83.03 | 11.21 | 2789709236 | 0.00 | 0.00 | make_cell |
5.96 | 93.95 | 10.92 | 2109168010 | 0.00 | 0.00 | gc_push_frame |
4.88 | 102.89 | 8.94 | 607787117 | 0.00 | 0.00 | length__ |
4.11 | 110.43 | 7.53 | 2109168010 | 0.00 | 0.00 | gc_peek_frame |
3.61 | 117.04 | 6.61 | 891765473 | 0.00 | 0.00 | struct_ref_ |
3.39 | 123.25 | 6.22 | 2109164356 | 0.00 | 0.00 | push_cc |
3.39 | 129.46 | 6.21 | _init |
|||
2.27 | 133.63 | 4.17 | 1742421 | 0.00 | 0.00 | assoc_string |
1.96 | 137.22 | 3.59 | 2235192336 | 0.00 | 0.00 | cons |
1.79 | 140.50 | 3.28 | 215232553 | 0.00 | 0.00 | apply_builtin |
1.54 | 143.33 | 2.83 | 679409558 | 0.00 | 0.00 | make_value_cell |
1.49 | 146.06 | 2.73 | 200300883 | 0.00 | 0.00 | pairlis |
1.18 | 148.22 | 2.16 | 2109168010 | 0.00 | 0.00 | gc_pop_frame |
1.13 | 150.28 | 2.07 | 925236148 | 0.00 | 0.00 | cell_ref |
1.10 | 152.30 | 2.02 | 196474232 | 0.00 | 0.00 | gc_copy |
1.07 | 154.26 | 1.96 | 419658520 | 0.00 | 0.00 | lookup_value |
1.04 | 156.17 | 1.91 | 411189650 | 0.00 | 0.00 | check_formals |
1.02 | 158.03 | 1.87 | 423486766 | 0.00 | 0.00 | lookup_binding |
0.98 | 159.82 | 1.79 | 183 | 0.01 | 0.01 | gc_cellcpy |
0.91 | 161.50 | 1.68 | 891765473 | 0.00 | 0.00 | assert_struct |
0.86 | 163.08 | 1.58 | 423486608 | 0.00 | 0.00 | make_binding_ |
0.78 | 164.51 | 1.43 | 183 | 0.01 | 0.03 | gc_loop |
0.78 | 165.94 | 1.43 | 635403488 | 0.00 | 0.00 | gc_check |
… | … | … | … | … | … | … |
0.00 | 183.37 | 0.00 | 1 | 0.00 | 0.00 | open_boot |
0.00 | 183.37 | 0.00 | 1 | 0.00 | 0.00 | open_output_file |
0.00 | 183.37 | 0.00 | 1 | 0.00 | 0.00 | read_boot |
0.00 | 183.37 | 0.00 | 1 | 0.00 | 0.00 | reader_read_octal |
The profiling data in context
The functions in the top are mostly the same in both cases: eval_apply
,
gc_push_frame
, push_cc
, gc_peek_frame
and make_cell
. They represent
around the 50% of the execution time (51.21% in fib
and 51.28% in
mescc-mes
) and they are closely related.
Before trying to explain why, we could spend a minute checking how they are
eating that amount of time. eval_apply
is executed a few times but it takes a
long time to run, while the rest of them are very fast functions (0.00 ms!) but
they are executed millions of times per second.
The reason for that is eval_apply
is a long running loop that is calling
them. It’s calling them all the time.
Deep in eval_apply
As I already explained in earlier episodes, mes
is a tree-walk interpreter,
but that doesn’t say much about how things work.
The current design of mes
is very interesting. I didn’t ask the author about
it but it feels like it was written in Scheme first and translated to C later.
I say this because the eval_apply
function, the core of the program, that
expands the macros, traverses the tree, and does all the eval
ing and the
apply
ing, is written as it would be written in Scheme: abusing the tail-call
optimization3 with mutually recursive functions.
That’s pretty understandable, if you are writing a Scheme interpreter you’d like to write Scheme, so you Schemeify your C as much as you can. Also, it happens that Scheme can be very elegantly described in Scheme, as the GNU Mes manual mentions, and most of the literature about Scheme interpreters uses Scheme. This tradition goes back to the SICP, and even further to the invention (discovery?) of LISP itself.
But C is not very compatible with that programming style. The C stack would overflow due to the lack of tail-call optimization.
GNU Mes overcomes that limitation manually manipulating a stack, that is
allocated by its memory manager, and emulating function calls by doing a
goto
. The argument passing to these “functions” is done with some global
variables that are understood as “registers” in this “virtual machine”.
Consider this piece of code:
evlis:
if (R1 == cell_nil)
goto vm_return;
if (R1->type != TPAIR)
goto eval;
push_cc (R1->car, R1, R0, cell_vm_evlis2);
goto eval;
evlis2:
push_cc (R2->cdr, R1, R0, cell_vm_evlis3);
goto evlis;
evlis3:
R1 = cons (R2, R1);
goto vm_return;
The “registers” are those R1
, R0
and so on, and we don’t really need to see
what they mean to understand how this whole thing works.
Imagine we enter evlis
with an R1
that is a list: it’s not a cell_nil
and
it’s type is TPAIR
. So we skip both if
statements and we reach the
push_cc
. This call will push to the stack our current position in the code,
cell_vm_evlis2
, and then jump to the eval
“function”. eval
eventually
ends, doing a goto vm_return
which pops the current frame and jumps to the
top of the eval_apply
function.
vm_return:
x = R1; // This saves the return value of the "call"
gc_pop_frame ();
R1 = x; // and puts it back, after it has been overwritten by the pop
goto eval_apply;
And, what do we find there?
A huge if
cascade that brings us to the label where we left:
eval_apply:
if (R3 == cell_vm_evlis2)
goto evlis2; // Bingo!
else if (R3 == cell_vm_evlis3)
goto evlis3;
else if (R3 == cell_vm_eval_check_func)
goto eval_check_func;
else if (R3 == cell_vm_eval2)
goto eval2;
// There are quite a few more below.
Didn’t we just emulate a “function call”?
I’m sure you can continue with the execution from evlis2
by yourselves, just
with the code I pasted here, and understand how a list is evaluated in mes
.
It might feel crazy, but that’s because during the explanation the goal of this
approach was left aside for a minute. I didn’t mention that the eval
label we
jumped to might be jumping to many other places, including apply
, which would
jump to eval
again, which, as we mentioned before, would blow up our stack if
we were doing this in C functions. Doing it this (weird) way lets you control
the stack and manually apply the tail-call optimization, i.e. just use goto
without the push_cc
.
This whole thing is looping all the time, doing goto eval_apply
over and over
again, and that’s why the actual eval_apply
function takes that long to run.
It’s the main loop of the interpreter.
Now we are in a great position to understand why those other functions were
called that many times. push_cc
is the easiest one, as it’s what it prepares
a call to one of those internal “functions” that require pushing to the stack.
gc_push_frame
is called by push_cc
and is the function that actually does
the pushing. gc_peek_frame
is also called many times, but we don’t see that
in the listings I shared. What happens here is gc_peek_frame
is called by
gc_pop_frame
, exactly what the vm_return
block does.
The make_cell
case is slightly different but also related. It’s called mostly
from cons
, and it’s sibling make_value_cell
is also called many times.
These are the functions that allocate Scheme values. Of course, we cannot
simply eliminate the calls to these functions, because a Scheme interpreter
would always need to create values, but my bet is that these functions,
specially cons
are called more often that strictly needed because of the
evaluation strategy I just described.
Conclusions
There are many conclusions to obtain from all this data and the design of the evaluator.
The first and more obvious one is that it looks like the project proposal makes
sense and we could make mes
way faster, as Guile is, even keeping the
interpreter small, as shown in Chibi-Scheme. Of course, I don’t expect to
obtain a 10x performance improvement in the first iteration, but I believe we
can do a lot “starting with the simplest thing that could possibly
work4” if we keep a non-pessimization philosophy.
The question now is, what is that thing?
I proposed to turn mes
into a bytecode interpreter. This alone might
sound as an overkill but from all the other options I considered it sounded
like the simplest, and there is evidence that it could possibly work: Guile and
Chibi-Scheme are bytecode compiler-interpreters, and thanks to the benchmarks
we saw they can be very fast comparing to GNU Mes’ Scheme interpreter.
Other possibilities like tree traversal memoization, like the ones applied in older versions of Guile didn’t feel like they would have such a profound impact in the evaluation process, and I felt like they were going to be harder for me to implement.
But still, there’s a lot more we can obtain from the data.
When comparing mes-gcc
and mes-m2
, we could see that small tricks would not
work if we want to improve the speed of the bootstrapping process. It’s not
that hard to write some code in a different way so GCC does its magic and
optimizes for a 10% improvement, but when it comes to M2-Planet we are alone,
there’s no compiler optimization there to help us, and we must be smart and
use optimizations that go deeper, those that happen in the CPU. Bytecode
could give us some of that data locality we need.
We must remember the profiler was run on mes-gcc
, but the information we
obtained from that can be extrapolated to the mes-m2
case: it is the
current implementation of mes
that is inefficient by design as it was
designed to be simple, not fast. All the push
ing and the pop
ing of scheme
values to that internal managed stack happens too often.
An evaluator that is able to run an already processed program is, potentially,
way faster than one that has to traverse the tree over and over again, while it
does all this goto
soup, that is also pretty hard to read. More specifically,
a bytecode interpreter would not only reduce the execution time of those
functions on the top in the profiler’s list, but many others that are related
to them in the very long tail of function calls that I didn’t bother to paste
in this post. This cascade effect could add up to a huge performance
improvement further than that 50% of execution time we could find in those 5
functions on the top.
Everything that can be processed ahead of time should be done (macro expansion, lambda hoisting, argument checking, compilation…) first and only once if possible, so the evaluator can focus on its main goal: running the code, as fast as possible.
What I didn’t mention, though, are all the functions that don’t look relevant
in the profiling: they are just good. The garbage collector and allocator in
mes
simply works and goes unnoticed in the profiling with a 1% of execution
time in the gc_copy
function and a very fast allocation that only becomes a
noticeable when a huge number (hundreds of millions) of values are allocated.
When it comes to readability and auditability, one of the goals of the project,
personally I consider the eval_apply
function is very hard to follow. There
are still many tricks that I don’t fully understand, and I already spent
several weeks reading and hacking on the interpreter, which isn’t that large
after all. Rewriting it is a chance for making it easier to read.
Work done
Understanding the interpreter internals isn’t an immediate task, so in order to get used to it I decided to go for some low-hanging fruit first.
First, I moved some operations to builtins and even created some. The
performance gain I saw was simply zero. Probably because the benchmark is
not designed to detect small improvements and because the main performance
bottleneck is in the eval_apply
function.
Same thing happened when updated the M2-Planet and I added switch
statements
that replace chained if
s that were there because of M2-Planet didn’t support
switch
statements in previous versions. The change affected mes-gcc
‘s
performance slightly, but not enough to consider it a big win. mes-m2
remained mostly the same as it doesn’t apply any kind of optimization in
switch
statements. I still consider this change important for readability, so
much so it helped me detect a bug in the string reader (the newline escaping
was broken). That I also fixed.
The eval_apply
function has been a puzzle to me for a long time to the point
that I started to look to any possibility to make it shorter, simpler or
anything that could help me understand it. In that process I found some very
specific cases in that goto
soup that mentioned pmatch
. These were the
result of a hack that avoided a hygiene issue in the pmatch
macro. I changed
the variable names in the macro and removed those special cases, slightly
reducing the complexity of the eval_apply
function and bringing the attention
to the lack of hygiene in our macro expander, which, by the way, is also
intertwined in the eval_apply
function.
Well, and I wrote the benchmark tool, that happens to be very useful as a test for all these small changes, too.
All the changes you can find in the faster
branch in the following
repo5:
http://github.com/ekaitz-zarraga/mes
Next steps
None of the work done really tackles the root issue but hacking around is important to familiarize yourself with the code, and I’m the kind of person that requires quite a bit of context before getting into action.
I think now I’m ready for the real deal: the evaluator swallows most of the time, and we need to change it. We just need a plan for it.
First I have to prepare for the big change. Removing the macro expansion from
the famous eval_apply
function and reducing it to a plain Scheme evaluator
(renamed to eval_core
) that can only work with the basic forms and the
builtins would help me in two fronts: working on the macro expansion alone,
opening the door for writing a fully hygienic macro expander from scratch; and
reducing the eval_core
function even more so it can be replaced later.
Currently, as I write this lines, I just finished the removal. I can git show
and see all the macro expansion related functionality in red.
Now I can get to the work. I have to write a macro expander. I’m yet to decide if I should write it first in Scheme (without macros) and translate it later or if I should write it directly in C. In any case, the macro expansion step would be separated from the evaluation.
The separation is not absolute, because the macro expander would make use of
the brand new eval_core
function to run the macros, but at least we would
have the two concerns clearly split in the code. Consider this the 0 step.
With the split, it would be time to start thinking about the compilation and evaluation. In the first thought I didn’t know very well what to do, because if the evaluator works on bytecode, how would I expand the macros then?
I gave this too much thought, but it wasn’t that difficult, really. If the
macro expander is not only a macro expander but also a compiler, the body of
the macros can be compiled as they come. The macro expansion would still work
with the eval_core
function, the same as it did, but now it wouldn’t eval
Scheme code, but bytecode.
Even more, we could compile all the Scheme code as we read it for macro expansion, so we could use helper functions from the current file in the macro expansion step, and we would traverse the tree only once.
Time for writing a macro expander, then.
Another thing I never did before. I’m sure it will be a lot of fun.
-
Hypothetically speaking, because the work has already started. But ssssh… ↩
-
See how the benchmark is useful? ↩
-
Way before all the Scheme, compilers, interpreters and Guix, I wrote a post about it. ↩
-
I’m quoting Janneke, the author of GNU Mes, who has a very pragmatical approach to programming. ↩
-
What you cannot see is all the changes I tried to do and I regretted later. ↩