Stage0 is a crazy project that is pretty well aligned with our vision of trust, bootstrappable software and whatnot.
During the last two weeks we have been working on the port of Stage0 to RISC-V, providing the very first step of the process, so we came here to talk about it, including a fantastic software necromancy moment you are going to enjoy.
The origin of the times
Once upon a time, software was written in machine code. Directly expressing the machine instructions by the hands of the programmers. That was long, long, time ago.
One day some programmer decided to write a translator that mapped that machine code to something more human readable and created what we call assembly language today. It gained popularity and programmers decided to add more and more functionalities to the assembly language until the point that what they created was not a one-to-one mapping with machine code anymore.
That’s how the first programming languages were born.
Their power was so immense that programmers decided to rewrite all their tools using the new programming languages, they even wrote newer programming languages with them.
But power corrupts the mind of the fool. Blinded by the power of programming languages, most of the programmers forgot the origin of the times, and forgetting the history is always a mistake.
Epic music starts…
The problem
Warning: I oversimplified during the beginning of the post, but now… Oh boy! I’m going to flatten this shit.
Well, we need auditable software. I don’t think anyone can deny that fact.
But what does “auditable software” mean? Isn’t free software enough?
It is true that the best way to audit stuff is to read the code of the programs. It’s the classic way we had to know if a program is doing what we want it to. But, how can you be really sure the code you are reading is the one that is shipping with your program?
You can’t! In general it’s impossible to know. There are many reasons, but I will oversimplify and give you just some thoughts and let the people from bootstrappable do the dirty job.
-
The compilation process is not reproducible, so the same source can result in different binaries. You can’t just compare different binaries to make sure your compiler is compiled correctly.
-
We have no way to solve the chicken-egg problem. The recipe to build the compiler version X is to get the sources of the compiler version X and compile them with the compiler version X-1. But how do you get the compiler version X-1? Rinse and repeat.
Also… Where’s the first version of your compiler? Does it run in modern machines with modern operating systems? -
As there’s no real way to get your compilers compiled by yourself, there’s no real way to be sure that the compilers are emitting the code they are supposed to. You have to trust them, and you can’t audit what you need to trust.
The first point is where projects like Nix and Guix have sense: they try to create reproducible stuff. Not only for compilation processes but also for scientific studies (that have to be reproduced by other people, because… that’s how science works, isn’t it?) and other things. Being able to create identical environments where you can ensure that their (compilation) output is going to be identical is extremely important, but I’ll leave that for now.
The second and third point are two different problems but they result in the same: software distributions ship huge binary blobs (hundreds of megabytes) as an starting point (bash, gcc…) so the users have no chance to check if those binaries are corrupt.
GNU Mes is a Scheme interpreter and C compiler that was designed to reduce the size of the binaries you need to ship with your distro. Mes has successfully reduced the size of the binaries that need to be shipped with distros like Guix, but the project is more ambitious that that.
Full-source bootstrap
Stage-0 is a project that is tackling the same “trusting trust” problem but from the opposite perspective, starting in the low-level, rather than the high-level approach that GNU Mes uses.
Both projects work together to provide a greater goal: the full-source bootstrap. The whole bootstrap is started from source, with no binaries involved, so the distros don’t need to ship binaries anymore.
But how is that?
Hex0
Stage0 starts in the low level, the lowest possible, and builds more complex programs from there, step by step.
The first step, Hex0, is a self-hosting “assembler”. I quote the word assembler there because I think it’s a very strong word for this: it’s an ELF file written in hexadecimal, with extra comments.
Hex0 is able to compile itself to a binary ELF file, converting the Hexadecimal values to the binary values and stripping the comments.
We still have to compile the first Hex0 with something but that’s not as difficult as compiling the first GCC because basically Hex0 can be compiled by very simple programs or even by hand, because it contains literally what is going to be written in the final ELF.
The comments on the Hex0 files describe the instructions on each of the lines of the ELF file so the resulting files can be audited, instruction by instruction, with the manual of the ISA as a reference.
This starting point is more than enough to build on top of. We just need to add more functionalities to the next steps: labels, constants… Until we are able to compile a simple C compiler, Mes or anything.
It’s a clever solution for a crazy problem.
Hex0 in RISC-V: my experience
So this is my blog and I come here to talk about myself!1
Some weeks ago I had the chance to make that first step, Hex0, for RISC-V (64 bit). You can take a look to the code here.
There you can see I added three files: the assembly file, the hex0 file and the binary of the compiled hex0. They are basically the same thing, but they are included for readability.
The assembly
The first step for me was to write the assembly file. It’s easy once you know how to make system calls in POSIX.
POSIX system calls in RISC-V are pretty easy:
- Load the arguments for the call in registers
a0
,a1
… - Load the syscall number in
a7
- Run
ecall
The result of the system call comes in a0
.
Input arguments are also important, because we need to be able to tell to Hex0 which is the file we want to compile, and where to put its output.
That’s pretty easy, input arguments are inserted in the stack, so we can load them by pop-ing them. As in any C program, the first element we get is the amount of arguments and the rest of them are the arguments themselves.
Putting all together, if you take a look to the first block of the program:
_start:
ld a0, 0(sp) # Get number of the args
ld a1, 8(sp) # Get program name
ld a2, 16(sp) # Input file name
...
# Open input file and store FD in s2
li a7, 56 # sys_openat
li a0, -100 # AT_FDCWD
mv a1, a2 # input file
li a2, 0 # read only
ecall
mv s2, a0 # Save fd in for later
In the first chunk we are reading the stack, element by element, and in the second we are opening the input file, using the filename we just obtained from the stack.
Simple.
As a note, I’d like to remind you to finish your program, because if you don’t it will continue to execute the memory after it and it’ll explode in your face:
terminate:
# Terminate program with 0 return code
li a7, 93 # sys_exit
li a0, 0 # Return code 0
ecall
This tells the OS to finish the execution.
The internals of the assembly file are simple, I won’t explain them in detail. It basically iterates character by character, removing the comments and converting the hex value couples to a byte.
Read it and tell me if you need help understanding it!2
The conversion to the Hex0
Hex0, as I said, is an ELF file, written in hexadecimal, so we need to compile our assembly file to binary and represent each of the instructions in hexadecimal. And we need to resolve all the labels to final addresses.
There’s no easy way to do it. I started doing it by hand, reading the RISC-V spec and converting the instructions one by one. But I tackled several difficulties doing that.
Pseudoinstructions are expanded to more than one instruction so we need to be careful in the comments and explain that correctly. Also, we need to resolve the addresses accordingly. For example:
la a1, buffer
This is a pseudoinstruction, and we need to resolve it to:
auipc a1, 0
addi a1, a1, $OFFSET
Where $OFFSET
is the offset from that instruction to the label buffer
.
These kind of expansions change our perception of the amount of instructions we have and we have to be extremely careful. I didn’t even mention the case where the offset is very large! That’s another story (thankfully we don’t have to deal with yet).
Once the pseudoinstructions are expanded we need to convert them to the hex value, and I swear it’s the most boring task I ever made in my life. Basically because the RISC-V instructions are not easy to map to their binary (for reasons related with the hardware implementation).
The eagles are coming!
But I had a trick, a deus ex machina that would safe my life. During the last months I’ve been randomly working on a Scheme compiler for RISC-V assembly and that made me start making a RISC-V assembly interpreter and compiler in python. It’s still an early WIP, and was almost abandoned, but it has the basic machinery that lets me compile simple instructions to hex.
With this dirty glue code I was able to compile the instructions one by one:
from registers.RV32I import *
from InstructionSets.RV64I import *
Regs = RegistersRV32I()
def x(registerName):
return Regs.getPos(registerName)
def compile(instruction):
hexrepr = hex(instruction.compile().value)
hexval = hexrepr[2:]
if len(hexval) < 8:
hexval = "0" * (8 - len(hexval)) + hexval
final = ""
for i in range(0,8,2):
final += hexval[i:i+2] + " "
final = final.rstrip().upper()
return " ".join(reversed(final.split(" ")))
I just needed to open a python shell and write something like:
compile( addi(x("a0"), x("a1"), 12) )
And that would compile that instruction for me, giving me the output in a beautiful hexadecimal format.
13 85 C5 00
Not the best UX but usable enough for a small file like this.
The addresses
The addresses are still something to solve.
I’m an idiot so I counted the instructions by hand and then realized I had to expand some pseudoinstructions I forgot, so all the branch instructions were broken. Yes, I’m like that.
Try to be smarter than I am. Use this trick:
Leave all the instructions that use addresses set to a wrong address, like 0 or something, until you converted the whole file. Once you have that, resolve the addresses. That way you’ll make sure every pseudoinstruction is expanded and you’ll be able to use tools that will help you to choose addresses correctly.
The trick I used was to add the ELF header, compile the file and then inspect the resulting binary.
For the compilation there are two choices: we can use the assembly file we wrote previously simply assembling it to a binary and use it as the hex0 assembler; or we can use the high level C prototype that Stage0 provides. In any case, they have to give the same result.
I still don’t know why objdump
is unable to process the binaries of the hex0
files but GDB is able to do it so… Launch a GDB as I explained in the
previous post and disassemble the
whole file3.
It’ll look like this:
0x0000000000600078: ld a0,0(sp)
0x000000000060007c: ld a1,8(sp)
0x0000000000600080: ld a2,16(sp)
0x0000000000600084: li s4,0
0x0000000000600088: li s5,0
0x000000000060008c: li a7,56
0x0000000000600090: li a0,-100
0x0000000000600094: mv a1,a2
0x0000000000600098: li a2,0
0x000000000060009c: ecall
...
With that dissasembled result, we can literally make some math with the addresses and fix all the instructions. We just need to substract the target address from the current address in the branches, so we get the offset.
NOTE: Be careful with the
la
pseudoinstruction (auipc + addi
). The base address here is the one of theauipc
, not the one of theaddi
.
The ELF header
If we don’t know how to make the ELF header, we can’t make the previous step, so better if we mention something about it.
Other files of the project are hex0 files too, so they also have a heavily commented ELF header we can use as a reference. Also, wikipedia has a great explanation of it.
The main point we need to change is the e_machine
field. We need to set it to
0xF3
, indicating RISC-V. Also we need to make sure the flag of 64 bit system
has to be set for RV64 and remember to check the endianness.
NOTE: Big Endian it’s the most natural way to write the file by hand. If you want to go for Little Endian this might get weird to write. The python script above uses Little Endian, watch the
reversed
call on it.
Debugging
Once you have everything ready you need to make sure it’s doing what it’s supposed to.
My first working program was failing to use an output file. Someone in the
#bootstrappable
IRC channel (I’m sorry, I can’t remember who was) told me to
strace
the program to see what was going on, and with that and some debugging
with GDB’s layout asm
, I was able to figure out one instruction was using a
wrong register.
These tools are important because the all the process is done by hand so there are many chances to screw up somewhere.
strace
is extremely handy in this specific program because most of the
functionality it has is based on system calls. If you clean the output
correctly you can see everything the program does accurately.
If you want to encourage my free software work you can support me on Liberapay.
Final thoughts
This contribution have been a lot of fun. It let me understand a little bit more about the ecosystem around the full-source bootstrap, which is kinda complex and includes some other stuff I didn’t even mention.
I learned a lot from this, now I have a deeper understanding of the instruction formats on RISC-V and I learned some cool GDB tricks that are always useful.
Stage0 has a really interesting approach for auditability that’s worth thinking about. They build everything from a commented binary file (that’s basically what hex0 is) that acts like a seed, so we can audit everything, including the very first step. The solution of having the contents of the ELF file directly written in hexadecimal is enough to ensure we can certify the contents are what we expect, and having every instruction commented with its assembly counterpart gives us the chance to go to the ISA and check if that’s actually what it is supposed to be. Perfectly auditable.
Using this first step as a building block for anything else ensures that we never need to rely on a binary file we can’t know where does it come from.
Really interesting stuff.
Now we have this very first step ported to an open instruction set, what also opens the door to the auditability from the hardware perspective. Now we can start thinking about having an auditable software stack in a device we designed ourselves, so we can audit it too. This is huge.
Now, we need to keep pushing in this direction, porting all the rest of the steps of Stage0, Mes and many other projects, if we want to reach a full RISC-V support. This is just one small step in that direction.
Hey! I almost forgot! And thanks to this I had the chance to work a little bit more on my assembly interpreter, and recover it from the darkness. That’s also great. Isn’t it?
Well, so we learned some things today, but the most important is that all the stupid things we do, all the random projects we work on, all the experiences we have in life are not just a waste of time. They may appear to be useful in the future, but you don’t know when…
What is sure here is if you don’t stay creative and active, you’ll never have any experience to learn from.
-
Not really, in fact, I use my experience as a vehicle to introduce you to great projects and interesting pieces of knowledge. But ssssh, don’t tell anyone. ↩
-
Contact me, seriously. I have my contact info here ↩
-
If you want to know where to start to disassemble, you can just ask
where
to GDB. ↩