Ekaitz's tech blog:
I make stuff at ElenQ Technology and I talk about it

Lessons learned on machine code generation

Machine code generation sounded like a weird magic to me half a year ago, I swear, but now it doesn’t look so disturbingly complicated. Nothing in computer science is that complicated, after all.

  1. Basics
    1. Machine code is numbers
      1. Demonstration
    2. Calling convention
    3. Memory protections
    4. Just-in-Time Compilation
      1. Example: Lightening: Guile’s machine code generation library
  2. Lessons learned
    1. Problem: Large immediates
      1. Solution: multiple instruction expansion
      2. Solution: constant insertion
    2. Problem: Unknown addresses and offsets
      1. Solution: relocations
        1. Example: C compilers
        2. Example: Lightening
    3. Problem: Long jumps
      1. Solution: always insert the largest jump possible
        1. Optimization: pointer relaxation
        2. Example: relaxed global variable access in C compilers
      2. Solution: Veneers
        1. Example: Lightening’s veneer system
    4. Problem: Register Access
      1. Solution: use the stack
      2. Solution: controlled register access
  3. Final thoughts

Basics

There are many contexts where you may need to generate machine code. If you are writing a compiler, an assembler, a jit compiler… In the last months I’ve been working on Lightening, a machine code generation library that powers Guile’s JIT Compilation, a RISC-V assembler and interpreter and Hex0, which was introduced in the previous post, where I needed to assemble a file by hand.

All of those cases result in the same thing, even if they have different conditions: we are generating machine code.

In this post I’ll try to talk about some issues that are generic and apply to all the cases and others that are more specific to some of the projects I mention.

But first we need to clarify some stuff just in case.

Machine code is numbers

Machine code is what people from the electronics world call “code”.

I know you know it, but let’s refresh some things about computing we may have forgotten thanks to all the efforts that hide the complexity of our everyday business.

Machine code instructions are basically blocks of bits your processor is reading and interpreting. Those bit blocks encode all the information the processor needs: the identifier of the instruction and its arguments.

The identifier is normally known as opcode. The arguments can have many different meanings, depending on the instruction so we are not getting into that. The instructions normally alter the values of registers, so they need to have identifiers for the source and destination registers, or literal values that are introduced literally inside of the instruction (they are called immediates).

Let’s put a simple RISC-V example here. Consider this assembly instruction:

addi a0, zero, 56

This thing you interpret as some assembly instruction that adds 56 to the zero register and stores the result in the a0 register, has to be encoded in a way that the machine is able to understand. Better said, it is encoded in a way that you can understand! The real instruction is a bunch of bits that represent the same thing.

RISC-V base ISA has a various instruction formats which depend on the goal of the instruction. This one is from the I format, because it includes an immediate. Read it and compare with the following:

  • First the opcode, addi for you, has a binary counterpart: 0010011. 7 bits for this instruction format.
  • Then the destination register, a0, has a binary representation: 01010. There are 32 registers in RISC-V so each of them are represented by a 5 bit value.
  • There’s some extra space for an opcode-like field called funct3: 000
  • Then there the source register, zero, which is: 00000. Again 5 bits.
  • And the immediate you are adding, 56, which is just the binary representation of 56: 000000111000. It’s 12 bits wide for this instruction format.

Putting all together:

000000111000 | 00000 | 000 | 01010 | 0010011

So this forms the following binary value:

00000011100000000000010100010011

Or in hex:

0x3800513

Just for if you didn’t realize, you just assembled an instruction by hand.

That instruction we just created is going to be processed by the machine, reading each of the fields and activating its circuits as it needs according to the voltage levels those values represent.

In this case, it’s going to activate the ALU to add the numbers and all that kind of things, but in other cases it may just change the value of the program counter or whatever. All this is executed by the circuitry of the device, right after it loads the instruction.

That’s for the machine, but for us, from the perspective of a programmer, instructions are just numbers, as we just saw.

Demonstration

I purposely used machine to refer to the device that runs our instructions, but we have to be more specific about it now.

I’m going to talk specifically about modern (and common) microprocessors, because other devices may have peculiarities that can sidetrack us too hard1.

In our modern and common microprocessor, instructions are located in the memory. But that’s nothing we didn’t know! If we run a binary it’s loaded in the memory and executed from there. We all know that!

But you can be surprised to certain level if we stretch that a little bit.

Well, we know from the previous part that instructions are just numbers, and we know that they loaded from the memory so let’s do some C black magic and see what happens:

#include<stdint.h>
#include<stdio.h>

typedef int f0(void);

int main(int argc, char* argv[]){
    uint32_t instructions[2];

    instructions[0] = 0x03800513; // addi a0, zero, 56
    instructions[1] = 0x00008067; // jalr zero, ra, 0

    f0 *load_56 = (f0*) instructions; // Reinterpret the array address
                                      // as a function
    int a = load_56();
    printf("%d\n", a);
}

In that example we build an array of two values. The first one corresponds to the instruction we encoded by hand before and the second corresponds to jalr zero, ra, 0, the return instruction, which you can encode yourself.

After that we convert the address of the array to a function that returns and integer and… Boom! We execute the array of numbers.

The code only works on RISC-V, but don’t worry, I can tell you that it prints 56.

So it was true that the machine can execute stuff from the memory, but what we may not know is that for the machine there’s no actual distinction between instructions and data2. We just executed an array of numbers!

The machine doesn’t care. If it looks like instructions it executes.

You can try to put random values in the array and try to execute them, too. An Illegal instruction error is going to happen, probably. If you are lucky you may execute something by accident, who knows.

But how did this thing work that well? Why did it return the value correctly and all that?

Calling convention

The code worked because we are following the RISC-V ABI, the same that C is following in the example. It tells us how do we need to pass arguments to functions and return and all that. This part of the ABI that defines how to call and return from functions is called calling convention.

I’m not going to extend a lot talking about this, but I will just say that RISC-V has some registers to pass arguments on: a0, a1a7. And those registers are also used for return values.

In the example we are not taking any argument so we don’t need to read from any but we return one value, just writing it in a0.

With what you know, you can now create a function that gets an input argument and adds an immediate to it. Why don’t you try?

On the other hand. RISC-V ABI defines there’s a register called ra that contains the Return Address, so we need to jump to it if we want to finish our function execution.

There are many things more you can read about, but this is enough to begin.

Memory protections

The C example where we executed an array is correct, it runs and all that, but the reality is that memory has different kinds of permissions for each part of it.

Code in memory is normally read-only and executable, and data can be read-only or not, depending on the goal it has (constant or variable).

If you think about the example above, once the array is set, we can overwrite it later, or even write it from the instructions we inserted on it. This could lead to security issues or unexpected results. That’s why code is normally read only and any attempt to write it will raise an exception to the kernel.

There are several ways to identify a memory block as code: the RISC-V assembly (and many others) uses the .text directive which automatically sets the block as a read-only block that can be executed; the mmap Linux system call needs some flags to indicate the protections on the memory block (PROT_EXEC, PROT_READ, PROT_WRITE…); etc.

Just-in-Time Compilation

Just-in-time (JIT) Compilation is a way to execute programs that involve a compilation step at runtime. Typically this happens on interpreted programs, where the interpreter consumes part of the execution time. An interpreter with a JIT Compilation feature is able to compile parts of the code it’s going to run to machine code and speed up the execution of those parts.

Clever interpreters are able to predict if the time they need to compile and execute the JIT Compiled parts is less than the time they need to interpret them, so they can decide if it’s worth the effort.

Normally, the JIT Compilation is more effective in pieces of code that are executed many times because the code only needs to be compiled once and the speed increase is going to be obtained in every execution. But many algorithms may be defined, and parts of the code may be recompiled looking for different optimizations while the interpreter collects data about the performance of the program.

Explained like this it looks like it’s a complex thing to do (and it is) but with the previously mentioned points we can imagine a simple JIT machine code generation library. We “just” need to:

  • Know what code to generate (choose a function to compile, this step may need some code analysis).
  • Reserve some space (malloc, mmap…)
  • Fill the space with numbers (the machine code instructions resulting from the compilation of the function).
  • Next time the program wants to call the function we compiled, call the numbers instead (as we did in the demonstration).
Example: Lightening, Guile’s machine code generation library

The just-in-time compilation process in Guile is simple, but effective3. Guile uses a library called Lightening for it. Lightening is a template-like library that defines a virtual instruction set. That instruction set is translated by the library to the instruction set of the actual machine.

Implementing support for another architecture is as simple as implementing all the translation code for the new architecture. That’s what I’ve been doing these days.

Guile’s JIT compiler only needs to call the instructions of the library and they will generate actual machine code by themselves, packaged in a function the interpreter will be able to call later.

Lightening is simple because it doesn’t need to compile from source code, or make code analysis to find which part of the code does it need to compile. It just exposes an API that looks like an instruction set and that’s the thing that we translate to the machine code.

The JIT is going to call the API of Lightening, creating more complex operations by combining Lightening’s instructions and Lightening is going to convert those operations to their machine code by a simple translation, filling the array of numbers and returning its address as a function pointer we can call later.

Of course, it is much more complex than that because it needs to solve many other problems we are talking about later but that’s the idea. And the idea doesn’t sound too difficult, once you have in mind what we talked about previously.

Lessons learned

There are many problems that a machine code generation library like that can encounter, but they are not exclusive for those libraries. This kind of problems can also appear in compilers, assemblers and many other things.

The lessons I learned come as problems I encountered during these days of digging and implementing and some possible solutions or thoughts about them.

Problem: Large immediates

Large immediates are one of the most obvious but boring issues in this world, and they apply to many cases.

In the example above we encoded an addi instruction that added 56, an immediate, to a register, and we said the immediate had a 12 bit space in the instruction. Registers in RISC-V are 32 bit (in RV32) or 64 bit (in RV64) wide, so we can work with larger values, but we are limited to use 12 bit immediates in addi and all the other I-type instructions.

Why is that? Well, RISC-V instructions are 32 bit and they need to be able to pack much more information than the immediate they use, so the immediates can’t be as large as we want. The fixed instruction size is a design decision that keeps the processor simple, but other processors have other design decisions4 around this.

Solution: multiple instruction expansion

There are several solutions for this, but the most obvious one is to use more than one instruction to operate in an immediate.

If we want to load a 64 bit value, we can add, rotate left, add, rotate left, add… Until we fill a whole register with the value we were looking for.

This means a simple addition can be expanded to many instructions. In some cases they are going to be just a few, but as the immediates get big we may need more than eight instructions, very well encoded, to write the immediate to a register and be able to operate with it.

Solution: constant insertion

This is not a solution we can take everywhere, but we can take in the context we are right now (code is stored in the memory and all that, remember). Consider this RV64 code:

auipc t0, 0             // x[t0] = PC + 0
ld t0, 12(t0)           // x[t0] = mem[ x[t0] + 12 ]
jal zero, 3             // PC    = PC + 3
0xAAAAAAAAAAAAAAAA      // This is a 64 bit literal
addi t0, t0, 1          // x[t0] = x[t0] + 1
// What's the value of t0 here?

The code has some comments on the right that I’m going to use through the whole post, so get used to them. The x means register access (base registers are called X registers in RISC-V), and mem is memory, PC (program counter) is written in uppercase and not as if it was a register because it’s not accessible by the programmer so we need to treat it as a global variable we can only set using jumps or get using auipc.

RISC-V instructions are 32 bit long (4 bytes), so you can get what the offset in the ld instruction does, right?

Basically we are loading a doubleword (ld) at the position of the 0xAAAAAAAAAAAAAAAA in the t0 register and adding 1 to it. So the answer to the question is 0xAAAAAAAAAAAAAAAB.

But can you see the trick we are using?

The jal instruction is jumping over the constant so we can’t execute it by accident (which would cause an Illegal Instruction error), and using the ld instruction we are able to load a big constant to a register. A constant which is mixed with the code, as any immediate would be, but without being associated with any instruction.

If we know the code we are generating is a function, we can always wait until the return instruction and insert all the constants after it, so they are perfectly separated and we don’t insert jumps to avoid executing the constants by accident. For that case, we need to change the values of the auipc and the ld accordingly, making them point to the correct address, which has some associated issues we need to talk about now.


Keep in mind you can hire ElenQ Technology if you like this kind of material.
We teach with this mixture of passion and awkward charisma. We also code and research.

Problem: Unknown addresses and offsets

Addresses and offsets are a pain in the ass because you may don’t know them when you expect to.

Let’s consider an unconditional jump like the one of the previous example. The number we introduce is the amount of instructions to jump from the program counter: an offset. The immediate offset can be positive, for forward jumps, or negative, for backward jumps.

jal zero, 3             // PC    = PC + 3

Generating this jump supposes that you know where you need to jump: you want to jump 3 instructions to the future.

But imagine you are assembling a file, a real assembly file that is not an oversimplification of the assembly, like what we did in the previous example. A real assembly file with labels:

add a0, a0, t1     // I don't care about this instruction
j end              // Unconditional jump to `end`

// Some code here

end:               // Label `end`
    ret            // return

If you are assembling this file line by line, you can actually assemble the add in the first line, because you know everything from it, but you are unable to emit the j end because you don’t know where end is yet.

If this assembly is written in a file you can always preprocess the whole file, get the labels, associate them with their addresses and then assemble the whole thing, but you are not always in this situation.

Lightening, for instance, generates the code as you call the API, so it doesn’t know where your jump points to until you call the API for the label later.

Compilers may encounter this issue too, when they are using separate compilation and linking steps. You must be able to compile one source file by your own but you may not know where do global variables appear, because they might be in a different file, and you only know those at link time.

Solution: relocations

There’s one simple way to solve it: introduce a fake offset or address and patch it later, when we know the position of the symbol. That’s what relocations do.

Example: C compilers

The relocations are a mechanism to pass information between the compiler and the linker, you can actually see them in the object files generated by your compiler. Make a simple file with a global variable and compile it. Something like this:

int global_symbol;
int main(int argc, char* argv[]){
    return global_symbol !=0;
}

If you compile it with gcc -c, you can inspect relocations in the result with objdump, using the -r flag alongside with -d for disassemble. In RISC-V you’ll find things like R_RISCV_HI20 or R_RISCV_LO12 where the relocations are located. They are ways to encode immediates in U-type instructions and I-type instructions respectively. In my case I get something like this (it’s not the full result):

   6:       00000797                auipc   a5,0x0
            6: R_RISCV_PCREL_HI20   global_symbol
            6: R_RISCV_RELAX        *ABS*
   a:       00078793                addi    a5,a5,0x0
            a: R_RISCV_PCREL_LO12_I .L0 
            a: R_RISCV_RELAX        *ABS*
   e:       639c                    ld      a5,0(a5)

There are two types of relocations but we are going to talk about the R_RISCV_RELAX later. You see my relocations have PCREL in the middle, but just to mention they are relative to the program counter.

If you just inspect the binary with the -d you won’t see the relocations and the result will look like nonsense code5:

    6:       00000797   auipc   a5,0x0         // x[a5] = PC + 0
    a:       00078793   addi    a5,a5,0x0      // x[a5] = x[a5] + 0
    e:       639c       ld      a5,0(a5)       // x[a5] = mem[ x[a5] + 0 ]

This adds 0 to program counter and stores the result in a5, then adds 0 to a5, and loads a doubleword to a5 from the address at a5. But the address at a5 at the moment of the load is nothing but the program counter at the auipc instruction. Weird.

The relocation is going to point to the auipc and the addi, and tell the linker it has to replace the zeros by other value. Which one? The address of the global variable. If we replace the zeros by a combination that is able to load the address of the global variable the code will work. That’s what the relocation does here.

So, as we don’t know where to point, we insert anything (zeros) and we fix the instructions when we know where do they need to point to.

Example: Lightening

The same approach is followed in Lightening, and you can follow in your assembler, library or anything that has a similar problem. Let’s consider some code using Lightening (obtained from tests/beqr.c, comments added by me):

// Make a function that loads two arguments
jit_load_args_2(j, jit_operand_gpr (JIT_OPERAND_ABI_WORD, JIT_R0),
                jit_operand_gpr (JIT_OPERAND_ABI_WORD, JIT_R1));

jit_reloc_t r = jit_beqr(j, JIT_R0, JIT_R1); // branch if equal registers
jit_leave_jit_abi(j, 0, 0, align);           // end ABI context
jit_reti(j, 0);                              // return 0
jit_patch_here(j, r);                        // make the branch jump here
jit_leave_jit_abi(j, 0, 0, align);           // end ABI context
jit_reti(j, 1);                              // return 1

// Obtain the function we created
jit_word_t (*f)(jit_word_t, jit_word_t) = jit_end(j, NULL);

// Test if it works
ASSERT(f(0, 0) == 1);       // 0 == 0 so it jumps        -> returns 1
ASSERT(f(0, 1) == 0);       // 0 != 1 so it doesn't jump -> returns 0

In this example we see how we generate machine code statement by statement, so there’s no way to know where does the beqr need to jump until we generated all the code for it.

You see the beqr function doesn’t receive the target address or offset as an argument, but it returns a jit_reloc_t, which other functions like reti don’t return.

That jit_reloc_t is what we are patching later with the jit_patch_here indicating where does it need to jump. The jit_patch_here function is going to correct the bits we set to zero because we didn’t know the target at that moment.

There are different kinds of relocations, as it happened in the previous example with the compilers, because different instruction formats need to be patched in different ways. In the case of Lightening, the relocation has a type associated with it, so we can check and act accordingly.

Problem: Long jumps

As we saw, some jumps encode the target as an immediate. This has a couple of implications that we described previously:

  • The jump target could be larger than the space we have for the immediate.
  • Sometimes we can’t know the target until we reach the position where the jump points to.

Both issues can be combined together in a killer combo. Consider this code:

j label     // jump to label

// A lot of instructions here

label:
    // this is the target of the jump

In RISC-V the j pseudoinstruction is resolved to jal, that has a 21 bit (signed) space for the jump target. If we had a hella lot of instructions between the jump and the target we may need more bits for the jump than the space we actually have.

Again, in the case were we can preprocess everything there’s no problem, but if we are assembling the instructions as they come we are going to have issues. We realize we can’t jump that far too late, because we already inserted a 21 bit jump and too many instructions when we reach the label. Patching the jump is not enough, because we didn’t leave enough space to insert the offset we need.

Solution: always insert the largest jump possible

There’s an obvious solution: always insert the largest possible jump and patch the whole jump later.

In RISC-V jalr jumps to the absolute address that is stored on a register with an optional 12 bit (signed) offset. Combined with the auipc (add upper immediate to program counter) it lets us make 32 bit relative jumps in just 2 instructions. Let’s explain that in code just in case:

auipc t0, offset_hi         // x[t0] = PC + (offset_hi<<12)
jalr zero, offset_lo(t0)    // PC    = x[t0] + offset_lo

If we consider offset as a value we know, we can split it in two blocks: the highest 20 bits as offset_hi and the lowest 12 bits as offset_low and use them to jump to any address in the 32 range from the current position, using just 2 instructions.

In 32 bit machines, this jump is the largest jump possible, because the machine can only address 32 bits, so we will be sure that any relative (or absolute, using lui instead of auipc) jump we want to make can fit in place. The only thing we have to take in account is to patch both instructions when we find the targets, not only one.

Optimization: pointer relaxation

But using the largest possible jumps can lead to inefficiencies because we use two instructions for jumps that can potentially fit in just one.

We can use something we saw before for that: relocations. More specifically, in the case of the GCC toolchain, we can use the R_RISCV_RELAX that appeared before.

The relaxation relocation is going to tell the next step, which can be the linker or anything else depending on the context we are working on, that the pointer can be relaxed. In the case of the auipc + jalr, possibly by replacing both instructions by a 21 bit jump like jal.

So we start with the longest jump possible, but when we actually know the target of the jump, we can reduce it to something smaller that needs fewer instructions.

Example: relaxed global variable access in C compilers

Global variables, as we saw before, are some of those points where compilers need to use relocations and let the linker clean the result.

Global variables don’t necessarily involve jumps but they do involve pointers for the loads and stores needed to operate with them. In the final executables, global variables are part of the .data segment, because they are known at compilation time, so we can exploit that fact a little and relax our weird auipc + load/store combos.

RISC-V has many registers, so we can use them for things that may not be the norm in other platforms where registers are scarce. In this case, we can exploit the gp (global pointer) register on RISC-V to improve the access to the global variables. We can cache the address of the .data segment of the program in the gp register so, as we know most of the global variables are going to be near (12 bit offset) to the beginning of the .data segment, we are probably going to be able to remove some of the auipcs we inserted before.

So a simple load of a global 64bit variable to a register:

auipc t0, offset_to_global_hi  // x[t0] = PC + offset_to_global_hi << 12
ld t0, offset_to_global_lo(t0) // x[t0] = mem[ x[t0] + offset_to_global_lo ]

Is optimized to this:

ld t0, offset_from_data(gp)    // x[t0] = mem[ x[gp] + offset_from_data ]

Of course, the offsets have to be calculated and all that, but this not that difficult.

Solution: Veneers

There are other solutions that don’t involve messing around with the code we generated earlier in an aggressive way like removing instructions, which can be pretty bad because you have to shift the array of instructions you generated to clean the gaps the pointer relaxation leaves.

Veneers are non destructive, and they involve no instruction reorganization, so they are interesting for those cases where you need to generate the code as you go.

Let’s explain them with an example:

beq a0, a1, branch // Jump to `branch` if x[a0] == x[a1]

// Instructions...

branch:
// Branch target

As we saw previously, if we insert too many instructions in between the jump and the target we screw it. What we didn’t mention is that as we go assembling instructions one by one we can keep a track of the amount of instructions we are inserting.

Having that in mind, we can take decisions in time, right before it’s too late. We can combine that knowledge with the constant insertion method introduced before to insert full-range jumps if needed, right before we exhaust the possible offset of the original instruction.

Of course, we need to patch the original instruction to jump to the code we are just going to insert, and we need to add some protections around the veneer to make it only accessible to the original jump.

beq a0, a1, veneer      // Jump to `veneer` if x[a0] == x[a1]

// Many instructions, but not too many!

// Here we realize we are running out of offset range so we insert a helper
// block that lets us jump further.
j avoid                 // Jump to `avoid` so the normal execution flow
                        // doesn't fall in the veneer
veneer:
    auipc t0,0          // x[t0] = PC + 0
    ld t0,12(t0)        // x[t0] = mem[ x[t0] + 12 ]
    jalr zero,0(t0)     // PC    = x[t0]
    ADDRESS(branch)     // Literal address of `branch` label
avoid:

// As many instructions as we want

branch:
// Branch target

As it happened with constant insertion, there are positions where the veneer insertion can be optimized a little, like right after a return or an unconditional jump, so we don’t need the protection (j avoid in the example).

The bad thing about veneers is they insert a bunch of instructions in the cases that are not in range and the jumps are done in two steps, that has a negative effect on the performance because they drop the pre-processed instructions in the pipeline.

Of course, the veneers themselves have to be patched too, because we won’t know the target (branch in the example) until we reach it. But, in the case of the veneer we can be 100% sure that we are going to be able to point to the target.

Example: Lightening’s constant pools

Lightening uses veneers for the jumps6, but they are part of Lightening’s constant pool mechanism. Constant pools work the same for the constant insertion than for veneers, because veneers are basically constants. Remember, code is numbers!

Basically anything that might be inserted as a constant, which can be a veneer or just a number or whatever, is queued to the constant pool. The library is going to emit instructions and check on each instruction if it needs to emit any of the constants of the pool.

The constant pool and each of the entries on the pool have associated information that tells the emitter if they need to be emitted now or if they can wait for later so the emitter can decide to insert them.

The literal pool entries have, of course, an associated relocation that contains information of the original jump or load instructions we may need to patch, as we already saw. So, in the case of a veneer emission, we need to patch the original jump to the veneer and remember the veneer needs to be patched later, when we find its target.

The mechanism is not complex, but it’s not simple neither. There are several kinds of relocations, depending on what we want to do with them, different kind of patches we need to do, address calculations and all those things that require a level of attention to the detail I’m not prepared to talk about.

Problem: Register access

You may have seen a problematic point in some of the solutions we worked with: we are using registers.

It’s not a problem by itself, but using registers might be really problematic if we are inserting code between the instructions someone else wrote because we can’t control the register use the original program did and we might be changing the values inside of the registers in the magic tricks we sneakily inserted.

Imagine we use, say, t0 register in our veneer but the original program uses that register for something else. That’s a problem. We are messing with the value in the register and potentially (surely) breaking the program.

Solution: use the stack

The most obvious solution you can think of is to use the stack. We can surround our veneers or code insertions with some protection code that saves the values of the registers on the stack and restores them when we finished.

It’s a simple solution in your mind, but if you need to deal with the jumps it can get messy. You may need to restore the register far away in the code and keeping track of everything. It can be complicated.

On the other hand, memory access is slow and boring and we don like that kind of things in our lives. We need more dynamite.

Solution: controlled register access

The other solution we can provide is to keep control of the registers that are being accessed and use others for our intervention.

A simple way to do this is to provide functions to get and release temporary registers, instead of letting the programmers do whatever they want. This makes sure that all the register access is controlled and we are not changing the values of any register in use.

The main problem we can have comes when the programmer needs all the register for their things and then we can’t really use any for our magic tricks. But we can always keep at least one register for us and only for us (throwing an error to the programmer when they use it) or even combine the use of the stack with this solution.

If we are directly working with assembly code, where we can’t force the programmer to use the interface we want, we can choose the solutions that don’t involve register access so we don’t need to analyze the code to deduce if the programmer is using the registers or not. Avoiding the problem is sometimes the best solution.

In the case of libraries like Lightening register access control is a must because Lightening can’t control how its (virtual-) instructions are translated to machine code instructions: each machine has its own peculiarities and details. In many cases they need to make use of temporary registers and, as the instructions are built incrementally, preventing instructions from peeing on each other is important.


Please, consider supporting me on Liberapay to encourage my free software work.

Final thoughts

I know these are just a few things, but they are enough to let you make your first program that involves machine code generation to certain level.

I’m not a computer scientist but a telecommunication engineer7, so I may put the focus on things that are obvious for the average reader of these kind of posts, but at the same time I may be flying over things that I consider basic due to my studies but the average reader doesn’t. In any case, feel free to contact me if you have questions or corrections.

Some of the tricks and lessons I included here are more important than others, but the most important thing is to start thinking in these terms. Try to understand the problems you face when you have separate compilation, assume the fact that you can’t know the future… The mindset is the most important point of all this and, once you have it, everything comes easier.

It’s also a lot of fun to realize code is just numbers in memory you can mess around with. I hope you keep it in your brain forever.

I hope this post throws some light on this dark hole that machine code generation is and makes you try to make your own findings on this beautiful area of compilers, machines and extremely long blog entries.


  1. One of those peculiarities is the Harvard Architecture that is not going to let us make the fantastic trick I’m going to show you now. Harvard Architecture is popular on microcontrollers. 

  2. LISPers are always right. 

  3. You can read more about how it works here

  4. In x86 all the instructions don’t have the same length and some can encode larger immediates

  5. addi a5,a5, 0x0 is adding 0 to a5 and storing it in a5 so it’s just moving it. RISC-V has a pseudoinstruction for that: mv a5,a5, which is expanded to the addi. objdump is going to write mv in its output, because it tries to be clever, but that is not the goal of the instruction we have. I changed it to the actual instruction so we can understand this better. 

  6. Only in the architectures that need them. x86 does not need constant pools or veneers because the ISA is complex enough to handle the problematic cases adding levels of complexity ISAs like RISC-V or ARM didn’t want to deal with. RISC vs CISC, y’know… 

  7. So, for all that software developers that write blog posts like “Are we really engineers?” or stuff like that: I am, thanks for the interest. LOL