Ekaitz's tech blog:
I make stuff at ElenQ Technology and I talk about it

GCC internals — From a porting perspective

In the previous post of the series the problem of the GCC bootstrapping was introduced. In this post we’ll describe how GCC works, from the perspective of someone who wants to port it so we understand what’s the job we have to do.

  1. Disclaimer
  2. Overview
    1. The compiler generation framework
    2. GCC as a coordinator
  3. Source code parsing
    1. GENERIC
  4. GIMPLE
  5. Register Transfer Language
    1. Target-dependent code
    2. Machine description files
      1. Machine modes
      2. RTL Templates
    3. Target description macros and functions
  6. Assembly code generation
  7. Summary
  8. My job in the backport
  9. Last words
  10. Learn more

Disclaimer

  • This post may be only valid for old GCC versions, like 4.something, because that’s the one I’m interested in. More recent versions may have different details, but I don’t expect them to be very different to what is described here. More specifically: I’m working on GCC 4.6.4, and the first GCC with RISC-V support is GCC 7.0.0.
  • This post will focus on how GCC compiles C programs because that’s the part we care about. Some other languages have differences on how they are treated but that’s not very relevant for us, as it has no implications on the back-end.

Both of these points will get clearer later.

Overview

GCC is structured as a pipeline of several steps that run one after the other.

  1. Source code parsing
  2. GIMPLE IR generation (target-independent)
  3. Some GIMPLE tree optimizations
  4. RTL IR generation (target-dependent)
  5. RTL optimizer
  6. Assembly code generator

Before starting to analyze each of the steps independently there are a couple of things to clarify.

The Compiler Generation Framework

An important point to note is GCC is a compiler collection meaning that it is able to compile code from many high level languages (HLL) and for many different targets. This has implications on how some steps are mapped to GCC’s source code.

The most important thing of all this is to differentiate between GCC’s code and an actual gcc executable. The key point here is that GCC’s codebase includes what is called CGF (Compiler Generator Framework) that can generate gcc executables from GCC’s code. The CGF generates gcc executables according to the input (target machine, host machine…) we give it, but the generated gcc executables may differ one from another even if they were generated from the same codebase.

Any gcc executable is able to compile any input HLL1 (C, C++, Objective-C, Ada, Fortran and Go2), so GCC’s code must include parsers for each of these languages.

On the other hand, gcc executables are only able to generate code for one target (x86, MIPS, ARM, RISC-V…), that must be chosen when GCC is compiled. In order to make the porting efforts easier, GCC has a set of tools that generate the target-dependent code from some configuration files called Machine Descriptions (MD).

Putting all this together, source code parsing and AST generation depend on the input HLL, and the code that runs for each HLL is selected when gcc runs (steps 1). The intermediate representation, GIMPLE, is target-independent so everything related with that is copied inside the final gcc executable (steps 2 and 3). The RTL (Register Transfer Language) representation and assembly code generation are target-dependent and the code related to that is generated from MD files when GCC is compiled (steps 4, 5 and 6)3.

This all means if we want to be able to read the source code of GCC we have to have clear in mind how the source code maps to the actual executable, i.e. if we generate a gcc executable for x86 it won’t contain the code for other architectures and it won’t even check if it was correctly programmed, because it’s not going to compile it.

GCC as a coordinator

Many GCC users or C programmers (or me, not that long ago) might think there is something missing on the list of steps we recently reviewed. The normal usecase for calling gcc like

$ gcc -o helloworld helloworld.c

does several steps internally that we need to separate:

  1. Preprocessing: the resolution of preprocessor macros like #define and stuff like that.
  2. Compiling to assembly: the generation of assembly code files per compilation unit (a file that is the output of the preprocessor).
  3. Assembly: the conversion from an assembly file to ELF object file.
  4. Linking: the executable or library generation from the ELF object files created in the previous step.

The reality is there’s more than one program involved here and gcc is just a coordinator that makes other programs run if needed.

The preprocessor is called cpp and it is generated from the GCC codebase. The compiler is gcc itself but the assembler and the linker are generally obtained from GNU Binutils’ as and ld respectively.

So, one of the most important things to understand is GCC only generates assembly, but it looks like it doesn’t4.

This means we need a proper support for our architecture on the assembler and the linker too. But we’ll keep that story for another day5.


Souce code parsing

HLL dependent. Generates appropriate IR

So the first step of the compiler is to process the input text and convert it to the appropriate Intermediate Representation. The most used intermediate representation is GENERIC, that was designed for C but fits other procedural languages pretty well6.

This parsing process not really relevant for us, as we want to add a new target, but it’s interesting to note because it gives shape to the codebase. GCC splits the code for the different input languages in folders named like: gcc/$LANGUAGE.

GENERIC

GENERIC is just a representation, we don’t need to care that much about it but a word is not going to hurt anyone. GENERIC is a tree representation: a set of nodes with some extra common information. Those nodes can be read at gcc/tree.def.

A simple example of this could be a function declaration, that would take the a node of type FUNCTION_DECL that has some sub-nodes: one for the return type, another for the body of the function and another for the arguments of the function.

It’s a simple AST you could come up with yourselves, except the fact that it is pretty complex. 😅

GIMPLE

HLL- and Target- independent representation

The next step is called Gimplification (see gimplify.c), the process of converting to GIMPLE. Normally, representing the AST as GIMPLE is too complex to be done in one step, so the GENERIC (+ some extensions) is used as a previous step that is easier to create.

GIMPLE is the central internal representation of GCC. It’s target-independent and High-Level-Language-independent. At this point some optimizations can be applied, those related with the structure of the source code, like loop unfolding or dead code elimination.

From the porting perspective, this representation is important, as it’s the border line between the front-end and the back-end, and we are interested in the latter. A really interesting part is to understand is how is this converted to the next representation, RTL.

Register Transfer Language (RTL)

Target-dependent low level representation

The next part of the compiler work is done using the RTL intermediate representation. The RTL representation is based on LISP, so we have a reason to love it, and it serves two purposes:

  1. Specify target properties via the Machine Descriptor files. These Machine Descriptor files are text files that look like LISP and are processed at compilation time.
  2. Represent a compilation. Meaning that the RTL is also an intermediate representation, a low-level one, that represents sets of instructions.

GCC does not make any distinction between the first and the second purpose, calling both RTL, but there some difference on the purpose and the shape of the RTL. RTL has both, an internal form represented by structures (case 2) and an external form represented as a text file (case 1).

The RTL is formed by a set of objects: expression, integers, wide integers, strings or vectors. In the textual form they are represented like in LISP, using double quotes for strings, brackets for vectors… and a lot of parenthesis. The internal representation you can imagine, structures for expressions, integer types for integers, char* for strings, etc.

The most interesting RTL objects are expressions, aka RTX, that are just a name (an expression code) plus the amount of possible arguments.

This is how a piece of RTL may look, it represents an instruction that sets the register 0 to the result of the addition of the register 1 and the constant integer 10 (see rtl.def for more information):

(set (reg  0)
     (plus (reg 1)
           (const_int 10)))

In the example the only things that are not expressions are the numbers (0, 1 and 10), all the rest you can find in rtl.def and see what they mean.

From GIMPLE, there are two steps left to reach our target, assembly code, and both involve RTL. The first maps the GIMPLE nodes to pattern names in a target-independent way, generating a list of RTL insns. The second matches those insn lists to RTL templates described in Machine Description files and uses those matches to generate the final assembly code.

Those insns are objects that represent code in RTL. Each function is described with a doubly-linked list of insns. You can think about them as instructions in the RTL world.

In the first step, the RTL insn generation step, only the names matter (and they are hardcoded in the compiler), while in the second the structure of the insn is going to be analyzed as we’ll see later.

Target-dependent code

As we previously said, target-dependent steps are generated at compile time and then inserted in the final gcc executable. All this code is located in one folder per target, under gcc/config/$TARGET, so the CFG is able to load the target we choose at compile time (using --target=) and insert that in the final executable.

That is done in different ways depending on the type of file we are working with: Machine Description files are processed by the programs (gencodes, gentags…) that generate C code files from them, while target description macros and functions, which are C files, are inserted in the building process as any other C file.

I’d like to insist here on the fact that the --target is the only one to be processed and loaded and the other possible targets are going to be ignored. The build process is not going to complain if a target is broken or anything like that if it isn’t the target we chose. It just doesn’t care.

Machine Description files

Machine Description files (.md extension) let us define insn patterns, which are incomplete RTL expressions that can be matched against the insn list generated from the GIMPLE, attributes and other interesting things we may not try to decipher here.

define_insn is a RTX we can use to define new insn patterns. It receives four or five operands:

  1. An optional name. It’s going to be used to match against GIMPLE.
  2. An RTL template. A vector of incomplete RTL expressions which describe how should the instruction look like. Incomplete in this context means it uses expressions like match_operand or match_operator which are designed to match against the RTL insn list and see if they are compatible or not.
  3. A condition. A final condition to say if the insn matches this pattern or not.
  4. An output template. A string that contains the output assembly code for this insn. The string can contain special characters like % to define where should the arguments be inserted. If the output is very complex we can write C code on this field too.
  5. An optional list of attributes.

This is an actual example from the RISC-V code we are backporting:

(define_insn "adddi3"
  [(set (match_operand:DI 0 "register_operand" "=r,r")
        (plus:DI (match_operand:DI 1 "register_operand" "r,r")
                 (match_operand:DI 2 "arith_operand"    "r,I")))]
  "TARGET_64BIT"
  "add\t%0,%1,%2"
  [(set_attr "type" "arith")
   (set_attr "mode" "DI")])

You can see the name adddi3 is something like: add + di + 3. This means it’s the add instruction with the di mode and 3 input arguments. That’s the way things are named.

Next block is a vector with the RTL template. If you try to ignore match_operand expressions you can see the template is not very different to the RTL example we gave before. In this case it’s something like:

(set (reg 0)
     (plus (reg 1)
           (reg 2)))

It’s basically storing in the first register the result of the addition of the other two.

The next field is the condition. In this case it needs to have TARGET_64BIT defined in order to work because the machine mode is DI (we’ll explain that soon).

The output code is simple, just a RISC-V add instruction:

add %0,%1,%2

Where %N is going to be replaced by the register numbers used as arguments for this instruction.

The last field are the attributes, which can be used to define the instruction size and other kind of things. We are not going to focus on them today.

Machine modes

Machine modes are a way to describe the size of a data object and its representation.

  • QI: quarter integer
  • HI: half integer
  • SI: single integer
  • DI: double integer
  • SF: single floating
  • DF: double floating

And so on.

The standard insn names include machine modes to describe what kind of instruction they are. The example above is addddi3, meaning it uses di machine mode: double integer. That’s why it needs the target to be a 64 bit RISC-V machine.

Machine modes also appear in some RTL expressions like plus or match_operand meaning that they operate in that machine mode, that is, with that data size and representation. For example (plus:SI ...).

RTL Templates

match_* expressions are what make RTL expressions incomplete, because they are designed to be compared against the insn list that comes from the previous step.

In the example above we had:

(set (match_operand:DI 0 "register_operand" "=r,r")
     (plus:DI (match_operand:DI 1 "register_operand" "r,r")
              (match_operand:DI 2 "arith_operand" "r,I")))

(match_operand N predicate constraint) is a placeholder for an operand number N of the insn. When the insn is constructed, the match_operand will be replaced by the corresponding operand of the insn. When the template is trying to match an insn the match_operand forces the operand number N to match the predicate in order to make the insn match the template. The match_* expressions are what defines how insns should be.

The predicate is a function name to be called. The function receives two input arguments: an expression and a machine mode. If the function returns 0 the function does not match.

predicates can also be combined in Machine Description files like this:

(define_predicate "arith_operand"
  (ior (match_operand 0 "const_arith_operand")
       (match_operand 0 "register_operand")))

So the arith_operand shown in the example above can be a const_arith_operand or (that’s what ior means) a register_operand. They can be more complex but this is more than enough to understand how they are built. In the end, they always check against C functions, but you can combine them with the convenience of the Machine Description files.

The constraint allows to fine-tune the matching. They define if the argument is in a register or the memory and stuff like that. r, for example, means the operand comes from a register.

There are other matching expressions too, but match_operand is the most used one and it’s the one that explains this concept of incomplete expressions the best.

Target description macros and functions

Apart from the machine descriptor files, there are other files involved. For example, the constraints defined above need to be defined in code somewhere.

The most important of these are the target description macros and functions, normally defined in gcc/config/$TARGET/$TARGET?(.h|.c). The .c should initialize the targetm variable, which contains all the machine information relevant to the compiler. It is initialized like this:

struct gcc_target targetm = TARGET_INITIALIZER;

That TARGET_INITIALIZER is a huge macro, defined in gcc/target.h, that initializes the targetm structure. This macro is split in smaller macros with reasonable defaults that may be overwritten by pieces. Each target should have a file that includes both target.h and target-def.h and overwrites any inappropriate default by redefining new macros and ends with the initialization line we just introduced. This is normally done in gcc/config/$TARGET/$TARGET.c, while the .h is normally used to define some macros that are needed in the .c file.

As a reference, the RISC-V code we need to backport (see gcc/config/riscv/riscv.c) uses the file to introduce the amount of registers, the type, the size, and that kind of things, and many others.

All this information contained in targetm is used by the compiler to decide how registers have to be allocated, which ones have preference, the cost of them, and many other things.

Assembly code generation

Having the previous step clear is enough to understand how does the assembly generation work. Each of the insns in the list obtained from GIMPLE is going to be compared against the RTL templates and the best match is going to be chosen. Once the match is chosen, the corresponding assembly is going to be generated from the corresponding field of the define_insn RTL expression.

As simple as that, but also that complex.

Why do I say it’s complex? Because many things have to be considered and GCC does consider them. Each instruction has a size, that has to be considered to calculate addresses, but also they have some execution time associated and GCC calculates the best matches to make the final assembly file as optimum as possible.

The RTL step has a lot of optimization passes, too. It’s a complex step but it’s not really important for us because we just need to make a temporary compiler that lets us compile a better one. It doesn’t really matter if it’s not perfect, at least at this point.

Summary

So, in summary, the process is the following:

  1. The HLL language is parsed to a tree, normally GENERIC.
  2. GENERIC is converted to GIMPLE.
  3. GIMPLE optimizations are applied.
  4. GIMPLE is matched to a insn list using pattern names.
  5. The insn list is matched against the RTL templates defined in the Machine Description files.
  6. RTL optimizations are applied.
  7. The matches convert the RTL to assembly code also taking in account the information obtained from the target definition macros and functions.

From our perspective, the most important things to remember are these:

  • The front-end is not very relevant for us, from the parsing to GIMPLE we can ignore for the moment.
  • RTL step is pretty complex, and the GIMPLE->RTL conversion is too.
  • GCC is a compiler collection that has a very powerful compilation process, the Compiler Generator Framework (CFG), in order to modularize the code and make it easier to port.
  • The machine description files and the target definition macros and functions are designed to make the porting process simpler. Those are the only files we need to touch.

If you like my job here, consider hiring ElenQ Technology.
Even if I’m busy with this, I still have some time slots available.

My job in the backport

With all the process more or less clear, we can be more specific on the job I need to do. I share some specifics in this section, so if you like reading code you are going to have some fun7.

First, I need to make sure all the used RTL expressions are compatible with the old version of the compiler. If they are not, I have to translate them to the old way to make them. Some examples of this are iterators like (define_int_iterator ...) which is not available in old GCC versions, so I need to unfold a couple of loops by hand and make them only use the old constructs.

Second, I need to convert the target description macros and functions to the old internal C-based API instead of the more modern C++-based one, as the recent port uses. These changes involve many layers and I didn’t yet analyze this in detail. They can be simple like converting from rtx_insn class to rtx, the older way to do this. But they can also be complex, like removing the 40% of the #include directives from riscv.c, which has many that were not available in the past. It’s going to be a lot of fun I predict.

Third, as this whole compilation process is complex, I decided to make it as accessible as possible, so other people can audit and replicate my work. For that I’m using Guix, my package manager of choice. I added a guix.scm and channels.scm file to the repository so my work can be replicated precisely by myself in the future, or by others8.

The Guix package also provides a better interaction with the building process of GCC, letting us replace inputs in a very simple way. I’m thinking about the next steps of the project here, when we need to compile my backported compiler with TinyCC, test if it works and then patch TinyCC until it does. Having the guix.scm file makes it easy to replace the current compiler with a patched TinyCC and ensures nothing is interfering in the compilation process because the compilation is done in an isolated container.

That’s mostly the job I need to do in the backport.

Something to keep in mind is that we don’t need to make it perfect, we just need it to work. The backported GCC is not going to be used as a production compiler, but just as a bridge with the next GCC version, so there’s only one program it needs to be able to compile correctly: GCC 7. Once we make the bridge with the next version, we can use that to compile anything we want.

Last words

I know this post is long, and the lack proper diagrams make everything a little bit hard to understand. That’s exactly how I felt reading about GCC, but the difference was I had to read some documentation that is… About 100 times longer than this post (see Learn more below). Not that bad after all.

There are many things I decided to leave out, like peephole optimizations, instruction attributes, and some other constructs that are not that important from my perspective. You may want to make your research on those on your own.

In any case, if you have any question you can always contact me9 and ask me any questions you have or send me some words of support.

In the next post I’ll describe a little bit about ELF, the executable and linkable format, just the bare minimum to understand the format, as it will be relevant for us in the future. And you might be thinking, why is it relevant if GCC compiles to assembly? Well, that’s one of the questions that we will be answering in the next post.

Now I leave you with a couple of interesting links on the next section.

Good luck with your ports!

Learn more

  • The GCC internals documentation: if you are interested on my work you should read an older version of the documentation. See Disclaimer.
  • The GCC source code: of course, this has everything you need to understand GCC, but the problem is that GCC is a huge codebase, you probably need to spend months reading it in order to understand everything. That’s why I think posts like this one are interesting, they help you focus on the parts you are interested in.

  1. When calling gcc you can choose which language you are compiling using -x language option or you can let gcc guess from the extension. 

  2. And also Java in the past! 

  3. The RTL optimizer contains many steps, most of them being target independent. That doesn’t really matter here, but those are not generated but copied from GCC’s source, as GIMPLE is. 

  4. Other compilers have different approaches for this. For example, TinyCC generates machine code directly, without the intermediate assembly file generation step, and is also able to link the files by itself. 

  5. This post is already long enough and we only made the introduction. 

  6. The case of FORTRAN is a little bit weird, as it generates its own representation that is later converted to GENERIC, we don’t really care about this at this point. 

  7. I’ll link to some examples of the code on RISC-V’s GitHub account. This code is already merged in GCC

  8. I’m hosting this on GitHub at the moment because the repository is huge. I’ll probably move all this to my server and edit the post after that. 

  9. You can find my contact info in the About page