In the previous post of the series the problem of the GCC bootstrapping was introduced. In this post we’ll describe how GCC works, from the perspective of someone who wants to port it so we understand what’s the job we have to do.
- Disclaimer
- Overview
- Source code parsing
- GIMPLE
- Register Transfer Language
- Assembly code generation
- Summary
- My job in the backport
- Last words
- Learn more
Disclaimer
- This post may be only valid for old GCC versions, like 4.something, because that’s the one I’m interested in. More recent versions may have different details, but I don’t expect them to be very different to what is described here. More specifically: I’m working on GCC 4.6.4, and the first GCC with RISC-V support is GCC 7.0.0.
- This post will focus on how GCC compiles C programs because that’s the part we care about. Some other languages have differences on how they are treated but that’s not very relevant for us, as it has no implications on the back-end.
Both of these points will get clearer later.
Overview
GCC is structured as a pipeline of several steps that run one after the other.
- Source code parsing
- GIMPLE IR generation (target-independent)
- Some GIMPLE tree optimizations
- RTL IR generation (target-dependent)
- RTL optimizer
- Assembly code generator
Before starting to analyze each of the steps independently there are a couple of things to clarify.
The Compiler Generation Framework
An important point to note is GCC is a compiler collection meaning that it is able to compile code from many high level languages (HLL) and for many different targets. This has implications on how some steps are mapped to GCC’s source code.
The most important thing of all this is to differentiate between GCC’s code and
an actual gcc
executable. The key point here is that GCC’s codebase includes
what is called CGF (Compiler Generator Framework) that can generate gcc
executables from GCC’s code. The CGF generates gcc
executables according to
the input (target machine, host machine…) we give it, but the generated gcc
executables may differ one from another even if they were generated from the
same codebase.
Any gcc
executable is able to compile any input HLL1 (C, C++,
Objective-C, Ada, Fortran and Go2), so GCC’s code must include parsers
for each of these languages.
On the other hand, gcc
executables are only able to generate code for one
target (x86, MIPS, ARM, RISC-V…), that must be chosen when GCC is compiled.
In order to make the porting efforts easier, GCC has a set of tools that
generate the target-dependent code from some configuration files called Machine
Descriptions (MD).
Putting all this together, source code parsing and AST generation depend on the
input HLL, and the code that runs for each HLL is selected when gcc
runs
(steps 1). The intermediate representation, GIMPLE, is target-independent so
everything related with that is copied inside the final gcc
executable
(steps 2 and 3). The RTL (Register Transfer Language) representation and
assembly code generation are target-dependent and the code related to that is
generated from MD files when GCC is compiled (steps 4, 5 and
6)3.
This all means if we want to be able to read the source code of GCC we have to
have clear in mind how the source code maps to the actual executable, i.e. if
we generate a gcc
executable for x86 it won’t contain the code for other
architectures and it won’t even check if it was correctly programmed,
because it’s not going to compile it.
GCC as a coordinator
Many GCC users or C programmers (or me, not that long ago) might think there is
something missing on the list of steps we recently reviewed. The normal usecase
for calling gcc
like
$ gcc -o helloworld helloworld.c
does several steps internally that we need to separate:
- Preprocessing: the resolution of preprocessor macros like
#define
and stuff like that. - Compiling to assembly: the generation of assembly code files per compilation unit (a file that is the output of the preprocessor).
- Assembly: the conversion from an assembly file to ELF object file.
- Linking: the executable or library generation from the ELF object files created in the previous step.
The reality is there’s more than one program involved here and gcc
is just a
coordinator that makes other programs run if needed.
The preprocessor is called cpp
and it is generated from the GCC codebase. The
compiler is gcc
itself but the assembler and the linker are generally
obtained from GNU Binutils’ as
and ld
respectively.
So, one of the most important things to understand is GCC only generates assembly, but it looks like it doesn’t4.
This means we need a proper support for our architecture on the assembler and the linker too. But we’ll keep that story for another day5.
Souce code parsing
HLL dependent. Generates appropriate IR
So the first step of the compiler is to process the input text and convert it to the appropriate Intermediate Representation. The most used intermediate representation is GENERIC, that was designed for C but fits other procedural languages pretty well6.
This parsing process not really relevant for us, as we want to add a new
target, but it’s interesting to note because it gives shape to the codebase.
GCC splits the code for the different input languages in folders named like:
gcc/$LANGUAGE
.
GENERIC
GENERIC is just a representation, we don’t need to care that much about it but
a word is not going to hurt anyone. GENERIC is a tree representation: a set of
nodes with some extra common information. Those nodes can be read at
gcc/tree.def
.
A simple example of this could be a function declaration, that would take the
a node of type FUNCTION_DECL
that has some sub-nodes: one for the return
type, another for the body of the function and another for the arguments of the function.
It’s a simple AST you could come up with yourselves, except the fact that it is pretty complex. 😅
GIMPLE
HLL- and Target- independent representation
The next step is called Gimplification (see gimplify.c
), the process of
converting to GIMPLE. Normally, representing the AST as GIMPLE is too complex
to be done in one step, so the GENERIC (+ some extensions) is used as a
previous step that is easier to create.
GIMPLE is the central internal representation of GCC. It’s target-independent and High-Level-Language-independent. At this point some optimizations can be applied, those related with the structure of the source code, like loop unfolding or dead code elimination.
From the porting perspective, this representation is important, as it’s the border line between the front-end and the back-end, and we are interested in the latter. A really interesting part is to understand is how is this converted to the next representation, RTL.
Register Transfer Language (RTL)
Target-dependent low level representation
The next part of the compiler work is done using the RTL intermediate representation. The RTL representation is based on LISP, so we have a reason to love it, and it serves two purposes:
- Specify target properties via the Machine Descriptor files. These Machine Descriptor files are text files that look like LISP and are processed at compilation time.
- Represent a compilation. Meaning that the RTL is also an intermediate representation, a low-level one, that represents sets of instructions.
GCC does not make any distinction between the first and the second purpose, calling both RTL, but there some difference on the purpose and the shape of the RTL. RTL has both, an internal form represented by structures (case 2) and an external form represented as a text file (case 1).
The RTL is formed by a set of objects: expression, integers, wide integers,
strings or vectors. In the textual form they are represented like in LISP,
using double quotes for strings, brackets for vectors… and a lot of
parenthesis. The internal representation you can imagine, structures for
expressions, integer types for integers, char*
for strings, etc.
The most interesting RTL objects are expressions, aka RTX, that are just a name (an expression code) plus the amount of possible arguments.
This is how a piece of RTL may look, it represents an instruction that sets the
register 0 to the result of the addition of the register 1 and the constant
integer 10 (see rtl.def
for more information):
(set (reg 0)
(plus (reg 1)
(const_int 10)))
In the example the only things that are not expressions are the numbers (0, 1
and 10), all the rest you can find in rtl.def
and see what they mean.
From GIMPLE, there are two steps left to reach our target, assembly code, and
both involve RTL. The first maps the GIMPLE nodes to pattern names in a
target-independent way, generating a list of RTL insn
s. The second matches
those insn
lists to RTL templates described in Machine Description files and
uses those matches to generate the final assembly code.
Those insn
s are objects that represent code in RTL. Each function is
described with a doubly-linked list of insn
s. You can think about them as
instructions in the RTL world.
In the first step, the RTL insn
generation step, only the names matter (and
they are hardcoded in the compiler), while in the second the structure of the
insn
is going to be analyzed as we’ll see later.
Target-dependent code
As we previously said, target-dependent steps are generated at compile time and
then inserted in the final gcc
executable. All this code is located in one
folder per target, under gcc/config/$TARGET
, so the CFG is able to load the
target we choose at compile time (using --target=
) and insert that in the
final executable.
That is done in different ways depending on the type of file we are working
with: Machine Description files are processed by the programs (gencodes
,
gentags
…) that generate C code files from them, while target description
macros and functions, which are C files, are inserted in the building process
as any other C file.
I’d like to insist here on the fact that the --target
is the only one to be
processed and loaded and the other possible targets are going to be ignored.
The build process is not going to complain if a target is broken or anything
like that if it isn’t the target we chose. It just doesn’t care.
Machine Description files
Machine Description files (.md
extension) let us define insn
patterns,
which are incomplete RTL expressions that can be matched against the insn
list generated from the GIMPLE, attributes
and other interesting things we
may not try to decipher here.
define_insn
is a RTX we can use to define new insn
patterns. It receives
four or five operands:
- An optional name. It’s going to be used to match against GIMPLE.
- An RTL template. A vector of incomplete RTL expressions which describe how
should the instruction look like. Incomplete in this context means it uses
expressions like
match_operand
ormatch_operator
which are designed to match against the RTLinsn
list and see if they are compatible or not. - A condition. A final condition to say if the
insn
matches this pattern or not. - An output template. A string that contains the output assembly code for this
insn
. The string can contain special characters like%
to define where should the arguments be inserted. If the output is very complex we can write C code on this field too. - An optional list of attributes.
This is an actual example from the RISC-V code we are backporting:
(define_insn "adddi3"
[(set (match_operand:DI 0 "register_operand" "=r,r")
(plus:DI (match_operand:DI 1 "register_operand" "r,r")
(match_operand:DI 2 "arith_operand" "r,I")))]
"TARGET_64BIT"
"add\t%0,%1,%2"
[(set_attr "type" "arith")
(set_attr "mode" "DI")])
You can see the name adddi3
is something like: add
+ di
+ 3
. This means
it’s the add
instruction with the di
mode and 3
input arguments. That’s
the way things are named.
Next block is a vector with the RTL template. If you try to ignore
match_operand
expressions you can see the template is not very different to
the RTL example we gave before. In this case it’s something like:
(set (reg 0)
(plus (reg 1)
(reg 2)))
It’s basically storing in the first register the result of the addition of the other two.
The next field is the condition. In this case it needs to have TARGET_64BIT
defined in order to work because the machine mode is DI
(we’ll explain that soon).
The output code is simple, just a RISC-V add
instruction:
add %0,%1,%2
Where %N
is going to be replaced by the register numbers used as arguments
for this instruction.
The last field are the attributes, which can be used to define the instruction size and other kind of things. We are not going to focus on them today.
Machine modes
Machine modes are a way to describe the size of a data object and its representation.
- QI: quarter integer
- HI: half integer
- SI: single integer
- DI: double integer
- SF: single floating
- DF: double floating
And so on.
The standard insn
names include machine modes to describe what kind of
instruction they are. The example above is addddi3
, meaning it uses di
machine mode: double integer. That’s why it needs the target to be a 64 bit
RISC-V machine.
Machine modes also appear in some RTL expressions like plus
or
match_operand
meaning that they operate in that machine mode, that is, with
that data size and representation. For example (plus:SI ...)
.
RTL Templates
match_*
expressions are what make RTL expressions incomplete, because they
are designed to be compared against the insn
list that comes from the
previous step.
In the example above we had:
(set (match_operand:DI 0 "register_operand" "=r,r")
(plus:DI (match_operand:DI 1 "register_operand" "r,r")
(match_operand:DI 2 "arith_operand" "r,I")))
(match_operand N predicate constraint)
is a placeholder for an operand number
N
of the insn
. When the insn
is constructed, the match_operand
will be
replaced by the corresponding operand of the insn
. When the template is
trying to match an insn
the match_operand
forces the operand number N
to
match the predicate
in order to make the insn
match the template.
The match_*
expressions are what defines how insn
s should be.
The predicate
is a function name to be called. The function receives two
input arguments: an expression and a machine mode. If the function returns 0
the function does not match.
predicate
s can also be combined in Machine Description files like this:
(define_predicate "arith_operand"
(ior (match_operand 0 "const_arith_operand")
(match_operand 0 "register_operand")))
So the arith_operand
shown in the example above can be a
const_arith_operand
or (that’s what ior
means) a register_operand
.
They can be more complex but this is more than enough to understand how they
are built. In the end, they always check against C functions, but you can
combine them with the convenience of the Machine Description files.
The constraint
allows to fine-tune the matching. They define if the argument
is in a register or the memory and stuff like that. r
, for example, means the
operand comes from a register.
There are other matching expressions too, but match_operand
is the most used
one and it’s the one that explains this concept of incomplete expressions the best.
Target description macros and functions
Apart from the machine descriptor files, there are other files involved. For example, the constraints defined above need to be defined in code somewhere.
The most important of these are the target description macros and functions,
normally defined in gcc/config/$TARGET/$TARGET?(.h|.c)
. The .c
should
initialize the targetm
variable, which contains all the machine information
relevant to the compiler. It is initialized like this:
struct gcc_target targetm = TARGET_INITIALIZER;
That TARGET_INITIALIZER
is a huge macro, defined in gcc/target.h
, that
initializes the targetm
structure. This macro is split in smaller macros with
reasonable defaults that may be overwritten by pieces. Each target should have
a file that includes both target.h
and target-def.h
and overwrites any
inappropriate default by redefining new macros and ends with the initialization
line we just introduced. This is normally done in
gcc/config/$TARGET/$TARGET.c
, while the .h
is normally used to define some
macros that are needed in the .c
file.
As a reference, the RISC-V code we need to backport (see
gcc/config/riscv/riscv.c
) uses the file to introduce the amount of registers,
the type, the size, and that kind of things, and many others.
All this information contained in targetm
is used by the compiler to decide
how registers have to be allocated, which ones have preference, the cost of
them, and many other things.
Assembly code generation
Having the previous step clear is enough to understand how does the assembly
generation work. Each of the insn
s in the list obtained from GIMPLE is going
to be compared against the RTL templates and the best match is going to be
chosen. Once the match is chosen, the corresponding assembly is going to be
generated from the corresponding field of the define_insn
RTL expression.
As simple as that, but also that complex.
Why do I say it’s complex? Because many things have to be considered and GCC does consider them. Each instruction has a size, that has to be considered to calculate addresses, but also they have some execution time associated and GCC calculates the best matches to make the final assembly file as optimum as possible.
The RTL step has a lot of optimization passes, too. It’s a complex step but it’s not really important for us because we just need to make a temporary compiler that lets us compile a better one. It doesn’t really matter if it’s not perfect, at least at this point.
Summary
So, in summary, the process is the following:
- The HLL language is parsed to a tree, normally GENERIC.
- GENERIC is converted to GIMPLE.
- GIMPLE optimizations are applied.
- GIMPLE is matched to a
insn
list using pattern names. - The
insn
list is matched against the RTL templates defined in the Machine Description files. - RTL optimizations are applied.
- The matches convert the RTL to assembly code also taking in account the information obtained from the target definition macros and functions.
From our perspective, the most important things to remember are these:
- The front-end is not very relevant for us, from the parsing to GIMPLE we can ignore for the moment.
- RTL step is pretty complex, and the GIMPLE->RTL conversion is too.
- GCC is a compiler collection that has a very powerful compilation process, the Compiler Generator Framework (CFG), in order to modularize the code and make it easier to port.
- The machine description files and the target definition macros and functions are designed to make the porting process simpler. Those are the only files we need to touch.
Even if I’m busy with this, I still have some time slots available.
My job in the backport
With all the process more or less clear, we can be more specific on the job I need to do. I share some specifics in this section, so if you like reading code you are going to have some fun7.
First, I need to make sure all the used RTL expressions are compatible with the
old version of the compiler. If they are not, I have to translate them to the
old way to make them. Some examples of this are iterators like
(define_int_iterator ...)
which is not available in old GCC
versions, so I need to unfold a couple of loops by hand and make them only use
the old constructs.
Second, I need to convert the target description macros and functions to the
old internal C-based API instead of the more modern C++-based one, as
the recent port uses. These changes involve many layers and I didn’t yet
analyze this in detail. They can be simple like converting from rtx_insn
class to rtx
, the older way to do this. But they can also be complex, like
removing the 40% of the #include
directives from riscv.c
, which has many
that were not available in the past. It’s going to be a lot of fun I predict.
Third, as this whole compilation process is complex, I decided to make it as
accessible as possible, so other people can audit and replicate my work. For
that I’m using Guix, my package manager of choice. I added a guix.scm
and channels.scm
file to the repository so my work can be
replicated precisely by myself in the future, or by others8.
The Guix package also provides a better interaction with the building process
of GCC, letting us replace inputs in a very simple way. I’m thinking about the
next steps of the project here, when we need to compile my backported compiler
with TinyCC, test if it works and then patch TinyCC until it does. Having the
guix.scm
file makes it easy to replace the current compiler with a patched
TinyCC and ensures nothing is interfering in the compilation process because
the compilation is done in an isolated container.
That’s mostly the job I need to do in the backport.
Something to keep in mind is that we don’t need to make it perfect, we just need it to work. The backported GCC is not going to be used as a production compiler, but just as a bridge with the next GCC version, so there’s only one program it needs to be able to compile correctly: GCC 7. Once we make the bridge with the next version, we can use that to compile anything we want.
Last words
I know this post is long, and the lack proper diagrams make everything a little bit hard to understand. That’s exactly how I felt reading about GCC, but the difference was I had to read some documentation that is… About 100 times longer than this post (see Learn more below). Not that bad after all.
There are many things I decided to leave out, like peephole optimizations, instruction attributes, and some other constructs that are not that important from my perspective. You may want to make your research on those on your own.
In any case, if you have any question you can always contact me9 and ask me any questions you have or send me some words of support.
In the next post I’ll describe a little bit about ELF, the executable and linkable format, just the bare minimum to understand the format, as it will be relevant for us in the future. And you might be thinking, why is it relevant if GCC compiles to assembly? Well, that’s one of the questions that we will be answering in the next post.
Now I leave you with a couple of interesting links on the next section.
Good luck with your ports!
Learn more
- The GCC internals documentation: if you are interested on my work you should read an older version of the documentation. See Disclaimer.
- The GCC source code: of course, this has everything you need to understand GCC, but the problem is that GCC is a huge codebase, you probably need to spend months reading it in order to understand everything. That’s why I think posts like this one are interesting, they help you focus on the parts you are interested in.
-
When calling
gcc
you can choose which language you are compiling using-x language
option or you can letgcc
guess from the extension. ↩ -
And also Java in the past! ↩
-
The RTL optimizer contains many steps, most of them being target independent. That doesn’t really matter here, but those are not generated but copied from GCC’s source, as GIMPLE is. ↩
-
Other compilers have different approaches for this. For example, TinyCC generates machine code directly, without the intermediate assembly file generation step, and is also able to link the files by itself. ↩
-
This post is already long enough and we only made the introduction. ↩
-
The case of FORTRAN is a little bit weird, as it generates its own representation that is later converted to GENERIC, we don’t really care about this at this point. ↩
-
I’ll link to some examples of the code on RISC-V’s GitHub account. This code is already merged in GCC. ↩
-
I’m hosting this on GitHub at the moment because the repository is huge. I’ll probably move all this to my server and edit the post after that. ↩
-
You can find my contact info in the About page. ↩