Ekaitz's tech blog:
I make stuff at ElenQ Technology and I talk about it

TinyCC to GCC gap is slowly closing

From the series: Bootstrapping GCC in RISC-V

In previous episodes we talked about getting sidetracked and we mentioned we needed to build Musl because we had limitations in our standard library. We didn’t explain them in detail and I think it’s the moment to do so, as many of the changes we proposed there have been tested and upstreamed, and explain the ramifications that process had.

Symptoms

TinyCC and our MeslibC are powerful enough to build Binutils. But not enough to make some of the programs, like GNU As, work.

MeslibC is supersimple, meaning it doesn’t really implement some of the things you might consider obvious. One of the best examples is fopen. Instead of returning a fresh FILE structure, in MeslibC fopen simply returns the underlying file descriptor, as returned by the kernel’s open call. This is not a big problem, as the fread and fclose provided with MeslibC are compatible with this behaviour, but there’s a very specific case where this is a problem. In GNU As, if no file is given as an input, it just tries to read from standard input, and it fails, saying there was no valid file descriptor. Why? Let’s read the code GNU As uses to read files (gas/input-file.c):

/* Open the specified file, "" means stdin.  Filename must not be null.  */

void
input_file_open (const char *filename,
         int pre)
{
  int c;
  char buf[80];

  preprocess = pre;

  gas_assert (filename != 0);   /* Filename may not be NULL.  */
  if (filename[0])
    {
      f_in = fopen (filename, FOPEN_RT);
      file_name = filename;
    }
  else
    {
      /* Use stdin for the input file.  */
      f_in = stdin;
      /* For error messages.  */
      file_name = _("{standard input}");
    }

  if (f_in == NULL)
    {
      as_bad (_("can't open %s for reading: %s"),
          file_name, xstrerror (errno));
      return;
    }

  c = getc (f_in);
  /* ... Continues ...*/

If MeslibC uses file descriptor integers as FILE structures, it’s not hard to detect the problem in the example. For the cases where the selected filename is empty (no file to read from) filename[0] will be false (\0 character), and f_in will be set to stdin. That should normally mean some FILE structure with an internal file descriptor of value 0, the one corresponding to the standard input. As the structure is not NULL the error message below won’t trigger. As I just explained, MeslibC uses kernel’s file descriptors instead of FILE structures so stdin in MeslibC is just 0, which is equal to NULL for the compiler, so the error message is triggered and the execution stops.

MeslibC’s clever solution for filenames is simply failing due to the fact that C has no error types, and errors are signalled in the standard library using NULL.

This is just a simple case to exemplify how MeslibC affects our bootstrapping chain, but there are others. For example, MeslibC can’t ungetc more than once because that was enough for the bootstrapping as it was designed for x86, but as we moved to a more recent binutils version (the first one supporting RISC-V), that became an obstacle, and it’s preventing us from running GNU As.

Of course, all of these problems could be fixed in MeslibC, but in the end the goal of MeslibC is not to be a proper C standard library implementation, but a helper for the bootstrapping of more powerful standard libraries that already exist. These problems, and some others we also found, are just drawing the line of when do we need to jump to a more mature C standard library in our chain. Looks like binutils is where that line is drawn.

Musl

The bootstrapping chain as conceived in Guix uses GLibC, as Guix is a GNU project, but we found Musl to be a more suitable C standard library for these initial steps as it is simple an easy to build while keeping all the functionality you might expect from a proper C standard library.

We ran into some issues though, as upstream TinyCC’s RISC-V backend wasn’t ready to build it.

First of all, TinyCC’s RISC-V backend had no support for Extended ASM, so I implemented it and sent it upstream.

Once I did that we built Musl and we realized we had issues in some functions. The problem was the Extended Asm implementation was not understanding the constraints properly and those parameters marked as read and write were not considered correctly. I talked with Michael, the author of that piece of code, because I didn’t understand the behaviour well. He guided me a little and I proceeded to fix it in all architectures.

Still, we couldn’t build Musl because it was using some atomic instructions that were not implemented in TinyCC’s RISC-V assembler and we decided to avoid them, patching around them in Musl. They happened to be important for memory allocation (LOL) so I decided to implement them in TinyCC’s assembler and push the changes upstream. I implemented lr (load reserved), sc (store conditional) and extended fences behaviour to match what the GNU Assembler (the reference RISC-V assembler) would do.

Still this wasn’t enough for Musl to build properly as TinyCC’s RISC-V backed was not implemented as a proper assembly but as instructions in human readable text. RISC-V is a RISC architecture and makes a heavy use of pseudoinstructions to ease the development of assembly programs. Before all this work, TinyCC only implemented simple instructions and almost no pseudoinstruction expansion.

Also, its architecture couples argument parsing with relocation generation and it doesn’t really help to implement pseudoinstructions with variable argument count or default values. I added enough code to avoid falling in the problems this design decision had and pushed everything upstream. The list includes support for many pseudoinstructions, proper relocation use for several instruction families like jal and branches, and some other things. In the end, we do not have a fully featured assembler yet, but we do have enough to build the simple code we find in a C standard library like Musl. In fact, even using the syntax that any RISC-V assembler would expect, as I explained in more detail here.

Meslibc

Once all those changes are finally applied to TinyCC, we can remove the weird split we needed to do in MeslibC to support make it match the TinyCC assembly syntax, so I did that. Less code, less problems.

Also my colleague Andrius added a realpath stub, to make us able to build upstream TinyCC without having to patch the places where realpath was used in it. realpath is not a simple function to implement, and it’s not critical in TinyCC. Again MeslibC doesn’t need to be perfect, only let us start building everything.

TinyCC

With all those changes coming to MeslibC and the ones we upstreamed, we now don’t need to patch on top of upstream TinyCC, so all our small changes on top of it are dropped now. Less code, less problems.

We could have kept these changes for ourselves, but sharing them is not only easier, but also better for everyone. The following is the complete list of changes I upstreamed to TinyCC, a project that we are not really part of, but this is what we do and what we believe in.

  • 0aca8611 fixup! riscv: Implement large addend for global address
  • 8baadb3b riscv: asm: implement j offset
  • 15977630 riscv: asm: Add branch to label
  • 671d03f9 riscv: Add full fence instruction support
  • c9940681 riscv: asm: Add load-reserved and store-conditional
  • 0703df1a Fix Extended Asm ignored constraints
  • 6b3cfdd0 riscv: Add extended assembly support
  • e02eec6b riscv: fix jal: fix reloc and parsing
  • 02391334 fixup! riscv: Add .option assembly directive (unimp)
  • cbe70fa6 riscv: Add .option assembly directive (unimp)
  • 618c1734 riscv: libtcc1.c support some builtins for __riscv
  • 3782da8d riscv: Support $ in identifiers in extended asm.
  • e2d8eb3d riscv: jal: Add pseudo instruction support
  • 409007c9 riscv: jalr: implement pseudo and parse like GAS
  • 8bfef6ab riscv: Add pseudoinstructions
  • 8cbbd2b8 riscv: Use GAS syntax for loads/stores:
  • 019d10fc riscv: Move operand parsing to a separate function
  • 7bc0cb5b riscv: Implement large addend for global address

Bootstrappable TinyCC

During the bootstrapping process we detected new issues and one of them was so deep it took pretty long to detect and solve.

Most of the programs we were building with our Bootstrappable TinyCC worked: GZip, Make… But we reached a point were we needed to rebuild upstream TinyCC with Musl, in order to start using Musl to build the next programs. It didn’t work.

We had a really hard time finding the problem behind this because it appeared too far in the chain to be easy. The process goes like this.

We use Mes to build our very first Bootstrappable TinyCC, which compiles itself several times (6), until it reaches its final state. That then builds upstream TinyCC and with that we build TinyCC again this time using Musl as its standard library. We found this last one was unable to build simple files and we started digging.

We realized TinyCC was using sign extension in unsigned values, and that was messing up with the next TinyCC, making it unable to build programs correctly. Researching this deeply we found the problem was in the load function of TinyCC but a TinyCC built with GCC didn’t have this problem. The only option was that the Bootstrappable TinyCC had the bug that was later affecting the compilers compiled with it.

Digging a little bit further I found the casts from Bootstrappable TinyCC had some missing cases that I didn’t backport properly but as I wasn’t able to understand them very well I decided to backport the full gen_cast function from upstream to the Bootstrappable TinyCC. With that, the errors from TinyCC were gone.

It feels like an accidental trusting trust attack, yes. This is the kind of things we have to deal with, and they are pretty tiring and frustrating to find.

The new Bootstrapping chain

So, all of this brings us to the new bootstrapping chain. We need to make things very different to the way Guix does them right now, because we are skipping many steps (GCC 2.95, now we need Musl for Binutils…) so I started a project to track how we go forward in the bootstrapping chain (it’s just a wip, for our tests, take that in account).

We had good and bad news in that regard. At the moment of writing we managed to build up to the GCC 4.6.4 I added RISC-V support to, but the compiler is faulty and it’s unable to build itself again with the C++ support.

I’m using non-bootstrapped versions of flex and bison, but those shouldn’t be hard to bootstrap either. I just didn’t have the time to make them from scratch. And I’m using a bash instead of gash because we had found a blocking error in gash that is not letting us continue forward from Binutils.

In any case, this means we are near from the next milestone: building GCC 4.6.4 with TinyCC; and as we described in the previous post we already built GCC 7.5 from GCC 4.6.4 so we solved the next already.

After those, we would need to clean this new bootstrapping chain and talk with Guix for its inclusion in there. I hope we can finish all this before hitting the deadline that is silently approaching…