[compiler-construction] Writing a compiler in its own language


Answers

I recall listening to a Software Engineering Radio podcast wherein Dick Gabriel spoke about bootstrapping the original LISP interpreter by writing a bare-bones version in LISP on paper and hand assembling it into machine code. From then on, the rest of the LISP features were both written in and interpreted with LISP.

Question

Intuitively, it would seems that a compiler for language Foo cannot itself be written in Foo. More specifically, the first compiler for language Foo cannot be written in Foo, but any subsequent compiler could be written for Foo.

But is this actually true? I have some very vague recollection of reading about a language whose first compiler was written in "itself". Is this possible, and if so how?




Generally, you need to have a working (if primative) cut of the compiler working first - then you can start thinking about making it self-hosting. This is actually considered an important milestone in some langauges.

From what I remember from "mono", it is likely they will need to add a few things to reflection to get it working: the mono team keep pointing out that some things simply aren't possible with Reflection.Emit; of course, the MS team might prove them wrong.

This has a few real advantages: it is a fairly good unit test, for starters! And you only have one language to worry about (i.e. it is possible a C# expert might not know much C++; but now thy can fix the C# compiler). But I wonder if there isn't an amount of professional pride at work here: they simply want it to be self-hosting.

Not quite a compiler, but I've recently been working on a system that is self hosting; the code generator is used to generate the code generator... so if the schema changes I simply run it on itself : new version. If there is a bug, I just go back to an earlier version and try again. Very convenient, and very easy to maintain.


Update 1

I've just watched this video of Anders at PDC, and (about an hour in) he does give some much more valid reasons - all about the compiler as a service. Just for the record.




Maybe you can write a BNF describing BNF.




When you write your first compiler for C, you write it in some other language. Now, you have a compiler for C in, say, assembler. Eventually, you will come to the place where you have to parse strings, specifically escape sequences. You will write code to convert \n to the character with the decimal code 10 (and \r to 13, etc).

After that compiler is ready, you will start to reimplement it in C. This process is called "bootstrapping".

The string parsing code will become:

...
if (c == 92) { // backslash
    c = getc();
    if (c == 110) { // n
        return 10;
    } else if (c == 92) { // another backslash
        return 92;
    } else {
        ...
    }
}
...

When this compiles, you have a binary which understands '\n'. This means you can change the source code:

...
if (c == '\\') {
    c = getc();
    if (c == 'n') {
        return '\n';
    } else if (c == '\\') {
        return '\\';
    } else {
        ...
    }
}
...

So where is the information that '\n' is the code for 13? It's in the binary! It's like DNA: Compiling C source code with this binary will inherit this information. If the compiler compiles itself, it will pass this knowledge on to its offspring. From this point on, there is no way to see from the source alone what the compiler will do.

If you want to hide a virus in the source of some program, you can do it like this: Get the source of a compiler, find the function which compiles functions and replace it with this one:

void compileFunction(char * name, char * filename, char * code) {
    if (strcmp("compileFunction", name) == 0 && strcmp("compile.c", filename) == 0) {
        code = A;
    } else if (strcmp("xxx", name) == 0 && strcmp("yyy.c", filename) == 0) {
        code = B;
    }

    ... code to compile the function body from the string in "code" ...
}

The interesting parts are A and B. A is the source code for compileFunction including the virus, probably encrypted in some way so it's not obvious from searching the resulting binary. This makes sure that compiling to compiler with itself will preserve the virus injection code.

B is the same for the function we want to replace with our virus. For example, it could be the function "login" in the source file "login.c" which is probably from the Linux kernel. We could replace it with a version that will accept the password "joshua" for the root account in addition to the normal password.

If you compile that and spread it as a binary, there will be no way to find the virus by looking at the source.

The original source of the idea: http://cm.bell-labs.com/who/ken/trust.html




Actually, most compilers are written in the language they compile, for the reasons stated above.

The first bootstrap compiler is usually written in C, C++ or Assembly.




Here's a dump (difficult topic to search on, actually):

This is also the idea of PyPy and Rubinius:

(I think this might also apply to Forth, but I don't know anything about Forth.)




Related