c++ linker - How does the compilation/linking process work?





that processes (6)


On the standard front:

  • a translation unit is the combination of a source files, included headers and source files less any source lines skipped by conditional inclusion preprocessor directive.

  • the standard defines 9 phases in the translation. The first four correspond to preprocessing, the next three are the compilation, the next one is the instantiation of templates (producing instantiation units) and the last one is the linking.

In practice the eighth phase (the instantiation of templates) is often done during the compilation process but some compilers delay it to the linking phase and some spread it in the two.

How does the compilation and linking process work?

(Note: This is meant to be an entry to 's C++ FAQ. If you want to critique the idea of providing an FAQ in this form, then the posting on meta that started all this would be the place to do that. Answers to that question are monitored in the C++ chatroom, where the FAQ idea started out in the first place, so your answer is very likely to get read by those who came up with the idea.)




The compilation of a C++ program involves three steps:

  1. Preprocessing: the preprocessor takes a C++ source code file and deals with the #includes, #defines and other preprocessor directives. The output of this step is a "pure" C++ file without pre-processor directives.

  2. Compilation: the compiler takes the pre-processor's output and produces an object file from it.

  3. Linking: the linker takes the object files produced by the compiler and produces either a library or an executable file.

Preprocessing

The preprocessor handles the preprocessor directives, like #include and #define. It is agnostic of the syntax of C++, which is why it must be used with care.

It works on one C++ source file at a time by replacing #include directives with the content of the respective files (which is usually just declarations), doing replacement of macros (#define), and selecting different portions of text depending of #if, #ifdef and #ifndef directives.

The preprocessor works on a stream of preprocessing tokens. Macro substitution is defined as replacing tokens with other tokens (the operator ## enables merging two tokens when it makes sense).

After all this, the preprocessor produces a single output that is a stream of tokens resulting from the transformations described above. It also adds some special markers that tell the compiler where each line came from so that it can use those to produce sensible error messages.

Some errors can be produced at this stage with clever use of the #if and #error directives.

Compilation

The compilation step is performed on each output of the preprocessor. The compiler parses the pure C++ source code (now without any preprocessor directives) and converts it into assembly code. Then invokes underlying back-end(assembler in toolchain) that assembles that code into machine code producing actual binary file in some format(ELF, COFF, a.out, ...). This object file contains the compiled code (in binary form) of the symbols defined in the input. Symbols in object files are referred to by name.

Object files can refer to symbols that are not defined. This is the case when you use a declaration, and don't provide a definition for it. The compiler doesn't mind this, and will happily produce the object file as long as the source code is well-formed.

Compilers usually let you stop compilation at this point. This is very useful because with it you can compile each source code file separately. The advantage this provides is that you don't need to recompile everything if you only change a single file.

The produced object files can be put in special archives called static libraries, for easier reusing later on.

It's at this stage that "regular" compiler errors, like syntax errors or failed overload resolution errors, are reported.

Linking

The linker is what produces the final compilation output from the object files the compiler produced. This output can be either a shared (or dynamic) library (and while the name is similar, they haven't got much in common with static libraries mentioned earlier) or an executable.

It links all the object files by replacing the references to undefined symbols with the correct addresses. Each of these symbols can be defined in other object files or in libraries. If they are defined in libraries other than the standard library, you need to tell the linker about them.

At this stage the most common errors are missing definitions or duplicate definitions. The former means that either the definitions don't exist (i.e. they are not written), or that the object files or libraries where they reside were not given to the linker. The latter is obvious: the same symbol was defined in two different object files or libraries.







This topic is discussed at CProgramming.com:
https://www.cprogramming.com/compilingandlinking.html

Here is what the author there wrote:

Compiling isn't quite the same as creating an executable file! Instead, creating an executable is a multistage process divided into two components: compilation and linking. In reality, even if a program "compiles fine" it might not actually work because of errors during the linking phase. The total process of going from source code files to an executable might better be referred to as a build.

Compilation

Compilation refers to the processing of source code files (.c, .cc, or .cpp) and the creation of an 'object' file. This step doesn't create anything the user can actually run. Instead, the compiler merely produces the machine language instructions that correspond to the source code file that was compiled. For instance, if you compile (but don't link) three separate files, you will have three object files created as output, each with the name .o or .obj (the extension will depend on your compiler). Each of these files contains a translation of your source code file into a machine language file -- but you can't run them yet! You need to turn them into executables your operating system can use. That's where the linker comes in.

Linking

Linking refers to the creation of a single executable file from multiple object files. In this step, it is common that the linker will complain about undefined functions (commonly, main itself). During compilation, if the compiler could not find the definition for a particular function, it would just assume that the function was defined in another file. If this isn't the case, there's no way the compiler would know -- it doesn't look at the contents of more than one file at a time. The linker, on the other hand, may look at multiple files and try to find references for the functions that weren't mentioned.

You might ask why there are separate compilation and linking steps. First, it's probably easier to implement things that way. The compiler does its thing, and the linker does its thing -- by keeping the functions separate, the complexity of the program is reduced. Another (more obvious) advantage is that this allows the creation of large programs without having to redo the compilation step every time a file is changed. Instead, using so called "conditional compilation", it is necessary to compile only those source files that have changed; for the rest, the object files are sufficient input for the linker. Finally, this makes it simple to implement libraries of pre-compiled code: just create object files and link them just like any other object file. (The fact that each file is compiled separately from information contained in other files, incidentally, is called the "separate compilation model".)

To get the full benefits of condition compilation, it's probably easier to get a program to help you than to try and remember which files you've changed since you last compiled. (You could, of course, just recompile every file that has a timestamp greater than the timestamp of the corresponding object file.) If you're working with an integrated development environment (IDE) it may already take care of this for you. If you're using command line tools, there's a nifty utility called make that comes with most *nix distributions. Along with conditional compilation, it has several other nice features for programming, such as allowing different compilations of your program -- for instance, if you have a version producing verbose output for debugging.

Knowing the difference between the compilation phase and the link phase can make it easier to hunt for bugs. Compiler errors are usually syntactic in nature -- a missing semicolon, an extra parenthesis. Linking errors usually have to do with missing or multiple definitions. If you get an error that a function or variable is defined multiple times from the linker, that's a good indication that the error is that two of your source code files have the same function or variable.




GCC compiles a C/C++ program into executable in 4 steps.

For example, a "gcc -o hello.exe hello.c" is carried out as follows:

1. Pre-processing

Preprocessin via the GNU C Preprocessor (cpp.exe), which includes the headers (#include) and expands the macros (#define).

cpp hello.c > hello.i

The resultant intermediate file "hello.i" contains the expanded source code.

2. Compilation

The compiler compiles the pre-processed source code into assembly code for a specific processor.

gcc -S hello.i

The -S option specifies to produce assembly code, instead of object code. The resultant assembly file is "hello.s".

3. Assembly

The assembler (as.exe) converts the assembly code into machine code in the object file "hello.o".

as -o hello.o hello.s

4. Linker

Finally, the linker (ld.exe) links the object code with the library code to produce an executable file "hello.exe".

ld -o hello.exe hello.o ...libraries...




The destructor of an object is called automatically when the object lifespan ends and it is destroyed. You should not usually call it manually.

We will use this object as an example:

class Test
{
    public:
        Test()                           { std::cout << "Created    " << this << "\n";}
        ~Test()                          { std::cout << "Destroyed  " << this << "\n";}
        Test(Test const& rhs)            { std::cout << "Copied     " << this << "\n";}
        Test& operator=(Test const& rhs) { std::cout << "Assigned   " << this << "\n";}
};

There are three (four in C++11) distinct types of object in C++ and the type of the object defines the objects lifespan.

  • Static Storage duration objects
  • Automatic Storage duration objects
  • Dynamic Storage duration objects
  • (In C++11) Thread Storage duration objects

Static Storage duration objects

These are the simplest and equate to global variables. The lifespan of these objects is (usually) the length of the application. These are (usually) constructed before main is entered and destroyed (in the reverse order of being created) after we exit main.

Test  global;
int main()
{
    std::cout << "Main\n";
}

> ./a.out
Created    0x10fbb80b0
Main
Destroyed  0x10fbb80b0

Note 1: There are two other type of static storage duration object.

static member variables of a class.

These are for all sense and purpose the same as global variables in terms of lifespan.

static variables inside a function.

These are lazily created static storage duration objects. They are created on first use (in a thread safe manor for C++11). Just like other static storage duration objects they are destroyed when the application ends.

Order of construction/destruction

  • The order of construction within a compilation unit is well defined and the same as declaration.
  • The order of construction between compilation units is undefined.
  • The order of destruction is the exact inverse of the order of construction.

Automatic Storage duration objects

These are the most common type of objects and what you should be using 99% of the time.

These are three main types of automatic variables:

  • local variables inside a function/block
  • member variables inside a class/array.
  • temporary variables.

Local Variables

When a function/block is exited all variables declared inside that function/block will be destroyed (in the reverse order of creation).

int main()
{
     std::cout << "Main() START\n";
     Test   scope1;
     Test   scope2;
     std::cout << "Main Variables Created\n";


     {
           std::cout << "\nblock 1 Entered\n";
           Test blockScope;
           std::cout << "block 1 about to leave\n";
     } // blockScope is destrpyed here

     {
           std::cout << "\nblock 2 Entered\n";
           Test blockScope;
           std::cout << "block 2 about to leave\n";
     } // blockScope is destrpyed here

     std::cout << "\nMain() END\n";
}// All variables from main destroyed here.

> ./a.out
Main() START
Created    0x7fff6488d938
Created    0x7fff6488d930
Main Variables Created

block 1 Entered
Created    0x7fff6488d928
block 1 about to leave
Destroyed  0x7fff6488d928

block 2 Entered
Created    0x7fff6488d918
block 2 about to leave
Destroyed  0x7fff6488d918

Main() END
Destroyed  0x7fff6488d930
Destroyed  0x7fff6488d938

member variables

The lifespan of a member variables is bound to the object that owns it. When an owners lifespan ends all its members lifespan also ends. So you need to look at the lifetime of an owner which obeys the same rules.

Note: Members are always destroyed before the owner in reverse order of creation.

  • Thus for class members they are created in the order of declaration
    and destroyed in the reverse order of declaration
  • Thus for array members they are created in order 0-->top
    and destroyed in the reverse order top-->0

temporary variables

These are objects that are created as the result of an expression but are not assigned to a variable. Temporary variables are destroyed just like other automatic variables. It is just that the end of their scope is the end of the statement in which they are created (this is usally the ';').

std::string   data("Text.");

std::cout << (data + 1); // Here we create a temporary object.
                         // Which is a std::string with '1' added to "Text."
                         // This object is streamed to the output
                         // Once the statement has finished it is destroyed.
                         // So the temporary no longer exists after the ';'

Note: There are situations where the life of a temporary can be extended.
But this is not relevant to this simple discussion. By the time you understand that this document will be second nature to you and before it is extending the life of a temporary is not something you want to do.

Dynamic Storage duration objects

These objects have a dynamic lifespan and are created with new and destroyed with a call to delete.

int main()
{
    std::cout << "Main()\n";
    Test*  ptr = new Test();
    delete ptr;
    std::cout << "Main Done\n";
}

> ./a.out
Main()
Created    0x1083008e0
Destroyed  0x1083008e0
Main Done

For devs that come from garbage collected languages this can seem strange (managing the lifespan of your object). But the problem is not as bad as it seems. It is unusual in C++ to use dynamically allocated objects directly. We have management objects to control their lifespan.

The closest thing to most other GC collected languages is the std::shared_ptr. This will keep track of the number of users of a dynamically created object and when all of them are gone will call delete automatically (I think of this as a better version of a normal Java object).

int main()
{
    std::cout << "Main Start\n";
    std::shared_ptr<Test>  smartPtr(new Test());
    std::cout << "Main End\n";
} // smartPtr goes out of scope here.
  // As there are no other copies it will automatically call delete on the object
  // it is holding.

> ./a.out
Main Start
Created    0x1083008e0
Main Ended
Destroyed  0x1083008e0

Thread Storage duration objects

These are new to the language. They are very much like static storage duration objects. But rather than living the same life as the application they live as long as the thread of execution they are associated with.







c++ compiler-construction linker c++-faq