c++ - Compiler optimization: g++ slower than intel




intel compiler vs gcc (2)

It looks like you're using OpenMP, and so I suspect the difference is in the OpenMP implementation, not just the quality of the optimized code.

Intel's OpenMP runtime is known to be quite high performance, and GCC's is good but not great.

OpenMP programs have very different performance characteristics, they don't just depend on how well the compiler can optimize loops or inline function calls. The implementation of the OpenMP runtime matters a lot, as well as the OS implementation of threads and synchronization primitives, which are quite different between Windows and GNU/Linux.

I recently acquired a computer with dual-boot to code in C++. On windows I use intel C++ compiler and g++ on linux. My programs consist mostly of computation (fixed point iteration algorithm with numerical integration, etc.).
I thought I could get performances close to windows on my linux, but so far I don't: for the exact same code, the program compiled with g++ is about 2 times slower than the one with intel compiler. From what I read, icc can be faster, maybe even up to 20-30% gains, but I did not read anything about it being twice as fast (and in general I actually read that both should be equivalent).

At first I was using flags which are approximately equivalent:

icl /openmp /I "C:\boost_1_61_0" /fast program.cpp

and

g++ -o program program.cpp -std=c++11 -fopenmp -O3 -ffast-math

Following advices from several other topics I tried adding/replacing several other flags like: -funsafe-math-optimizations, -march=native, -fwhole-program, -Ofast etc. with only slight (or no) performances gain.

Is icc really faster or am I missing something? I'm fairly new to linux so I don't know, maybe I forgot to install something properly (like a driver), or to change some option in g++ ? I have no idea whether the situation is normal or not, that's why I prefer to ask. Especially since I prefer to use linux to code ideally, so I would rather have it be up to speed.

EDIT: I decided to install the last intel compiler (Intel Compiler C++ 17, update4) on linux to check. I end up with mitigated results: it does NOT do better than gcc (even worse in fact). I ran cross comparison linux/windows - icc/gcc - parallelized or not, using the flags mentioned earlier in the post (to make direct comparisons), here are my results (time to run 1 iteration measured in ms):

  1. Plain loop, no parallelization:

    • Windows:
      gcc = 122074 ; icc = 68799
    • Linux:
      gcc = _91042 ; icc = 92102
  2. Parallelized version:

    • Windows:
      gcc = 27457 ; icc = 19800
    • Linux:
      gcc = 27000 ; icc = 30000

To sum up: it's a bit of a mess. On linux, gcc seems to always be faster than icc, especially when parallelization is involved (I ran it for longer program, the difference is much higher than the one here).
On windows, it's the opposite and icc clearly dominates gcc, especially when there is no parallelization (in which case gcc takes a really long time to compile).

The fastest compilation is done with parallelization and icc on windows. I don't understand why I cannot replicate this on linux. Is there anything I need to do (ubuntu 16.04) to help fasten my processes?
The other difference is that on windows I use an older intel composer (Composer XE 2013) and call 'ia32' instead of intel64 (which is the one I should be using) while on linux I use the last version that I installed yesterday. And on linux, the Intel Compiler 17 folder is on my second hdd (and not my ssd on which linux is install) I don't know if this might slow things down too.
Any idea where the problem may come from?

Edit: Exact hardware: Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz, 8 CPU, 4 cores, 2 threads per core, architecture x86_64 - Linux Ubuntu 16.04 with gcc 5.4.1 and Intel Compiler 17 (update4) - Windows 8.1, Intel Composer 2013

Edit: Code is very long, here is the form of the loop that I'm testing (i.e. just one iteration of my fixed point iteration). It's very classic I guess... not sure it can bring anything to the topic.

// initialization of all the objects...
// length_grid1 is about 2000
vector< double > V_NEXT(length_grid1), PRICE_NEXT(length_grid1);
double V_min, price_min; 
#pragma omp parallel
{ 
#pragma omp for private(V_min, price_min, i, indexcurrent, alpha, beta)
    for (i = 0; i < length_grid1; i++) {
         indexcurrent = indexsum[i]; 
         V_min = V_compute(&price_min, indexcurrent, ...);
         V_NEXT[indexcurrent] = V_min; PRICE_NEXT[indexcurrent] = price_min;
     }
 }// end parallel

where V_compute function is a classic and simple optimization algorithm (customized golden search) returning the optimal value and its argument:

double V_compute(double *xmin, int row_index, ... ) {
double x1, x2, f1, f2, fxmin;
// golden_ratio=0.61803399; 
x1 = upper_bound - golden_ratio*(upper_bound - lower_bound);
x2 = lower_bound + golden_ratio*(upper_bound - lower_bound);

// Evaluate the function at the test points
f1 = intra_value(x1, row_index, ...);
f2 = intra_value(x2, row_index, ...);

while (fabs(upper_bound - lower_bound) > tolerance) {
    if (f2 > f1){
        upper_bound = x2; x2 = x1; f2 = f1;
        x1 = upper_bound - golden_ratio*(upper_bound - lower_bound);
        f1 = intra_value(x1, row_index, ...);
    } else {
        lower_bound = x1; x1 = x2; f1 = f2;
        x2 = lower_bound + golden_ratio*(upper_bound - lower_bound);
        f2 = intra_value(x2, row_index, ...);
    }
}
// Estimated minimizer = (lower bound + upper bound) / 2
*xmin = (lower_bound + upper_bound)/2;
fxmin = intra_value(*xmin, row_index, ...);
return - fxmin; }       

The function optimized (intra_value) is quite complicated in terms of computation (pick a grid point (row_index) from precompiled grid, then involve a lot of numerical integration, etc.).


Note that "fast-math" breaks some language rules to get fast code and may produce incorrect results in some cases.

Also note that -O3 is not guaranteed to be faster than -O2 or any of the other optimization levels (it depends on your code) - you should test multiple versions.

You may also want to enable -Wl,-O1 - the linker also can do some optimizations.

You may also want to try building with LTO (link time optimization) - it can often yield significant improvements.

I realize this does not answer your question as such. But it should give you some things to play with :-)

Also, gcc is improving pretty fast. You may want to try a newer version if you are not already on 7.1. Also; try Clang for a third datapoint. Additionally, you can use icc on Linux if you want to.





compiler-optimization