c++ - what - wiki denormal




Why does changing 0.1f to 0 slow down performance by 10x? (4)

Dan Neely's comment ought to be expanded into an answer:

It is not the zero constant 0.0f that is denormalized or causes a slow down, it is the values that approach zero each iteration of the loop. As they come closer and closer to zero, they need more precision to represent and they become denormalized. These are the y[i] values. (They approach zero because x[i]/z[i] is less than 1.0 for all i.)

The crucial difference between the slow and fast versions of the code is the statement y[i] = y[i] + 0.1f;. As soon as this line is executed each iteration of the loop, the extra precision in the float is lost, and the denormalization needed to represent that precision is no longer needed. Afterwards, floating point operations on y[i] remain fast because they aren't denormalized.

Why is the extra precision lost when you add 0.1f? Because floating point numbers only have so many significant digits. Say you have enough storage for three significant digits, then 0.00001 = 1e-5, and 0.00001 + 0.1 = 0.1, at least for this example float format, because it doesn't have room to store the least significant bit in 0.10001.

In short, y[i]=y[i]+0.1f; y[i]=y[i]-0.1f; isn't the no-op you might think it is.

Mystical said this as well: the content of the floats matters, not just the assembly code.

Why does this bit of code,

const float x[16] = {  1.1,   1.2,   1.3,     1.4,   1.5,   1.6,   1.7,   1.8,
                       1.9,   2.0,   2.1,     2.2,   2.3,   2.4,   2.5,   2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
                     1.923, 2.034, 2.145,   2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
    y[i] = x[i];
}

for (int j = 0; j < 9000000; j++)
{
    for (int i = 0; i < 16; i++)
    {
        y[i] *= x[i];
        y[i] /= z[i];
        y[i] = y[i] + 0.1f; // <--
        y[i] = y[i] - 0.1f; // <--
    }
}

run more than 10 times faster than the following bit (identical except where noted)?

const float x[16] = {  1.1,   1.2,   1.3,     1.4,   1.5,   1.6,   1.7,   1.8,
                       1.9,   2.0,   2.1,     2.2,   2.3,   2.4,   2.5,   2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
                     1.923, 2.034, 2.145,   2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
    y[i] = x[i];
}

for (int j = 0; j < 9000000; j++)
{
    for (int i = 0; i < 16; i++)
    {
        y[i] *= x[i];
        y[i] /= z[i];
        y[i] = y[i] + 0; // <--
        y[i] = y[i] - 0; // <--
    }
}

when compiling with Visual Studio 2010 SP1. (I haven't tested with other compilers.)


Welcome to the world of denormalized floating-point! They can wreak havoc on performance!!!

Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation. Operations on denormalized floating-point can be tens to hundreds of times slower than on normalized floating-point. This is because many processors can't handle them directly and must trap and resolve them using microcode.

If you print out the numbers after 10,000 iterations, you will see that they have converged to different values depending on whether 0 or 0.1 is used.

Here's the test code compiled on x64:

int main() {

    double start = omp_get_wtime();

    const float x[16]={1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
    const float z[16]={1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,1.923,2.034,2.145,2.256,2.367,2.478,2.589,2.690};
    float y[16];
    for(int i=0;i<16;i++)
    {
        y[i]=x[i];
    }
    for(int j=0;j<9000000;j++)
    {
        for(int i=0;i<16;i++)
        {
            y[i]*=x[i];
            y[i]/=z[i];
#ifdef FLOATING
            y[i]=y[i]+0.1f;
            y[i]=y[i]-0.1f;
#else
            y[i]=y[i]+0;
            y[i]=y[i]-0;
#endif

            if (j > 10000)
                cout << y[i] << "  ";
        }
        if (j > 10000)
            cout << endl;
    }

    double end = omp_get_wtime();
    cout << end - start << endl;

    system("pause");
    return 0;
}

Output:

#define FLOATING
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007

//#define FLOATING
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.46842e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.45208e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044

Note how in the second run the numbers are very close to zero.

Denormalized numbers are generally rare and thus most processors don't try to handle them efficiently.


To demonstrate that this has everything to do with denormalized numbers, if we flush denormals to zero by adding this to the start of the code:

_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

Then the version with 0 is no longer 10x slower and actually becomes faster. (This requires that the code be compiled with SSE enabled.)

This means that rather than using these weird lower precision almost-zero values, we just round to zero instead.

Timings: Core i7 920 @ 3.5 GHz:

//  Don't flush denormals to zero.
0.1f: 0.564067
0   : 26.7669

//  Flush denormals to zero.
0.1f: 0.587117
0   : 0.341406

In the end, this really has nothing to do with whether it's an integer or floating-point. The 0 or 0.1f is converted/stored into a register outside of both loops. So that has no effect on performance.


It's due to denormalized floating-point use. How to get rid of both it and the performance penalty? Having scoured the Internet for ways of killing denormal numbers, it seems there is no "best" way to do this yet. I have found these three methods that may work best in different environments:

  • Might not work in some GCC environments:

    // Requires #include <fenv.h>
    fesetenv(FE_DFL_DISABLE_SSE_DENORMS_ENV);
    
  • Might not work in some Visual Studio environments: 1

    // Requires #include <xmmintrin.h>
    _mm_setcsr( _mm_getcsr() | (1<<15) | (1<<6) );
    // Does both FTZ and DAZ bits. You can also use just hex value 0x8040 to do both.
    // You might also want to use the underflow mask (1<<11)
    
  • Appears to work in both GCC and Visual Studio:

    // Requires #include <xmmintrin.h>
    // Requires #include <pmmintrin.h>
    _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
    _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
    
  • The Intel compiler has options to disable denormals by default on modern Intel CPUs. More details here

  • Compiler switches. -ffast-math, -msse or -mfpmath=sse will disable denormals and make a few other things faster, but unfortunately also do lots of other approximations that might break your code. Test carefully! The equivalent of fast-math for the Visual Studio compiler is /fp:fast but I haven't been able to confirm whether this also disables denormals.1


Using gcc and applying a diff to the generated assembly yields only this difference:

73c68,69
<   movss   LCPI1_0(%rip), %xmm1
---
>   movabsq $0, %rcx
>   cvtsi2ssq   %rcx, %xmm1
81d76
<   subss   %xmm1, %xmm0

The cvtsi2ssq one being 10 times slower indeed.

Apparently, the float version uses an XMM register loaded from memory, while the int version converts a real int value 0 to float using the cvtsi2ssq instruction, taking a lot of time. Passing -O3 to gcc doesn't help. (gcc version 4.2.1.)

(Using double instead of float doesn't matter, except that it changes the cvtsi2ssq into a cvtsi2sdq.)

Update

Some extra tests show that it is not necessarily the cvtsi2ssq instruction. Once eliminated (using a int ai=0;float a=ai; and using a instead of 0), the speed difference remains. So @Mysticial is right, the denormalized floats make the difference. This can be seen by testing values between 0 and 0.1f. The turning point in the above code is approximately at 0.00000000000000000000000000000001, when the loops suddenly takes 10 times as long.

Update<<1

A small visualisation of this interesting phenomenon:

  • Column 1: a float, divided by 2 for every iteration
  • Column 2: the binary representation of this float
  • Column 3: the time taken to sum this float 1e7 times

You can clearly see the exponent (the last 9 bits) change to its lowest value, when denormalization sets in. At that point, simple addition becomes 20 times slower.

0.000000000000000000000000000000000100000004670110: 10111100001101110010000011100000 45 ms
0.000000000000000000000000000000000050000002335055: 10111100001101110010000101100000 43 ms
0.000000000000000000000000000000000025000001167528: 10111100001101110010000001100000 43 ms
0.000000000000000000000000000000000012500000583764: 10111100001101110010000110100000 42 ms
0.000000000000000000000000000000000006250000291882: 10111100001101110010000010100000 48 ms
0.000000000000000000000000000000000003125000145941: 10111100001101110010000100100000 43 ms
0.000000000000000000000000000000000001562500072970: 10111100001101110010000000100000 42 ms
0.000000000000000000000000000000000000781250036485: 10111100001101110010000111000000 42 ms
0.000000000000000000000000000000000000390625018243: 10111100001101110010000011000000 42 ms
0.000000000000000000000000000000000000195312509121: 10111100001101110010000101000000 43 ms
0.000000000000000000000000000000000000097656254561: 10111100001101110010000001000000 42 ms
0.000000000000000000000000000000000000048828127280: 10111100001101110010000110000000 44 ms
0.000000000000000000000000000000000000024414063640: 10111100001101110010000010000000 42 ms
0.000000000000000000000000000000000000012207031820: 10111100001101110010000100000000 42 ms
0.000000000000000000000000000000000000006103515209: 01111000011011100100001000000000 789 ms
0.000000000000000000000000000000000000003051757605: 11110000110111001000010000000000 788 ms
0.000000000000000000000000000000000000001525879503: 00010001101110010000100000000000 788 ms
0.000000000000000000000000000000000000000762939751: 00100011011100100001000000000000 795 ms
0.000000000000000000000000000000000000000381469876: 01000110111001000010000000000000 896 ms
0.000000000000000000000000000000000000000190734938: 10001101110010000100000000000000 813 ms
0.000000000000000000000000000000000000000095366768: 00011011100100001000000000000000 798 ms
0.000000000000000000000000000000000000000047683384: 00110111001000010000000000000000 791 ms
0.000000000000000000000000000000000000000023841692: 01101110010000100000000000000000 802 ms
0.000000000000000000000000000000000000000011920846: 11011100100001000000000000000000 809 ms
0.000000000000000000000000000000000000000005961124: 01111001000010000000000000000000 795 ms
0.000000000000000000000000000000000000000002980562: 11110010000100000000000000000000 835 ms
0.000000000000000000000000000000000000000001490982: 00010100001000000000000000000000 864 ms
0.000000000000000000000000000000000000000000745491: 00101000010000000000000000000000 915 ms
0.000000000000000000000000000000000000000000372745: 01010000100000000000000000000000 918 ms
0.000000000000000000000000000000000000000000186373: 10100001000000000000000000000000 881 ms
0.000000000000000000000000000000000000000000092486: 01000010000000000000000000000000 857 ms
0.000000000000000000000000000000000000000000046243: 10000100000000000000000000000000 861 ms
0.000000000000000000000000000000000000000000022421: 00001000000000000000000000000000 855 ms
0.000000000000000000000000000000000000000000011210: 00010000000000000000000000000000 887 ms
0.000000000000000000000000000000000000000000005605: 00100000000000000000000000000000 799 ms
0.000000000000000000000000000000000000000000002803: 01000000000000000000000000000000 828 ms
0.000000000000000000000000000000000000000000001401: 10000000000000000000000000000000 815 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 44 ms

An equivalent discussion about ARM can be found in Stack Overflow question Denormalized floating point in Objective-C?.





floating-point