[C++] ¿Por qué este código C ++ es más rápido que mi ensamblaje escrito a mano para probar la conjetura de Collatz?


Answers

Afirmar que el compilador de C ++ puede producir un código más óptimo que un programador de lenguaje ensamblador competente es un error muy grave. Y especialmente en este caso. El humano siempre puede mejorar el código que el compilador puede, y esta situación particular es una buena ilustración de esta afirmación.

La diferencia de tiempo que está viendo se debe a que el código de ensamblaje en la pregunta está muy lejos de ser óptimo en los bucles internos.

(El siguiente código es de 32 bits, pero se puede convertir fácilmente a 64 bits)

Por ejemplo, la función de secuencia se puede optimizar a solo 5 instrucciones:

    .seq:
        inc     esi                 ; counter
        lea     edx, [3*eax+1]      ; edx = 3*n+1
        shr     eax, 1              ; eax = n/2
        cmovc   eax, edx            ; if CF eax = edx
        jnz     .seq                ; jmp if n<>1

Todo el código se ve así:

include "%lib%/freshlib.inc"
@BinaryType console, compact
options.DebugMode = 1
include "%lib%/freshlib.asm"

start:
        InitializeAll
        mov ecx, 999999
        xor edi, edi        ; max
        xor ebx, ebx        ; max i

    .main_loop:

        xor     esi, esi
        mov     eax, ecx

    .seq:
        inc     esi                 ; counter
        lea     edx, [3*eax+1]      ; edx = 3*n+1
        shr     eax, 1              ; eax = n/2
        cmovc   eax, edx            ; if CF eax = edx
        jnz     .seq                ; jmp if n<>1

        cmp     edi, esi
        cmovb   edi, esi
        cmovb   ebx, ecx

        dec     ecx
        jnz     .main_loop

        OutputValue "Max sequence: ", edi, 10, -1
        OutputValue "Max index: ", ebx, 10, -1

        FinalizeAll
        stdcall TerminateAll, 0

Para compilar este código, se necesita FreshLib .

En mis pruebas, (procesador AMD A4-1200 de 1 GHz), el código anterior es aproximadamente cuatro veces más rápido que el código C ++ de la pregunta (cuando se compila con -O0 : 430 ms frente a 1900 ms) y más de dos veces más rápido (430 ms vs. 830 ms) cuando el código de C ++ se compila con -O3 .

El resultado de ambos programas es el mismo: secuencia máxima = 525 en i = 837799.

Question

Escribí estas dos soluciones para Project Euler Q14 , en ensamblaje y en C ++. Son el mismo enfoque idéntico de fuerza bruta para probar la conjetura de Collatz . La solución de ensamblaje se ensambló con

nasm -felf64 p14.asm && gcc p14.o -o p14

El C ++ fue compilado con

g++ p14.cpp -o p14

Asamblea, p14.asm

section .data
    fmt db "%d", 10, 0

global main
extern printf

section .text

main:
    mov rcx, 1000000
    xor rdi, rdi        ; max i
    xor rsi, rsi        ; i

l1:
    dec rcx
    xor r10, r10        ; count
    mov rax, rcx

l2:
    test rax, 1
    jpe even

    mov rbx, 3
    mul rbx
    inc rax
    jmp c1

even:
    mov rbx, 2
    xor rdx, rdx
    div rbx

c1:
    inc r10
    cmp rax, 1
    jne l2

    cmp rdi, r10
    cmovl rdi, r10
    cmovl rsi, rcx

    cmp rcx, 2
    jne l1

    mov rdi, fmt
    xor rax, rax
    call printf
    ret

C ++, p14.cpp

#include <iostream>

using namespace std;

int sequence(long n) {
    int count = 1;
    while (n != 1) {
        if (n % 2 == 0)
            n /= 2;
        else
            n = n*3 + 1;

        ++count;
    }

    return count;
}

int main() {
    int max = 0, maxi;
    for (int i = 999999; i > 0; --i) {
        int s = sequence(i);
        if (s > max) {
            max = s;
            maxi = i;
        }
    }

    cout << maxi << endl;
}

Conozco las optimizaciones del compilador para mejorar la velocidad y todo, pero no veo muchas maneras de optimizar aún más mi solución de ensamblaje (hablando programáticamente, no matemáticamente).

El código C ++ tiene el módulo de cada término y división en cualquier término par, donde el ensamblaje es solo una división por trimestre.

Pero el ensamblaje tarda en promedio 1 segundo más que la solución C ++. ¿Por qué es esto? Lo pido principalmente por curiosidad.

Tiempos de ejecución

Mi sistema: 64 bit Linux en 1.4 GHz Intel Celeron 2955U (microarquitectura Haswell).




Even without looking at assembly, the most obvious reason is that /= 2 is probably optimized as >>=1 and many processors have a very quick shift operation. But even if a processor doesn't have a shift operation, the integer division is faster than floating point division.

Edit: your milage may vary on the "integer division is faster than floating point division" statement above. The comments below reveal that the modern processors have prioritized optimizing fp division over integer division. So if someone were looking for the most likely reason for the speedup which this thread's question asks about, then compiler optimizing /=2 as >>=1 would be the best 1st place to look.

On an unrelated note , if n is odd, the expression n*3+1 will always be even. So there is no need to check. You can change that branch to

{
   n = (n*3+1) >> 1;
   count += 2;
}

So the whole statement would then be

if (n & 1)
{
    n = (n*3 + 1) >> 1;
    count += 2;
}
else
{
    n >>= 1;
    ++count;
}



On a rather unrelated note: more performance hacks!

  • [the first «conjecture» has been finally debunked by @ShreevatsaR; removed]

  • When traversing the sequence, we can only get 3 possible cases in the 2-neighborhood of the current element N (shown first):

    1. [even] [odd]
    2. [odd] [even]
    3. [even] [even]

    To leap past these 2 elements means to compute (N >> 1) + N + 1 , ((N << 1) + N + 1) >> 1 and N >> 2 , respectively.

    Let`s prove that for both cases (1) and (2) it is possible to use the first formula, (N >> 1) + N + 1 .

    Case (1) is obvious. Case (2) implies (N & 1) == 1 , so if we assume (without loss of generality) that N is 2-bit long and its bits are ba from most- to least-significant, then a = 1 , and the following holds:

    (N << 1) + N + 1:     (N >> 1) + N + 1:
    
            b10                    b1
             b1                     b
           +  1                   + 1
           ----                   ---
           bBb0                   bBb
    

    where B = !b . Right-shifting the first result gives us exactly what we want.

    QED: (N & 1) == 1 ⇒ (N >> 1) + N + 1 == ((N << 1) + N + 1) >> 1 .

    As proven, we can traverse the sequence 2 elements at a time, using a single ternary operation. Another 2× time reduction.

The resulting algorithm looks like this:

uint64_t sequence(uint64_t size, uint64_t *path) {
    uint64_t n, i, c, maxi = 0, maxc = 0;

    for (n = i = (size - 1) | 1; i > 2; n = i -= 2) {
        c = 2;
        while ((n = ((n & 3)? (n >> 1) + n + 1 : (n >> 2))) > 2)
            c += 2;
        if (n == 2)
            c++;
        if (c > maxc) {
            maxi = i;
            maxc = c;
        }
    }
    *path = maxc;
    return maxi;
}

int main() {
    uint64_t maxi, maxc;

    maxi = sequence(1000000, &maxc);
    printf("%llu, %llu\n", maxi, maxc);
    return 0;
}

Here we compare n > 2 because the process may stop at 2 instead of 1 if the total length of the sequence is odd.

[EDIT:]

Let`s translate this into assembly!

MOV RCX, 1000000;



DEC RCX;
AND RCX, -2;
XOR RAX, RAX;
MOV RBX, RAX;

@main:
  XOR RSI, RSI;
  LEA RDI, [RCX + 1];

  @loop:
    ADD RSI, 2;
    LEA RDX, [RDI + RDI*2 + 2];
    SHR RDX, 1;
    SHRD RDI, RDI, 2;    ror rdi,2   would do the same thing
    CMOVL RDI, RDX;      Note that SHRD leaves OF = undefined with count>1, and this doesn't work on all CPUs.
    CMOVS RDI, RDX;
    CMP RDI, 2;
  JA @loop;

  LEA RDX, [RSI + 1];
  CMOVE RSI, RDX;

  CMP RAX, RSI;
  CMOVB RAX, RSI;
  CMOVB RBX, RCX;

  SUB RCX, 2;
JA @main;



MOV RDI, RCX;
ADD RCX, 10;
PUSH RDI;
PUSH RCX;

@itoa:
  XOR RDX, RDX;
  DIV RCX;
  ADD RDX, '0';
  PUSH RDX;
  TEST RAX, RAX;
JNE @itoa;

  PUSH RCX;
  LEA RAX, [RBX + 1];
  TEST RBX, RBX;
  MOV RBX, RDI;
JNE @itoa;

POP RCX;
INC RDI;
MOV RDX, RDI;

@outp:
  MOV RSI, RSP;
  MOV RAX, RDI;
  SYSCALL;
  POP RAX;
  TEST RAX, RAX;
JNE @outp;

LEA RAX, [RDI + 59];
DEC RDI;
SYSCALL;

Use these commands to compile:

nasm -f elf64 file.asm
ld -o file file.o

See the C and an improved/bugfixed version of the asm by Peter Cordes on Godbolt . (editor's note: Sorry for putting my stuff in your answer, but my answer hit the 30k char limit from Godbolt links + text!)




For the Collatz problem, you can get a significant boost in performance by caching the "tails". This is a time/memory trade-off. See: memoization ( https://en.wikipedia.org/wiki/Memoization ). You could also look into dynamic programming solutions for other time/memory trade-offs.

Example python implementation:

import sys

inner_loop = 0

def collatz_sequence(N, cache):
    global inner_loop

    l = [ ]
    stop = False
    n = N

    tails = [ ]

    while not stop:
        inner_loop += 1
        tmp = n
        l.append(n)
        if n <= 1:
            stop = True  
        elif n in cache:
            stop = True
        elif n % 2:
            n = 3*n + 1
        else:
            n = n // 2
        tails.append((tmp, len(l)))

    for key, offset in tails:
        if not key in cache:
            cache[key] = l[offset:]

    return l

def gen_sequence(l, cache):
    for elem in l:
        yield elem
        if elem in cache:
            yield from gen_sequence(cache[elem], cache)
            raise StopIteration

if __name__ == "__main__":
    le_cache = {}

    for n in range(1, 4711, 5):
        l = collatz_sequence(n, le_cache)
        print("{}: {}".format(n, len(list(gen_sequence(l, le_cache)))))

    print("inner_loop = {}".format(inner_loop))



From comments:

But, this code never stops (because of integer overflow) !?! Yves Daoust

For many numbers it will not overflow.

If it will overflow - for one of those unlucky initial seeds, the overflown number will very likely converge toward 1 without another overflow.

Still this poses interesting question, is there some overflow-cyclic seed number?

Any simple final converging series starts with power of two value (obvious enough?).

2^64 will overflow to zero, which is undefined infinite loop according to algorithm (ends only with 1), but the most optimal solution in answer will finish due to shr rax producing ZF=1.

Can we produce 2^64? If the starting number is 0x5555555555555555 , it's odd number, next number is then 3n+1, which is 0xFFFFFFFFFFFFFFFF + 1 = 0 . Theoretically in undefined state of algorithm, but the optimized answer of johnfound will recover by exiting on ZF=1. The cmp rax,1 of Peter Cordes will end in infinite loop (QED variant 1, "cheapo" through undefined 0 number).

How about some more complex number, which will create cycle without 0 ? Frankly, I'm not sure, my Math theory is too hazy to get any serious idea, how to deal with it in serious way. But intuitively I would say the series will converge to 1 for every number : 0 < number, as the 3n+1 formula will slowly turn every non-2 prime factor of original number (or intermediate) into some power of 2, sooner or later. So we don't need to worry about infinite loop for original series, only overflow can hamper us.

So I just put few numbers into sheet and took a look on 8 bit truncated numbers.

There are three values overflowing to 0 : 227 , 170 and 85 ( 85 going directly to 0 , other two progressing toward 85 ).

But there's no value creating cyclic overflow seed.

Funnily enough I did a check, which is the first number to suffer from 8 bit truncation, and already 27 is affected! It does reach value 9232 in proper non-truncated series (first truncated value is 322 in 12th step), and the maximum value reached for any of the 2-255 input numbers in non-truncated way is 13120 (for the 255 itself), maximum number of steps to converge to 1 is about 128 (+-2, not sure if "1" is to count, etc...).

Interestingly enough (for me) the number 9232 is maximum for many other source numbers, what's so special about it? :-O 9232 = 0x2410 ... hmmm.. no idea.

Unfortunately I can't get any deep grasp of this series, why does it converge and what are the implications of truncating them to k bits, but with cmp number,1 terminating condition it's certainly possible to put the algorithm into infinite loop with particular input value ending as 0 after truncation.

But the value 27 overflowing for 8 bit case is sort of alerting, this looks like if you count the number of steps to reach value 1 , you will get wrong result for majority of numbers from the total k-bit set of integers. For the 8 bit integers the 146 numbers out of 256 have affected series by truncation (some of them may still hit the correct number of steps by accident maybe, I'm too lazy to check).