[C++] Collatz 추측을 테스트하기 위해이 C ++ 코드가 필자의 손으로 쓰는 어셈블리보다 더 빠른 이유는 무엇입니까?


Answers

유능한 어셈블리 언어 프로그래머보다 C ++ 컴파일러가 최적의 코드를 생성 할 수 있다고 주장하는 것은 매우 나쁜 실수입니다. 특히이 경우에. 인간은 항상 컴파일러가 할 수있는 코드를 더 잘 만들 수 있습니다.이 특별한 상황은이 주장을 잘 보여줍니다.

여러분이보고있는 타이밍 차이는 질문에있는 어셈블리 코드가 내부 루프에서 아주 멀리 떨어져 있기 때문입니다.

(아래의 코드는 32 비트이지만 쉽게 64 비트로 변환 할 수 있습니다)

예를 들어 시퀀스 함수는 5 개의 명령어로만 최적화 할 수 있습니다.

    .seq:
        inc     esi                 ; counter
        lea     edx, [3*eax+1]      ; edx = 3*n+1
        shr     eax, 1              ; eax = n/2
        cmovc   eax, edx            ; if CF eax = edx
        jnz     .seq                ; jmp if n<>1

전체 코드는 다음과 같습니다.

include "%lib%/freshlib.inc"
@BinaryType console, compact
options.DebugMode = 1
include "%lib%/freshlib.asm"

start:
        InitializeAll
        mov ecx, 999999
        xor edi, edi        ; max
        xor ebx, ebx        ; max i

    .main_loop:

        xor     esi, esi
        mov     eax, ecx

    .seq:
        inc     esi                 ; counter
        lea     edx, [3*eax+1]      ; edx = 3*n+1
        shr     eax, 1              ; eax = n/2
        cmovc   eax, edx            ; if CF eax = edx
        jnz     .seq                ; jmp if n<>1

        cmp     edi, esi
        cmovb   edi, esi
        cmovb   ebx, ecx

        dec     ecx
        jnz     .main_loop

        OutputValue "Max sequence: ", edi, 10, -1
        OutputValue "Max index: ", ebx, 10, -1

        FinalizeAll
        stdcall TerminateAll, 0

이 코드를 컴파일하려면 FreshLib 가 필요합니다.

내 테스트 (1GHz AMD A4-1200 프로세서)에서 위의 코드는 질문의 C ++ 코드보다 약 4 배 빠릅니다 ( -O0 : 430ms 대 1900ms로 컴파일 될 때). 그리고 두 배 이상 빨라졌습니다 (430 ms 대 830 ms) C ++ 코드가 -O3 으로 컴파일됩니다.

두 프로그램의 출력은 동일합니다 : max sequence = 525 on i = 837799.

Question

어셈블리와 C ++에서 Project Euler Q14에 대한이 두 가지 솔루션을 작성했습니다. 그것들은 Collatz 추측 을 테스트하기위한 똑같은 무차별 대항 접근법입니다. 조립 용액을

nasm -felf64 p14.asm && gcc p14.o -o p14

C ++는 다음과 같이 컴파일되었습니다.

g++ p14.cpp -o p14

어셈블리, p14.asm

section .data
    fmt db "%d", 10, 0

global main
extern printf

section .text

main:
    mov rcx, 1000000
    xor rdi, rdi        ; max i
    xor rsi, rsi        ; i

l1:
    dec rcx
    xor r10, r10        ; count
    mov rax, rcx

l2:
    test rax, 1
    jpe even

    mov rbx, 3
    mul rbx
    inc rax
    jmp c1

even:
    mov rbx, 2
    xor rdx, rdx
    div rbx

c1:
    inc r10
    cmp rax, 1
    jne l2

    cmp rdi, r10
    cmovl rdi, r10
    cmovl rsi, rcx

    cmp rcx, 2
    jne l1

    mov rdi, fmt
    xor rax, rax
    call printf
    ret

C ++, p14.cpp

#include <iostream>

using namespace std;

int sequence(long n) {
    int count = 1;
    while (n != 1) {
        if (n % 2 == 0)
            n /= 2;
        else
            n = n*3 + 1;

        ++count;
    }

    return count;
}

int main() {
    int max = 0, maxi;
    for (int i = 999999; i > 0; --i) {
        int s = sequence(i);
        if (s > max) {
            max = s;
            maxi = i;
        }
    }

    cout << maxi << endl;
}

속도와 모든 것을 향상시키는 컴파일러 최적화에 대해 알고 있지만, 어셈블리 솔루션을 더 최적화하는 방법은 여러 가지가 없습니다 (수학적으로 프로그래밍 방식으로 말하기).

C ++ 코드는 모든 텀 (term)과 눗셈 (division)마다 모듈을 가지고 있습니다.

그러나이 어셈블리는 C ++ 솔루션보다 평균 1 초 더 오래 걸립니다. 왜 이런거야? 주로 호기심을 묻고 있습니다.

실행 시간

내 시스템 : 1.4 GHz Intel Celeron 2955U (Haswell 마이크로 아키텍처)에서 64 비트 Linux.

  • g++ (최적화되지 않음) : 평균 1272 ms

  • g++ -O3 평균 578 밀리 초

  • 원본 asm (div) 평균 2650ms

  • Asm (shr) 평균 679 MS

  • @johnfound asm , nasm avg와 조립 501 ms

  • @hidefromkgb asm avg 200 ms

  • @hidefromkgb asm @Peter Cordes에 최적화 됨 평균 사용 시간 145 ms

  • @Veedrac C ++ avg 81 ms with -O3 , 305 ms with -O0




Even without looking at assembly, the most obvious reason is that /= 2 is probably optimized as >>=1 and many processors have a very quick shift operation. But even if a processor doesn't have a shift operation, the integer division is faster than floating point division.

Edit: your milage may vary on the "integer division is faster than floating point division" statement above. The comments below reveal that the modern processors have prioritized optimizing fp division over integer division. So if someone were looking for the most likely reason for the speedup which this thread's question asks about, then compiler optimizing /=2 as >>=1 would be the best 1st place to look.

On an unrelated note , if n is odd, the expression n*3+1 will always be even. So there is no need to check. You can change that branch to

{
   n = (n*3+1) >> 1;
   count += 2;
}

So the whole statement would then be

if (n & 1)
{
    n = (n*3 + 1) >> 1;
    count += 2;
}
else
{
    n >>= 1;
    ++count;
}



For the Collatz problem, you can get a significant boost in performance by caching the "tails". This is a time/memory trade-off. See: memoization ( https://en.wikipedia.org/wiki/Memoization ). You could also look into dynamic programming solutions for other time/memory trade-offs.

Example python implementation:

import sys

inner_loop = 0

def collatz_sequence(N, cache):
    global inner_loop

    l = [ ]
    stop = False
    n = N

    tails = [ ]

    while not stop:
        inner_loop += 1
        tmp = n
        l.append(n)
        if n <= 1:
            stop = True  
        elif n in cache:
            stop = True
        elif n % 2:
            n = 3*n + 1
        else:
            n = n // 2
        tails.append((tmp, len(l)))

    for key, offset in tails:
        if not key in cache:
            cache[key] = l[offset:]

    return l

def gen_sequence(l, cache):
    for elem in l:
        yield elem
        if elem in cache:
            yield from gen_sequence(cache[elem], cache)
            raise StopIteration

if __name__ == "__main__":
    le_cache = {}

    for n in range(1, 4711, 5):
        l = collatz_sequence(n, le_cache)
        print("{}: {}".format(n, len(list(gen_sequence(l, le_cache)))))

    print("inner_loop = {}".format(inner_loop))



From comments:

But, this code never stops (because of integer overflow) !?! Yves Daoust

For many numbers it will not overflow.

If it will overflow - for one of those unlucky initial seeds, the overflown number will very likely converge toward 1 without another overflow.

Still this poses interesting question, is there some overflow-cyclic seed number?

Any simple final converging series starts with power of two value (obvious enough?).

2^64 will overflow to zero, which is undefined infinite loop according to algorithm (ends only with 1), but the most optimal solution in answer will finish due to shr rax producing ZF=1.

Can we produce 2^64? If the starting number is 0x5555555555555555 , it's odd number, next number is then 3n+1, which is 0xFFFFFFFFFFFFFFFF + 1 = 0 . Theoretically in undefined state of algorithm, but the optimized answer of johnfound will recover by exiting on ZF=1. The cmp rax,1 of Peter Cordes will end in infinite loop (QED variant 1, "cheapo" through undefined 0 number).

How about some more complex number, which will create cycle without 0 ? Frankly, I'm not sure, my Math theory is too hazy to get any serious idea, how to deal with it in serious way. But intuitively I would say the series will converge to 1 for every number : 0 < number, as the 3n+1 formula will slowly turn every non-2 prime factor of original number (or intermediate) into some power of 2, sooner or later. So we don't need to worry about infinite loop for original series, only overflow can hamper us.

So I just put few numbers into sheet and took a look on 8 bit truncated numbers.

There are three values overflowing to 0 : 227 , 170 and 85 ( 85 going directly to 0 , other two progressing toward 85 ).

But there's no value creating cyclic overflow seed.

Funnily enough I did a check, which is the first number to suffer from 8 bit truncation, and already 27 is affected! It does reach value 9232 in proper non-truncated series (first truncated value is 322 in 12th step), and the maximum value reached for any of the 2-255 input numbers in non-truncated way is 13120 (for the 255 itself), maximum number of steps to converge to 1 is about 128 (+-2, not sure if "1" is to count, etc...).

Interestingly enough (for me) the number 9232 is maximum for many other source numbers, what's so special about it? :-O 9232 = 0x2410 ... hmmm.. no idea.

Unfortunately I can't get any deep grasp of this series, why does it converge and what are the implications of truncating them to k bits, but with cmp number,1 terminating condition it's certainly possible to put the algorithm into infinite loop with particular input value ending as 0 after truncation.

But the value 27 overflowing for 8 bit case is sort of alerting, this looks like if you count the number of steps to reach value 1 , you will get wrong result for majority of numbers from the total k-bit set of integers. For the 8 bit integers the 146 numbers out of 256 have affected series by truncation (some of them may still hit the correct number of steps by accident maybe, I'm too lazy to check).




On a rather unrelated note: more performance hacks!

  • [the first «conjecture» has been finally debunked by @ShreevatsaR; removed]

  • When traversing the sequence, we can only get 3 possible cases in the 2-neighborhood of the current element N (shown first):

    1. [even] [odd]
    2. [odd] [even]
    3. [even] [even]

    To leap past these 2 elements means to compute (N >> 1) + N + 1 , ((N << 1) + N + 1) >> 1 and N >> 2 , respectively.

    Let`s prove that for both cases (1) and (2) it is possible to use the first formula, (N >> 1) + N + 1 .

    Case (1) is obvious. Case (2) implies (N & 1) == 1 , so if we assume (without loss of generality) that N is 2-bit long and its bits are ba from most- to least-significant, then a = 1 , and the following holds:

    (N << 1) + N + 1:     (N >> 1) + N + 1:
    
            b10                    b1
             b1                     b
           +  1                   + 1
           ----                   ---
           bBb0                   bBb
    

    where B = !b . Right-shifting the first result gives us exactly what we want.

    QED: (N & 1) == 1 ⇒ (N >> 1) + N + 1 == ((N << 1) + N + 1) >> 1 .

    As proven, we can traverse the sequence 2 elements at a time, using a single ternary operation. Another 2× time reduction.

The resulting algorithm looks like this:

uint64_t sequence(uint64_t size, uint64_t *path) {
    uint64_t n, i, c, maxi = 0, maxc = 0;

    for (n = i = (size - 1) | 1; i > 2; n = i -= 2) {
        c = 2;
        while ((n = ((n & 3)? (n >> 1) + n + 1 : (n >> 2))) > 2)
            c += 2;
        if (n == 2)
            c++;
        if (c > maxc) {
            maxi = i;
            maxc = c;
        }
    }
    *path = maxc;
    return maxi;
}

int main() {
    uint64_t maxi, maxc;

    maxi = sequence(1000000, &maxc);
    printf("%llu, %llu\n", maxi, maxc);
    return 0;
}

Here we compare n > 2 because the process may stop at 2 instead of 1 if the total length of the sequence is odd.

[EDIT:]

Let`s translate this into assembly!

MOV RCX, 1000000;



DEC RCX;
AND RCX, -2;
XOR RAX, RAX;
MOV RBX, RAX;

@main:
  XOR RSI, RSI;
  LEA RDI, [RCX + 1];

  @loop:
    ADD RSI, 2;
    LEA RDX, [RDI + RDI*2 + 2];
    SHR RDX, 1;
    SHRD RDI, RDI, 2;    ror rdi,2   would do the same thing
    CMOVL RDI, RDX;      Note that SHRD leaves OF = undefined with count>1, and this doesn't work on all CPUs.
    CMOVS RDI, RDX;
    CMP RDI, 2;
  JA @loop;

  LEA RDX, [RSI + 1];
  CMOVE RSI, RDX;

  CMP RAX, RSI;
  CMOVB RAX, RSI;
  CMOVB RBX, RCX;

  SUB RCX, 2;
JA @main;



MOV RDI, RCX;
ADD RCX, 10;
PUSH RDI;
PUSH RCX;

@itoa:
  XOR RDX, RDX;
  DIV RCX;
  ADD RDX, '0';
  PUSH RDX;
  TEST RAX, RAX;
JNE @itoa;

  PUSH RCX;
  LEA RAX, [RBX + 1];
  TEST RBX, RBX;
  MOV RBX, RDI;
JNE @itoa;

POP RCX;
INC RDI;
MOV RDX, RDI;

@outp:
  MOV RSI, RSP;
  MOV RAX, RDI;
  SYSCALL;
  POP RAX;
  TEST RAX, RAX;
JNE @outp;

LEA RAX, [RDI + 59];
DEC RDI;
SYSCALL;

Use these commands to compile:

nasm -f elf64 file.asm
ld -o file file.o

See the C and an improved/bugfixed version of the asm by Peter Cordes on Godbolt . (editor's note: Sorry for putting my stuff in your answer, but my answer hit the 30k char limit from Godbolt links + text!)




Links