# [c++] 为什么这个C ++代码比测试Collat z猜想的手写组件更快？

声称C ++编译器可以产生比合格的汇编语言程序员更优化的代码是一个非常糟糕的错误。 特别是在这种情况下。 人们总是可以使代码更好，编译器可以，而这种特殊情况就是对这种说法的很好说明。

你看到的时间差异是因为问题中的汇编代码在内部循环中非常不理想。

（以下代码是32位，但可以轻松转换为64位）

例如，序列函数可以优化为只有5条指令：

```
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
```

整个代码如下所示：

```
include "%lib%/freshlib.inc"
@BinaryType console, compact
options.DebugMode = 1
include "%lib%/freshlib.asm"
start:
InitializeAll
mov ecx, 999999
xor edi, edi ; max
xor ebx, ebx ; max i
.main_loop:
xor esi, esi
mov eax, ecx
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
cmp edi, esi
cmovb edi, esi
cmovb ebx, ecx
dec ecx
jnz .main_loop
OutputValue "Max sequence: ", edi, 10, -1
OutputValue "Max index: ", ebx, 10, -1
FinalizeAll
stdcall TerminateAll, 0
```

为了编译这段代码，需要FreshLib 。

在我的测试中（1 GHz AMD A4-1200处理器），上述代码比问题中的C ++代码快约四倍（用`-O0`

编译时：430毫秒比1900毫秒），速度提高了两倍以上（430毫秒比830毫秒），当C ++代码用`-O3`

编译时。

两个程序的输出相同：i = 837799上的最大序列= 525。

我在汇编和C ++中为Project Euler Q14编写了这两个解决方案。 它们是测试Collatz猜想的完全相同的蛮力方法。 该组装解决方案与组装

```
nasm -felf64 p14.asm && gcc p14.o -o p14
```

C ++编译时使用

```
g++ p14.cpp -o p14
```

大会， `p14.asm`

```
section .data
fmt db "%d", 10, 0
global main
extern printf
section .text
main:
mov rcx, 1000000
xor rdi, rdi ; max i
xor rsi, rsi ; i
l1:
dec rcx
xor r10, r10 ; count
mov rax, rcx
l2:
test rax, 1
jpe even
mov rbx, 3
mul rbx
inc rax
jmp c1
even:
mov rbx, 2
xor rdx, rdx
div rbx
c1:
inc r10
cmp rax, 1
jne l2
cmp rdi, r10
cmovl rdi, r10
cmovl rsi, rcx
cmp rcx, 2
jne l1
mov rdi, fmt
xor rax, rax
call printf
ret
```

C ++，p14.cpp

```
#include <iostream>
using namespace std;
int sequence(long n) {
int count = 1;
while (n != 1) {
if (n % 2 == 0)
n /= 2;
else
n = n*3 + 1;
++count;
}
return count;
}
int main() {
int max = 0, maxi;
for (int i = 999999; i > 0; --i) {
int s = sequence(i);
if (s > max) {
max = s;
maxi = i;
}
}
cout << maxi << endl;
}
```

我知道编译器优化以提高速度和一切，但我没有看到很多方法来进一步优化我的汇编解决方案（以编程方式不以数学方式说）。

C ++代码每个术语和每个阶段都有模数，每个阶段的汇编只有一个分区。

但是这个装配平均比C ++解决方案平均长1秒。 为什么是这样？ 主要是好奇心。

### 执行时间

我的系统：1.4 GHz Intel Celeron 2955U（Haswell微架构）上的64位Linux。

`g++`

（未优化）：平均1272毫秒`g++ -O3`

avg 578 ms原始asm（div）平均2650毫秒

`Asm (shr)`

avg 679 ms@johnfound asm ，与nasm avg 501 ms组装

@hidefromkgb asm avg 200 ms

@hidefromkgb asm由@Peter Cordes avg 145 ms 优化

@Veedrac C ++平均值为81毫秒，

`-O3`

305毫秒，`-O0`

平均值

Even without looking at assembly, the most obvious reason is that `/= 2`

is probably optimized as `>>=1`

and many processors have a very quick shift operation. But even if a processor doesn't have a shift operation, the integer division is faster than floating point division.

**Edit:** your milage may vary on the "integer division is faster than floating point division" statement above. The comments below reveal that the modern processors have prioritized optimizing fp division over integer division. So if someone were looking for the most likely reason for the speedup which this thread's question asks about, then compiler optimizing `/=2`

as `>>=1`

would be the best 1st place to look.

On an *unrelated note* , if `n`

is odd, the expression `n*3+1`

will always be even. So there is no need to check. You can change that branch to

```
{
n = (n*3+1) >> 1;
count += 2;
}
```

So the whole statement would then be

```
if (n & 1)
{
n = (n*3 + 1) >> 1;
count += 2;
}
else
{
n >>= 1;
++count;
}
```

For the Collatz problem, you can get a significant boost in performance by caching the "tails". This is a time/memory trade-off. See: memoization ( https://en.wikipedia.org/wiki/Memoization ). You could also look into dynamic programming solutions for other time/memory trade-offs.

Example python implementation:

```
import sys
inner_loop = 0
def collatz_sequence(N, cache):
global inner_loop
l = [ ]
stop = False
n = N
tails = [ ]
while not stop:
inner_loop += 1
tmp = n
l.append(n)
if n <= 1:
stop = True
elif n in cache:
stop = True
elif n % 2:
n = 3*n + 1
else:
n = n // 2
tails.append((tmp, len(l)))
for key, offset in tails:
if not key in cache:
cache[key] = l[offset:]
return l
def gen_sequence(l, cache):
for elem in l:
yield elem
if elem in cache:
yield from gen_sequence(cache[elem], cache)
raise StopIteration
if __name__ == "__main__":
le_cache = {}
for n in range(1, 4711, 5):
l = collatz_sequence(n, le_cache)
print("{}: {}".format(n, len(list(gen_sequence(l, le_cache)))))
print("inner_loop = {}".format(inner_loop))
```

On a rather unrelated note: more performance hacks!

### [the first «conjecture» has been finally debunked by @ShreevatsaR; removed]

When traversing the sequence, we can only get 3 possible cases in the 2-neighborhood of the current element

`N`

(shown first):- [even] [odd]
- [odd] [even]
- [even] [even]

To leap past these 2 elements means to compute

`(N >> 1) + N + 1`

,`((N << 1) + N + 1) >> 1`

and`N >> 2`

, respectively.Let`s prove that for both cases (1) and (2) it is possible to use the first formula,

`(N >> 1) + N + 1`

.Case (1) is obvious. Case (2) implies

`(N & 1) == 1`

, so if we assume (without loss of generality) that N is 2-bit long and its bits are`ba`

from most- to least-significant, then`a = 1`

, and the following holds:`(N << 1) + N + 1: (N >> 1) + N + 1: b10 b1 b1 b + 1 + 1 ---- --- bBb0 bBb`

where

`B = !b`

. Right-shifting the first result gives us exactly what we want.QED:

`(N & 1) == 1 ⇒ (N >> 1) + N + 1 == ((N << 1) + N + 1) >> 1`

.As proven, we can traverse the sequence 2 elements at a time, using a single ternary operation. Another 2× time reduction.

The resulting algorithm looks like this:

```
uint64_t sequence(uint64_t size, uint64_t *path) {
uint64_t n, i, c, maxi = 0, maxc = 0;
for (n = i = (size - 1) | 1; i > 2; n = i -= 2) {
c = 2;
while ((n = ((n & 3)? (n >> 1) + n + 1 : (n >> 2))) > 2)
c += 2;
if (n == 2)
c++;
if (c > maxc) {
maxi = i;
maxc = c;
}
}
*path = maxc;
return maxi;
}
int main() {
uint64_t maxi, maxc;
maxi = sequence(1000000, &maxc);
printf("%llu, %llu\n", maxi, maxc);
return 0;
}
```

Here we compare `n > 2`

because the process may stop at 2 instead of 1 if the total length of the sequence is odd.

### [EDIT:]

Let`s translate this into assembly!

```
MOV RCX, 1000000;
DEC RCX;
AND RCX, -2;
XOR RAX, RAX;
MOV RBX, RAX;
@main:
XOR RSI, RSI;
LEA RDI, [RCX + 1];
@loop:
ADD RSI, 2;
LEA RDX, [RDI + RDI*2 + 2];
SHR RDX, 1;
SHRD RDI, RDI, 2; ror rdi,2 would do the same thing
CMOVL RDI, RDX; Note that SHRD leaves OF = undefined with count>1, and this doesn't work on all CPUs.
CMOVS RDI, RDX;
CMP RDI, 2;
JA @loop;
LEA RDX, [RSI + 1];
CMOVE RSI, RDX;
CMP RAX, RSI;
CMOVB RAX, RSI;
CMOVB RBX, RCX;
SUB RCX, 2;
JA @main;
MOV RDI, RCX;
ADD RCX, 10;
PUSH RDI;
PUSH RCX;
@itoa:
XOR RDX, RDX;
DIV RCX;
ADD RDX, '0';
PUSH RDX;
TEST RAX, RAX;
JNE @itoa;
PUSH RCX;
LEA RAX, [RBX + 1];
TEST RBX, RBX;
MOV RBX, RDI;
JNE @itoa;
POP RCX;
INC RDI;
MOV RDX, RDI;
@outp:
MOV RSI, RSP;
MOV RAX, RDI;
SYSCALL;
POP RAX;
TEST RAX, RAX;
JNE @outp;
LEA RAX, [RDI + 59];
DEC RDI;
SYSCALL;
```

Use these commands to compile:

```
nasm -f elf64 file.asm
ld -o file file.o
```

See the C and an improved/bugfixed version of the asm by Peter Cordes on Godbolt . (editor's note: Sorry for putting my stuff in your answer, but my answer hit the 30k char limit from Godbolt links + text!)

From comments:

But, this code never stops (because of integer overflow) !?! Yves Daoust

For many numbers it will *not* overflow.

If it *will* overflow - for one of those unlucky initial seeds, the overflown number will very likely converge toward 1 without another overflow.

Still this poses interesting question, is there some overflow-cyclic seed number?

Any simple final converging series starts with power of two value (obvious enough?).

2^64 will overflow to zero, which is undefined infinite loop according to algorithm (ends only with 1), but the most optimal solution in answer will finish due to `shr rax`

producing ZF=1.

Can we produce 2^64? If the starting number is `0x5555555555555555`

, it's odd number, next number is then 3n+1, which is `0xFFFFFFFFFFFFFFFF + 1`

= `0`

. Theoretically in undefined state of algorithm, but the optimized answer of johnfound will recover by exiting on ZF=1. The `cmp rax,1`

of Peter Cordes **will end in infinite loop** (QED variant 1, "cheapo" through undefined `0`

number).

How about some more complex number, which will create cycle without `0`

? Frankly, I'm not sure, my Math theory is too hazy to get any serious idea, how to deal with it in serious way. But intuitively I would say the series will converge to 1 for every number : 0 < number, as the 3n+1 formula will slowly turn every non-2 prime factor of original number (or intermediate) into some power of 2, sooner or later. So we don't need to worry about infinite loop for original series, only overflow can hamper us.

So I just put few numbers into sheet and took a look on 8 bit truncated numbers.

There are three values overflowing to `0`

: `227`

, `170`

and `85`

( `85`

going directly to `0`

, other two progressing toward `85`

).

But there's no value creating cyclic overflow seed.

Funnily enough I did a check, which is the first number to suffer from 8 bit truncation, and already `27`

is affected! It does reach value `9232`

in proper non-truncated series (first truncated value is `322`

in 12th step), and the maximum value reached for any of the 2-255 input numbers in non-truncated way is `13120`

(for the `255`

itself), maximum number of steps to converge to `1`

is about `128`

(+-2, not sure if "1" is to count, etc...).

Interestingly enough (for me) the number `9232`

is maximum for many other source numbers, what's so special about it? :-O `9232`

= `0x2410`

... hmmm.. no idea.

Unfortunately I can't get any deep grasp of this series, why does it converge and what are the implications of truncating them to *k* bits, but with `cmp number,1`

terminating condition it's certainly possible to put the algorithm into infinite loop with particular input value ending as `0`

after truncation.

But the value `27`

overflowing for 8 bit case is sort of alerting, this looks like if you count the number of steps to reach value `1`

, you will get wrong result for majority of numbers from the total k-bit set of integers. For the 8 bit integers the 146 numbers out of 256 have affected series by truncation (some of them may still hit the correct number of steps by accident maybe, I'm too lazy to check).