编译器 - 什么时候装配比C快?




python java速度 (20)

知道汇编程序的其中一个原因是,偶尔它可以用来编写代码,而不是用高级语言编写代码,特别是C语言。 然而,我也多次听到它说过,虽然这并非完全错误,但汇编程序实际上可用于生成更高性能代码的情况非常罕见,需要有关汇编的专业知识和经验。

这个问题甚至没有涉及到汇编指令将是机器特定的和非便携式的,或者汇编器的其他方面的事实。 当然,除了这个之外,还有很多很好的理解组装的理由,但这是一个征求例子和数据的特定问题,而不是关于汇编语言和更高级语言的扩展话语。

任何人都可以提供一些具体的例子说明使用现代编译器汇编将比编写良好的C代码更快的情况,并且您是否可以使用概要证据来支持该声明? 我非常相信这些案例存在,但我真的很想知道这些案件究竟有多深奥,因为它似乎是一些争议点。




Given the right programmer, Assembler programs can always be made faster than their C counterparts (at least marginally). It would be difficult to create a C program where you couldn't take out at least one instruction of the Assembler.


How about creating machine code at run-time?

My brother once (around 2000) realised an extremely fast real-time ray-tracer by generating code at run-time. I can't remember the details, but there was some kind of main module which was looping through objects, then it was preparing and executing some machine code which was specific to each object.

However, over time, this method was outruled by new graphics hardware, and it became useless.

Today, I think that possibly some operations on big-data (millions of records) like pivot tables, drilling, calculations on-the-fly, etc. could be optimized with this method. The question is: is the effort worth it?


I have an operation of transposition of bits that needs to be done, on 192 or 256 bits every interrupt, that happens every 50 microseconds.

It happens by a fixed map(hardware constraints). Using C, it took around 10 microseconds to make. When I translated this to Assembler, taking into account the specific features of this map, specific register caching, and using bit oriented operations; it took less than 3.5 microsecond to perform.


I have read all the answers (more than 30) and didn't find a simple reason: assembler is faster than C if you have read and practiced the Intel® 64 and IA-32 Architectures Optimization Reference Manual , so the reason why assembly may be slower is that people who write such slower assembly didn't read the Optimization Manual .

In the good old days of Intel 80286, each instruction was executed at a fixed count of CPU cycles, but since Pentium Pro, released in 1995, Intel processors became superscalar, utilizing Complex Pipelining: Out-of-Order Execution & Register Renaming. Before that, on Pentium, produced 1993, there were U and V pipelines: dual pipe lines that could execute two simple instructions at one clock cycle if they didn't depend on one another; but this was nothing to compare of what is Out-of-Order Execution & Register Renaming appeared in Pentium Pro, and almost left unchanged nowadays.

To explain in a few words, fastest code is where instructions do not depend on previous results, eg you should always clear whole registers (by movzx) or use add rax, 1 instead or inc rax to remove dependency on previous state of flags, etc.

You can read more on Out-of-Order Execution & Register Renaming if time permits, there is plenty information available in the Internet.

There are also other important issues like branch prediction, number of load and store units, number of gates that execute micro-ops, etc, but the most important thing to consider is namely the Out-of-Order Execution.

Most people are simply not aware about the Out-of-Order Execution, so they write their assembly programs like for 80286, expecting their instruction will take a fixed time to execute regardless of context; while C compilers are aware of the Out-of-Order Execution and generate the code correctly. That's why the code of such unaware people is slower, but if you will become aware, your code will be faster.


It all depends on your workload.

For day-to-day operations, C and C++ are just fine, but there are certain workloads (any transforms involving video (compression, decompression, image effects, etc)) that pretty much require assembly to be performant.

They also usually involve using CPU specific chipset extensions (MME/MMX/SSE/whatever) that are tuned for those kinds of operation.


Longpoke, there is just one limitation: time. When you don't have the resources to optimize every single change to code and spend your time allocating registers, optimize few spills away and what not, the compiler will win every single time. You do your modification to the code, recompile and measure. Repeat if necessary.

Also, you can do a lot in the high-level side. Also, inspecting the resulting assembly may give the IMPRESSION that the code is crap, but in practice it will run faster than what you think would be quicker. 例:

int y = data[i]; // do some stuff here.. call_function(y, ...);

The compiler will read the data, push it to stack (spill) and later read from stack and pass as argument. Sounds shite? It might actually be very effective latency compensation and result in faster runtime.

// optimized version call_function(data[i], ...); // not so optimized after all..

The idea with the optimized version was, that we have reduced register pressure and avoid spilling. But in truth, the "shitty" version was faster!

Looking at the assembly code, just looking at the instructions and concluding: more instructions, slower, would be a misjudgment.

The thing here to pay attention is: many assembly experts think they know a lot, but know very little. The rules change from architecture to next, too. There is no silver-bullet x86 code, for example, which is always the fastest. These days is better to go by rules-of-thumb:

  • memory is slow
  • cache is fast
  • try to use cached better
  • how often you going to miss? do you have latency compensation strategy?
  • you can execute 10-100 ALU/FPU/SSE instructions for one single cache miss
  • application architecture is important..
  • .. but it does't help when the problem isn't in the architecture

Also, trusting too much into compiler magically transforming poorly-thought-out C/C++ code into "theoretically optimum" code is wishful thinking. You have to know the compiler and tool chain you use if you care about "performance" at this low-level.

Compilers in C/C++ are generally not very good at re-ordering sub-expressions because the functions have side effects, for starters. Functional languages don't suffer from this caveat but don't fit the current ecosystem that well. There are compiler options to allow relaxed precision rules which allow order of operations to be changed by the compiler/linker/code generator.

This topic is a bit of a dead-end; for most it's not relevant, and the rest, they know what they are doing already anyway.

It all boils down to this: "to understand what you are doing", it's a bit different from knowing what you are doing.


More often than you think, C needs to do things that seem to be unneccessary from an Assembly coder's point of view just because the C standards say so.

Integer promotion, for example. If you want to shift a char variable in C, one would usually expect that the code would do in fact just that, a single bit shift.

The standards, however, enforce the compiler to do a sign extend to int before the shift and truncate the result to char afterwards which might complicate code depending on the target processor's architecture.


One of the more famous snippets of assembly is from Michael Abrash's texture mapping loop ( expained in detail here ):

add edx,[DeltaVFrac] ; add in dVFrac
sbb ebp,ebp ; store carry
mov [edi],al ; write pixel n
mov al,[esi] ; fetch pixel n+1
add ecx,ebx ; add in dUFrac
adc esi,[4*ebp + UVStepVCarry]; add in steps

Nowadays most compilers express advanced CPU specific instructions as intrinsics, ie, functions that get compiled down to the actual instruction. MS Visual C++ supports intrinsics for MMX, SSE, SSE2, SSE3, and SSE4, so you have to worry less about dropping down to assembly to take advantage of platform specific instructions. Visual C++ can also take advantage of the actual architecture you are targetting with the appropriate /ARCH setting.


The simple answer... One who knows assembly well (aka has the reference beside him, and is taking advantage of every little processor cache and pipeline feature etc) is guaranteed to be capable of producing much faster code than any compiler.

However the difference these days just doesn't matter in the typical application.


Tight loops, like when playing with images, since an image may cosist of millions of pixels. Sitting down and figuring out how to make best use of the limited number of processor registers can make a difference. Here's a real life sample:

http://danbystrom.se/2008/12/22/optimizing-away-ii/

Then often processors have some esoteric instructions which are too specialized for a compiler to bother with, but on occasion an assembler programmer can make good use of them. Take the XLAT instruction for example. Really great if you need to do table look-ups in a loop and the table is limited to 256 bytes!

Updated: Oh, just come to think of what's most crucial when we speak of loops in general: the compiler has often no clue on how many iterations that will be the common case! Only the programmer know that a loop will be iterated MANY times and that it therefore will be beneficial to prepare for the loop with some extra work, or if it will be iterated so few times that the set-up actually will take longer than the iterations expected.


gcc has become a widely used compiler. Its optimizations in general are not that good. Far better than the average programmer writing assembler, but for real performance, not that good. There are compilers that are simply incredible in the code they produce. So as a general answer there are going to be many places where you can go into the output of the compiler and tweak the assembler for performance, and/or simply re-write the routine from scratch.


一个可能不再适用的用例,但为了您的书呆子的乐趣:在Amiga上,CPU和图形/音频芯片将争取访问某个区域的RAM(第一个2MB的RAM是特定的)。 所以当你只有2MB RAM(或更少)时,显示复杂的图形和播放声音将会导致CPU的性能下降。

在汇编程序中,您可以以一种聪明的方式交织代码,以使CPU在图形/音频芯片内部繁忙时(即总线空闲时)只尝试访问RAM。 因此,通过对您的指令进行重新排序,巧妙使用CPU缓存,总线时序,您可以获得一些根本无法使用任何更高级别语言的效果,因为您必须对每条命令执行时间操作,甚至可以在此处插入NOP以保持各种相互挤出雷达。

这就是为什么CPU的NOP(无操作 - 无所作为)指令实际上可以让你的整个应用程序运行得更快的另一个原因。

[编辑]当然,这项技术取决于特定的硬件设置。 这是许多Amiga游戏无法应对更快CPU的主要原因:指令的执行时间已关闭。


只有在使用某些特殊用途的指令集时,编译器才会支持。

为了最大限度地提高具有多个流水线和预测分支的现代CPU的计算能力,您需要以一种方式来构造汇编程序,使得a)几乎不可能让人写b)更难以维护。

此外,更好的算法,数据结构和内存管理将为您提供至少比您可以在组装中进行的微优化更高的性能。


在我的工作中,有三个理由让我了解和使用组装。 按重要程度排序:

  1. 调试 - 我经常收到有错误或文档不完整的库代码。 我通过加入装配层面了解它在做什么。 我必须每周做一次。 我还将它用作调试问题的工具,在这些问题中,我的眼睛没有发现C / C ++ / C#中的惯用错误。 看着这个组件就可以过去。

  2. 优化 - 编译器在优化方面做得相当不错,但是我在大多数情况下都在不同的球场上进行比赛。 我编写的图像处理代码通常以如下所示的代码开始:

    for (int y=0; y < imageHeight; y++) {
        for (int x=0; x < imageWidth; x++) {
           // do something
        }
    }
    

    “做某件事”通常发生在几百万次(即3到30次)的数量级上。 通过在“做某事”阶段中刮取周期,性能增益被极大地放大。 我通常不会从那里开始 - 我通常先写代码开始工作,然后尽我所能重构C,使其自然更好(更好的算法,更少的循环负载等)。 我通常需要阅读程序集以查看正在发生的事情,而且很少需要编写它。 我可能每两三个月做一次。

  3. 做一些语言不会让我的东西。 这些包括 - 获得处理器体系结构和特定处理器功能,访问CPU中的标志(man,我真的希望C给你访问进位标志)等。我可能会在一年或两年内这样做一次。


尽管C与8位,16位,32位,64位数据的低级操作“接近”,但有一些C不支持的数学运算,通常可以在某些汇编指令集:

  1. 定点乘法:两个16位数的乘积是一个32位数。 但C中的规则说,两个16位数的乘积是一个16位数,而两个32位数的乘积是一个32位数 - 这两种情况都是下半部分。 如果你想要16x16乘法的半部分或者32x32乘法,你必须和编译器一起玩游戏。 一般的方法是投射到一个大于需要的比特宽度,乘以,向下移动并投回:

    int16_t x, y;
    // int16_t is a typedef for "short"
    // set x and y to something
    int16_t prod = (int16_t)(((int32_t)x*y)>>16);`
    

    在这种情况下,编译器可能足够聪明,知道你实际上只是试图获得16x16乘法的上半部分,并用机器的本机16x16乘法做正确的事情。 或者它可能很愚蠢,并且需要一个库调用来执行32x32乘法,因为您只需要16位产品 - 但C标准并没有给你任何表达自己的方式。

  2. 某些偏移操作(旋转/进位):

    // 256-bit array shifted right in its entirety:
    uint8_t x[32];
    for (int i = 32; --i > 0; )
    {
       x[i] = (x[i] >> 1) | (x[i-1] << 7);
    }
    x[0] >>= 1;
    

    这在C中并不太不雅观,但是,除非编译器足够聪明才能意识到自己在做什么,否则它会做很多“不必要”的工作。 许多汇编指令集允许您在进位寄存器中旋转或左移/右移结果,因此您可以在34条指令中完成上述操作:将指针加载到数组的开头,清除进位,并执行32位8-位指针上使用自动增量。

    再举一个例子,有线性反馈移位寄存器 (LFSR),在汇编时优雅地执行:取一大块N位(8,16,32,64,128等),将整个事情向右移1(见上面算法),那么如果得到的进位是1,那么你用代表多项式的位模式异或。

话虽如此,除非我有严重的性能限制,否则我不会诉诸于这些技术。 正如其他人所说,汇编比C代码更难记录/调试/测试/维护:性能增益伴随着一些严重的成本。

编辑: 3.在汇编中可能会发生溢出检测(在C中不能真正做到这一点),这使得一些算法更容易。


我很惊讶没有人这样说。 如果用汇编写入, strlen()函数会更快! 在C中,你能做的最好的事情是

int c;
for(c = 0; str[c] != '\0'; c++) {}

而在装配中,你可以大幅加快速度:

mov esi, offset string
mov edi, esi
xor ecx, ecx

lp:
mov ax, byte ptr [esi]
cmp al, cl
je  end_1
cmp ah, cl
je end_2
mov bx, byte ptr [esi + 2]
cmp bl, cl
je end_3
cmp bh, cl
je end_4
add esi, 4
jmp lp

end_4:
inc esi

end_3:
inc esi

end_2:
inc esi

end_1:
inc esi

mov ecx, esi
sub ecx, edi

长度在ecx。 这在时间上比较了4个字符,因此速度提高了4倍。 并且考虑使用eax和ebx的高位字,它会比之前的C例程快8倍


简短的回答? 有时。

从技术上讲,每个抽象都有成本,编程语言是CPU工作的抽象。 但是C非常接近。 几年前,当我登录到我的UNIX帐户并获得以下财富消息(当这种情况很受欢迎)时,我记得大声笑出声来:

C编程语言 - 将汇编语言的灵活性与汇编语言的强大功能相结合的一种语言。

这很有趣,因为它是真的:C就像便携式汇编语言。

值得注意的是,汇编语言只是在你编写时运行。 然而,C和它生成的汇编语言之间有一个编译器,这是非常重要的,因为C代码的速度与编译器的性能有很大关系。

当gcc出现时,其中的一件事情让它如此受欢迎,因为它比许多商业UNIX版本的C编译器要好得多。 它不仅是ANSI C(没有这个K&R C垃圾),更强大,通常生成更好(更快)的代码。 并不总是,但经常。

我告诉你所有这些,因为对于C和汇编器的速度没有一揽子规则,因为对C没有客观的标准。

同样,汇编程序也会因为您正在运行的处理器,系统规格,您正在使用的指令集等而异。 历史上有两种CPU架构系列:CISC和RISC。 CISC最大的玩家仍然是Intel x86架构(和指令集)。 RISC主导着UNIX领域(MIPS6000,Alpha,Sparc等)。 CISC赢得了人们的关注。

无论如何,当我还是一个年轻的开发人员时,流行的智慧是,手写的x86通常比C更快,因为这种架构的工作方式,它的复杂性受益于人类这样做。 另一方面,RISC似乎是为编译器设计的,所以没有人(我知道)写了Sparc汇编器。 我确信这样的人存在,但毫无疑问,他们现在都疯了,并已经制度化。

即使在相同的处理器系列中,指令集也是重要的一点。 某些英特尔处理器通过SSE4具有像SSE一样的扩展。 AMD有自己的SIMD指令。 像C这样的编程语言的好处是有人可以编写自己的库,以便针对您运行的任何处理器进行优化。 这是汇编程序的艰辛工作。

汇编程序仍然可以优化编译器无法编译的结果,编写好的汇编语言将比C语言的编译器速度更快或更快。 更大的问题是:它值得吗?

归根结底,虽然汇编器是其时代的产物,并且在CPU周期昂贵时更受欢迎。 现在一个耗资5-10美元制造的CPU(Intel Atom)几乎可以做任何人都想要的东西。 现在编写汇编程序的唯一真正原因是操作系统的某些部分(即使绝大多数Linux内核是用C语言编写的),设备驱动程序,可能是嵌入式设备等低级别的东西(尽管C倾向于在那里占主导地位太)等等。 或者只是踢(这是有点受虐狂)。


许多年前,我正在教一个人用C编程。这个练习是把图形旋转90度。 他回来的解决方案需要几分钟才能完成,主要是因为他使用的是乘法和除法等。我向他展示了如何使用位移来重现问题,并且处理时间降低到大约30秒,优化他的编译器。 我刚刚得到了一个优化编译器,并且相同的代码在<5秒内旋转了图形。 我看着编译器生成的汇编代码,从我看到的汇编代码那里决定,然后我编写汇编程序的日子结束了。





assembly