assembly - x64 - x86 rsp
Why does System V/AMD64 ABI mandate a 16 byte stack alignment? (1)
I've read in different places that it is done for "performance reasons", but I still wonder what are the particular cases where performance get improved by this 16-byte alignment. Or, in any case, what were the reasons why this was chosen.
edit : I'm thinking I wrote the question in a misleading way. I wasn't asking about why the processor does things faster with 16-byte aligned memory, this is explained everywhere in the docs. What I wanted to know instead, is how the enforced 16-byte alignment is better than just letting the programmers align the stack themselves when needed. I'm asking this because from my experience with assembly, the stack enforcement has two problems: it is only useful by less 1% percent of the code that is executed (so in the other 99% is actually overhead); and it is also a very common source of bugs. So I wonder how it really pays off in the end. While I'm still in doubt about this, I'm accepting peter's answer as it contains the most detailed answer to my original question.
Note that the current version of the i386 System V ABI used on Linux also requires 16-byte stack alignment 1 . See https://sourceforge.net/p/fbc/bugs/659/ for some history.
SSE2 is baseline for x86-64
, and making the ABI efficient for types like
, and for compiler auto-vectorization, was one of the design goals, I think. The ABI has to define how such args are passed as function args, or by reference.
16-byte alignment is sometimes useful for local variables on the stack (especially arrays), and guaranteeing 16-byte alignment means compilers can get it for free whenever it's useful, even if the source doesn't explicitly request it.
If the stack alignment relative to a 16-byte boundary wasn't known, every function that wanted an aligned local would need an
and rsp, -16
, and extra instructions to save/restore
after an unknown offset to
e.g. using up
for a frame pointer.
Without AVX, memory source operands have to be 16-byte aligned. e.g.
paddd xmm0, [rsp+rdi]
faults if the memory operand is misaligned. So if alignment isn't known, you'd have to either use
movups xmm1, [rsp+rdi]
paddd xmm0, xmm1
, or write a loop prologue / epilogue to handle the misaligned elements. For local arrays that the compiler wants to auto-vectorize over, it can simply choose to align them by 16.
Also note that early x86 CPUs (before Nehalem / Bulldozer) had a
instruction that's slower than
even when the pointer does turn out to be aligned. (i.e. unaligned loads/stores on aligned data was extra slow, as well as preventing folding loads into an ALU instruction). (See
Agner Fog's optimization guides, microarch guide, and instruction tables
for more about all of the above.)
These factors are why a guarantee is more useful than just "usually" keeping the stack aligned. Being allowed to make code which actually faults on a misaligned stack allows more optimization opportunities.
Aligned arrays also speed up vectorized
functions that can't
alignment, but instead check for it and can jump straight to their whole-vector loops.
An array uses the same alignment as its elements, except that a local or global array variable of length at least 16 bytes or a C99 variable-length array variable always has alignment of at least 16 bytes. 4
4 The alignment requirement allows the use of SSE instructions when operating on the array. The compiler cannot in general calculate the size of a variable-length array (VLA), but it is expected that most VLAs will require at least 16 bytes, so it is logical to mandate that VLAs have at least a 16-byte alignment.
This is a bit aggressive, and mostly only helps when functions that auto-vectorize can be inlined, but usually there are other locals the compiler can stuff into any gaps so it doesn't waste stack space. And doesn't waste instructions as long as there's a known stack alignment. (Obviously the ABI designers could have left this out if they'd decided not to require 16-byte stack alignment.)
Of course, it makes it free to do
alignas(16) char buf;
or other cases where the source
And there are also
locals. The compiler may not be able to keep all vector locals in registers (e.g. spilled across a function call, or not enough registers), so it needs to be able to spill/reload them with
, or as a memory source operand for ALU instructions, for efficiency reasons discussed above.
Loads/stores that actually are split across a cache-line boundary (64 bytes) have significant latency penalties, and also minor throughput penalties on modern CPUs. The load needs data from 2 separate cache lines, so it takes two accesses to the cache. (And potentially 2 cache misses, but that's rare for stack memory).
already had that cost baked in for vectors on older CPUs where it's expensive, but it still sucks. Spanning a 4k page boundary is
worse (on CPUs before Skylake), with a load or store taking ~100 cycles if it touches bytes on both sides of a 4k boundary. (Also needs 2 TLB checks).
Natural alignment makes splits across any wider boundary impossible
, so 16-byte alignment was sufficient for everything you can do with SSE2.
has 16-byte alignment
in the x86-64 System V ABI, because of
(10-byte/80-bit x87). It's defined as padded to 16 bytes for some weird reason, unlike in 32-bit code where
sizeof(long double) == 10
. x87 10-byte load/store is quite slow anyway (like 1/3rd the load throughput of
on Core2, 1/6th on P4, or 1/8th on K8), but maybe cache-line and page split penalties were so bad on older CPUs that they decided to define it that way. I think on modern CPUs (maybe even Core2) looping over an array of
would be no slower with packed 10-byte, because the
would be a bigger bottleneck than a cache-line split every ~6.4 elements.
Actually, the ABI was defined before silicon was available to benchmark on (
back in ~2000
), but those K8 numbers are the same as K7 (32-bit / 64-bit mode is irrelevant here). Making
16-byte does make it possible to copy a single one with
, even though you can't do anything with it in XMM registers. (Except manipulate the sign bit with
definition means that
always returns 16-byte aligned memory in x86-64 code. This lets you get away with using it for SSE aligned loads like
, but such code can break when compiled for 32-bit where
is only 8. (Use
Other ABI factors
values on the stack (after xmm0-7 have the first 8 float / vector args). It makes sense to require 16-byte alignment for vectors in memory, so they can be used efficiently by the callee, and stored efficiently by the caller. Maintaining 16-byte stack alignment at all times makes it easy for functions that need to align some arg-passing space by 16.
There are types like
that the ABI guarantees have 16-byte alignment
. If you define a local and take its address, and pass that pointer to some other function, that local needs to be sufficiently aligned. So maintaining 16-byte stack alignment goes hand in hand with giving some types 16-byte alignment, which is obviously a good idea.
These days, it's nice that
can cheaply get 16-byte alignment, so
doesn't ever cross a cache line boundary. For the really rare case where you have an atomic local with automatic storage, and you pass pointers to it to multiple threads...
Footnote 1: 32-bit Linux
Not all 32-bit platforms broke backwards compatibility with existing binaries and hand-written asm the way Linux did; some like i386 NetBSD still only use the historical 4-byte stack alignment requirement from the original version of the i386 SysV ABI.
The historical 4-byte stack alignment was also insufficient for efficient 8-byte
on modern CPUs. Unaligned
are generally efficient except when they cross a cache-line boundary (like other loads/stores), so it's not horrible, but naturally-aligned is nice.
Even before 16-byte alignment was officially part of the ABI, GCC used to enable
(2^4 = 16-bytes) on 32-bit. This currently assumes the incoming stack alignment is 16 bytes (even for cases that will fault if it's not), as well as preserving that alignment. I'm not sure if historical gcc versions used to try to preserve stack alignment without depending on it for correctness of SSE code-gen or
ffmpeg is one well-known example that depends on the compiler to give it stack alignment: what is "stack alignment"? , e.g. on 32-bit Windows.
Modern gcc still emits code at the top of
to align the stack by 16 (even on Linux where the ABI guarantees that the kernel starts the process with an aligned stack), but not at the top of any other function. You could use
to tell gcc how aligned it should assume the stack is when generating code.
Ancient gcc4.1 didn't seem to really respect
for automatic storage, i.e. it doesn't bother aligning the stack any extra
in this example on Godbolt
, so old gcc has kind of a checkered past when it comes to stack alignment. I think the change of the official Linux ABI to 16-byte alignment happened as a de-facto change first, not a well-planned change. I haven't turned up anything official on when the change happened, but somewhere between 2005 and 2010 I think, after x86-64 became popular and the x86-64 System V ABI's 16-byte stack alignment proved useful.
At first it was a change to GCC's code-gen to use more alignment than the ABI required (i.e. using a stricter ABI for gcc-compiled code), but later it was written in to the version of the i386 System V ABI maintained at https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI (which is official for Linux at least).
@MichaelPetch and @ThomasJager report
that gcc4.5 may have been the first version to have
for 32-bit as well as 64-bit. gcc4.1.2 and gcc4.4.7 on Godbolt appear to behave that way, so maybe the change was backported, or Matt Godbolt configured old gcc with a more modern config.