what - x86-64 assembly cheat sheet

Is a sign or zero extension required when adding a 32bit offset to a pointer for the x86-64 ABI? (2)

As EOF's comment indicates the compiler can't assume that upper 32 bits of a 64-bit register used to pass a 32-bit argument has any particular value. That makes the sign or zero extension necessary.

The only way to prevent this would be to use a 64-bit type for the argument, but this moves the requirement to extend the value to the caller, which may not be improvement. I wouldn't worry too much about the size of register spills though, since the way you're doing it now it's probably more likely that after the extension the original value will be dead and it's the 64-bit extended value that will be spilled. Even if it's not dead the compiler may still prefer to spill the 64-bit value.

If you're really concerned about your memory footprint and you don't need the larger 64-bit address space you might look at the x32 ABI which uses the ILP32 types but supports the full 64-bit instruction set.

Summary: I was looking at assembly code to guide my optimizations and see lots of sign or zero extensions when adding int32 to a pointer.

void Test(int *out, int offset)
    out[offset] = 1;
movslq  %esi, %rsi
movl    $1, (%rdi,%rsi,4)

At first, I thought my compiler was challenged at adding 32bit to 64bit integers, but I've confirmed this behavior with Intel ICC 11, ICC 14, and GCC 5.3.

This thread confirms my findings, but it's not clear if the sign or zero extension is necessary. This sign/zero extension would only be necessary if the upper 32bits aren't already set. But wouldn't the x86-64 ABI be smart enough to require that?

I'm kind of reluctant to change all my pointer offsets to ssize_t because register spills will increase the cache footprint of the code.

Yes, you have to assume that the high 32 bits of an arg or return-value register contains garbage. On the flip side, you are allowed to leave garbage in the high 32 when calling or returning yourself. i.e. the burden is on the receiving side to ignore the high bits, not on the passing side to clean the high bits.

You need to sign or zero extend to 64 bits to use the value in a 64-bit effective address. In the x32 ABI , gcc frequently uses 32-bit effective addresses instead of using 64-bit operand-size for every instruction modifying a potentially-negative integer used as an array index.

The standard:

The x86-64 SysV ABI only says anything about which parts of a register are zeroed for _Bool (aka bool ). Page 20:

When a value of type _Bool is returned or passed in a register or on the stack, bit 0 contains the truth value and bits 1 to 7 shall be zero (footnote 14: Other bits are left unspecified, hence the consumer side of those values can rely on it being 0 or 1 when truncated to 8 bit)

Also, the stuff about %al holding the number of FP register args for varargs functions, not the whole %rax .

There's an open github issue about this exact question on the github page for the x32 and x86-64 ABI documents .

The ABI doesn't place any further requirements or guarantees on the contents of the high parts of integer or vector registers holding args or return values, so there aren't any. I have confirmation of this fact via email from Michael Matz (one of the ABI maintainers): "Generally, if the ABI doesn't say something is specified, you cannot rely on it."

He also confirmed that e.g. clang >= 3.6's use of an addps that could slow down or raise extra FP exceptions with garbage in high elements is a bug (which reminds me I should report that). He adds that this was an issue once with an AMD implementation of a glibc math function. Normal C code can leave garbage in high elements of vector regs when passing scalar double or float args.

Actual behaviour which is not (yet) documented in the standard:

Narrow function arguments, even _Bool / bool , are sign or zero-extended to 32 bits. clang even makes code that depends on this behaviour (since 2007, apparently) . ICC17 doesn't do it , so ICC and clang are not ABI-compatible , even for C. Don't call clang-compiled functions from ICC-compiled code for the x86-64 SysV ABI, if any of the first 6 integer args are narrower than 32-bit.

This doesn't apply to return values, only args: gcc and clang both assume that return-values they receive only have valid data up to the width of the type. gcc will make functions returning char that leave garbage in the high 24 bits of %eax , for example.

A recent thread on the ABI discussion group was a proposal to clarify the rules for extending 8 and 16-bit args to 32 bits, and maybe actually modify the ABI to require this. The major compilers (except ICC) already do it, but it would be a change to the contract between callers and callees.

Here's an example (check it out with other compilers or tweak the code on the Godbolt Compiler Explorer , where I've included many simple examples that only demonstrate one piece of the puzzle, as well as this that demonstrates a lot):

extern short fshort(short a);
extern unsigned fuint(unsigned int a);

extern unsigned short array_us[];
unsigned short lookupu(unsigned short a) {
  unsigned int a_int = a + 1234;
  a_int += fshort(a);                 // NOTE: not the same calls as the signed lookup
  return array_us[a + fuint(a_int)];

# clang-3.8 -O3  for x86-64.    arg in %rdi.  (Actually in %di, zero-extended to %edi by our caller)
lookupu(unsigned short):
    pushq   %rbx                      # save a call-preserved reg for out own use.  (Also aligns the stack for another call)
    movl    %edi, %ebx                # If we didn't assume our arg was already zero-extended, this would be a movzwl (aka movzx)
    movswl  %bx, %edi                 # sign-extend to call a function that takes signed short instead of unsigned short.
    callq   fshort(short)
    cwtl                              # Don't trust the upper bits of the return value.  (This is cdqe, Intel syntax.  eax = sign_extend(ax))
    leal    1234(%rbx,%rax), %edi     # this is the point where we'd get a wrong answer if our arg wasn't zero-extended.  gcc doesn't assume this, but clang does.
    callq   fuint(unsigned int)
    addl    %ebx, %eax                # zero-extends eax to 64bits
    movzwl  array_us(%rax,%rax), %eax # This zero-extension (instead of just writing ax) is *not* for correctness, just for performance: avoid partial-register slowdowns if the caller reads eax
    popq    %rbx

Note: movzwl array_us(,%rax,2) would be equivalent, but no smaller. If we could depend on the high bits of %rax being zeroed in fuint() 's return value, the compiler could have used array_us(%rbx, %rax, 2) instead of using the add insn.

Performance implications

Leaving the high32 undefined is intentional, and I think it's a good design decision.

Ignoring the high 32 is free when doing 32-bit ops. A 32-bit operation zero-extends its result to 64-bit for free , so you only need an extra mov edx, edi or something if you could have used the reg directly in a 64-bit addressing mode or 64-bit operation.

Some functions won't save any insns from having their args already extended to 64-bit, so it's a potential waste for callers to always have to do it. Some functions use their args in a way that requires the opposite extension from the signedness of the arg, so leaving it up to the callee to decide what to do works well.

Zero-extending to 64-bit regardless of signedness would be free for most callers, though, and might have been a good choice ABI design choice. Since arg regs are clobbered anyway, the caller already needs to do something extra if it wants to keep a full 64-bit value across a call where it only passes the low 32. Thus it usually only costs extra when you need a 64-bit result for something before the call, and then pass a truncated version to a function. In x86-64 SysV, you can generate your result in RDI and use it, and then call foo which will only look at EDI.

16-bit and 8-bit operand-sizes often lead to false dependencies (AMD, P4, or Silvermont, and later SnB-family), or partial-register stalls (pre SnB) or minor slowdowns (Sandybridge), so the undocumented behaviour of requiring 8 and 16b types to be extended to 32b for arg-passing makes some sense. See Why doesn't GCC use partial registers? for more details on those microarchitectures.

This probably not a big deal for code-size in real code, since tiny functions are / should be static inline , and arg-handling insns are a small part of bigger functions . Inter-procedural optimization can remove overhead between calls when the compiler can see both definitions, even without inlining. (IDK how well compilers do at this in practice.)

I'm not sure whether changing function signatures to use uintptr_t will help or hurt overall performance with 64-bit pointers. I wouldn't worry about stack space for scalars. In most functions, the compiler pushes/pops enough call-preserved registers (like %rbx and %rbp ) to keep its own variables live in registers. A tiny bit extra space for 8B spills instead of 4B is negligible.

As far as code-size, working with 64-bit values requires a REX prefix on some insns that wouldn't have otherwise needed one. Zero-extending to 64-bit happens for free if any operations are required on a 32-bit value before it gets used as an array index. Sign-extension always takes an extra instruction if it's required. But compilers can sign-extend and work with it as a 64-bit signed value from the start to save instructions, at the cost of needing more REX prefixes. (Signed overflow is UB, not defined to wrap around, so compilers can often avoid redoing sign-extension inside a loop with an int i that uses arr[i] .)

Modern CPUs usually care more about insn count than insn size, within reason. Hot code will often be running from the uop cache in CPUs that have them. Still, smaller code can improve density in the uop cache. If you can save code size without using more or slower insns, then it's a win, but not usually worth sacrificing anything else for unless it's a lot of code size.

Like maybe one extra LEA instruction to allow [reg + disp8] addressing for a dozen later instructions, instead of disp32 . Or xor eax,eax before multiple mov [rdi+n], 0 instructions to replace the imm32=0 with a register source. (Especially if that allows micro-fusion where it wouldn't be possible with a RIP-relative + immediate, because what really matters is front-end uop count, not instruction count.)