performance - tables - Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
hardware performance counters windows (2)
I think you're measuring accurately, and the explanation is microarchitectural, not any kind of measurement error.
I think your results for mid to low T support the conclusion that
stops the front-end from even issuing past the
until all earlier instructions retire
, rather than having all the uops from both chains already issued and just waiting for
to flip a switch and let multiplies from each chain start to dispatch on alternating cycles.
(port1 would get edx,eax,empty,edx,eax,empty,... for Skylake's 3c latency / 1c throughput multiplier right away, if
didn't block the front-end, and overhead wouldn't scale with T.)
throughput when only uops from the first chain are in the scheduler because the front-end hasn't chewed through the
and loop branch yet. And for the same number of cycles at the end of the window when the pipeline is mostly drained and only uops from the 2nd chain are left.
The overhead delta looks linear up to about T=60. I didn't run the numbers, but the slope up to there looks reasonable for
T * 0.25
clocks to issue the first chain vs. 3c-latency execution bottleneck. i.e.
delta growing maybe 1/12th as fast as total no-lfence cycles
So (given the
overhead I measured below), with T<60:
no_lfence cycles/iter ~= 3T # OoO exec finds all the parallelism lfence cycles/iter ~= 3T + T/4 + 9.3 # lfence constant + front-end delay delta ~= T/4 + 9.3
@Margaret reports that
is a better fit than
2*T / 4
, but I would have expected T/4 at both the start and end, for a total of 2T/4 slope of the delta.
After about T=60, delta grows much more quickly (but still linearly), with a slope about equal to the total no-lfence cycles, thus about 3c per T. I think at that point, the scheduler (Reservation Station) size is limiting the out-of-order window. You probably tested on a Haswell or Sandybridge/IvyBridge, ( which have a 60-entry or 54-entry scheduler respectively . Skylake's is 97 entry.
The RS tracks un-executed uops. Each RS entry holds 1 unfused-domain uop that's waiting for its inputs to be ready, and its execution port, before it can dispatch and leave the RS 1 .
, the front-end issues at 4 per clock while the back-end executes at 1 per 3 clocks, issuing 60 uops in ~15 cycles, during which time only 5
instructions from the
chain have executed. (There's no load or store micro-fusion here, so every fused-domain uop from the front-end is still only 1 unfused-domain uop in the RS
For large T the RS quickly fills up, at which point the front-end can only make progress at the speed of the back-end. (For small T, we hit the next iteration's
before that happens, and that's what stalls the front-end).
When T > RS_size
, the back-end can't see any of the uops from the
imul chain until enough back-end progress through the
chain has made room in the RS. At that point, one
from each chain can dispatch every 3 cycles, instead of just the 1st or 2nd chain.
Remember from the first section that time spent just after
only executing the first chain = time just before
executing only the second chain. That applies here, too.
We get some of this effect even with no
, for T > RS_size
, but there's opportunity for overlap on both sides of a long chain. The ROB is at least twice the size of the RS, so the out-of-order window when not stalled by
should be able to keep both chains in flight constantly even when T is somewhat larger than the scheduler capacity. (Remember that uops leave the RS as soon as they've executed. I'm not sure if that means they have to
executing and forward their result, or merely start executing, but that's a minor difference here for short ALU instructions. Once they're done, only the ROB is holding onto them until they retire, in program order.)
The ROB and register-file shouldn't be limiting the out-of-order window size ( http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ ) in this hypothetical situation, or in your real situation. They should both be plenty big.
Blocking the front-end is an implementation detail of
on Intel's uarches
. The manual only says that later instructions can't
. That wording would allow the front-end to issue/rename them all into the scheduler (Reservation Station) and ROB while
is still waiting, as long as none are dispatched to an execution unit.
So a weaker
would maybe have flat overhead up to T=RS_size, then the same slope as you see now for T>60.
(And the constant part of the overhead might be lower.)
Note that guarantees about speculative execution of conditional/indirect branches after
, not (as far as I know) to code-fetch. Merely triggering code-fetch is not (AFAIK) useful to a Spectre or Meltdown attack. Possibly a timing side-channel to detect how it decodes could tell you something about the fetched code...
I think AMD's LFENCE is at least as strong on actual AMD CPUs, when the relevant MSR is enabled. ( Is LFENCE serializing on AMD processors? ).
Your results are interesting, but it doesn't surprise me at all that there's significant constant overhead from
itself (for small T), as well as the component that scales with T.
doesn't allow later instructions to start until earlier instructions have
. This is probably at least a couple cycles / pipeline-stages later than when their results are ready for bypass-fowarding to other execution units (i.e. the normal latency).
So for small T, it's definitely significant that you add extra latency into the chain by requiring the result to not only be ready, but also written back to the register file.
It probably takes an extra cycle or so for
to allow the issue/rename stage to start operating again after detecting retirement of the last instruction before it. The issue/rename process takes multiple stages (cycles), and maybe lfence blocks at the
of this, instead of in the very last step before uops are added into the OoO part of the core.
itself has 4 cycle throughput on SnB-family, according to Agner Fog's testing.
Agner Fog reports
2 fused-domain uops (no unfused), but on Skylake I measure it at 6 fused-domain (still no unfused) if I only have 1
. But with more
back-to-back, it's fewer uops! Down to ~2 uops per
with many back-to-back, which is how Agner measures.
(a tight loop with no work) runs at 1 iteration per ~10 cycles on SKL, so that might give us an idea of the real extra latency that
adds to the dep chains even without the front-end and RS-full bottlenecks.
overhead with only
, OoO exec being irrelevant:
.loop: ;mfence ; mfence here: ~62.3c (with no lfence) lfence ; lfence here: ~39.3c times 10 imul eax,eax ; with no lfence: 30.0c ; lfence ; lfence here: ~39.6c dec ecx jnz .loop
, runs at the expected 30.0c per iter. With
, runs at ~39.3c per iter, so
effectively added ~9.3c of "extra latency" to the critical path dep chain. (And 6 extra fused-domain uops).
after the imul chain, right before the loop-branch, it's slightly slower. But not a whole cycle slower, so that would indicate that the front-end is issuing the loop-branch + and imul in a single issue-group after
allows execution to resume. That being the case, IDK why it's slower. It's not from branch misses.
Getting the behaviour you were expecting:
Interleave the chains in program order, like @BeeOnRope suggests in comments, doesn't require out-of-order execution to exploit the ILP, so it's pretty trivial:
.loop: lfence ; at the top of the loop is the lowest-overhead place. %rep T imul eax,eax imul edx,edx %endrep dec ecx jnz .loop
You could put pairs of short
times 8 imul
chains inside a
to let OoO exec have an easy time.
Footnote 1: How the front-end / RS / ROB interact
My mental model is that the issue/rename/allocate stages in the front-end add new uops to both the RS and the ROB at the same time.
Uops leave the RS after executing, but stay in the ROB until in-order retirement. The ROB can be large because it's never scanned out-of-order to find the first-ready uop, only scanned in-order to check if the oldest uop(s) have finished executing and thus are ready to retire.
(I assume the ROB is physically a circular buffer with start/end indices, not a queue which actually copies uops to the right every cycle. But just think of it as a queue / list with a fixed max size, where the front-end adds uops at the front, and the retirement logic retires/commits uops from the end as long as they're fully executed, up to some per-cycle per-hyperthread retirement limit which is not usually a bottleneck. Skylake did increase it for better Hyperthreading, maybe to 8 per clock per logical thread. Perhaps retirement also means freeing physical registers which helps HT, because the ROB itself is statically partitioned when both threads are active. That's why retirement limits are per logical thread.)
, which are handled in the front-end (don't need any execution units on any ports) are added
to the ROB, in an already-executed state. (A ROB entry presumably has a bit that marks it as ready to retire vs. still waiting for execution to complete. This is the state I'm talking about. For uops that
need an execution port, I assume the ROB bit is set via a
from the execution unit. And that the same completion-port signal frees its RS entry.)
Uops stay in the ROB from issue to retirement .
Uops stay in the RS from issue to execution . The RS can replay uops in a few cases , e.g. for the other half of a cache-line-split load , or if it was dispatched in anticipation of load data arriving, but in fact it didn't. (Cache miss or other conflicts like Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up? .) Or when a load port speculates that it can bypass the AGU before starting a TLB lookup to shorten pointer-chasing latency with small offsets - Is there a penalty when base+offset is in a different page than the base?
So we know that the RS can't remove a uop right as it dispatches, because it might need to be replayed. (Can happen even to non-load uops that consume load data.) But any speculation that needs replays is short-range, not through a chain of uops, so once a result comes out the other end of an execution unit, the uop can be removed from the RS. Probably this is part of what a completion port does, along with putting the result on the bypass forwarding network.
Footnote 2: How many RS entries does a micro-fused uop take?
TL:DR: P6-family: RS is fused, SnB-family: RS is unfused.
A micro-fused uop is issued to two separate RS entries in Sandybridge-family , but only 1 ROB entry. (Assuming it isn't un-laminated before issue, see section 2.3.5 for HSW or section 126.96.36.199 for SnB of Intel's optimization manual, and Micro fusion and addressing modes . Sandybridge-family's more compact uop format can't represent indexed addressing modes in the ROB in all cases.)
The load can dispatch independently, ahead of the other operand for the ALU uop being ready. (Or for micro-fused stores, either of the store-address or store-data uops can dispatch when its input is ready, without waiting for both.)
I used the two-dep-chain method from the question to experimentally test this on Skylake (RS size = 97)
, with micro-fused
or edi, [rdi]
, and another dep chain in
Full test code, NASM syntax on Godbolt
; loop body %rep T %if FUSE or edi, [rdi] ; static buffers are in the low 32 bits of address space, in non-PIE %else mov eax, [rdi] or edi, eax %endif %endrep %rep T %if FUSE or esi, [rsi] %else mov eax, [rsi] or esi, eax %endif %endrep
(unfused-domain) per cycle (or per second which
calculates for us), we can see a throughput number that doesn't depend on separate vs. folded loads.
With small T (T=30), all the ILP can be exploited, and we get ~0.67 uops per clock with or without micro-fusion. (I'm ignoring the small bias of 1 extra uop per loop iteration from dec/jnz. It's negligible compared to the effect we'd see if micro-fused uops only used 1 RS entry)
Remember that load+
is 2 uops, and we have 2 dep chains in flight, so this is 4/6, because
or edi, [rdi]
has 6 cycle latency. (Not 5, which is surprising, see below.)
At T=60, we still have about 0.66 unfused uops executed per clock for FUSE=0, and 0.64 for FUSE=1. We can still find basically all the ILP, but it's just barely starting to dip, as the two dep chains are 120 uops long (vs. a RS size of 97).
At T=120, we have 0.45 unfused uops per clock for FUSE=0, and 0.44 for FUSE=1. We're definitely past the knee here, but still finding some of the ILP.
If a micro-fused uop took only 1 RS entry, FUSE=1 T=120 should be about the same speed as FUSE=0 T=60, but that's not the case . Instead, FUSE=0 or 1 makes nearly no difference at any T. (Including larger ones like T=200: FUSE=0: 0.395 uops/clock, FUSE=1: 0.391 uops/clock). We'd have to go to very large T before we start for the time with 1 dep-chain in flight to totally dominate the time with 2 in flight, and get down to 0.33 uops / clock (2/6).
Oddity: We have such a small but still measurable difference in throughput for fused vs. unfused, with separate
loads being faster.
Other oddities: the total
lower for FUSE=0 at any given T. Like 2,418,826,591 vs. 2,419,020,155 for T=60. This difference was repeatable down to +- 60k out of 2.4G, plenty precise enough. FUSE=1 is slower in total clock cycles, but most of the difference comes from lower uops per clock, not from more uops.
Simple addressing modes like
are supposed to only have 4 cycle latency, so load + ALU should be only 5 cycle. But I measure 6 cycle latency for the load-use latency of
or rdi, [rdi]
, or with a separate MOV-load, or with any other ALU instruction I can never get the load part to be 4c.
A complex addressing mode like
[rdi + rbx + 2064]
has the same latency when there's an ALU instruction in the dep chain, so it appears that Intel's 4c latency for simple addressing modes
applies when a load is forwarding to the base register of another load (with up to a +0..2047 displacement and no index).
Pointer-chasing is common enough that this is a useful optimization, but we need to think of it as a special load-load forwarding fast-path, not as a general data ready sooner for use by ALU instructions.
P6-family is different: an RS entry holds a fused-domain uop.
@Hadi found an Intel patent from 2002 , where Figure 12 shows the RS in the fused domain.
Experimental testing on a Conroe (first gen Core2Duo, E6600) shows that there's a large difference between FUSE=0 and FUSE=1 for T=50. ( The RS size is 32 entries ).
- T=50 FUSE=1: total time of 2.346G cycles (0.44IPC)
T=50 FUSE=0: total time of 3.272G cycles (0.62IPC = 0.31 load+OR per clock). (
ocperf.pydoesn't have events for
uops_executedon uarches before Nehalem or so, and I don't have
oprofileinstalled on that machine.)
T=24 there's a negligible difference between FUSE=0 and FUSE=1, around 0.47 IPC vs 0.9 IPC (~0.45 load+OR per clock).
T=24 is still over 96 bytes of code in the loop, too big for Core 2's 64-byte (pre-decode) loop buffer, so it's not faster because of fitting in a loop buffer. Without a uop-cache, we have to be worried about the front-end, but I think we're fine because I'm exclusively using 2-byte single-uop instructions that should easily decode at 4 fused-domain uops per clock.
I was playing with the code in this answer , slightly modifying it:
BITS 64 GLOBAL _start SECTION .text _start: mov ecx, 1000000 .loop: ;T is a symbol defined with the CLI (-DT=...) TIMES T imul eax, eax lfence TIMES T imul edx, edx dec ecx jnz .loop mov eax, 60 ;sys_exit xor edi, edi syscall
I the results I get are consistent with the static analysis in that answer.
When I introduce a
I'd expect the CPU to execute the
imul edx, edx
sequence of the
iteration in parallel with the
imul eax, eax
sequence of the next (
Something like this (calling A the
imul eax, eax
imul edx, edx
| | A | D A | D A | D A | ... | D A | D | V time
Taking more or less the same number of cycles but for one unpaired parallel execution.
When I measure the number of cycles, for the original and modified version, with
taskset -c 2 ocperf.py stat -r 5 -e cycles:u '-x ' ./main-$T
in the range below I get
T Cycles:u Cycles:u Delta lfence no lfence 10 42047564 30039060 12008504 15 58561018 45058832 13502186 20 75096403 60078056 15018347 25 91397069 75116661 16280408 30 108032041 90103844 17928197 35 124663013 105155678 19507335 40 140145764 120146110 19999654 45 156721111 135158434 21562677 50 172001996 150181473 21820523 55 191229173 165196260 26032913 60 221881438 180170249 41711189 65 250983063 195306576 55676487 70 281102683 210255704 70846979 75 312319626 225314892 87004734 80 339836648 240320162 99516486 85 372344426 255358484 116985942 90 401630332 270320076 131310256 95 431465386 285955731 145509655 100 460786274 305050719 155735555
How can the values of
I would have expected them to be similar to those of
Cycles:u no lfence
since a single
should prevent only the first iteration from being executed in parallel for the two blocks.
I don't think it's due to the
overhead as I believe that should be constant for all
I'd like to fix what's wrong with my forma mentis when dealing with the static analysis of code.
I'll present an analysis for the case where T = 1 for both codes (with and without
). You can then extend this for other values of T. You can refer to Figure 2.4 of the Intel Optimization Manual for a visual.
Because there is only a single easily predicted branch, the frontend will only stall if the backend stalled. The frontend is 4-wide in Haswell, which means up to 4 fused uops can be issued from the IDQ (instruction decode queue, which is just a queue that holds in-order fused-domain uops, also called the uop queue) to the reservation station (RS) entires of the scheduler. Each
is decoded into a single uop that cannot be fused. The instructions
get macrofused in the frontend to a single uop. One of the differences between microfusion and macrofusion is that when the scheduler dispatches a macrofused uop (that are not microfused) to the execution unit it's assigned to, it gets dispatched as a single uop. In contrast, a microfused uop needs to be split into its constituent uops, each of which must be separately dispatched to an execution unit. (However, splitting microfused uops happens on entrance to the RS, not on dispatch, see Footnote 2 in @Peter's answer).
is decoded into 6 uops. Recognizing microfusion only matters in the backend, and in this case, there is no microfusion in the loop.
Since the loop branch is easily predictable and since the number of iterations is relatively large, we can just assume without compromising accuracy that the allocator will always be able to allocate 4 uops per cycle. In other words, the scheduler will receive 4 uops per cycle. Since there is no micorfusion, each uop will be dispatched as a single uop.
can only be executed by the Slow Int execution unit (see Figure 2.4). This means that the only choice for executing the
uops is to dispatch them to port 1. In Haswell, the Slow Int is nicely pipelined so that a single
can be dispatched per cycle. But it takes three cycles for the result of the multiplication be available for any instruction that requires (the writeback stage is the third cycle from the dispatch stage of the pipeline). So for each dependence chain, at most one
can be dispatched per 3 cycles.
is predicted taken, the only execution unit that can execute it is Primary Branch on port 6.
So at any given cycle, as long as the RS has space, it will receive 4 uops. But what kind of uops? Let's examine the loop without lfence:
imul eax, eax imul edx, edx dec ecx/jnz .loop (macrofused)
There are two possibilities:
imuls from the same iteration, one
imulfrom a neighboring iteration, and one
dec/jnzfrom one of those two iterations.
dec/jnzfrom one iteration, two
imuls from the next iteration, and one
dec/jnzfrom the same iteration.
So at the beginning of any cycle, the RS will receive at least one
and at least one
from each chain. At the same time, in the same cycle and from those uops that are already there in the RS, the scheduler will do one of two actions:
Dispatch the oldest
dec/jnzto port 6 and dispatch the oldest
imulthat is ready to port 1. That's a total of 2 uops.
Because the Slow Int has a latency of 3 cycles but there are only two chains, for each cycle of 3 cycles, no
imulin the RS will be ready for execution. However, there is always at least one
dec/jnzin the RS. So the scheduler can dispatch that. That's a total of 1 uop.
Now we can calculate the expected number of uops in the RS, X N , at the end of any given cycle N:
+ (the number of uops to be allocated in the RS at the beginning of cycle N) - (the expected number of uops that will be dispatched at the beginning of cycle N)
= X N-1 + 4 - ((0+1)*1/3 + (1+1)*2/3)
= X N-1 + 12/3 - 5/3
= X N-1 + 7/3 for all N > 0
The initial condition for the recurrence is X 0 = 4. This is a simple recurrence that can be solved by unfolding X N-1 .
X N = 4 + 2.3 * N for all N >= 0
The RS in Haswell has 60 entries. We can determine the first cycle in which the RS is expected to become full:
60 = 4 + 7/3 * N
N = 56/2.3 = 24.3
So at the end of cycle 24.3, the RS is expected to be full. This means that at the beginning of cycle 25.3, the RS will not be able to receive any new uops. Now the number of iterations, I, under consideration determines how you should proceed with the analysis. Since a dependency chain will require at least 3*I cycles to execute, it takes about 8.1 iterations to reach cycle 24.3. So if the number of iterations is larger than 8.1, which is the case here, you need to analyze what happens after cycle 24.3.
The scheduler dispatches instructions at the following rates every cycle (as discussed above):
1 2 2 1 2 2 1 2 . .
But the allocator will not allocate any uops in the RS unless there are at least 4 available entries. Otherwise, it will not waste power on issuing uops at a sub-optimal throughput. However, it is only at the beginning of every 4th cycle are there at least 4 free entries in the RS. So starting from cycle 24.3, the allocator is expected to get stalled 3 out of every 4 cycles.
Another important observation for the code being analyzed is that it never happens that there are more than 4 uops that can be dispatched, which means that the average number of uops that leave their execution units per cycle is not larger than 4. At most 4 uops can be retired from the ReOrder Buffer (ROB). This means that the ROB can never be on the critical path. In other words, performance is determined by the dispatch throughput.
We can calculate the IPC (instructions per cycles) fairly easily now. The ROB entries look something like this:
imul eax, eax - N imul edx, edx - N + 1 dec ecx/jnz .loop - M imul eax, eax - N + 3 imul edx, edx - N + 4 dec ecx/jnz .loop - M + 1
The column to the right shows the cycles in which the instruction can be retired. Retirement happens in order and is bounded by the latency of the critical path. Here each dependency chain have the same path length and so both constitute two equal critical paths of length 3 cycles. So every 3 cycles, 4 instructions can be retired. So the IPC is 4/3 = 1.3 and the CPI is 3/4 = 0.75. This is much smaller than the theoretical optimal IPC of 4 (even without considering micro- and macro-fusion). Because retirement happens in-order, the retirement behavior will be the same.
We can check our analysis using both
and IACA. I'll discuss
. I've a Haswell CPU.
perf stat -r 10 -e cycles:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-nolfence Performance counter stats for './main-1-nolfence' (10 runs): 30,01,556 cycles:u ( +- 0.00% ) 40,00,005 instructions:u # 1.33 insns per cycle ( +- 0.00% ) 0 RESOURCE_STALLS.ROB 23,42,246 UOPS_ISSUED.ANY ( +- 0.26% ) 22,49,892 RESOURCE_STALLS.RS ( +- 0.00% ) 0.001061681 seconds time elapsed ( +- 0.48% )
There are 1 million iterations each takes about 3 cycles. Each iteration contains 4 instructions and the IPC is 1.33.
shows the number of cycles in which the allocator was stalled due to a full ROB. This of course never happens.
can be used to count the number of uops issued to the RS and the number of cycles in which the allocator was stalled (no specific reason). The first is straightforward (not shown in the
output); 1 million * 3 = 3 million + small noise. The latter is much more interesting. It shows that about 73% of all time the allocator stalled due to a full RS, which matches our analysis.
counts the number of cycles in which the allocator was stalled due to a full RS. This is close to
because the allocator does not stall for any other reason (although the difference could be proportional to the number of iterations for some reason, I'll have to see the results for T>1).
The analysis of the code without
can be extended to determine what happens if an
was added between the two
s. Let's check out the
results first (IACA unfortunately does not support
perf stat -r 10 -e cycles:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-lfence Performance counter stats for './main-1-lfence' (10 runs): 1,32,55,451 cycles:u ( +- 0.01% ) 50,00,007 instructions:u # 0.38 insns per cycle ( +- 0.00% ) 0 RESOURCE_STALLS.ROB 1,03,84,640 UOPS_ISSUED.ANY ( +- 0.04% ) 0 RESOURCE_STALLS.RS 0.004163500 seconds time elapsed ( +- 0.41% )
Observe that the number of cycles has increased by about 10 million, or 10 cycles per iteration. The number of cycles does not tell us much. The number of retired instruction has increased by a million, which is expected. We already know that the
will not make instruction complete any faster, so
should not change.
are particularly interesting. In this output,
counts cycles, not uops. The number of uops can also be counted (using
) and has increased by 6 uops per iteration (no fusion). This means that an
that was placed between two
s was decoded into 6 uops. The one million dollar question is now what these uops do and how they move around in the pipe.
is zero. What does that mean? This indicates that the allocator, when it sees an
in the IDQ, it stops allocating until all current uops in the ROB retire. In other words, the allocator will not allocate entries in the RS past an
retires. Since the loop body contains only 3 other uops, the 60-entry RS will never be full. In fact, it will be always almost empty.
The IDQ in reality is not a single simple queue. It consists of multiple hardware structures that can operate in parallel. The number of uops an
requires depends on the exact design of the IDQ. The allocator, which also consists of many different hardware structures, when it see there is an
uops at the front of any of the structures of the IDQ, it suspends allocation from that structure until the ROB is empty. So different uops are usd with different hardware structures.
shows that the allocator is not issuing any uops for about 9-10 cycles per iteration. What is happening here? Well, one of the uses of
is that it can tell us how much time it takes to retire an instruction and allocate the next instruction. The following assembly code can be used to do that:
TIMES T lfence
The performance event counters will not work well for small values of
. For sufficiently large T, and by measuring
, we can determine that it takes about 4 cycles to retire each
. That's because
will be incremented about 4 times every 5 cycles. So after every 4 cycles, the allocator issues another
(it doesn't stall), then it waits for another 4 cycles, and so on. That said, instructions that produce results may require 1 or few more cycle to retire depending on the instruction. IACA always assume that it takes 5 cycles to retire an instruction.
Our loop looks like this:
imul eax, eax lfence imul edx, edx dec ecx jnz .loop
At any cycle at the
boundary, the ROB will contain the following instructions starting from the top of the ROB (the oldest instruction):
imul edx, edx - N dec ecx/jnz .loop - N imul eax, eax - N+1
Where N denotes the cycle number at which the corresponding instruction was dispatched. The last instruction that is going to complete (reach the writeback stage) is
imul eax, eax
. and this happens at cycle N+4. The allocator stall cycle count will be incremented during cycles, N+1, N+2, N+3, and N+4. However it will about 5 more cycles until
imul eax, eax
retires. In addition, after it retires, the allocator needs to clean up the
uops from the IDQ and allocate the next group of instructions before they can be dispatched in the next cycle. The
output tells us that it takes about 13 cycles per iteration and that the allocator stalls (because of the
) for 10 out of these 13 cycles.
The graph from the question shows only the number of cycles for up to T=100. However, there is another (final) knee at this point. So it would be better to plot the cycles for up to T=120 to see the full pattern.