Pipeline on Registers calculation

Question

I was reading about the pipeline optimizations recently. I wanted to ask if I understand correctly how a processor handles pipelining.

Here is C++ code for simple test program:

#include <vector>

int main()
{
    std::vector<int> vec(10000u);
    std::fill(vec.begin(), vec.end(), 0);
    for (unsigned i = 0u; i < vec.size(); ++i)
    {
        vec[i] = 5;
    }

    return 0;
}

And part of assembler code produced by for loop:

...
00007FF6A4521080  inc         edx  
    {
        vec[i] = 5;
00007FF6A4521082  mov         dword ptr [rcx+rax*4],5  
00007FF6A4521089  mov         eax,edx  
00007FF6A452108B  cmp         rax,r9  
00007FF6A452108E  jb          main+80h (07FF6A4521080h)  
    }
...

In program, vector "vec" is allocated with constant size and filled with zeros. Important "work" happens in for loop, where all vector variables are assigned to 5 ( just a random value).

I want to ask if this assembler code makes some stall in the pipeline? The reason would be that all instructions are somehow correlated and work on the same registers. For example, the pipeline would need to wait with instruction cmp rax r9, before mov eax, edx actually assign value to eax/rax?

Looping 10000 times is where branch prediction should go into work. jb instruction jumps 10000 times and only at the end it will pass through. It means that branch predictor should predict very easily that jump will happen most of the time. However, this optimization would be pointless from my perspective, if the code itself makes stall inside the loop.

My target architecture is Skylake i5-6400

well you have a memory operand so the mov of an immediate to a memory location is going to cause some sort of stall, granted it is a write so can be quicker than a read. branch prediction is just going to do premature fetches, the compare is the instruction before so is there a prefetch gain? the move to eax for what reason other than to cause a stall i dont know, but yes from a textbook approach the move has to execute to some level before the compare can start, it doesnt necessarily have to hit the register file but the compare depends on the result of the mov. — old_timer, Sep 11 '18 at 17:20
but thats from a textbook approach, which doesnt mean it is invalid or irrelevant, certainly not, but this x86 so you likely have what decades now of performance improvements over an elementary pipeline, plus it is microcoded so you have nuances that are not visible at the x86 ISA level. Plus you dont specify the specific chip, stepping, and patches to know how that processor will behave (if there is some assumption that that was possible anyway). — old_timer, Sep 11 '18 at 17:22
Register names are not strictly tied to specific gates on modern x86 chips. There is nothing stopping an x86 instruction set computer from having a dozen registers all in use as eax at the same time, so long as it keeps track of dependencies between reads and writes on them. (I don't know if existing hardware goes as far as a dozen) — Yakk - Adam Nevraumont, Sep 11 '18 at 17:22
Yes, if the result of one instruction is used as an operand of a following instruction. The prior instruction has to complete first for an basic textbook pipeline situation. Then in said textbook it should talk about solutions that might help that stall. This is face value instruction set perspective. Actual implementation of that instruction set as commented by others here means it shouldnt stall. — old_timer, Sep 11 '18 at 17:26
Although you have created something that looks textbook-wise like something interesting. You can run it a billion times in a loop, and your results are not going to show you what you think you are going to find. There are piles of infrastructure in an x86 design to make these things not stall. Its a combination of slowness due to the bulk, that results in speed keeping the processor more evenly fed. Combined with stuff to avoid stalls like this or in the real implementation the stall isnt there. — old_timer, Sep 11 '18 at 17:33
On some platforms you can see the branch prediction, but here again I dont expect you will be able to isolate it. — old_timer, Sep 11 '18 at 17:35
Maybe a dupe of this? Not exactly but half of the answer also answers this question: https://stackoverflow.com/q/37105230/555045 — harold, Sep 11 '18 at 17:37
[Register renaming](https://en.m.wikipedia.org/wiki/Register_renaming). — Jesper Juhl, Sep 11 '18 at 18:17
@Yakk-AdamNevraumont: Haswell's physical register file has 168 integer entries, and 168 vector entries. (https://www.realworldtech.com/haswell-cpu/3/). A dozen is chump change. :P Every architectural register needs an entry, but the rest can all be in use for the same architectural register after different instructions. You could probably maximize that with a sequence like `imul eax, ecx, 3` / `imul eax, ecx, 5` / etc. that has repeated independent writes EAX. (`add eax, eax` would also work, but the dependency means reg renaming isn't helping you avoid any stalls.) — Peter Cordes, Sep 11 '18 at 19:01
See also http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ for how register-file size (instead of instruction ReOrder Buffer size) can limit the out-of-order instruction window. — Peter Cordes, Sep 11 '18 at 19:01
you don't need to use std::fill after constructing because the default value will be assigned with the count constructor if the value isn't explicitly set `explicit vector( size_type count, const T& value = T(), const Allocator& alloc = Allocator());` — phuclv, Sep 13 '18 at 01:30

Hadi Brais · Accepted Answer · 2018-09-13T02:52:25.570

TL;DR:

Case 1: A buffer that fits in the L1D. The vector constructor or the call to std::fill will place the buffer fully in the L1D. In this case, the 1 store per cycle throughput of the the pipeline and the L1D cache are the bottleneck.

Case 2: A buffer that fits in the L2. The vector constructor or the call to std::fill will place the buffer fully in the L2. However, the L1 has to write dirty lines back to the L2 and there is only one port between the L1D and the L2. In addition, the lines have to be fetched from the L2 to the L1D. The 64B/cycle bandwidth between the L1D and L2 should be able to easily handle that perhaps with occasional contention (see below for more details). So overall the bottleneck is the same as in Case 1. The particular buffer size you used, about 40KB, doesn't fit in the L1D of Intel and recent AMD processors, but fits in the L2. Although in case of Simultaneous Multi-Threading (SMT), there might be some additional contention from the other logical core.

Case 3: A buffer that doesn't fit in the L2. The lines need to be fetched from the L3 or memory. The L2 DPL prefetcher can track the stores and prefetch the buffer into the L2, thereby alleviating the long latency. The single L2 port is the bottleneck together with the L1 writeback and fill buffers. It's severe, especially when the buffer doesn't fit in the L3 where the interconnect can also be on the critical path. The 1 store throughput is too much for the cache subsystem to handle. The two most relevant performance counters are L1D_PEND_MISS.REQUEST_FB_FULL and and RESOURCE_STALLS.SB.

First, note that the constructor (which will probably get inlined) of vector itself initializes the elements to zero by calling memset internally. memset basically does the same thing as your loop, but it is highly optimized. In other words, in terms of big-O notation, both are linear in the number of elements, but memset has a smaller constant factor. In addition,std::fill also internally calls memset to set all elements to zero (again). std::fill will also probably get inlined (with proper optimizations enabled). Therefore, you really have three loops in that piece of code. It would be more efficient to initialize your vector using std::vector<int> vec(10000u, 5). Now let's to get to the microarchitectural analysis of the loop. I'll only discuss what I expect to happen on modern Intel processors, specifically Haswell and Skylake¹.

Let's examine the code carefully:

00007FF6A4521080  inc         edx
00007FF6A4521082  mov         dword ptr [rcx+rax*4],5  
00007FF6A4521089  mov         eax,edx  
00007FF6A452108B  cmp         rax,r9  
00007FF6A452108E  jb          main+80h (07FF6A4521080h)

The first instruction will be decoded into a single uop. The second instruction will be decoded into two uops that are fused in the frontend. The third instruction is a register-to-register move and is a candidate for move elimination at the register renaming stage. It's hard to know for sure whether the move will get eliminated without running the code³. But even if it did not get eliminated, the instructions will be dispatched as follows²:

               dispatch cycle                            |         allocate cycle

cmp         rax,r9                           macro-fused | inc         edx                           (iteration J+3)
jb          main+80h (07FF6A4521080h)     (iteration J)  | mov         dword ptr [rcx+rax*4],5       (iteration J+3)
mov         dword ptr [rcx+rax*4],5       (iteration J+1)| mov         eax,edx                       (iteration J+3)
mov         eax,edx                       (iteration J+1)| cmp         rax,r9                            macro-fused
inc         edx                           (iteration J+2)| jb          main+80h (07FF6A4521080h)     (iteration J+3)
---------------------------------------------------------|---------------------------------------------------------
cmp         rax,r9                           macro-fused | inc         edx                           (iteration J+4)
jb          main+80h (07FF6A4521080h)     (iteration J+1)| mov         dword ptr [rcx+rax*4],5       (iteration J+4)
mov         dword ptr [rcx+rax*4],5       (iteration J+2)| mov         eax,edx                       (iteration J+4)
mov         eax,edx                       (iteration J+2)| cmp         rax,r9                            macro-fused
inc         edx                           (iteration J+3)| jb          main+80h (07FF6A4521080h)     (iteration J+4)

The cmp and jb instructions will get macrofused into a single uop. So the total number of uops is 4 in both the fused domain and 5 in the unfused domain. There is exactly one jump among them. Therefore, a single loop iteration can be issued per cycle.

Because of the dependency between inc and mov-store, these two instructions cannot be dispatched in the same cycle. Nonetheless, inc from a previous iteration can be dispatched with uops from the previous iteration.

There are four ports (p0, p1, p5, p6) to which inc and mov can be dispatched. There is exactly one port, p6, for the predicted-taken cmp/jb. There are three ports (p2, p3, p7) for the STA uop of mov dword ptr [rcx+rax*4],5 and one port, p4, for the STD uop. (Although p7 cannot handle the specified addressing mode.) Because there is only one port for each, the maximum execution throughput that can be achieved is 1 iteration per cycle.

Unfortunately, the throughput will be worse; many stores will miss in the L1D. The L1D prefetchers are not capable of prefetching lines in the Exclusive coherence state and don't track store requests. But fortunately, many stores will be combined. Successive stores in the loop target sequential locations in the virtual address space. Since a line is 64 bytes in size and each store is 4 bytes in size, every 16 consecutive stores are to the same cache line. These stores can be combined in the store buffer, but they won't because the stores will retire as early as possible once they become at the top of the ROB. The loop body is pretty small so it's very unlikely that more than few of the 16 stores will get combined in the store buffer. However, when the combined store request gets issued to the L1D, it will miss and an LFB will be allocated, which also supports combining stores. The L2 cache DPL prefetcher is capable of tracking RFO requests, so hopefully we will almost always hit in the L2. But it will take at least 10-15 cycles to get the line from the L2 to the L1 at. The RFO might be sent early, though, before the store actually gets committed. At the same time, most likely a dirty line needs to be evicted from the L1 to make space from the incoming line to be written to. The evicted line will be written in a writeback buffer.

It's hard to predict what the overall effect will be without running the code. The two most relevant performance counters are L1D_PEND_MISS.REQUEST_FB_FULL and and RESOURCE_STALLS.SB.

The L1D has only one store port that 16-byte, 32-byte, 64-byte wide on Ivy Bridge, Haswell, and Skylake, respectively. So the stores will be committed at these granularities. But a single LFB can always hold a full 64-byte cache line.

The total number of store fused uops is equal to the number of elements (1 million in this case). To get the number of LFBs required, divide by 16 to get 62500 LFBs, which is the same as the number of RFOs to the L2. It will take 16 cycles before requiring another LFB because only one store can be dispatched per cycle. As long as the L2 can deliver the target line within 16 cycles, we will never block on the LFBs and the achieved throughput will be close to 1 iteration per cycle, or in terms of IPC, 5 instructions per cycle. This is only possible if we almost always hit int the L2 in a timely manner. Any consistent delay in cache or memory will significantly reduce the throughput below that. It might go something like this: a burst of 16 iterations will execute quickly, then the pipe stalls on the LFBs for some number of cycles. If this number is equal to the L3 latency (about 48 cycles), then the throughput would be around 1 iteration per 3 cycles (= 16/48).

The L1D has a limited number (6?) of writeback buffers to hold evicted lines. In addition, the L2 has only a single 64-byte port that is used for all communication between the L1D and L2 including writebacks and RFOs.The availability of writeback buffers may be on the critical path as well. In that case, the number of LFBs also be a bottleneck because an LFB will not be written into the cache until a writeback buffer is available. If not, the LFBs will fill up quickly, especially if the L2 DPL prefetcher was able to deliver the lines in a timely manner. Obviously streaming cacheable WB stores into the L1D is very inefficient.

If you do run the code, you need to also account for the two calls to memset.

(1) On Sandy Bridge and Ivy Bridge, the instruction mov dword ptr [rcx+rax*4],5 will get unlaminiated, resulting in 5 uops per iteration in the fused domain. So the frontend may be on the critical path.

(2) Or something like that, depending on whether the first instruction of the first iteration of the loop gets the first slot of the allocator. If not, the iteration numbers shown need to be shifted accordingly.

(3) @PeterCordes found that move elimination does occur most of the time on Skylake. I can also confirm this on Haswell.

Not a comment on any DV, but I still don't think case 2 is correct (probably also case 3, but I'll focus on 2). The L2 write port, L1 writeback and FB aren't the bottlenecks here. When the source is in L2, this code executes at ~1 cycle iteration, meaning that only _one 64-byte line is filled every 16 cycles._ So all the infrastructure you mention has a leisurely 16 cycles to move one line from the L2 to the L1, and write back one line from the L1 to the L2. In reality it can do this in as little as 2 cycles, so there is no bottleneck there. The bottleneck is more or less the same ... — BeeOnRope, Sep 13 '18 at 02:45
... as the L1 case, which is why its performance is more or less the same (hint: if you have two quite different explanations for the bottlenecks for two scenarios, running the same code, yet the performance is about the same, you would do well to seriously consider the idea that bottlenecks are the same in both scenarios). — BeeOnRope, Sep 13 '18 at 02:46
@BeeOnRope I think I agree. Don't hesitate to tell me what you think about Case 3. Is there another way to explain the high `L1D_PEND_MISS.REQUEST_FB_FULL` counts in that case? We can only consider the 1 store per cycle throughput the bottleneck if it means that performance can be improved by increasing it. Now I'm thinking in Case 1, there is only once store that can be allocated per cycle, so the bottleneck seems to include the allocation width as well. — Hadi Brais, Sep 13 '18 at 02:54
Thanks for the answer which takes a close look at optimizations which takes effect in CPU. My misconception was that the pipeline length is really long (14 stages?) so every stall would mean 14 cycles of waiting. Meanwhile, there is like 1 cycle of a stall at most. — Paweł Kozłowski, Sep 15 '18 at 18:43
Length of vector indeed does not fit into L1d cache, but it wasn't actually my goal to analyze memory load bottleneck in the program. — Paweł Kozłowski, Sep 15 '18 at 18:49
@PawełKozłowski There are no other bottlenecks in the pipeline. It's the memory subsystem. It might be beneficial to inline in the code in some other function so that there is more compute work between the stores. — Hadi Brais, Sep 15 '18 at 19:59
About (case 3) I'm not totally sure, it depends on the measurements. You were seeing something like 1 store every 1.3 cycles? So it's still "close" to the theoretical optimum of 1 every 1.0 cycles, and it isn't clear if the store limit (either store port or L1D write ports) is stilling being the primary bottleneck combined with another bottleneck that happens sometimes, or if something else is the bottleneck all the time in a steady state and it works out to 1.3 cycles. You can calculate from known values the expected performance ... — BeeOnRope, Sep 21 '18 at 20:22
... for various bottlenecks. For example (for loads) the limit of 10 fill buffers combined with an L3 latency of 40 cycles (for example), means you can only load one line every 40 / 10 = 4 cycles from the L3 (ignoring prefetching, which makes it a bit faster). Also the L3 bandwidth is IIRC reported to be around 16 bytes/cycle, which happens to work out to around the same limit (1 line / 4 cycles). Those are both much faster than the observed performance of 1.3 cycles (1 line / 20ish cycles) though. However, that's for loads! I don't actually fully understand the path for stores. — BeeOnRope, Sep 21 '18 at 20:24
As you mention there may be fewer writeback buffers than fill buffers (are these the same thing or are they shared?). There are some interesting performance events there too. If you see the FB buffer event with a lot of counts it could indicate that FBs are a bottleneck also: the PFO has to use those to bring in the line, but when they are the main bottleneck there count is usually around the total number of cycles (in fact for loads, often around 1.5x the number of cycles, perhaps because there are 2 load ports, how this event isn't clear to me). — BeeOnRope, Sep 21 '18 at 20:26
You can use the two fb pending cycles event to measure average occupancy of the FB which gives a really good idea if it is close to being the bottleneck (if it's close to 10), or if you are just getting transient FB full events e.g., due to the occasional hiccup, cache miss, etc. — BeeOnRope, Sep 21 '18 at 20:27

score 4 · Answer 2 · answered Sep 11 '18 at 18:07

In a classic textbook pipeline sense, yes this appears to be a stall situation because you have the result of one operation being used as the operand of the next operation. But even in a textbook you will see possible solutions to this.

And the actual implementation of an x86 in multiple ways this is not going to have the performance hit that the face value assembly language might imply.

Same goes with branch prediction on this loop. Branch prediction can come in different forms, at the same time. One is the first thing you would think of is the logic somehow pre-computes the result so that a fetch can start early (thats all branch prediction does is toss out an additional fetch, which btw can have a negative effect, some number of clock cycles earlier than the normal fetch would be). Or not bother with pre-computing and simply toss fetches out for that alternate path just in case and allow normal fetching to cover the condition not met case. Another solution which you can/will see implemented is a simple cache, be it a deep one or a short one. I remember last time I was near 00007FF6A452108E that was a branch instruction lets toss out an early fetch and not bother with waiting to see if the condition passes. Some may only remember the last few branches some may remember more, for a simple loop like this run it 10 times or 10 billion you wont necessarily see the branch prediction.

For many reasons I dont expect you to be able to create something you can really see the difference in compared to simple noise. First and foremost you are probably running this on an operating system and asking the operating system via layers of code what the time is to time this loop. I wouldnt expect you to be able to isolate what you are trying to do here from the noise of the operating system. Running DOS and disabling interrupts would be a start, but I still dont think you will see anything beyond the noise of the processor/system.

You will want to choose a different processor and system if you want to experiment with or see these kinds of effects. Or you need to study the intel documentation (or amd) for the specific chip and stepping and firmware patch for the chip you are using and then you should be able to craft some instruction sequences that you can detect compared to functionally identical sequences that perform differently.

There is a lot of work put in to making the code perform reasonably well on an x86, thats what the high cost and power consumption is all about. Many of the classic performance traps have been smoothed out, and where you end up finding them is not necessarily apparent from the x86 ISA view (you have to view it at the implementation level to see the traps if any).

Won't the fact that modern x86 CPUs have a larger register file than what the ISA exposes + register renaming, impact this as well? — Jesper Juhl, Sep 11 '18 at 18:12
yep that falls under "implementation" too many details to list, and historically the list of different intel implementations over time is long, although the OP did use RAX so that shortens the list. yes result is an operand for the next instruction textbook, but x86 is far from textbook, look elsewhere. — old_timer, Sep 11 '18 at 22:22

score 1 · Answer 3 · answered May 07 '19 at 21:22

Out-of-order execution hides the latency of the inc feeding the store addressing mode, as Hadi explained.

The store can't execute until the cycle after the inc from that iteration executes, but inc only has 1-cycle latency on most uarches so there's not much latency for out-of-order execution to hide.

The reason the compiler emits that inefficient loop with an extra mov eax,edx is that you used an unsigned (32-bit) loop counter with a 64-bit size_t upper bound.

unsigned types in C++ have well-defined overflow behaviour (wraparound) that the compiler has to implement (unlike signed overflow being UB). So as written, the loop is infinite if vec.size() > UINT_MAX, and gcc has to make code that matches the abstract-machine's behaviour for that case. This stops your compiler from auto-vectorizing.

(And compilers generally don't get aggressive about infinite loops being UB, even though ISO C++ says they are if they contain no volatile or atomic operations, or library calls.)

If you'd used int i, you wouldn't have this problem. Signed overflow is UB, so the compiler can assume it doesn't happen and promote i to the width of size_t and pointers. Or better, use size_t i, that's what it's for. Either way, hopefully the compiler can convert the loop to a pointer-increment and use a simple addressing mode, and auto-vectorize with SSE or AVX to do 16 or 32-byte stores.

The extra mov eax,edx is 100% redundant, though. i is already correctly zero-extended into RDX, so the compiler could use inc edx / cmp rdx, r9. This is a missed optimization in whatever compiler you're using.

Pipeline on Registers calculation

3 Answers3