3

Consider the following x86 assembly:

; something that sets rax
mov rcx, [rdi]
xor rax, rcx
xor rax, rcx

At the end of the sequence, rax has the same value as it had on entry, but from the point of view of the CPU its value depends on the value loaded from memory into rcx. In particular, subsequent use of rax won't start until that load and the two xor instructions complete.

Is there any way to achieve this effect more efficiently than the two-xor sequence, e.g., with a single one-uop one-cycle-latency instruction? It's OK if some constant value needs to be set up once before the sequence (e.g., having a zeroed register).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • 2
    Maybe `shld rax, rcx, 0`? Or `imul rax, rcx, 1`? – Ross Ridge Aug 02 '18 at 06:28
  • @RossRidge, ah, those are both nice. Too bad they are three cycles latency though (in my particular case that happens to matter more than the uop count). I could have made it clearer in the question... – BeeOnRope Aug 02 '18 at 06:34
  • why do you need to xor 2 times like that? Are you using the value of rax after the first xor? – phuclv Aug 02 '18 at 06:40
  • How does `bt rax,rcx` fare in terms of performance? And does it create dependency like you want? (will also modify CF, which may be OK, as your `xor` does modify it too, although in fixed CF=0 way). (edit: wait, `bt` doesn't modify `rax`, so in case it's implemented like that, the further use of `rax` may be done ahead ... hm :/ doesn't seem good) – Ped7g Aug 02 '18 at 06:43
  • 2
    @phuclv 2 times to revert `rax` to original value, i.e. creating "nop" operation, but with dependency on `rcx` value, which is loaded from memory. BeeOnRope is working on some kind of micro benchmarks, exercising all kind of internal constraints of modern x86 architecture, so his code sometimes looks a bit weird, it's just artificial benchmarking piece of code, not "real" code doing something "meaningful". – Ped7g Aug 02 '18 at 06:45
  • 2
    @RossRidge `imul rax, rcx, 1` will move `rcx` into `rax`, destroying `rax` content? ... maybe `imul rcx, rax` would work as desired, if `rcx` may be destroyed. = nope, that makes `rax` independent, ups. Too early for me... – Ped7g Aug 02 '18 at 06:50
  • @Ped7g Oh, right yah that won't work. – Ross Ridge Aug 02 '18 at 06:56
  • Then some kind of `rol/ror rax,cl` is maybe another option, if the `rcx` contains value divisible by 64 (either the memory is set up like that, or another `and rcx,-64` after load), but I think those are a bit slow too, so like on par with the suggested `shld` one. – Ped7g Aug 02 '18 at 06:59
  • 2
    I was thinking along the lines of `xor ecx, ecx ; add rcx, [edi] ; adc rax, 0`. – David Wohlferd Aug 02 '18 at 07:11
  • @BeeOnRope: updated my answer to fix the latencies; I had time to check Agner Fog's tables and `shld` is 3c latency, unfortunately. – Peter Cordes Aug 02 '18 at 19:15
  • It sounds like you want to depend on the value from memory. Are you sure this is best rather than a memory barrier? – dave Aug 03 '18 at 01:03
  • @dave - I'm just trying to set up an OoO dependency between two registers. Yes, in this example `rcx` came from memory, but not always. In any case it is not clear how a memory barrier would help even for this example (even if it could somehow create the dependency, it would be terribly slow). – BeeOnRope Aug 03 '18 at 01:09

1 Answers1

5

With only 1 uop / 1c latency on the critical path of the target register:

# target=rax  extra source=rcx
mov  edx, ecx    ; no latency
and  edx, 0      ; BMI1  ANDN could mov+and in 1 uop, port 1 or 5 only on SnB-family (Ryzen: any)

or   rax, rdx

AND with zero is not special-cased as a dep-breaking zeroing idiom on any CPUs, AFAIK.

front-end uops: 3 (or 2 with BMI1). Latency:

  • from rcx to rax: 2c (with mov-elimination or BMI1).
  • from rax(input) to rax(output): 1c

With a zeroed register, if it's ok to couple all the dep chains into that one register (unlike the ANDN version that only reads an all-ones register):

and   edx, ecx         # 0 &= ecx
or    rax, rdx         # rax |= 0

To test a function's latency (not throughput) but still feed it the same input repeatedly:

.loop:
    call  func        ; arg in RDI, return in RAX
    mov   rdi, rbx    ; arg for next iter, off the critical path

    and   eax, 0      ; 1c latency
    or    rdi, rax    ; 1c latency

   jmp   .loop

If the function is pure, we can do 1c / 1uop

Actually it just has to return a known value for a given input. This also works if its impurities are limited to having other side-effects / outputs.

Instead of XOR twice after getting the result, set things up so we already have an XOR that we can unscramble with only one more XOR. Or use addition because LEA lets us copy-and-add in one instruction, saving a mov that wouldn't be on the critical path.

    mov   rdi, rbx        ; original input
    call  func
    sub   rbx, rax        ; RBX = input - output

.loop:
    call  func
    lea   rdi, [rbx + rax]   ; RDI = (input-output) + output = input
    jmp  .loop

@RossRidge's suggestion is only 1 uop on SnB-family CPUs, but only runs on port 1:

shld rax, rcx, 0

3c latency, 1 uop for port 1 on HSW/SKL. Agner Fog reports 1c latency for IvB, but 3c for HSW/BDW/SKL.

shld r,r,i is 2 uops on older Intel, and significantly slower on AMD, like 6 uops / 3c latency on Piledriver / Ryzen.

Note that instlatx64 reports 1c latency / 0.5c throughput for shld/shrd on Haswell/Skylake (like single-register shifts), but I tested myself and it's definitely 3c latency / 1c throughput. Reported as an instlatx64 bug on their github page.

SHLD could also be interesting for copying a 32-bit register with a dependency on another. e.g. @BeeOnRope describes wanting to call a function repeatedly with the same input value in RDI, but with a dependency on the result in RAX. If we only care about EDI, then

; RBX = input<<32
call  func
mov   edi, eax         ; 0 latency with mov-elimination
shld  rdi, rbx, 32     ; EDI = the high 32 bits of RBX, high bits of RDI = old EDI.

Of course that's pointless vs. this, which doesn't need mov-elimination

call   func
mov    rdi, rbx        ; off critical path
shld   rdi, rax, 0     ; possibly 1c latency on SnB / IvB.  3 on HSW/SKL

A modification of @DavidWholford's suggestion works, too:

test ecx,ecx     ; CF=0, with a false dependency on RCX
adc  rax, 0      ; dependent on CF

2 uops on Haswell/Broadwell/Skylake and AMD. 3 uops on Intel P6-family, and maybe SnB/IvB. Latency:

  • from rcx to rax: 2c on HSW and later, 3 with 2-uop adc
  • from rax to rax: 1c on HSW and later, 2 with 2-uop adc

ADC on Haswell and earlier is normally 2 uops, but adc with immediate 0 is special-cased on Haswell to only be 1 uop / 1c. adc eax,0 is always 2c latency on Core 2. The first uarch with this optimization might be SnB, but hopefully we get an answer on Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?

test clears CF regardless of the value, but I think (untested) that CF still carries a dependency on the source register. If not, then perhaps use TEST / ADOX could be useful on Broadwell and later. (Because CF is renamed separately on most CPUs, but OF might only be part of the same bundle as ZF / SF and other flags that do depend on the AND result.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Do you mind helping me understand the first part? Particularly the "from rax to rax" latency? Thank you! – Margaret Bloom Aug 02 '18 at 16:42
  • @MargaretBloom: RAX(input) to RAX(output), i.e. before/after the `or rax,rdx`, because that's the only instruction that touches RAX. The instructions setting up RDX are off the critical path, if the critical path is the dep chain through RAX. – Peter Cordes Aug 02 '18 at 19:03
  • @BeeOnRope: yeah. You don't happen to have a HSW to test that, do you? My HSW test box is out of commission: I lent the RAM to upgrade my brother's girlfriend's old SnB macbook pro laptop, but my HSW laptop won't boot with regular DDR3, it needs DDR3L :/ – Peter Cordes Aug 02 '18 at 23:07
  • @PeterCordes - I do have access to one (BTW places like TravisCI often use Haswell-era servers, so if you have a little benchmark suite that you integrate with TravisCI you can see the results on Haswell every time you make a change). In practice results seem good even on these VMs (you can tell what approximate arch you are running on by the `--test-name=basic/*` tests in uarch-bench for example). So you want me to run a chain of `adc rax, 0` type things? What about a throughput test? Not obvious how to do throughput because each `adc` depends on the former through `CF`, right? – BeeOnRope Aug 02 '18 at 23:13
  • @BeeOnRope: yeah, the hypothesis is that `adc rax, 0` is 1 uop / 1c, while `adc rax, 1` or `adc rax, rdx` is 2 uops / 2c. i.e. that the decoders notice that the imm8 is 0 and leave out the uop that gets that input operand. A dep chain through RAX or flags is fine. – Peter Cordes Aug 02 '18 at 23:15
  • @PeterCordes - [here are the tests](https://github.com/travisdowns/uarch-bench/commit/872e2a5bc84faa9bbf0e1252c186d053ccc7540d#diff-8485929ddd759f9862831b87d83018c6). I'll run them on Haswell now but I don't have root so can't see the performance counters. They report identical 1 cycle latency and 0.5 cycle recip tput on Skylake (including the `xor eax, eax` to break the dep chain in the tput case). – BeeOnRope Aug 02 '18 at 23:35
  • 2
    @PeterCordes - the rumors are true! `adc rax, 0` has 1 cycle latency, `adc rax, 1` has 2 cycle latency on Haswell. Recip througput is 0.5 vs 0.76 (keep in mind there are 2 xors in there too, so 0.75 is the max possible rtput if `adc` is 2 uops). – BeeOnRope Aug 02 '18 at 23:39
  • @PeterCordes - [link to the results](https://travis-ci.org/travisdowns/uarch-bench/jobs/411529902#L926). I also added it to [Intel Performance Quirks](https://github.com/travisdowns/uarch-bench/wiki/Intel-Performance-Quirks). – BeeOnRope Aug 03 '18 at 00:06
  • @BeeOnRope: typo in the results: `Inependent push/pop chain` missing a `d` in "indep". Very cool, thanks for checking this. The question now is how far back that goes. I'll test on Core2 in a few minutes. – Peter Cordes Aug 03 '18 at 00:08
  • Yeah that typo has been there forever, and I feel like you even pointed it out before (or someone else did), so I finally fixed it :). – BeeOnRope Aug 03 '18 at 00:11
  • @BeeOnRope: It's *not* supported on Core 2. My best guess is that it's probably new with SnB (which redesigned the decoders), but I asked [Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?](https://stackoverflow.com/q/51664369) with my test program and results, and a link to uarch-bench. – Peter Cordes Aug 03 '18 at 02:08
  • While I'm giving you the upvote (after all, you did all the work), I'm going to (mentally) take partial credit for bumping into something interesting here. `adc rax, 0` - who knew? – David Wohlferd Aug 03 '18 at 03:54
  • @DavidWohlferd: heh, thanks. I'd been curious about it for a while, glad we finally got that sorted out. I hadn't been thinking of a dependency through flags at all until I saw your comment, so you can definitely take credit for that idea. – Peter Cordes Aug 03 '18 at 03:56
  • 1
    I think maybe I specified the question a little bit wrong because the 1-cycle latency idea (the `mov;and;or` one) doesn't really work in most of real scenarios. I think the real scenario is something like a 1-arg function (arg in `rdi`) which returns 1 result (in `rax`). So I need to make `rdi` for the next call depend on `rax`. In practice `rdi` also needs to have its value reset since it is callee clobbered, let's say from `rbx`. So I guess I get the 2 cycle "rcx to rax" cost in practice. I didn't see a way to make it 1 cycle. – BeeOnRope Aug 07 '18 at 00:36
  • @BeeOnRope: on IvB with 1c `shld` (if that's accurate), we can achieve the magical 1c critical path latency for 32-bit inputs. See my updated answer (which also added a code block for the AND/OR method for the use-case you describe. It's actually *easier* there, and doesn't depend on mov-elimination.) – Peter Cordes Aug 07 '18 at 01:50
  • 1
    @BeeOnRope: For pure functions, keep `input - output` in a register, and regenerate `input` with LEA. Updated my answer again. – Peter Cordes Aug 07 '18 at 05:59