sha256rnds2 implicit register xmm0

Question

According to [1] sha256rnds2 instruction has an implicit 3rd operand that uses register xmm0. This is the thing that prevents me from having an effective computation of sha256 over multiple buffers simultaneously and thus hopefully fully utilizing CPU's execution pipelines and conveyor.

Other multibuffer implementations (e.g. [2], [3]) use two different techniques to overcome this:

Compute rounds sequentially
Partially utilize parallelization when it's possible

The question I have - why this instruction was designed in this way - to have an implicit barrier that prevents us from utilizing multiple execution pipelines or to effectively use two sequential instructions due to reciprocal throughput.

I see three possible reasons:

Initially SHA-NI was considered as an extension for low-performance CPUs. And no one thought that it will be popular in high-perf CPUs - hence no support of multiple pipelines.
There is a limit from instruction encoding/decoding side - there are no enough bits to encode 3rd register that is why it's hardcoded.
shar256rnds2 has tremendous energy consumption and this is why it's not possible to have multiple execution pipelines for it.

Links:

With register renaming, the fixed implicit operand shouldn't really interfere with simultaneous execution. In other words, if you write `sha256rnd xmm1, xmm2 ; movdqa xmm0, xmm3 ; sha256rnd xmm4, xmm5` then nothing prevents the two `sha256rnd`s from executing simultaneously in separate pipelines, as they have no dependencies. The architectural `xmm0` would be renamed to different internal registers for the different instructions. — Nate Eldredge, Dec 01 '21 at 19:08
So the underlying reason is probably #2, but its impact is not as much as you think. Of course, due to #1 and #3, any given CPU may or may not actually have more than one pipeline that can execute this instruction - but if it does, there's no reason you can't use them all. — Nate Eldredge, Dec 01 '21 at 19:09
@NateEldredge: Looks like they wanted to avoid a VEX encoding, so they could provide SHA extensions on low-power Silvermont-family CPUs that don't have AVX/BMI instructions. (Where it's most useful.) So (1) led to (2), but not because of it's not pipelined. According to https://uops.info/ and https://agner.org/optimize/, Ice Lake has one execution unit for `SHA256RNDS2` on port 5, with 6 cycle latency but pipelined at 3c throughput. So 2 can be in flight at once. Not close to a front-end bottleneck with an extra `movdqa`. — Peter Cordes, Dec 02 '21 at 02:20
It's equally pipelined in Goldmont, with SHA256RNDS2 as 3 uops, 8c latency, 4c throughput. While SHA1 is better pipelined (1 uop, 5c lat, 2c tput). Zen2 also has one pipelined execution unit; Zen3 has two units, 4c latency 2c throughput for SHA256. — Peter Cordes, Dec 02 '21 at 02:27
Swapping xmm0 beside `movdqa` requires stores/loads from memory - 7 xmm registers are used per buffer: two for states and five for msgtmps. For two buffers I need 14 registers + 1 xmm0. The last register might be used either for SHUF_MASK or as a scratch for xmm0. In either case there is a register spilling. — , Dec 02 '21 at 07:38
Ah, I see. Fortunately L1d cache is fast, so you're probably fine reloading XMM0 (or some other read-only vector constant) from stack space (or .rodata if it's a true constant, not just loop-invariant). You'd want to *avoid* store/reload if you can, as that potentially puts store-forwarding latency on the critical path. — Peter Cordes, Dec 02 '21 at 11:06

score 1 · Accepted Answer · answered Dec 11 '21 at 18:48

Register renaming makes this a non-problem for the back-end. (See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for info on how register renaming hides write-after-write and write-after-read hazards.)

At worst this costs you an extra movdqa xmm0, whatever or vmovdqa instruction before some or all of your sha256rnds2 instructions, costing a small amount of front-end throughput. Or I guess if you're out of registers, then maybe an extra load, or even a store/reload.

Looks like they wanted to avoid a VEX encoding, so they could provide SHA extensions on low-power Silvermont-family CPUs that don't have AVX/BMI instructions. (Where it's most useful because the CPU is slower relative to the amount of data it's throwing around.) So yes, only 2 explicit operands could be encoded via the normal ModRM mechanism in x86 machine code. x86 does three-register instructions with VEX prefixes, which provide a new field for another 4-bit register number. (vblendvb has 4 explicit operands, with the 4th register number as an immediate, but that's crazy and requires special decoder support.)

So (1) led to (2), but not because of any lack of pipelining.

According to https://uops.info/ and https://agner.org/optimize/, the SHA256RNDS2 and instruction is at least partially pipelined on all CPUs that support it. Ice Lake has one execution unit for SHA256RNDS2 on port 5, with 6 cycle latency but pipelined at 3c throughput. So 2 can be in flight at once. Not close to a front-end bottleneck with an extra movdqa.

sha256rnds2 implicit register xmm0

1 Answers1