According to [1] sha256rnds2 instruction has an implicit 3rd operand that uses register xmm0. This is the thing that prevents me from having an effective computation of sha256 over multiple buffers simultaneously and thus hopefully fully utilizing CPU's execution pipelines and conveyor.
Other multibuffer implementations (e.g. [2], [3]) use two different techniques to overcome this:
- Compute rounds sequentially
- Partially utilize parallelization when it's possible
The question I have - why this instruction was designed in this way - to have an implicit barrier that prevents us from utilizing multiple execution pipelines or to effectively use two sequential instructions due to reciprocal throughput.
I see three possible reasons:
- Initially SHA-NI was considered as an extension for low-performance CPUs. And no one thought that it will be popular in high-perf CPUs - hence no support of multiple pipelines.
- There is a limit from instruction encoding/decoding side - there are no enough bits to encode 3rd register that is why it's hardcoded.
shar256rnds2has tremendous energy consumption and this is why it's not possible to have multiple execution pipelines for it.
Links: