1

I am contemplating an implementation of SHA3 in pure assembly. SHA3 has an internal state of 17 64 bit unsigned integers, but because of the transformations it uses, the best case could be achieved if I had 44 such integers available in the registers. Plus one scratch register possibly. In such a case, I would be able to do the entire transform in the registers.

But this is unrealistic, and optimisation is possible all the way down to even just a few registers. Still, more is potentially better, depending on the answer to this question.

I am thinking of using the MMX registers for fast storage at least, even if I'll need to swap into other registers for computation. But I'm concerned about that being ancient architecture.

Is data transfer between an MMX register and, say, RAX going to be faster than indexing u64s on the stack and accessing them from what's likely to be L1 cache? Or even if so, are there hidden pitfalls besides considerations of speed I should watch for? I am interested in the general case, so even if one was faster than the other on my computer, it might still be inconclusive.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
WDS
  • 966
  • 1
  • 9
  • 17
  • 2
    What architecture do you target? x86 or x86-64? Why not use SSE registers instead? Using MMX is generally not a good idea. Refer to Agner Fog's tables for the actual performance of instructions. Fetching data from the stack is generally pretty fast. – fuz Dec 08 '18 at 12:54
  • X86-64. It's just a hobby thing, like someone else trying to make his car as fast as possible. I have a pure Rust implementation that is faster than Sha3Sum on Linux, but thinking Asm will torch them both. Thank you for the XMM suggestion. I planned to try them too but left them out here to keep the question simplest. Definitely appreciate your advice on avoiding MMX. Will check Agner Fog's tables now. – WDS Dec 08 '18 at 13:01
  • 1
    You might even want to use AVX2 which allows you to put the entire state into registers. – fuz Dec 08 '18 at 13:05
  • 2
    if you're using x86-64 then no need to concern "about that being ancient architecture" because that implies at least SSE2. The use of MMX is generally discourage because the calling convention for x87 is not very good – phuclv Dec 08 '18 at 15:22

1 Answers1

7

Using ymm registers as a "memory-like" storage location - it's not a win for performance. MMX wouldn't be either. The use-case is for completely avoid memory accesses which might disturb a micro-benchmark.

Efficient store-forwarding and fast L1d cache hits make using regular RAM very good. x86 allows memory operands, like add eax, [rdi], and modern CPUs can decode that to a single uop.

With MMX you'd need 2 uops, like movd edx, mm0 / add eax, edx. So that's more uops, and more latency. movd or movq latency to/from MMX or XMM registers is worse than 3 to 5 cycle store-forwarding latency on typical modern CPUs.


But if you don't need to move data back and forth often, you might be able to usefully keep some of your data in MMX / XMM registers and use pxor mm0, mm1 and so on.

If you can schedule your algorithm so you have fewer total instructions / uops from using movd/movq (int<->XMM or int<->MMX) and movq2dq/movdq2q (MMX->XMM / XMM->MMX) instructions instead of stores and memory operands or loads, then it might be a win.

But on Intel before Haswell, there are only 3 ALU execution ports, so the 4-wide superscalar pipeline could hit a narrower bottleneck (ALU throughput) than front-end throughput, if you leave the store/load ports idle.

(See https://agner.org/optimize/ and other performance links in the x86 tag wiki.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • A lot of operations will need to happen on each u64 relative to the number of times the subset of such operations occurs that cannot be done in place -- the subset that will force me out of GP registers and into something else. Based on responses I received, I think I'll stay in gp regs and the stack at least for now. That wiki you linked at the bottom is outstanding. I have a long flight tomorrow, and now I have the intel docs on my kindle to pass the time. – WDS Dec 09 '18 at 09:10
  • 2
    @WDS: I'd definitely recommend starting with Agner Fog's asm optimization guide and/or microarch pdf. Intel's optimization manual has some pretty scattered advice in some sections, some of which dates from Pentium 4 and is no longer relevant. (Having looked over Agner Fog's microarch guide for Haswell vs. Core 2 will help you understand what's still relevant.) – Peter Cordes Dec 09 '18 at 09:56
  • I tested using xmm registers instead of memory, and indeed, a simple loop adding them up in a GPR is about 50% slower than having the same data in memory. It is a pity. – Marcel Hendrix Feb 26 '22 at 23:18