I am contemplating an implementation of SHA3 in pure assembly. SHA3 has an internal state of 17 64 bit unsigned integers, but because of the transformations it uses, the best case could be achieved if I had 44 such integers available in the registers. Plus one scratch register possibly. In such a case, I would be able to do the entire transform in the registers.
But this is unrealistic, and optimisation is possible all the way down to even just a few registers. Still, more is potentially better, depending on the answer to this question.
I am thinking of using the MMX registers for fast storage at least, even if I'll need to swap into other registers for computation. But I'm concerned about that being ancient architecture.
Is data transfer between an MMX register and, say, RAX going to be faster than indexing u64s on the stack and accessing them from what's likely to be L1 cache? Or even if so, are there hidden pitfalls besides considerations of speed I should watch for? I am interested in the general case, so even if one was faster than the other on my computer, it might still be inconclusive.