2

I am using an SSE intrinsic with one of the argument as a memory location (_mm_mul_ps(xmm1,mem)).

I have a doubt which will be faster:

xmm1 = _mm_mul_ps(xmm0,mem)  // mem is 16 byte aligned

or:

xmm0 = _mm_load_ps(mem);
xmm1 = _mm_mul_ps(xmm1,xmm0);

Is there a way to specify alignment with _mm_mul_ps() intrinsic ?

Paul R
  • 208,748
  • 37
  • 389
  • 560
tonyjames
  • 31
  • 1
  • It probably doesn't make any difference - let the compiler take care of the heavy lifting and don't worry too much about minor details - and of course you should benchmark/profile your code to see what really matters. – Paul R Jul 09 '15 at 12:21

1 Answers1

2

There are no _mm_mul_ps(reg,mem) form even though mulps reg,mem instruction form exists - https://msdn.microsoft.com/en-us/library/22kbk6t9(v=vs.90).aspx

What you can do is _mm_mul_ps(reg,_mm_load_ps(mem)) and it is going to be exactly the same as writing it in 2 lines.

You can use _mm_load_ps & _mm_loadu_ps to specify if you expect you data to be aligned. BTW, there are no penalty for doing unaligned loads on aligned data starting from Haswell microarch.

On the other side compiler should be smart enough to figure out if it would be better to do the load first and then do the multiply, or do the multiply from the memory.

In some cases it might make sense to do the load a bit in advance to improve software pipelining, but usually this is going to be the next level of optimization.

Elalfer
  • 5,312
  • 20
  • 25
  • gcc accepts code that uses arbitrary expressions of `__m128` type as args to intrinsics, including array lookups or pointer dereferences. You only need the load intrinsics if you want to tell the compiler that unaligned loads are needed, or something. Do other compilers have problems with something like `_mm_mul_ps(var, vec_array[i])`? – Peter Cordes Jul 09 '15 at 20:42
  • Non that I know of, so it should be legal to do, but it is not a `mem` form per say. Another drawback is that the compiler most likely is going to use unaligned load if it can't figure out the buffer alignment (which is quite hard thing to do in a general case) – Elalfer Jul 09 '15 at 21:17
  • Yeah, or it will put a scalar prologue that runs until reaching an aligned point. My point was that you don't need an intrinsic that has a `mem` form, because you can just use the pointer-dereference operator, `*`, on any vector in memory. If you're compiling for an AVX target, then a compiler can generate VEX-encoded instructions from the intrinsics, removing alignment requirements. But otherwise I agree, there is some use to explicitly using aligned load instructions, because that should tell the compiler that segfault on unaligned is what you want. – Peter Cordes Jul 10 '15 at 00:52
  • I agree that there are no need for `mem` forms. But: 1. Compiler doesn't prolog when one uses intrinsics and user must take care of it. It happens only for autovectorization cases. 2. VEX encoding doesn't remove alignment requirements for aligned loads, but Haswell removed penalty for using unaligned loads on aligned data, so compiler (at least ICC) always going to generate unaligned loads when compiling for AVX2. – Elalfer Jul 10 '15 at 03:47
  • 1. yeah derp, I was thinking about autovectorization. 2. VEX means the compiler can fold unaligned loads into non-`mov` instructions as memory operands. – Peter Cordes Jul 10 '15 at 03:54
  • Agner Fog's insn tables show that it was Nehalem that made `movdqu` the same cost as `movdqa` (barring the small penalties for cases when the data actually *is* misaligned.) On Core2/Penryn, `movdqu` was 4 uops. 2 p0/5, 2 p2. – Peter Cordes Jul 10 '15 at 03:58
  • 2. Ok, now I see what you mean. I'd say its a relaxed requirement on mem ref alignment in AVX Instruction Set and VEX is just a prefix to enable 3rd operand. – Elalfer Jul 10 '15 at 04:04
  • `mulps xmm0, [unaligned]` will fault on Sandybridge, `vmulps xmm0, xmm0, [unaligned]` won't. Using the VEX encoding, and using the AVX version, is synonymous. VEX is how the CPU knows to not enforce alignment, as well as that it's 3-operand. – Peter Cordes Jul 10 '15 at 04:17
  • That should be correct as far as I'm concerned. VEX is also used in AVX2 and AVX512. So I'm just trying to be a bit more clear and don't mix encoding and instruction sets. But otherwise, I think we are on the same page here ;) – Elalfer Jul 10 '15 at 04:17
  • IIRC, AVX512 actually uses EVEX, a different encoding scheme. >.< You have a good point that there are VEX-encoded instructions that aren't part of AVX itself (as in, what you are guaranteed to have if that CPUID feature bit is set). AVX2, FMA, and AVX+AES are examples, so good point. But using the VEX-encoded version of an SSE* instruction is a well-defined thing, and only applies to intrinsics / mnemonics that have non-VEX versions, which is why I said it that way. (i.e. no source change, just compiling with AVX support, lets the compiler fold more loads.) Anyway, yeah we agree. :) – Peter Cordes Jul 10 '15 at 04:23
  • @PeterCordes, [yes it was Nehalem that made `movdqu` and `movdqa` have the same throughput and latency.](http://stackoverflow.com/a/20654007/2542702). – Z boson Jul 10 '15 at 06:48
  • There is no difference in penalty for unaligned loads on aligned data since Nahalem but that's not the same thing as saying that _mm_load_ps & _mm_loadu_ps are equivalent on aligned data. The reason is that these are intrinsics and the compiler may not map them directly to movdqa and movdqu. – Z boson Jul 10 '15 at 08:20