Understanding how the instrinsic functions for SSE use memory

Question

Before I ask my question, just a little background information.

In C languages, when you assign to a variable, you can conceptually assume you just modified a little piece of memory in RAM.

int a = rand(); //conceptually, you created and assigned variable A in ram

In assembly language, to do the same thing, you essentially need the result of rand() stored in a register, and a pointer to "a". You would then do a store instruction to get the register content into ram.

When you program in C++ for example, when you assign and manipulate value type objects, you usually dont even have to think about their addresses or how or when they will be stored in registers.

Using SSE instrinics are strange because they appear to somewhere inbetween coding in C and assembly, in terms of the conceptual memory model.

You can call load/store functions, and they return objects. A math operation like _mm_add will return an object, yet it's unclear to me weather the result will actually be stored in the object unless you call _mm_store.

Consider the following example:

inline void block(float* y, const float* x) const {
// load 4 data elements at a time
__m128 X = _mm_loadu_ps(x);
__m128 Y = _mm_loadu_ps(y);
// do the computations
__m128 result = _mm_add_ps(Y, _mm_mul_ps(X, _mm_set1_ps(a)));
// store the results
_mm_storeu_ps(y, result);

}

There are alot of temporary objects here. Do the temporary objects actually not exist? Is it all just syntax sugar for calling assembly instrunctions in a C like way? What happens if instead of doing the store command at the end, you just kept the result, would the result then be more than syntax sugar, and will actually hold data?

TL:DR How am I suppose to think about memory when using SSE instrinsics?

Paul R · Accepted Answer · 2015-03-19T15:08:08.300

4

An __m128 variable may be in a register and/or memory. It's much the same as with simple float or int variables - the compiler will decide which variables belong in registers and which must be stored in memory. In general the compiler will try to keep the "hottest" variables in registers and the rest in memory. It will also analyse the lifetimes of variables so that a register may be used for more than one variable within a block. As a programmer you don't need to worry about this too much, but you should be aware of how many registers you have, i.e.. 8 XMM registers in 32 bit mode and 16 in 64 bit mode. Keeping your variable usage below these numbers will help to keep everything in registers as far as possible. Having said that, the penalty for accessing an operand in L1 cache is not that much greater than accessing a register operand, so you shouldn't get too hung up on keeping everything in registers if it proves difficult to do so.

Footnote: this vagueness about whether SSE variables are in registers or memory when using intrinsics is actually quite helpful, and makes it much easier to write optimised code than doing it with raw assembler - the compiler does the grunt work of keeping track of register allocation and other optimisations, allowing you to concentrate on making the code work correctly.

edited Mar 19 '15 at 15:08

answered Mar 19 '15 at 15:01

Paul R

208,748
37
389
560

If I'm not suppose to worry about it, then why is there the load and store functions? Espeically load, seems like the compiler should do automatically. – Thomas Mar 19 '15 at 15:37
Well you don't actually have to - you can use pointer or array syntax if you like, but you'll still get `_mm_load_xxx`/`_mm_store_xxx` instructions generated under the hood. But the point of intrinsics is that in most cases they map 1:1 to actual machine instructions, so using `_mm_load_xxx`/`_mm_store_xxx` is clearer when mixed with other SSE intrinsics. It also gives you explicit control over which instructions you actually use (e.g. aligned versus unaligned, and various other special cases which have no equivalent in C). – Paul R Mar 19 '15 at 15:40
it's difficult for me to wrap my head around it being optimized into instrunctions like regular C, while at the same time you can control load / stores. Is load just a hint for optimization? – Thomas Mar 19 '15 at 15:51
Not really - it just takes you closer to the machine level, in that you are explicitly loading and storing data rather than letting the compiler manage it. It lets you write code which is closer to assembler than C, while still giving you some of the labour-saving benefits of a compiler (register allocation, instruction scheduling, peephole optimisation, etc). It does mean though that you need to be straddling the HLL/assembler fence when you program - you need to think like an assembler programmer and a C programmer at the same time. – Paul R Mar 19 '15 at 15:59
Tell me if you agree, "the optimizer does the best it can given your forced load / stores", I can kind of understand it if I can say that – Thomas Mar 19 '15 at 16:06
2

Yes, close enough. You might find it helpful/instructive to write some simple SIMD code with intrinsics and then look at the code generated with `gcc -O3 -S ...`. – Paul R Mar 19 '15 at 16:14

score 1 · Answer 2 · edited May 23 '17 at 12:29

Vector variables aren't special. They will be spilled to memory and re-loaded when needed later, if the compiler runs out of registers when optimizing a loop (or across a function call to a function the compiler can't "see" to know that it doesn't touch the vector regs).

gcc -O0 actually does tend to store to RAM when you set them, instead of keeping __m128i variables only in registers, IIRC.

You could write all your intrinsic-using code without ever using any load or store intrinsics, but then you'd be at the mercy of the compiler to decide how and when to move data around. (You actually still are, to some degree these days, thanks to compilers being good at optimizing intrinsics, and not just literally spitting out a load wherever you use a load intrinsic.)

Compilers will fold loads into memory operands for following instructions, if the value isn't needed as an input to something else as well. However, this is only safe if the data is at a known-aligned address, or an aligned-load intrinsic was used.

The way I currently think about load intrinsics is as a way of communicating alignment guarantees (or lack thereof) to the compiler. The "regular" SSE (non-AVX / non-VEX-encoded) versions of vector instructions fault if used with an unaligned 128b memory operand. (Even on CPUs supporting AVX, FWIW.) For example, note that even punpckl* lists its memory operand as a m128, and thus has alignment requirements, even if it only actually reads the low 64b. pmovzx lists its operand as a m128.

Anyway, using load instead of loadu tells the compiler that it can fold the load into being a memory operand for another instruction, even if it can't otherwise prove that it comes from an aligned address.

Compiling for an AVX target machine will allow the compiler to fold even unaligned loads into other operations, to take advantage of uop micro-fusion.

This came up in comments on How to specify alignment with _mm_mul_ps.

The store intrinsics apparently have two purposes:

To tell the compiler whether it should use the aligned or unaligned asm instruction.
To remove the need for a cast from __m128d to double * (doesn't apply to the integer case).

Just to confuse things, AVX2 introduced things like _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a), which stores the high/low halves to different addresses. It probably compiles to a vmovdqu / vextracti128 ..., 1 sequence. Incidentally, I guess they made vextracti128 with AVX512 in mind, since using it with 0 as the immediate is the same as vmovdqu, but slower and longer-to-encode.

Understanding how the instrinsic functions for SSE use memory

2 Answers2

Linked