0

So I want to get the value or state of specific xmm registers. This is primarily for a crash log or just to see the state of the registers for debugging. I tried this, but it doesn't seem to work:

#include <x86intrin.h>
#include <stdio.h>

int main(void) {

     register __m128i my_val __asm__("xmm0");
     __asm__ ("" :"=r"(my_val));
     printf("%llu %llu\n", my_val & 0xFFFFFFFFFFFFFFFF, my_val << 63);
  return 0;
}

As far as I know, the store related intrinsics would not treat the __m128i as a POD data type but rather as a reference to one of the xmm registers.

How do I get and access the bits stored in the __m128i as 64 bit integers? Or does my __asm__ above work?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Josh Weinstein
  • 2,788
  • 2
  • 21
  • 38

2 Answers2

2

How do I get and access the bits stored in the __m128i as 64 bit integers?

You will have to convert the __m128i vector to a pair of uint64_t variables. You can do that with conversion intrinsics:

uint64_t lo = _mm_cvtsi128_si64(my_val);
uint64_t hi = _mm_cvtsi128_si64(_mm_unpackhi_epi64(my_val, my_val));

...or though memory:

uint64_t buf[2];
_mm_storeu_si128((__m128i*)buf, my_val);
uint64_t lo = buf[0];
uint64_t hi = buf[1];

The latter may be worse in terms of performance, but if you intend to use it only for debugging, it would do. It is also trivial to adapt to differently sized elements, if you need that.

Or does my __asm__ above work?

No, it doesn't. The "=r" output constraint does not allow vector registers, such as xmm0, which you pass as an output, it only allows general purpose registers. No general purpose registers are 128-bit wide, so that asm statement makes no sense.

Also, I should note that my_val << 63 shifts the value in the wrong way. If you wanted to output the high half of the hypothetical 128-bit value then you should've shifted right, not left. And besides that, shifts on vectors are either not implemented or act on each element of the vector rather than the vector as a whole, depending on the compiler. But this part is moot, as with the code above you don't need any shifts to output the two halves.

Andrey Semashev
  • 10,046
  • 1
  • 17
  • 27
  • On 64-bit targets, `unsigned __int128` is as wide as an XMM, so a type-pun or memcpy or `_mm_store_si128` from an `"=x"(m128i)` output could work. – Peter Cordes Aug 05 '20 at 14:32
  • More importantly, register-asm local vars are a flaky way to read incoming regs of a function, and are only guaranteed to make an `asm` statement's `"r"` or `"x"` operand pick a specific register; the old behaviour usually works but is now undocumented = unsupported (https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html). Your answer seems to suggest omitting the asm statement. This will probably break with clang. – Peter Cordes Aug 05 '20 at 14:34
  • @PeterCordes `__int128` is a gcc-specific extension, it's not universally available. Re asm statements, I intentionally did not provide any guidance as to how to capture the values of registers, as that was not the question. It is not obvious that this is needed in the first place. The asm statement that contains "=r" output constraint indeed does not make sense and should be either fixed or removed. – Andrey Semashev Aug 05 '20 at 19:31
  • Not GCC-specific, specific to the GNU dialect of C. Supported by GCC, clang, and ICC at least. Just like `register __m128i my_val __asm__("xmm0");` which the OP is already using. As far as I'm concerned, the hard part (that made this question worth answering at all, instead of closing as a duplicate of [print a \_\_m128i variable](https://stackoverflow.com/a/46752535)) is capturing the value of an XMM register. As I pointed out in my answer, you can't safely remove the `asm()` statement. – Peter Cordes Aug 05 '20 at 19:35
  • @PeterCordes As far as capturing goes, I wouldn't use that syntax anyway because you have no control over the code that the compiler generates around the asm statement. The register may be clobbered by the compiler, and the code will print garbage. The best you could do is to use inline asm to save the registers and thus minimize the possibility of losing their values. But again, it isn't obvious that this is needed, as the author may just want to print a `__m128i` value that he has in his code. – Andrey Semashev Aug 05 '20 at 19:42
  • @PeterCordes BTW, `__int128` is not normally stored in vector registers, at least not on x86. It is usually stored in a GPR pair. I'm not sure using "=x" constraint to save to a `__int128` variable is correct. – Andrey Semashev Aug 05 '20 at 19:51
  • That's correct, you'd need `_mm_storeu_si128(&my_int128, my_m128i)` or any other way of type-punning. I only suggested that because glibc printf might be able to format it as a single integer. (Actually, turns out gcc does accept `"=x"` for an `__int128`; clang doesn't though. https://godbolt.org/z/83o3h1) – Peter Cordes Aug 05 '20 at 19:59
  • IDK if `asm volatile("movaps %xmm0, %0" : "m"(buf));` is any more reliable than an empty template with a register output; GCC could step on XMM0 before that, too. The whole idea is garbage and not anything you should ever rely on for correctness; I added some more warning about that to my answer. My reading of the question is that it very much includes capturing a hard *register*, not just printing an existing `__m128i` C variable. To do that reliably, you have to write the function in stand-alone asm, not inline. – Peter Cordes Aug 05 '20 at 20:00
0

If you really want to know about register values, rather than __m128i C variable values, I'd suggest using a debugger like GDB. print /x $xmm0.v2_int64 when stopped at a breakpoint.

Capturing a register at the top of a function is a pretty flaky and unreliable thing to try to attempt (smells like you've already gone down the wrong design path)1. But you're on the right track with a register-asm local var. However, xmm0 can't match an "=r" constraint, only "=x". See Reading a register value into a C variable for more about using an empty asm template to tell the compiler you want a C variable to be what was in a register.

You do need the asm volatile("" : "=x"(var)); statement, though; GNU C register-asm local vars have no guarantees whatsoever except when used as operands to asm statements. (GCC will often keep your var in that register anyway, but IIRC clang won't.)

There's not a lot of guarantee about where this will be ordered wrt. other code (asm volatile may help some, or for stronger ordering also use a "memory" clobber). Also no guarantee that GCC won't use the register for something else first. (Especially a call-clobbered register like any xmm reg.) But it does at least happen to work in the version I tested.

print a __m128i variable shows how to print a __m128i as two 64-bit halves once you have it, or as other element sizes. The compiler will often optimize _mm_store_si128 / reload into shuffles, and this is for printing anyway so keep it simple.

Using a unsigned __int128 tmp; would also be an option in GNU C on x86-64.


#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#ifndef __cplusplus
#include <stdalign.h>
#endif

// If you need this, you're probably doing something wrong.
// There's no guarantee about what a compiler will have in XMM0 at any point
void foo() {
    register __m128i xmm0 __asm__("xmm0");
    __asm__ volatile ("" :"=x"(xmm0));

    alignas(16) uint64_t buf[2];
    _mm_store_si128((__m128i*)buf, xmm0);
    printf("%llu %llu\n", buf[1], buf[0]);   // I'd normally use hex, like %#llx
}

This prints the high half first (most significant), so reading left to right across both elements we get each byte in descending order of memory address within buf.

It compiles to the asm we want with both GCC and clang (Godbolt), not stepping on xmm0 before reading it.

# GCC10.2 -O3
foo:
        movhlps xmm1, xmm0
        movq    rdx, xmm0                 # low half -> RDX
        mov     edi, OFFSET FLAT:.LC0
        xor     eax, eax
        movq    rsi, xmm1                 # high half -> RSI
        jmp     printf

Footnote 1:

If you make sure your function doesn't inline, you could take advantage of the calling convention to get the incoming values of xmm0..7 (for x86-64 System V), or xmm0..3 if you have no integer args (Windows x64).

__attribute__((noinline))
void foo(__m128i xmm0, __m128i xmm1, __m128i xmm2, etc.) {
  // do whatever you want with the xmm0..7 args
}

If you want to provide a different prototype for the function for callers to use (which omits the __m128i args), that can maybe work. It's of course Undefined Behaviour in ISO C, but if you truly stop inlining, the effects depend on the calling convention. As long as you make sure it's noinline so link-time optimization doesn't do cross-file inlining.

Of course, the mere fact of inserting a function call will change register allocation in the caller, so this only helps for a function you were going to call anyway.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Is it safe to also just access the `__m128i` directly? Like `xmm0[0]` from the above instead of the store intrinsic? – Josh Weinstein Aug 06 '20 at 00:20
  • 1
    @JoshWeinstein: In GNU C, yes, `__m128i` happens to be defined as `typedef long long __m128i __attribute__((vector_size(16), may_alias))`, so yes, `xmm0[0]` is the low element of that GNU C native vector of `long long`s. Since you want 64-bit element size anyway, and signed<->unsigned conversion is a no-op on 2's complement machines, yes you can do that. Related: [why does "+=" gives me unexpected result in SSE instrinsic](https://stackoverflow.com/q/56572357) / [\`reinterpret\_cast\`ing between hardware SIMD vector pointer and the corresponding type?](https://stackoverflow.com/q/52112605) – Peter Cordes Aug 06 '20 at 00:25