0

Is it possible to compute char- arrays with intel sse intrinsics? my attempt so far:

void load_and_print( char arr[], size_t l ){
    __m128i __attribute__((aligned(16))) x_reg = _mm_load_si128((const __m128i *) arr);
    for (int i = 0; i < l; ++i) {
        unsigned short v = _mm_cvtss_f32(x_reg+i);
        printf("%d ",v);
    }
}

which does not work because _mm_cvtss_f32 uses loads float, but I can not find a way to use chars. Do I have to use Bitmasks?

EDIT

The example function may load an char-array into an xmm.register and print the values from the xmm register afterwards.Just an attempt to load and retrieve an array into/from a xmm register

  • What do you mean by “compute” an `char` array? Do you want to print the characters represented by the bytes in the array? Do you want to print a decimal numeral showing the value of each byte? Do you want to group the bytes into `unsigned short` objects and print decimal numerals for their values? Or something else? – Eric Postpischil Jun 10 '23 at 13:01
  • Use _mm_lddqu_si128 for unaligned load, _mm_extract_epi8 to return the 8bit elements. – Simon Goater Jun 10 '23 at 14:00
  • 1
    @SimonGoater: `lddqu` runs the same as `movqdu` on all CPUs except Pentium 4. Just use `_mm_loadu_si128`. Also note that `_mm_extract_epi8` requires a compile-time constant index (and requires SSE4.1). If you actually wanted to extract each character 1 at a time, it's usually more efficient in asm to store to memory and loop over the bytes, because there are so many of them for `char` elements. As in [print a \_\_m128i variable](https://stackoverflow.com/q/13257166) . (In this artificial case, it would be more efficient to just loop over the original `arr`.) – Peter Cordes Jun 10 '23 at 15:50
  • @PeterCordes There would be a lot of repetition using extract, you're right. There are numerous issues with the OP's code though and I just offered a quick suggestion. There's an aliasing violation on the load, so probably should use memcpy, but then to memcpy back doesn't use intrinsics at all. It would be nice to have an unaligned load from char* and runtime extract but hey. I assume that specifying 16 byte alignment for stack allocation of __m128i is redundant no? – Simon Goater Jun 11 '23 at 09:11
  • @PeterCordes I tried a similar example with gcc -Wall -fstrict-aliasing and it didn't complain about _mm_loadu_si128((const __m128i *)arr); Since arr isn't event necessarily aligned to 16 bytes, surely this should be an aliasing violation no? – Simon Goater Jun 11 '23 at 09:19
  • @SimonGoater: Yes, `alignof(__m128i) == 16` so the compiler does that on its own. Re: whether it's safe to use load intrinsics: [Is \`reinterpret\_cast\`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?](https://stackoverflow.com/q/52112605) - it's fine, the intrinsics API requires that you create misaligned pointers (which compilers have to support), but the deref happens inside the wrapper function, not by simple deref of a `__m128i*`. And they're defined as `__attribute__((may_alias))` so it's neither a strict-aliasing violation nor alignment UB. – Peter Cordes Jun 11 '23 at 15:17
  • 1
    @SimonGoater: Until AVX-512, Intel's intrinsics API was badly designed to take pointer types other than `void*`, requiring extra casting in the source. And until very recently didn't portably provide a `movd` 32-bit load / store intrinsic, like they expected you to use `*(int*)ptr` with a `cvt` intrinsic, which could indeed be aliasing and/or alignment UB, unless you instead use `memcpy`. – Peter Cordes Jun 11 '23 at 15:20

0 Answers0