Is there a way to utilize all XMM registers?

Question

Here is a code snippet to compute square root of values in a float array taken from http://felix.abecassis.me/2011/09/cpp-getting-started-with-sse/

void sse(float* a, int N)
{
    // We assume N % 4 == 0.
    int nb_iters = N / 4;
   __m128* ptr = (__m128*)a;


    for (int i = 0; i < nb_iters; ++i, ++ptr, a += 4){
        _mm_store_ps(a, _mm_sqrt_ps(*ptr));
    }

}

When I dissemble this code, I see only one xmm (xmm0) being used. I assumed unrolling the loop would give the compiler a cue that more xmms can be used. I modified the code as

void sse3(float* a, int N)
{
__m128* ptr = (__m128*)a;

 for (int i = 0; i < N; i+=32){

    _mm_store_ps(a + i, _mm_sqrt_ps(*ptr));
    ptr++;
    _mm_store_ps(a + i + 4, _mm_sqrt_ps(*ptr));
    ptr++;
    _mm_store_ps(a + i + 8, _mm_sqrt_ps(*ptr));
    ptr++;
    _mm_store_ps(a + i + 12, _mm_sqrt_ps(*ptr));
    ptr++;
    _mm_store_ps(a + i + 16, _mm_sqrt_ps(*ptr));
    ptr++;
    _mm_store_ps(a + i + 20, _mm_sqrt_ps(*ptr));
    ptr++;
    _mm_store_ps(a + i + 24, _mm_sqrt_ps(*ptr));
    ptr++;
    _mm_store_ps(a + i + 28, _mm_sqrt_ps(*ptr));
    ptr++;
 }
}

N should be greater than 32 in this case. However I still don't see more than one xmm. Why can't the compiler assign more than one xmm?

My understanding is the computation on xmm0, xmm1, xmm2 ... xmm7 are independent and can go in parallel on modern superscalar architectures. On a 4 way superscalar machine, the second code snippet should give me a theoretical speed-up of 4 (if 4 different xmms are assigned, for example).

PS: The second code snippet appears to be a little faster(consistently).

sse3: 18809 microseconds
sse: 20543 microseconds

UPDATED

Used -O3 flag as suggested

Here is the Disassembly for Ben Voigt's answer - Note that I changed the name of the function to sse4.

147:ssetest.cpp   **** void sse4(float* a, int N)
148:ssetest.cpp   **** {
2076                    .loc 8 148 0
2077                    .cfi_startproc
2078                .LVL173:
2079                .LBB5900:
2080                .LBB5901:
149:ssetest.cpp   ****    __m128 b, c, d, e;
150:ssetest.cpp   ****
151:ssetest.cpp   ****    for (int i = 0; i < N; i += 16) {
2081                    .loc 8 151 0
2082 0320 85F6          testl   %esi, %esi  # N
2083 0322 7E4C          jle .L106   #,
147:ssetest.cpp   **** void sse4(float* a, int N)
2084                    .loc 8 147 0
2085 0324 8D56FF        leal    -1(%rsi), %edx  #, tmp104
2086                .LBE5901:
2087                .LBE5900:
2088 0327 31C0          xorl    %eax, %eax  # ivtmp.1046
2089                .LBB5925:
2090                .LBB5924:
2091 0329 C1EA04        shrl    $4, %edx    #,
2092 032c 4883C201      addq    $1, %rdx    #, D.189746
2093 0330 48C1E206      salq    $6, %rdx    #, D.189746
2094                .LVL174:
2095                    .p2align 4,,10
2096 0334 0F1F4000      .p2align 3
2097                .L108:
2098                .LBB5902:
2099                .LBB5903:
899:/usr/lib/gcc/x86_64-linux-gnu/4.6/include/xmmintrin.h ****   return (__m128) *(__v4sf *)__P;
2100                    .loc 9 899 0 discriminator 2
2101 0338 0F285407      movaps  16(%rdi,%rax), %xmm2    # MEM[base: a_7(D), index: ivtmp.1046_85, offset: 16B], c
2101      10
2102                .LVL175:
2103                .LBE5903:
2104                .LBE5902:
2105                .LBB5904:
2106                .LBB5905:
182:/usr/lib/gcc/x86_64-linux-gnu/4.6/include/xmmintrin.h ****   return (__m128) __builtin_ia32_sqrtps ((__v4sf)__A);
2107                    .loc 9 182 0 discriminator 2
2108 033d 0F511C07      sqrtps  (%rdi,%rax), %xmm3  # MEM[base: a_7(D), index: ivtmp.1046_85, offset: 0B], tmp107
2109                .LBE5905:
2110                .LBE5904:
2111                .LBB5906:
2112                .LBB5907:
899:/usr/lib/gcc/x86_64-linux-gnu/4.6/include/xmmintrin.h ****   return (__m128) *(__v4sf *)__P;
2113                    .loc 9 899 0 discriminator 2
2114 0341 0F284C07      movaps  32(%rdi,%rax), %xmm1    # MEM[base: a_7(D), index: ivtmp.1046_85, offset: 32B], d
2114      20
2115                .LVL176:
2116                .LBE5907:
2117                .LBE5906:
2118                .LBB5908:
2119                .LBB5909:
182:/usr/lib/gcc/x86_64-linux-gnu/4.6/include/xmmintrin.h ****   return (__m128) __builtin_ia32_sqrtps ((__v4sf)__A);
2120                    .loc 9 182 0 discriminator 2
2121 0346 0F51D2        sqrtps  %xmm2, %xmm2    # c, tmp109
2122                .LBE5909:
2123                .LBE5908:
2124                .LBB5910:
2125                .LBB5911:
899:/usr/lib/gcc/x86_64-linux-gnu/4.6/include/xmmintrin.h ****   return (__m128) *(__v4sf *)__P;
2126                    .loc 9 899 0 discriminator 2
2127 0349 0F284407      movaps  48(%rdi,%rax), %xmm0    # MEM[base: a_7(D), index: ivtmp.1046_85, offset: 48B], e
2127      30
2128                .LVL177:
2129                .LBE5911:
2130                .LBE5910:
2131                .LBB5912:
2132                .LBB5913:
182:/usr/lib/gcc/x86_64-linux-gnu/4.6/include/xmmintrin.h ****   return (__m128) __builtin_ia32_sqrtps ((__v4sf)__A);
2133                    .loc 9 182 0 discriminator 2
2134 034e 0F51C9        sqrtps  %xmm1, %xmm1    # d, tmp111
2135                .LBE5913:
2136                .LBE5912:
2137                .LBB5914:
2138                .LBB5915:
2139                    .loc 9 948 0 discriminator 2
2140 0351 0F291C07      movaps  %xmm3, (%rdi,%rax)  # tmp107, MEM[base: a_7(D), index: ivtmp.1046_85, offset: 0B]
2141                .LVL178:
2142                .LBE5915:
2143                .LBE5914:
2144                .LBB5916:
2145                .LBB5917:
182:/usr/lib/gcc/x86_64-linux-gnu/4.6/include/xmmintrin.h ****   return (__m128) __builtin_ia32_sqrtps ((__v4sf)__A);
2146                    .loc 9 182 0 discriminator 2
2147 0355 0F51C0        sqrtps  %xmm0, %xmm0    # e, tmp113
2148                .LBE5917:
2149                .LBE5916:
2150                .LBB5918:
2151                .LBB5919:
2152                    .loc 9 948 0 discriminator 2
2153 0358 0F295407      movaps  %xmm2, 16(%rdi,%rax)    # tmp109, MEM[base: a_7(D), index: ivtmp.1046_85, offset: 16B]
2153      10
2154                .LVL179:
2155                .LBE5919:
2156                .LBE5918:
2157                .LBB5920:
2158                .LBB5921:
2159 035d 0F294C07      movaps  %xmm1, 32(%rdi,%rax)    # tmp111, MEM[base: a_7(D), index: ivtmp.1046_85, offset: 32B]
2159      20
2160                .LVL180:
2161                .LBE5921:
2162                .LBE5920:
2163                .LBB5922:
2164                .LBB5923:
2165 0362 0F294407      movaps  %xmm0, 48(%rdi,%rax)    # tmp113, MEM[base: a_7(D), index: ivtmp.1046_85, offset: 48B]
2165      30
2166 0367 4883C040      addq    $64, %rax   #, ivtmp.1046
2167                .LVL181:
2168                .LBE5923:
2169                .LBE5922:
2170                    .loc 8 151 0 discriminator 2
2171 036b 4839D0        cmpq    %rdx, %rax  # D.189746, ivtmp.1046
2172 036e 75C8          jne .L108   #,
2173                .LVL182:
2174                .L106:
2175 0370 F3            rep
2176 0371 C3            ret
2177                .LBE5924:
2178                .LBE5925:
2179                    .cfi_endproc
2180                .LFE7998:
2182 0372 66666666      .p2align 4,,15
2182      662E0F1F
2182      84000000
2182      0000
2183                    .globl  _Z6normalPfi
2185                _Z6normalPfi:
2186                .LFB7999:
152:ssetest.cpp   ****       b = _mm_load_ps(a + i);
153:ssetest.cpp   ****       c = _mm_load_ps(a + i +  4);
154:ssetest.cpp   ****       d = _mm_load_ps(a + i +  8);
155:ssetest.cpp   ****       e = _mm_load_ps(a + i + 12);
156:ssetest.cpp   ****       _mm_store_ps(a + i,      _mm_sqrt_ps(b));
157:ssetest.cpp   ****       _mm_store_ps(a + i +  4, _mm_sqrt_ps(c));
158:ssetest.cpp   ****       _mm_store_ps(a + i +  8, _mm_sqrt_ps(d));
159:ssetest.cpp   ****       _mm_store_ps(a + i + 12, _mm_sqrt_ps(e));
160:ssetest.cpp   ****    }
161:ssetest.cpp   **** }

Strangely, sse and sse4 have almost the same performance and sse3 performs the worst (although part of the loop is unrolled).

Take a look at [Register Renaming](http://en.wikipedia.org/wiki/Register_renaming). Because of that, the compiler simply doesn't need to use more than 1 register... — Mysticial, Nov 23 '13 at 06:20
According to Anger Fog's well-known manuals (http://www.agner.org/optimize/), for Wolfdale CPUs the instruction SQRTPS has 6-13 cycles latency, 5-12 cycles reciprocal throughput, and can only use one execution port, p0. Translated for you, only one SQRTPS instruction can be in flight and one can only be started every 5-12 cycles. There is therefore zero opportunity for concurrent execution of SQRTPS instructions, and they can take a godawful long time to finish compared to an addition or multiplication. The compiler can't win by clobbering more registers here. — Iwillnotexist Idonotexist, Nov 23 '13 at 06:24
@Mysticial There is a Read-After-Write dependency in the assembled code, xmm0 always has to wait on a load from memory, after the previous store to memory. — Mathai, Nov 23 '13 at 15:51
@IwillnotexistIdonotexist great info. time to read that doc :) thanks — Mathai, Nov 23 '13 at 17:17
@Mathai I haven't seen the assembly so I wouldn't know exactly what the compiler was going anyway. — Mysticial, Nov 23 '13 at 17:54
That compiler output is absolutely horrid. Check your compiler options. You are using optimization, right? — Ben Voigt, Nov 23 '13 at 19:32
@BenVoigt Sorry I am a newbie in SSE and optimization. I am not doing any optimization. How should I compile ? I am using gcc version 4.6.3 — Mathai, Nov 23 '13 at 21:58
@BenVoigt When I use the switch -O3 and -g (for debugging) together, the breakpoint in the function is never hit. I am not able to see the assembly, but the code is definitely running faster (x2). Now the fastest is usually the original code (without any loop unrolling) and my modified code is the slowest. The code you posted takes more or less the same time as the original code in the link. — Mathai, Nov 24 '13 at 00:52
@Mathai: Why can't you see the assembly? See http://stackoverflow.com/questions/137038/how-do-you-get-assembler-output-from-c-c-source-in-gcc (Use `-S` and `-O3` together) — Ben Voigt, Nov 24 '13 at 00:59
@BenVoigt posted. I find it strange that sse performs the same as sse4 (your function). sse3 performs the worst now. I see xmm0 - xmm3 being used in your function now. sse uses only xmm0, but gives the same performance. I don't understand why. — Mathai, Nov 24 '13 at 03:05

score 4 · Answer 1 · answered Nov 23 '13 at 06:13

4

I think what you're seeing here is the compiler protecting against aliasing. You're reading and writing through pointers so the compiler has to assume that each write to a any float * could be changing the next read through a float *. Try using __restrict__ on your pointers to tell the compiler you're not doing that. You can also rewrite your loop to compute from one array (global is ideal, because then no aliasing can be involved) into another.

answered Nov 23 '13 at 06:13

Ben Jackson

90,079
9
98
150

He's not actually reading through `float*` anywhere. Getting rid of the aliasing between `a` and `ptr` would certainly be good, though. – Ben Voigt Nov 23 '13 at 06:18
@Ben Jackson, I am not familiar with __restric__ . Do I change it to void sse3(float* __restrict__ a, int N) ? I tried this, I see only xmm0 in disassembly. – Mathai Nov 23 '13 at 17:27
GCC doesn't seem to care about aliasing between `float` and `__m128` as @BenVoigt pointed out. I get multiple registers with `-funroll-all-loops` (it even deals with all the cases where `nb_iters` is not a multiple of the unroll.) – Ben Jackson Nov 23 '13 at 18:51

Ben Voigt · Accepted Answer · 2013-11-23T18:29:23.413

4

How about:

void sse3(float* a, int N)
{
   __m128 b, c, d, e;

   for (int i = 0; i < N; i += 16) {
      b = _mm_load_ps(a + i);
      c = _mm_load_ps(a + i +  4);
      d = _mm_load_ps(a + i +  8);
      e = _mm_load_ps(a + i + 12);
      _mm_store_ps(a + i,      _mm_sqrt_ps(b));
      _mm_store_ps(a + i +  4, _mm_sqrt_ps(c));
      _mm_store_ps(a + i +  8, _mm_sqrt_ps(d));
      _mm_store_ps(a + i + 12, _mm_sqrt_ps(e));
   }
}

NB: Using the same XMM register for multiple computations is one thing that can kill pipelining, but it's not the only thing. Operations on different registers can proceed independently only if other resources are available in sufficient quantity. There isn't an entire SIMD computation unit dedicated to each register, as your comment suggests you believe.

edited Nov 23 '13 at 18:29

answered Nov 23 '13 at 06:21

Ben Voigt

277,958
43
419
720

Thanks for your reply. I still see only xmm0 in disassembly. There is not difference in speed. – Mathai Nov 23 '13 at 17:30
@Mathai: Can you post the disassembly? I'm not terribly surprised that the speed is unchanged, but it should be quite different for the compiler to use only one XMM register for all four named variables, considering their lifetimes overlap. Did it do a lot of instruction reordering? – Ben Voigt Nov 23 '13 at 18:30
posted. I am not sure what instruction reordering is. – Mathai Nov 23 '13 at 19:26
@Ben Voigt, My theory on why this is faster is that regardless of parallel execution and piping of the _mm_sqrt_ps, the loads and stores can happen in parallel. Which is not so much a bottle neck on parallel-execution processor resources, but rather on memory. Do you know if there is validity to this theory? – Apriori Mar 04 '14 at 18:52

Is there a way to utilize all XMM registers?

2 Answers2