how to optimize this section of code?

Question

double diff, dsq = 0;
double *descr1, *descr2;
int i, d;

for (i = 0; i < d; ++i)
{
  diff = descr1[i] - descr2[i];
  dsq += diff * diff;
}
return dsq;

I want to optimize this section of code, that taking most time in my program. If this double multiplication performs in an optimized way, my program can run very fast. Is there other ways of multiplication instead of using * operator that causes the program runs faster? thanks a lot.

Can you switch to integers? Which range of value will be covered by the content of `descr1/2`? — alk, Sep 08 '13 at 10:38
You're asking if there's a better operator than multiplication to perform multiplication? If there was, it would be at the assembly level and the compiler's optimizer would put it in for you. There isn't a general solution that will be faster (short of specialized hardware as @jev suggests), but if your data is known in advance, you could produce a lookup table of answers that have been precomputed. — mah, Sep 08 '13 at 10:46
@mawia he could also use a simple instruction to multiply by 0 or 1, but since he's using double values, I suspect neither 0, 1, nor 2 will be expected. — mah, Sep 08 '13 at 10:48
The least you can do is move the declaration of `diff` to inside the loop; it needs to be optimised out anyway. (and maybe use unsigned types for i and d) — wildplasser, Sep 08 '13 at 10:48
@alk If a processor has hardware floating-point at all, and the OP does not say that it doesn't, double-precision multiplication is usually faster than 64-bit integer multiplication, and single-precision multiplication faster than 32-bit integer multiplication. — Pascal Cuoq, Sep 08 '13 at 10:58
@PascalCuoq: Fair enough. Propably my obviously wrong assumption dues to the times when my optimisation instincts were imprinted, as there were extra processors needed to do floating point math ... ;-) — alk, Sep 08 '13 at 11:07
You say this code is taking the most time so presumably you have profiled it? If so then I'd drill down further and find which lines of code are the bottleneck e.g. is it the multiply or the array read? Also is this to be cross-platform or can it be optimised for a specific chipset etc? — zebrabox, Sep 08 '13 at 11:11
@alk Yeah, it is counter-intuitive when you have known processors without FPU, but the hard part of a double-precision multiplication (of normal arguments giving a normal result) is multiplying 53-bit significands, which is easier than multiplying 64-bit integers. — Pascal Cuoq, Sep 08 '13 at 13:35
@user2758590 can you please state the hardware that you run (or want to run) on (e.g. Intel i7-2600). Can we assume SSE/AVX can be used? — Z boson, Sep 08 '13 at 14:36
@mah: I completely disagree with your claim that "there isn't a general solution that will be faster". Loop unrolling and allowing the compiler to parallelize operations will offer BIG improvements, without any specialized hardware. — abelenky, Sep 08 '13 at 23:52
@mah: since each of the floating-point operations requires one clock cycle (the multiply two sometimes) ANY lookup table logic will cause performance to deteriorate. The floating point code given - and variations on the theme - generates in-line code without jumps. Lookup table logic on the other hand will make the code longer by introducing tests, conditional jumps, a lookup table and extra memory accesses. — Olof Forshell, Sep 09 '13 at 09:47

abelenky · Answer 1 · 2013-09-08T16:40:01.827

This is definitely a case for Duff's Device.

Here's my implementation, based on Duff's Device.
(Note: only lightly tested... this must be stepped through in a debugger to assure correct behavior)

void fnc(void)
{
    double dsq = 0.0;
    double diff[8] = {0.0};
    double descr1[115];
    double descr2[115];
    double* pD1 = descr1;
    double* pD2 = descr2;
    int d = 115;

    //Fill with random data for testing
    for(int i=0; i<d; ++i)
    {
        descr1[i] = (double)rand() / (double)rand();
        descr2[i] = (double)rand() / (double)rand();
    }

    // Duff's Device: Step through this in a debugger, its AMAZING.
    int c = (d + 7) / 8;
    switch(d % 8) {
    case 0: do {    diff[0] = *pD1++ - *pD2++; diff[0] *= diff[0];
    case 7:         diff[7] = *pD1++ - *pD2++; diff[7] *= diff[7];
    case 6:         diff[6] = *pD1++ - *pD2++; diff[6] *= diff[6];
    case 5:         diff[5] = *pD1++ - *pD2++; diff[5] *= diff[5];
    case 4:         diff[4] = *pD1++ - *pD2++; diff[4] *= diff[4];
    case 3:         diff[3] = *pD1++ - *pD2++; diff[3] *= diff[3];
    case 2:         diff[2] = *pD1++ - *pD2++; diff[2] *= diff[2];
    case 1:         diff[1] = *pD1++ - *pD2++; diff[1] *= diff[1];
                    dsq += diff[0] + diff[1] + diff[2] + diff[3] + diff[4] + diff[5] + diff[6] + diff[7]; 
               } while(--c > 0);
    }
}

Explanation
As others have said, there is little you can do to optimize the floating point operations. However, in your original code, the program spent a lot of time of time checking the value of i.

The execution steps were roughly:

Is i < d? ==> Yes
Do some math.
Is i < d? ==> Yes
Do some math.
Is i < d? ==> Yes
Do some math.
Is i < d? ==> Yes
Do some math.

You can see every other step is checking i.

With Duff's Device, you get eight operations before checking the counter (c in this case).

Now the execution steps are roughly:

Is c > 0? ==> Yes
Do some math.
Do some math.
Do some math.
Do some math.
Do some math.
Do some math.
Do some math.
Do some math.
Is c > 0? ==> Yes
Do some math.
Do some math.
Do some math.
Do some math.
Do some math.
Do some math.
Do some math.
Do some math.
Is c > 0? ==> Yes
[...]

In otherwords, you spend about 8-times more CPU actually accomplishing work, and far less time checking the value of your counter. That is a BIG win.

I suspect you could even unroll the loop further to 16 or 32 operations for even a bigger win. It really depends on the likely values of d in your code.

Please test and profile this code, and let me know how it works out for you.
I have a strong feeling that this will be a big improvement.

Duff's device was apparently implemented to copy from one memory location to another which is a much simpler exercise optimization-wise than the arithmetic in the OP's question. In addition D's d was invented in 1983 when pipe-lining was in its infancy on the 80286 and non-existent(?) on the 68k. So the "definitely" was relevant in 1983 but perhaps less so today given 30 years of HW development. C compilers are smarter today if you serve up an optimization scenario that they recognize but otherwise they're just less dumb. — Olof Forshell, Sep 09 '13 at 09:37

score 2 · Answer 2 · answered Sep 08 '13 at 11:49

2

You could help the compiler a bit with its strict aliasing rules:

double calc_ssq(double *restrict descr1, double *restrict descr2, size_t count)
{
double ssq;

ssq = 0.0;
for ( ;count;  count--) {
        double diff;
        diff = *descr1++ - *descr2++;
        ssq += diff * diff;
        }
return ssq;
}

answered Sep 08 '13 at 11:49

wildplasser

43,142
8
66
109

Not really, there is no writing access via any pointer in the code, so aliasing cannot stop the compiler from optimizing the original code. If this has a performance impact, it only indicates that the compiler is not smart enough. – cmaster - reinstate monica Sep 08 '13 at 13:17
In the original code snippet there's not, but it's possible that the actual code is complex enough that aliasing analysis can't be done. – Sneftel Sep 08 '13 at 13:59
I'm sorry for this contribution, it was only intended to get the discussion going. Maybe `const` would do the same here, as the arrays are never written anyway. I am not an expert on this kind of thing, but I have the feeling that this kind of mechanism was the driving force for introducing this feature in C99. (insert dmr quote here ...) BTW: inlining would probably help, too. – wildplasser Sep 08 '13 at 16:36

score 1 · Answer 3 · answered Sep 08 '13 at 11:49

1

If you do not really need the double precision for your calculation, you can try to cast them to single precision and then multiply.

I suppose, that single precision multiplication will be faster than double precision multiplication in case of 32bit processor, as regular float needs only one processor register and double needs two.

I am not sure that casting will not "eat" all speed improvement, that you will get from single precision multiplication.

answered Sep 08 '13 at 11:49

Alex

9,891
11
53
87

Casting won't eat any speed improvement, because floats are always represented as `long double` in the floating point registers, the cast is a noop. The difference, is that the compiler will use the single precision variants of the arithmetic instructions instead of the double precision ones, which usually execute twice as fast. – cmaster - reinstate monica Sep 08 '13 at 13:23
@cmaster did not know that about `long double`. Thanks. – Alex Sep 08 '13 at 15:23

Olof Forshell · Answer 4 · 2013-09-09T08:22:45.283

1

If d is evenly divisible by 2 I'd try something like this:

for(i=0;i<d;i+=2)
{
  diff0 = descr1[i]   - descr2[i];
  diff1 = descr1[i+1] - descr2[i+1];
  dsq += diff0 * diff0 + diff1 * diff1;
}

Which would hint to the optimizer that it is possible to interleave the six operations. Even if d were odd you could append a 0.0 value to the end of each vector (giving an even number of values) since it would make no difference to the result given the operations involved.

The next step might be appending to the vectors to be evenly divisible by four, doing four subtractions, four multiplications and four additions before iterating with i+=4;

Evenly divisible by eight allows the vectors to exactly fit the cache line size of 64.

Floating point multiplications require just a clock cycle or two to complete as do additions and subtractions (according to Agner Fog). So for your example reducing the iteration overhead should speed things up.

edited Sep 09 '13 at 08:22

answered Sep 08 '13 at 12:23

Olof Forshell

3,169
22
28

I gotta say, if this code sped up your program, then the Duff's Device (in my answer) will offer at least a 4-fold increase. You should really give that a try. – abelenky Sep 08 '13 at 22:25
@abalenky: I'm not convinced because the optimizer may or (probably) may not understand what kind of interleaving of operations is possible. Essentially each case needs to be atomic so case 7 cannot interleave with optimizations from case 0, case 6 from 7 and so on. Every intermediate value will require storage of its own until the summing step. As I see it you are essentially replacing loop overhead with switch-case overhead. – Olof Forshell Sep 09 '13 at 08:11
I gotta say, if this code sped up your program, then using SSE/AVX and multiple cores (along with unrolling the loop which this answer did) will speed it up quite a bit more. – Z boson Sep 09 '13 at 18:09
If you had checked cmaster's answer (below) perhaps you wouldn't be so sure. Multiple cores? I think not. The memory system will probably be saturated by the first core so adding another will slow things down by - what? 50 to 150%? You also forgot to mention that my answer enables instruction reordering as well. – Olof Forshell Sep 09 '13 at 20:11

score 1 · Answer 5 · answered Sep 08 '13 at 13:34

All in all, you have an extremely tight loop that accesses a lot of data. Loop unrolling may help to hide the latency, but on modern hardware, loops like these are bounded by memory bandwidth, not by computational power.

So, the only real hope for optimization that you have is: a) use arrays of float instead of arrays of double to cut the amount of data loaded from memory half, and b) avoid calling this code as much as possible.

Here are some numbers:

You have three double arithmetic instructions in your inner loop, that's roughly 6 cycles. These need 16 bytes of data. On a 3 GHz processor, that's 8 GB/s memory bandwidth. A DDR3-1066 module delivers 8.5 GB/s. So, even if you use SSE and stuff, you'll not get much faster, unless you switch to using float.

A conclusion to be drawn from your comment is to make the algorithm cache-friendly to make the greatest possible use of the memory system bandwidth. It would be interesting to know how many clock cycles each loop in the example requires. — Olof Forshell, Sep 09 '13 at 09:56

Z boson · Answer 6 · 2013-09-08T20:19:28.300

Assuming you're using a modern Intel/AMD processor (with AVX) and you want to keep the same algorithm you can try the following code. It uses AVX and OpenMP for parallelization. Compile with GCC foo.c -mavx -fopenmp -O3. If you don't want to use OpenMP just comment out the two #pragma statements.

The speed is going to depend on the array sizes and the cache sizes. For arrays that fit in the L1 cache you can expect about a 6x speedup (you should disable OpenMP then due to its overhead). The boost will keep dropping with each cache level. When it gets to system memory it still gets a boost though (running over 10M doubles (2*80MB) is still over 70% faster on my two core ivy bridge system).

#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
#include <omp.h>

double foo_avx_omp(const double *descr1, const double *descr2, const int d) {
    double diff, dsq = 0;
    int i;
    int range;
    __m256d diff4_v1, diff4_v2, dsq4_v1, dsq4_v2, t1, l1, l2, l3, l4;
    __m128d t2, t3;
    range = (d-1) & -8; //round down to multiple of 8
    #pragma omp parallel private(i,l1,l2,l3,l4,t1,t2,t3,dsq4_v1,dsq4_v2,diff4_v1,diff4_v2) \
    reduction(+:dsq)
    {
        dsq4_v1 = _mm256_set1_pd(0.0);
        dsq4_v2 = _mm256_set1_pd(0.0); //two sums to unroll the loop once

        #pragma omp for
        for(i=0; i<(range/8); i++) {
            //load one cache line of descr1
            l1 = _mm256_load_pd(&descr1[8*i]);
            l3 = _mm256_load_pd(&descr1[8*i+4]);
             //load one cache line of descr2
            l2 = _mm256_load_pd(&descr2[8*i]);
            l4 = _mm256_load_pd(&descr2[8*i+4]);
            diff4_v1 = _mm256_sub_pd(l1, l2);
            diff4_v2 = _mm256_sub_pd(l3, l4);
            dsq4_v1 = _mm256_add_pd(dsq4_v1, _mm256_mul_pd(diff4_v1, diff4_v1));
            dsq4_v2 = _mm256_add_pd(dsq4_v2, _mm256_mul_pd(diff4_v2, diff4_v2));
        }
        dsq4_v1 = _mm256_add_pd(dsq4_v1, dsq4_v2);
        t1 = _mm256_hadd_pd(dsq4_v1,dsq4_v1);
        t2 = _mm256_extractf128_pd(t1,1);
        t3 = _mm_add_sd(_mm256_castpd256_pd128(t1),t2);
        dsq += _mm_cvtsd_f64(t3);
    }

    //finish remaining elements if d was not a multiple of 8
    for (i=range; i < d; ++i) {
      diff = descr1[i] - descr2[i];
      dsq += diff * diff;
    }
    return dsq;
}

double foo(double *descr1, double *descr2, int d) {
    double diff, dsq = 0;
    int i;

    for (i = 0; i < d; ++i)
    {
      diff = descr1[i] - descr2[i];
      dsq += diff * diff;
    }
    return dsq;
}

int main(void)
{
    double result1, result2, result3, dtime;
    double *descr1, *descr2;
    const int n = 2000000;
    int i;
    int repeat = 1000;

    descr1 = _mm_malloc(sizeof(double)*n, 64); //align to a cache line 
    descr2 = _mm_malloc(sizeof(double)*n, 64); //align to a cache line

    for(i=0; i<n; i++) {
        descr1[i] = 1.0*rand()/RAND_MAX;
        descr2[i] = 1.0*rand()/RAND_MAX;
    }
    dtime = omp_get_wtime();
    for(i=0; i<repeat; i++) {
        result1 = foo(descr1, descr2, n);
    }
    dtime = omp_get_wtime() - dtime;
    printf("foo %f, time %f\n", result1, dtime);

    dtime = omp_get_wtime();
    for(i=0; i<repeat; i++) {
        result1 = foo_avx_omp(descr1, descr2, n);
    }
    dtime = omp_get_wtime() - dtime;
    printf("foo_avx_omp %f, time %f\n", result1, dtime);

    return 0;
}

score 1 · Answer 7 · answered Sep 09 '13 at 16:09

1

It looks like you're calculating the mean squared error of two vectors.

Use BLAS and you'll be able to take advantage of hand optimized code that is far more efficient than any of us would ever write.

answered Sep 09 '13 at 16:09

Kennet Belenky

2,755
18
20

Check cmaster's answer. You must load two 8-byte entities per loop. PC3-12800 (cmaster uses PC3-8500) will read the equivalent of 200 million cache lines per second. So that's 1600 million 8-byte entities per second sustained through RAM and caches. You need two for each iteration and do three operations (subtract, multiply and add) on them. 800 million iterations per second doesn't seem impossible to me. I'm not convinced that hand-crafted code would be "far more efficient." But I guess it depends on how you define "far more." – Olof Forshell Sep 09 '13 at 20:30
Sure. Feel free to take issue with the "far more efficient." My original point stands that the OP should take advantage of the work that has come before. Don't reinvent the wheel, unless of course you actually want to become a wheelwright. – Kennet Belenky Sep 11 '13 at 17:10
Why not "reinvent the wheel?" Code sequences are reinvented all the time often due to external factors allowing more data to be processed, new algorithms or sometimes just for the plain, good old-fashioned fun of learning how something ticks. – Olof Forshell Sep 12 '13 at 07:46

score 0 · Answer 8 · answered Sep 08 '13 at 10:56

0

Place both doubles of descr1 and descr2 into a structure so that they are next to each other in memory. This would make the usage of the cache and access to the memory better.

Also use register for diff and dsq

answered Sep 08 '13 at 10:56

Ed Heal

59,252
17
87
127

6

using register is useless, compilers ignore it – Gregory Pakosz Sep 08 '13 at 10:59
1

I would say the same thing slightly differently: if you have the sort of compiler that needs `register` hints, the largest speed boost you can get for free is upgrading to a state-of-the-art compiler. – Pascal Cuoq Sep 08 '13 at 11:00
@GregoryPakosz - It is a hint for the compiler. Any reference to explain that it is ignored by all compilers? – Ed Heal Sep 08 '13 at 11:02
1

@EdHEal sure: http://stackoverflow.com/a/10675111/216063 -- so yeah try it, add `register` and look whether the generated code changes – Gregory Pakosz Sep 08 '13 at 11:03
If you read your reference it says (and I quote) "The **hint can** be ignored and in **most** implementations it will be ignored" (my emphasis). i.e. not always and not it all implementations – Ed Heal Sep 08 '13 at 11:06
@EdHeal it's exactly the same argument I just had with @mah on the now deleted answer... The *mythical compiler* that optimizes things based on usage of the `register` keyword just doesn't exist... Try as much as you want. It's 2013. I'm doing lots of cross platform optimizing (Mac, Windows, Linux, iOS, Android). It involves using various versions of Clang, GCC and MSVC. Sure I once tried to use `register` to see how it changes the generated assembly. Guess what? It doesn't change anything! – Gregory Pakosz Sep 08 '13 at 11:21
@mah Sorry for the misunderstanding in the deleted answer, I got lost in the comments flow, and I see now (thanks also to KingsIndian) the meaning of the dispute. Still, I'm not convinced about the fact that a compliant compiler could generate different code for array subscripting and pointer dereference. From the quote I cited I understand that the parser must treat them as the same thing at the syntax level, so the code generator won't even see the difference (but I'm not a compiler expert so I may have missed something). – LorenzoDonati4Ukraine-OnStrike Sep 08 '13 at 11:23
@EdHeal if you can point a compiler that makes good use of `register` I'm all ears, it would be something worth remembering. – Gregory Pakosz Sep 08 '13 at 11:24
@GregoryPakosz - As it is a hint it does no harm - but possibly on some compilers and in some circumstances it may improve performance. So why not just put it in just in case when the performance is an issue. – Ed Heal Sep 08 '13 at 11:25
The only real effect of the `register` keyword is that it inhibits that the address of the object is taken. – Jens Gustedt Sep 08 '13 at 11:26
@EdHeal when performance is an issue, I also sacrifice a goat just in case. See the idea? :) – Gregory Pakosz Sep 08 '13 at 11:26
Anyways, I put enough energy in the pointless dispute. The real optimization is to go SSE and substract doubles 2 by 2 and multiply accumulate diff in disq – Gregory Pakosz Sep 08 '13 at 11:28

score 0 · Answer 9 · answered Sep 08 '13 at 13:50

Warning: Untested code below.

If both your hardware and compiler support them, you may wish to parallize some of the operations using vectors. I've used something similar to the following in the past on a GCC 4.6.x compiler (x86-64 Ubuntu machine). Some of the syntax may be off slightly or may vary if using a different compiler/architecture. However, it should I hope be close enough to get you a little further towards your goal.

typedef double v2d_t __attribute__((vector_size (16)));

typedef union {
    v2d_t    vector;
    double   d[2];
} v2d_u;

v2d_u    vdsq = (v2d_t) {0.0, 0.0};  /* sum of square of differences */
v2d_u    vdiff;              /* difference */
v2d_t *  vdescr1;            /* pointer to array of aligned vector of doubles */
v2d_t *  vdescr2;            /* pointer to array of aligned vector of doubles */
int      i;                  /* index into array of aligned vector of doubles */
int      d;                  /* # of elements in array */

/*
 * ...
 * Assuming that <d> is getting initialized appropriately somewhere
 * ...
 */

for (i = 0; i < d; i++) {
    vdiff.vector = vdescr1[i] - vdescr2[i];
    vdsq.vector += vdiff.vector * vdiff.vector;
}

return vdsq.d[0] + vdsq.d[1];

The above can probably be tweaked some more to get better performance. Perhaps some loop unrolling. Alternatively, if you can exploit the 256 bit vectors (such as YMMx on some x86 processors) instead of the 128 bit vectors that this sample uses, that too may speed things up (some tweaking to the code would be required).

Hope this helps.

how to optimize this section of code?

9 Answers9