Why double elements from arrays don't load to FP registers?

Question

So, I have functions like this using C++:

double fun_w22(double* xx, double* z, int N) {
    double result= 0.0;

    for (int i = 0; i < N; i++) {
        result+= xx[i] * z[i];
    }

    return result;
}

And the same function using FPU x86:

double fun_w22_asm(double* xx, double* z, int N) {
    double result= 0.0;
    __asm {
        mov ecx, N
        mov esi, xx
        mov edi, z 
        fld [esi]
        fld [edi]
        fmul [esi]  
        dec ecx 
    petla:
        add esi, 8
        add edi, 8
        fld[esi]
        fmul[edi]
        fadd
        dec ecx
    jnz petla
        fstp result
    }

    return result;
}

The problem is that elements of vectors does not load to registers so assembly function does not show right result. Here is main function:

int main() {

    int N=3;
    
    double xx[] = { 1.0, 2.0, 3.0 }; // Sample array x
    double z[] = { 4.0, 5.0, 6.0 }; // Sample array z

    // Call the function to calculate the result
    double result = fun_w22(xx, z, N);
    double result1 = fun_w22_asm(xx, z, N);
    // Display the result
    std::cout << "Result: " << result << std::endl;
    std::cout << "Result ASM: " << result1 << std::endl;

    return 0;
}

I was expecting the same result but the assembly code runs and writes out 0 as a result. When I turn on disassebly in VS, I see random numbers in registers, not numbers as should be. I don't know if I should have someting extra in my code or turned on in my VS. I really need to understand this.

Make sure your floating point instructions use the proper size. I don't know what the default is. Try inserting `qword ptr` such as `fld qword ptr [esi]` to force double precision. — Jester, Aug 29 '23 at 12:51
Be weary ... your asm and c++ code aren't the same. Consider when `N` is `0` or `1`. PS: `dec ecx; jnz petla` is equivalent of just `loop petla` ... Also, you don't use `result` in the assembly version (or is that `wynik`, that just missed when translating?) — ChrisMM, Aug 29 '23 at 12:55
It's only missed when translating, I changed it. I know it is equivalent but at University we use loops like I wrote, so I stick with it. Regarding this is there any issue with code I don't see. — Biskopt, Aug 29 '23 at 13:17
I mean, the first `fld [esi]` is logically pointless as far as I can tell, you don't handle `N==0` correctly, and it fails in practice. [Here](https://godbolt.org/z/oM6eGnnxa) is my attempt to debug the C++, the ASM, and ASM-translated-back-to-C++. I don't see anything obvoius, but I am far from an ASM expert. — Yakk - Adam Nevraumont, Aug 29 '23 at 14:11
Even after deleting `fld[esi]` does not change anything. Still, the asm return result is equal to 0. I'm out of any more ideas. Code like this I have in my lectures so I guess it should work....but it doesn't. I don't get it. So we are still at the same point. — Biskopt, Aug 29 '23 at 14:54
My debugger shows correct values after returning from function's call `result = 32, result1 = 32`. Exception throw is generated when program try to print result messages. — Nassau, Aug 29 '23 at 16:47
So the code is sounting it correctly. Interesting, but it also something good. So there is other problem with something I don't know. about my debbuger maybe — Biskopt, Aug 29 '23 at 17:28
@ChrisMM: I hope you're not recommending someone actually use [the slow `loop` instruction](https://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently) outside of code-golf or tuning for original 8086. `dec/jnz` is a good way to write the bottom of a `do{}while()` loop for modern CPUs, so that part is idiomatic. You're right about the `N<=1` problems, though; they should probably just not peel the first iteration so N==1 Just Works (after fixing the FP stack balacing), and if `N==0` is possible, then `test ecx,ecx` / `jz done` — Peter Cordes, Aug 29 '23 at 17:38
`fldz` could be used instead of a peeled first iteration, or just do extra branching. But I think the biggest problem (other than operand-size) is that the x87 stack grows by 1 for every iteration of the inner loop because this uses `fadd` instead of `faddp`. — Peter Cordes, Aug 29 '23 at 17:50
@PeterCordes, honestly, didn't realize that `loop` was different than `dec; jnz`, figured it was just the more convenient way to write it. So, definitely my mistake there. — ChrisMM, Aug 29 '23 at 18:26

Why double elements from arrays don't load to FP registers?

0 Answers0