Modifying low byte of 64-bit variable like rax/al registers without compiler overhead

Question

I have a performance critical piece of code that contains in reality uint8_t variables (guaranteed not to overflow) that are incremented from other uint8_t values and also used as parts of array indexing and other 64-bit address calculations.

I have looked at the disassembly of MSVC compiler with full optimizations and annoyingly there's plenty of excessive movzx and other unnecessary additional instructions for converting between 8-bit and 64-bit operations, no matter how I try to do it. If I use a 8-bit variable, address calculations perform additional zero extensions etc. through temporary registers and if I use a 64-bit variable there are similar additional operations for the other 8-bit values that are added to it.

If that was written with assembly, there would be no problem, as the value could be accessed simultaneously with e.g. rax and al registers as needed. Is there some way to access the low byte (especially for performing additions) of a uint64_t variable with C++ so that MSVC would be smart enough to compile it with e.g. simple direct al register access (a.g. add al, other_uint8_var) when the full variable is in the rax register?

I tried several alternatives like bit masking high/low parts for emulating the low byte change, aliasing with a union of 8-bit and 64-bit values, aliasing the 64-bit value with temporary 8-bit reference variable etc. All of them just led to worse results, often so that the variable was moved from the register to temporary memory location for performing the change.

Simplest example:

#include <stdint.h>
unsigned idx(unsigned *ptr, uint8_t *v)
{
    uint8_t tmp = v[1] + v[2];       // truncate the sum to 8-bit
    return ptr[tmp];                 // before using with a 64-bit pointer
}

All compilers (Godbolt: GCC11/clang14/MSVC19.31/ICC19.01) do a bad job, wasting a movzx eax,al that can't even benefit from mov-elimination for zero latency because they widen within the same register. MSVC 19.31 -O2 compiles to:

unsigned int idx(unsigned int *,unsigned char *) PROC                                ; idx, COMDAT
        movzx   eax, BYTE PTR [rdx+2]
        add     al, BYTE PTR [rdx+1]
        movzx   eax, al                       ;; fully redundant, value already truncated to 8-bit and zero-extended to 64
        mov     eax, DWORD PTR [rcx+rax*4]
        ret     0

Clang/LLVM actually does even worse, starting with a mov al, [mem] load with a false dependency on the old value of RAX (on CPUs other than P6-family and first-generation Sandybridge). But is saves one byte of machine-code size.

Further examples how MSVC compiles:

The following runnable program generates 3 random integers which are summed as array index. Interesting part is placed in the test() function to force the compiler to use the desired argument types and keep the part separated for easy viewing of the assembly from the otherwise inlined code.

#include <iostream>
#include <cstdlib>

static constexpr uint64_t array[30]{ 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29 };

struct Test {
    __declspec(noinline) static uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
        uint64_t index = base;
        index += added1;
        index += added2;
        return array[index];
    }
};

int main()
{
    uint64_t base = rand() % 10;
    uint8_t added1 = rand() % 10;
    uint8_t added2 = rand() % 10;

    uint64_t result = Test::test(base, added1, added2);

    std::cout << "array[" << base << "+" << (uint64_t)added1 << "+" << (uint64_t)added2 << "]=" << result << std::endl;
    return 0;
}

The above test function with uint64 base index compiles to:

uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
    movzx       edx,dl  
    add         rcx,rdx  
    movzx       eax,r8b  
    add         rax,rcx  
    lea         rcx,[array (07FF74FD63340h)]  
    mov         rax,qword ptr [rcx+rax*8]  
    ret  
}

Compiler has assigned rcx=base, dl=added1, r8b=added2. uint8_t values are separately zero extended before summing.

Changing the base index to uint8_t compiles:

uint64_t test(uint8_t base, uint8_t added1, uint8_t added2) {
    uint8_t index = base;
    index += added1;
    index += added2;
    return array[index];
}

uint64_t test(uint8_t base, uint8_t added1, uint8_t added2) {
    add         cl,dl  
    add         cl,r8b  
    movzx       eax,cl  
    lea         rcx,[array (07FF6287C3340h)]  
    mov         rax,qword ptr [rcx+rax*8]  
    ret  
}

So now the compiler is happy to do the math with 8-bit registers but needs to separately zero extend the result for addressing.

What I would like to get is basically the above without the movzx, so that rcx would be used directly as the array offset, since I know all except the lowest byte are already zero. Obviously the compiler cannot do that automatically, since unlike me, it doesn't know the additions can't cause overflow. So what I'm missing is a way to tell it just that.

If I try to cast the target register as 8-bit (which tends to work for read operations) or use a union for modeling the variable like rcx/cl registers, it does it through a stack variable:

uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
    uint64_t index = base;
    reinterpret_cast<uint8_t&>(index) += added1;
    reinterpret_cast<uint8_t&>(index) += added2;
    return array[index];
}

OR:

union Register {
    uint64_t u64;
    uint8_t u8;
};

uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
    Register index;
    index.u64 = base;
    index.u8 += added1;
    index.u8 += added2;
    return array[index.u64];
}

uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
    mov         qword ptr [rsp+8],rcx  
    add         cl,dl  
    add         cl,r8b  
    mov         byte ptr [index],cl  
    lea         rcx,[array (07FF798B53340h)]  
    mov         rax,qword ptr [index]  
    mov         rax,qword ptr [rcx+rax*8]  
    ret  
}

Attempting to indicate my intentions through some bit masking compiles to:

uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
    uint64_t index = base;
    index = (index & 0xffffffffffffff00) | (uint8_t)(index + added1);
    index = (index & 0xffffffffffffff00) | (uint8_t)(index + added2);
    return array[index];
}

uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
    lea         eax,[rdx+rcx]  
    and         rcx,0FFFFFFFFFFFFFF00h  
    movzx       edx,al  
    or          rdx,rcx  
    lea         rcx,[array (07FF70F4B3340h)]  
    lea         eax,[r8+rdx]  
    and         rdx,0FFFFFFFFFFFFFF00h  
    movzx       eax,al  
    or          rax,rdx  
    mov         rax,qword ptr [rcx+rax*8]  
    ret  
}

I believe some similar pattern works on some compilers for overwriting (instead of adding to) the low byte, but here the compiler clearly failed to recognize that pattern. Another similar pattern generates this:

uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
    uint64_t index = base;
    index = (index & 0xffffffffffffff00) | (((index & 0xff) + added1) & 0xff);
    index = (index & 0xffffffffffffff00) | (((index & 0xff) + added2) & 0xff);
    return array[index];
}

uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
    movzx       eax,dl  
    add         rax,rcx  
    xor         rax,rcx  
    movzx       edx,al  
    xor         rdx,rcx  
    movzx       eax,r8b  
    add         rax,rdx  
    lea         rcx,[array (07FF6859F3340h)]  
    xor         rax,rdx  
    movzx       eax,al  
    xor         rax,rdx  
    mov         rax,qword ptr [rcx+rax*8]  
    ret
}

Compilers are generally not very smart about when they can safely use partial registers without performance disaster :/ Especially if you don't care about P6-family ([Why doesn't GCC use partial registers?](https://stackoverflow.com/q/41573502)) — Peter Cordes, Apr 07 '22 at 23:18
Since you are doing memory accesses I doubt a couple of zero extensions could cause significant performance problems. Did you benchmark it? — Jester, Apr 07 '22 at 23:22
@PeterCordes Perhaps most annoyingly I can see the compiler using partial registers in various other places of the disassembly, just not on the most obvious places. — JohnDoeNot, Apr 07 '22 at 23:49
Can you update your question with a [mcve], please? We *might* be able to help you come up with just the right pattern of C++ code that will cause MSVC to produce the assembly output you want. I've had some success with that in the past. Of course, it's brittle, and may not survive an upgrade to the next version of the compiler, but it's a way better option than trying to write assembly (there's no way that'll be faster). Speaking in concrete terms about a specific piece of code is the difference between this being an answerable question, and this being one (as it is now) that's too broad. — Cody Gray - on strike, Apr 07 '22 at 23:52
@Jester: I'm currently trying to benchmark how to squeeze out some additional performance and that's exactly why I'm trying to make the compiler remove those extra ops to see the effects, as there's plenty of those. Plan B is to test it somehow directly on assembly files (since inline assembly is not supported for x64). But then if it helps, I would still have this same issue of somehow doing the same on the C++ code. — JohnDoeNot, Apr 07 '22 at 23:54
One approach to see if it's significant would be to compile to asm, edit that asm by hand, then assemble + link into a program you can benchmark. MSVC unfortunately doesn't make clean asm output that you can necessarily assemble, the way GCC and clang `-S` do; it can have extra definitions of functions floating around that will conflict. But you should be able to isolate just a definition for a function or two you want, then comment it out in the C++ source and just prototype it. It seems to omit stack-unwind metadata directives, so the result won't be exception / SEH safe, but benchmarkable — Peter Cordes, Apr 08 '22 at 02:09
@Jester: if the memory working-set is small and fits in L1d, extra latency in address calculations or loop critical paths could matter. Or just front-end throughput. — Peter Cordes, Apr 08 '22 at 02:13
@PeterCordes: While experimenting with different variable sizes and comparing results, I also noticed how those small extra ops sometimes led the compiler to generate further overhead. For example, in some cases, extraneous zero extension through some temp register seemed to lead it running out of temp registers, leading it to convert some register usage to stack variables, or at least preserving and reloading some values through the stack. And all that obviously makes the generated assembly harder to follow and identify other issues. — JohnDoeNot, Apr 08 '22 at 10:14
*although it still seems to utilize some knowledge of higher bytes being already zeroed when called* - huh? No, the Windows x64 calling convention allows high garbage in registers when narrow args are at the bottom, and the code-gen doesn't depend on any bytes of registers outside those holding narrow args. — Peter Cordes, Apr 08 '22 at 12:15
Making it a non-inline function *requires* the compiler to emit those `movzx` instructions in your first `test()`, because you didn't do anything to promise that `added1 + added2` could be done without carry into higher bits (or that you want to discard such carry if it does happen). e.g. `index += (uint8_t) (added1 + added2);` would allow `add dl, r8b` / `movzx` / 64-bit add, so failure to do that would be a missed optimization. There are no missed optimizations in your first `test()`, except for MSVC picking `movzx edx,dl` to zero-extend in-place instead of into a new reg for mov-elim — Peter Cordes, Apr 08 '22 at 12:16
*Compiler has assigned rcx=base, dl=added1, r8b=added2* - That's the Windows x64 calling convention. https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170 The compiler didn't have a choice. (Since it's not a `static` private function, it couldn't have invented a custom calling convention even if it wanted to.) — Peter Cordes, Apr 08 '22 at 12:20
@PeterCordes: Thank you for pointing out those details I'm not really familiar with! — JohnDoeNot, Apr 08 '22 at 12:35
@PeterCordes: About those higher bytes being zeroed, I somehow temporarily forgot that writing to the lower 32-bits clears the upper 32-bits on registers, and that fooled me. Edited that out. — JohnDoeNot, Apr 08 '22 at 12:44

Wisblade · Answer 1 · 2022-04-07T23:50:13.823

0

Why not simply write ~~inline~~ assembler code? For really highly critical code, it may be the only solution... I had to do that for reading from a CompactFlash on a bare-metal platform, for example: C or C++ code wasn't even able to match required timings...

Also, it may help you, by benchmarking this assembler code, to see if you really get more performances - or not! - and maybe give you a goal and/or an indication of what you can expect to reach.

If you need portability, you can still have conditional compilation, with all optimized ~~inline~~ assembler for all known targets, and leave the best C/C++ code you have for every other cases - won't be perfect, but not all CPU has partial registers anyway so a generic C code could do the trick for these platforms.

edited Apr 07 '22 at 23:50

answered Apr 07 '22 at 23:33

Wisblade

1,483
4
13

Inline assembly is not supported for x64 in Visual Studio as far as I know and as documented here: https://learn.microsoft.com/en-us/cpp/assembler/inline/inline-assembler?view=msvc-170 – JohnDoeNot Apr 07 '22 at 23:40
Then do a full assembler module: inline ASM is more convenient, for sure, but you can still compile a pure ASM function with MASM and link it to your C++ code... See [this page](https://learn.microsoft.com/en-us/cpp/assembler/masm/masm-for-x64-ml64-exe?view=msvc-170). It doesn't change the conditional compilation part: you should NEVER link to ASM code without such a precaution anyway. – Wisblade Apr 07 '22 at 23:49
@JohnDoeNot it doesn’t support _inline_ assembly. You can still add an asm file to your project and call it that way – Taekahn Apr 07 '22 at 23:49
1

Inline assembly (or even out-of-line assembly) to avoid some zero extensions is absolutely *not* going to be faster than what the compiler generates. – Cody Gray - on strike Apr 07 '22 at 23:49
@CodyGray But at least, once OP benchmarked it, it can ease his mind about that... And if it happens to be faster, it solves his problem. Win-win situation. – Wisblade Apr 07 '22 at 23:51
I don't really understand. You are suggesting to do extra work, trying something that can be self-evidently dismissed as not a valid solution with a bit of understanding of the realities of compiler code-generation, inlining, relative instruction timing, etc., just to ease someone's mind? I don't see how that provides any peace of mind. There's no win to be had here. – Cody Gray - on strike Apr 07 '22 at 23:53
@CodyGray It's not _your_ time, or _your_ project, and his hierarchy can require to test every potential solution. I face that constantly: they aren't technical people, they don't understand, and they want to "see". After all, it's their money, so... And I don't even spoke about getting a hardware platform designed by electronics engineers who never thought about asking to software engineers if it will be suitable - and once PCB are produced, you have to deal with it and make them work. Sometimes, we need to do some realpolitik at work, even if it seems (or is) stupid. – Wisblade Apr 07 '22 at 23:57
Yeah, I complain about these same things at work to non-technical people who want to "see". I fixed the problem of electrical engineers designing an unsuitable hardware platform by supervising their work closely. I guess it didn't achieve my overall goal of reducing workload very well, though. :-) – Cody Gray - on strike Apr 08 '22 at 00:34
3

@JohnDoeNot: Note that only MSVC is crippled like that. GCC and clang support GNU C inline asm which can give you a low-overhead way to wrap an instruction or two. With clang-cl, it's somewhat of a drop-in replacement for MSVC. However, I would *not* recommend inline asm if you can avoid it. (https://gcc.gnu.org/wiki/DontUseInlineAsm). It's a black-box for the optimizer and will prevent auto-vectorization and constant propagation, among other things. – Peter Cordes Apr 08 '22 at 01:43
1

@CodyGray: Writing a whole loop in asm is viable as a last resort, and is doable even with MSVC + MASM. But before that, wrapping an operation that adds an 8-bit input to a uint64_t input is something you can do with GNU C inline asm. Ideally you'd want to hand-hold the compiler with pure C, but they don't even try to keep an 8-bit value zero-extended to 64 by just using 8-bit operand-size in the most trivial example: https://godbolt.org/z/5sfPxn9he . Looks like it's time to file missed-optimization bugs on GCC and clang. (And MSVC if anyone cares. ICC is dead-ended for LLVM-based ICX.) – Peter Cordes Apr 08 '22 at 01:52
1

@CodyGray: I added a [mcve] to the question where GCC/clang/MSVC and pre-LLVM ICC all fail. IDK if it's exactly what the OP had in mind, but it seems to me the simplest / clearest way to express the logic to the compiler. We'll see if they edit to confirm this is the kind of thing they want MSVC to optimize, or something else. – Peter Cordes Apr 08 '22 at 02:11
@PeterCordes: Your edit certainly illustrates the issue nicely. Thank you! I began writing a similar example yesterday but didn't manage to finish it in time. I don't know if it's worth it to add it as another example anymore (to the original question or as a separate answer). – JohnDoeNot Apr 08 '22 at 09:58
1

@JohnDoeNot: Definitely not as an answer, unless it gets some compiler to avoid movzx. Unless your idea (and your real use-case) had any significant differences, you should just edit the question to remove the (editor's note) paragraph at the bottom. If they did, you could adapt my example or add another one. – Peter Cordes Apr 08 '22 at 10:03
@PeterCordes: I added my example how the compiler treats the various alternatives I have tried. – JohnDoeNot Apr 08 '22 at 12:08
@JohnDoeNot I'll repeat myself, but rather than trying to twist the compiler's output, **try what you want** with a MASM module, and benchmark it to simply _SEE_ if it is as good as you expect... If it isn't, why bothering to get a given compiler output since you proved it was unefficient? And if it _IS_ better, then you got your solution... – Wisblade Apr 08 '22 at 12:47
@Wisblade: I believe the arguments given against separate assembly modules given above are valid, as it removes other optimization opportunities from the compiler. So while it could provide benchmarking results, it wouldn't provide a result I could easily use or even reuse in other places if desired. In my mind the ability to utilize partial registers with sort of C++ semi-assembly also seems like such a simple need that it can and will arise in other cases too. So I would really like to know if there's a simple pattern that could be used for that. I certainly expected there to be one. – JohnDoeNot Apr 08 '22 at 13:04
I'm especially curious why this simple solution from above isn't optimized as expected, or if there's some clear reason why the compiler couldn't turn it into partial register write: reinterpret_cast(index) += added1. Because of possible endianness issues? – JohnDoeNot Apr 08 '22 at 13:07
@JohnDoeNot It removes optimizations at function level, yeah. But let's say it otherwise: unless you modify the compiler, you won't get what you want. Try it, shield it with conditional compilation in caller, and benchmark it. **THEN**, if you got faster code, you can wonder why compiler didn't achieve that... In the other case, then compiler was right to output ASM this way. Reduction to absurdity, a good way of proving something elegantly. What _SEEMS_ logic for an human isn't always logic for a machine... Because you don't know _all_ the hardware implications behind. – Wisblade Apr 08 '22 at 13:35
@JohnDoeNot If you do that, and depending on the results you get, we can then have another discussion, on another question, that will probably be way more interesting about compiler optimizations. Currently, I could resume it in "I think the compiler should do that, but I didn't tried the solution I believe is better, and I want it to output my solution before testing it"... You spent more time in these comments than what would have been necessary to test it, and worst, your problem doesn't progressed at all despite these discussions. That's why I encourage you to test it. – Wisblade Apr 08 '22 at 13:40
@Wisblade: I'm not familiar with the details on writing and integrating ASM modules to my cmake project, so it will take me longer than these comments. But yes, I will do some reading and trying as time permits. – JohnDoeNot Apr 08 '22 at 14:06
@JohnDoeNot Hmm? I linked you the page on "how to do it" in the 2nd comment... Procedure is detailled step by step... Did you missed the link? – Wisblade Apr 08 '22 at 16:05
@Wisblade: It's just not that simple when dealing with existing code and unfamiliar details. It was relatively easy to add a dummy conditional NASM module to the build but it immediately revealed the first issue with lost optimization, as it seems ASM functions cannot be inlined even with whole program optimization, adding call overhead. Then the piece that needs to be converted contains dependencies to other C++ classes, lookup tables generated there etc. Losing other compiler optimizations, I also need to understand instruction level changes better to avoid apples to oranges comparisons etc. – JohnDoeNot Apr 08 '22 at 19:47

Modifying low byte of 64-bit variable like rax/al registers without compiler overhead

Simplest example:

Further examples how MSVC compiles:

1 Answers1