I have a performance critical piece of code that contains in reality uint8_t variables (guaranteed not to overflow) that are incremented from other uint8_t values and also used as parts of array indexing and other 64-bit address calculations.
I have looked at the disassembly of MSVC compiler with full optimizations and annoyingly there's plenty of excessive movzx and other unnecessary additional instructions for converting between 8-bit and 64-bit operations, no matter how I try to do it. If I use a 8-bit variable, address calculations perform additional zero extensions etc. through temporary registers and if I use a 64-bit variable there are similar additional operations for the other 8-bit values that are added to it.
If that was written with assembly, there would be no problem, as the value could be accessed simultaneously with e.g. rax and al registers as needed. Is there some way to access the low byte (especially for performing additions) of a uint64_t variable with C++ so that MSVC would be smart enough to compile it with e.g. simple direct al register access (a.g. add al, other_uint8_var) when the full variable is in the rax register?
I tried several alternatives like bit masking high/low parts for emulating the low byte change, aliasing with a union of 8-bit and 64-bit values, aliasing the 64-bit value with temporary 8-bit reference variable etc. All of them just led to worse results, often so that the variable was moved from the register to temporary memory location for performing the change.
Simplest example:
#include <stdint.h>
unsigned idx(unsigned *ptr, uint8_t *v)
{
uint8_t tmp = v[1] + v[2]; // truncate the sum to 8-bit
return ptr[tmp]; // before using with a 64-bit pointer
}
All compilers (Godbolt: GCC11/clang14/MSVC19.31/ICC19.01) do a bad job, wasting a movzx eax,al that can't even benefit from mov-elimination for zero latency because they widen within the same register. MSVC 19.31 -O2 compiles to:
unsigned int idx(unsigned int *,unsigned char *) PROC ; idx, COMDAT
movzx eax, BYTE PTR [rdx+2]
add al, BYTE PTR [rdx+1]
movzx eax, al ;; fully redundant, value already truncated to 8-bit and zero-extended to 64
mov eax, DWORD PTR [rcx+rax*4]
ret 0
Clang/LLVM actually does even worse, starting with a mov al, [mem] load with a false dependency on the old value of RAX (on CPUs other than P6-family and first-generation Sandybridge). But is saves one byte of machine-code size.
Further examples how MSVC compiles:
The following runnable program generates 3 random integers which are summed as array index. Interesting part is placed in the test() function to force the compiler to use the desired argument types and keep the part separated for easy viewing of the assembly from the otherwise inlined code.
#include <iostream>
#include <cstdlib>
static constexpr uint64_t array[30]{ 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29 };
struct Test {
__declspec(noinline) static uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
uint64_t index = base;
index += added1;
index += added2;
return array[index];
}
};
int main()
{
uint64_t base = rand() % 10;
uint8_t added1 = rand() % 10;
uint8_t added2 = rand() % 10;
uint64_t result = Test::test(base, added1, added2);
std::cout << "array[" << base << "+" << (uint64_t)added1 << "+" << (uint64_t)added2 << "]=" << result << std::endl;
return 0;
}
The above test function with uint64 base index compiles to:
uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
movzx edx,dl
add rcx,rdx
movzx eax,r8b
add rax,rcx
lea rcx,[array (07FF74FD63340h)]
mov rax,qword ptr [rcx+rax*8]
ret
}
Compiler has assigned rcx=base, dl=added1, r8b=added2. uint8_t values are separately zero extended before summing.
Changing the base index to uint8_t compiles:
uint64_t test(uint8_t base, uint8_t added1, uint8_t added2) {
uint8_t index = base;
index += added1;
index += added2;
return array[index];
}
uint64_t test(uint8_t base, uint8_t added1, uint8_t added2) {
add cl,dl
add cl,r8b
movzx eax,cl
lea rcx,[array (07FF6287C3340h)]
mov rax,qword ptr [rcx+rax*8]
ret
}
So now the compiler is happy to do the math with 8-bit registers but needs to separately zero extend the result for addressing.
What I would like to get is basically the above without the movzx, so that rcx would be used directly as the array offset, since I know all except the lowest byte are already zero. Obviously the compiler cannot do that automatically, since unlike me, it doesn't know the additions can't cause overflow. So what I'm missing is a way to tell it just that.
If I try to cast the target register as 8-bit (which tends to work for read operations) or use a union for modeling the variable like rcx/cl registers, it does it through a stack variable:
uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
uint64_t index = base;
reinterpret_cast<uint8_t&>(index) += added1;
reinterpret_cast<uint8_t&>(index) += added2;
return array[index];
}
OR:
union Register {
uint64_t u64;
uint8_t u8;
};
uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
Register index;
index.u64 = base;
index.u8 += added1;
index.u8 += added2;
return array[index.u64];
}
uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
mov qword ptr [rsp+8],rcx
add cl,dl
add cl,r8b
mov byte ptr [index],cl
lea rcx,[array (07FF798B53340h)]
mov rax,qword ptr [index]
mov rax,qword ptr [rcx+rax*8]
ret
}
Attempting to indicate my intentions through some bit masking compiles to:
uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
uint64_t index = base;
index = (index & 0xffffffffffffff00) | (uint8_t)(index + added1);
index = (index & 0xffffffffffffff00) | (uint8_t)(index + added2);
return array[index];
}
uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
lea eax,[rdx+rcx]
and rcx,0FFFFFFFFFFFFFF00h
movzx edx,al
or rdx,rcx
lea rcx,[array (07FF70F4B3340h)]
lea eax,[r8+rdx]
and rdx,0FFFFFFFFFFFFFF00h
movzx eax,al
or rax,rdx
mov rax,qword ptr [rcx+rax*8]
ret
}
I believe some similar pattern works on some compilers for overwriting (instead of adding to) the low byte, but here the compiler clearly failed to recognize that pattern. Another similar pattern generates this:
uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
uint64_t index = base;
index = (index & 0xffffffffffffff00) | (((index & 0xff) + added1) & 0xff);
index = (index & 0xffffffffffffff00) | (((index & 0xff) + added2) & 0xff);
return array[index];
}
uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
movzx eax,dl
add rax,rcx
xor rax,rcx
movzx edx,al
xor rdx,rcx
movzx eax,r8b
add rax,rdx
lea rcx,[array (07FF6859F3340h)]
xor rax,rdx
movzx eax,al
xor rax,rdx
mov rax,qword ptr [rcx+rax*8]
ret
}