0

Original C source code:

#define BITSET_WORDIDX(idx) (idx >> 6)
#define BITSET_WORD_BITIDX(idx) (idx & 0x3F)
static inline __attribute__((always_inline, regparm(2))) unsigned int test_and_set_bit_nonatomic(
    uint64_t *restrict bitset, const unsigned int bitset_idx)
{
    bitset += BITSET_WORDIDX(bitset_idx);
    register uint64_t u64_bitset = *bitset;
    register const uint64_t u64_idx = BITSET_WORD_BITIDX(bitset_idx);
    register unsigned int cf_copy;

    asm volatile("btsq %2, %1"
                 : "=@ccc"(cf_copy), "+r"(u64_bitset)
                 : "r"(u64_idx));
    *bitset = u64_bitset;
    return cf_copy;
}

Compiled assembly for function call with gcc -Ofast -march=native -S -fverbose-asm (although I changed the comments for better readability):

# "%eax" in below line indicates the "bitset_idx" parameter is a register variable from above scope.
# (since its "always_inline" function)
# Writing to "%eax" will never happen in this scope.
    movl    %eax, %edx  # (temp. register #0)bitset_idx = bitset_idx;
    movl    %eax, %ebp  # (temp. register #1)bitset_idx = bitset_idx;
    shrl    $6, %edx    # (temp. register #0)bitset_idx >>= 6;
    andl    $63, %ebp   # (temp. register #1)bitset_idx &= 0x3F;
# "%r13" in below line indicates the "bitset" parameter is a register variable from above scope.
# (since its "always_inline" function)
# Writing to "%r13" will never happen in this scope.
    leaq    0(%r13,%rdx,8), %rsi    # (void *)bitset += ((temp. register #0)bitset_idx * 8)
    movl    %ebp, %edi  # u64_idx = (temp. register #1)bitset_idx
    movq    (%rsi), %rdx    # u64_bitset = *bitset
#APP
    btsq %rdi, %rdx
#NO_APP
    movq    %rdx, (%rsi)    # *bitset = u64_bitset

My question is, "Why GCC inserts movl %ebp, %edi # u64_idx = (temp. register #1)bitset_idx?". Since EBP register is already a temporal register variable, wouldn't it be more efficient to treat EBP register as "u64_idx" from here?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    FYI, `bts` masks the count for you, your C `&` is a missed optimization. See my answer on [What is the best way to store a large list of Boolean values in x86 assembly?](https://stackoverflow.com/q/68451133) (especially the fact that narrower accesses can be more efficient, allowing more instruction-level parallelism if two updates are in separate 16-bit words inside the same qword.) – Peter Cordes Jun 28 '22 at 10:12
  • 1
    Looks like a missed optimization. Those are unfortunately not rare in GCC. Wasted `mov` instructions specifically are something GCC tends to have more than clang, but clang does sometimes have other missed optimizations. – Peter Cordes Jun 28 '22 at 10:15
  • @PeterCordes "bts masks the count for you, your C & is a missed optimization." -> Thanks! I really didn't know about that at all until now. – hurryman2212 Jun 28 '22 at 10:17
  • @PeterCordes "especially the fact that narrower accesses can be more efficient, allowing more instruction-level parallelism if two updates are in separate 16-bit words inside the same qword." -> Is that meaning that I should lower the type length for "bitset"? I understand that QWORD is excessive(actually I missed this part because I move the baseline source for traversing code with "bsfq"), but isn't it not good to lower the type length below DWORD? (according to "https://www.agner.org/optimize/instruction_tables.pdf", modern uarch. shows penalty latency for "mov r8/r16,m" for padding?) – hurryman2212 Jun 28 '22 at 10:20
  • @PeterCordes So the solution for this is that "write the "u64_idx"-related part in inline assembly too"? (plus above comments) – hurryman2212 Jun 28 '22 at 10:22
  • 1
    Read my linked answer; you want to load with `movzx eax, word [rdi]` or similar, which is what GCC actually does. If GCC makes it inconvenient to get ideal asm for WORD size (like in my answer and the linked Sieve of Eratosthenes implementations using that in the inner loop), then use DWORD operand-size. Of course C makes it inconvenient to access the same data with different widths; if you're searching for the first non-zero bit from a given start point, you *do* want QWORD loads. So you'd need to express that with `memcpy` if the underlying type isn't 64 bits wide. – Peter Cordes Jun 28 '22 at 10:30
  • 1
    And yes, hand-writing more code in (inline) asm is one workaround for the compiler doing a bad job and wasting instructions. Another is to report the bug on [GCC's bugzilla](https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc) with the missed-optimization keyword, and hope it gets fixed within a few months or years. Also try a different compiler, like clang. (But note that clang likes to save code-size at the risk of partial-register penalties, so it might well `mov r16, m` for a 16-bit load, especially if not inlined into a loop where it could see a loop-carried false dependency .) – Peter Cordes Jun 28 '22 at 10:37

0 Answers0