2

Is there branchless way to clear 32-bit register depending on status register state? It can be achieved using additional clear register and CMOVcc, but it is too expensive on x86 in 32bit mode for me. Sadly CMOVcc have no version with immideate operand. Reading from memory is also bad variant.

There is SETcc (though, operand is 1 byte) but not "CLEARcc" instruction on x86.

Tomilov Anatoliy
  • 15,657
  • 10
  • 64
  • 169
  • Expensive how? Because of register pressure? The cmov itself isn't any slower in 32-bit mode. (http://agner.org/optimize/). See my comments on Aki's SBB/AND answer: xor-zeroing a register ahead of flag setting is cheaper than SBB/AND if you can spare a register. – Peter Cordes Feb 11 '18 at 03:05
  • Expensive, because GCC says "asm operand has impossible constraints" due to lack of disposable registers. – Tomilov Anatoliy Feb 11 '18 at 07:30
  • @PeterCordes Are your comments still in force for *Sandy Bridge* arch? – Tomilov Anatoliy Feb 11 '18 at 07:33
  • Wait what? You're using *inline* asm? Is this part of a giant inline-asm block? Maybe write a whole function so you can spill/reload as needed. Or if this is just a small snippet, then https://gcc.gnu.org/wiki/DontUseInlineAsm: use C `? :` ternary to encourage gcc to go branchless. And yes, [`xor`-zeroing is fantastically cheap on Sandybridge](https://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and), as efficient as NOP: just 1 uop for the front-end (fused domain), with no execution unit needed in the unfused domain. – Peter Cordes Feb 11 '18 at 07:45
  • @PeterCordes My case is special. I tried to solve a programming contest task from archive. There is requirements to send 1 source file. I can't split it into separate `.S` and `.cpp` files. – Tomilov Anatoliy Feb 11 '18 at 08:48
  • @PeterCordes quote from your link "6) Performance may not be what you expect.". It turns out that inline asm deliver the goods. It is totally faster (6x times) then using AVX intrinsics from host language. It is funny, but changing `inc %%ecx;` to `lea 1(%%ecx), %%ecx;` gives about 100ms (1600ms - 100ms is good achievement). – Tomilov Anatoliy Feb 11 '18 at 08:57
  • Ok, as a one-off for a specific compiler with specific surrounding code (when you don't care about future maintainability), then yeah many of the downsides of inline asm aren't applicable. 6 times, though? Are you sure you used the *right* intrinsics, and compiled with optimization enabled? Are you sure you're running on Sandybridge, and not silvermont or something where `inc` is actually slower? Or was that `inc` before / after a shift? https://stackoverflow.com/questions/36510095/inc-instruction-vs-add-1-does-it-matter. – Peter Cordes Feb 11 '18 at 09:57
  • 2
    And BTW, you *can* write whole functions inside GNU C "basic" asm statements at the global scope. e.g. outside any function: `asm(".globl func\n\t"` `"func:\n\t"` ... `);` Then you have total control over register allocation, and can use the stack (which isn't safe in x86-64 inline asm inside a function: https://stackoverflow.com/questions/34520013/using-base-pointer-register-in-c-inline-asm/34522750. – Peter Cordes Feb 11 '18 at 10:01
  • But of course you have to write a whole loop inside the function to minimize call overhead, and be ABI-compliant, so inline-asm with the right constraints can still be even better, if you don't need to do any tricks like putting a block of your code after the `ret` of the function (e.g. an unlikely branch target, or just one that you want to take out of line). – Peter Cordes Feb 11 '18 at 10:03
  • @PeterCordes I sure about *Sandy Bridge* (Xeon E5-2665). To be fair 3 times for naive alogrithm. Further improvement is algorithmical. – Tomilov Anatoliy Feb 11 '18 at 10:44
  • Ok, maybe inc -> lea was a win because of code alignment or something. Did you try `add $1, %ecx`? That's also 3 bytes, and the normal choice if you're avoiding `inc`, unless you want to leave FLAGS unmodified. I still think it's really weird that your hand-written asm is 3x or 6x faster than the compiler. Can you link the source for both versions (e.g. on http://gcc.godbolt.org/)? Might be worth filing a missed-optimization gcc bug report, unless your pure C version was compiled without `-O3`. If it's branch misses that hurt gcc, then it should have chosen branchless. – Peter Cordes Feb 11 '18 at 10:50
  • @PeterCordes By rules I can't publicly share code of solutions. But in private I can. – Tomilov Anatoliy Feb 11 '18 at 10:56
  • My email is `peter@cordes.ca`. – Peter Cordes Feb 11 '18 at 10:57
  • It may not be fair to say "GCC should have chosen branchless" - in the absence of programmer provided feedback gcc simply has to guess at the branch probabilities and most branches are predictable so branches are often a good choice. A person writing assembly can try both and pick the best, but that doesn't make the compiler behavior wrong (it doesn't have enough info). – BeeOnRope Feb 13 '18 at 01:40
  • @PeterCordes Did you receive my letters? – Tomilov Anatoliy Feb 13 '18 at 05:35
  • yeah, haven't looked at it yet. Been watching the winter olympics :) – Peter Cordes Feb 13 '18 at 06:50
  • @PeterCordes It is sad, but my country is banned, so nothing interesting for me. – Tomilov Anatoliy Feb 13 '18 at 07:32

2 Answers2

4

This may disappoint you, but CMOVcc is very good in that regard. Using it with a variable ddZERO with the value 0 is not that bad, especially in a loop.

CMOVcc rTarget, ddZERO

resets the rTarget register to zero if the cc conditions are met.
Otherwise (there is an otherwise) you can invert the scenario and CMOVcc on a NOT MATCHING condition. Which choice would be better depends on the frequency of the occurrence.

If you have a register with the value 0 you should use that instead. But if you can't spare a register using a (cached) memory location is not that bad. This estimation is based on experience and IIRC using a constant in a L1 cached memory location has a practically negligible latency in a loop.

zx485
  • 28,498
  • 28
  • 50
  • 59
  • 1
    Yup, a micro-fused load of `0` from memory should be fine unless your code is bottlenecked on load uops / L1D throughput. A memory operand for `cmov` is read unconditionally, and start loading as soon as the address is ready, before the flags and destination register are ready, so normally the memory-source input is ready by the time the other operands are ready. – Peter Cordes Feb 11 '18 at 03:07
3

There is essentially one generic method in most ISAs providing branchess setting or clearing a register: generating an all zero or all ones mask from carry flag: sbb reg,reg clears a mask when carry is zero and sets the mask when carry is set. Followed by and dst, reg will either clear the destination register, or leave it unchanged.

One can invert the condition by toggling the mask, or inverting the carry flag. Test for zero can be achieved by either subtracting one from the register under test, or subtracting the register under test from zero. The first sets carry iff register was zero; the second form sets carry iff register was non-zero.

Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • 2
    If `xor`-zero / `cmov` is too expensive, `sbb` / `and` is hardly better. `sbb` is 2 uops on Intel pre-broadwell (same as cmov) and has a false dependency on the old value of the register, except on AMD Bulldozer-family (and Ryzen?) where `sbb same,same` is recognized as only dependent on CF. Also, sbb/and has 3 cycle total latency from flags to the result, because the AND is on the critical path. The only advantage is that the extra register isn't needed until *after* the instruction that sets CF. xor/cmov needs the xor-zeroing before the flag-setting instruction. – Peter Cordes Feb 11 '18 at 03:03
  • If you want to invert the mask, maybe use `xor eax,eax` / set flags / `setc al` / `dec eax`. Although the extra xor-zeroing sucks, too. Depends on the uarch you're targeting, but if you wanted to invert the condition, that tips things even farther in favour of `cmov` instead of creating a mask and using `and`. (With BMI2, you can use `andn` instead of `and` to invert + and in one step, though, so that's quite good on Ryzen / Excavator and maybe not bad on Broadwell / Skylake) – Peter Cordes Feb 11 '18 at 07:51