3

What's faster:

add DWORD PTR [rbp-0x4],1

or

 mov    eax,DWORD PTR [rbp-0x4]
 add    eax,1
 mov    DWORD PTR [rbp-0x4],eax

I've seen the second code generated by a compiler, so maybe calling add on a register is much faster?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Alex
  • 34,581
  • 26
  • 91
  • 135

2 Answers2

6

They both decode to the same amount of back-end uops, but the memory-destination add gets those uops through the front-end in fewer fused-domain uops on modern Intel/AMD CPUs.

On Intel CPUs, add [mem], imm decodes to a micro-fused load+add and a micro-fused store-address+store-data, so 2 total fused-domain uops for the front-end. AMD CPUs always keep memory operands grouped with the ALU operation without calling it "micro-fusion", it's just how they've always worked. (https://agner.org/optimize/ and INC instruction vs ADD 1: Does it matter?).


The first way doesn't leave the value in a register, so you couldn't use it as part of ++a if the value of the expression was used. Only for the the side-effect on memory.


Using [rbp - 4] and incrementing a local in memory smells like un-optimized / debug-mode code, which you should not be looking at for what's efficient. Optimized code typically uses [rsp +- constant] to address locals, and (unless the variable is volatile) wouldn't be just storing it back into memory again right away.

Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? - compiling in debug mode, aka -O0 (the default) compiles each C statement separately, and treats every variable sort of like volatile, which is totally horrible.

See How to remove "noise" from GCC/clang assembly output? for how to get compilers to make asm that's interesting to look at. Write a function that takes args and returns a value so it can do something without optimizing away or propagating constants into mov eax, constant_result.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
2

Adding to a register probably is faster (since the registers are on-chip) but, since you have to load and store the data anyway, you're unlikely to see an improvement.

The long-winded approach might even be slower since there may be opportunities for the CPU to optimise the shorter code. In addition, the shorter code may have atomicity for the read/modify/write, depending on how you code it. It certainly won't waste the eax register.

Bottom line, the longer code is unlikely to be enough of an improvement (if any) to justify the readability hit.

But you don't have to guess (or even ask us) - the chip manufacturers provide copious details on the timings of instructions. For example, Intel's optimisation manual.

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • Re: atomicity: only with respect to interrupts, therefore only wrt. other code running *on that core*. [Taking a semaphore must be atomic. Is it?](https://stackoverflow.com/a/39358907). In C++ we could say it's atomic wrt. a signal handler in the same thread. – Peter Cordes Apr 23 '20 at 00:48
  • @Peter, that was why I included the weasel words "depending on how you code it" :-) My understanding is that the `lock` prefix on an `inc` opcode on memory will buslock so that other cores don't try to use the value mid change. I could be wrong, of course, it's been a while since I coded that close to the metal. – paxdiablo Apr 23 '20 at 05:40
  • Right, to make it atomic wrt. other cores you need `lock inc [mem]` or `lock add [mem], 1`. If the memory isn't split across cache lines, though, the core only needs to hang onto MESI exclusive ownership of that one line (a "cache lock") from the load to the store; no need to bother other cores that are accessing different lines. [Can num++ be atomic for 'int num'?](https://stackoverflow.com/q/39393850) has an overview of details. – Peter Cordes Apr 23 '20 at 05:45