What's faster:
add DWORD PTR [rbp-0x4],1
or
mov eax,DWORD PTR [rbp-0x4]
add eax,1
mov DWORD PTR [rbp-0x4],eax
I've seen the second code generated by a compiler, so maybe calling add on a register is much faster?
What's faster:
add DWORD PTR [rbp-0x4],1
or
mov eax,DWORD PTR [rbp-0x4]
add eax,1
mov DWORD PTR [rbp-0x4],eax
I've seen the second code generated by a compiler, so maybe calling add on a register is much faster?
They both decode to the same amount of back-end uops, but the memory-destination add gets those uops through the front-end in fewer fused-domain uops on modern Intel/AMD CPUs.
On Intel CPUs, add [mem], imm decodes to a micro-fused load+add and a micro-fused store-address+store-data, so 2 total fused-domain uops for the front-end. AMD CPUs always keep memory operands grouped with the ALU operation without calling it "micro-fusion", it's just how they've always worked.
(https://agner.org/optimize/ and INC instruction vs ADD 1: Does it matter?).
The first way doesn't leave the value in a register, so you couldn't use it as part of ++a if the value of the expression was used. Only for the the side-effect on memory.
Using [rbp - 4] and incrementing a local in memory smells like un-optimized / debug-mode code, which you should not be looking at for what's efficient. Optimized code typically uses [rsp +- constant] to address locals, and (unless the variable is volatile) wouldn't be just storing it back into memory again right away.
Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? - compiling in debug mode, aka -O0 (the default) compiles each C statement separately, and treats every variable sort of like volatile, which is totally horrible.
See How to remove "noise" from GCC/clang assembly output? for how to get compilers to make asm that's interesting to look at. Write a function that takes args and returns a value so it can do something without optimizing away or propagating constants into mov eax, constant_result.
Adding to a register probably is faster (since the registers are on-chip) but, since you have to load and store the data anyway, you're unlikely to see an improvement.
The long-winded approach might even be slower since there may be opportunities for the CPU to optimise the shorter code. In addition, the shorter code may have atomicity for the read/modify/write, depending on how you code it. It certainly won't waste the eax register.
Bottom line, the longer code is unlikely to be enough of an improvement (if any) to justify the readability hit.
But you don't have to guess (or even ask us) - the chip manufacturers provide copious details on the timings of instructions. For example, Intel's optimisation manual.