Which sequence of instructions has better performance for zeroing one register or another?

Question

I had an assignment from my professor and one part of it sparked a conversation on branchless programming. The goal was to convert this C code to MIPS assembly (assuming a and b were in registers $s0 and $s1, respectively) using the slt instruction.

if (a <= b)
  a = 0;
else
  b = 0;

The expected response was:

        slt $t0,$s1,$s0    ; Set on less-than (if $s1 < $s0 then set $t0 to 1, else 0).
        bne $t0,$0,else    ; Branch on not-equal (jump to `else` if $t0 != $0)
        add $s0,$0,$0      ; $s0 = $0 + $0, aka $s0 = $0 * 2
        j   done           ; Unconditional jump to `done`
else:   sub $s1,$0,$0      ; $s1 = $0 - $0, aka `$s1 = 0`
done:

However, I submitted the following branchless solution (from what I understand, branchless programming is preferred in terms of performance):

        slt  $t0,$s1,$s0    ; Set on less-than (if $s1 < $s0 then set $t0 to 1, else 0).
        mul  $s0,$s0,$t0    ; $s0 = $s0 * $t0 (truncated to 32 bits)
        xori $t0,$t0,0x1    ; $t0 = XOR( $t0, 1 )
        mul  $s1,$s1,$t0    ; $s1 = $s1 * $t0 (truncated to 32 bits)

I understand this is a small case wherein any performance gain would be negligible, but I am wondering if the line of thinking I used was on the right track.

My professor pointed out that the mul instruction is complex (read: hardware intensive), thus (if implemented in hardware) any performance gains made by using less instructions would ultimately be lost due to that complexity. Is this true?

You may need to check MIPS reference to get the clock count for each case (besides, instead of multiplying with 0/1, you can AND with 0/-1 which is "less complex" and that night be what compiler do) — user202729, Sep 30 '21 at 01:23
Have you **actually profiled** your code? If not, then you cannot say (nor speculate) otherwise. I assume you used Godbolt? — Dai, Sep 30 '21 at 01:23
Profile is also possible, but only if you have some actual hardware. — user202729, Sep 30 '21 at 01:24
I do agree with your prof that `mul` (especially `div`) tends to be _relatively_ expensive in pretty-much every ISA out there (except for specific operands, that tend to be powers-of-2). — Dai, Sep 30 '21 at 01:24
Have you read these other QAs btw? https://stackoverflow.com/q/22736836/159145 https://stackoverflow.com/q/15183346/159145 — Dai, Sep 30 '21 at 01:28
@Dai They... looks irrelevant? OP already know how to compare and branch, and these doesn't concern the performance at all. — user202729, Sep 30 '21 at 01:30
@user202729 Yes, I agree they don't focus on performance, but they do touch upon some relevant things, like the branch-delay-slot (which is actually **very** relevant to this problem). — Dai, Sep 30 '21 at 01:32
**OP ( @user3670473 )** - Is there a reason why you're apparently mixing temporary (`$tN`) and subprogram (`$sN`) registers? Is your entire block of assembly intended to be a subprogram? — Dai, Sep 30 '21 at 02:03
@Dai It looks like the correct one? Usually (and "usually given in assignments") variables (a, b) are stored in $s and temporary are stored in $t. — user202729, Sep 30 '21 at 02:08
@Dai we have not entirely focused on, that I'm aware, the specific general uses for the registers (although I know the difference through my own reading). Also, the question presumed use of the `$sN` registers (it was a simple "convert this C code to MIPS" question). Regardless, I will look into the questions you sent. — user3670473, Sep 30 '21 at 02:37

score 3 · Accepted Answer · answered Sep 30 '21 at 02:38

Its impossible to say as you haven't specified a MIPS implementation, which there have been a lot of over the years.

However, multiplication can be more expensive, especially on earlier MIPS processors, where each could cost two (or three) cycles.

An alternative along the lines of unconditional control flow is to subtract one from the slt and use and instead of mul, later the xor with -1. That will cost one extra instruction over the mul version, so unclear as to which will be faster, but does involve "simpler" instructions.

    slt  $t0,$s1,$s0    ; Set on less-than (if $s1 < $s0 then set $t0 to 1, else 0).
    subi  $t0, $t0, 1   ; 1/true->0, 0/false->-1
    and  $s1,$s1,$t0    ; $s1 &= $t0-1        mask is either 0 or -1
    xori $t0,$t0,-1     ; $t0 = ~ $t0
    and  $s0,$s0,$t0    ; $s1 &= ~ (t0-1)     mask is either -1 or 0

Whether this is faster than the conditional logic depends on the processor, and the data set, assuming this is in a loop (if not in a loop at some level, then the question is of lessor importance).

The data set will determine whether the branch prediction is effective or not. Each time the branch prediction is wrong, it will cost a couple of cycles (of course depending on the processor implementation). If the branch prediction is 100% effective, then the conditional logic would probably be best. But if the data set defies branch prediction then several cycles of overhead per execution might be incurred, so the unconditional logic might be best.

It would be nice if processors had a few extra opcodes so that the subtraction of 1 was unnecessary, then also the inversion for the else part. In such hypothetical case, we would have:

slt $t0, $s1, $s0
sameIfTrueOrZero  $s0, $s0, $t0   # $s0 = ($t0 ? $s0 : 0)
sameIfFalseOrZero $s1, $s1, $t0   # $s1 = ($t0 ? 0 : $s1)

These would be simple instructions for hardware to implement — they would not tax the ALU, or instruction and operand encodings.

Some more advanced processors may be able to fuse operations, or dual issue, on some of the above so that the execution cost might come down.

Thank you for this information! As far as I'm aware, for the course I am taking we are only merely emulating MIPS using the MARS program. We are not working with any actual hardware, only merely discussing it. I guess my question here is if I am going down the right track at all by considering the effects of branches in my assembly code, or if I should save those worries for only certain circumstances (and if so, what circumstances those would be)? — user3670473, Sep 30 '21 at 02:48
Conditional logic is problematic for modern processor designs. Game programmers, for example, will keep separate lists so that all object on a given list are subject to the same logic without condition (no virtual methods that choose different functions each time, no if then else that sometimes then and sometimes else). (of course, the same object can be on multiple lists..) Search terms are: "Entity Component System" and "Data Oriented Design", among others. — Erik Eidt, Sep 30 '21 at 02:57
*It would be nice if processors had a few extra opcodes* - Real MIPS since [MIPS IV](https://en.wikipedia.org/wiki/MIPS_architecture#MIPS_IV) in [1994](https://en.wikipedia.org/wiki/List_of_MIPS_architecture_processors) *does* have conditional move instructions: see how `gcc -O3 -march=mips32` uses `movn` and `movz` from `$zero`, controlled by the `slt` result, to implement this logic ahead of a tailcall in https://godbolt.org/z/MsoqTj5ro (just like clang does for x86-64). — Peter Cordes, Apr 12 '22 at 01:42

score 1 · Answer 2 · answered Apr 12 '22 at 07:58

Interesting and clever use of mul by 0 or 1, but if you have MIPS32 extensions like mul (not classic MIPS mult/mflo) then you also have movz / movn conditional moves. In fact those were introduced with MIPS IV (first commercial release in 1994, R8000), earlier than 1999's MIPS 4K (MIPS32 ISA) MIPS 5K (MIPS64 ISA).

movn and movz are very much like x86 cmov, AArch64 csel, and equivalent instructions on other ISAs. Since MIPS has a zero register, you can condition-move from it to conditionally zero something based on another register being zero (movz) or non-zero (movn).

That allows GCC to compile your logic into 3 instructions, when I put it in front of a tailcall that uses the values in their incoming registers.

int bar(int,int);

int foo(int a, int b){
    if (a <= b)
      a = 0;
    else
      b = 0;
    return bar(a, b);
}

Godbolt MIPS GCC 11.2 -O3 -march=mips4 -fno-delayed-branch

foo:
        slt     $2,$5,$4
        movn    $5,$0,$2
        movz    $4,$0,$2
        j       bar
        nop

This is pretty much unbeatable.

Note that in your proposed version, the dynamic instruction count is at worst 4 (no path of execution runs both all 5), and at best 3 (if the bne is taken). But it avoid branching which is good, and does save static code size.

(If you assemble for a real MIPS with an assembler that tries but fails to fill branch-delay slots here, the branchy version might get an extra NOP after one or both branches, widening the advantage of branchless.)

On a fast wide machine (like 4-wide MIPS R8000 or R10000 although those are MIPS IV CPUs with movn/z but not mul) it's certainly interesting to consider mul if you don't do this often; the front-end can decode these instructions and then get more into the pipeline while it chews on these. Modern x86 CPUs have fully pipelined integer multiply units with 3 cycle latency, 1 cycle throughput (so they can start a new multiply every clock cycle). But much older MIPS CPUs almost certainly didn't, likely not even fully pipelined, and probably higher latency. (This is why until MIPS32/64 they only had mult, which wrote the result to a special hi/lo registers, to decouple that logic from the regular integer pipeline.)

I don't know where the tradeoff might be in terms of how predictable the branch would have to be (by the older simpler predictors in old MIPS CPUs) before that would be better than mul, for some realistic historical mul latency/throughput numbers. MIPS pipelines are shorter than modern CPUs; it helps some that the ISA is literally designed from the ground up to pipeline easily. But the branch delay slot only really helps scalar CPUs.

With a slow enough mul (or mult / mfo), even worst-case branch prediction might be better.

See also Modern Microprocessors A 90-Minute Guide!

I didn't look at details of MIPS III CPUs to see if any had longer pipelines / superscalar or out-of-order exec which might favour less branchy code even without movn / movz

But as Erik showed in his answer, it's possible to just use bitwise operations even if you don't have MIPS IV movn/movz, so this consideration of mul only goes from purely hypothetical on MIPS IV to basically hypothetical on MIPS III CPUs.

(Or I guess if you have code that needs to be able to run on old MIPS CPUs, but is being run on a newer MIPS.)

Which sequence of instructions has better performance for zeroing one register or another?

2 Answers2