2

I wanna to use an 64 bit reg to control two 32 bits counter in an nested loop

I'm trying to control counters with rotate command in assembly plus some xor's, but my problem is that when i sub an ECX the HIGHER part turn 0, and my EXTERNAL counter is in HIGHER part. I've tried to DEC from CL too, but when last BYTE turn 0, the DEC turn it to 0xFF

xor rcx, rcx ; i e j
mov ecx, 1000 ; i

for_ext:
    rol rcx, 32 ; j
    or rcx, 1000
    for_int:

        <some code>

    ; dec ecx ; <- this puts ZERO in HIGHER
    ; sub cl, 1 ; <- this works partially
    ; jnz for_int 
    ; loop for_int ; <- this test RCX, so don't work 

    rol rcx, 32
loop for_ext

Maybe have some way to made an DEC in ECX that don't wick in higher part

Nefisto
  • 517
  • 3
  • 9

3 Answers3

1

This works:

                                        ;mov ecx,... clears upper bits of rcx
        mov     ecx,000000200h          ;run outer loop 200h times
main0:  rol     rcx,32
        or      rcx,000001000h          ;run inner loop 1000h times
main1:  nop
        dec     rcx
        test    ecx,ecx
        jnz     main1
        rol     rcx,32
        dec     rcx                     ;faster than loop
        jnz     main0
rcgldr
  • 27,407
  • 3
  • 36
  • 61
  • `loop` is quite slow (except on AMD Bulldozer/Ryzen), only use it if optimizing for code-size not speed. [Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?](//stackoverflow.com/q/35742570). Other than that, this looks "good" (but of course worse than using 2 registers, or memory for the outer counter, because it costs an extra uop inside the inner loop on SnB-family, which can macro-fuse dec/jnz into a single uop. It's not much worse on AMD where `dec` / `jnz` or `dec` / `test+jnz` are both 2 uops: fusion only for test/cmp + jcc) – Peter Cordes Apr 18 '19 at 00:26
  • @PeterCordes - I used a single register and loop trying to duplicate OP's question and example code as much as possible. It is possible that the inner loop is using all of the other registers and cache lines, which would mean using memory for the outer loop would involve a cache miss, but in all but these rare cases, your suggestions would be faster. – rcgldr Apr 18 '19 at 01:00
  • My main point was that a complete answer to the question shouldn't leave out mentioning that you normally *don't* want to do this, outside of silly computer tricks. I did come up with an answer that only has 1 extra instruction of overhead for the outer loop on top of a normal `sub/jcc`, though, vs. your 2x `ror`. – Peter Cordes Apr 18 '19 at 13:30
  • you just need `mov ecx, 000000200h` instead of the longer version with rcx. Same to `or` – phuclv Apr 18 '19 at 14:28
  • @phuclv - `or` uses 32 bit immediate. `mov` can use 64 bit immediate, but I thought it could also use 32 bit immediate. – rcgldr Apr 18 '19 at 15:36
  • NASM will optimize `mov rcx, 200h` to 5-byte `mov ecx, 200h`. But YASM won't modify the operand-size, and will emit 7-byte `mov r/m64, sign_extended_imm32`. Not 10-byte `mov r64, imm64`, but YASM and GAS need the programmer to know what operand-size to use for maximum efficiency. You do of course need 64-bit operand size for `or r/m64, imm32`, otherwise it would truncate RCX to ECX, @phuclv. – Peter Cordes Apr 19 '19 at 14:32
0

Thankfully to @Jester and other, i've reached to this code

segment .data
z dq 0

segment .text
global main:

main:

xor rax, rax ; res
xor rcx, rcx ; i e y
mov ecx, 1000 ; i

for_ext:
    rol rcx, 32 ; y
    or rcx, 1000 ; cl para nao zerar a parte alta
    for_int:

        <some code>

    dec rcx
    cmp ecx, 0
    jnz for_int

    rol rcx, 32
loop for_ext

ret
Nefisto
  • 517
  • 3
  • 9
  • Why would you use `loop`, though? Are you optimizing for code-size over speed? [Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?](//stackoverflow.com/q/35742570). Also, this is @rcgldr's answer plus some redundant init stuff (like a useless `xor rcx,rcx` before `mov ecx, 1000`) – Peter Cordes Apr 18 '19 at 12:35
0

Writing a 32-bit register always zeroes the upper 32 of the full 64-bit register. You could do this trick more easily with 16-bit halves of a 32-bit register, or especially the low 8 bits.

(In a code-golf problem, I once had a constant I only needed outside of loops, and its low 8 bits were all zero. I used ebx=-1024 outside of inner loops, and bl as my loop counter inside loops, ending with bl=0.)

But normally it's better to just use another register, or keep the outer loop counter in stack memory. (Or spill some other rarely-used value, especially if it's read-mostly so you can just use it as a memory source operand.)

As Jester suggests, test the low 32 bits separately for the inner loop condition. (This costs 1 extra uop on Intel Sandybridge-family, where dec/jnz could macro-fuse. But 0 extra uops on AMD, or other Intel, where dec/jnz can't fuse but test/jnz can.)

For the outer loop, rcgldr has already proposed rotate before/after to swap 32-bit halves. (With an unfortunate choice of the slow loop instruction for no good reason.)

But we can reduce that to only 1 instruction of overhead beyond a sub/jcc that you'd normally have. If we treat the outer counter as signed 32-bit, and check for it becoming negative, we can do that check at the same time as re-creating the inner loop counter in ECX with the same sub rcx. (This means the initial counter value needs to be 1 lower, because we effectively stop at -1 instead of 0.)

A 32-bit sign-extended immediate isn't big enough to sub rcx, 1<<32, and (unless you need that constant for something else), if you're using 2 registers you're much better off using separate regs for separate counters. But with 2 subtracts, or actually an add of -(2^31), we can wrap the low 32 almost all the way around, subtracting 1 from the high half and leaving the count for the next inner loop in ECX.

inner_count equ 0x5678
outer_count equ 0x1234

global _start
_start:
    xor   eax, eax
    xor   edx, edx                ; test counters to prove this loops the right number of times


    mov   rcx,  ((outer_count-1)<<32) + inner_count

.outer:
 .inner:                ; do {
      ; ...   inner loop body
            inc  rax         ; instrumentation: inner++
    dec   rcx             ; rcx--
    test  ecx,ecx
    jnz   .inner        ; }while(ecx)
  ; ecx=0.  rcx=outer count << 32

    ;... outer loop body
            inc  rdx         ; instrumentation: outer++

    add   rcx, -1<<31   ; largest magnitude 32-bit immediate is INT_MIN, 0xFFFFFFFF8000000
    sub   rcx,  (1<<31) - inner_count    ; re-create the inner loop counter from 0 + INT_MIN
    jge   .outer

.end:   ; set a breakpoint on _start.end and look at registers


    mov   eax, 231
    syscall          ; Linux sys_exit_group(edi=0)

Final state: rdx = 0x1234, rax = 0x6260060 = 0x1234 * 0x5678, so these loops ran the correct number of times.

On Sandybridge-family, sub/jge can macro-fuse into a single instruction. Even so, this has worse code-size I think (2x sub r64, imm32), and ror rcx,32 is a single-uop instruction on Sandybridge-family and AMD. (https://agner.org/optimize/). If your outer counter was in RAX, the short-form encoding with no ModRM byte could help.

This works for any unsigned inner count, from 1 to 2^32 - 1, and for any signed-positive outer count, from 1 to 2^31 - 1.

The inner count can't be 0 = 2^32, because that would require 2x add rcx, 0xFFFFFFFF80000000 to wrap all the way around. With one of the instructions being sub rcx,imm32, the largest positive number (without the high bits being set) we can subtract is 0x7fffffff.

This might also work with jnc if we use wrapping the top half as the loop-exit condition, allowing a full 2^32-1 range for the upper counter.


30-bit counter in the bottom of RCX, 34-bit counter in the top

The inner loop test becomes

dec   rcx
test  ecx, (1<<31)-1      ; test the low 30 bits for non-zero
jnz  .inner

The advantage here is that a single sub imm32 can wrap the inner counter to where we need it:

sub   rcx,  (1<<31) - inner_count    ; outer-- and re-create the inner loop counter
jnc   .outer

We still can't use jnz, because re-creating the inner count at the same time means the whole register won't be zero. So we have to branch on it becoming negative or having unsigned wraparound.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847