Writing a 32-bit register always zeroes the upper 32 of the full 64-bit register. You could do this trick more easily with 16-bit halves of a 32-bit register, or especially the low 8 bits.
(In a code-golf problem, I once had a constant I only needed outside of loops, and its low 8 bits were all zero. I used ebx=-1024 outside of inner loops, and bl as my loop counter inside loops, ending with bl=0.)
But normally it's better to just use another register, or keep the outer loop counter in stack memory. (Or spill some other rarely-used value, especially if it's read-mostly so you can just use it as a memory source operand.)
As Jester suggests, test the low 32 bits separately for the inner loop condition. (This costs 1 extra uop on Intel Sandybridge-family, where dec/jnz could macro-fuse. But 0 extra uops on AMD, or other Intel, where dec/jnz can't fuse but test/jnz can.)
For the outer loop, rcgldr has already proposed rotate before/after to swap 32-bit halves. (With an unfortunate choice of the slow loop instruction for no good reason.)
But we can reduce that to only 1 instruction of overhead beyond a sub/jcc that you'd normally have. If we treat the outer counter as signed 32-bit, and check for it becoming negative, we can do that check at the same time as re-creating the inner loop counter in ECX with the same sub rcx. (This means the initial counter value needs to be 1 lower, because we effectively stop at -1 instead of 0.)
A 32-bit sign-extended immediate isn't big enough to sub rcx, 1<<32, and (unless you need that constant for something else), if you're using 2 registers you're much better off using separate regs for separate counters. But with 2 subtracts, or actually an add of -(2^31), we can wrap the low 32 almost all the way around, subtracting 1 from the high half and leaving the count for the next inner loop in ECX.
inner_count equ 0x5678
outer_count equ 0x1234
global _start
_start:
xor eax, eax
xor edx, edx ; test counters to prove this loops the right number of times
mov rcx, ((outer_count-1)<<32) + inner_count
.outer:
.inner: ; do {
; ... inner loop body
inc rax ; instrumentation: inner++
dec rcx ; rcx--
test ecx,ecx
jnz .inner ; }while(ecx)
; ecx=0. rcx=outer count << 32
;... outer loop body
inc rdx ; instrumentation: outer++
add rcx, -1<<31 ; largest magnitude 32-bit immediate is INT_MIN, 0xFFFFFFFF8000000
sub rcx, (1<<31) - inner_count ; re-create the inner loop counter from 0 + INT_MIN
jge .outer
.end: ; set a breakpoint on _start.end and look at registers
mov eax, 231
syscall ; Linux sys_exit_group(edi=0)
Final state: rdx = 0x1234, rax = 0x6260060 = 0x1234 * 0x5678, so these loops ran the correct number of times.
On Sandybridge-family, sub/jge can macro-fuse into a single instruction. Even so, this has worse code-size I think (2x sub r64, imm32), and ror rcx,32 is a single-uop instruction on Sandybridge-family and AMD. (https://agner.org/optimize/). If your outer counter was in RAX, the short-form encoding with no ModRM byte could help.
This works for any unsigned inner count, from 1 to 2^32 - 1, and for any signed-positive outer count, from 1 to 2^31 - 1.
The inner count can't be 0 = 2^32, because that would require 2x add rcx, 0xFFFFFFFF80000000 to wrap all the way around. With one of the instructions being sub rcx,imm32, the largest positive number (without the high bits being set) we can subtract is 0x7fffffff.
This might also work with jnc if we use wrapping the top half as the loop-exit condition, allowing a full 2^32-1 range for the upper counter.
30-bit counter in the bottom of RCX, 34-bit counter in the top
The inner loop test becomes
dec rcx
test ecx, (1<<31)-1 ; test the low 30 bits for non-zero
jnz .inner
The advantage here is that a single sub imm32 can wrap the inner counter to where we need it:
sub rcx, (1<<31) - inner_count ; outer-- and re-create the inner loop counter
jnc .outer
We still can't use jnz, because re-creating the inner count at the same time means the whole register won't be zero. So we have to branch on it becoming negative or having unsigned wraparound.