How to sum integers with sign into a wider sum on IA32. 64-bit sum of 32-bit signed ints?

Question

I'm trying to sum a list of integers with sign on Assembly 32 but i have only got to sum integers without sign. Do you know some way to do it?

My program tries to sum integers and store in resultado, whose size is 64 bits, so to be able to do that I use two registers of 32 bits(EAX and EDX), and I check when the sum produces carry.

After of all that, I join EAX and EDX on resultado.

# sum.s     Sumar los elementos de una lista.
#           llamando a función, pasando argumentos mediante registros
# retorna:  código retorno 0, comprobar suma en %eax mediante gdb/ddd.
# as --32 -g sum.s -o sum.o
# ld -m elf_i386 sum.o -o sum

# DATA SECTION
.section .data
lista:
    .int 4294967295, 4294967295, 4294967295, 4294967295

longlista:
    .int (.-lista)/4
resultado:
    .quad -1


.section .text
_start: .global _start

    mov $lista, %ebx
    mov longlista, %ecx
    call suma
    mov %eax, resultado
    mov %edx, resultado+4

    mov $1, %eax
    mov $0, %ebx
    int $0x80


suma:
    push %esi
    mov $0, %eax
    mov $0, %edx
    mov $0, %esi

bucle:
    add (%ebx,%esi,4), %eax
    jc .L1

bucle1:
    inc %esi
    cmp %esi,%ecx
    jne bucle
    pop %esi
    ret

.L1:
    inc %edx
    jmp bucle1

This gives a 64-bit sum that treats the inputs as unsigned 32-bit, which isn't what I want.

It's the same for signed. Do you actually have a problem with this? — Jester, Nov 05 '17 at 17:56
@Jester: Actually this code is zero-extending the 32-bit integers to 64-bit, not sign-extending. (It's the same when the sum is the same width as the array elements, but that's not what the OP is doing here). — Peter Cordes, Nov 05 '17 at 18:52
@Antonio: it's simpler on x86-64: `movsxd (%rbx,%rsi,4), %rdx` / `add %rdx, %rax` to sign extend. As Sep says, on x86-32 the easiest way to sign-extend into a pair of registers is with `cdq` for extend `eax` into `edx:eax`. — Peter Cordes, Nov 05 '17 at 19:05

Sep Roland · Accepted Answer · 2017-11-05T20:01:40.740

Next code that uses 64-bit addition will give a correct sum for positive and negative numbers alike without any wraparound due to only using 32 bit registers.
The signed result can exceed the range [-2GB,+2GB-1].

suma:
    push %esi
    push %edi
    xor  %esi, %esi           ;Clear %edi:%esi
    xor  %edi, %edi
    sub  $1, %ecx             ;Start at last element in array
    jl   emptyArray
bucle:
    mov  (%ebx,%ecx,4), %eax  ;From signed 32-bit to signed 64-bit
    cdq
    add  %eax, %esi           ;Add signed 64-bit numbers
    adc  %edx, %edi
    dec  %ecx
    jge  bucle
emptyArray:
    mov  %esi, %eax           ;Move result from %edi:%esi to %edx:%eax
    mov  %edi, %edx
    pop  %edi
    pop  %esi
    ret

The order in which the additions are made is unimportant, and so the code starts with the last element working towards the first one.

If you're going to break out of the loop at zero, arrange it so you don't need a `test` instruction. Fewer instructions *in the loop* is part of the point of looping towards zero. e.g. `dec %ecx` once before the loop, then `dec %ecx` / `jge` to treat the loop counter as signed, and have the loop run once when `ecx` = 0, stopping when it changes to `-1`. (Or use a displacement in the addressing mode). And BTW, you don't need to save/restore `%ecx`; the OP seems to be using a somewhat normal calling convention where the usual registers are call-clobbered. — Peter Cordes, Nov 05 '17 at 18:59

score 0 · Answer 2 · answered Nov 05 '17 at 23:42

Your current code implicitly zero-extends. It's equivalent to add (%ebx,%esi,4), %eax / adc $0, %edx, but what you need to be adding to the upper half is 0 or -1 depending on the sign of the low half. (i.e. 32 copies of the sign bit; see Sep's answer).

32-bit x86 can do 64-bit integer math directly using SSE2/AVX2/AVX512 paddq. (All 64-bit-capable CPUs support SSE2, so it's a reasonable baseline these days).

(Or MMX paddq if you care about Pentium-MMX through Pentium III / AMD Athlon-XP).

SSE4.1 makes sign-extending to 64-bit cheap.

pmovsxdq  (%ebx),  %xmm1     # load 2x 32-bit (Dword) elements, sign-extending into Qword elements
paddq     %xmm1, %xmm0
add       $8, %ebx

cmp / jb             # loop while %ebx is below an end-pointer.
# preferably unroll by 2 so there's less loop overhead,
# and so it can run at 2 vectors per clock on SnB and Ryzen.  (Multiple shuffle units and load ports)

# horizontal sum
pshufd    $0b11101110, %xmm0, %xmm1    # xmm1 = [ hi | hi ]
paddq     %xmm1, %xmm0                 # xmm0 = [ lo + hi | hi + hi=garbage ]

# extract to integer registers or do a 64-bit store to memory.
movq      %xmm0, (result)

I avoided an indexed addressing mode so the load can stay micro-fused with pmovsxdq on Sandybridge. Indexed is fine on Nehalem, Haswell or later, or on AMD.

Unfortunately there are CPUs without SSE4.1 still in service. In that case, you might want to just use scalar, but you can sign-extend manually.

There is no 64-bit arithmetic right shift, though. (Only 64-bit element-size logical shifts). But you can emulate cdq by copying and using a 32-bit shift to broadcast the sign bit, then unpack.

# prefer running this on aligned memory
# Most CPUs without SSE4.1 have slow movdqu

.loop:
    movdqa    (%ebx, %esi, 1), %xmm1      # 4x 32-bit elements
    movdqa    %xmm1, %xmm2
    psrad     $31, %xmm1                  # xmm1 = high halves (broadcast sign bit to all bits with an arithmetic shift)

    movdqa    %xmm2, %xmm3               # copy low halves again before destroying.
    punpckldq %xmm1, %xmm2                # interleave low 2 elements -> sign-extended 64-bit
    paddq     %xmm2, %xmm0

    punpckhdq %xmm1, %xmm3                # interleave hi  2 elements -> sign-extended 64-bit
    paddq     %xmm3, %xmm0

    add       $16, %esi
    jnc   .loop            # loop upward toward zero, with %ebx pointing to the end of the array.
    #end of one loop iteration, does 16 bytes

(Using two separate vector accumulators would likely be better than using two paddq into xmm0, to keep dependency chains shorter.)

This is more instructions, but it's doing twice as many elements per iteration. It's still more instructions per paddq, but it's probably still better than scalar, especially on Intel CPUs before Broadwell where adc is 2 uops (because it has 3 inputs: 2 registers + EFLAGS).

It might be better to just copy %xmm1 twice before the first psrad. On CPUs where movdqa has non-zero latency, I wanted to copy and then use the original to shorten the critical path so out-of-order execution has less latency to hide.

But this means the last punpck is reading the result of a chain of 2x movdqa register copies. This might be worse on CPUs with mov-elimination that doesn't work 100% of the time (Intel). It might need a vector ALU to do the copy, because a chain of mov register copies is one of the cases where mov-elimination doesn't working perfectly.

How to sum integers with sign into a wider sum on IA32. 64-bit sum of 32-bit signed ints?

2 Answers2