Your current code implicitly zero-extends. It's equivalent to add (%ebx,%esi,4), %eax / adc $0, %edx, but what you need to be adding to the upper half is 0 or -1 depending on the sign of the low half. (i.e. 32 copies of the sign bit; see Sep's answer).
32-bit x86 can do 64-bit integer math directly using SSE2/AVX2/AVX512 paddq. (All 64-bit-capable CPUs support SSE2, so it's a reasonable baseline these days).
(Or MMX paddq if you care about Pentium-MMX through Pentium III / AMD Athlon-XP).
SSE4.1 makes sign-extending to 64-bit cheap.
pmovsxdq (%ebx), %xmm1 # load 2x 32-bit (Dword) elements, sign-extending into Qword elements
paddq %xmm1, %xmm0
add $8, %ebx
cmp / jb # loop while %ebx is below an end-pointer.
# preferably unroll by 2 so there's less loop overhead,
# and so it can run at 2 vectors per clock on SnB and Ryzen. (Multiple shuffle units and load ports)
# horizontal sum
pshufd $0b11101110, %xmm0, %xmm1 # xmm1 = [ hi | hi ]
paddq %xmm1, %xmm0 # xmm0 = [ lo + hi | hi + hi=garbage ]
# extract to integer registers or do a 64-bit store to memory.
movq %xmm0, (result)
I avoided an indexed addressing mode so the load can stay micro-fused with pmovsxdq on Sandybridge. Indexed is fine on Nehalem, Haswell or later, or on AMD.
Unfortunately there are CPUs without SSE4.1 still in service. In that case, you might want to just use scalar, but you can sign-extend manually.
There is no 64-bit arithmetic right shift, though. (Only 64-bit element-size logical shifts). But you can emulate cdq by copying and using a 32-bit shift to broadcast the sign bit, then unpack.
# prefer running this on aligned memory
# Most CPUs without SSE4.1 have slow movdqu
.loop:
movdqa (%ebx, %esi, 1), %xmm1 # 4x 32-bit elements
movdqa %xmm1, %xmm2
psrad $31, %xmm1 # xmm1 = high halves (broadcast sign bit to all bits with an arithmetic shift)
movdqa %xmm2, %xmm3 # copy low halves again before destroying.
punpckldq %xmm1, %xmm2 # interleave low 2 elements -> sign-extended 64-bit
paddq %xmm2, %xmm0
punpckhdq %xmm1, %xmm3 # interleave hi 2 elements -> sign-extended 64-bit
paddq %xmm3, %xmm0
add $16, %esi
jnc .loop # loop upward toward zero, with %ebx pointing to the end of the array.
#end of one loop iteration, does 16 bytes
(Using two separate vector accumulators would likely be better than using two paddq into xmm0, to keep dependency chains shorter.)
This is more instructions, but it's doing twice as many elements per iteration. It's still more instructions per paddq, but it's probably still better than scalar, especially on Intel CPUs before Broadwell where adc is 2 uops (because it has 3 inputs: 2 registers + EFLAGS).
It might be better to just copy %xmm1 twice before the first psrad. On CPUs where movdqa has non-zero latency, I wanted to copy and then use the original to shorten the critical path so out-of-order execution has less latency to hide.
But this means the last punpck is reading the result of a chain of 2x movdqa register copies. This might be worse on CPUs with mov-elimination that doesn't work 100% of the time (Intel). It might need a vector ALU to do the copy, because a chain of mov register copies is one of the cases where mov-elimination doesn't working perfectly.