Just for fun, here are a couple other sequences for storing a hex string from a register. Printing the buffer is not the interesting part, IMO; copy that part from Michael's excellent answer if needed.
I tested some of these. I've included a main that calls one of these functions and then uses printf("%s\n%lx\n", result, test_value); to make it easy to spot problems.
Test main():
extern printf
global main
main:
push rbx
mov rdi, 0x1230ff56dcba9911
mov rbx, rdi
sub rsp, 32
mov rsi, rsp
mov byte [rsi+16], 0
call register_to_hex_ssse3
mov rdx, rbx
mov edi, fmt
mov rsi, rsp
xor eax,eax
call printf
add rsp, 32
pop rbx
ret
section .rodata
fmt: db `%s\n%lx\n`, 0 ; YASM doesn't support `string with escapes`, so this only assembles with NASM.
; NASM needs
; %use smartalign
; ALIGNMODE p6, 32
; or similar, to stop it using braindead repeated single-byte NOPs for ALIGN
SSSE3 pshufb for the LUT
This version doesn't need a loop, but the code size is much larger than the rotate-loop versions because SSE instructions are longer.
section .rodata
ALIGN 16
hex_digits:
hex_xlat: db "0123456789abcdef"
section .text
;; rdi = val rsi = buffer
ALIGN 16
global register_to_hex_ssse3
register_to_hex_ssse3: ;;;; 0x39 bytes of code
;; use PSHUFB to do 16 nibble->ASCII LUT lookups in parallel
movaps xmm5, [rel hex_digits]
;; x86 is little-endian, but we want the hex digit for the high nibble to be the first character in the string
;; so reverse the bytes, and later unpack nibbles like [ LO HI ... LO HI ]
bswap rdi
movq xmm1, rdi
;; generate a constant on the fly, rather than loading
;; this is a bit silly: we already load the LUT, might as well load another 16B from the same cache line, a memory operand for PAND since we manage to only use it once
pcmpeqw xmm4,xmm4
psrlw xmm4, 12
packuswb xmm4,xmm4 ; [ 0x0f 0x0f 0x0f ... ] mask for low-nibble of each byte
movdqa xmm0, xmm1 ; xmm0 = low nibbles at the bottom of each byte
psrlw xmm1, 4 ; xmm1 = high nibbles at the bottom of each byte (with garbage from next byte)
punpcklbw xmm1, xmm0 ; unpacked nibbles (with garbage in the high 4b of some bytes)
pand xmm1, xmm4 ; mask off the garbage bits because pshufb reacts to the MSB of each element. Delaying until after interleaving the hi and lo nibbles means we only need one
pshufb xmm5, xmm1 ; xmm5 = the hex digit for the corresponding nibble in xmm0
movups [rsi], xmm5
ret
AVX2: you can do two integers at once, with something like
int64x2_to_hex_avx2: ; (const char buf[32], uint64_t first, uint64_t second)
bswap rsi ; We could replace the two bswaps with one 256b vpshufb, but that would require a mask
vmovq xmm1, rsi
bswap rdx
vpinsrq xmm1, xmm1, rdx, 1
vpmovzxbw ymm1, xmm1 ; upper lane = rdx, lower lane = rsi, with each byte zero-extended to a word element
vpsllw ymm1, ymm1, 12 ; shift the high nibbles out, leaving the low nibbles at the top of each word
vpor ymm0, ymm0, ymm1 ; merge while hi and lo elements both need the same shift
vpsrlw ymm1, ymm1, 4 ; low nibbles in elems 1, 3, 5, ...
; high nibbles in elems 0, 2, 4, ...
pshufb / store ymm0 / ret
Using pmovzx and shifts to avoid pand is a win compared to generating the constant on the fly, I think, but probably not otherwise. It takes 2 extra shifts and a por. It's an option for the 16B non-AVX version, but it's SSE4.1.
Optimized for code-size (fits in 32 (0x20) bytes)
(Derived from Frank's loop)
Using cmov instead of the LUT to handle 0-9 vs. a-f might take fewer than 16B of extra code size. That might be fun: edits welcome.
The ways to get a nibble from the bottom of rsi into an otherwise-zeroed rax include:
mov al, sil (3B (REX required for sil)) / and al, 0x0f (2B special encoding for and al, imm8).
mov eax, esi (2B) / and eax, 0x0f (3B): same size and doesn't require an xor beforehand to zero the upper bytes of rax.
Would be smaller if the args were reversed, so the dest buffer was already in rdi. stosb is a tiny instruction (but slower than mov [rdi], al / inc rdi), so it actually saved overall bytes to use xchg rdi, rsi to set up for it. changing the function signature could save 5 bytes: void reg_to_hex(char buf[16], uint64_t val) would save two bytes from not having to return buf in rax, and 3 bytes from dropping the xchg. The caller will probably use 16B of stack, and having the caller do a mov rdx, rsp instead of mov rdx, rax before calling another function / syscall on the buffer doesn't save anything.
The next function is probably going to ALIGN 16, though, so shrinking the function to even smaller than 32B isn't as useful as getting it inside half a cache-line.
Absolute addressing for the LUT (hex_xlat) would save a few bytes
(use mov al, byte [hex_xlat + rax] instead of needing the lea).
global register_to_hex_size
register_to_hex_size:
push rsi ; pushing/popping return value (instead of mov rax, rsi) frees up rax for stosb
xchg rdi, rsi ; allows stosb. Better: remove this and change the function signature
mov cl, 16 ; 3B shorter than mov ecx, 16
lea rdx, [rel hex_xlat]
;ALIGN 16
.loop:
rol rsi, 4
mov eax, esi ; mov al, sil to allow 2B AND AL,0xf requires a 2B xor eax,eax
and eax, 0x0f
mov al, byte [rdx+rax]
stosb
;; loop .loop ; setting up ecx instead of cl takes more bytes than loop saves
dec cl
jne .loop
pop rax ; get the return value back off the stack
ret
Using xlat costs 2B (to save/restore rbx), but saves 3B, for a net savings of 1B. It's a 3-uop instruction, with 7c latency, one per 2c throughput (Intel Skylake). The latency and throughput aren't a problem here, since each iteration is a separate dependency chain, and there's too much overhead for this to run at one clock per iteration anyway. So the main problem is that it's 3 uops, making it less uop-cache-friendly. With xlat, the loop becomes 10 uops instead of 8 (using stosb), so that sucks.
112: 89 f0 mov eax,esi
114: 24 0f and al,0xf
116: d7 xlat BYTE PTR ds:[rbx]
117: aa stos BYTE PTR es:[rdi],al
vs.
f1: 89 f0 mov eax,esi
f3: 83 e0 0f and eax,0xf
f6: 8a 04 02 mov al,BYTE PTR [rdx+rax*1]
f9: aa stos BYTE PTR es:[rdi],al
Interestingly, this still has no partial-register stalls, because we never read a wide register after writing only part of it. mov eax, esi is write-only, so it cleans up the partial-reg-ness from the load into al. So there would be no advantage to using movzx eax, byte [rdx+rax]. Even when we return to the caller, the pop rax doesn't leave the caller succeptible to partial-reg problems.
(If we don't bother returning the input pointer in rax, then the caller could have a problem. Except in that case it shouldn't be reading rax at all. Usually it only matters if you call with call-preserved registers in a partial-reg state, because the called function might push them. Or more obviously, with arg-passing / return-value registers.
Efficient version (uop-cache friendly)
Looping backwards didn't turn out to save any instructions or bytes, but I've included this version because it's more different from the version in Frank's answer.
ALIGN 16
global register_to_hex_countdown
register_to_hex_countdown:
;;; work backwards in the buffer, starting with the least-significant nibble as the last char
mov rax, rsi ; return value, and loop bound
add rsi, 15 ; last char of the buffer
lea rcx, [rel hex_xlat] ; position-independent code
ALIGN 16
.loop:
mov edx, edi
and edx, 0x0f ; isolate low nibble
mov dl, byte [rcx+rdx] ; look up the ascii encoding for the hex digit
; rdx is an 'index' with range 0x0 - 0xf
; non-PIC version: mov dl, [hex_digits + rdx]
mov byte [rsi], dl
shr rdi, 4
dec rsi
cmp rsi, rax
jae .loop ; rsi counts backwards down to its initial value
ret
The whole thing is only 12 insns (11 uops with macro-fusion, or 12 including the NOP for alignment). Some CPUs can fuse cmp/jcc but not dec/jcc (e.g. AMD, and Nehalem)
Another option for looping backwards was mov ecx, 15, and store with mov [rsi+rcx], dl, but two-register addressing modes can't micro-fuse. Still, that would only bring the loop up to 8 uops, so it would be fine.
Instead of always storing 16 digits, this version could use rdi becoming zero as the loop condition to avoid printing leading zeros. i.e.
add rsi, 16
...
.loop:
...
dec rsi
mov byte [rsi], dl
shr rdi, 4
jnz .loop
; lea rax, [rsi+1] ; correction not needed because of adjustments to how rsi is managed
mov rax, rsi
ret
printing from rax to the end of the buffer gives just the significant digits of the integer.