Print register value to console

Question

I want to print the value in %RCX directly to the console, let's say an ASCII value. I've searched through some wise books and tutorials, but all use buffers to pass anything. Is it possible to print anything without creating special buffer for that purpose?

lets say i am here (all this answers are fat too complicated to me and use different syntax):

movq $5, %rax
...???(print %rax)

Output on console:

\>5

in example, to print buffer i use code:

SYSWRITE = 4
STDOUT = 1
EXIT_SUCCESS = 0

.text
buff: .ascii "Anything to print\n"
buff_len = . - buff

movq $SYSWRITE, %eax
mov $STDOUT, %ebx
mov $buff, %ecx
mov $buff_len, %edx

NO C CODE OR DIFFERENT ASS SYNTAX ALLOWED!!!

Are you looking for a function that doesn't require a buffer being passed as a parameter? Can that function create its own buffer on the stack in order to write the string of characters? Your question is too broad as is. You should narrow down exactly what you need. — Michael Petch, Mar 31 '16 at 17:36
@PeterCordes I may also be missing something but wasn't there a comment by the OP to your first comment that you couldn't use the _C_ library? Maybe I was imagining it. I don't see it now. — Michael Petch, Mar 31 '16 at 17:37
And given that you mentioned _%RCX_ that appears like AT&T syntax, are you looking for a GNU Assembler (`as`) solution? — Michael Petch, Mar 31 '16 at 17:47
@PeterCordes Actually, you may be right, but I also believe I may be right (but not with certainty). When I first came to this question this morning I thought I saw a response to your first comment from the OP (but it wasn't tagged with your name). It had said they weren't allowed to use the _C_ library (not just `printf`). At first I was going to provide an answer but because of the comment I didn't feel the OP identified the parameters of the problem correctly. — Michael Petch, Mar 31 '16 at 18:38
Assembly language isn't always easy, you also didn't originally state which assembler you used. I did ask you that in the comments days ago. Clearly your update suggests you are using GNU assembler with AT&T syntax. My answer uses AT&T syntax and would work with GNU Assembler. The complexity is inherent in the fact that your question doesn't actually have a nice simple solution. — Michael Petch, Apr 08 '16 at 14:13
No C in use, ass only, so answers using different syntax are worthless. — user2678074, Apr 08 '16 at 14:16
Michael's answer is *exactly* what you asked for: a GAS macro. Nobody posted any answers in C, but Frank and I used NASM syntax. If you can't mentally convert from Intel to GNU syntax, then you can just assemble with nasm or yasm, and look at the `objdump -d` output in AT&T syntax. However, the only good [instruction-set reference manual](https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html) I know of uses Intel syntax, so reading it is essential. (Also see other links in the [x86 tag wiki](http://stackoverflow.com/tags/x86/info).) — Peter Cordes, Apr 08 '16 at 14:46
I didn't use _C_ in my answer (nor did I see _C_ in anyone else's answer) . My macro was entirely done in GNU assembler syntax. The `.macro`, `.ifndef`, `.PushSection`, `PopSection`, `.endif`, `.include` are all GNU assembler directives. I used the same AT&T syntax you are using. I actually did that because the other good answer was originally in _NASM_ . So I don't know what else you want. — Michael Petch, Apr 08 '16 at 18:03
Does `movq $SYSWRITE, %eax` in your example actually work? Seems you are trying to move 64-bits to a 32-bit register? You also show `movq $5, %rax ...???(print %rax)` as an example. That is 64-bit code. Are you targeting 32-bit or 64-bit? What does `...???` in `...???(print %rax)` even mean. And where do the extra 2 characters come from in your sample output of `\>5` . — Michael Petch, Apr 08 '16 at 18:17
@Michael Petch - I just thought that might be simpler way to print register, than trough buffer. Making buffer just to copy registry to it and then prints, sounded like unwanted extra procedure back then. I also added note, that C is not allowed. It seemed quite obvious to me that if i use ass, i don't want an C printf, but I was wrong. — user2678074, Apr 11 '16 at 07:31
After discussion, the easiest way to print register to output is to copy it's contents to predefined buffer and then use syntax from question to print it out. It looks like there is no easy direct way to do that. — user2678074, Apr 13 '16 at 14:48

score 2 · Answer 1 · edited Mar 31 '16 at 18:33

2

In order to print a register (in hex representation or numeric) the routine (write to stdout, stderr, etc.) expects ASCII characters. Just writing a register will cause the routine to try an display the ascii equivalent of the value in the register. You may get lucky sometimes if each of the bytes in the register happen to fall into the printable character range.

You will need to convert it vis-a-vis routines that convert to decimal or hex. Here is an example of converting a 64 bit register to the hex representation (using intel syntax w/nasm):

section .rodata

hex_xlat:        db "0123456789abcdef"

section .text

; Called with RDI is the register to convert and
; RSI for the buffer to fill
; 
register_to_hex:
    push    rsi                 ; Save for return

    xor     eax,eax
    mov     ecx, 16             ; looper
    lea     rdx, [rel hex_xlat]  ; position-independent code can't index a static array directly

ALIGN 16
.loop:
    rol     rdi, 4              ; dil now has high bit nibble
    mov     al, dil             ; capture low nibble
    and     al, 0x0f
    mov     al, byte [rdx+rax]  ; look up the ASCII encoding for the hex digit
                                 ; rax is an 'index' with range 0x0 - 0xf.
                                 ; The upper bytes of rax are still zero from xor
    mov     byte [rsi], al      ; store in print buffer
    inc     rsi                 ; position next pointer
    dec     ecx
    jnz    .loop

.exit:
    pop     rax                 ; Get original buffer pointer
    ret

edited Mar 31 '16 at 18:33

Peter Cordes

328,167
45
605
847

answered Mar 31 '16 at 14:49

Frank C.

7,758
4
35
45

1

If you already zero `rax`, you don't need to use a slow instruction like `xlatb`. You can just use `mov al, [rbx + rax]`. Also, why would you loop with `jrcxz` and `jmp` instead of `dec ecx` / `jne .loop`? So you fall through into `.exit` when you're done. Other than that, yeah I guess it's useful if you want to put debug-prints in asm instead of just using a debugger. Normally you wouldn't save/restore any of those registers; function calls are allowed to clobber them – Peter Cordes Mar 31 '16 at 14:54
Great and astute comments Peter but I was just trying to demonstrate. – Frank C. Mar 31 '16 at 15:01
1

I think code like that is going to be at best confusing, at worst teaching beginners bad idioms. – Peter Cordes Mar 31 '16 at 15:03
And is that because you didn't answer it first? It is not confusing as it walks through steps that *could* be further refined/optimized but that wasn't the point of the question. – Frank C. Mar 31 '16 at 15:06
Any time you're posting asm code on SO, especially on a beginner question. you should assume someone's going to say "oh, that's how you should write loops", and start using that terrible `jrcxz .out / jmp` idiom. That and xlatb are the only reason I didn't upvote. Otherwise it's a nice answer. – Peter Cordes Mar 31 '16 at 15:12
Not really necessary to preserve _RSI_, _RDI_, _RCX_, _RDX_ with the [System V 64-bit ABI](http://www.x86-64.org/documentation/abi.pdf). All of them are considered volatile scratch registers. _RBX_ would have to be preserved since it is a non-volatile (callee preserved) register. – Michael Petch Mar 31 '16 at 15:40
Updated as per your suggestion – Frank C. Mar 31 '16 at 15:45
@MichaelPetch: I think the idea here is a function call you can insert as part of a debug-print thing, so you can insert it with minimal modification to the calling code. That's why I was saying that this mostly looked useful for people that like debug-prints instead of using a proper debugger. Oh and BTW, if you don't need this to be PIC, you can just use absolute addressing for the lookup table. `[hex_xlat + rax]` – Peter Cordes Mar 31 '16 at 15:48
@PeterCordes : If Frank was producing a debug print thing then there wouldn't be a function entry point and a return on it. The way the code is written it appears to be designed as an external function call. The OP didn't say if the function had to be inlined or external, so I don't think Frank is wrong in producing a standalone function. Of course the OP is forced to copy _RCX_ to _RDI_ to make the call, but still does the job. – Michael Petch Mar 31 '16 at 15:50
Frank now that you made all those changes, you can make another simplification.Rather than `lea rbx, [hex_xlat]`consider `lea rdx, [hex_xlat]` . _RDX_ doesn't need to be preserved across the function call. Then change `mov al, byte [rbx+rax]` to `mov al, byte [rdx+rax]`. Doing that means we don't use _RBX_ anymore and you can now remove the PUSH/POP of _RBX_ in the function prologue/epilogue. – Michael Petch Mar 31 '16 at 15:57
You guys feel free to strip it down even further. My usage, the implementation which is now fined tuned as per this go round, was bulk output as input in Clojure map/reduce to trace hierarchical structures. I wrote it on OS X hence the -pic – Frank C. Mar 31 '16 at 15:57
In one aspect it doesn't fulfill the requirements of the question. This answer does use a buffer that gets passed in as a parameter, and the OP wanted to print out to the console without a buffer. One could drop the second parameter and use a `syscall` to print each character out as it is generated rather than writing to an output buffer. – Michael Petch Mar 31 '16 at 16:21
instead of reading from memory each time you want to convert to ascii, this could also be done with `add al,0x30` `cmp al,0x3a` `jb skip` `add al,7` `skip:` – Tommylee2k Mar 31 '16 at 16:32
But that would imply the need to setup locals on the stack or do some pushes prior to the `syscall` for the volatile registers to preserve state. – Frank C. Mar 31 '16 at 16:39
@FrankC.: Yes. you'd have to rework the code to get it functioning. I'd probably avoid push/pop around the _SYSCALL_ and do the conversion using other registers. So _RDI_, _RSI_, _RDX_, _RCX_ and _R11_ (The last 2 registers are clobbered by the _SYSCALL_) would be used for the `sys_write` _SYSCALL_ exclusively. This would mean using other registers that will have to be preserved in the function prologue/epilogue.You technically would still need a buffer of 1 byte (local variable on the stack) for `sys_write` _SYSCALL_ to work since it requires a pointer to a character(s). – Michael Petch Mar 31 '16 at 17:03
@MichaelPetch And there-in lay the rub: the state management is one thing but the 16 individual calls to `syscall` is much more expensive. I've haven't looked at the Linux code but certainly tens of instructions if not more per call. Given that, the right solution would be then the comments Peter made in the OP question. – Frank C. Mar 31 '16 at 17:26
Yes, the 16 individual _SYSCALL_ s would be more expensive. The OPs question is actually ambiguous. He says you can't use a buffer, but does that mean he would accept a function with a buffer on the stack? If the answer to that is Yes (one could ask the OP) then you just allocate 16 bytes on the stack as a local variable to write all the characters to and do a single SYSCALL. If the question wasn't so ambiguous I could say for certainty, and is one of the reasons I chose not to bother answering this question.Broad and ambiguous questions are bad. – Michael Petch Mar 31 '16 at 17:32
@Tommylee2k: loading from a tiny LUT that will stay hot in L1 cache is much better than a hard-to-predict branch, for repeated use. You could use an `lea` to get `al+7` in a different register, then use `cmp/cmov`. But in this case I think a LUT is likely still win unless the cache line holding the LUT is evicted between every call. Note that each LUT lookup and store to the result buffer is a separate dependency chain, so they can all be in flight at once. So that 4 or 5 cycle L1 load-use latency isn't even part of a loop-carried dependency chain. – Peter Cordes Mar 31 '16 at 18:50
perpaps, but the OP said spmething like "without creating special buffers", it's still unsaid if this includes constants arrays or not – Tommylee2k Mar 31 '16 at 19:03

Michael Petch · Answer 2 · 2016-04-02T07:11:24.453

This answer is an addendum to the answer given by Frank, and utilizes the mechanism used there to do the conversion.

You mention the register %RCX in your question. This suggests you are looking at 64-bit code and that your environment is likely GCC/GAS (GNU Assembler) based since % is usually the AT&T style prefix for registers.

With that in mind I've created a quick and dirty macro that can be used inline anywhere you need to print a 64-bit register, 64-bit memory operand, or a 32-bit immediate value in GNU Assembly. This version was a proof of concept and could be amended to support 64 bit immediate values. All the registers that are used are preserved, and the code will also account for the Linux 64-bit System V ABI red zone.

The code below is commented to point out what is occurring at each step.

printmac.inc:

.macro memreg_to_hex src            # Macro takes one input
                                    #  src = memory operand, register,
                                    #        or 32 bit constant to print

    # Define the translation table only once for the current object
    .ifndef MEMREG_TO_HEX_NOT_FIRST
        .set MEMREG_TO_HEX_NOT_FIRST, 1
        .PushSection .rodata
            hex_xlat: .ascii "0123456789abcdef"
        .PopSection
    .endif

    add    $-128,%rsp               # Avoid 128 byte red zone
    push   %rsi                     # Save all registers that will be used
    push   %rdi
    push   %rdx
    push   %rcx
    push   %rbx
    push   %rax
    push   %r11                     # R11 is destroyed by SYSCALL

    mov  \src, %rdi                 # Move src value to RDI for processing

    # Output buffer on stack at ESP-16 to ESP-1
    lea    -16(%rsp),%rsi           # RSI = output buffer on stack
    lea    hex_xlat(%rip), %rdx     # RDX = translation buffer address
    xor    %eax,%eax                # RAX = Index into translation array
    mov    $16,%ecx                 # 16 nibbles to print

.align 16
1:
    rol    $4,%rdi                  # rotate high nibble to low nibble
    mov    %dil,%al                 # dil now has previous high nibble
    and    $0xf,%al                 # mask off all but low nibble
    mov    (%rdx,%rax,1),%al        # Lookup in translation table
    mov    %al,(%rsi)               # Store in output buffer
    inc    %rsi                     # Update output buffer address
    dec    %ecx
    jne    1b                       # Loop until counter is 0

    mov    $1,%eax                  # Syscall 1 = sys_write
    mov    %eax,%edi                # EDI = 1 = STDIN
    mov    $16,%edx                 # EDX = Number of chars to print
    sub    %rdx,%rsi                # RSI = beginning of output buffer
    syscall

    pop    %r11                     # Restore all registers used
    pop    %rax
    pop    %rbx
    pop    %rcx
    pop    %rdx
    pop    %rdi
    pop    %rsi
    sub    $-128,%rsp               # Restore stack
.endm

printtest.s

.include "printmac.inc"

.global main
.text
main:
    mov $0x123456789abcdef,%rcx
    memreg_to_hex %rcx               # Print the 64-bit value 0x123456789abcdef
    memreg_to_hex %rsp               # Print address containing ret pointer
    memreg_to_hex (%rsp)             # Print return pointer
    memreg_to_hex $0x402             # Doesn't support 64-bit immediates
                                     #  but can print anything that fits a DWORD
    retq

This can be compiled and linked with:

gcc -m64 printtest.s -o printtest

The macro doesn't print an end of line character so the output of the test program looks like:

0123456789abcdef00007fff5283d74000007f5c4a080a500000000000000402

The memory addresses will be be different.

Since the macros are inlined, each time you invoke the macro the entire code will be emitted. The code is space inefficient. The bulk of the code could be moved to an object file you can include at link time. Then a stub macro could wrap a CALL to the main printing function.

The code doesn't use printf because at some point I thought I saw a comment that you couldn't use the C library. If that's not the case this can be simplified greatly by calling printf to format the output to print a 64-bit hexadecimal value.

You only need to `add $-128, %rsp` to skip the old red zone, then push and use the new red zone. (and `sub $-128, %rsp` to restore, so that insn can also use an `imm8`.) I'm surprised you put the LUT in immediate data, rather than `.rodata`. I guess you want to make sure nothing goes in the object file if the macro isn't used at all? — Peter Cordes, Apr 02 '16 at 04:20
It was a specific design decision to keep it out of `.rodata` or `,data` . I allude to that when I said _The macro's translate table could be placed in the .data section, but the macro will have to be modified to emit it only once. I leave that as an exercise to the reader._ . `.data` should have read `.rodata` . I prefer the original solution I coded that actually placed the main code in a separate object, used `.rodata` for the look up table, and the macro was a simple wrapper. — Michael Petch, Apr 02 '16 at 04:31
@PeterCordes As for the Red Zone, I play it safe because I couldn't find a definitive answer as to whether _syscall_ honored the red zone. `int 0x80` would have, but I couldn't find a definitive answer when I produced the answer so played it safe. If one can say for certainty _syscall_ doesn't clobber the red zone I am more than happy to amend the answer. If there was no _syscall_ then I would have simplified it. — Michael Petch, Apr 02 '16 at 04:34
I suspect it is safe since the kernel would have used its own stack and the parameters are passed via registers. — Michael Petch, Apr 02 '16 at 04:47
And yes, my intention was to fully encapsulate the macro so that if it wasn't invoked that nothing was added to any of the data sections. It is possible to still achieve that but I wasn't going to add anymore complexity to the macro. — Michael Petch, Apr 02 '16 at 05:03
`syscall` does not write the user stack, and neither does the Linux code that handles it. [The second thing the kernel code does (after `swapgs`) is switch `%rsp` to the kernel stack](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S#L155). The ABI doesn't say anything about `syscall` clobbering the red zone either, and I'm sure someone would have mentioned it when discussing GNU C inline asm `syscall` wrappers. But the answer from the kernel source is pretty definitive. (AMD64 Linux doesn't even support syscalls with more than 6 args, so it doesn't even read user stack) — Peter Cordes, Apr 02 '16 at 05:03
@PeterCordes : See my last second to last comment _I suspect it is safe since the kernel would have used its own stack and the parameters are passed via registers._ . I suspected it was safe. — Michael Petch, Apr 02 '16 at 05:04
I suspect that using a `cmov` instead of a LUT would actually be better, if you're going to load/store the LUT from an immediate to scratch space. You'd still need to adjust the stack to get room to save regs, though. — Peter Cordes, Apr 02 '16 at 05:06
I considered alternatives to the LUT, however I wasn't intending to create new code, and was mapping it to the other answer. It meant I didn't have to explain anything more about the code that was already discussed with the other answer. That decision also determined not splitting up the macro into a wrapper and an external object. — Michael Petch, Apr 02 '16 at 05:07

score 1 · Answer 3 · edited Jun 20 '20 at 09:12

Just for fun, here are a couple other sequences for storing a hex string from a register. Printing the buffer is not the interesting part, IMO; copy that part from Michael's excellent answer if needed.

I tested some of these. I've included a main that calls one of these functions and then uses printf("%s\n%lx\n", result, test_value); to make it easy to spot problems.

Test `main()`:

extern printf

global main
main:
        push    rbx
        mov     rdi, 0x1230ff56dcba9911
        mov     rbx, rdi

        sub     rsp, 32
        mov     rsi, rsp
        mov     byte [rsi+16], 0
        call register_to_hex_ssse3

        mov     rdx, rbx
        mov     edi, fmt
        mov     rsi, rsp
        xor     eax,eax
        call    printf

        add     rsp, 32
        pop     rbx
        ret

section .rodata
fmt:    db `%s\n%lx\n`,  0      ; YASM doesn't support `string with escapes`, so this only assembles with NASM.
    ;  NASM needs 
    ; %use smartalign
    ; ALIGNMODE p6, 32
    ; or similar, to stop it using braindead repeated single-byte NOPs for ALIGN

SSSE3 `pshufb` for the LUT

This version doesn't need a loop, but the code size is much larger than the rotate-loop versions because SSE instructions are longer.

section .rodata
ALIGN 16
hex_digits:
hex_xlat:        db "0123456789abcdef"

section .text

    ;; rdi = val  rsi = buffer
ALIGN 16
global register_to_hex_ssse3
register_to_hex_ssse3:       ;;;; 0x39 bytes of code
    ;; use PSHUFB to do 16 nibble->ASCII LUT lookups in parallel
    movaps  xmm5, [rel hex_digits]
    ;; x86 is little-endian, but we want the hex digit for the high nibble to be the first character in the string
    ;; so reverse the bytes, and later unpack nibbles like [ LO HI ... LO HI ]
    bswap   rdi
    movq    xmm1, rdi

    ;; generate a constant on the fly, rather than loading
    ;; this is a bit silly: we already load the LUT, might as well load another 16B from the same cache line, a memory operand for PAND since we manage to only use it once
    pcmpeqw xmm4,xmm4
    psrlw   xmm4, 12
    packuswb xmm4,xmm4  ; [ 0x0f 0x0f 0x0f ... ] mask for low-nibble of each byte

    movdqa  xmm0, xmm1  ; xmm0 = low  nibbles at the bottom of each byte
    psrlw   xmm1, 4     ; xmm1 = high nibbles at the bottom of each byte (with garbage from next byte)
    punpcklbw xmm1, xmm0    ; unpacked nibbles (with garbage in the high 4b of some bytes)

    pand    xmm1, xmm4  ; mask off the garbage bits because pshufb reacts to the MSB of each element.  Delaying until after interleaving the hi and lo nibbles means we only need one
    pshufb  xmm5, xmm1  ; xmm5 = the hex digit for the corresponding nibble in xmm0
    movups  [rsi], xmm5
    ret

AVX2: you can do two integers at once, with something like

int64x2_to_hex_avx2:    ; (const char buf[32], uint64_t first, uint64_t second)
bswap      rsi          ; We could replace the two bswaps with one 256b vpshufb, but that would require a mask
vmovq      xmm1, rsi
bswap      rdx
vpinsrq    xmm1, xmm1, rdx, 1
vpmovzxbw  ymm1, xmm1          ; upper lane = rdx, lower lane = rsi, with each byte zero-extended to a word element
vpsllw     ymm1, ymm1, 12      ; shift the high nibbles out, leaving the low nibbles at the top of each word
vpor       ymm0, ymm0, ymm1    ; merge while hi and lo elements both need the same shift
vpsrlw     ymm1, ymm1, 4       ; low  nibbles in elems 1, 3, 5, ...
                               ; high nibbles in elems 0, 2, 4, ...
pshufb / store ymm0 / ret

Using pmovzx and shifts to avoid pand is a win compared to generating the constant on the fly, I think, but probably not otherwise. It takes 2 extra shifts and a por. It's an option for the 16B non-AVX version, but it's SSE4.1.

Optimized for code-size (fits in 32 (0x20) bytes)

(Derived from Frank's loop)

Using cmov instead of the LUT to handle 0-9 vs. a-f might take fewer than 16B of extra code size. That might be fun: edits welcome.

The ways to get a nibble from the bottom of rsi into an otherwise-zeroed rax include:

mov al, sil (3B (REX required for sil)) / and al, 0x0f (2B special encoding for and al, imm8).
mov eax, esi (2B) / and eax, 0x0f (3B): same size and doesn't require an xor beforehand to zero the upper bytes of rax.

Would be smaller if the args were reversed, so the dest buffer was already in rdi. stosb is a tiny instruction (but slower than mov [rdi], al / inc rdi), so it actually saved overall bytes to use xchg rdi, rsi to set up for it. changing the function signature could save 5 bytes: void reg_to_hex(char buf[16], uint64_t val) would save two bytes from not having to return buf in rax, and 3 bytes from dropping the xchg. The caller will probably use 16B of stack, and having the caller do a mov rdx, rsp instead of mov rdx, rax before calling another function / syscall on the buffer doesn't save anything.

The next function is probably going to ALIGN 16, though, so shrinking the function to even smaller than 32B isn't as useful as getting it inside half a cache-line.

Absolute addressing for the LUT (hex_xlat) would save a few bytes
(use mov al, byte [hex_xlat + rax] instead of needing the lea).

global register_to_hex_size
register_to_hex_size:
    push    rsi             ; pushing/popping return value (instead of  mov rax, rsi) frees up rax for stosb
    xchg    rdi, rsi        ; allows stosb.  Better: remove this and change the function signature
    mov     cl, 16          ; 3B shorter than mov ecx, 16
    lea     rdx,  [rel hex_xlat]

;ALIGN 16
.loop:
    rol     rsi, 4
    mov     eax, esi          ; mov al, sil  to allow 2B AND AL,0xf  requires a 2B xor eax,eax
    and     eax, 0x0f
    mov     al, byte [rdx+rax]
    stosb
      ;; loop .loop  ; setting up ecx instead of cl takes more bytes than loop saves
    dec     cl
    jne    .loop
    pop     rax              ; get the return value back off the stack
    ret

Using xlat costs 2B (to save/restore rbx), but saves 3B, for a net savings of 1B. It's a 3-uop instruction, with 7c latency, one per 2c throughput (Intel Skylake). The latency and throughput aren't a problem here, since each iteration is a separate dependency chain, and there's too much overhead for this to run at one clock per iteration anyway. So the main problem is that it's 3 uops, making it less uop-cache-friendly. With xlat, the loop becomes 10 uops instead of 8 (using stosb), so that sucks.

 112:   89 f0                   mov    eax,esi
 114:   24 0f                   and    al,0xf
 116:   d7                      xlat   BYTE PTR ds:[rbx]
 117:   aa                      stos   BYTE PTR es:[rdi],al

vs.

  f1:   89 f0                   mov    eax,esi
  f3:   83 e0 0f                and    eax,0xf
  f6:   8a 04 02                mov    al,BYTE PTR [rdx+rax*1]
  f9:   aa                      stos   BYTE PTR es:[rdi],al

Interestingly, this still has no partial-register stalls, because we never read a wide register after writing only part of it. mov eax, esi is write-only, so it cleans up the partial-reg-ness from the load into al. So there would be no advantage to using movzx eax, byte [rdx+rax]. Even when we return to the caller, the pop rax doesn't leave the caller succeptible to partial-reg problems.

(If we don't bother returning the input pointer in rax, then the caller could have a problem. Except in that case it shouldn't be reading rax at all. Usually it only matters if you call with call-preserved registers in a partial-reg state, because the called function might push them. Or more obviously, with arg-passing / return-value registers.

Efficient version (uop-cache friendly)

Looping backwards didn't turn out to save any instructions or bytes, but I've included this version because it's more different from the version in Frank's answer.

ALIGN 16
global register_to_hex_countdown
register_to_hex_countdown:
;;; work backwards in the buffer, starting with the least-significant nibble as the last char
    mov     rax, rsi             ; return value, and loop bound
    add     rsi, 15              ; last char of the buffer
    lea     rcx,  [rel hex_xlat] ; position-independent code

ALIGN 16
.loop:
    mov     edx, edi
    and     edx, 0x0f            ; isolate low nibble
    mov     dl, byte [rcx+rdx]   ; look up the ascii encoding for the hex digit
                                  ; rdx is an 'index' with range 0x0 - 0xf
                         ; non-PIC version:    mov     dl, [hex_digits + rdx]
    mov     byte [rsi], dl
    shr     rdi, 4
    dec     rsi
    cmp     rsi, rax
    jae    .loop                 ; rsi counts backwards down to its initial value

    ret

The whole thing is only 12 insns (11 uops with macro-fusion, or 12 including the NOP for alignment). Some CPUs can fuse cmp/jcc but not dec/jcc (e.g. AMD, and Nehalem)

Another option for looping backwards was mov ecx, 15, and store with mov [rsi+rcx], dl, but two-register addressing modes can't micro-fuse. Still, that would only bring the loop up to 8 uops, so it would be fine.

Instead of always storing 16 digits, this version could use rdi becoming zero as the loop condition to avoid printing leading zeros. i.e.

    add     rsi, 16
    ...
.loop:
    ...
    dec     rsi
    mov     byte [rsi], dl
    shr     rdi, 4
    jnz    .loop
        ; lea rax,  [rsi+1]    ; correction not needed because of adjustments to how rsi is managed
    mov     rax, rsi
    ret

printing from rax to the end of the buffer gives just the significant digits of the integer.

Sorry, but this is useless ad doesn't answer my question at all. It's just SPAM — user2678074, Apr 08 '16 at 14:20
Update: I posted a self-answered Q&A [How to convert a binary integer number to a hex string?](https://stackoverflow.com/q/53823756) with scalar and SIMD versions, including AVX-512. — Peter Cordes, Sep 20 '21 at 17:49

Print register value to console

3 Answers3

Test `main()`:

SSSE3 `pshufb` for the LUT

Optimized for code-size (fits in 32 (0x20) bytes)

Efficient version (uop-cache friendly)

Linked

Print register value to console

3 Answers3

Test main():

SSSE3 pshufb for the LUT

Optimized for code-size (fits in 32 (0x20) bytes)

Efficient version (uop-cache friendly)

Linked

Test `main()`:

SSSE3 `pshufb` for the LUT