Efficient UTF-8 character-length decoding for a non-zero character in a 32 bit register

Question

I'm storing a UTF-8 character in eax and later on, in processing, I need to know how many bytes make up the character.

I've narrowed down going about this that minimizes shifts and masks and wanted to know am I missing some neat trick somewhere?

Option 1: Brute Force

    mov     r11, 4      ;   Maximum bytes
    bt      eax, 31     ;   Test 4th MSB
    jc      .exit 
    dec     r11         ;   Lets try 3
    bt      eax, 23     ;   Test 3rd MSB
    jc      .exit 
    dec     r11         ;   Lets try 2
    bt      eax, 15     ;   Test 2nd MSB
    jc      .exit 
    dec     r11         ;   It's straight up ascii (1 byte)
.exit:

Note:

I had the accumulation in the eax register wrong as pointed out by, well, everyone.
Both Margaret and Ped7g provided solutions and I learnt even more than expected.

AFAIK you should be checking from the other direction, but it's not exactly clear how your character is in eax to start with. — Jester, Dec 21 '16 at 14:04
@Jester when reading in the UTF-8 (or ascii) I shift in character order so if it is UTF-8 the control byte (e.g. 0x11......) is in the higher position followed by the continuation (e.g. 0x10......). If I get your meaning, it would be more efficient to check low order first for ascii or least costly UTF-8? — Frank C., Dec 21 '16 at 14:06
Wait, so what is in `eax` then, can you put some examples? I did expect something like `0x89c389c3` for `ÉÉ` (one character `É` with start of next character, which happens to be also `É` and fits into remaining two bytes of `eax`). What is read by your code in such case? (text has bytes `c3 89 c3 89`) If you would provide example for each length, it would be best to verify any suggestions on those. (plus how tight you are on spare registers) — Ped7g, Dec 21 '16 at 14:13
@Ped7g - I am going at the character level, you have two (e.g. `É` followed by second `É`). So for one (1) characer: `rax = 0x000000000000c3a1` — Frank C., Dec 21 '16 at 14:17
Is `c3a1` (`á`) typo on your side, or I'm still missing how your value conversions works? If you are going per character, something already parsed that character. BTW, why something isn't providing also length? You are "parsing" it twice then. If you don't mind, then why are you bothering with performance of second parse, as you don't mind performance? I'm a bit confused by this. — Ped7g, Dec 21 '16 at 14:20
Since you're coding this in assembly you should probably be leaving the UTF-8 characters unpacked because that's the fast way to process them. — Ross Ridge, Dec 21 '16 at 18:36
See also the comments on [this question](http://stackoverflow.com/questions/25477582/how-do-i-decode-a-utf-8-character) for some links to other SO questions about decoding UTF-8. (algorithms in C are easy to implement in asm). — Peter Cordes, Dec 23 '16 at 05:02

Margaret Bloom · Accepted Answer · 2016-12-23T07:42:38.303

4

If you can assume a correct encoding of the character, you can simply check where the highest zero is in the first code-unit (thanks to the auto-synch property of UTF-8).

The culprit being that for code-points of one code-unit the highest zero is bit 7. For the code-points of n code-units the highest bit is 7 - n (note the "discontinuity").

Assuming the first code-unit is in al.

not al                 ;Trasform highest 0 in highest 1
bsr al, al             ;Find the index (from bit0) of the first 1 from the left
xor al, 7              ;Perform 7 - index
                       ;This gives 0 for single code unit code points
mov ah, 1
cmovz al, ah           ;Change back to 1

Note that bsr is not defined for an input of 0, but that can only happen for an invalid leading code-unit (of value 11111111b).

You can detect the invalid 0xff code-unit with a jz <error handler> after the bsr instruction.

Thanks to @CodyGray for pointing out a bug on the original version.
Thanks to @PeterCorders for pointing out the XOR trick to do 7 - AL.

edited Dec 23 '16 at 07:42

answered Dec 21 '16 at 14:21

Margaret Bloom

41,768
5
78
124

OK, yes... I am not encoding it correctly. So when I do, the first byte (if not straight ascii) will have the control bits which also encodes the length in bits (0 based) 4-6 potentially (for 2 to 4 byte count). Is there an instruction for finding a 0 bit from msb? If so I can find that, I can toggle bit 7, shift by index bit to right and just use the value as count would remain? – Frank C. Dec 21 '16 at 14:58
@FrankC. I cannot find an instruction for finding the highest zero bit (the first zero from MSb). I solved by not-ing the code-unit first and using `bsr`. Regarding your second question, you need to take into account that there is a "discontinuity" in the position of the highest zero (index 7 for single code-unit chars, index 5 for double code-unit chars, index 4 and lower for subsequent lengthed chars) – Margaret Bloom Dec 21 '16 at 15:24
Agreed and thank you. I marked your's as the solution. – Frank C. Dec 21 '16 at 15:26
Instead of `neg al` + `add al, 7` + `setz al`, why not just `cmp al, 14` + `setz al`? – Cody Gray - on strike Dec 21 '16 at 16:08
1

@CodyGray I have to compute the function f(i) such that: f(7) -> 1, f(5) -> 2, f(4) -> 3, f(3) -> 4. So I do `7-i == 0 ? 1 : 7-i`. I'm not following the solution you proposed. BTW, Thanks for the edit! – Margaret Bloom Dec 21 '16 at 16:16
Then you can't use `setz` because that unconditionally clobbers `al`. It will either be set to 0 or 1, depending on ZF. – Cody Gray - on strike Dec 21 '16 at 16:17
@CodyGray Damn... you are right! I'm fixing it. Thanks again! – Margaret Bloom Dec 21 '16 at 17:28
I think you can `xor al, 7` to do the `al = 7-al`. gcc and clang use this 2's complement trick when compiling a leading-zeros function with BSR (i.e. when LZCNT is unavailable). https://godbolt.org/g/lDfjNL. – Peter Cordes Dec 23 '16 at 04:52
You could avoid destroying `eax` by using bmi2 [`rorx edx, eax, 8`](http://felixcloutier.com/x86/RORX.html) to copy and rotate the low bits to the top of a 32-bit register to set up for LZCNT (which isn't available with an 8-bit operand-size). I got this idea from trying to code it in C, where leading-zero-count functions don't come in small sizes, so I ended up using `__builtin_clz(~(foo<<24))`. gcc and clang just do the shift instead of using 8-bit operand size. – Peter Cordes Dec 23 '16 at 04:54
Thanks @PeterCordes, what a clever use of the xor! – Margaret Bloom Dec 23 '16 at 07:44

Ped7g · Answer 2 · 2016-12-21T17:57:21.063

If you insist on the reversed order of bytes (for whatever weird reason), you can still simply scan for first bit set to 1, divide by 8 and +1 to get number of bytes.

GetReversedShiftedUtf8BytesCount:
    ; eax = UTF8 code in reversed order, by from LSB
    ; 'É' (c3 89) => eax = 0x0000c389
    bsr ecx,eax
    cmovz ecx,eax   ; needed only for eax = 0
      ; ^ if eax is never 0 on input, this "cmovz" can be removed
    shr ecx,3
    inc ecx
    ret

As you put the first byte of char into MSB, it will produce number of bit 15, 23 or 31 for multibyte chars, for 7b ASCII anything from 0 to 6 will be produced by bsr. "div 8" will fix them straight all, either way, it doesn't care.

This routine should actually work also with valid normal UTF8 codes.

For invalid UTF8 code ending with zero byte it will return wrong number of bytes (without the zero ones).

There's of course as always also LUT solution possible:

    movzx  ecx,al
    shr    ecx,3
    movzx  ecx,byte [utf8lengthLUT + ecx]  ; +rcx for 64b
    ; ecx = number of bytes or 0 for invalid leading byte value
    ...

utf8lengthLUT:                     ; 32B look-up table for upper 5b of 1st byte
    db     1, 1, 1, 1, 1, 1, 1, 1  ; 00000 - 00111 ; single byte
    db     1, 1, 1, 1, 1, 1, 1, 1  ; 01000 - 01111 ; single byte
    db     0, 0, 0, 0, 0, 0, 0, 0  ; 10000 - 10111 ; not valid leading byte
    db     2, 2, 2, 2              ; 11000 - 11011 ; two bytes code point
    db     3, 3                    ; 11100 - 11101 ; three bytes code point
    db     4                       ; 11110         ; four bytes code point
    db     0                       ; 11111         ; not valid leading byte

I didn't debug it, just tried to translate with nasm for syntax check. Nor did I profile it of course. :) Looking at shortness of that bsr variant, I'm in doubt this will be seriously faster even on CPUs where bsr hurts.

But this one treats invalid UTF8 opcodes differently, instead of detecting non-zero MSB and returning number of it+1 (not sensitive to leading byte content), it will decode the leading byte info properly and return 0 when leading bits are wrong. But correct leading bits with incorrect 2nd+ byte (like c3 00) will still return 2, while the first variant returns 1 in such case.

(it's possible to go with only 16B LUT table, if you don't care about invalid 11111 leading byte info and you will take that as 4 byte code point)

BTW, there are some i18n libraries (open source), doing all those things like validating utf8 inputs, fixing invalid ones, counting chars, etc... Some of these are around for more than a decade... and yet still getting bug reports and fixes. Which is sort of an subtle hint how difficult it is to write this stuff correctly (without exposing app to some input data exploit). :)

(plus considering how many (fixing) edits received these two answers... :) )

And one more offtopic advice: if you will ever try to write something with PHP, supposed to process UTF8 input data (which are not from trusted source, but even from trusted source), especially if those input data are from GET/POST response... just don't on your own. Never. Get some framework for that one. :)

The order was a error on my part. Assume that `eax = 0x000089c3` — Frank C., Dec 21 '16 at 15:03
@FrankC. Then I believe Margaret's solution will work too, but also this one should for valid UTF-8 codes, because the MSB still can't be zero, so *some* bit will be set in MSB, the "div 8" will get correct result [0,1,2,3] anyway. ... This one may fail for invalid UTF8 codes (with zero byte inside (at end)). Make sure your code is **not** exploitable in such case. — Ped7g, Dec 21 '16 at 15:05
Good (+1) point on the exploit. I'll add more rigor to the intake. Thank you! — Frank C., Dec 21 '16 at 15:08
@FrankC.: BTW this of course works only because you already parsed the character to proper length. So if you are really into performance, that first parser already has the length. Just take it from it. This would make sense if you had in eax the UTF-8 char + undefined mess after it, so this would be the first instance of handling the char. But then the code would have to be different. Or Margaret's one (which is undefined for `FF` ... sooo... still some work to do, if you need to be bullet proof :) ). — Ped7g, Dec 21 '16 at 15:09
I don't have the luxury of changing the "char" structure/object meta data to include length which is why I have the parse happening at one spot and the usage that can occur 'at some later point' in the future (such as printing, etc.) — Frank C., Dec 21 '16 at 15:11
hm... The utf-8 data can be stream of bytes (as in utf-8 text file). So `char []` encodes that all, including length of codes and even invalid ones.. :) and with minimal footprint. Parsing that stream to single chars just ahead of working with single chars is usually the good-enough, but if you are really intensive over individual chars, you may want to store them separately out of the stream, wasting more memory and reparsing them several times. But the original utf-8 encoding is actually quite handy for most of the tasks, so you are doing probably something special. :) — Ped7g, Dec 21 '16 at 15:14
Generally speaking, this is oldschool brute force which MSB byte has some bit set. Margaret's solution is actually parsing UTF8 control byte data. So I think both answers give some insight and may be preferred in different situations. Yay. :) — Ped7g, Dec 21 '16 at 15:23
@Ped7g Got it, I do like not destroying `eax` and using `ecx` the way you've coded. — Frank C., Dec 21 '16 at 15:31

Efficient UTF-8 character-length decoding for a non-zero character in a 32 bit register

2 Answers2