Using SSE intrinsics, one could code this like:
char in[2];
char string[16];
__m128i zeroes = _mm_set1_epi8('0');
__m128i ones = _mm_set1_epi8('1');
__m128i mask = _mm_set_epi8(
0x80, 0x40, 0x20, 0x10, 8, 4, 2, 1,
0x80, 0x40, 0x20, 0x10, 8, 4, 2, 1);
__m128i val = _mm_set_epi8(
in[1], in[1], in[1], in[1], in[1], in[1], in[1], in[1],
in[0], in[0], in[0], in[0], in[0], in[0], in[0], in[0]);
val = _mm_cmplt_epi8(val, _mm_and_si128(val, mask));
val = _mm_or_si128(_mm_and_si128(val, zeroes), _mm_andnot_si128(val, ones));
_mm_storeu_si128(string, val);
The code performs the following steps:
- replicate the 2-byte input into all bytes of the XMM register,
_mm_set1_epi...()
- create a mask to extract a different bit from each word
- bit extract using parallel and
- compare (lower-than) the extracted bit with the mask.
the result is an array of either 0xffff or 0x0 if the bit was clear, or set.
- extract the
'0' and '1' characters using that mask, combine them.
- write the resulting byte array out
This gets away with shift-and-test sequences, but at the price of the _mm_set*() which expands into sequences of a few SSE instructions each. It's still faster than 128 iterations of a bit-test loop.