How appropriate is it to use a union to simulate CPU registers in an emulator?

Question

So I've seen it said before that to use unions in this manner is a bad idea. I know that it's technically undefined behavior. However, if I am using C++11 (and therefor only new-ish compilers) then honestly how bad could the following code really be? Is this actually likely to blow up on me? Could it be improved?

union registers_t
{
    struct [[packed]]
    {
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
        uint8_t F, A, C, B, E, D, L, H, R, I;
#else
        uint8_t A, F, B, C, D, E, H, L, I, R;
#endif
        uint16_t SP, PC;
    };
    struct [[packed]]
    {
        uint16_t AF, BC, DE, HL;
    };
};

Like I said, I know this is UB in C++ so there is no reason to point this out. My question is does that actually matter in this case.

This code lacks usage examples. If you imply that you want to assign `AF` and then read `A` then yes, it will be UB. But there is no need to mess with union here at all, just write some getters / setters methods. — user7860670, Oct 27 '18 at 10:12
@VTT This method produces cleaner code that is easier to read. i.e. reg.BC++ vs reg.setBC(reg.getBC + 1). — Chris_F, Oct 27 '18 at 10:15
This method produces code that smells. Accessing class fields directly binds user code to the implementation details. Today you store values in union, tomorrow you want to start using variant but it turns out that existing code base will be broken. Also i would probably even write `reg.Get_MutableBC().Increment();` or `reg.Increment()` hiding `uint16_t`. — user7860670, Oct 27 '18 at 10:27
You've got my blessing. But you'll get hassled about strict aliasing, just ensure you disable the optimization that makes that byte. I don't know of a compiler where that actually matters, Darwin sorted that out. — Hans Passant, Oct 27 '18 at 12:51

Pezo · Accepted Answer · 2018-10-27T15:51:44.563

1

If I understand correctly this is only for an emulator, so you don't actually need your 8 and 16 bit values to occupy the same storage, you just want different access to the same data.

My suggestion would be to encapsulate your registers in a class (for example a register_t class that stores a single 16/32/64 value, which you can then access as 8/16/32 bits as well). Depending on your needs you might even return a proxy object as a sort of reference that allows you to assign a part of your register (e.g. to be able to do reg.l() += 1).

A simple example might look like this: 16 bit register_t with proxy reference to constituent bytes (other operators, bit widths, and const correctness are left as an exercise to the reader). Another advantage of doing it this way (besides it not being UB) is that you don't need to care about endianness.

If you want to have the right names for your registers as well, you could then encapsulate a number of those registers in another class that provides access to named registers via appropriately named member functions.

edited Oct 27 '18 at 15:51

answered Oct 27 '18 at 11:27

Pezo

1,458
9
15

I tried this approach using T& instead of T* which doesn't work and I am not entirely sure why. – Chris_F Nov 02 '18 at 16:53
How do you mean, storing `T&` instead of `T*` in your `byte_ref` class? – Pezo Nov 02 '18 at 16:56
I don't really like storing references as members. Why do you not want to store a pointer, and how does it not work? – Pezo Nov 02 '18 at 20:59
I can't replicate the behavior I was experiencing previously. I guess my instinct is to prefer a reference because null pointers are not needed and the added dereference is no longer needed. Is there a reason you would prefer a pointer? – Chris_F Nov 02 '18 at 22:13
Well there's one reason and one gut feeling. The reason is that you gain nothing by using a reference instead of a pointer, but you lose the ability to reassign (i.e. types with reference members are not copy- or move-assignable). My gut feeling is that a member is not part of the interface and shouldn't be misused to ensure invariants of the class (e.g. that it can't be null), I do take a reference in the constructor for that (so I know the pointer is not null). Have a look [at this answer](https://stackoverflow.com/a/892303/1849769) as well. – Pezo Nov 02 '18 at 22:28
I will just say that using the latest version of Clang with -O3 flag this method is producing a little over 2x the number of instructions as the union method. So avoiding UB definitely come at a performance cost... – Chris_F Nov 03 '18 at 15:23
Not only avoiding UB, but also portability. This method is correct on every standards-conforming compiler. If you only care about one specific platform then you can just look at whether your compiler vendor supports what you want to do. If they don't then you're out of luck and your code may break at any time, if they do, great. – Pezo Nov 03 '18 at 15:44

score 0 · Answer 2 · answered Oct 27 '18 at 10:44

0

The behaviour is undefined if you write to a member of a union, and then access the same data as a different member of the union, unless one of the types is a char type.

So this code seems to be fine. If you tried the same thing with 32 bit registers (which would be incorrect if you are trying to simulate x86 registers, but that's not the question), then writing to a 16 bit element and reading back as a 32 bit element would be undefined behaviour.

BTW. memcpy, memset and memmove are accessing data as bytes. Officially.

answered Oct 27 '18 at 10:44

gnasher729

51,477
5
75
98

*unless one of the types is a char type* - where did you get this from? Maybe you are mixing unions with byte representation access `reinterpret_cast(&whatever);`? The only special treatment is used for common initial sequence of standard-layout structs. – user7860670 Oct 27 '18 at 10:55

How appropriate is it to use a union to simulate CPU registers in an emulator?

2 Answers2