Reliable busy-wait for a register in C++

Question

See edit at end

I'm writing code for an STM32F4 processor, and I've come upon a situation where the compiler apparently reorders my register reads/writes, ruining a busy wait loop. Specifically, when I write

void writeLedSPI(uint16_t data) {
    using namespace Hardware;

    led_latch::clear();

    leds_spi::write(&data, 1);
    leds_spi::wait_tx_completed();

    led_latch::set();

    //The result should appear in the LED shift registers
}

where

inline static void SPI::write(const uint16_t *data, std::size_t length) {

    for (unsigned int i = 0; i < length; i++){
        while (!tx_buffer_empty());
        regs()->DR = *(data++);
    }

}

inline static void SPI::wait_tx_completed() {
    while (regs()->SR & SPI_SR_BSY);
}

and

inline static void GPIO::set() {
    regs()->BSRR = bit;
}
inline static void GPIO::clear() {
    regs()->BSRR = bit << 16;
}

what happens under -O3 is that the GPIO pin led_latch gets cleared before the SPI write is complete (looking at the scope trace, actually about by the time the SPI has clocked out the first bit of 16). So it appears that led_latch::set() gets moved before the busy wait loop, or the loop is optimized entirely away.

That's actually a bit confusing for me, since all of the fields coming from any of the regs() structs are volatile, so I'd have imagined that the compiler is not allowed to move volatile reads/writes past each other? Maybe I'm missing something?

Anyway, I tried to use the technique in this answer, i.e. insert various instantiations of DoNotOptimize, but the problem is that the dependence I need get is an implicit one between the read of the BSY flag that returns false from the SPI and the write of the BSRR register, and I can't seem to get this working.

Funny enough, inserting a busy wait loop that just waits for the debug counter does the trick:

const uint32_t start = DWT->CYCCNT;
while ((DWT->CYCCNT - start) < ticks);

where ticks is just enough for the SPI to transfer the data. Obviously though this is a hack, especially in more involved code.

So: how do I keep the compiler from moving led_latch::set() before the busy wait loop in leds_spi::wait_tx_completed() at any optimization level? A solution where any nasty implementation defined behaviour could be hidden in the SPI driver, i.e. wouldn't be visible in writeLedSPI would be ideal.

Edit:

The problem was not compiler optimizations per se, rather, with optimizations on, the delay from writing to the SPI data register to reading the SPI_SR_BSY flag was just one clock cycle (see Godbolt), by which time the flag hadn't yet turned on. Changing the code to

leds_spi::write(&data, 1);
leds_spi::wait_tx_buffer_empty();
leds_spi::wait_tx_completed();

makes it work. I'm still open for answers on what synchronization I could do to make this work without the "you just have to know to wait for the tx_empty first" -trick, if possible.

What compiler are you using? All `volatile` accesses are supposed to happen in the order they would in the C++ abstract machine. They're not ordered wrt. non-volatile accesses, but are ordered wrt. each other. If that's happening, I think that's a compiler bug. You might be able to work around it with a compiler memory barrier like `asm("":::"memory")` for compilers that understand GNU C inline asm syntax. — Peter Cordes, Mar 17 '20 at 14:27
Have you checked the asm output to see if stores are actually getting statically reordered? If not, maybe you need an ARM barrier instruction like `dsb sys` to make sure a store fully flushes before a later load instruction? I don't know ARM device driver writing details, or whether you have the memory region containing your MMIO registers set to uncacheable. — Peter Cordes, Mar 17 '20 at 14:31
The Cortex M4 doesn't have a cache, so that isn't the problem. I'll check the asm output as soon as I get back to this. — Timo, Mar 17 '20 at 14:41
It still has a pipeline and possibly a store buffer, though. If it does mis-compile, can you reduce this to a tighter and more complete [mcve] that you could actually (mis)-compile on the Godbolt compiler explorer (https://godbolt.org/) with ARM gcc there? If you can demo the wrong asm, you can report it as a GCC bug on their bugzilla. — Peter Cordes, Mar 17 '20 at 14:42
Show us the declaration of `regs` plz. Also what's that `template` line doing in your snippet? Are you sure it compiles? Cause it looks ill-formed to me. — Red.Wave, Mar 17 '20 at 15:15
@red.wave the template is copy-paste error (I simplified the code a bit), I'll fix it. Thanks for pointing that out! — Timo, Mar 17 '20 at 15:17
If you just want a memory order you can try with std::atomic_signal_fence — Surt, Mar 17 '20 at 15:22
could it be related to this? https://stackoverflow.com/questions/59925618/how-do-i-make-an-infinite-empty-loop-that-wont-be-optimized-away — bracco23, Mar 17 '20 at 15:40
@Red.Wave I'm not at work anymore, I'll get to that tomorrow. In the meantime, the struct definitions are directly the ST headers (STM32F4xx.h) — Timo, Mar 17 '20 at 15:47
@Timo `regs` is a function that I guess is not part of the micro's driver lib. If things have not changed since 10yrs ago, micro still relies on C rather than C++. I think the declaration of `regs` has sth to tell. — Red.Wave, Mar 17 '20 at 19:59
@Red.Wave I copy-pasted enough (I'm skipping setups, namespaces etc etc) to compile in godbolt, see https://godbolt.org/z/gjJ54M . Indeed, I'm not getting statically reordered! I also found out what was the actual problem, see my edit in a while. — Timo, Mar 18 '20 at 07:22
@PeterCordes You're right, I'm not getting reordered, the problem was elsewhere! See my edit in a while. You can check out the minimal version in Godbolt: https://godbolt.org/z/gjJ54M — Timo, Mar 18 '20 at 07:24
I think it's pretty normal for device drivers to poll the HW until it responds to a command, or to use a known-time delay loop to give the HW some known necessary time to respond. — Peter Cordes, Mar 18 '20 at 11:22

Reliable busy-wait for a register in C++

0 Answers0