See edit at end
I'm writing code for an STM32F4 processor, and I've come upon a situation where the compiler apparently reorders my register reads/writes, ruining a busy wait loop. Specifically, when I write
void writeLedSPI(uint16_t data) {
using namespace Hardware;
led_latch::clear();
leds_spi::write(&data, 1);
leds_spi::wait_tx_completed();
led_latch::set();
//The result should appear in the LED shift registers
}
where
inline static void SPI::write(const uint16_t *data, std::size_t length) {
for (unsigned int i = 0; i < length; i++){
while (!tx_buffer_empty());
regs()->DR = *(data++);
}
}
inline static void SPI::wait_tx_completed() {
while (regs()->SR & SPI_SR_BSY);
}
and
inline static void GPIO::set() {
regs()->BSRR = bit;
}
inline static void GPIO::clear() {
regs()->BSRR = bit << 16;
}
what happens under -O3 is that the GPIO pin led_latch gets cleared before the SPI write is complete (looking at the scope trace, actually about by the time the SPI has clocked out the first bit of 16). So it appears that led_latch::set() gets moved before the busy wait loop, or the loop is optimized entirely away.
That's actually a bit confusing for me, since all of the fields coming from any of the regs() structs are volatile, so I'd have imagined that the compiler is not allowed to move volatile reads/writes past each other? Maybe I'm missing something?
Anyway, I tried to use the technique in this answer, i.e. insert various instantiations of DoNotOptimize, but the problem is that the dependence I need get is an implicit one between the read of the BSY flag that returns false from the SPI and the write of the BSRR register, and I can't seem to get this working.
Funny enough, inserting a busy wait loop that just waits for the debug counter does the trick:
const uint32_t start = DWT->CYCCNT;
while ((DWT->CYCCNT - start) < ticks);
where ticks is just enough for the SPI to transfer the data. Obviously though this is a hack, especially in more involved code.
So: how do I keep the compiler from moving led_latch::set() before the busy wait loop in leds_spi::wait_tx_completed() at any optimization level? A solution where any nasty implementation defined behaviour could be hidden in the SPI driver, i.e. wouldn't be visible in writeLedSPI would be ideal.
Edit:
The problem was not compiler optimizations per se, rather, with optimizations on, the delay from writing to the SPI data register to reading the SPI_SR_BSY flag was just one clock cycle (see Godbolt), by which time the flag hadn't yet turned on. Changing the code to
leds_spi::write(&data, 1);
leds_spi::wait_tx_buffer_empty();
leds_spi::wait_tx_completed();
makes it work. I'm still open for answers on what synchronization I could do to make this work without the "you just have to know to wait for the tx_empty first" -trick, if possible.