Did 486 SMP systems provide Total Store Ordering?

Question

Cache-coherent SMP (symmetric, or shared-memory, multi processing) systems can provide various grades of memory ordering guarantees, the stronger ones being more expensive but making it easier to write correct code: https://en.wikipedia.org/wiki/Memory_ordering

Modern x86 CPUs provide Total Store Ordering (TSO), which is the strongest guarantee (that is used in actual SMP systems).

I'm interested in what the situation was in the early 90s. At that time, a typical x86 SMP system would have a pair of 486 CPUs, each with memory controller on a separate chip. Did those systems also provide TSO?

I think the question could benefit from a paragraph explaining what 'Total Store Ordering' is, for the general reader. Though I am familiar with memory models (had to be, having written a little Alpha code :-)), the term TSO is not a familiar one to me. It is not actually defined at the linked Wikipedia page; it appears only as the name of an operational mode for RISC-V and SPARC. — dave, Feb 01 '24 at 11:22
Do you have an example of a dual CPU 486 board ? I didn't think dual cpu was even supported until the pentium pro. Multi threaded software on x86 was non existent. — camelccc, Feb 01 '24 at 13:08
@camelccc there were quite a few multiple CPU 486 systems in the high-performance world; I worked on a 4-CPU 486 in the mid-90s. In fact Compaq SystemPro was available with two 386s in late 1989, albeit using asymmetric multi-processing. — Stephen Kitt, Feb 01 '24 at 13:12
@camelccc see this PC Magazine article for details of a few symmetric multi-486 systems. — Stephen Kitt, Feb 01 '24 at 13:23
@StephenKitt this article is about asymetric multi processing servers, where the memory is not shared, so the cpu's can't even access the same address space. This is little different from 2 computers in 1 box — camelccc, Feb 01 '24 at 13:27
@camelccc the article I’m looking at discusses a couple of AMP systems and several SMP systems with shared memory. The 433MP System Overview gives more details of one of them; it mentions cache coherency etc. All the SMP systems discussed could run SCO Unix with MPX, which requires a symmetric system with a shared address space. — Stephen Kitt, Feb 01 '24 at 13:41
OK, we have here an example of a machine that was doing something the 486 was never designed to do. The answer to your question must be yes as the 486 can't perform instruction re-ordering and there are no barrier instructions anyway. This answer is with the caveat that without hardware cache coherency support from the 486, there must have been restrictions on writing to addresses from different threads that are in the same cache line, which are going to be specific to DEC's implementation. — camelccc, Feb 01 '24 at 14:01
@camelccc Intel CPUs support SMP for all x86. In fact they do all the way back to 8080 times. Multibus (note the name) was quite common with professional 8086 to 486 systems. Multibus allows various degrees of multiprocessing, from loose coupled dedicated systems over shared I/O and shared memory all the way to SMP. Implementation is based on Intel's DMA protocol introduced with the 8080, available and used since 1974. — Raffzahn, Feb 01 '24 at 14:47
which is the strongest guarantee (that is used in actual SMP systems) - Historically there have been sequentially-consistent SMP systems, which is even stronger, even StoreLoad ordering doesn't happen (e.g. hardware with no store buffer or pipelining, each instruction fully finishes before starting the next.) Probably some of the machines in https://en.wikipedia.org/wiki/Symmetric_multiprocessing#History are like that. (386 was like this. But 386 SMP, unlike 486, predates the rules that modern x86 follows.) — Peter Cordes, Feb 01 '24 at 21:03
@camelccc: lock add dword [esp], 0 is a full memory barrier with no other side effects. (And ironically is more efficient than mfence on modern x86, so it's what compilers use even on x86-64, with RSP of course. locked instructions may not be 100% ordered wrt. SSE4.1 NT loads from WC memory, but mfence is; I think this is why mfence is slower. See Does lock xchg have the same behavior as mfence? - on Skylake, a microcode update made mfence drain the ROB (reorder buffer) like lfence, as well as the store buffer.) — Peter Cordes, Feb 01 '24 at 21:11
"Modern x86 CPUs provide Total Store Ordering (TSO), which is the strongest guarantee (that is used in actual SMP systems)." -- I am not sure, but I have been told / taught at computer architecture conferences that at least some IBM systems, mainframes probably, were actually sequentially consistent. — Krazy Glew, Feb 01 '24 at 23:40

Krazy Glew · Accepted Answer · 2024-02-29T02:05:23.920

BRIEF: most i486 multi-processors were probably TSO. But system vendors could definitely build non-TSO systems, and were definitely considering doing so.

---+ DETAIL

---++ SMP Marketplace ~1991 (not just Intel)

During the Intel P6 design effort (~1991-1996) we had to consider

SMP systems like Sequent, who started out as a bunch of ex-Intel engineers building NS32032 based systems, then '386, '486, and '586 systems. Wikipedia doesn't mention Sequent shipping a P6 based system, but we surely talked to them, especially because they were in a neighboring suburb of Portland Oregon.
NCR's 486 based SMP systems.
A scattering of other x86 based multiprocessor systems that we would have been reluctant to break compatibility with
As well as companies who were moving to x86 from other processor families, who we sought to encourage. It's fair to say that many companies were considering building x86 multi-processors at that time.

---++ Memory Models on Our Mind

P6 spent quite some time thinking seriously about implementing sequential consistency, because there were some older systems (Intel and non-Intel based) that had such strong memory ordering models. However, I clearly remember when 'NS' said something like "Why are we bothering with strong ordering? The i486 allows loads to bypass stores in its write buffer between CPU and L1 cache." Since we were running short on time, we switched to the weaker memory ordering model.

By the way, lest people think that Intel started off in the strong memory ordering / SC camp, and then only reluctantly relaxed, let me say that one of my first assignments at Intel was to participate in the "Weak Ordering Task Force". This petered out after a while, even before we realized we could achieve higher performance using a stronger memory ordering model and speculation. Weak ordering did not raise its head again until Itanium. Nevertheless, I still think that Intel should have defined a "store-release" instruction, is that is required for so many weaker memory ordering models. Even if those models have slightly different definitions of store-release, Supporting the instruction would have allowed us to implement 1 of the weaker models. I think that LOCK MOV to memory, i.e. the x86 LOCK prefix in front of a store instruction, would have been easy to do. IIRC the LOCK prefix on stores was ignored at this time. But I emphasize: weaker memory ordering models with store-release instructions were probably lower performance on the shared bus multiprocessor systems that Intel was anticipating in the near future at the time the P6 effort started in 1991. I wanted store-release for both higher end HPC systems and lower end embedded systems that might have wanted to avoid the implementation complexity of large-scale speculative consistency maintenance.

Note that the i486 only allowed loads to bypass stores with non-overlapping addresses, in a 4-deep store buffer between processor and the L1 cache, whereas P6 provided store-to-load forwarding out of its much deeper store buffer. 16 entries? 32 entries? Not determined at that time, but we knew it was going to be much deeper. Not only is there a slight difference between bypass and forwarding store buffers with respect to the memory model, but the different depths was possibly a greater concern. We already knew that some code could detect the difference between i386, i486, and Pentium/P5 based on timing and estimation of the store buffer and instruction prefetch queue depths, and sometimes depended on particular timings and depths. However, we had already decided that it was a fools errand to try to keep code running that depended excessively on instruction timing and particular microarchitecture buffer depths, etc.

Overall we felt more comfortable having P6 implement more conservative architecture policies than previous processors, rather than less conservative. E.g. for self modifying code P6 let a right to the instruction stream affect the very next instruction, rather than expose the actual microarchitecture instruction prefix queue depth, which was certainly deeper than that of previous processors. Code for previous processors "knew" that if it wrote more than, say, 4*16 bytes worth of instruction fetches ahead. P5 had only just introduced the notion of instruction stream serializing instructions. However, the memory ordering model was one place where legacy software that depended on an N<=4 deep store buffer would break on P6.

---++ System Vendors and Memory Ordering

At that time, ~1991-6, Intel had nowhere near as much control of the systems architecture as it does now. Load bypassing of the store buffer is the key element making you not be strong ordered, but whether you are PC or TSO is a function of the system architecture, which was not controlled by Intel.

Terminology: SC = Sequential Consistency. PC = Processor Consistency. TSO = so-called Total Store Order. See below for more terminology commentary.

Quick reminder: TSO Total Store Ordering is IMHO totally misnamed. It is not a total store ordering. It corresponds roughly to having a forwarding or at least bypassing store buffer between every processor and a "global ordering point". So there is not a single store ordering for the entire system that is seen by all processors. Rather, there is a single store ordering for the entire system, but each processor may see its own stores "earlier" then they occur in the total store order seen by other processors. TSO is arguably the simplest memory model that supports Causal Consistency, but is a bit stronger.

The main other alternative system architecture, typically associated with PC "processor consistency", can be imagined as a tree of store buffers. E.g. processors P1 and P2 each have their own private store buffers, and then their traffic is merged into store buffer SB12. Processors P3 and P4 have their own private store buffers, and their traffic is merged into store buffer SB34. The global observation point may lie beyond SB12 and SB34. P1 can see not only its own stores earlier than in the so-called global order, but it can also see P2's stores earlier. And so on.

(Actually, multipath systems like meshes should really be a major alternative system topology. But with writeback caching you can emulate TSO or PC. It is multipath systems like meshes with caches that create the most "interesting" memory model considerations. That, and trying to milk the highest performance out of a mesh, even if you have caches. But the guys who go that far usually aren't HW cache coherent anyway.)

Intel's 486 (and Pentium) implemented the private store buffers. But it was up to the system vendor to develop the system architecture.

During the development of P6 there was at least one system vendor that was planning to build a non-TSO system. It's been a long time, but my memory is that they had a smaller SMP, let's say 4 processors, and they interconnected two of these clusters with bidirectional store buffers. I think their 486 era implementation was writethrough caching only (i486 itself was writethrough only, but some system implementers build their own writeback caches). Of course we wanted and expected everybody to eventually go to writeback caches, but supporting older configurations always has some value, and at the time there was some serious debate about whether WT was occasionally higher performance.

IIRC this was the "smoking gun" or at least "last straw" that resulted in the official Intel P6 memory ordering model being processor consistency and not TSO. There were probably some other companies, but I remember this one in particular because it was a very clean example. SAD BUT TRUE: many years later I met the platform architect for that company, who had since joined Intel. He told me that the non-TSO version of that platform never shipped. If we had known this, Intel might very well have gone for TSO way back then.

... Except: there were various implementation bugs that made various systems not only fail TSO, but also processor consistency. Sometimes the recommended workarounds essentially involved adding enough fence type operations to make the code almost be weakly ordered tolerant.

@PeterCordes links to a discussion of such errata that caused messy code ifdef code on Linux https://lkml.iu.edu/hypermail/linux/kernel/1803.2/02384.html

---++ Intel's not really official memory models

In the architectural documents I called the official Intel P6 memory ordering model something like "speculative processor consistency", where the speculative part was to emphasize that if you placed what RISC-V now calls non-idempotent MMIO devices in WB or WT speculatable memory types, you could not rely on MMIO side effects being done in any particular order. There were some awesome and terrifying system that did things like map disk controller MMMIO addresses to every address of a 4K page, and we just gave up on being compatible with that.

For many years Intel's internal memory ordering model target was stronger than we were willing to guarantee to the outside world. Not just fearing Intel implementation bugs, but also system vendor issues. Eventually, once the multiprocessor wild wild West settled down, more and more features of the internal memory ordering target model were disclosed in subsequent releases or updates of the memory ordering specification.

One of Intel's leading architects was vociferously against officially publishing any memory ordering model, whether processor consistency or the eventual TSO, for fear of such "minor" implementation bugs requiring a recall.

BTW, I believe that one of Intel's first multicore chips - multiple processors on the same chip - broke TSO by allowing forwarding of stores from one processor to the other, before globally observed. Easily fixed by comparing processor IDs. Not quite sure where or how it happened, or even if.

---++ Memory Ordering and Implementations

IMHO full SC probably could have been implemented reasonably, using the speculation techniques that allow loads to pass other loads and stores. Particularly in systems that only contained memory types WB and UC.

(WB = Write-Back cacheable, WT=Write-Through cacheable, UC=Uncacheable)

It is my understanding, based on some presentations and tutorials on memory ordering presented by IBM people at computer architecture conferences, that IBM mainframes do this, essentially by requiring a processor/cache to get ownership of any line before any access - whether writeback, writethrough, or uncached. Sounds good - but this amounts to building a cache directory or snoop filter to track such ownership even if you have no cache in your system.

Actually, it's not WB write-back back per-se that makes it easier to implement memory ordering. AFAICT it is any protocol that allows only a single writer to a memory location at any time. Like MESI and MOESI WB cache protocols. But not necessarily Write Update, which may or may not be write back. Not WT Write Through did not require ownership. Hence my surprise when IBM told me that they had to acquire ownership before doing a writethrough. Not any protocols that try to dynamically decide whether to do write back via read for ownership vs write-through, depending on which is faster for particular workloads. Nor for right back protocols with some optimizations that eliminate read for ownerships. And certainly not for some of the GPU caches that have a dirty bit per byte, which allows simultaneous writes to the same or different bites in a cache line without the "write back exposing stale data" issues that arise when you have a single dirty bit for an entire cache line.

Historically, certainly prior to P6 and probably more recently, low end x86 systems were "store and forget". As were many other not so low end non-x86 systems. A processor sent a writethrough store into an external store buffer, the store buffer said OK, and from the processor's point of view it was complete. No need to track "ownership" for WT or UC memory. With "store-and-forget" the external system must be involved in implementing the memory ordering model, and can easily break any supposed guarantees that a processor company tries to provide.

There are other ways of implementing stronger or strong-ish memory ordering models. E.g. it may be required that store-completion-acknowledgements be sent back to the processor only when the store has completely or finally been done. Then a processor can wait for store-completion before sending any order dependent request. Sounds good... except that my experience at that time was that system vendors often lied, sending store-completion earlier for improved performance. Early store completion is actually PC correct (not necessarily TSO) at almost any place in a system topology that you can guarantee all traffic goes through. But early store-completion is problematic when there is more than one path to a node.

I think this single path consideration is one reason why PCIe tends to have a tree structure, and modern x86 MP systems tend to be hierarchical. Mesh systems tend to be multipath. AMD's Opteron caches look like they should be multipath, but I learned that they were always configured with routing tables that are single path.

I clearly remember discussing "store-and-forget" versus "store-completion" with a former coworker who came from ARM. He conceded the possible performance advantages of store-and-forget (for WT and UC systems, which were ARM's original bread and butter), but silenced me by drawing K2,2 on my whiteboard: 2 processors, connected to two different I/O devices, each by point to point links. That might have store buffers. That's a really simple example of a multipath system. If you want to do store-and-forget there, you have to provide store-buffer draining.

Sorry to have digressed.

---++ BOTTOM LINE

Most 486 systems were probably TSO. At least to writeback cacheable memory. But Intel customers were definitely thinking about building non-TSO systems.

And in aggressive I/O systems prior to PCI it could be hard to find any sort of memory model. Certainly, you had to "flush the buffers" or the like using different techniques for different paths. (Which, BTW, is still true for some systems today. Even non-Intel SOC / NOCs.)

---++ Terminology Notes

When I say "SMP" I mean "shared memory multiprocessing", and nearly always some form of "HW cache coherent SMP". Not necessarily 100% symmetric in terms of access time. E.g. SMP with local coupled faster memory regions OK.

I should probably handwave or obscured terms like SC, PC, PSO and TSO.

SC (Sequential Consistency) is one of several very strongly ordered memory models. Some folks get religious about the distinctions between SC and strong ordering in general or particular.

TSO is fairly well defined. Note that for a while Intel systems were officially CC (Causally Consistent), but not necessarily TSO.

As I mentioned above, the original P6 memory ordering model was speculative processor consistency. unfortunately as far as I can tell the more formal parts of this model were never made public.

I'm not going to discuss SUN SPARC PSO Partial Store Order.

I like the store buffer topology model as a way of thinking about the difference between TSO and "processor consistency". TSO=private store buffers per processor, global ordering. "Processor consistency" allows multilevel trees of store buffers. But there are differences between these physical models and the abstract memory ordering models.

I'm fairly sure that I should not be saying PC (Processor Consistency) for the "tree of store buffers" model. I think that's a special type of PC, but not the only type.

PC Processor Consistency is definitely not necessarily the same as "program order".

There are different weak ordering models. Some people get religious about the distinction between RC Release Consistency and WC Weak Consistency. If you bind your ordering operations to slightly different instructions... although some instruction sets have load-acquire and store-release, e.g. RISC-V and ARM (I think), their definitions of load-acquire and store-release are not necessarily the definitions that similarly named instructions had in the classic memory ordering models like RC and WC.

---+ End - Invisible [1]: ... label definitions follow

Thanks for all these historical details! I edited to add definitions and wikipedia links for acronyms like PSO and PC closer to where you first used them. Hopefully my simplistic summaries like PC = release/acquire (like x86 all the time, or like AArch64 stlr / ldapr) don't conflict with what you meant. — Peter Cordes, Feb 03 '24 at 02:00
one of Intel's first multicore chips ... by allowing forwarding of stores from one processor to the other, before globally observed. - That would be a lot more plausible for SMT logical cores of the same physical core, like perhaps an early P4 with Hyperthreading rather than multi-core? Unless there was a shared store buffer. SMT store-forwarding of graduated stores is the mechanism for IRIW reordering on IBM POWER hardware, where the memory model does allow that. — Peter Cordes, Feb 03 '24 at 02:06
... P6 there was at least one system vendor that was planning to build a non-TSO / much stronger than we were willing to guarantee to the outside world - Unfortunately this led to some wasted(?) developer effort and slower-than-necessary Linux kernel SMP memory barriers until 2007 (https://lwn.net/Articles/252110/) when they started taking full advantage of x86's strong memory ordering for inter-core communication (as opposed to for MMIO). Before that I think they might have been using sfence instead of just a compile-time barrier for smp_wmb (write memory barrier = release fence). — Peter Cordes, Feb 03 '24 at 02:28
There was also a config option to work around non-TSO x86 by using full barriers instead, CONFIG_X86_PPRO_FENCE which people didn't enable by default but cluttered up some of that code with ifdefs. Fully removed in 2018: https://lkml.iu.edu/hypermail/linux/kernel/1803.2/02384.html (The removed code in the diff includes some comments and documentation on why Linux thought this might be necessary.) — Peter Cordes, Feb 03 '24 at 02:28
SC ... by requiring a processor/cache to get ownership of any line before any access - Including reads? So multiple readers would contend with each other, like if MESI didn't have the Shared state? That would be pretty bad for modern OSes, like defeating RCU (read-copy-update) and SeqLock patterns that make readers truly read-only, not even contending to take a readers/writers lock. I don't know much about the software on IBM Z-series mainframes. — Peter Cordes, Feb 03 '24 at 03:53
Don't need @PeterCordes because we are the only people discussing ... // My only significant objection to your edits was saying "PC = release/acquire", without the qualification "like x86 all the time", so I deleted that. // Although this is a good example of how the load-acquire or store-release in models like WC or RC, or which we considered for P6, is not necessarily the same as the store-release in ARM or RISC-V. Some flavors of store-release were (or are) globally fencing, not the pairwise. I sometimes wish that Sarita had used different words for her pairwise memory models. — Krazy Glew, Feb 03 '24 at 22:53
Yes, as I mentioned the multicore issue that I remembered, I couldn't figure out how it would have been a problem. Hence my waffling. I'm sure you are 100% correct, forwarding between hyper threads in a shared store buffer is much more likely to have been the problem. — Krazy Glew, Feb 03 '24 at 22:55
Thanks for the clarification. BTW, in Stack Overflow markdown you can get actual headers of different sizes like ### header 3 / #### header 4 which you might want to use instead of ---+. — Peter Cordes, Feb 03 '24 at 23:05
My bad, don't need exclusive access to location before reading. But may need to guarantee that nobody else has exclusive access and is writing. No WT cache I knew did anything special before writing - up until I heard about the IBM systems. Let alone UC. // Q: are there other systems like IBM with ownership for UC and WT? // Elsewhere have talked about sender and receiver implementations of memory ordering. Nice thing about special -acquire and -release instructions, whether loads, stores, or fences, is that they allow implementation choice. — Krazy Glew, Feb 03 '24 at 23:08
Thanks. I forgot about ### etc. because of been jumping around between Markdown systems. But also because I find it very hard distinguish the different header depths based purely on font. Also when copy/pasting. So I often use the twiki style ---+++ anyway. // In fact in my own will for wiki/docs such " section header level visibility" is a user option. Implicit when using dot decimal Section numbering like 1.2.3. Trying to come up with Alternatives that are less obnoxious than twiki ---++++, but still more visible than what most Markdown do. — Krazy Glew, Feb 03 '24 at 23:19
One of my 1st interactions with Linus what telling them that he should use LOCKed instructions as acquire operations, albeit rmw load-acquire/store-release. Unfortunately, at that time I could not safely say ordinary stores are releases on x86. They are/were not. // BTW, exactly what to be excessively slow code look like? That would be a good Q&A you can ask and answer yourself. — Krazy Glew, Feb 03 '24 at 23:35

Stephen Kitt · Answer 2 · 2024-02-01T14:48:33.497

The first standard for x86-based multi-processing systems was Intel’s MultiProcessor Specification, first published in 1994 – so there weren’t really any “typical” 486 SMP systems before then. (There were 486-based SMP systems, and Intel provided components such as the 82489DX APIC that were expected to be used when building such systems, but there was no standard for them to adhere to.) Version 1.4 of that specification explicitly requires the following:

Maintaining cache coherency—When one processor accesses data cached in another processor’s cache, it must not receive incorrect data. If it modifies data, all other processors that access that data also must not receive stale data. External caches must maintain coherency among themselves, and with the main memory, internal caches, and other bus master DMA devices.

Cache flushing—The processor can generate special flush and write-back bus cycles that must be used by external caches in a manner that maintains cache coherency. The actual responses are implementation-specific and may vary from design to design. A program can initiate hardware cache flushing by executing a WBINVD instruction. This instruction is only guaranteed to flush the caches of the local processor. See Appendix B for system-wide flushing mechanisms. Given that cache coherency is maintained by hardware, there is no need for software to issue cache flush instructions under normal circumstances.

Reliable communication—All processors must be able to communicate with each other in a way that eliminates interference when more than one processor accesses the same area in memory simultaneously. The processor uses the LOCK# signal for this purpose. External caches must ensure that all locked operations are visible to other processors.

Write ordering—In some circumstances, it is important that memory writes be observed externally in precisely the same order as programmed. External write buffers must maintain the write ordering of the processor.

Given that the 486 is a strictly in-order CPU, the cache coherency and memory locking requirements described above provide strong ordering guarantees across the whole system.

Pre-MPS symmetric multi-processor systems might not have implemented such strict guarantees, but I imagine most if not all of them did in practice. For example, the 1991 Digital 433MP System Overview explains

Cache coherency guarantees that no processor loses track of whether its cached data is up-to-date or not. The cache coherency hardware in the applicationDEC system prevents a processor from changing data without informing other processors. Whenever a processor caches system memory it marks the memory as taken. If it changes or knows that it will have to change the data, it marks the data so that other processors are prevented from copying the data until the current operations on the cache are complete. The cache coherency scheme implemented in the applicationDEC system ensures that when one processor caches data, the other processors are not affected.

The NCR Voyager is another interesting x86 SMP implementation, dating back at least to the 486, and supported for a few years by Linux in the mid- to late noughts. — Stephen Kitt, Feb 01 '24 at 15:30
There were even some 386 SMP systems, with sequentially-consistent memory ordering (like TSO but without a store buffer or instruction pipelining, so no StoreLoad reordering either.) e.g. https://cs.hac.ac.il/staff/martin/Micro_Modern/slide08.pdf has a photo of a "Makbilan Parallel Computer" from 1989: sixteen 386 single-board computers with a shared address-space. Also https://www.os2museum.com/wp/386-cache-coherency/ - 386's bus was designed to support cache coherency for external caches, and for non-Intel chips with cache. — Peter Cordes, Feb 01 '24 at 21:35
Note that modern x86's TSO memory model is what you get from coherent caches and in-order execution with a store buffer + store forwarding, like 486 and Pentium (with in-order commit from the store buffer, not looking past the oldest entry on cache miss stores.) PPro was the first x86 Intel CPU that had to intentionally limit what memory-ordering it could make visible, by speculatively loading early but rolling back (machine_clears.memory_ordering perf event) if the cache line was invalidated before the load was architecturally allowed to happen. — Peter Cordes, Feb 01 '24 at 21:42