1

I am thinking about 'Minimizing page faults (and TLB faults) while “walking” a large graph'

'How to know whether a pointer is in physical memory or it will trigger a Page Fault?' is a related question looking at the problem from the other side, but does not have a solution.

I wish to be able to load some data from memory into a register, but have the load abort rather than getting a page fault, if the memory is currently paged out. I need the code to work in user space on both Windows and Linux without needing any none standard permission.

(Ideally, I would also like to abort on a TLB fault.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Ian Ringrose
  • 51,220
  • 55
  • 213
  • 317
  • The load does actually abort with an exception. The OS will then load the page and let your program redo the load. So its OS-depending. Maybe `verr` (https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf#G9.260937) does the job of checking whether the load would fail or not, but Im not sure on this one. – Domso Sep 07 '18 at 12:02
  • 1
    [`verr`](http://felixcloutier.com/x86/VERR:VERW.html) is useless for this: it only checks segment privs given a 16-bit segment selector, not an address. You'd use it like `mov eax, ds` / `verr ax` to ask if the data segment is readable. Spoiler alert: it is. – Peter Cordes Sep 07 '18 at 18:37

2 Answers2

5

The RTM (Restricted Transactional Memory) part of the TXT-NI feature allows to suppress exceptions:

Any fault or trap in a transactional region that must be exposed to software will be suppressed. Transactional execution will abort and execution will transition to a non-transactional execution, as if the fault or trap had never occurred.
[...]
Synchronous exception events (#DE, #OF, #NP, #SS, #GP, #BR, #UD, #AC, #XM, #PF, #NM, #TS, #MF, #DB, #BP/INT3) that occur during transactional execution may cause an execution not to commit transactionally, and require a non-transactional execution. These events are suppressed as if they had never occurred.

I've never used RTM but it should work something like this:

xbegin fallback

  ; Don't fault here

xend

; Somewhere else
fallback:
  ; Retry non-transactionally

Note that a transaction can be aborted for many reasons, see chapter 16.8.3.2 of the Intel manual volume 1. Also note that RTM is not ubiquitous.

Besides RTM I cannot think of another way to suppress a load since it must return a value or eventually signal an abort condition (which would be the same as a #PF).

Margaret Bloom
  • 41,768
  • 5
  • 78
  • 124
  • I wish to stop the OS seeing the #PF, hence fault I could handle in user space would also solve the problem. – Ian Ringrose Sep 07 '18 at 12:11
  • If RTM was more common it would be a great soltuion, as it would also make my thread locking easyer. – Ian Ringrose Sep 07 '18 at 12:12
  • @IanRingrose RTM suppresses the #PF and it's the only think I came up with. I can't think of another mechanism to abort a load, the x86 arch is not really build around the concept of "load abortion". Let's see if anyone has some other thoughts :) – Margaret Bloom Sep 07 '18 at 12:15
  • 1
    @IanRingrose It still has some issues, for instance not all x86 core support TSX. Maybe with c++20 and the support of a higher-language, it will be more common. (https://en.cppreference.com/w/cpp/language/transactional_memory) – Domso Sep 07 '18 at 12:18
  • 1
    Nice idea! @IanRingrose: there's unfortunately no instruction that just queries the TLB or the current page table with the result in a register, on x86. It's possible some other ISA has an instruction for that, but I'm not aware of any that do. As a design idea, that would only be useful for performance, not correctness, because there'd always be a gap between querying and using. A try_load insn that also sets/clear flags instead of raising #PF could avoid the race condition, but no ISA I know of has that either. – Peter Cordes Sep 07 '18 at 18:11
  • @PeterCordes One problem, to be useful, it has to be VERY fast in the case that the page is in the TLB (or memory). – Ian Ringrose Sep 07 '18 at 18:24
  • A TryLoad that aborted if the item was not in the CPU cashe (level 3) would also be very nice...... – Ian Ringrose Sep 07 '18 at 18:26
  • @IanRingrose: Being fast is not a problem; I don't see why a hypothetical `queryTLB` instruction would be any slower than a load that hits in L1dTLB and L1d cache, if the page is already hot. It would just run on a load port and produce an integer result. Intel load ports already know how to run broadcast-loads, zero-extending loads, and even `vmovddup ymm, m256` (two in-lane broadcasts) for some reason, so adding another kind of load uop doesn't seem like a problem. Same for a load that set flags according to whether or not we aborted the load instead of #PF. – Peter Cordes Sep 07 '18 at 18:32
2

There's unfortunately no instruction that just queries the TLB or the current page table with the result in a register, on x86 (or any other ISA I know of). Maybe there should be, because it could be implemented very cheaply.

(For querying virtual memory for pages being paged out or not, there is the Linux system call mincore(2) that produces a bitmap of present/absent for a range of pages starting (given as void* start / size_t length. That's maybe similar to the HW page tables so probably could let you avoid page faults until after you've touched memory, but unrelated to TLB or cache. And maybe doesn't rule out soft page faults, only hard. And of course that's only the current situation: pages could be evicted between query and access.)


Would a CPU feature like this be useful? probably yes for a few cases

Such a thing would be hard to use in a way that paid off, because every "false" attempt is CPU time / instructions that didn't accomplish any useful work. But a case like this could possibly be a win, when you don't care what order you traverse a tree / graph in, and some nodes might be hot in cache, TLB, or even just RAM while others are cold or even paged out to disk.

When memory is tight, touching a cold page could even evict a currently-hot page before you get to it.

Normal CPUs (like modern x86) can do speculative / out-of-order page walks (to fill TLB entries), and definitely speculative loads into cache, but not page faults. Page faults are handled in software by the kernel. Taking a page-fault can't happen speculatively, and is serializing. (CPUs don't rename the privilege level.)

So software prefetch can cheaply get the hardware to fill TLB and cache while you touch other memory, if you the one you're going to touch 2nd was cold. If it was hot and you touch the cold side first, that's unfortunate. If there was a cheap way to check hot/cold, it might be worth using it to always go the right way (at least on the first step) in traversal order when one pointer is hot and the other is cold. Unless a read-only transaction is quite cheap, it's probably not worth actually using Margaret's clever answer.

If you have 2 pointers you will eventually dereference, and one of them points to a page that's been paged out while the other is hot, the best case would be to somehow detect this and get the OS to start paging in one page from disk in the background while you traverse the side that's already in RAM. (e.g. with Windows PrefetchVirtualMemory or Linux madvise(MADV_WILLNEED). See answers on the OP's other question: Minimizing page faults (and TLB faults) while "walking" a large graph)

This will require a system call, but system calls are expensive and pollute caches + TLBs, especially on current x86 where Spectre + Meltdown mitigation adds thousands of clock cycles. So it's not worth it to make a VM prefetch system call for one of every pair of pointers in a tree. You'd get a massive slowdown for cases when all the pointers were in RAM.


CPU design possibilities

Like I said, I don't think any current ISAs have this, but it would I think be easy to support in hardware with instructions that run kind of like load instructions, but produce a result based on the TLB lookup instead of fetching data from L1d cache.

There are a couple possibilities that come to mind:

  • a queryTLB m8 instruction that writes flags (e.g. CF=1 for present) according to whether the memory operand is currently hot in TLB (including 2nd-level TLB), never doing a page walk. And a querypage m8 that will do a page walk on a TLB miss, and sets flags according to whether there's a page table entry. Putting the result in a r32 integer reg you could test/jcc on would also be an option.

  • a try_load r32, r/m32 instruction that does a normal load if possible, but sets flags instead of taking a page fault if a page walk finds no valid entry for the virtual address. (e.g. CF=1 for valid, CF=0 for abort with integer result = 0, like rdrand. It could make itself useful and set other flags (SF/ZF/PF) according to the value, if there is one.)

The query idea would only be useful for performance, not correctness, because there'd always be a gap between querying and using during which the page could be unmapped. (Like the IsBadXxxPtr Windows system call, except that that probably checks the logical memory map, not the hardware page tables.)

A try_load insn that also sets/clear flags instead of raising #PF could avoid the race condition. You could have different versions of it, or it could take an immediate to choose the abort condition (e.g. TLB miss without attempt page-walk).

These instructions could easily decode to a load uop, probably just one. The load ports on modern x86 already support normal loads, software prefetch, broadcast loads, zero or sign-extending loads (movsx r32, m8 is a single uop for a load port on Intel), and even vmovddup ymm, m256 (two in-lane broadcasts) for some reason, so adding another kind of load uop doesn't seem like a problem.

Loads that hit a TLB entry they don't have permission for (kernel-only mapping) do currently behave specially on some x86 uarches (the ones that aren't vulnerable to Meltdown). See The Microarchitecture Behind Meltdown on Henry Wong's blod (stuffedcow.net). According to his testing, some CPUs produce a zero for speculative execution of later instructions after a TLB/page miss (entry not present). So we already know that doing something with a TLB hit/miss result should be able to affect the integer result of a load. (Of course, a TLB miss is different from a hit on a privileged entry.)

Setting flags from a load is not something that ever normally happens on x86 (only from micro-fused load+alu), so maybe it would be implemented with an ALU uop as well, if Intel ever did implement this idea.

Aborting on a condition other than TLB/page miss or L1d miss would require outer levels of cache to also support this special request, though. A try_load that runs if it hits L3 cache but aborts on L3 miss would need support from the L3 cache. I think we could do without that, though.

The low-hanging fruit for this CPU-architecture idea is reducing page faults and maybe page walks, which are significantly more expensive than L3 cache misses.

I suspect that trying to branch on L3 cache misses would cost you too much in branch misses for it to really be worth it vs. just letting out-of-order exec do its thing. Especially if you have hyperthreading so this latency-bound process can happen on one logical core of a CPU that's also doing something else.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • It's not obvious whether the results of `queryTLB` and `try_load` can be used efficiently. If new flags need to be added, then we need new instructions to check these flags. Either way, an instruction needs to be added to check the results. Maybe a jump instruction? But then that would impact branch prediction. If the presence of TLB entries or data cache line is highly predictable, then why not just use software prefetching since we know which entries are most likely to be cold. If it is not highly predictable, then these additional jumps will cause a lot of mispredicts, degrading perf. – Hadi Brais Sep 16 '18 at 00:01
  • Software prefetching is useful when you know the order in which memory locations are accessed and you know none of them are hot, so they will have to be fetched. `queryTLB` and `try_load` might be useful in situations where that is not the case, perhaps to to help get software prefetching to work. Although it's not clear to me whether such situations are rare or not. I can imagine in graph traversals where it's possible to access a node from multiple paths, we indeed may not know whether a node is hot or cold in the cache. That makes sense. – Hadi Brais Sep 16 '18 at 00:05
  • Note that if we are going to use a form of `jcc` to check the flags, then we need such an instruction after every `queryTLB`/`try_load`. I'm not sure if the overall impact on perf would be positive. Consider case where we have 10 memory locations to access. Should we probe each one of them? That sounds like a lot of overhead and would make the code complicated. Using more clever algorithms or data structures might be better. – Hadi Brais Sep 16 '18 at 00:09
  • @HadiBrais: Huh, why would new flags be needed? CF or OF would be fine, and maybe set ZF/SF/PF according to the value as well. There's precendent for CF from instructions like `rdrand` setting CF on failure, and in case you want to do anything like `adc` to count not-present pages, CF is the special flag. – Peter Cordes Sep 16 '18 at 03:38
  • @HadiBrais: yes, this is probably only useful for optimizing an "unusual" traversal like GC, that walks the data structure in an order different from its normal use pattern. And yes, that many `jc` instruction is only justified if it saves a significant number of hard page faults (sleeps waiting for IO to page in a page, especially if it evicts a page that was another one of the 10 pointers.) But maybe even saving some TLB misses, if for some reason hugeTLB isn't sufficient. – Peter Cordes Sep 16 '18 at 03:44
  • I think reusing existing flags should be fine as long as register renaming for the flags is supported. The surrounding code might also use the same flags for other purposes. – Hadi Brais Sep 16 '18 at 03:49
  • @HadiBrais: yes, of course these hypothetical instructions would be write-only for flags, same as [`blsi eax, [rdi]`](http://felixcloutier.com/x86/BLSI.html) for example. Writing flags+an integer register is a totally solved problem internally. (From an ALU port at least, maybe not a load port.) IIRC, you or BeeOnRope found an Intel patent about using an integer PRF entry to hold a FLAGS result that went with an integer result. We know for sure that x86 CPUs rename FLAGS, because they have better than 1 per clock throughput for independent `add`. – Peter Cordes Sep 16 '18 at 03:55
  • Yes that was Bee. He posted that in some comment section. Maybe [this](https://patents.google.com/patent/US6047369A/en) one. – Hadi Brais Sep 16 '18 at 03:59