Thread local real usage of the underlying segment registers

Question

I read a number of articles and S/O answers saying that (on linux x86_64) FS (or GS in some variants) references a thread-specific page table entry, which then gives an array of pointers to the actual data that is in sharable data. When threads are swapped, all the registers are switched over, and the threaded base page therefore changes. Threaded variables are accessed by name with just 1 extra pointer hop, and the referenced values can be shared to other threads. All good and plausible.

Indeed, if you look at the code for __errno_location(void), the function behind errno, you find something like (this is from musl, but gnu is not so much different):

static inline struct pthread *__pthread_self()
{
    struct pthread *self;
    __asm__ __volatile__ ("mov %%fs:0,%0" : "=r" (self) );
    return self;
}

And from glibc:

=> 0x7ffff6efb4c0 <__errno_location>:   endbr64
   0x7ffff6efb4c4 <__errno_location+4>: mov    0x6add(%rip),%rax        # 0x7ffff6f01fa8
   0x7ffff6efb4cb <__errno_location+11>:        add    %fs:0x0,%rax
   0x7ffff6efb4d4 <__errno_location+20>:        retq

So my expectation is that the actual value for FS would change for each thread. E.g. under the debugger, gdb: info reg or p $fs, I would see the value of FS be different in different threads, but no: ds, es, fs, gs are all zero all the time.

In my own code, I write something like below and get the same - FS is unchanged but the TLV "works":

struct Segregs
{
    unsigned short int  cs, ss, ds, es, fs, gs;
    friend std::ostream& operator << (std::ostream& str, const Segregs& sr)
    {
        str << "[cs:" << sr.cs << ",ss:" << sr.ss << ",ds:" << sr.ds
            << ",es:" << sr.es << ",fs:" << sr.fs << ",gs:" << sr.gs << "]";
        return str;
    }
};

Segregs GetSegRegs()
{
    unsigned short int  r_cs, r_ss, r_ds, r_es, r_fs, r_gs;
    __asm__ __volatile__ ("mov %%cs,%0" : "=r" (r_cs) );
    __asm__ __volatile__ ("mov %%ss,%0" : "=r" (r_ss) );
    __asm__ __volatile__ ("mov %%ds,%0" : "=r" (r_ds) );
    __asm__ __volatile__ ("mov %%es,%0" : "=r" (r_es) );
    __asm__ __volatile__ ("mov %%fs,%0" : "=r" (r_fs) );
    __asm__ __volatile__ ("mov %%gs,%0" : "=r" (r_gs) );
    return {r_cs, r_ss, r_ds, r_es, r_fs, r_gs};
}

But the output?

Main: Seg regs : [cs:51,ss:43,ds:0,es:0,fs:0,gs:0]
Main:    tls    @0x7ffff699307c=0
Main:    static @0x96996c=0
 Modified to 1234
Main:    tls    @0x7ffff699307c=1234
Main:    static @0x96996c=1234

 Async thread
[New Thread 0x7ffff695e700 (LWP 3335119)]
Thread: Seg regs : [cs:51,ss:43,ds:0,es:0,fs:0,gs:0]
Thread:  tls    @0x7ffff695e6fc=0
Thread:  static @0x96996c=1234

So something else is actually going on? What extra trickery is happening, and why add the complication?

For context I'm trying to do something "funky with forks", so I would like to know the gory detail.

GDB can show you the segment register values; you don't need to write inline asm. But your way does give nice compact output good for posting. — Peter Cordes, Dec 16 '20 at 19:31
@PeterCordes Indeed, it does, but I was getting to the point that I didn't trust it and wanted to see for myself :-) — Gem Taylor, Dec 18 '20 at 12:15

score 7 · Accepted Answer · edited Oct 26 '21 at 02:25

In 64-bit mode, the actual contents of the 16-bit FS and GS segment registers are normally the "null selector" (0), because other mechanisms are used to set the segment bases with 64-bit values. (MSR or wrfsbase)

Like in protected mode, there are separate "FSBASE" and "GSBASE" registers within the CPU, and when you specify, say, an FS segment override to an instruction, the base address from the FSBASE register is added to the operand's effective address to determine the actual linear address to be accessed.

The kernel's context structure for each thread stores a copy of its FSBASE and GSBASE registers, and they are reloaded appropriately on each context switch.

So what actually happens is that each thread sets its FSBASE register to point to its own thread-local storage. (Depending on the CPU features and OS design, this may only be possible for privileged code, so a system call may be required.) Then instructions with an FS segment override can be used to access an object with a given offset in the thread-local storage block, as you've seen.

In 32-bit mode, on the other hand, the values in FS and GS do have more meaning; they are segment selectors which are used to index into a descriptor table maintained by the kernel. The descriptor table holds the actual segment info, including its base address, and you could use a system call to ask the kernel to modify it. Each thread would have its own local descriptor table, so you wouldn't necessarily see different selectors in FS for different threads, but it would still be the case that FS-override instructions from different threads would result in accesses to different linear addresses.

(Or a 32-bit kernel could write into a GDT entry and mov a constant from a register into fs or gs to get it to reload that newly-written GDT entry. So it would only need a GDT per logical core instead an LDT per process. The CPU never reloads a segment descriptor on its own, although with a per-core GDT the entry would still match the current task if you had separate entries for FS and GS. So user-space might not break itself with mov eax,gs / mov gs,eax.)

Anyway, this is really just the lack of a convenient MSR or wrfsbase way to set segment register bases separately from mov Sreg, r/m triggering the CPU to load the descriptor. In either protected or long mode, in the segment register value does need to be valid (including null = 0), and moving some random value into it will likely fault.

Ah that makes more sense! I think it is still worth saying that this gets over the 64K counting limit that segment registers have. I guess for 32-bit mode 64K threads were considered adequate, but not with 64-bit. — Gem Taylor, Dec 16 '20 at 18:03
@GemTaylor: I guess whether there was a 64K limit in 32-bit mode would depend on the threading implementation: whether there was a separate LDT for each thread, or only one per process. If you were doing "lightweight" threads in userspace, then maybe you'd only have one LDT, and need a different selector for each thread. So then you only have to reload FS when switching threads. For kernel-supported threads, you could have an LDT per thread; there'd be no need for unique selectors so you'd never run out. — Nate Eldredge, Dec 16 '20 at 18:10
@GemTaylor: For 386, yes (or use LDT). But modern 32-bit kernels use the same mechanism as 64-bit kernels to read or write FS.base or GS.base: `rdmsr` and `wrmsr` using MSR number `C000_0100h` for FS (MSR_FS_BASE), or on even newer CPUs, `wrfsbase`/`rdfsbase` which even lets the kernel allow user-space to modify the segment bases. ([When the CPU is in kernel mode, can it read and write to any register?](https://stackoverflow.com/q/55746156)). — Peter Cordes, Dec 16 '20 at 19:21
The legacy mechanism of actually writing FS or GS to trigger a GDT or LDT read works in 64 or 32-bit mode I think, but is avoided in 32-bit mode because it's slower and can't read back the current bases to save across context switches. (And can only set a 32 bit base, so not even fully usable for 64-bit kernels. The MSRs are guaranteed available for x86-64 so a 64-bit kernel doesn't need a fallback for 386 compat.) At least that's my understanding, but I'm pretty sure a 32-bit Linux kernel would still have the same FS value for every thread. Not just 32-bit user space under 64 (compat mode) — Peter Cordes, Dec 16 '20 at 19:21
And BTW, **FS and GS *do* have meaning in 64-bit mode**. 64-bit code can crash itself (by raising an exception) by moving a bad value into FS or GS. But copying DS to FS doesn't crash because it's a valid selector. Try it with `mov eax, ds` / `mov fs, eax` / `mov edx, 12345` / `mov fs, edx` built as a 64-bit static executable for Linux. (nasm and ld). Segfault on the `mov fs, edx` but not `mov fs, eax`. Normally a 64-bit kernel leaves them with the null selector value (`0`) because it can (in long mode and compat mode), not because no other value would have meaning. — Peter Cordes, Dec 16 '20 at 19:27
Thanks! I have shown in a quick test that in principle copying this FS base value with arch_prctl will give me the behaviours I need. — Gem Taylor, Dec 18 '20 at 12:19

Thread local real usage of the underlying segment registers

1 Answers1