How is the ‘Coprocessor segment overrun’ exception supposed to be handled?

Question

The Intel 80386 CPU didn't have an on-board x87 FPU (maybe with the exception of some non-Intel clones). It was, however, able to use either a 80287 or 80387 as an external FPU. When the x87 FPU accesses memory, the CPU makes the necessary privilege checks and generates an exception if the access is illegal (not enough privilege, page/segment not present etc.).

On CPUs with an on-board FPU, the exception generated is the General Protection Fault (interrupt 13), just like all other non-FPU illegal accesses.

On CPUs without an on-board FPU (like the 80386), the exception generated is the Coprocessor Segment Overrun Exception (interrupt 9), unlike all other memory accesses which use the GPF.

The Intel i386 manual states the following about how INT9 should be handled (emphasis is mine):

The addresses of the failed numeric instruction and its operand may be lost; a FSTENV instruction does not return reliable numeric coprocessor state information. The coprocessor-segment-overrun exception should be handled by executing a FNINIT instruction (i.e., a FINIT instruction without a preceding WAIT instruction). The return address on the stack might not point to either the failed numeric instruction or the instruction following the failed numeric instruction. The failed numeric instruction is not restartable; however, the interrupted task may be restartable if it did not contain the failed numeric instruction.

What I understand from this is that if the FPU makes an illegal memory access then its entire state cannot be recovered. Am I actually getting this right?

What happens if the illegal access happened, for example, because the memory was swapped to disk? Normally the OS would load the missing page to memory, put it back in the virtual address space and continue execution. However, with this behavior, it is impossible to do that because ‘The failed numeric instruction is not restartable’. You cannot get the FPU state either, all that can be done is to FNINIT and kill the innocent application.

I also don't understand this statement very well: ‘The return address on the stack might not point to either the failed numeric instruction or the instruction following the failed numeric instruction’. Does this mean that the return address is undefined and all the OS can do is kill the application and return to some other task by crafting a return address on the stack?

Is a segment overrun exception generated on a page fault, or merely in the particular case of a segment overrun? The 80286 segment model was based on the assumption that entire segments would be swapped in and out as a unit, so there was no need to handle the possibility that an access intended for the FPU might straddle the end of a segment. — supercat, Oct 22 '21 at 20:49
I understand that yes, they do generate on page faults. But I might have understood the clumsy wording wrong, so here's the manual (page 253 in the PDF). — DarkAtom, Oct 22 '21 at 20:58
Based on my reading of that section, the processor checks that the first and last bytes of the operand are within addressable memory before telling the FPU to expect data. If neither is within addressable storage, a normal fault would occur. The error that triggers an unrecoverable segment overrun fault can only happen under weird (almost certainly contrived) scenarios where the first and last bytes of the operand would have valid addresses but some bytes in the middle would not. Ordinary page faults that could hit a middle byte would also hit on a first or last byte, allowing normal handling. — supercat, Oct 22 '21 at 21:06
Yeah, reading again with your argument in mind makes it very clear. Basically, OSes which use flat 4GB descriptors don't have to worry about this exception at all and, for other OSes, they can simply treat this as a normal illegal memory access and kill the application. — DarkAtom, Oct 22 '21 at 21:41
So basically, an obscure corner case that they weren't able to design a good solution for, so they decided to essentially just have the CPU give up, and they documented that. The message between the lines in the documentation is "This can only happen if your OS manages memory in a dumb way, so don't do that". — Nate Eldredge, Oct 23 '21 at 18:35
I have the 1986 edition which gives more explicit advice: align segment starts on page boundaries, and don't make segments that are just slightly less than maximum size. It then says "If neither software system design constraint is acceptable, the exception handler should execute FNINIT and should probably terminate the task." What they seem to really mean is "If neither software system design constraint is acceptable, then your design is stupid and we can't reasonably be expected to support it." — Nate Eldredge, Oct 23 '21 at 18:41
I am still wandering why it has to be a different exception just for this. What constraint stopped Intel from making this a GPF? Why would the FPU care if it's this specific case of an illegal memory access and not other cases? — DarkAtom, Oct 23 '21 at 18:51
@DarkAtom: If a GPF occurs, it would be common to reconfigure memory and then restart the instruction where the fault occurred. Any time an instruction fails in such a way that reconfiguring memory would not allow a smooth restart, a different interrupt should be used to indicate that. The problem here I think is that some of the side effects associated with FPU instructions may occur before the FPU has received all of the associated data, and precisely which side effects those are might vary depending upon the exact type of chip one is using, or maybe even the die revision. — supercat, Oct 24 '21 at 04:30
@DarkAtom: Further, at least on the 8086 the way the FPU worked, there was a special family of opcodes that would cause the CPU to put data for the FPU on the data bus while using some wires dedicated to that purpose to indicate when the FPU should look there. Once the FPU starts receiving some of the data, I don't know that there is any signaling protocol to indicate that it's not going to receive all of the data it should expect and that it thus needs to abort the operation. Maybe the 80286 and 80386 added such functionality, but I wouldn't be surprised if they didn't. — supercat, Oct 24 '21 at 04:37
@supercat When a GPF occurs, the OS checks the cause of the fault. If the cause is that the user tried to execute cli or some other illegal operation then there's no point in restarting the operation and the application must be killed (or maybe there are some signal handlers involved). This exception is no different, it's just a special case of the GPF. But I guess it makes sense, a GPF might be restartable, while this one will never be. — DarkAtom, Oct 24 '21 at 08:27
@DarkAtom: The general nature of a GPF is that if the state of the universe changes before a GPF handler returns, whatever operation had been attempted will behave as though the state of the universe had magically changed just before it was attempted the first time. If a segment-bounds fault occurred in the middle of most operations, adjusting the segment bound and returning would result in the operation executing as though the segment bound had held the correct value from the start, but in the coprocessor overrun case it cannot. — supercat, Oct 28 '21 at 16:28

score 6 · Answer 1 · edited Nov 25 '23 at 14:43

The exception dates back to the 80286/80287. Intel 80286 CPU: Real Mode Emulation says that

The current case that cannot be restarted in general is any floating point operand reference where the second or subsequent word exceeded a segment limit. The exception 9 handler must execute FNINIT before any other WAIT or ESC instruction. The internal status of the 80287 cannot be read until it is forced idle by FNINIT. The FNINIT instruction will mark all floating point data registers as empty, set top of stack to 0, and mask all errors. The numeric instruction and data addresses stored in the 80287 will correctly point at the failing instruction. If the 80286 program interrupted by the math address error is not the program that executed the failed ESC instruction, then that program can be restarted.

score 5 · Answer 2 · edited Feb 12 '23 at 22:07

5

The Intel 80386 CPU [...] use[d] either a 80287 or 80387 as an external FPU. When the x87 FPU accesses memory,

No, unlike the 8087 which was a real coprocessor, 287 and later are I/O devices. They do not make any memory access on their own. All access is handled by the 286/386.

the CPU makes the necessary privilege checks and generates an exception if the access is illegal (not enough privilege, page/segment not present etc.).

For what it can check.

On CPUs without an on-board FPU (like the 80386), the exception generated is the Coprocessor Segment Overrun Exception (interrupt 9), unlike all other memory accesses which use the GPF.

Because GPF can only be raised if the CPU knows ahead of time that an access will be problematic.

What I understand from this is that if the FPU makes an illegal memory access then its entire state cannot be recovered. Am I actually getting this right?

Yes ... err ... no. The FPU does not make any access, but the CPU does. For all instructions with a defined amount of data the CPU can check that data and do the usual int. But instructions like FSTENV handle an amount of data that is unknown to the CPU. Thus the CPU can only check the start address for that transfer beforehand. After that a DMA like transfer follows:

The CPU prepares an address pointer using the address given in the instruction
Check address for validity
If not -> GPF
CPU checks for BUSY active
If BUSY is inactive, transfer is ended -> exit loop
CPU takes a 16-bit (287) or 32-bit (387) word from the FPU's data port (800000FCh)
CPU checks for address being a valid one
If not -> Coprocessor Segment Overrun Exception
CPU stores 16/32 bit word at the address pointer
Address pointer gets incremented by 2/4 depending on FPU
Go to step 4

(see here for a description of the interface)

What happens if the illegal access happened, for example, because the memory was swapped to disk? Normally the OS would load the missing page to memory, put it back in the virtual address space and continue execution. However, with this behavior, it is impossible to do that because ‘The failed numeric instruction is not restartable’. You cannot get the FPU state either, all that can be done is to FNINIT and kill the innocent application.

Exactly that. Although the innocence might be debatable.

The "DMA" protocol used does not allow any restart, as it does not include any signalling to tell the 287/387 to interrupt the ongoing transfer.

I also don't understand this statement very well: ‘The return address on the stack might not point to either the failed numeric instruction or the instruction following the failed numeric instruction’. Does this mean that the return address is undefined and all the OS can do is kill the application

Abort is the only solution.

Also, think of it, even if the address could be recovered, the only useful action would still be an abort as the FPU state after that is undefined. That means it's for all practical use corrupted, thus any continuation at that point is useless at best, harmful at most.

Of course, if OS and environment provide a user side abort handler, more sophisticated applications get a chance to restart from some save point.

edited Feb 12 '23 at 22:07

Sep Roland

1,043
5
14

answered Feb 12 '23 at 17:32

Raffzahn

222,541
22
631
918

One sentence ended with "or" ("... to interrupt the ongoing transfer or " ) I removed it in my edit but maybe you were planning on including something additional there? – Sep Roland Feb 12 '23 at 20:55
Oh ... well, short term memory doesn't help, so let's just drop it :)) – Raffzahn Feb 12 '23 at 21:04
In fact, the documentation says the only option is a FNINIT. I wonder why can't other instructions like FNSAVE be used? – Yuhong Bao Feb 13 '23 at 02:39
This also reminds me that the 80386 started checking the end of the operand address so page fault for example can be raised instead of this exception. I assume the 80286 only checked the beginning, right? – Yuhong Bao Feb 13 '23 at 02:56
Of course I assume the 80386 began waiting until the data transfers complete before the next instruction, right? – Yuhong Bao Feb 13 '23 at 03:19
I wonder if the reason why FNINIT was chosen is that it is one of the few instructions that don't take a memory operand, so can be used to abort the data transfer. – Yuhong Bao Feb 13 '23 at 03:42
@YuhongBao any can be used to abort. It's simply that FNINIT will always work and more important always give a dedicated safe state to restart from. So why adding more information? This isn't about some research topic, but delivering a product that (hopefully) works and does what is described. You may of course experiment with other combinations, but what are they good for? Also, sure the 386 does check the end address? If yes, how? – Raffzahn Feb 13 '23 at 03:55
I was thinking that this exception always happens when the FPU is busy with a data transfer, and you will notice that the documentation is clear that even WAIT can't be used. – Yuhong Bao Feb 13 '23 at 05:56
http://bitsavers.trailing-edge.com/components/intel/80286/210498-005_80286_and_80287_Programmers_Reference_Manual_1987.pdf "The interrupt signals that the processor extension is requesting an invalid data transfer. The processor extension will always be busy when waiting on data...." – Yuhong Bao Feb 13 '23 at 06:04
This also reminds me of https://groups.google.com/g/comp.os.os2.misc/c/q0qTQNxCHgc/m/LsBaxzuVB1wJ – Yuhong Bao Feb 13 '23 at 06:55
Page faults would not work very well if the 80386 did not check the end address. Even the 80386 manual did not make claims otherwise. – Yuhong Bao Feb 13 '23 at 07:12
@YuhongBao again, how can the 286 check for the end address before it knows the end address? – Raffzahn Feb 13 '23 at 13:55
Note that I am talking about the 386 and not the 286 here, because the 286 probably didn't regardless of whether it is possible. – Yuhong Bao Feb 13 '23 at 13:56
That being said, you do have a good point here, especially since the 386 had to be backward compatible with the 287. – Yuhong Bao Feb 13 '23 at 14:11
http://bitsavers.trailing-edge.com/components/intel/_dataBooks/1986_80386_Hardware_Reference_Manual.pdf "The 80287 Processor Extension Acknowledge (PEACK#) input is pulled high. In an 80286 system, the 80286 generates PEACK# to disable the PEREQ output of the 80287 so that extra data is not transferred. Because the 80386 knows the length of the operand and will not transfer extra data, PEACK# is not needed or used in 80386 systems. " – Yuhong Bao Feb 13 '23 at 14:24
@YuhongBao you're aware that comments can be edited? Flooding this isn't exactly a good idea. Further, throwing in a 200+ page PDF without mentioning the page isn't helpful. I have all the original manuals in paper right beside me (did use 287 with 68k). Last, I would suggest to simply build a 287 system if you want to try and learn about such fringe cases. – Raffzahn Feb 13 '23 at 14:55
It is in "CHAPTER 5 COPROCESSOR HARDWARE INTERFACE" if you are interested. – Yuhong Bao Feb 13 '23 at 15:03

How is the ‘Coprocessor segment overrun’ exception supposed to be handled?

2 Answers2