193

I installed Windows NT 3.1 on a Compaq ProSignia 3080 system, because of several reasons: I know that this machine was running Windows NT 3.1 when it was in productive use. And I think this machine was one of the machines Microsoft explicitly targeted with Windows NT. For improved performance, I maxed out the RAM at 128MB and swapped the socketed Intel 486DX-33 by an AMD enhanced 486DX4-SV8B (write-back cache and SMM) in a voltage adapter socket. I postponed dealing with BIOS support for 486DX4 processors, and jumpered the processor to a 2x multiplier. Without chipset support for L1 write-back and with a 2x multiplier, the processor is supposed to be software compatible to the Intel 80486DX2-66, which is a supported option for that system.

While installing Windows NT 3.1 worked perfectly, I really like to tinker with my retro stuff. The Windows NT 3.1 CD comes with the full set of debugging symbols, I'm curious into investigating why NetDDE throws an error into the event log, and the system crashes with a specific EISA ethernet card (which might be due to faulty hardware), so I decided to dive into kernel debugging. Setting up kernel debugging is straight-forward, once you realize you should use the i386kd executable supplied with Windows NT 3.1 instead of kd/ntkd from the current Windows 10 develepmont kit.

As soon as I want to break in (using Ctrl-C in i386kd), the target machine reboots instead of providing a kd> prompt.

I already tested the following:

  • Memory in that system is OK
  • The system files are not corrupted
  • There is no hardware watchdog active that reboots the machine when the kernel is interrupted for debugging
  • The USB-to-serial adapter I use (which seems to be a counterfeit PL2301) in the host communicates properly. It's not mis-sending some debugger commands as "reboot system" command (the KD protocol provides one, though!)
  • It's not related to some remote management or alerting options provided by the mainboard.
user3840170
  • 23,072
  • 4
  • 91
  • 150
Michael Karcher
  • 7,941
  • 3
  • 25
  • 49

1 Answers1

279

Short explation

The Windows NT 3.1 kernel is incompatible with enhanced 486 processors. Specifically, it is incompatible with 486 processors providing the CPUID instructions. Kernel debugging works fine with the 486DX-33 that was originally installed in the machine, and with the older non-enhanced core in a write-through Am486DX4-NV8T without SMM.

If your goal is just toying around with NT 3.1 kernel debugging, you might want to use a processor that is compatible with Windows NT 3.1 out-of-the-box. If you are as curious as me, you might want to fix Windows NT. Keep reading in this case.

The underlying issue

The incompatibility is due to a bug in KiSaveProcessorControlState (and a similar bug in the counterpart KiRestoreProcessorControlState), which is called from three locations inside NTOSKRNL.EXE:

  1. When an exception is reflected to the kernel debugger using KdpTrap (If I use Ctrl-C to break into the kernel debugger, a breakpoint exception is raised from the break-in polling functionality in the timer tick interrupt)
  2. When KeBugCheckEx is called (i.e. the "blue screen")
  3. When KiSaveProcessorState is invoked. This appears to never happen, as this function is neither exported nor called from inside NTOSKRNL if the control flow analysis by IDA in NTOSKRNL.EXE is exhaustive.

This function is supposed to save the processor control registers into an extended CONTEXT structure. Its disassembly looks like this:

.text:80106740 ; __stdcall KiSaveProcessorControlState(x)
.text:80106740                 public _KiSaveProcessorControlState@4
.text:80106740 _KiSaveProcessorControlState@4 proc near
.text:80106740
.text:80106740 dest            = dword ptr  4
.text:80106740
.text:80106740                 mov     edx, [esp+dest]
.text:80106744                 xor     ecx, ecx
.text:80106746                 mov     eax, cr0
.text:80106749                 mov     [edx+0CCh], eax
.text:8010674F                 mov     eax, cr2
.text:80106752                 mov     [edx+0D0h], eax
.text:80106758                 mov     eax, cr3
.text:8010675B                 mov     [edx+0D4h], eax
.text:80106761                 mov     [edx+0D8h], ecx
.text:80106767                 cmp     ds:word_FFDFF138, 5
.text:8010676F                 jb      short @@before_pentium
.text:80106771                 mov     eax, cr4
.text:80106774                 mov     [edx+0D8h], eax
.text:8010677A @@before_pentium:
.text:8010677A                 mov     eax, dr0
.text:8010677D                 mov     [edx+0DCh], eax
.text:80106783                 mov     eax, dr1
.text:80106786                 mov     [edx+0E0h], eax
.text:8010678C                 mov     eax, dr2
.text:8010678F                 mov     [edx+0E4h], eax
.text:80106795                 mov     eax, dr3
.text:80106798                 mov     [edx+0E8h], eax
.text:8010679E                 mov     eax, dr6
.text:801067A1                 mov     [edx+0ECh], eax
.text:801067A7                 mov     eax, dr7
.text:801067AA                 mov     dr7, ecx
.text:801067AD                 mov     [edx+0F0h], eax
.text:801067B3                 sgdt    fword ptr [edx+0F6h]
.text:801067BA                 sidt    fword ptr [edx+0FEh]
.text:801067C1                 str     word ptr [edx+104h]
.text:801067C8                 sldt    word ptr [edx+106h]
.text:801067CF                 retn    4
.text:801067CF _KiSaveProcessorControlState@4 endp

This function is supposed to save all control register (CR0, CR2, CR3, and CR4 on pentium and later processors), all debug registers (DR0-DR3, DR6, DR7) and various global protected mode settings (the address of the GDT, the address of the IDT, the selector of the active TSS and the selector of the LDT). To detect the processor type, it uses a value from the KPRCB (Kernel Processor Control Block). The KPRCB is part of the KPCR (Kernel Processor Control Region). The KPRCB for the boot processor (or the only processor on uniprocessor systems) is located at virtual address FFDFF120, which is hard-coded into this method. Geoff Chappell writes this about the relevant part of the KPRCB in NT 3.1:

+018   CHAR CpuType;
+019   CHAR CpuID;
+01A   UShort CpuStep;

These members of the KPCRB are initialized by KiSetProcessorType, which identifies the relevant processors correctly (but be aware that it mistrusts processors that report a CPUID feature level above 3 and considers them as "generic non-CPUID capable 586 compatible processors". The byte at offset 18 is set to 4 for 486 processors, 5 for Pentium processors and 6 for Pentium Pro and Pentium II/III processors. The byte at offset 19 is a boolean flag that indicates whether the processor support CPUID and it behaves "reasonable".

A very attentive reader might already have noticed the bug: The CMP instruction uses the word at address FFDFF138 (which is 18h bytes into the KPRCB), instead of the byte at that address. This means the byte at offset 19h in the KPRCB is considered part of the model number. If a processor supports CPUID, its model number is considered to be 256 bigger than it actually is. This means Windows NT 3.1 treats a CPUID capable 80-4-86 processor as 80-260-86 processor. And as 260 is way larger than 5 (Pentium), that processor better had CR4.

The fix

The fix is obvious once the bug is identified. The instruction cmp ds:word_FFDFF138, 5 only appears twice in NTOSKRNL.EXE, specifically in KiSaveProcessorControlState and KiRestoreProcessorControlState, and it needs to be patched to be a byte compare instead of a word compare. Use your favorite hex editor to patch 66 83 3D 38 F1 DF FF 05 to 90 80 3D 38 F1 DF FF 05, two times. This fix applies both the NTOSKRNL.EXE from the original NT 3.1 Advanced Server distribution as well as NT 3.1 SP3.

Michael Karcher
  • 7,941
  • 3
  • 25
  • 49
  • 36
    This is an great example of a self-answered question done right. – Criggie Apr 20 '21 at 11:06
  • 16
    @Vilx- Free time built civilization. – J... Apr 20 '21 at 14:10
  • 14
    @J... and on the other hand: civilisation built more free time! – John Keates Apr 20 '21 at 17:04
  • 35
    I see your NOP (90h), and raise you an explicit DS prefix (3Eh). It's all about the style points! :-) Any instruction with a memory operand can have a segment prefix. In this case, the DS prefix is implicit/implied, but it can be explicitly specified without changing the meaning of the instruction. Both ways work to pad the extra leftover byte of space, but the explicit DS prefix does not change the instruction's execution speed, whereas the NOP actually takes 1 cycle of time to execute (plus possible decoding). – Cody Gray - on strike Apr 21 '21 at 00:34
  • 5
    Why do I keep seeing the image of some old retired Microsoft programmer reading this and going "Dam smart-azz kids I thought I safely swept that under the rug with the 10,000 other bugs I left in that code" – Ted Mittelstaedt Apr 22 '21 at 09:31
  • 24
    The really interesting question is, how on earth did you get the idea to look at _KiSaveProcessorControlState in the first place, when all you had was a rebooting PC without a usable debugger? – Guntram Blohm Apr 22 '21 at 11:06
  • 1
    @JohnKeates Civilization finds that you are not performing your maximally possible allocation of profitable work. Civilization says get back to work, slave. You are not meant to have any free time. – user253751 Apr 22 '21 at 14:17
  • 21
    @GuntramBlohm NTOSKRNL.EXE + debug symbols + IDA helped me understand how the remote break-in is supposed to work. I knew that something in the remote break-in code path before the first debug packet is sent is going to reboot my machine. So I patched "JMP SHORT $" instructions into the relevant code-path. If I placed it before the crash point, the machine hangs. If I placed it after the crash point, the machine reboots. This allowed me to "bisect" where the crash is happening. BTW: Good question, but I don't think the debugging story fits the Q/A format well, so I left it out. – Michael Karcher Apr 22 '21 at 20:23
  • 9
    @CodyGray If you apply the style metric by Raymond Chen (number of bytes modified), you should neither put a NOP nor a DS prefix there, but leave the operand size prefix (66) as is. It has no effect on 8-bit instructions, and you get down to patching just one byte instead of two. – Michael Karcher Apr 22 '21 at 20:25
  • 10
    "If you are as curious as me, you might want to fix Windows NT." Phrases like that make me want to break out the popcorn while simultaneously regretting my life decisions that lead to this point. – Cort Ammon Apr 23 '21 at 02:23
  • 6
    BTW: Good question, but I don't think the debugging story fits the Q/A format well, so I left it out. — @MichaelKarcher: This just surfaced on HN (https://news.ycombinator.com/item?id=37684986) and I just wanted to say, there's a pretty sizeable niche of people who will be very interested to hear more about everything :) – i336_ Sep 28 '23 at 05:25
  • 1
    @i336_ It surfaced hacker news again. It hit hacker news first some days after posting: https://news.ycombinator.com/item?id=26898639 – Michael Karcher Sep 28 '23 at 21:52