How reliable are BSOD?

Question

I searched this question in superuser site, but no one posted it, so here is my question: Do BSOD gives us 100% accurate error?

yes, they will always show you the error that crashed the kernel. That said, that error may not always be the first error to occur (a poorly handled userland error can crash the kernel in rare cases), but yes, the BSOD will always show you the fatal error. — Frank Thomas, Nov 06 '15 at 05:10
An example of user mode code resulting in a kernel crash is if a protected system process, like winlogon.exe or lsass.exe, hits an unhandled exception and crashes. That's a "should never happen" condition, and there is no way to correctly restore state and keep going. So if any protected system process fails, that's considered a fatal system error, and a BSOD ensues. — Jamie Hanrahan, Nov 06 '15 at 15:56

score 7 · Accepted Answer · edited Mar 20 '17 at 10:16

The BSOD codes are exactly the parameters that were passed to KeBugCheckEx. The first such parameter is called the "bugcheck code". It is translated to the message you see on the BSOD. For example, bugcheck code 0x50 is PAGE_FAULT_IN_NONPAGED_AREA, 0x44 is MULTIPLE_IRP_COMPLETE_REQUESTS.

The other four parameters' meanings are specific to the particular bugcheck code. For example in the case of PAGE_FAULT_IN_NONPAGED_AREA one of the other parameters will indicate the virtual address that faulted. For MULTIPLE_IRP_COMPLETE_REQUESTS one of the parameters indicates the address of the IRP (I/O Request Packet).

However: A phrase we use often in kernel mode debugging is "the victim is not always the culprit". i.e. the code that was made to crash, is not always the culprit (the code that created the circumstances that caused the crash). The BSOD only identifies the victim. Even a minidump usually doesn't have enough info to go beyond that.

There are two broad categories of BSOD codes: Those that indicate an "assertion failure", and those that are due to an unhandled or unhandleable exception that was raised in kernel mode. (The debugger documentation does not clearly distinguish between these, although you can usually figure it out from the descrption of each bugcheck code.)

An "assertion failure" is similar to the way the "assert" macro is commonly used in C programming (although Windows doesn't use the "assert" macro in kernel mode). It is an "inline" test for a "shouldn't happen" condition. For example, NO_MORE_IRP_STACK_LOCATIONS means that someone created an IRP with too few "stack locations" (n.b.: This is not the same as the "stack" that's used for return addresses, local variables, etc.) for the number of layered drivers that exist for a given device (or "DevNode").

An "exception" is something that happens as a side effect of executing an instruction. Some exceptions can be "handled". For example, a page fault is an exception. Under most conditions (when you're either in user mode, or in kernel mode at IRQL < 2, and the virtual address being referenced is properly defined and accessible in the current access mode) the OS's pager can take care of page faults.

But if any of those conditions are not met, the page fault can't be resolved. In user mode this usually results in the process crashing. In kernel mode it will result in a BSOD, with any of several bugcheck codes depending on the exact circs. Common ones are:

IRQL_NOT_LESS_OR_EQUAL (the pagefault occurred at IRQL 2 or above)
DRIVER_IRQL_NOT_LESS_OR_EQUAL (similar, but KeBugCheckEx figured out that the page fault was raised inside a driver and changed the bugcheck code to indicate that fact)
KMODE_EXCEPTION_NOT_HANDLED (it was at IRQL 0 or 1 but it couldn't be resolved for some other reason)
SYSTEM_SERVICE_EXCEPTION (also at IRQL 0 or 1 but in a kernel mode routine that was invoked from user mode (has nothing to do with "service" processes))
SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (also at IRQL 0 or 1 but the problem occurred in a thread in the "system" process) ...etc.

Now, here's the rub:

The bugcheck code and other information always precisely indicates the circumstances at the point where the problem was detected. But it doesn't necessarily - in fact it often does not - indicate the true cause of the problem.

For example, the most common cause of an unhandleable page fault in kernel mode is simply that the address being referenced is incorrect. For example, suppose I call ExAllocatePool (the k-mode equivalent, roughly, of malloc) and it can't allocate what I want. In that case it will return to me not the address of an allocated block but zero - the 'null pointer'. Now suppose I store that zero where a pointer should be. Later, some other piece of code, in the OS and not in my code, tries to use that pointer. BSOD! The obvious information on the BSOD and the minidump will point to the code that tried to use the pointer. But the real culprit is my code, which failed to check for a zero return from ExAllocatePool and stored that as a "pointer". But by that time my code could be long gone, i.e. not executing any more.

Another example: Suppose I successfully allocate the pool (heap) that I need, but while I allocate 120 bytes, my code mistakenly writes 140 bytes beyond the address that's returned to me. I have just corrupted the pool metadata for the next block of pool, and if that block in use, I've also corrupted the data that belongs to whoever owns that block. This might not cause a problem for some time. It won't immediately cause a problem for me! But eventually, when whoever owns that block tries to use their data, they'l have problems (could be a page fault, could be lots of things). Or if a request to free or allocate pool happens to hit the corrupted metadata, that'll likely raise some sort of exception, resulting in a BSOD. And, again, I, the culprit, will likely not be anywhere evident.

In debugging these you have to figure out where the bad data - usually a pointer - came from, not just who tried to use it.

Similarly, the NO_MORE_IRP_STACK_LOCATIONS bugcheck is never the fault of the code in IoCompleteRequest that detects it. It's probably the fault of some driver that set up driver layering improperly. A simple look at a minidump (and certainly that "whocrashed" thing people keep posting the output of) would quickly conclude "the problem is in ntoskrnl", because it's IoCompleteRequest that calls KeBugCheckEx in this path, and IoCompleteRequest is in ntoskrnl. But the real problem in such cases is invariably that some driver set up the layering of device objects improperly (or it was set up properly, but corrupted later). Which driver code did the foul deed will not likely be obvious in the minidump file.

It's sometimes possible to figure out from a kernel dump which driver was the actual culprit, but the BSOD will almost never tell you, and a minidump usually doesn't have enough info to tell you.

In a few cases the bugcheck code and other info on the BSOD seems pretty widely separated from the actual problem. For example, UNEXPECTED_KERNEL_MODE_TRAP used to be something we'd see moderately often (as BSODs go... particularly if you were using a particular sound card which I will not identify here; the product and even the chipset it used is long obsolete so it doesn't matter now), with parameters that indicated a "double fault". Many of these were actually caused by code that used too much kernel stack space. There's nothing about "UNEXPECTED_KERNEL_MODE_TRAP" or even "double fault" that tells you that. (After some time, the debugger doc was updated to include a suggestion that a "double fault" could be caused by a kernel stack overflow.)

For more information on all the possible bugcheck codes, see the help that comes with WinDbg, section "Bug Check Code Reference". If you're not already familiar with the material in Windows Internals by Solomon, Russinovich, and Ionescu, you'll likely need to refer to that to understand many of the descriptions. For perhaps more explanation of "why the OS can't just fix things up and go on" instead of crashing, see my answer here.

score 0 · Answer 2 · answered Nov 06 '15 at 05:41

0

No. The error codes might not be the root cause but many times is related to it. For example Power loss might throw an HDD fail because it crashed due to failure to i/o with the disk (before the CPU lost power generally). OR your PSU might be faulty and unexpectedly cut power to HDD.

The BSOD codes are just pointers to why windows crashed in current/previous session.

answered Nov 06 '15 at 05:41

Roh_mish

265

Can you provided relevant sources for your conclusion? Goes everything against what I have read about Windows and experienced personally over 20 years of working with Windows. – Ramhound Nov 06 '15 at 12:08
@Ramhound I agree with Roh_mish although I agree it wasn't very clear. I think what Roh_mish is basically saying is that the BSOD can be misleading. For example, a BSOD might seem to complain that there is a problem with a driver, when the actual issue is with hardware. Or maybe vice-versa. So the BSODs are not always very clear about identifying the actual "root cause" of an issue, which may lead some people to consider them to be less than fully accurate/reliable. In fact, the BSODs should be presenting some information very accurately, but interpreting them right isn't always very easy – TOOGAM Nov 06 '15 at 17:43

How reliable are BSOD?

2 Answers2