Basic Blue Screen Troubleshooting

Doug Allen: Hi and I'd like to welcome everyone to today's presentation of Basic Blue Screen Troubleshooting. My name is Doug Allen and I'm a Support Professional with Microsoft Premier Product Support Services Setup Team here in North Carolina. I'd like to thank everyone for joining me today. And with that I'd like to go ahead and jump into what we're here to learn about, which are those stubborn blue screens.

Now to understand STOP screens as you see on slide 2 and how to recover from them, we must dig into the architecture of the operating systems. The operating system architecture is divided into two main sections kernel mode and user mode. Kernel mode is a high-privilege, direct access to hardware, memory, the hardware abstraction layer (HAL), the Microkernel, and all other Windows NT® Executive Services.

User mode is low-privilege, with no direct access to hardware. It also uses APIs to request system resources, environment variables, and is where the integrated subsystems are located, for example, the Posix subsystem and the OS/2 subsystem. Also, in the user mode is where all the access violations and your Dr. Watson errors occur.

And in previous builds of Windows NT 3.51, they basically moved between Windows NT 3.51 to Windows NT 4.0. The developers chose to move, for instance, the video portion from user mode to kernel mode, to promote stability and also allow it to run faster.

And, if you notice, on slide 3 is a simplified diagram of how user mode and kernel mode are related to each other. And how they're related, and how they make Windows NT and Windows® 2000 supportable, and what a stable operating system it is.

Now on slide 4 I'm sure we are all wondering why these STOP screens happen and there are five common categories of causes for these STOP screens. They range from different system services, applications, device drivers, or faulty or incompatible device drivers, to hardware problems, disk or file system corruption, and firmware or BIOS issues that are either outdated or incompatible with either the operating system or certain files. Viruses can also cause STOP screens. And these causes can be one or a multitude of combinations of these common problems.

Now there are four categories on slide 5 when the STOP screens can happen. And again they're broken up into four main categories. The first is the short startup period of the extra boot up sequence (phase four, near the end of the boot up process before it gets to the GUI mode – the desktop CTRL+ALT+DELETE logon screen). They can also happen when there's a software condition detected by the processor, or when a hardware malfunction is detected by the processor, and then the final category will encompass all the rest of the STOP screen codes.

On slide 6, I break down the STOP screens that you're used to seeing into five main sections. Now if you notice, this particular breakdown is for Windows NT 4.0. (In Windows 2000, it's broken up a little differently.) In section 1, the debug port status info is where, if you actually had a kernel debugger hooked up to it or if you're in debug mode, it will show you the Com or the different port status information. In section 2, it will give you the actual bug check information, which is where you will see the actual STOP 0x000001E, and then you also see your status description code, which is for instance K mode exception not handled, IRQL not less or equal, etc.

Now section 3 will actually have driver information, but it lists all of the driver information that's loaded in memory at the time of the blue screen. Section 4 encompasses the actual kernel build number and an actual dump of the stack trays. And that's relevant to the actual build number of Ntoskernel.exe. And then the last section lists the debug port info. Also in the debug port info, it also includes if it is writing a Memory.dmp file or anything else like that in this section.

I apologize that I do not have a screen capture of a Windows NT 4.0 blue screen, but the graphic would be too small to be able to see. So I would refer you to the Windows NT 4.0 Resource Kit for an actual graphic image of the blue screen without actually seeing one in production.

And just a little bit more information here on slide 7 about descriptions of what the different sections are. If you notice under the driver information that's loaded into memory, it's separated into three main columns. Your first column is your load base address. Second column is the time and date stamp of the actual file (but those are in hex). And the third column actually names all the drivers that are in the set.

(Slide 8) And then continuing on, we see more information about those sections. About the kernel build numbers and the debug port info, which I have already covered.

(Slide 9) Now as I mentioned earlier, the Windows 2000 STOP screens are a little different. They only contain three sections. Those sections being section 1: debug check info (which is the STOP code and the four parameters that go with it and its description code). Section 2 is actually added and between this they removed some of the previous sections from the NT 4.0 blue screen. Section 2 is a recommended user action for that particular blue screen. So whenever you get a particular blue screen, it will give you a recommended action to help you recover from this blue screen. And then section 3 (which is the same as section 5 of Windows NT 4.0 blue screen) displays the debug port information.

Now in slide 10 we break those three sections down; I've already covered some of this, and of course debug info is all in hex. So if you ever want to do anything with those actual numbers, you will need to convert them from hex to actual decimals. When you're using a kernel debugger, they act like the Snd/Rcv lights of a modem that blink back and forth when it's transmitting and receiving data.

(Slide 11) Moving on, I'd like to talk about the actual memory.dmp file itself, what it contains, what it's used for, etc. Basically, the Memory.dmp files contain information about the computer at the time of the crash. So the actual state of your machine, all files that were loaded into memory, and files that were paged out through the actual Pagefile.sys, are all contained in the Memory.dmp files. The Memory.dmp file is created every time a blue screen happens if your machine is configured to do so.

A Memory.dmp file is very useful during the debugging process to determine the root cause of your crash. And you would verify the integrity of your Memory.dmp with a utility called Dumpchk with NT 4.0, or Dumpchk 2000 in Windows 2000. Both these utilities can be found respectively on the Windows NT or Windows 2000 retail compact discs.

In Windows 2000, on slide 12, you see that there were some changes to the Memory.dmp file between Windows NT 4.0 and Windows 2000. There are now three different types of memory dumps that can be created. The first type is a mini dump, which is only 64 KB in size. Now a mini dump is usually very small and tells you very little about the crash. Its header will only contain a very small picture of the state of your system at the time of the crash. And I will say that the mini dump is not very useful to a support professional in determining the root cause of the crash.

They also added another type of dump, kernel only dump, which is a little more useful in debugging and determining the root cause of the crash. It contains everything that the mini dump contains, but it will also contain everything that was loaded in the kernel, along with just a very small snapshot. It will actually get everything loaded into kernel-mode memory at the time of the crash. We always recommend using a complete dump in Windows 2000 (which is equivalent to the Windows NT 4.0 Memory.dmp file), which contains a complete dump of everything that was loaded into memory at the time of the crash.

(Slide 13) Now there are two main conditions in which a Memory.dmp file can be created. The first is that a valid pagefile must exist on the system root partition (which is the same partition where WinNT or Windows 2000 system files are installed). And it must equal the size of the amount of physical random access memory (RAM) you have installed in your system, plus approximately 12 megabytes (MB).

There have been arguments about exactly how much over the physical RAM you should have, but a general rule is at least 12 MB. The only other condition is that whatever drive you choose to store your Memory.dmp file on must have enough space to actually contain the Memory.dmp file.

(Slide 14) And on the next slide you will see an actual screen shot in Windows 2000. This one shows the Startup and Recovery options where you would use to actually configure your Memory.dmp creation and other memory type configuration problems.

(Slide 15) From that screen shot you can see that, to obtain a Memory.dmp file, it must be configured to actually write the dump file. This can be done a couple of different ways: through the actual GUI (that you see on the previous slide), or it can be set through the registry on either HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager, or in the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl.

If the system stops responding, (i.e., if it hangs where you lose your mouse functionality; sometimes you lose your keyboard functionality where the system appears to hang), you can force creation of the Memory.dmp file even though you don't see the actual STOP screen itself. This can only be done on Windows 2000, which I outline the creations conditions on slide 16.

There are two main changes that must be made to actually force the creation of the Memory.dmp files (if the system is hung and does not actually give the blue screen). The first one is that the Memory.dmp file must be configured to be written in that same startup and recovery options GUI. And the second is that a registry change must be added to the system under HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\i8042prt\Parameters; create a key named CrashOnCtrlScroll, and make that a type of REG_DWORD and set that to a hex value of 0x1. (So actually, it's the value of 1, but that's in hex.)

(Slide 16) Once you've set those settings, how you create it if the machine hangs is hold down the right CTRL key on the keyboard and push the SCROLL LOCK key twice. That will force the creation of the Memory.dmp file so you can get that to a Support Professional for debugging and determine the root cause of the crash. That's a lot more useful than saying when it's hung it's usually a user mode error, (meaning it would give a Dr. Watson or similar to an access violation where a User.dmp would be retrieved). But in a lot of cases, a Memory.dmp would be much more valuable to determining the cause of the crash.

Now from slide 17 to 26, I basically went through and collected all the most common STOP screens that are seen by Microsoft Product Support Services in our different troubleshooting. The most common STOP screen that we see is STOP A, which is IRQL_NOT_LESS_EQUAL. Now a STOP A is caused by a kernel-mode process that tried to access a portion of memory that was at an IRQL that was too high for what the system will allow.

The most important thing of these four parameters on the STOP A is the fourth parameter. The fourth parameter is the address at which the blue screen occurs. With that parameter, you can at least determine what driver was loaded into that section of memory at the time of the crash to give you a hint as to the cause of the crash. And usually STOP As are caused by buggy device drivers or services from backup utilities or virus scanners. These types of applications use device drivers and filter drivers that are put either into the services applet or the devices applet under the Control Panel in Windows NT 4.0, or get loaded as hidden devices in Windows 2000 under Device Manager.

(Slide 18) The next most common STOP code is a STOP 1E, which is a KMODE_EXCEPTION_NOT_HANDLED. Basically, what this STOP code means is that a kernel-mode process tried to execute an illegal or unknown processor instruction; basically, the application or faulty driver tried to issue an unknown command or a different system API call that the system did not understand. On a STOP 1E, the second parameter is the most important. The second parameter is the address where the exception or crash occurred. And just on a side note, a lot of times, just to the right of the actual bug check and the four STOP code parameters, it might also list a reference driver with the STOP code. If this reference driver is Win32k.sys, then a most likely cause is a third-party remote control application, for instance, pcAnywhere, VNC (Virtual Network Computing), applications like that.

(Slide 19) The next most common STOP code is a STOP 24, which has the description of NTFS_FILE_SYSTEM. Like its description denotes, it is caused by a problem that occurred in the file system driver in Ntfs.sys. Now with this blue screen, the first parameter is the most important, and this description of STOP 24 code is almost always caused by either disk corruption or a disk defragmenter (for instance, the application like Diskeeper), or in some cases – very rarely, it will occur when you are creating a partition larger than 7 GB on a volume that is a Services For Macintosh volume and contains a large number of files.

(Slide 20) Now the next most common STOP code is a STOP 2E, which is a DATA_BUS_ERROR. This error is almost always caused by a parity error in the actual system's physical RAM. Since it's in physical RAM, it's always related to either a defective hardware problem, or either some form of configuration issue or incompatible hardware. For instance, if you're mixing parity and EDO-type RAM or other types of incompatible RAM (depending on the actual hardware), that can cause the error.

If you receive this STOP code right after adding or changing the configuration of your physical RAM, go ahead and put it back to the way it was before and see if that makes it go away. But if the error still persists, then another thing you can do is try disabling the memory cache within your motherboard file.

(Slide 21) Another common STOP code is a STOP 50, which is PAGE_FAULT_IN_NONPAGED_AREA. A lot of times, you will see these on a Windows NT 4.0 Terminal Server install. And if it's on Windows NT 4.0 Terminal Server install, you can almost always link it back to an incompatible or faulty printer driver, be it from HP, Lexmark, whatever the printer driver is. This is mainly caused when requested data is not found in memory, the system goes to check the page file, but then the missing data is identified and is unable to be written to the page file. Then data has nowhere to go, which causes the blue screen. With this STOP code, the first parameter is the most important, which will indicate the actual virtual address that caused the fault.

(Slide 22) Now another common STOP code is a STOP 7B, which is INACCESIBLE_BOOT_DEVICE. Now this is caused when Windows lost access to the system partition during the Startup process. Usually, debugging this or getting a memory dump is useless to us. It's almost always caused by some form of problem with a SCSI device driver, a RAID driver, or a particular UDMA IDE controller driver. Also, an incorrect ARC path in your Boot.ini file will also cause this. Also, if you have a hardware disk failure on your drive that contains the system partition, that can cause this error. If you're actually installing Windows NT or 2000, near the very beginning of the install, it will give you a prompt to press the F6 key to install third-party mass storage device drivers.

(Slide 23) Now another common STOP code is a STOP 7F, which is an UNEXPECTED_KERNEL_MODE_TRAP. And they occur when the CPU generates an error that the kernel does not catch (it crashed or it died such that the processor was not able to catch the crash). And in this one, the first parameter is the most important. Usually, there are about ten most common status codes for the first parameter and those are outlined in the Microsoft Knowledge Base article Q137539, and you can look at that Q article "General Causes of STOP 0x0000007F Errors" for more details about that first parameter and what they mean.

A lot of times, we are able to confirm that a STOP 7F is caused by hardware – especially RAM. One troubleshooting step that you can do is to try disabling the sync negotiation in the BIOS of your SCSI adapter. You will also need to check the termination on your cables attached to SCSI devices. Also, for all of you over-clockers out there, you should also be aware that over-clocking your CPU to a higher frequency than what your actual processor speed is, can cause the STOP 7F.

(slide 24) Related to that, the next common STOP code is a STOP 9F, which is a DRIVER_POWER_STATE_FAILURE and, like its name indicates, it's caused when its drivers do not handle the power state transition request properly. For instance, it would happen whenever you are coming out of power save mode (be it stand by, hibernate, etc.). You would probably see these more prevalently on laptops or computers that have power save profiles set up. And most commonly we see this kind of STOP code when you are actually shutting down your system, or like I mentioned earlier, coming out from either standby or hibernation mode. And for other common causes of this, you would definitely need to check if you have any compact disc writing software, be it Nero or Easy-CD Creator. Those are common causes. And other applications that can attempt to catch crashes; for instance, Norton Symantec, I believe have some utilities like this or other similar applications. You will also need to check your power management capability and settings, for instance, whether your system is either APM (Advanced Power Management), or ACPI compatible.

(Slide 25) The next STOP code is a STOP D1, which is DRIVER_IRQL_NOT_LESS_OR_EQUAL. This STOP code is very similar and virtually identical to the very first STOP code that we talked about, which was a STOP A. This one has pretty much the same cause as a STOP A, and it occurs when the system attempts to access pageable memory at a process IRQL that is too high. And with this STOP code, the fourth parameter, which contains the address that reference the memory in which it fails, is the most important. You would follow the same troubleshooting steps as you would use on a STOP A to troubleshoot a STOP D1.

(Slide 26) The next of the most common STOP codes is a STOP C000021A, and is STATUS_SYSTEM_PROCESS_TERMINATED. This STOP code is one of the few STOP codes that is actually a user-mode crash, and it's caused when a user-mode subsystem (be it either one of the two systems Winlogon or CSRSS), is fatally compromised and security of the operating systems can not be guaranteed. This is one of the only user-mode errors that can actually bring down a machine. The cause of this is almost always a third-party application, or if you have mismatched system files. For instance, if you are installing a service pack and the system was either turned off by accident or had another type crash, and when you rebooted, half your files were up to a certain service pack level and others were at a previous level. Another troubleshooting step to resolve that in Windows 2000 if you believe to have mismatched system files is to go to your Start Run Command prompt and type sfc_/scannow, which stands for the system file checker that is part of the Windows file protection technology of Windows 2000. That will go scan all your system files and make sure they all are the same and up-to-date. And if they're mismatched, they will replace them from GLO cache subdirectory.

Now that we've covered all of the most common STOP codes I'm sure we're all wondering how can we prevent these, or what do we do when they occur?

Starting with slide 27, I've outlined a few different tools and methods that you can use to troubleshoot and attempt to recover from these blue screens. The first one is the ERD disk, which is the Emergency Repair Disk. All of us fall short of always updating or keeping our Emergency Repair Disks, but I cannot stress this enough that your ERD will become invaluable if you have certain STOP codes that will either affect your systems registry or your different debug user profiles, or either your boot sector on your hard drive or other system files.

Another good thing to have is a Windows NT boot disk, and the Q article in the Knowledge Base Q301680, "HOW TO: Create a Boot Disk for an NTFS or FAT Partition" outlines exactly what this is and how to create it and how to use it. A Windows NT boot disk is the equivalent of, for instance, a Windows 98 boot disk, where it loads your bootable files from the actual floppy disk. To create this for any Windows NT or 2000 system, take a floppy disk and stick it in your A drive and use Windows NT Explorer to format that floppy disk. Then, from a similar system to the one that you want to boot with, copy over from the root of your C drive, the three files Ntldr, Ntdetect.com, and your Boot.ini file. That will help you troubleshoot any kind of problem with missing files or corruption of those three files or your master boot record or partition tables on your hard drive that prevents the OS from booting.

Another very useful tool that we always have customers try is a parallel installation of the operating system (OS). If your existing operating system does not boot, you can lay down a parallel install just by going through the normal setup procedure for the operating system, and make sure you install it into a different directory than your existing operating system install. That way, if you need to access your production machine's registry or do some file copy operations to replace mismatched system files, you're able to do that.

Now these next two methods apply to both Windows NT 4.0 and Windows 2000. They are: try booting into VGA mode, and try the last known good configuration. I'm sure that everyone out there has tried, at one point, using the last known good configuration, and it has never resolved your issue and you've wondered what exactly the last known good configuration does.

How the last known good configuration works is: when your machine boots up and the Windows NT hardware detection runs, it goes out into your system and queries and enumerates all of your hardware and writes all of that configuration into your system's registry hive. And on boot up that hive is dynamically built on every boot. So if your machine doesn't boot, all this information is stored under HKEY_LOCAL_MACHINE under the \SYSTEM hive, and then under that you will see approximately three different entries in there that are called ControlSet0001, ControlSet0002 maybe ControlSet0003 – and then the last one that you will see there is CurrentControlSet.

Now on a normal booting machine, you will log in and it will use the current control set – that is, the profile that is currently in use at the time your machine is running. Now when you make a system change – be it a device driver, install an application – it might make some system changes to your system hive. If those changes are incorrect or incompatible on the next reboot, it builds a CurrentControlSet registry key from the ControlSet0001. The last known good has a separate control set that keeps a picture of your system before those changes were applied. That way, if you need to go back to a different video driver or roll back the version of a driver, you can usually do that and last known good will work.

Now Windows 2000 has some extra enhancements and other technologies that were not available in Windows NT 4.0. These are Safe mode (for Windows 95 or Windows 98, Windows 9x family of operating systems), or a new technology called a Recovery Console. Now the Recovery Console basically allows you to boot and log in as administrator to the operating system via a command line. This gives you command line file write access into NTFS partitions, which is very useful, so that you no longer have to keep your system partition on a FAT partition. You can use a 98 or 95 boot floppy to be able to access your system partition.

The Recovery Console is also very useful in doing other things. And I would definitely recommend everyone look into this Recovery Console and read more about it. It's a very powerful and very useful tool.

Now on slide 28, there are also some other things that we can do on troubleshooting STOP screens. You definitely always want to check your systems and application event logs. We tend to overlook those, but they can contain vital data in terms of pointing to the cause of your crash. Another thing that you would like to do is verify the latest service pack that you have installed, which you can do by running the Winver command from either the Start Run or Command prompt.

I would also recommend that you virus check the system with the latest virus definitions for that particular application, and make sure that your system is free of any boot sector or mess of boot record or any other type of Trojan viruses. Another thing that you would want to do is – if it's related to your file system corruption – you would run the command chkdsk /f /r. Because the check disk can't run at the time when your OS is up and running, it will prompt you to answer the question, "Do you want to have check disk run on your next reboot?" You would say Yes and what the /f and /r switches do is if chkdsk runs and detects any errors in your file system (or your table indexes or anything else like that on the partition), it will go ahead and fix them.

If you've ever called in for a support incident with us with PSS and have ever had blue screens before, you maybe familiar with this last utility called MPS Reports (for Microsoft Product Support Reporting Tool). This utility is very useful and it will go out on your machine, it's non-intrusive, and it only gathers information; it does not inhibit your system in any way. It gathers a dump of all three of your event logs in EVT and TXT format, and it gets a PSTAT, which is an output of all running processes on your machine and numerous other listings of different directories, all loaded driver files, stuff like that. If it's setup related, it will also gather all of your setup-related logs, for instance your setup Api.log, Setupact.log and stuff like that. So, the next time that you speak with a Microsoft Support Professional, if you don't already have one, request this utility. It's very useful in troubleshooting.

Now on slide 29 I want to give you a brief overview about the actual Recovery Console. And like I mentioned before, basically, it allows you command line access to the boot partition or simple volume of your file system. And, for those of you who are familiar with the Windows NT 4.0 or 2000 technology Sysprep (which stands for System Preparation), you cannot pre-stage the Recovery Console with Sysprep. (The Recovery Console can be installed locally, or just run using the repair process from the 2000 Setup CD-ROM.) When it installs, it installs dynamically and is unique to every install, because it will go out and read your disk drive geometry of your machine. So that's why you cannot pre-stage it using Sysprep.

It's very useful: you can just type the command enable or disable for any system services or devices; you can replace files; you can display or modify disk partition information; and you can replace your master boot record or your boot sector with known good copies. For more information on the Recovery Console, I recommend that you see the Q article Q229716, "Description of the Windows 2000 Recovery Console," in the Microsoft Knowledge Base.

(Slide 30) There are some things that you can do to prevent the STOP screen. The first one, I cannot stress enough: it's always, always, always test your drivers before installing them in a production environment – either on a test machine, in a lab, or test environment (depending on what you're deploying or installing or updating). I can't stress it enough: test your drivers or whatever you're doing – your procedure – before putting into production.

I would also suggest that you check the Hardware Compatibility List (HCL) that's also on Microsoft.com before you install any new hardware, to verify compatibility with either Windows NT or Windows 2000. If you can, purchase hardware or software that is on the HCL; that will help ensure you can avoid any future problems with this hardware or software.

If you're running Windows 2000, whenever you do install new device drivers, I recommend that you try to install digitally signed drivers whenever possible. Windows 2000 always prefers to have digitally signed drivers. The vendor has contacted us and has gone through a certain process that verifies that their drivers are of a certain stability, will work in all given situations, and that just promotes stability.

I would always recommend that, before and after you make any system change, you update your Emergency Repair Disk, and that you keep revisions of your ERD so, in case you have to bring your system to its system state or replace the part of your system's registry from six months to a year ago, you are able to do that.

On slide 31, I outlined a little bit more information on kernel-mode and user-mode debugging. This is not usually something that most people get into, but just in case, if you are more familiar with actual Visual C or Visual C++® code, or you're a developer and this is interesting to you, then this lists out some more information about it. Basically, debugging is the process that we use to determine the cause of the root of a crash; in all honesty, it should really be reserved for more advanced users.

Now when you go to debug a Memory.dmp file or user dump, you will need to have symbols for the operating system and respective service pack, and you will also need the debugging tools that can be downloaded from http://www.microsoft.com/dddk/debugging/ (the DDDK standing for device driver development kit). The symbols can be downloaded from that same site, or they can be also retrieved from either the retail operating system compact disc, or if you have the service pack, the symbols for that respective service pack are included. And I reference article Q148658 "How to Load Windows NT MEMORY.DMP File Using I386KD.EXE" for more information about debugging.

On slide 32, I just wanted to list some additional resources on STOP screens, how to help resolve them, information like that. And I would highly recommend that you see Windows NT 4.0 and Windows 2000 resource kits. The Windows NT 4.0 resource kit is very useful and goes into grave details about debugging and determining root causes of your crashes. I would also recommend the Web site http://www.microsoft.com/ddk/.

Also, the Resource Kit Web resources are located at the Web address http://www.microsoft.com/windows2000/techinfo/reskit/Webresources/default.asp. And like I mentioned earlier, I highly recommend that at some point you go and visit the Hardware Compatibility List to make sure that all of your hardware is compatible. Also the subscription service Microsoft TechNet, which has a copy of all the resource kits, downloadable service pack files, and also contains a copy of the Microsoft Knowledge Base, which is delivered on a monthly basis. Also, use a subscription service for the Microsoft Developer Network (MSDN®) that will also contain more detailed information – especially if you are into developing or writing code, you'll find MSDN very useful. And, of course, the Microsoft Knowledge Base off of support.microsoft.com with where you could find all of these reference Q articles in this WebCast.

(Slide 33) That concludes our presentation today on Basic Windows NT and 2000 Blue Screen Troubleshooting. I'd like to thank everyone for joining us today as we've looked at how to prevent, and what we can do to try and recover from, these blue screens. If you follow these steps, hopefully they will assist you in preventing any further blue screens, and with that I'd like to turn it back over to Jason.

Jason Bennet: Great. Thanks for that presentation, Doug. Just a couple of quick notes before we move on to the Q&A portion of this Support WebCast. If you'd like to have a copy of those PowerPoint® slides, be sure that you download the file from our Web site. To access all information on all upcoming Support WebCasts, and our archive content and the PowerPoint slides from all these past WebCasts, an easy-to-remember URL is http://support.microsoft.com/webcasts/.

The Q&A portion of this Support WebCast is intended to encourage further discussion of the Support WebCast topic. One-on-one product support issues are outside the scope of the Support WebCast and if you do need technical assistance, please submit an incident on the Web or call Microsoft Product Support Services and speak to a Support Professional.

First question, "After running PSTAT to determine which driver threw the exception, Ntoskernel.exe is almost always the culprit instead of a third-party driver. Does this tell me that the core motherboard drivers are to blame for the blue screen?"

Doug: That's a very good question. No, it does not. Just because the crash occurred in Ntoskernel.exe almost never means that the file Ntoskernel is to blame. If you remember from our previous discussion of kernel mode and user mode, Ntoskernel is the main kernel-mode driver. What happens is, any kernel-mode drivers will load into Ntoskernel and are actually contained in the memory from within the starting address in memory of Ntoskernel. A PSTAT at the very bottom will list all of the actual load addresses of everything that was contained in the system when the PSTAT was taken.

Jason: Where do you configure the Memory.dmp file?"

Doug: There will be two different sections but related to Windows NT and Windows 2000. You would configure that under the Startup and Recovery tab of your System Information. If you're running Windows 2000, right-click on My Computer, click Properties, and then click the Advanced tab, click the Startup and Recovery button. For Windows NT 4.0, right-click My Computer, click Properties, click the Startup/Shutdown tab, select the Write debugging information to check box where you type a path and filename of where you want to save your Memory.dmp files.

Jason: Great. Can the dump be placed on another drive?

Doug: Yes it can. Like I mentioned earlier about the creation conditions for the Memory.dmp file: only the pagefile has to reside on the system partition. And you can (in that same space that I've just instructed you to go) type in a different drive letter and path name for the Memory.dmp file. So if your main system partition, (for instance, your C drive) is only 2-4 GB in size and you have 4 GB of physical memory, then obviously you would never have enough free space to write the dump. What you can do is delete the default percent variable and type in the appropriate path to where you'd like to store that Memory.dmp file, including the actual file name, like it has by default Memory.dmp.

Jason: Okay. Good. Next question, Can you still get the dump file if you have it set to auto reboot on system failure?

Doug: Yes, you can. There are two different levels in terms of auto rebooting after a crash. If you're running, the one particular instance is built into the OS which is on the same Set Up On Recovery Configuration page. There is a check box included with the others that says automatically reboot after a STOP error occurs. If you choose to have that checked, when your system blue screens, it will only write a Memory.dmp before it reboots if the check box is also checked to write the Memory.dmp. So if you have your system set to auto reboot but not to actually create the Memory.dmp then, no, it will not create the Memory.dmp and will just display the blue screen for a few seconds and reboot. I definitely recommend keeping the first check box Write the event into the system event log checked.

Now also if you're running a Compaq server (I believe HP has them as well), they have a specific Compaq driver utility called ASR, which is Automatic Service Recovery, that will also reboot the machine automatically when it detects a system failure.

But to answer your question directly, yes, as long as your machine is configured to write the Memory.dmp via that check box, and with the corresponding correct path, and your system is also configured to auto reboot, before it reboots it will write the complete full Memory.dmp.

Jason: What is the fourth parameter of the STOP code? Can you please give an example?

Doug: I'm going to assume we're speaking about a most common STOP A code, for example, where I mentioned the fourth parameter is the most important. These four parameters are all in hex and they will be of the form 0x and then an eight-digit alphanumeric character set.

For instance, for STOP A, it will be 000A and in the parenthesis it will have the four parameters which can be, for instance, 0xC005,0x000 all zeros and then the third can be a 0 or a 1 (which is doing either a read or a write operation), and the fourth parameter is actually very different. It will be, for instance, like 0xFE8592F0 (being an alphanumeric character set). So that is the actual memory address at which the crash occurs.

Combined with a PSTAT (which gives you a list of all loaded drivers and their memory load addresses), and using a scientific calculator that's even included with Windows OS, you can convert those hex numbers into values, get its load range, and then try to determine which driver was actually referenced at the time of the crash. I hope that answers that person's question.

Jason: Okay. Great. For floppy-less work stations what is the work around for having an ERD for them?

Doug: That's a very good question. I know a lot of newer machines like, for instance, the Compaq I-Pac machine, don't contain any legacy devices. Or you might also see what looks like a terminal server client; they will not have floppies, or may have the floppy disk disabled. The repair process does differ between 4.0 and 2000; in Windows NT 4.0 you can kick off the repair process by going to a Start Run and type the command rdisk, and if you're on a domain controller via a back up or primary to main controller (if you also want to update and back up all of your user and group accounts and everything in your actual directory data base), you would need to add the /s switch on the rdsk command.

Now when you run that on Windows NT 4.0, it will come up and ask you do you either want to create an ERD or do you want to update your repair information. By default, Windows NT 4.0 stores all of the files that are copied to your ERD floppy under the system folder, which is by default Win NT and then under the repair folder from there. So if you look in the WinNT\Repair folder you will see the exact same files that you would see after you've actually created an ERD floppy.

When you run Rdisk, keep saying that you want to update your repair information, which copies new updated versions of all the files that it backs up on to the Win NT repair folder. Then you can leave them there and either copy them to a network location or connect some form of removable drive, or you can just leave them there.

When you run the emergency repair process, you would just specify that you do not have an ERD and it will by default search the WinNT\Repair folder on your hard drive for the same repair files. If it finds them there, then it will repair them from there as well, without needing an ERD 3.5 floppy disk. So that's about the only thing that I can say for floppy-less machines.

Jason: What causes a Winserv.dll blue screen, and what are the solutions?

Doug: I believe you are referring to either a STOP 0x00000135 in Winsrv.dll which is usually caused by a corrupted or invalid software registry hive, or a STOP 0xC0000135 (unable to locate DLL), which is usually caused by a missing or corrupt Winsrv.dll file. To resolve the second STOP, install a parallel install of Window NT or 2000 and copy over Kernel32.dll, Ntdll.dll, Win32k.sys, User32.dll, and Winsrv.dll. For more information, please see Knowledge Base article Q173309 "Blue Screen STOP Message C0000135 Appears at Startup".

Jason: Okay. Great. The next question, Will a memory dump still occur if a page file is on a drive other than the system root?

Doug: No, it will not. It might try to. It depends on if you have one there at all.

It does require you to have at least one pagefile on the SystemRoot drive, but you can set it to the minimum, which is only 2 MB. That way, if you have a very small C drive that only has 5 or 10 MB pagefile out of necessity, and the main storage of the pagefile is on another drive (be it D, E, F, or whatever partition that has much more space), and if you wanted to write a Memory.dmp file, no it will not. The system has to have the pagefile on your C partition, and it has to be at least the same size as the amount of physical RAM that you have on your machine. For instance, if you have a machine with 1 GB RAM, and you must have a pagefile that is at least 1 GB in size (and I would even increase it to at least 12 MB over 1 GB – usually, I would make that approximately a hundred to a couple hundred megabytes above that).

So to answer your question, no it would not. It has to be on C.

Jason: Excellent. We are very interested in your feedback regarding this WebCast program, and you can send us your comments and suggestions using the e-mail alias feedback@microsoft.com. If you use that alias please be sure to include "Support WebCasts" in the subject line.

Jason: And moving on to the next question. Is the blue screen error the same as the "Blue Screen of Death" error?

Doug: Yes (chuckle). The "Blue Screen of Death" has become a common slang term for what I've been talking about today, which is technically what we call a STOP screen or just a blue screen. Because of the nature of these STOP screens, it denotes that the system had a fatal crash, which brings the machine down and renders it inoperable. So that's why, in the IT world, it has come to be known as the "Blue Screen of Death". So, yes, those two are the same thing.

Jason: Okay. Next question, Could you delete ControlSet0003 in the registry and expect it to roll back?

Doug: Assuming that it's not always going to be ControlSet003, what you need to do is, at the same level of those registry keys you will also see a key called Select. Highlight that key; you will have approximately four entries in there, and if you look at the one that says Last Known Good, it will give you a hex value number of 0x(123 whatever). And that number – if you double click it – will show you just the straight number. It will never be 001 because your normal boot mode that has the updated changes that you did will be ControlSet001, which will become CurrentControlSet.

So if you want it to roll back, you would need to find out what ControlSet is slated for your last known good. For instance, if it's set for ControlSet001 (which is your normal boot mode), and then you also have ControlSet003, and the Select keys in the same level tells you that the last known good is set for 0x3 (and that does mean that the last known good configuration is stored under the ControlSet003 configuration). So, if you do what this person suggests and delete that ControlSet003 key, then no, you would not be able to roll back to a previous state. If you tried, you would probably get a hardware profile error if you still tried to run the last known good configuration process.

Jason: Okay. Another question concerning last known good (LKG). In NT4.0, if you do multiple reboots, how does this affect the LKG? Usually a reboot is the first attempt to bringing the system back, so could you be overriding an LKG that may have been good?

Doug: That's a good question, and that all depends on how far you let your system boot up. What your system considers a "good" configuration is if you actually get to the login screen, press CTRL+ALT+DELETE, enter a valid user account, and actually get to a desktop. Once you get to a desktop, that's when the operating system considers that a "good" booting configuration and writes the changes to the last known good control set.

So, say that you are rebooting your machine multiple times. You let your machine go through the post operation and maybe just get to the boot loader screen, and then you decide to either shut down or restart for some reason, then no, that will not override the previous known good configuration. You would have to actually get the machine booted up, logged in, and to a desktop for your machine to consider it a "good" configuration and override the previous last known good configuration data.

Jason: The Recovery Console does not appear to allow you to access all files and directories on the system partition as well as other partitions. What are its limitations?

Doug: If you actually look within that article (Q229716 "Description of the Windows 2000 Recovery Console") it does mention that, but from what I understand, its limitation is to the drive where your system root is, i.e. your WinNT directory, and you cannot go back to the root level of the drive. You can only see the root of your WinNT directory and subdirectories under there. So you cannot get access to some other root directory that is outside of Win NT. That's pretty much the main restriction of the Recovery Console in terms of file data access. Because this means you actually recover the system, so that's why they limited it that way.

Jason: We may have just covered this in another question concerning Last Known Good. If you log on the machine after an incident and the logon freezes or the machine crashes again, then the last known good configuration is no good because you have overridden the current control set. Is this correct?

Doug: No, it's not, because (like I said just before) it has not saved that configuration yet. But it all depends on where exactly the machine state was. So if you just were logging in and it had not actually loaded – now I'm not a domains user profile specialist – but basically the way it works that when you press CTRL+ALT+DELETE, type your username, and passwords, you might be writing from a logon script. But from then it will kick off Explorer.exe, which will then spawn the GDI and the different graphical displays that will launch and contain the shell for your particular user profile. Because it's at that point when it's loading your particular user profile and applying all of your settings. From what I understand, it does not save those changes until all the running processes have completed, so it will not overwrite the last known good until you're at your desktop and the desktop has completed loading.

Jason: Where can you go to get the tools to read your own DMP files?

Doug: Like I said before that is under the URL http://www.microsoft.com/ddk/debugging/, and there you would download the debugging tools for Windows NT or Windows 2000 and that will install all the needed tools that you need to debug your own Memory.dmp file.

Jason: Excellent. Is there any way of setting chkdsk /f /r even if you can't get the OS to load?

Doug: Yes. Using the Recovery Console, you would run the chkdsk command with the /p and /r switches, which is the same as running chkdsk /f /r under the OS. Please see Knowledge Base article Q229716 "Description of the Windows 2000 Recovery Console" for more information.

Jason: What is the maximum memory dump in Win 2000?

Doug: Windows 2000, I believe, is limited at 4 GB of RAM, the same as Windows NT 4.0.

Now, you can get around that, but only under very specific circumstances. You can have memory dumps that go up to 8 GB in size, but only if you're running either Windows 2000 Advanced Server with what's called the /PAE switch, or if you're using, for instance, Datacenter Server (and some form of either cluster or whatever, using the PAE functionality). And that should apply to very few people. But for all intents and purposes, Windows 2000 and Windows NT 4.0 are both stuck at 4 GB for Memory.dmp file.

Jason: Okay, great. The next question we have Is it possible to clone a PC with Recovery Console installed?

Doug: No, that is, when I mentioned Sysprep, that is what I was referring to. When you Sysprep a machine, you configure it exactly how you want it; you run Sysprep, which strips the machine of any uniquely identifying indicators, and then you would clone that machine. After the cloning is done on your target machine, it runs the mini Setup wizard to uniquely identify that machine with its machine name (such as SIDs or sGUIDs). And when you install the Recovery Console, it reads the disk and drive configuration information every time it installs. And when you clone your machine, if it's not exactly identical, then Recovery Console will not work (they are almost always different). When I say "identical" I don't mean just the partition sizes, but also the physical size of the hard drives. So, no, you cannot pre-stage or pre-install the Recovery Console. I wish you could. It would make life easier. But unfortunately, it was not written that way.

Jason: Okay. How can I identify the correct set of symbols, and what is the best way to make them available to the kernel debugger?

Doug: The symbol files can be gotten from either the symbol files for the base OS for NT, or the base install of 2000; you can get them from either the retail CD-ROM or from the URL that I mentioned before. You can download them and install the particular symbol files for whatever service pack or the hot fixes you have installed on your machine. There are some debugging commands that you can use to verify the installation of your current symbol files.

The way that you tell the Windows kernel debugger about the symbol files and where they're located is through a system environment variable. Before you run the kernel debugger, there are approximately two to four system environment variables that have to be specified before you can run the kernel debugger and have it initialize successfully.

But I would see the Q article that I mentioned before for more information on debugging, and that will outline all the processes and commands that you can use for checking the symbol file status and that they're installed correctly.

Jason: Okay. Does each service pack contain updated symbols for debugging?

Doug: Yes. Each service pack will have its own updated symbol file.

Jason: If you haven't had a chance to submit us some feedback, we would really be interested in your feedback regarding today's session. Again, you can always send them to the Microsoft alias, feedback@microsoft.com. And make sure that you include "Support WebCast" in the subject line.

Our next question, MPS Reports utility, is it available for download?

Doug: I wish it was, but, no, it is not. The MPS Reports utility is only available from a Microsoft Support Professional.

Jason: Okay. Next question, "What are the steps in maintenance to test the drivers before they are loaded? Can you briefly explain a process of how to do this?

Doug: Basically, what I mean by "test what you're going to be upgrading/installing" would be to (in your test or lab environment) "mirror" your machine's configuration, install all the software that's on your production machine, put it in the same type of network. (This might be constrained to the hardware you have available or the budgets that are available to different companies.) But if you can, try to mirror your production environment as much as possible so that way you will get a better test when you do it. And then to actually test it, just in your test environment (that's not connected or in any way, shape, or form able to get to your production environment,– just kind of its own little stand-alone network or machine), go ahead and upgrade the device driver. If you want to test a procedure like a disaster recovery procedure, that way you have a test machine that you can use, so you can go ahead and test it, break it, attempt recovery. Or, if you have an updated driver that you want to test, make sure that it will work on your machine. That way if it doesn't work, then there's no harm then and you know not to install it on your production machine or in your production environment.

Jason: Can you explain the structure of the parameters that are of importance for each STOP code: format, etc.?"

Doug: That's a very broad question. If you look back through the previous slides, I do tell you what the most important parameters are, but that's way more information than I can go through now. But, basically, every single STOP code is going to say STOP <0x and then an eight-digit alphanumeric status code> (for instance, STOP A, STOP 1E), and then it will say open parenthesis [(], and then it will list four more of these bug check parameters (it will say 0x and then a character set of eight-digits long; there will be four of those). And then it will be closed parentheses [)], and then below that it will give you a STOP code description, which will be the actual character text you will see. For instance, K_MODE_EXCEPTION_NOT_HANDLED, IRQL_NOT_LESS_OR_EQUAL, stuff like that.

But for each and every one, what every parameter means for every different STOP code, you would need to see the Microsoft Knowledge Base and the resource kit will tell you exactly what every parameter is, what it means, and explains more detail about every parameter for all of the most common STOP codes.

Jason: I'm usually seeing Dr. Watson application error, referencing STOP code 005. Can you please explain?

Doug: That's a user-mode error. The C00005 is what's called an Access Violation. I never really get into user-mode debugging. That's really outside the scope of a STOP error, because technically that's a Dr. Watson user-mode error. That is not an actual blue screen STOP error.

But, like I said, a 005 is an access violation, which is usually because the particular application somewhere in its code tried to do something (process an API or do a system call) that was either not allowed by the operating system or was incompatible with the currently installed other software on the machine, or is in conflict with device drivers or other third-party applications. But for all intents and purposes, those are really outside the scope of the actual blue screen STOP errors.

Jason: Okay, great. Why did MS remove the stack flow list from the Win 2000 STOP screen?

Doug: Since I wasn't here in the development meetings when they were coding Windows 2000, I cannot answer that. However, I can say that probably 95 percent of the time we get a Memory.dmp file, someone who is proficient in kernel-mode debugging has to take a look at it to discern what the cause of the crash is. Instead of having a stack trace that nobody wants to sit there and write out all the numerous columns of all the codes in text (didn't know what they were, or what codes were important, so they had to sit there and copy the entire screen), it was probably much more useful to users to list the most important ones to support professionals; and then, it gives you some suggested resolution instead of just listing a stack trace. So, I think that was probably deemed more valuable to the user community instead of the stack trace, since so few people besides Microsoft Support Professionals actually do kernel-mode debugging.

Jason: Do any or all of the methods in today's presentation apply to Windows Millennium?

Doug: Windows Millennium does not give STOP screens in the traditional sense that we're talking about in Windows NT and 2000. So, the generic processes, in terms of testing and checking on the Hardware Compatibility List, things like that, those are applicable. But all the rest of them, for instance, in ERD, the Emergency Repair Disk, or different things like that do not apply to Windows Me, Windows 98 or Windows 95.

You can try a parallel install of the OS, but you can also try with Windows 98 and Windows Me, they provided other troubleshooting tools, for instance, Winver and Sysinfo. If you right-click My Computer, click Properties (I forget exactly where), I believe you can go under the system configuration and select your startup and hardware configuration parameters; you can choose what options and what part of the system startup environment you want to load, for instance, all the entries in your system (.ini, Win.ini, Autoexec.bat, Config.sys). So, I would recommend, for Windows Me, using those types of troubleshooting tools.

Also in Windows 98 and Windows Millennium, you can use the System Restore utility, which will basically roll your machine back. The system automatically – every day, every week, whatever is specified outside of default – makes backup configurations of your system. So, that way, if you want to roll back your machine to when it was working two weeks ago or a week ago, you can run that and it will restore your system. So, it's another troubleshooting step for Windows 98 or Windows Millennium. But that's technically out of the scope of today's topic.

Jason: The next question we have, How can you get the dump and look at it if you were never able to get into the OS? That is, the system keeps cycling through and crashes and tries to come up.

Doug: The way that you would get around that is, depending if it's Windows NT or 2000, you would never actually look at the dump on the problem system. What you would do is get the Memory.dmp file off the machine and either to another machine for analysis, or up to a Microsoft Support Professional for analysis.

You would install a parallel installation of your operating system, be it Windows NT or 2000, and then from there you can either copy it off to a removable type storage device or just install networking support in the parallel (just making sure that when you install the parallel you give your machine name a different name from your production install). Then you can get it on the network within your domain and you're able to copy the Memory.dmp file either to another machine or to a network share location. From there, you can either analyze it yourself or get it up to a Microsoft Support Professional for analysis.

Jason: Great. Is there a Web site from which to download the earlier mentioned Microsoft utility that will gather the PSTAT, event log, and other information?

Doug: Unfortunately, no, there's not. It is only obtained, just like the slide denotes, from a Microsoft Support Professional when you call in an incident (when you have an issue with your machine).

It is usually one of the first troubleshooting steps we do to try to discern the causes of either STOP codes or other printing problems. There are different versions of the utility for different aspects or topics of problems with your machine.

Jason: I know you can script the ERD creation in NT 4.0. Can you do the same with a system state in 2000?

Doug: You can with a system state, and by default it does update the Reg.bak folders under the \Repair folder in 2000. You don't have an actually floppy in 2000 – well, you do, but it only contains about two files, which basically contain pointers to the system states, which is usually a couple hundred MB in size on your machine's hard drive under the \Repair folder.

You can script the repair process. You can script the system state, but actually have the system state do the ERD. There is no way to actually do the command line equivalent (in 2000) of going from the 2000 back-up tool and choosing Create Emergency Repair Disk. There is no switch for the Ntbackup.exe command line to create an Emergency Repair Disk, short of backing up the system state.

Jason: Where can I find more information about the parameters of the various STOP codes?

Doug: You can find more information in different Q articles in the Microsoft Knowledge Base that are specific to each particular blue screen. A much better, and easier, reference will be the Microsoft Windows NT 4.0 and Windows 2000 Resource Kits that are available either by download or through the TechNet subscription. You can also buy them in hard copy from any local IT bookseller, or through Microsoft Press.

Jason: Can I differentiate between a hardware or software crash from the error number?

Doug: Not for 100 percent certainty. It usually depends on the actual STOP code and its parameters, but for all intents and purposes, no, you cannot.

The only one that you really can determine the nature of the crash is a STOP 2E, which basically is a problem with the hardware memory. Other than that, no, there's not a certain number or parameter that will tell you whether it's a hardware or a software problem. That can only really be done through the debugging process.

You can make an "educated guess" (and probably a Microsoft Support Professional or a very seasoned support person could), just from your previous experiences with the different OSs and different problems. Or, you can guess that it will be hardware if you just made a recent hardware change. Or, if you just changed out the processor, just added memory, just did something with the partitions or disks, then that might lead you to believe that it might be a hardware-related issue. But there is no 100% guaranteed way to discern whether the nature of the crash is either a hardware or a software issue, short of debugging.

Jason: What is the best application to use to view a full dump file?

Doug: There are different ones, and it's really up to the person doing the debugging. The main ones will be the actual I36kd.exe file, which is just a kernel-mode debugger. There's also a graphical utility, which is probably more popular, called WinDBG; it does user-mode and kernel-mode debugging in a more graphical state window, instead of I36kd, which is command line only.

Jason: Could you clarify what a machine check exception, BSOD, is pointing to as the probable cause? It appears to be a hardware failure. Is this correct?

Doug: I think I understand this person's question: "When you receive a STOP screen, do the parameters or different character descriptions (for instance, EXCEPTION_NOT_HANDLED), point to a probable cause, for instance, a hardware issue?"

Not really. It kind of falls under the same umbrella that I just gave in my last answer; there's no real way to tell whether you're getting a hardware error or a software error. It doesn't list any different descriptions.

You can search the Microsoft Knowledge Base with your particular parameters, usually leaving out a specific memory address reference, because the memory address is going to be different for every machine on every boot; it's going to load in a different memory range. So, when you query it in Microsoft TechNet or the Microsoft Knowledge Base, query on the major parameters without the actual memory address and you might be able to find some more information on what are probable causes.

I went over, in the previous slides, what are the common STOP codes, the major things to check, and the major causes. So, that question is really hard to answer for certain, but that's what I would recommend doing.

Jason: Next question, Is the memory.dmp file valid only for a blue screen incident, or can it be manually generated and used to troubleshoot any type of fault? If so, where do we go for information on how to do that?

Doug: I covered this during my previous presentation. Yes, you can. I'm not certain that it would be relevant for every type of error that you're having (because I'm not an expert on all forms of errors), for instance, printing or application problems, and so on.

So, I can't say for certain, but you can force create a Memory.dmp (only in 2000), by looking at slide 16 of my presentation. There are two registry keys that need to be set. There's also one other check box that needs to be set. Look on slide 16 for the two configuration options, and then you would hold down the right CTRL key and press the SCROLL LOCK key twice to generate a Memory.dmp file.

Jason: We are at the last question in the queue. You might not be able to answer this, Doug, because this product has not been released yet, but, Does this information apply to Windows XP?"

Doug: Yes, it does. Between Windows 2000 and Window XP, they really haven't made any changes in terms of the memory dump creation process. It would pretty much be the same with Windows 2000. So, just to answer that question directly without going into too much detail (because it's still in beta and things might change), as far as it stands right now I do not know of any major changes to the Startup and Recovery options. So, yes, to answer your question, those options and different recovery steps would still apply.

Jason: That does wrap up all of the questions we have in the queue for today. I want to thank everyone for joining us. I do hope the information was useful to you. And again, we are very interested in your feedback. If you want to send us some feedback after the broadcast is over, send it to the e-mail alias, feedback@microsoft.com, and include "Support WebCast" in the subject line.

Thanks again for joining us, and good-bye.

Source: http://support.microsoft.com/

New Software for Mac OS X