Monday, November 03, 2014

Debugging Bug Check (BSOD) 0x101 CLOCK_WATCHDOG_TIMEOUT in a Hypervisor/VMM Environment

I'm planning to write a post for debugging  Bug Check 0x101 issue (CLOCK_WATCHDOG_TIMEOUT) in Windows system. but I happened to find this blog Debugging a CLOCK_WATCHDOG_TIMEOUT Bugcheck from MSFT debugger team which explaned it in greater details. However, the issue we met is slightly different from what MSFT team was debugging. We are working in virtualization/hypervisor environment, and Windows (7+) is running as a primary Guest OS.  

Basically, according to MSFT, a bugcheck 0x101 occurs when the Clock interrupt (Its IRQL is #28) has not been processed by each processor within a timeout.  The Clock interrupt is quite high in the IRQL table for x86, however the Inter-Processor Interrupt (IPI, its IRQL is #29) is much higher than this level. 

However, in our case, the things are little bit different. But I won't give some details for this.

In Uniprocessor mode, the system hangs up when issue occurs, in a SMP mode, the system shows 0x101 BSOD just right after a very short stucking, but sometimes the system also gets hang-up. 

The root cause I eventually got is a deadloop happening in hypervisor. And when this deadloop happens in BSP processor (the CPU is endlessly running in host VMX root mode), the symptom is guest Windows OS hang-up without 0x101 WATCHDOG TIMEOUT BSOD, but when such a deadloop happens in any one of APs (Application Processors), this clock watchdog timeout Blue Screen of Death occurs very soon. 

This is because when one of processors runs in VMX root mode endlessly, the Clock interrupt (IRQL #28) has not been processed by that processor within a timeout, then the BSP processor will initiate a 0x101 BSOD, send IPI to other processors starting to dump the system states, and putting themselves into shutdown state. 

No comments:

Post a Comment