Digging Into A macOS Kernel Panic

macos restart

TL;DR (2021-06-09)

AppleSMC is a kernel module that communicates with SMC (System_Management_Controller), and the SMC is basically an Apple co-processor that has its own firmware, used to manage the upper layer system including disk encryption, etc.

I was able to dig into AppleSMC's backtrace and locate the spot where it causes the watchdog to timeout.

In the backtrace AppleSMC is doing a dead loop, which eventually causes the watchdog to timeout and throw you a panic.

The panic itself doesn't mean anything, it's just watchdog saying something went wrong, the "something" is from AppleSMC.

AppleSMC just enters the dead loop wishing to timeout the watchdog, I don't know what error it receives from SMC firmware

background

i bought a macbook pro (16", 2019) in May as my XPS died (motherboard fried maybe?). personally i dont like apple at all, but fuck it, i had never used a mac before, so why not? i really love to try new things

for my experience i am satisfied with my macbook, except for some stupid issues, one of which is what im going to dig into in this article

shutdown/restart hang (caused by a kernel panic)

i found many users have encounterd the (almost) same panic: https://forums.macrumors.com/threads/constant-kernel-panics-userspace-watchdog-timeout-no-successful-checkins-from-com-apple-windowserver.2222878/

Also on reddit: it's the exactly same panic

if you look at the posts you will find they are actually having different problems except all the problems lead to (the same) watchdog timeout

my panic report looks like this:

panic(cpu 2 caller 0xffffff7f881a1aae): watchdog timeout: no checkins from watchdogd in 302 seconds (11 totalcheckins since monitoring last enabled), shutdown in progress
Backtrace (CPU 2), Frame : Return Address
0xffffff83b0783c40 : 0xffffff800771f5cd
0xffffff83b0783c90 : 0xffffff8007858b05
0xffffff83b0783cd0 : 0xffffff800784a68e
0xffffff83b0783d20 : 0xffffff80076c5a40
0xffffff83b0783d40 : 0xffffff800771ec97
0xffffff83b0783e40 : 0xffffff800771f087
0xffffff83b0783e90 : 0xffffff8007ec2838
0xffffff83b0783f00 : 0xffffff7f881a1aae
0xffffff83b0783f10 : 0xffffff7f881a1486
0xffffff83b0783f50 : 0xffffff7f881b6d9c
0xffffff83b0783fa0 : 0xffffff80076c513e
      Kernel Extensions in backtrace:
         com.apple.driver.watchdog(1.0)[B435C72B-B311-3C67-8AA1-1D5CE0FAD429]@0xffffff7f881a0000->0xffffff7f881a8fff
         com.apple.driver.AppleSMC(3.1.9)[4589419D-7CCC-39A9-9E2F-F73FE42DD902]@0xffffff7f881a9000->0xffffff7f881c7fff
            dependency: com.apple.driver.watchdog(1)[B435C72B-B311-3C67-8AA1-1D5CE0FAD429]@0xffffff7f881a0000
            dependency: com.apple.iokit.IOACPIFamily(1.4)[0A7D7382-66FE-391B-9F93-97A996256C25]@0xffffff7f88109000
            dependency: com.apple.iokit.IOPCIFamily(2.9)[BE052F4D-9B80-3FCD-B36D-BACB7DEE0DF2]@0xffffff7f88112000

everytime the hang happens, it eventually gives me a backtrace like this

i called customer service, the friendly tech support suggested that i should reinstall the system (if i cant figure out how to reproduce the panic), so i erased the partition table and restarted from scratch, then the hang happened again after a few shutdowns

i was really upset, but as i cant reproduce the panic, i cant even ask apple to fix it

then i decided to look into the panic myself

the backtrace

the first glance at panic report tells me it might be something wrong with SMC, i followed apple's guide to reset SMC, without luck

apple diagnostics reports no issue

so it must be in the kernel right?

lets start reading the backtrace

by reading Call Stack you will have a basic understanding of how kernel subroutines get executed

so how is it executed? just read from brace bottom to top

first thing first, the panic report only gives me related kernel extensions, to find out exactly what code is causing the panic, we have to get symbols from the kernel (extensions)

we dont need to download anything to debug such a problem, the kernel is installed on our machine, just copy it, along with all the related extensions:

/System/Library/Kernels/kernel # kernel
/System/Library/Kernels/Extensions # extensions

follow https://www.repleo.nl/wordpress/symbolicate-os-x-kernel-panics-using-lldb/ to locate symbols

the kernel slide bytes, and AppleSMC text base address can be found in the panic report, like this:

Kernel slide:     0x0000000007400000

com.apple.driver.AppleSMC(3.1.9)[4589419D-7CCC-39A9-9E2F-F73FE42DD902]@0xffffff7f881a9000->0xffffff7f881c7fff

lldb

from this screenshot we can see more clearly about what's going on, the backtrace tree is now symbolicated

lets reverse some code

as far as i know AppleSMC kext is proprietary, the only way to know what it does is to reverse it

from lldb, we dont see anything worthy of reversing, its just a watchdog timeout call

anyways, AppleSMC module seems interesting, lets reverse it

after loading the AppleSMC binary, we have to rebase the binary so it matches the address at runtime: Edit, segment, rebase program

press G then paste the address 0xffffff7f881b6d9c, we get:

inlockunlock inlockunlock code

__int64 __fastcall SMCWatchDogTimer::watchdogThread(SMCWatchDogTimer *this)
{
  char v1; // al
  int v2; // ecx
  thread_act_t v3; // eax
  unsigned int v4; // eax
  uint64_t v5; // rbx
  SMCWatchDogTimer *v6; // rdi
  integer_t policy_info; // [rsp+8h] [rbp-38h]
  int v9; // [rsp+Ch] [rbp-34h]
  int v10; // [rsp+10h] [rbp-30h]
  int v11; // [rsp+14h] [rbp-2Ch]
  char v12; // [rsp+18h] [rbp-28h]
  __int64 v13; // [rsp+20h] [rbp-20h]

  v1 = IOWatchdogmacOS::check_coprocessor_system(this, 0LL); // maybe it's related to T2 chip?
  v2 = 30;
  if ( v1 )
    v2 = 60;
  *((_DWORD *)this + 63) = v2;
  nanoseconds_to_absolutetime(1000000LL, &v12);
  clock_interval_to_absolutetime_interval(10LL, 1000000000LL, &v13);
  policy_info = v13;
  v10 = v13;
  v9 = *(_DWORD *)&v12;
  v11 = 1;
  v3 = current_thread();
  v4 = thread_policy_set(v3, 2u, &policy_info, 4u);
  if ( !v4 )
  {
    while ( 1 )
    {
      IOLockLock(*((_QWORD *)this + 29));
      v5 = mach_absolute_time();
      if ( !*((_BYTE *)this + 248) )
        IOWatchdog::checkWatchdog(this);
      IOLockUnlock(*((_QWORD *)this + 29)); // `0xffffff7f881b6d9c`
      clock_delay_until(v13 + v5);
    }
  }
  v6 = (SMCWatchDogTimer *)v4;
  SMCWatchDogTimer::watchdogThread(v4);
  return SMCWatchDogTimer::extendWatchdog(v6);
}

I don't think the while (1) loop is ever going to end. When thread_policy_set fails, it enters the loop and never breaks out.

v13 represents 1 sec I think. Basically this loop puts a mutex io lock, then feeds watchdog (when the timer pointer plus 248 is null or whatever), finally it sleeps for 1 sec, repeats the loop.

if (!*((_BYTE*)this + 248))
    IOWatchdog::checkWatchdog(this);

When it stops feeding the watchdog, watchdog timeouts after about 3 min, hence the panic.

whats watchdog

from man watchdogd

watchdogd ensures that the system is healthy and able to make forward progress throughout the system lifecycle. If watchdogd or the Watchdog KEXT determine that the system is unhealthy they will attempt to take corrective action and ultimately may panic the system to get it back to a usable state.

from https://en.wikipedia.org/wiki/Watchdog_timer

For example, in the case of the Linux operating system, a user-space watchdog daemon may simply kick the watchdog periodically without performing any tests. As long as the daemon runs normally, the system will be protected against serious system crashes such as a kernel panic. To detect less severe faults, the daemon[4] can be configured to perform tests that cover resource availability (e.g., sufficient memory and file handles, reasonable CPU time), evidence of expected process activity (e.g., system daemons running, specific files being present or updated), overheating, and network activity, and system-specific test scripts or programs may also be run.[5]

Upon discovery of a failed test, the Linux watchdog daemon may attempt to perform a software-initiated restart, which can be preferable to a hardware reset as the file systems will be safely unmounted and fault information will be logged. However it is essential to have the insurance of the hardware timer as a software restart can fail under a number of fault conditions. In effect, this is a dual-stage watchdog with the software restart comprising the first stage and the hardware reset the second stage.

from my understanding, SMCWatchDogTimer stopped kicking watchdog somehow (at IOLockUnlock()), then watchdog realizes AppleSMC might be dead, it initiates a panic, reboot and throw the panic report to user

jm33_ng