TL;DR (2021-06-09)
AppleSMC
is a kernel module that communicates with SMC (System_Management_Controller), and the SMC is basically an Apple co-processor that has its own firmware, used to manage the upper layer system including disk encryption, etc.
I was able to dig into AppleSMC
's backtrace and locate the spot where it causes the watchdog to timeout.
In the backtrace AppleSMC
is doing a dead loop, which eventually causes the watchdog to timeout and throw you a panic.
The panic itself doesn't mean anything, it's just watchdog saying something went wrong, the "something" is from AppleSMC
.
AppleSMC
just enters the dead loop wishing to timeout the watchdog, I don't know what error it receives from SMC firmware
background
i bought a macbook pro (16", 2019) in May as my XPS died (motherboard fried maybe?). personally i dont like apple at all, but fuck it, i had never used a mac before, so why not? i really love to try new things
for my experience i am satisfied with my macbook, except for some stupid issues, one of which is what im going to dig into in this article
shutdown/restart hang (caused by a kernel panic)
i found many users have encounterd the (almost) same panic: https://forums.macrumors.com/threads/constant-kernel-panics-userspace-watchdog-timeout-no-successful-checkins-from-com-apple-windowserver.2222878/
Also on reddit: it's the exactly same panic
if you look at the posts you will find they are actually having different problems except all the problems lead to (the same) watchdog timeout
my panic report looks like this:
panic(cpu 2 caller 0xffffff7f881a1aae): watchdog timeout: no checkins from watchdogd in 302 seconds (11 totalcheckins since monitoring last enabled), shutdown in progress
Backtrace (CPU 2), Frame : Return Address
0xffffff83b0783c40 : 0xffffff800771f5cd
0xffffff83b0783c90 : 0xffffff8007858b05
0xffffff83b0783cd0 : 0xffffff800784a68e
0xffffff83b0783d20 : 0xffffff80076c5a40
0xffffff83b0783d40 : 0xffffff800771ec97
0xffffff83b0783e40 : 0xffffff800771f087
0xffffff83b0783e90 : 0xffffff8007ec2838
0xffffff83b0783f00 : 0xffffff7f881a1aae
0xffffff83b0783f10 : 0xffffff7f881a1486
0xffffff83b0783f50 : 0xffffff7f881b6d9c
0xffffff83b0783fa0 : 0xffffff80076c513e
Kernel Extensions in backtrace:
com.apple.driver.watchdog(1.0)[B435C72B-B311-3C67-8AA1-1D5CE0FAD429]@0xffffff7f881a0000->0xffffff7f881a8fff
com.apple.driver.AppleSMC(3.1.9)[4589419D-7CCC-39A9-9E2F-F73FE42DD902]@0xffffff7f881a9000->0xffffff7f881c7fff
dependency: com.apple.driver.watchdog(1)[B435C72B-B311-3C67-8AA1-1D5CE0FAD429]@0xffffff7f881a0000
dependency: com.apple.iokit.IOACPIFamily(1.4)[0A7D7382-66FE-391B-9F93-97A996256C25]@0xffffff7f88109000
dependency: com.apple.iokit.IOPCIFamily(2.9)[BE052F4D-9B80-3FCD-B36D-BACB7DEE0DF2]@0xffffff7f88112000
everytime the hang happens, it eventually gives me a backtrace like this
i called customer service, the friendly tech support suggested that i should reinstall the system (if i cant figure out how to reproduce the panic), so i erased the partition table and restarted from scratch, then the hang happened again after a few shutdowns
i was really upset, but as i cant reproduce the panic, i cant even ask apple to fix it
then i decided to look into the panic myself
the backtrace
the first glance at panic report tells me it might be something wrong with SMC, i followed apple's guide to reset SMC, without luck
apple diagnostics reports no issue
so it must be in the kernel right?
lets start reading the backtrace
by reading Call Stack you will have a basic understanding of how kernel subroutines get executed
so how is it executed? just read from brace bottom to top
first thing first, the panic report only gives me related kernel extensions, to find out exactly what code is causing the panic, we have to get symbols from the kernel (extensions)
we dont need to download anything to debug such a problem, the kernel is installed on our machine, just copy it, along with all the related extensions:
/System/Library/Kernels/kernel # kernel
/System/Library/Kernels/Extensions # extensions
follow https://www.repleo.nl/wordpress/symbolicate-os-x-kernel-panics-using-lldb/ to locate symbols
the kernel slide bytes, and AppleSMC text base address can be found in the panic report, like this:
Kernel slide: 0x0000000007400000
com.apple.driver.AppleSMC(3.1.9)[4589419D-7CCC-39A9-9E2F-F73FE42DD902]@0xffffff7f881a9000->0xffffff7f881c7fff
from this screenshot we can see more clearly about what's going on, the backtrace tree is now symbolicated
lets reverse some code
as far as i know AppleSMC kext is proprietary, the only way to know what it does is to reverse it
from lldb, we dont see anything worthy of reversing, its just a watchdog timeout call
anyways, AppleSMC module seems interesting, lets reverse it
after loading the AppleSMC binary, we have to rebase the binary so it matches the address at runtime: Edit, segment, rebase program
press G then paste the address 0xffffff7f881b6d9c
, we get:
__int64 __fastcall SMCWatchDogTimer::watchdogThread(SMCWatchDogTimer *this)
{
char v1; // al
int v2; // ecx
thread_act_t v3; // eax
unsigned int v4; // eax
uint64_t v5; // rbx
SMCWatchDogTimer *v6; // rdi
integer_t policy_info; // [rsp+8h] [rbp-38h]
int v9; // [rsp+Ch] [rbp-34h]
int v10; // [rsp+10h] [rbp-30h]
int v11; // [rsp+14h] [rbp-2Ch]
char v12; // [rsp+18h] [rbp-28h]
__int64 v13; // [rsp+20h] [rbp-20h]
v1 = IOWatchdogmacOS::check_coprocessor_system(this, 0LL); // maybe it's related to T2 chip?
v2 = 30;
if ( v1 )
v2 = 60;
*((_DWORD *)this + 63) = v2;
nanoseconds_to_absolutetime(1000000LL, &v12);
clock_interval_to_absolutetime_interval(10LL, 1000000000LL, &v13);
policy_info = v13;
v10 = v13;
v9 = *(_DWORD *)&v12;
v11 = 1;
v3 = current_thread();
v4 = thread_policy_set(v3, 2u, &policy_info, 4u);
if ( !v4 )
{
while ( 1 )
{
IOLockLock(*((_QWORD *)this + 29));
v5 = mach_absolute_time();
if ( !*((_BYTE *)this + 248) )
IOWatchdog::checkWatchdog(this);
IOLockUnlock(*((_QWORD *)this + 29)); // `0xffffff7f881b6d9c`
clock_delay_until(v13 + v5);
}
}
v6 = (SMCWatchDogTimer *)v4;
SMCWatchDogTimer::watchdogThread(v4);
return SMCWatchDogTimer::extendWatchdog(v6);
}
I don't think the while (1)
loop is ever going to end. When thread_policy_set
fails, it enters the loop and never breaks out.
v13
represents 1 sec I think. Basically this loop puts a mutex io lock, then feeds watchdog (when the timer pointer plus 248 is null or whatever), finally it sleeps for 1 sec, repeats the loop.
if (!*((_BYTE*)this + 248))
IOWatchdog::checkWatchdog(this);
When it stops feeding the watchdog, watchdog timeouts after about 3 min, hence the panic.
whats watchdog
from man watchdogd
watchdogd ensures that the system is healthy and able to make forward progress throughout the system lifecycle. If watchdogd or the Watchdog KEXT determine that the system is unhealthy they will attempt to take corrective action and ultimately may panic the system to get it back to a usable state.
from https://en.wikipedia.org/wiki/Watchdog_timer
For example, in the case of the Linux operating system, a user-space watchdog daemon may simply kick the watchdog periodically without performing any tests. As long as the daemon runs normally, the system will be protected against serious system crashes such as a kernel panic. To detect less severe faults, the daemon[4] can be configured to perform tests that cover resource availability (e.g., sufficient memory and file handles, reasonable CPU time), evidence of expected process activity (e.g., system daemons running, specific files being present or updated), overheating, and network activity, and system-specific test scripts or programs may also be run.[5]
Upon discovery of a failed test, the Linux watchdog daemon may attempt to perform a software-initiated restart, which can be preferable to a hardware reset as the file systems will be safely unmounted and fault information will be logged. However it is essential to have the insurance of the hardware timer as a software restart can fail under a number of fault conditions. In effect, this is a dual-stage watchdog with the software restart comprising the first stage and the hardware reset the second stage.
from my understanding, SMCWatchDogTimer
stopped kicking watchdog somehow (at IOLockUnlock()
), then watchdog realizes AppleSMC
might be dead, it initiates a panic, reboot and throw the panic report to user
Comments
comments powered by Disqus