NUC crashes on debian 11 - How I fixed it

I recently installed Debian Bullseye on an old Intel NUCCAY6H mini PC I had lying around. It’s a great little device for a home server, as it’s very cheap, fits 16G of memory, and with 4 mini-cores it’s no slouch.

The first install attempt didn’t go well, with missing firmware for the NIC causing hanging for a couple minutes during boot. This happens quite a bit with Debian’s hard-line stance on binary blobs, so I re-installed with the non-free install media.

After a couple hours, the machine locked up again. It seemed I had more problems to solve…

TLDR/Spoiler: You have to turn off PXE boot, or the system randomly crashes.

Perhaps the NIC driver?

First place to look was the crappy Realtek NIC onboard. These are known for acting strangely, so I figured it was either responsible for the link going down, or the entire system locking up.

# lspci | grep Eth

03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

So, I installed the dkms package for the nic firmware:

sudo apt install r8168-dkms
Backing up initrd.img-5.10.0-9-amd64 to /boot/initrd.img-5.10.0-9-amd64.old-dkms
Making new initrd.img-5.10.0-9-amd64
(If next boot fails, revert to initrd.img-5.10.0-9-amd64.old-dkms image)

After a reboot, things appeared to work fine, and the NIC worked fine. Same gigabit speed, and more or less the same. I was hopeful this would fix it, but alas it crashed again about 20 minutes later.

Microcode maybe?

So, I looked at the dmesg output. First thing that jumped out was this section:

[    1.787738] BERT: Error records from previous boot:
[    1.787742] [Hardware Error]: event severity: fatal
[    1.787744] [Hardware Error]:  Error 0, type: fatal
[    1.787746] [Hardware Error]:   section_type: Firmware Error Record Reference
[    1.787747] [Hardware Error]:   Firmware Error Record Type: SOC Firmware Error Record Type1 (Legacy CrashLog Support)
[    1.787749] [Hardware Error]:   Revision: 0
[    1.787750] [Hardware Error]:   Record Identifier: 2000200000000
[    1.787754] [Hardware Error]:   00000000: 00020002 00000001 0000031a 00000000  ................
[    1.787756] [Hardware Error]:   00000010: 00000000 00000000 00000000 00000000  ................
[    1.787758] [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
[    1.787760] [Hardware Error]:   00000030: 00000000 00000000 00000000 00000000  ................

After some Duckduckgo’ing, it seemed some people with old fashioned SandyBridge boards had this issue on UEFI boot, and the cure was a microcode update.

I already had non-free sources enabled, so I decided to install it.

apt install intel-microcode

After a reboot, the kernel hardware errors were gone!

The crashes however, weren’t.

BIOS? Boot ROMs?

So I did something I almost never do… Consult the useless vendor documentation.

After a bit, I found this page which advises:

If Network Boot is enabled in BIOS, random restarts or blue screen errors can occur.

  1. Press F2 during start to enter BIOS Setup.
  2. Go to Advanced > Boot > Boot Configuration.
  3. Disable Network Boot - OR - enable Boot Network Devices Last.
  4. Press F10 to save and exit BIOS.

I tried it and… Tada!! No more crashes!

I also noticed that the interface name changed from enp3s0 to enp2s0 which is sort of suspicious. Turning of PXE absolutely shouldn’t have that effect on the system…

I’m not going to loosen my tinfoil hat though.