Recovering Mysql/Mariadb after a nasty crash

Recovering Mysql/Mariadb after a nasty crash

Database recovery is tricky at the best of times, and it never seems to happen when and how you expect. In this case, my self-hosted Zoneminder server crashed hard and came up broken. It appears to be related to some index or table in the database getting corrupted, but I’m not enough of a “database surgeon” to say with any certainty.

When this crash happened, the first sign was that the mariadb service was crashlooping. This is never a good sign.

Worse still was that my backup had stopped working the night before:

...
-rw------- 1 root root 410987932 Jul 25 00:03 zoneminder-2024-07-25-00-00-01.sql.gz
-rw------- 1 root root 410959586 Jul 26 00:03 zoneminder-2024-07-26-00-00-01.sql.gz
-rw------- 1 root root         0 Jul 27 00:03 zoneminder-2024-07-27-00-00-01.sql.gz
-rw------- 1 root root         0 Jul 28 00:03 zoneminder-2024-07-28-00-00-01.sql.gz

So, whatever had corrupted my database also broke the backups! Even worse, the failure happened in such a way that my backup alerts did not trigger - Fortunately I had caught this before it had overwritten all of the backups!!

The ideal outcome of this is:

  • Recovering the database
  • Dump & restore to clear up any index corruption - This isn’t strictly necessary, I just don’t know how to fix this specific issue so this is my “big hammer” to clear things up with minimal effort
  • Fix the backup script

Recovering mariadb

The first step was to get it to stop crashing. All non-essential services are stopped and the damage assessment begins.

In the logs, I was greeted with this:

2024-07-28  9:24:25 0 [Note] Starting MariaDB 10.5.23-MariaDB-0+deb11u1 source revision 6cfd2ba397b0ca689d8ff1bdb9fc4a4dc516a5eb as process 7453
2024-07-28  9:24:25 0 [Note] InnoDB: Uses event mutexes
2024-07-28  9:24:25 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2024-07-28  9:24:25 0 [Note] InnoDB: Number of pools: 1
2024-07-28  9:24:25 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2024-07-28  9:24:25 0 [Note] InnoDB: Using Linux native AIO
2024-07-28  9:24:25 0 [Note] InnoDB: Initializing buffer pool, total size = 134217728, chunk size = 134217728
2024-07-28  9:24:25 0 [Note] InnoDB: Completed initialization of buffer pool
2024-07-28  9:24:25 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=37416822856,37416822856
2024-07-28  9:24:25 0 [Note] InnoDB: Starting final batch to recover 459 pages from redo log.
2024-07-28  9:24:26 0 [Note] InnoDB: 128 rollback segments are active.
2024-07-28  9:24:26 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2024-07-28  9:24:26 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2024-07-28  9:24:26 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2024-07-28  9:24:26 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2024-07-28  9:24:26 0 [Note] InnoDB: 10.5.23 started; log sequence number 37423522023; transaction id 104591640
2024-07-28  9:24:26 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2024-07-28  9:24:26 0 [Note] Plugin 'FEEDBACK' is disabled.
2024-07-28 09:24:26 0x7f0fb47f8700  InnoDB: Assertion failure in file ./storage/innobase/btr/btr0cur.cc line 336
InnoDB: Failing assertion: btr_page_get_prev(get_block->frame) == block->page.id().page_no()
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mariadbd startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
240728  9:24:26 [ERROR] mysqld got signal 6 ;
Sorry, we probably made a mistake, and this is a bug.
Your assistance in bug reporting will enable us to fix this for the next release.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.5.23-MariaDB-0+deb11u1 source revision: 6cfd2ba397b0ca689d8ff1bdb9fc4a4dc516a5eb
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=1
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467891 K  bytes of memory
...

Whether this is a symptom or a root cause is hard to say. My theory is that the Zoneminder service itself crashed or exited in an unclean way and caused a partial write into the database, but that shouldn’t really be possible. Likewise, a storage fault is possible, though this disk is fairly new and has no SMART errors (yet).

The first step is to boot mysql/mariadb into safe mode. There are multiple recovery modes depending on the restrictions imposed. Typically you start with 1 and increment up to 6 until recovery succeeds, then back to 0 for normal operation.

/etc/mysql/mariadb.conf.d/50-server.cnf

[mysqld]

# Emergency recovery
innodb_force_recovery = 2

In my case, it took level 2 to boot up the service.

Once it was up, the first operation was to take a backup

sudo mysqldump zm | gzip > emerg_backup.gz

Rebuilding the database

After a quick examination with zcat/less to check the data is correct, it was time to rebuild the database.

DROP DATABASE zm;

At this point I switched recovery mode off and restarted the database to ensure everything came up cleanly.

innodb_force_recovery = 0
sudo systemctl restart mariadb

The database was then recreated in mysql.

CREATE DATABASE zm;

Then, the data could be restored.

cat emerg_backup.gz | gunzip | sudo mysql zm

After about half an hour, it was back up and running.

What about the backup script?

Edited for brevity, this was the backup script at the time of the incident:

#!/bin/bash

SERVER="nas.local"

## Backup the database
NOW=$(date +"%Y-%m-%d-%H-%M-%S")
mysqldump zm | gzip > /opt/mysql-backups/zoneminder-"$NOW".sql.gz
find /opt/mysql-backups/ -mtime +7 -delete

# Sync database backups to server
rsync -azh --delete /opt/mysql-backups/ zmbackup@"$SERVER":/storage/zoneminder/database/

It’s pretty simple, and it’s worked great historically. However, it’s missing one critical line near the top:

set -euo pipefail

To expand on this:

  • -e – Terminate on all errors (non-zero exit codes)
  • -u – Treat all unset variables as errors
  • -o pipefail – Examine ALL commands in a pipeline for error codes, not just the last one

Without this critical command, bash will helpfully chug along and silently ignore any errors. In this case, even if I had set -e this would still have issues since the mysqldump command’s error code would be overridden by the successful operation of gzip!

So, it’s imperative that any backup script is can catch and respond to errors correctly. In this case it was just a dumb little database that stores metadata for CCTV recordings, but all data should be protected and nurtured with the utmost responsibility and diligence.

How to prevent something like this?

Simple - Use shellcheck. I can’t recommend it enough. Since I started “linting” my bash scripts, I’ve had far fewer issues like this. Bash is a tricky language, choc full of weird behaviours, bad defaults, and landmines galore. Having a tool in your back pocket to catch these sorts of errors before they become problems is a huge advantage.

A better backup script

By adding some error checking, we now have a much more robust and safe backup script. While this doesn’t address the root cause of the mariadb/mysql crash, it makes recovery a much less stressful situation.

#!/bin/bash
set -euo pipefail

SERVER="nas.local"

## Backup the database
NOW=$(date +"%Y-%m-%d-%H-%M-%S")
BACKUP_FILE_PATH="/opt/mysql-backups/zoneminder-${NOW}.sql.gz"
mysqldump zm | gzip > "$BACKUP_FILE_PATH"

# Check DB backup
if [ ! -s "$BACKUP_FILE_PATH" ]; then
    echo "[!!] Error backing up database. File is 0kb or does not exist"
    exit 1
fi

# Cleanup old backups
find /opt/mysql-backups/ -mtime +7 -delete

# Sync database backups to server
rsync -azh --delete /opt/mysql-backups/ zmbackup@"$SERVER":/storage/zoneminder/database/

Zoneminder… oh Zoneminder…

Meanwhile, I’ve reduced my recording retention for now to try to reduce the server load. I still don’t entirely know the root cause, and may never, but there was a positive outcome in the form of detecting a backup problem early.

And I can’t say this enough, check your backups folks.