What happened?
We took the MariaDB database down to stop the ibdatafile from growing exponentially. The only way to do this is to dump all databases, delete the log files, update the server configuration, and restore the data.
The dump process worked without any errors. The clean up and server configuration completed without any errors. The restore process is where things failed.
Because of errors in 1 or more database(s) transactions failed to complete causing the restore to never get past a certain point. The amount of data in the mysql data directory would increase up to a certain point and then the data size would drop by half before repeating. After running the restore process twice and having errors in different spots we decided to extract each database from the monolithic backup we had taken previously. Why a monolithic backup? It's usually faster to dump and restore, except in this case.
The extraction process is painfully slow compared to just running a working dump restore. That's why this process is taking so long.
What didn't we do?
We could have restored from the latest backup which would have been 24 hours or less old, however, that option comes with the significant risk of data loss. Rather than risk losing 24 hours of data we went with the slower, safer process.
The current restore process for each server is now past the previous failure point.
We'd sincerely apologize for the downtime this has caused for your apps that use MariaDB. In the tests we performed before the actual event we did not run into any of these errors.