Timeline:
Root Cause Analysis:
After assessing our server monitoring solution it was established that the Galera Database Service had paused replication between nodes and was failing to handle new database queries in a reasonable period of time. It was then established that the WP Offload Media plugin was generating large numbers of “ALTER” table queries - while a table is being altered, Galera prevents normal queries (both reads and writes) from being served in order to prevent data loss. This caused all three Galera nodes to be unable to serve normal traffic for a substantial period of time.
After these queries finished running, and all changes had been made to the relevant tables - normal service should have then been resumed. However the WP Offload Media Plugin continued to attempt to update the relevant tables, which caused the database cluster to continue to fail to serve queries. In order to restore full functionality, the WP Offload Media Plugin was reverted to a previous version.
Mitigation and Preventative Measures.
In order to avoid future disruption, steps will be taken to accelerate our migration away from Galera and into a managed database solution with support for “Non-Blocking Operations“ to allow access to be maintained to tables that are not currently being modified.
Changes should also be made to the plugin update process to include the involvement of QA resource.
A support ticket has also been raised with the creators of the WP Offload Media plugin to ensure that this issue is resolved ASAP.