WordPress 502 Errors
Incident Report for Engineering
Postmortem

Timeline:

  • 11:18am - WP Offload Media plugin update deployed
  • 11:20am - DevSecOps Engineer begins to investigate potential disruption
  • 11:40am - DevSecOps Engineer identifies root cause of disruption
  • 11:44am - Incident established and Status Page updated
  • 13:30pm - Partial Service Restored
  • 14:13pm - Front End Service fully Restored
  • 14:43pm - Service Fully Restored

Root Cause Analysis:

After assessing our server monitoring solution it was established that the Galera Database Service had paused replication between nodes and was failing to handle new database queries in a reasonable period of time. It was then established that the WP Offload Media plugin was generating large numbers of “ALTER” table queries - while a table is being altered, Galera prevents normal queries (both reads and writes) from being served in order to prevent data loss. This caused all three Galera nodes to be unable to serve normal traffic for a substantial period of time.

After these queries finished running, and all changes had been made to the relevant tables - normal service should have then been resumed. However the WP Offload Media Plugin continued to attempt to update the relevant tables, which caused the database cluster to continue to fail to serve queries. In order to restore full functionality, the WP Offload Media Plugin was reverted to a previous version.

Mitigation and Preventative Measures.

In order to avoid future disruption, steps will be taken to accelerate our migration away from Galera and into a managed database solution with support for “Non-Blocking Operations“ to allow access to be maintained to tables that are not currently being modified.

Changes should also be made to the plugin update process to include the involvement of QA resource.

A support ticket has also been raised with the creators of the WP Offload Media plugin to ensure that this issue is resolved ASAP.

Posted Nov 24, 2022 - 15:16 GMT

Resolved
All long running database queries have now completed.
Continued disruption was seen due to a separate fault within the object storage plugin update continually attempting to re-run the queries (although no action was taken as a result of these, it caused the server to disable any writes to the database)

In order to prevent this we have now rolled back the update to the Object Storage Plugin.
Now that these long running queries have completed we should not see any further disruption related to these changes when we are able to apply the update.

Corrections have been run to made to restart any cavalcade (cron) jobs that have been failing as a result of these issues.

All services should now be operating normally.

A full postmortem will be made available shortly.
Posted Nov 23, 2022 - 14:43 GMT
Update
After disabling non essential queries front end sites appear to be operating normally - we have now downgraded these to degraded performance and will continue to monitor.

The adminservers are still experiencing intermittent timeouts and 502 errors, we are currently trying to mitigate this and will continue to provide updates wherever possible.
Posted Nov 23, 2022 - 14:13 GMT
Update
While a large volume of long running database jobs are still under way, we have made several changes to block non essential database queries from taking place. This has restored some accessibility to the front end of all WPC sites and to the wordpress admin dashboard area.

We are continuing to look for ways to further improve performance while the long running database queries complete.
Posted Nov 23, 2022 - 13:33 GMT
Monitoring
A WordPress plugin update deployed earlier this morning at 11:20 has caused an outage across WordPress cluster. This update was necessary although the impact on site performance was not anticipated. The update has triggered a large number of database operations which need to be allowed to complete. At this stage we are monitoring the progress of those jobs.
Posted Nov 23, 2022 - 11:44 GMT
This incident affected: WordPress Infrastructure (Admin Server, Web Server(s)) and Websites.