At approximately 14:55 UTC, we began a routine upgrade of an AWS RDS Aurora database cluster. We had already upgraded 8 other clusters.
At 15:04:05 UTC, the first of the readers in the cluster restarted to perform the upgrade. At 15:04:28 the reader was to serve connections. At 15:04:30 the reader experienced an internal error and aborted the mysql process. This reader was restarted another 126 times, until a final restart at 15:45:19 that stayed running.
At 15:05:07 UTC, the remaining nodes in the cluster were simultaneously restarted by AWS. At this point the cluster was unavailable to service requests. The other readers and writers were continuously restarted for approximately 40 minutes by AWS.
Readers were successfully started at 15:45:02, 15:45:19, and 15:45:41. The writer was successfully started at 15:45:41.
Not seeing successful livecomments recovery, we temporarily deprovisioned the livecomments service at 15:53 UTC. We allowed a few minutes to allow the cluster to stabilize, and redeployed livecomments at 16:00 UTC.
Service was partially restored at 16:02 and fully restored at 16:06 UTC.
At 17:11 UTC, the upgrade finished.
An AWS RDS Aurora database experienced problems during an upgrade.
The livecomments service was unavailable to all customers for 62 minutes.