Partial Outage - Engagement Tools
Incident Report for Viafoura
Postmortem

Event Details

At approximately 14:55 UTC, we began a routine upgrade of an AWS RDS Aurora database cluster. We had already upgraded 8 other clusters.

At 15:04:05 UTC, the first of the readers in the cluster restarted to perform the upgrade. At 15:04:28 the reader was to serve connections. At 15:04:30 the reader experienced an internal error and aborted the mysql process. This reader was restarted another 126 times, until a final restart at 15:45:19 that stayed running.

At 15:05:07 UTC, the remaining nodes in the cluster were simultaneously restarted by AWS. At this point the cluster was unavailable to service requests. The other readers and writers were continuously restarted for approximately 40 minutes by AWS.

Readers were successfully started at 15:45:02, 15:45:19, and 15:45:41. The writer was successfully started at 15:45:41.

Not seeing successful livecomments recovery, we temporarily deprovisioned the livecomments service at 15:53 UTC. We allowed a few minutes to allow the cluster to stabilize, and redeployed livecomments at 16:00 UTC.

Service was partially restored at 16:02 and fully restored at 16:06 UTC.

At 17:11 UTC, the upgrade finished.

Incident Root Cause

An AWS RDS Aurora database experienced problems during an upgrade.

Incident Impact

The livecomments service was unavailable to all customers for 62 minutes.

Posted Jan 27, 2021 - 15:06 EST

Resolved
The issue has been resolved, and we are monitoring our tools' performance. Thank you for your patience.
Posted Jan 18, 2021 - 11:43 EST
Identified
We have identified the issue, and partially restored commenting functionality.
Posted Jan 18, 2021 - 11:09 EST
Investigating
We are experiencing a partial outage - live comments not available - and we are investigating the cause of this issue.
Posted Jan 18, 2021 - 10:40 EST
This incident affected: Engagement tools.