Partial Outage - Engagement tools
Incident Report for Viafoura
Postmortem

Event Details

On Saturday April 10th at 8:55 AM EST Reports started coming in about Viafoura widgets and the tray no longer loading.  After triaging we discovered the disk partitions on storage nodes were full.

Incident Root Cause

As part of a data pipeline storage nodes upgrade prior to the incident, we had created new nodes and transferred data logs from our old storage nodes to the new instance. An unintentional update in log timestamps during the transfer resulted in the logs not being purged correctly and exhaust the log data storage.

Incident Impact

All widgets and trays for all users were down during the incident.

Resolution Details

We identified that a 3x increase in our physical storage nodes is sufficient to handle spikes in traffic in the future and the volumes were adjusted accordingly.

Incident Prevention Action(s)

We have updated our system alerts on disk usage monitoring to warn us before disk space is exhausted at 75% usage, and to alert at 80% usage.

Posted Apr 12, 2021 - 11:24 EDT

Resolved
This incident has been resolved.
Posted Apr 10, 2021 - 09:22 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 10, 2021 - 09:08 EDT
Identified
The issue has been identified and a fix is being implemented.
Posted Apr 10, 2021 - 09:04 EDT
Investigating
We are currently investigating this issue.
Posted Apr 10, 2021 - 08:51 EDT
This incident affected: Management Portal and Engagement tools.