On Saturday April 10th at 8:55 AM EST Reports started coming in about Viafoura widgets and the tray no longer loading. After triaging we discovered the disk partitions on storage nodes were full.
As part of a data pipeline storage nodes upgrade prior to the incident, we had created new nodes and transferred data logs from our old storage nodes to the new instance. An unintentional update in log timestamps during the transfer resulted in the logs not being purged correctly and exhaust the log data storage.
All widgets and trays for all users were down during the incident.
We identified that a 3x increase in our physical storage nodes is sufficient to handle spikes in traffic in the future and the volumes were adjusted accordingly.
We have updated our system alerts on disk usage monitoring to warn us before disk space is exhausted at 75% usage, and to alert at 80% usage.