(download this report in PDF format)
On August 21st starting at 6:30 EDT the V3 Conversation services started to show degraded API responses and shortly after became unresponsive. Impacted by this, the Settings service and admin console dashboard also become inaccessible due to inability to login.
The outage was triggered by an unscheduled spike - 15 times the normal average traffic on our comments services. A previous product update bypassed the comments count API caching. With the traffic spike which was above the provisioned 5x autoscaling some commenting and settings services were overloaded.
From 6:30am till 10:00am EDT the comments widget intermittently failed to load. In the course of correcting this, an additional outage affecting our admin console occurred from approx 9:10am until 9:40am, during which time it was inaccessible intermittently.
The compute capacity was increased and the database size was quadrupled to handle the traffic.
We have made significant improvements to the comments count widget to prevent such service exhaustion. These improvements include optimization of API calls the tool makes (reduced by 50%) as well as caching the APIs to ensure significant reduction of load on the data stores. We also have established new procedures to prescale our infrastructure when significant spikes are expected.
We ensured the stability of our services and impact of above improvements on our scalability by extensively stress testing all services beyond the loads that caused the outage.