Comments and Console outage Incident Report

Event Details

On August 21st starting at 6:30 EDT the V3 Conversation services started to show degraded API responses and shortly after became unresponsive. Impacted by this, the Settings service and admin console dashboard also become inaccessible due to inability to login.

Incident Root Cause

The outage was triggered by an unscheduled spike - 15 times the normal average traffic on our comments services. A previous product update bypassed the comments count API caching. With the traffic spike which was above the provisioned 5x autoscaling some commenting and settings services were overloaded.

Incident Impact

From 6:30am till 10:00am EDT the comments widget intermittently failed to load. In the course of correcting this, an additional outage affecting our admin console occurred from approx 9:10am until 9:40am, during which time it was inaccessible intermittently.

Resolution Details

The compute capacity was increased and the database size was quadrupled to handle the traffic.

Incident Prevention Action(s)

We have made significant improvements to the comments count widget to prevent such service exhaustion. These improvements include optimization of API calls the tool makes (reduced by 50%) as well as caching the APIs to ensure significant reduction of load on the data stores. We also have established new procedures to prescale our infrastructure when significant spikes are expected.

We ensured the stability of our services and impact of above improvements on our scalability by extensively stress testing all services beyond the loads that caused the outage.

‌

(download this report in PDF format)

Posted Sep 08, 2020 - 10:49 EDT

Resolved

The incident has been resolved. All systems are fully operational.

Posted Aug 21, 2020 - 11:51 EDT

Monitoring

A new fix has been implemented to restore systems to full operations. Our team is monitoring the results.

Posted Aug 21, 2020 - 10:30 EDT

Investigating

We are still experiencing degraded performance. Our team is investigating and working to restore our systems to full operations. Users may encounter issues trying to log in or engage in the conversations widget (V3).

Posted Aug 21, 2020 - 09:47 EDT

Update

We are continuing to monitor for any further issues.

Posted Aug 21, 2020 - 09:41 EDT

Monitoring

A fix has been implemented and all systems are restored to full operations. Our team is monitoring the results.

Posted Aug 21, 2020 - 08:41 EDT

Investigating

We are experiencing unexpected degraded performance and we are working to restore our systems to full operations. Users may encounter issues trying to log in or engage in the conversations widget (V3).

Posted Aug 21, 2020 - 07:48 EDT

This incident affected: API, Management Portal, and Engagement tools.