Intermittent performance issues with engagement tools
Incident Report for Viafoura
Postmortem

V3 Comments Degradation Incident Report

(download a PDF version of this report)

Event Details

On August 17th starting at 6:30am EDT the V3 comments API started to show failures. At 1pm the degradation started causing intermittent failure in V3 comment widgets loading across all sites. Investigation continued till 3pm when the issue was resolved.

Incident Root Cause

Our V3 comment widget usage increased more than 4x the normal average usage from August 10th to 17th. Viafoura’s infrastructure autoscaling is normally triggered by CPU usage in compute resources and scaled the compute resources accordingly, however, the database connection pool and heap size proved to be a bottleneck during the scaling and needed to be scaled at a higher rate than the compute resources. Failure to scale the database connection pool size eventually caused degradation in comments API responses.

Incident Impact

From August 17th 6am till 1pm the users experienced degradation in comments widgets load times. From 1pm to 3pm afterwards, API timeouts and failures in loading comments widgets were experienced. The incident only impacted V3 comment widgets and loading V3 comments in the moderation console. V2 comment APIs and widgets were not impacted.

Resolution Details

The performance bottleneck was identified and corrected accordingly. The application memory allocated to each compute node was also adjusted to accommodate for the handling of more concurrent connections. By August 17th at 3pm, after the adjustments all services were back to normal operation.

Incident Prevention Action(s)

We observed proper autoscaling based on CPU usage after increased connection pool sizes were applied and the adjustment will prevent connection pool size becoming the bottleneck in the future autoscaling events.

(download a PDF version of this report)

Posted Sep 08, 2020 - 10:38 EDT

Resolved
The intermittent issue with conversation widgets has been resolved. Thank you for your patience.
Posted Aug 17, 2020 - 20:42 EDT
Investigating
We are experiencing unexpected degraded performance (intermittent, isolated to V3 widgets only) and we are working to restore our systems to full operations. Some users may encounter issues loading the conversations widget on article pages.
Posted Aug 17, 2020 - 15:13 EDT
This incident affected: Engagement tools.