Site slowness
Incident Report for Clockwise MD
Postmortem

Summary

The maintenance window on the night of 2/23 included several changes of which one was a web server switch from Unicorn to Puma. This changes the application servicing model from process to multithreaded request handling. The change had been running in staging for 6 months. It’s also a change many other companies have made in the last two years as Puma has become the default web server for many web sites.

The web server change caused an unexpected slowdown in application performance when under heavy load not seen in previous testing.

Cause

On Sunday afternoon (2/24 from 1615 to 1705 UTC), we saw issues with slow response times and upon investigation, database connection issues. With the new configuration, the database pool size needed to be increased and that change didn’t roll out. Once the change was rolled out, everything stabilized as Sunday went on.

Monday morning (2/25 from 1605 to 1620 UTC) proved that the problem wasn’t related to just the database connection pool but to just poorer performance on the application servers. This primarily deals with handling fewer web requests due to the multithreaded model. The decision was made to move back to a known good configuration using Unicorn. The change was being made to 2 servers to confirm it would work and it did. To finish the rollout, a base configuration was changed. The expectation was that this change would be applied when running a “configure” step against the servers individually. However, the change was immediately picked up by Chef and applied to multiple servers immediately which then dropped out of the pool. This reduced the server availability to 20% of its normal size, at which point requests weren’t being kept up with and the site started being partially unavailable. Once servers were issued a reconfigure command, they became available.

Resolution

The maintenance window did allow the rollout of several enhancements to the infrastructure. Unfortunately the change in web server design caused stability issues on Sunday and Monday. As of 1620 UTC, all web servers are running to their previous optimal configuration.

Postmortem

It was determined during Sunday’s work to stabilize the system that we could run individual servers in different configurations fairly easily. That method was used again on Monday to test out the previous web server configuration. In future maintenance cycles, the decision was made to adjust a single canary server and run it for an extended period of time in production to gather more information. Then follow that with a maintenance window to adjust all servers accordingly.

Posted 7 months ago. Feb 25, 2019 - 21:16 EST

Resolved
Partial site availability issues and site slowness.
Posted 7 months ago. Feb 25, 2019 - 10:10 EST