Elevated latencies with Real-time Service
Incident Report for Clockwise MD
Postmortem

Summary

Clockwise.MD experienced a service interruption with our 3rd party vendor Ably that impacted all customers due to impact of real-time screen updates and longer running background jobs that also impacted the integration engine. The application still functioned but changes made by one user that could impact another would require a screen refresh or longer delays to see the changes. As that system took longer to process, other jobs were impacted including integrations.

We scaled up the number of background job servers to process more jobs and Ably resolved their service interruption.

Cause

Clockwise.MD uses a background job processor to allow processes to run in priority queues. Some queues are higher priority than others such that lower priority jobs can take time to process. On a typical day, all jobs are processed within a couple minutes as workflows ebb and flow throughout the day. Priority jobs are processed within milliseconds and include updating/refreshing the screen as actions are taken by users.

Clockwise.MD uses a 3rd party vendor, Ably, to process web-socket communication with our servers to exchange changes made by users. An example of this would be a patient interacting with the kiosk and seeing changes on the wait room monitor. When a patient checks-in on the kiosk, a message is sent to the wait room screen that there’s been a change and the wait room page will call back to Clockwise.MD for the page to update. When this issue started, those updates would take longer.

Ably had an incident that impacted all Clockwise customers. When Clockwise.MD attempted via background jobs to connect, the process stopped until they timed out, and then retried multiple times before failing. This meant that the background worker processes that normally complete in a matter of milliseconds were occupied for 15+ seconds at a time and were unavailable to take on other background jobs.

This problem then impacted our background jobs as their queues grew and there wasn’t enough workers to process integration jobs.

Resolution

We scaled up the number of active workers in an attempt to stop the bleeding, but given the large job volume tied to realtime updates, we weren't able to throw enough resources at the problem to process all the jobs.

Engineering started working on a hot fix to address the issue but that development wasn't ready for release until after Ably had already started to recover.

Postmortem

In order to be ready for any similar future problem on Ably's part, we did add in the ability to effectively stop trying to connect (or to time out much faster) thereby reducing the overall number of long running jobs and their impact to other jobs including integrations.

That functionality was released this morning on 1/4/2019.

Posted 3 months ago. Jan 04, 2019 - 11:59 EST

Resolved
This incident has been resolved.
Posted 3 months ago. Jan 03, 2019 - 15:46 EST
Update
We are continuing to monitor for any further issues.
Posted 3 months ago. Jan 03, 2019 - 13:30 EST
Monitoring
Our third-party real-time messaging service is recovering and we're processing through a backlog of messages.
Posted 3 months ago. Jan 03, 2019 - 12:41 EST
Investigating
We are experiencing elevated latencies with our real-time messaging service. The service is investigating the issue and working to resolve it. There is no estimate on resolution at this time.
Posted 3 months ago. Jan 03, 2019 - 11:05 EST
This incident affected: Web Site and Infrastructure (Ably Publish/Subscribe).