Clockwise.MD experienced a service interruption with our 3rd party vendor Ably that impacted all customers due to impact of real-time screen updates and longer running background jobs that also impacted the integration engine. The application still functioned but changes made by one user that could impact another would require a screen refresh or longer delays to see the changes. As that system took longer to process, other jobs were impacted including integrations.
We scaled up the number of background job servers to process more jobs and Ably resolved their service interruption.
Clockwise.MD uses a background job processor to allow processes to run in priority queues. Some queues are higher priority than others such that lower priority jobs can take time to process. On a typical day, all jobs are processed within a couple minutes as workflows ebb and flow throughout the day. Priority jobs are processed within milliseconds and include updating/refreshing the screen as actions are taken by users.
Clockwise.MD uses a 3rd party vendor, Ably, to process web-socket communication with our servers to exchange changes made by users. An example of this would be a patient interacting with the kiosk and seeing changes on the wait room monitor. When a patient checks-in on the kiosk, a message is sent to the wait room screen that there’s been a change and the wait room page will call back to Clockwise.MD for the page to update. When this issue started, those updates would take longer.
Ably had an incident that impacted all Clockwise customers. When Clockwise.MD attempted via background jobs to connect, the process stopped until they timed out, and then retried multiple times before failing. This meant that the background worker processes that normally complete in a matter of milliseconds were occupied for 15+ seconds at a time and were unavailable to take on other background jobs.
This problem then impacted our background jobs as their queues grew and there wasn’t enough workers to process integration jobs.
We scaled up the number of active workers in an attempt to stop the bleeding, but given the large job volume tied to realtime updates, we weren't able to throw enough resources at the problem to process all the jobs.
Engineering started working on a hot fix to address the issue but that development wasn't ready for release until after Ably had already started to recover.
In order to be ready for any similar future problem on Ably's part, we did add in the ability to effectively stop trying to connect (or to time out much faster) thereby reducing the overall number of long running jobs and their impact to other jobs including integrations.
That functionality was released this morning on 1/4/2019.