Delay in Processing Inbound Messages from DocuTAP
Incident Report for Clockwise MD
Postmortem

Summary

The performance issues of processing inbound messages was a result of our applications generating valid interface traffic faster than we could process it. With Christmas and New Years, we had significantly higher visit volumes and ended up generating messages faster than we could process them.

We’ve added resources and will adjust our system to depend more on our background processors to do a bulk of the work in the coming weeks.

Cause

The Clockwise.MD integration engine currently takes just under 200ms to process a typical message to completion, meaning we can process just over 5 messages/second. It has a separate job queue to manage background jobs but isn’t used for HL7 communication and primarily used for API requests to other EHR systems.

Resolution

The immediate resolution added more API servers to handle more inbound traffic. With Clockwise.MD, we can autoscale web servers based on latency. Unfortunately that process doesn’t work for the integration engine because the latency is variable, the request count isn’t consistent and external HL7 processors also can adjust the amount of traffic being sent to us to back-off on inbound requests.

Postmortem

In order to mitigate the impact of higher overall visit volume, we'll be working over the coming weeks to spread out the bottleneck in the Clockwise.MD integration engine. The plan is to receive those messages from the interface with minimal processing (write the contents to a database and return an OK status) and then pick up the actual processing in a background job.

The integration engine has access to a background job worker pool but doesn’t utilize it for inbound messages processing and only for outbound API requests.

With a whole fleet of background workers running at any given time, the integration engine will (a) be able to process those incoming messages in parallel and (b) be able to scale horizontally when load demands it.

In the future, resources can then auto-scale based on the number of jobs in the queue vs on difficult to meter API metrics from long running requests.

Posted 3 months ago. Jan 04, 2019 - 11:02 EST

Resolved
Inbound messages are now steadily processing in near-real time.
Posted 3 months ago. Dec 31, 2018 - 16:03 EST
Monitoring
The delay in processing inbound message from DocuTAP is resolving and messages are now being processed within less than one minute of being received. We will continue to monitor for any further issues.
Posted 3 months ago. Dec 31, 2018 - 15:39 EST
Investigating
We are experiencing a delay in the time between receiving and enqueuing messages from DocuTAP vs. completing processing of those same messages. We are continuing to monitor and investigate this issue and will post an update when available.
Posted 3 months ago. Dec 31, 2018 - 14:09 EST
This incident affected: EHR Integration Engine.