The performance issues of processing inbound messages was a result of our applications generating valid interface traffic faster than we could process it. With Christmas and New Years, we had significantly higher visit volumes and ended up generating messages faster than we could process them.
We’ve added resources and will adjust our system to depend more on our background processors to do a bulk of the work in the coming weeks.
The Clockwise.MD integration engine currently takes just under 200ms to process a typical message to completion, meaning we can process just over 5 messages/second. It has a separate job queue to manage background jobs but isn’t used for HL7 communication and primarily used for API requests to other EHR systems.
The immediate resolution added more API servers to handle more inbound traffic. With Clockwise.MD, we can autoscale web servers based on latency. Unfortunately that process doesn’t work for the integration engine because the latency is variable, the request count isn’t consistent and external HL7 processors also can adjust the amount of traffic being sent to us to back-off on inbound requests.
In order to mitigate the impact of higher overall visit volume, we'll be working over the coming weeks to spread out the bottleneck in the Clockwise.MD integration engine. The plan is to receive those messages from the interface with minimal processing (write the contents to a database and return an OK status) and then pick up the actual processing in a background job.
The integration engine has access to a background job worker pool but doesn’t utilize it for inbound messages processing and only for outbound API requests.
With a whole fleet of background workers running at any given time, the integration engine will (a) be able to process those incoming messages in parallel and (b) be able to scale horizontally when load demands it.
In the future, resources can then auto-scale based on the number of jobs in the queue vs on difficult to meter API metrics from long running requests.