On Sunday, June 3 2018, the web site started picking up latency around 1400 UTC (10:00 am ET). The latency slowly grew until around 1600 UTC (noon ET) at which point the site became increasingly difficult to reach. Between 1600 UTC and 1755 UTC (noon until 1:55 pm ET), the web site dropped 2/3 of all web requests and the remaining 1/3 of web requests took longer than normal to respond.
Clockwise.MD operates on a variable number of servers throughout the day based on customer volume. Over night, Clockwise.MD runs on 1/3 the number of the servers needed when compared to during the day. As the day begins, additional servers are started up to handle the additional customers coming online. Our server load matches our customer's hours across the United States starting with the east coast at 8am ET and the west coast starting up around 11am ET. The largest volume of requests is the overlap between both coasts from 10am to 5pm ET when we run the full set of servers. Additional servers automatically start up if latency hits a certain threshold.
As servers start up for the day, they apply specific setup steps to get ready for their work day. One of those steps is to verify dependencies are installed. One of those dependencies is an external open source vendor that manages libraries for use in creating assets.
The vendor's SSL certificate had expired creating a situation where newly started servers couldn't resolve dependencies as they couldn't connect to the vendor's web site.
On-call engineers determined the root cause and patched the dependency check to use the non-HTTPS URLs of the vendors web site so that the dependency check could pass.
After the hotfix was released, the servers could start normally and services were fully restored.
The production servers did not require a dependency to assets during startup. Assets are built during a previous process (i.e. continuous integration) and not necessary for production servers.
A patch in tomorrow's release (June 5, 2018) will remove the dependency of assets building from production servers so further issues with this vendor will be fully mitigated.
Additionally, a review of the number of permanent 24/7 hosts will be done so that more minimum capacity is always available to handle more load overall.