The software that manages our virtual machines received a buggy update. That caused Groupdesk to be unavailable for parts of Saturday and Sunday. On top of that, on Friday the machines were under a very heavy load caused by someone attempting to gain unauthorized access. This caused a disruption in our service.
We run Groupdesk on a cluster of virtual machines. The software that manages this cluster is called Kubernetes and we use an enterprise version of Kubernetes created and maintained by a company called CoreOS.
Nov 20th: A very dangerous vulnerability was discovered in Kubernetes. Patches were created by many of the companies that offer Kubernetes.
Dec 7th: Our version of Kubernetes was updated with a patch to resolve the vulnerability.
Almost immediately afterwards we noticed something wasn’t quite right. We’ve setup a number of alarms designed to trigger when both the virtual machines that run Groupdesk and the virtual machines that maintain those machines behave strangely. We noticed that part of Kubernetes that is responsible for authentication was failing intermittently. Just to make the distinction clear: Groupdesk authenticates users and travelers, and it was working just fine, but the software called Kubernetes, which manages the virtual machines that Groupdesk runs on, was suffering occasional authentication issues. This means sometimes us developers had to wait a few minutes to log into the management software to spin up new virtual machines, for example.
Dec 7th - 14th: We reported the issue to CoreOS and followed up with them, hoping that a new patch would be out soon.
Dec 14th: Around 5:15 we began investigating an issue that was causing users to be logged out and unable to log back into Groupdesk. About 30 minutes later we identified the source of the problem; the log files for the Kubernetes virtual machines were full of failed authentication attempts. Someone or some group of people were attempting to brute-force their way into the machines that maintain Groupdesk, but not the Groupdesk machines themselves. They were attempting to log into the servers at an incredible rate and were causing the network congestion that was interrupting Groupdesk service. To resolve this we re-arranged the firewalls around the cluster of virtual machines and then rebooted the whole cluster. Bringing the cluster back up took an additional hour. As a final note about this day, this event seems to be completely independent of the issues discussed above and below. An unusual coincidence.
Dec 15th: At 1:45 we noticed alarms indicating an issue with one of the Kubernetes machines. However, as designed, this machine restored itself within a few minutes and there was no disruption of service. This happened again at 3:00pm, and again caused no disruption of service as the issue was within Kubernetes, not Groupdesk. Both of these failures were related to the very first issues we noticed and reported to CoreOS.
Dec 15th: Around 10:30pm we received reports of disruption of service. We checked for indications that the cluster was experiencing the same problems as the previous day, but found none. The Kubernetes virtual machines had failed for reasons similar to the initial issue, but on a much larger scale, causing networking to the Groupdesk machines to also fail. We rebooted the cluster to resolve the issue and discussed long term solutions. It seemed clear that the patch created by CoreOS was causing increasing instability in the cluster, so we considered abandoning the version of Kubernetes created by CoreOS and setting up a new Kubernetes cluster created by Amazon. It was resolved that on the 16th we would setup that new cluster on the side as a trial. By midnight the old cluster had finished rebooting and Groupdesk was back to full service.
Dec 16th: At 10:00am we noticed services were down again. No investigation was done this time. Instead the old cluster was immediately rebooted and we started work on getting a new cluster running on Amazon Web Services. While work progressed on the new cluster, the old cluster failed to reboot, twice. It had crashed in a completely unexpected and thorough way. We decided to cut losses and focus on getting the new cluster up. It took a while to, but we managed to restore service by 2:00pm.
Dec 17th: We spent the day getting the new cluster setup with all the monitoring and alarms that we had used in the old cluster. We also spent time researching the issue with the CoreOS patch and discovered we were not the only ones who experienced the dramatic failure of an otherwise very stable product.
To summarize, Groupdesk experienced 2 hours of service disruption on Friday due to a failed hacking attempt, 1.5 hours of disruption on Saturday due to an unusual failure in the machines that manage Groupdesk machines, and 4 hours of disruption on Sunday due to that same issue causing the cluster to dramatically tear itself apart.