So on Sunday 17th Dec I was going about my normal daily duties and I noticed I had an e-mail from one of the security related monitoring systems (GravityScan) that periodically scan my site, to report the site could not be reached. So like all web site administrators, I immediately went to check, and I did indeed find my beloved projects website to be offline! Firstly, sorry to any users who found the site in that condition, especially users of Zorin OS who are sent here to help verify their downloads of Zorin OS ISO. This website gets a lot of traffic from Zorin, so I tried my best to get it back online ASAP.
The website is hosted on an Amazon AWS Micro EC2 instance, which, after I logged into the dashboard, I found to be online and seemingly fine. I could SSH into it OK and status was reported as OK. So like all good IT professionals, I turned the instance on and off again as a quick easy hit to fix, but it didn’t fix it…at all.
So I had to dig deep into AWS settings again which is always daunting. It’s worth bearing in mind I’d made no AWS changes at all for several weeks so I had no idea what caused the problem. Having checked the instance, the next culprit was most likely to be the load balancer. The Load Balancers are needed not to actually manage loads in my case because QuickHash is hardly on the level of Microsoft or Netflix, and a T2.Micro instance seems to handle around the 1K visitors a day easily that the site receives in traffic. It’s needed to manage the SSL certificate and to provide HTTPS secure connections, which are more necessary these days than they used to be, most notably because modern web browsers often refuse to even load standard http. The load balancers though are not only quite complicated but they are also very expensive! I pay about £25 a month, just for the load balancer!
Anyway, had a look at that, I remembered I’d only setup a Classic Load Balancer, which have been on the demise for a while, but it was showing as “Out Of Service”. Why was it out of service? I had and still do not have the faintest idea! I tried tweaking the Health Check settings and that didn’t seem to help and I checked the target groups which, at first glance, seemed in order. However, they were reporting that they were “unused” which seemed odd, but I have other AWS instances using target groups too that reported the same thing yet they were online and working fine.
So I tried migrating the classic load balancer to a new Application Load Balancer. That failed to complete. By now, I was beginning to panic a little because I could see no reason for the problem. I tried removing and re-adding the QuickHash website instance to the CLB; still no change.
So I checked Route 53 settings. Again, all was in order, with alias records set for the naked domain and the www portion. Both set to the classic load balancer.
After several hours of staring at settings, and seeking some advice from a friend who is in fact a certified AWS professional, and after we noticed that accessing the IP address itself directly loaded the website, we knew the problem had to be related to either Route 53 or Load Balancer. So he and I agreed that the best way forward may simply be to delete the CLB and replace it with an ALB. Having done that, and having re-pointed the target groups, and then after redirecting Route 53 records to the new ALB, and after waiting for some minutes, the website came back online.
Lessons learned? None really. AWS is a beast, and can be quite complicated for the faint hearted. I still have no idea what caused the initial outage because all the alarms and monitoring suggested no real issues…no CPU throttling or network traffic out of the ordinary. But I guess if nothing else the website is now running on a new Application Load Balancer instead of the older and (I assume) soon to be deprecated Classic load balancer. And that can’t be a bad thing. I just hope they are not more expensive!
If you are reading this, then the website is online, of course. So hooray!