Addressing the Recent Downtime and Emergency Maintenance on Cloud Infrastructure
During the past week, our clients experienced several periods of service interruption caused by server reboots, kernel updates, and unscheduled maintenances. Our community manager and customer service staff are working around the clock to keep our clients informed and updated about our maintenance schedules and to minimize the negative impact as much as possible.
Your Protection is Our Mission
As posted on our blog, we are performing multiple rounds of reboots and emergency maintenance to address the Meltdown vulnerability. Yet, I feel a more non-technical follow up is needed to reassure our clients and partners that all actions are taken so far, and the ones that will follow, have one goal and one goal only – to protect our clients from a possible security issue related to the Meltdown and Spectre vulnerabilities.
It is important to look at the situation at hand with a clear mind – we are facing an unprecedented vulnerability which is affecting an enormous percentage of the server infrastructure worldwide. All cloud hosting providers now face these same difficulties and may choose to enact different action plans in response to this threat.
Our goal is to mitigate these vulnerabilities as soon as humanly possible and prevent as much harm to our customers as we can. Entrusting us with your websites, business, and online presence is something we cannot neglect and the actions we took is the path we found adequate to repay that trust.
Although technology is used to help solve problems such as convenience issues, budget constraints and increase business effectiveness, we live in a world so depending on the technology that such unprecedented vulnerabilities can distress our day-to-day operations for days, weeks or even months. Unfortunately, this is the reality we have to face and this is the reason why a company such as ours exist in the first place – to deal with such situations in a swift and efficient way for you, in order to continue to support you, our clients, in your online business and ventures.
We are working closely with all partners and upstream providers and the information available changes by the hour. We are committed to keep the information flow and to give advance notice to affected clients as quickly as possible. Yet, if technical challenges dictate immediate actions, we may perform unscheduled emergency maintenance or reboots on different parts of our infrastructure.
Contact Us for Assistance
As always, feel free to contact our technical support and customer service if you have any questions, but please understand that requests for changed timelines or modified reboot schedules will likely not be possible.
Last but not least, on behalf of my team, I would like to thank you for your patience and understanding during this period. We appreciate your trust in us to do the right thing.
The latest tips and news from the industry straight to your inbox!
Join 30,000+ subscribers for exclusive access to our monthly newsletter with insider cloud, hosting and WordPress tips!
Comments (11)
It has been several weeks since I brought the malfunction to your attention. First week I was told that it must be a mistake with my connection. Then suddenly it became an emergency. Now, several weeks later I am still told to be patient, while watching my website going down up to 5 times a day with almost half an hour of downtime. Patience is ok for a few hours, but several weeks of this is a bit much.
Up until now you’ve been brilliant in many way. This is my first complaint but it’s a big one. Surely there are at least temporary measures you can take to avoid the downtime?
From this post it seems that it’s a security threat and all hosting providers are affected. But who is affected and how long it’s going to take to fix? What are my options outside of watching my ranking dropping every time?
Dear Alex,
We were sorry to hear you are not satisfied with our services lately. We have tried to search through both our Ticketing and LiveChat database by the names “Alex” used on Disqus, but to no avail. We cannot identify your account and address in detail the mentioned situations.
Please note that this blog post dates back January 12, 2018 and a lot has changed since then.
We would like to look into your case in detail and address your concerns, so please provide us with your domain name or ticket ID. We will forward the case to our senior support supervisors for review. We are sure that any misunderstanding or issue can and will be carefully investigated, explained, and resolved as soon as possible!
Best,
FastComet Team
I’m having the same experience as Alex and my site has been down for nearly 24 hours now. I also reported the issue when it was going down for short periods of time last week, and again yesterday when it went down completely and was told I need to be patient and there is no estimated time for the issued to be fixed. Funny that you don’t mention that there is a massive issue affecting many users anywhere on your blog or social media. I expect some downtime once in a while, but I never experienced such long downtime with any other host before. This is a major inconvenience which I’ll consider at renew time. I realise this is an old blog post, perhaps you should make a new post to keep the users informed on the current situation.
Dear Pat,
Again, I am not able to track in our system, so that be able to check your case in particular.
Generally speaking, indeed there has been an issue with the service and our Technical support team has been working on it for the past few days, in order to resolve it completely. The issue with the service is caused due to a kernel bug, causing an RCU stall on the entire physical host, where probably your hosting server is located. You can read more about the RCU Stall Detector here:
https://www.kernel.org/doc/Documentation/RCU/stallwarn.txt
This is a bug affecting AMD CPUs only. When this happens, the system becomes unresponsive and the only way to bring it back up is a reboot of the server itself. This generally resolves the CPU contention and resumes the normal operation of all services, allowing full access to the data on the server. However, this is not a permanent solution we can tolerate at any point in time and we have escalated it to our Upstream provider for a permanent solution a few days ago. The report contains detailed information about the issue, as well as the behavior of the hardware host at those times.
Our upstream provider updated us with a confirmation about the bug and they are currently working closely with the AMD technicians team in order to provide a permanent fix to it. While we cannot provide an exact ETA on completely resolving the issue at hand, we can assure you that multiple teams have been assigned to the task.
On behalf of our DevOps and Escalations teams, I can assure you that we are doing everything within our reach to resolve the current matter as quickly as possible, so you can continue working on your projects without having to worry about the underlying infrastructure used for them.
Best,
FastComet Team
I’ve already received this generic reply on my ticket, so not very helpful as the site is still down and no estimate of when the users will be able to access it.
Hello Elena,
So what you are saying is you’ve known of this problem since the 12th of January, and now, 5 weeks later, you are still telling customers that you are working on it with no ETA???
And all this while allowing your clients to go with hours of downtime every day?
Actually, I just realised that the date you mentioned was in January 12, 2018.
This makes your comment “a lot has changed since then” somewhat questionable!
My main point is that for a company that claims to care about it’s customers, in reality you deem it acceptable to allow hours of downtime every day for weeks.
So far all I am getting is apology and patience, while the site that I’ve worked for a year is going down!
I find it hard to believe that there’s absolutely no solution, even temporary to keep the site afloat while fixing the issue!
My ticket #171608
Dear Alex,
I believe there is a lot of misunderstanding regarding the “January 12, 2018” timing. The date “January 12, 2018” is associated with the date the blog post under which you are commenting was posted. This blog post was associated with the Spectre and Meltdown vulnerabilities affecting Intel CPUs. As of February 2020, we have switched to AMD EPYC CPU.
Honestly, without any additional information that would allow us to track your account, receiving a comment under a blog post which is 2 years old, we were truly confused about what made you choose to comment exactly here. As much as we would like to assist, the best way to do so remains our Ticketing system, where our support team will be able to provide you with frequent updates on the current matter.
Please note that in no way I aim to put you at fault for the situation. I just wanted to explain the cause of the misunderstanding.
Best,
FastComet Team
If you were to actually read my messages, you’d see that I provided a ticket number at the bottom!!!
Ok, you switched to some other system, and 2 years later here we are, having the same problem! How is that an improvement!??
Dear Alex,
I did check the ticket ID and I was able to see that you have already listed all of your concerns that you mention here in it. What is more, our Technical support has already addressed them. I do understand your frustration, however, please do understand that fixing a major bug in the kernel is a hard-work and entails many difficulties. Kernel bugs can be seriously hard to find and fix and multiple teams have been assigned to the task. The kernel itself is a highly modular body of code with a large development community. Since the issue has been discovered, our upstream provider has been working with AMD so that they may develop a patch for the issue. Unfortunately, as in every DevOps process given time and resource constraints are required to be overthrown to come back with the results.
Best,
FastComet Team
Dear Elena and the FastComet Team,
Regarding your response:
“our Technical support has already addressed them”
– Are you referring to your team’s messages telling me to be patient?
Correct me if I’m wrong, but I always thought that the whole point of having host for my website was so that I wouldn’t have to know things like AMD CPUs and kernel bugs!
Sounds like you are making your problem, my problem.
When my site was down for the 1st week – ok, you were amazing until now, things happen.
2nd week of downtime – my business is significantly affected.
3rd week of downtime – give me a refund and a compensation for the damages.
My first 3 requests for the refund were completely ignored.
Now, I received this:
“We are always striving to provide high quality of services and keep the 99.9% uptime of each server. That is why, the case was reviewed with highest priority by our system administrations and technical support, in order to resolve the issue and find a permanent solution, so that it does not happen again.”
– Here’s the actual reality, ONLY FOR TODAY SO FAR
The monitor is back UP (HTTP 200 – OK) (It was down for 2 minutes and 8 seconds).
The monitor is back UP (HTTP 200 – OK) (It was down for 15 minutes and 2 seconds).
The monitor is back UP (HTTP 200 – OK) (It was down for 32 minutes and 5 seconds).
The monitor is back UP (HTTP 200 – OK) (It was down for 11 minutes and 7 seconds).
It’s been like this for the past 3 weeks (that I know of since I started to monitor).
Next actually shows how much you “care” for your clients:
“Anyway, we do understand that this has caused you inconvenience and if cancellation is your final decision, we will comply with it.
Thus we are happy to issue a pro-rated refund for the remaining time of your hosting package and as a form of compensation, we will refund you 1 month of the already used period”
– So your idea of “compensation” is to refund me the 1 months when you actually failed to provide the claimed 99% uptime?
Hi, Wonderful Post.. I really Like your post..This is a awesome tips.. Thanks for sharing the good information