They suffered a massive power failure today which meant that a large number of their customers’ sites were unavailable for around four hours. Right now, their status blog entry detailing this problem (and how the repairs are coming along) has 159 comments.
Most of these comments are of the frustrated-yet-understanding variety. A worrying number of them are terrifyingly puffed-up with their own sense of self-importance. And far too many are threatening to move their operations to another hosting provider.
Having worked as a system/network administrator for a while, I know exactly what Ed and the guys at Hosting365 are going through, so I sympathise completely. I’ve had those awful days where the worst thing that could possibly happen actually happens and you’ve got angry customers demanding a full report on how the problem happened, what steps you will be taking to fix the problem and how you will prevent this happening in the future while you’re focusing all of your efforts on just restoring a basic level of service. Horrible days, to be sure, but they have their uses.
To those people who are thinking of moving away from Hosting365 I say: stop. If I was using Hosting365, I would not switch to Blacknight now precisely because Blacknight haven’t suffered from something like this — yet. Whereas, I’ll bet you €100 that, after today, Hosting365 will be putting all of their attention into their reliability, focusing how to make sure that something like this never happens again.
And to those people that are complaining about their mission-critical services running on Hosting365, I say: well, I don’t know what to say without sounding rude. I’ll just say that if I was a reseller and it was my ass on the line, I’d make sure that my ass was covered. From a business perspective, a secondary server (from a different hosting company) is cheap as chips and worth its weight in gold when your primary server suffers from extended downtime.
So you’re saying that we should have a massive outage before people consider moving to us?
I think I understand where part of your logic comes from, but I’d have to disagree with it.
“a secondary server”. Interesting concept…and after today one that many of us resellers will be investigating. How would this work in practice when even nameservers are down?
@Michele: No, that’s not what I was saying at all, and I’m sorry if it came across that way in my entry.
What I was saying, or trying to say, was that switching to Blacknight after Hosting365’s (single, massive) power outage is, in my view, a logical fallacy because if I was a betting man, I would put money on you guys to be the next to suffer catastrophic failure.
This isn’t their first major outage. They had a similar issue last year as well.
We have built our entire network with levels of redundancy in mind specifically to avoid this type of incident. Our DNS servers, for example, are physically separated by several hundred kilometers.
While I can appreciate that it might seem likely that we would be the next to tumble, I’d also like to think that we’ve been avoiding these issues precisely because our setup is so different
@Dotty: Having control of your DNS, and a short timeout for lookups, is essential to making a secondary server run properly in an emergency. This is why it’s important to have your DNS hosted in more than one place. One of the things Hosting365 was criticized for (and rightly) was having their two name servers for all their customers on the one subnet.
If my website was mission-critical I’d have a mirror of web, mysql and mail running on a secondary server, and have my DNS hosted on my primary server AND secondary server.
But that’s just me.
@Michele: I’m glad to hear that, and I hope that you don’t think I’m singling you guys out or suggesting that Blacknight are in any way incompetent (in fact, I have a huge amount of respect for you guys after the Tom Raftery incident). I have no experience of either Blacknight or Hosting365’s services.
Believe me, I’m knocking on wood right now, hoping nothing bad happens to either yourselves or Hosting365.
In a isp-off, my experience would lean towards Blacknight. I’ve had some really bad experiences with Irish ISPs(especially ieinternet & less especially H365) and from all the ISPs I would only recommend Blacknight. In 3 years I have never had a reliablilty issue and response rates to any self induced issues have been really snappy. I think h365 have had it too easy for far too long, let the companies that care(blacknight) have a chance.
I second Trevor there. I’ve been with H365 since the start, before they got big. I got training on the control panel from Stephen himself for about an hour if I remember correctly. The company back then was small and hungry for the business and regarded my business as loyal. I see those characteristics in Blacknight now and I’ve made this point to Michele on a number of occasions, that you shouldn’t grow too quickly and forget your customers that are getting you there. I can’t see Blacknight having the same attitude to customers. My loyality dropped over a couple of issues last year, especially the last “blackout” and now I’ll feel a lot better elsewhere! John, you say, “Whereas, Iâ€™ll bet you â‚¬100 that, after today, Hosting365 will be putting all of their attention into their reliability, focusing how to make sure that something like this never happens again.”. Why didn’t they do this when the ESB substation blew last year. It’s the exact same issue… power loss. They got away with it then. Didn’t put a good enough solution in place in case it happened again. And now, should they be allowed to get away with it? I still haven’t had any contact from them with any sort of explanation or nothing. This should have been priority once things were up and running again. Rant over!!! 😀
A secondary server is not a bad idea… though you’re talking log shipping and warm standby rather than real fail-over because of the physical seperation of servers.
Of course, your ISP running tests to make sure that even if you lose power, servers and systems reboot properly might be a good idea too. I can understand 50 minutes being lost because someone pulled the wrong cable during preplanned maintenance (I’m not saying I think it’s acceptable, just that I could understand that sometimes human error just happens, even to the best of us). But not having tested to make sure that recovering from that wouldn’t take days is as unforgivable as not having sat down and worked out a DR plan for power failures after spending lots on new – and therefore untested – redundant power supplies. H365 just took a big kick in the pants because they thought that spending lots on an obvious solution would cover every eventuality and that therefore no further work was needed.
And from what I’ve seen from their posts on their blog and boards.ie and elsewhere, that doesn’t seem to be a nettle they’re grasping. And until they do, I can’t recommend to my boss that we stay with them at this point, not given their preliminary incident report.
JK, you thought you’d just make a random, chatty, observation on people getting all knicker-twisty about something that (i) can happen in the best of circs and, once it does, is unlikely to happen again and (ii) shouldn’t have been such a problem for people if they’d even thought to back up/failsafe properly. And then you gets lots of serious comments disagreeing – wouldn’t happen on TwentyMajor, I can tell ya!
One man’s fluffy is another man’s lead balloon baby.
@LegalEagle: Thanks! I’d completely forgotten about this and forgotten what my original point was in the first place.
@$*: Remember: This is not about H365’s past sins.
H365 were unfortunate that in the last two years, they have had two major issues connected with power but in my mind, neither of these could have been expected or properly planned for. From personal experience, issues relating to power are the most difficult to anticipate or to provide ‘soft’ failovers for. They also tend to have horrific knock-on effects of random equipment failure.
Let me illustrate this with an example: in the last place I worked, the NOC was conducting a standard UPS test. Everyday kind of thing that’s supposed to be done once a month. However, this was after the NOC had received a bunch of new, rather large tenants. The extra equipment exposed a problem: when it was built, the NOC was wired badly, and during this UPS test, the entire NOC was being run off ‘dirty’ electricity (straight from the ESB), which caused a massive power spike and blew out most of the circuits. I had my head in a cabinet at the time this test was done and watched as flames SHOT OUT of the Cisco router right in front of me. Six months after the incident, we were still experiencing equipment failure (usually in the form of dead power supplies).
Yes, we were angry – we lost a day’s work and a lot of equipment. But, apart from the electrician who wired the NOC in the first place, there was noone to blame. The NOC management company had done everything they could.
And I’ll tell you what – the NOC management company were infinitely more professional in their dealing with us after the incident. It scared them straight, as I’m betting Friday’s incident will do for H365.