This also completely screwed us. Having to go back and manually fulfill many orders has been extremely destructive. Today was the launch of a brand new product!
The entire wyday.com domain was down for the past 24 hours.
We lost customers and we lost time supporting existing customers. What can and will be done to regain trust in your service?
This also completely screwed us. Having to go back and manually fulfill many orders has been extremely destructive. Today was the launch of a brand new product!
I'm truly sorry for the downtime and how much extra work this has caused you both.
I just posted a blog covering the broad details here: A day of unplanned downtime 7/27 into morning of 7/28
And we'll be addressing this problem holisitically (including modifying how our failover systems work) in the background.
May I know which datacenter it was?
24 hrs of downtime and failover server in the same Datacenter. Unbelievable!
You put us in extremely difficult position with out clients. trusted your infrastructure to be reliable, and yet there was no proper redundancy or disaster recovery plan in plac.
yet there was no proper redundancy or disaster recovery plan in plac[e].
There was, but it was inadequate. This failover mechanism as-designed has worked well for many other types of outages over the decades. This failure-type was unique-to-us, and our failover mechanisms themselves weren't up to the challenge.
This was addressed in the blog post and is something we're working on. Obviously it's a failure in our design that has to be addressed.
You have our emails. We should not have to go to the help site or the blog to learn about this incident. If you can email us when we get charged you can email us after a major incident.
We're still working our way through the list of people who contacted us during the outage.
The blog post came first because it is the most reliable method to reach as many customers as possible as quickly as possible.
I believe Jason's point was that as soon as the downtime started, you could have reached out to everyone to let us know of what was happening and that you were working on it. Being offline for over 24h is bad, but not communicating made it unnecessarily worse.
The blog post talks about this. Namely, our mail servers were also down (atypically).
[…] During this downtime our services were unavailable. Unfortunately (and atypically for us) this also meant our mail servers were also down.
For LimeLM customers this downtime meant the following: […]
When we were able to communicate via an error message on the website, we did so. We also answered questions by phone and by Mastodon.
We're not on Twitter anymore. It's filled with fascists, Nazis, and porn-bots.
Yes, it's easier to reach us by the website (which was down) or emails (which was down) but we were available by other means to answer questions as best as we could, and more importantly we were working on the problems.
I'm not looking for excuses, but reassurance. Such as “by end of August, we will have incorporated real-time messaging, such as an independent status page, and mechanisms to automatically message customers whenever there will be or is downtime alongside an estimate on how long we expect it to last.”
Our business currently depend on your service being available. Mistakes do happen, technology does sometimes fail. But it's the handling of that failure that determines whether or not trust can and should be restored.
If your position is that you did what you could, that Twitter isn't to your liking and that your customers should have known to look for some service called “mastodon”, then I'm afraid that is poor communication and not reassuring enough.
I also can't help but think that this datacenter you speak of is a computer in your basement. More details on this would help, such as whether it is on AWS or Hetzner or something else. At that point, one could correlate the downtime to regain more trust. This is very much about proving competence moving forward.
More details on this would help, such as whether it is on AWS or Hetzner or something else.
Currently Akamai. However that is changing. We've been finding we need greater control over the hardware, which includes the ability for our employees to drive to the datacenter and put hands on equipment. That means ditching Akamai.
We don't and won't share internal roadmaps. It's a priority that we're working on. Already obliquely referenced in the blog post.
If your position is that you did what you could, that Twitter isn't to your liking and that your customers should have known to look for some service called “mastodon”, then I'm afraid that is poor communication and not reassuring enough.
Phone was available. We answered phone calls when we weren't actively working on alternative recovery methods.
Another thing that hasn't changed in the twenty years we've been in business is our phone number. We also don't outsource our phone. So you'll be getting one of us in charge (usually me).
I'm not looking for excuses, but reassurance. Such as “by end of August, we will have incorporated real-time messaging, such as an independent status page, and mechanisms to automatically message customers whenever there will be or is downtime alongside an estimate on how long we expect it to last.”
A status page is not a priority. Will it be done eventually? Yes. Is it something we're actively putting money and time toward? No. Phone, email, and social media are the best way to reach us during downtime.
But trust me, everyone here with power to restore services is notified immediately of downtime and is working on restoring services as soon as possible.
In the more than 2 decades we've been in business this is the longest downtime (planned or unplanned) we've had by an order of magnitude. The longest downtime previously was about 2 hours.
And the reason it was so long, as already stated in the blog post, is that the failover mechanisms in place didn't account for the whole datacenter to be unavailable. This is an extreme case that hasn't happened to us in 2 decades. But now that it has, we're adjusting how we handle failovers.
So, as far as priorities go, the number one priority is redesigning and testing our failover mechanisms to handle a complete datacenter to fail.
Yes, we had another datacenter in another state ready to handle the failover, but again, see the blog post, it was a failure in how we designed the failover mechanisms.
So, in short, the blog post already covered this. This is now a weakness on our radar that must be addressed. And that we're addressing.
Long story short: this downtime was shitty for everyone involved. I get that. You don't want it to happen again. We don't want it to happen again. And we're working to make sure it doesn't happen again.
> But trust me, everyone here with power to restore services is notified immediately of downtime and is working on restoring services as soon as possible.
Why did everyone need to be notified immediately?
This is the point we are trying to make. *We* need to be notified immediately too. Since we depend on you, our service is down when yours is, and *we* also need to act quick to restore confidence in *our* userbase, just like you.
Because we were not informed, our users were the first to be informed, who then inform us. It is bad business and it could have been avoided.
It would help tremendously to know whether the next time this happens, will we be notified immediately?
Btw, Akamai does have a status page, and it reports 0 incidents on or around the time of this event.
And hence one of the huge problems with status pages… they're just a useful as social media. You need to know where to look to even get the status you want. You thought you found their status page (and you did, you found one of their status pages), but it doesn't account for that Akamai is a huge company that has swallowed up hundreds of other companies each with their own status pages. And apparently they don't aggregate the statuses into one location.
And with status pages you *still* wouldn't be push-notified of problems (unless all of the services that do the push notifying are up and running).
Look: I get that this is frustrating. And we're actively working on solutions to uptime problems not status / notification problems.
Even trillion dollar companies get status pages wrong (Amazon and Google have had fairly high-profile cases just in the last couple of years where their status pages showed everything as being 100% available even several hours after it was obviously not).
These aren't easy problems to solve. These are social problems (where does one even find the status page if the “main site” is down and how do they verify the source is legitimate and not a scam?) and technical problems (how to maintain a second set of servers with high availability that is secure but also completely separate from the first set of servers).
Trust me, these are hard problems. The fact that you confidently tried to call me out for lying (why would I lie?) about Akamai downtime proves how hard the social problem is (where do you even look for status and verify it's actually a legitimate source, and … and … and …). You had (a) the company name (b) when the event happened and (c ) the benefit of not being stressed because of the downtime and it was still hard for you to find the actual status page. Which is my point. This is not a you problem. This is problem inherent in external status pages. And that's just a tiny slice of the social problems of status pages (completely ignoring the big-hairy technical issues of status pages).
So, no. Status pages are not a priority. Not this year. Not next year. Maybe down the road. But our time and money is better spent elsewhere right now.
Yes, you'll be notified of the problem (hence the blog posts, and before that status was directly on our website). But our priority during downtime was, and will remain, getting back up as fast as possible.