You're right. There is no excuse for the downtime or to having no staff be reachable. We'll be sad to see you go.
I've been sending a variation of the following message to customers, but I should post it here as well:
I truly apologize for the downtime. Over the past several months weve been working to ensure that no single part of the infrastructure is make or break. Weve been successfully chipping away at each corner of this infrastructure.
We did have a single load balancer server that was make or break, however, and we hadnt yet designed an automated solution to this problem. It just so happened a couple of hours ago the people in the datacenter decided to reboot a bunch of machines (including this one) and the routers were not correctly assigning IP addresses to the machines. Were still looking into the reason why the reboots occurred in the first place (likely, operator error). However, because the correct IP address was not assigned to this entry point, this meant this load balancing machine was alive and running, but did not have the correct IP addresses assigned to it and thus it was unreachable.
It goes without saying that this is embarrassing and were going to make our systems more robust to handle human error. This should not happen especially considering the engineering effort weve devoted to making sure this doesnt happen. And if it does happen our automated systems have to be smarter about fixing it. Right now we failed at that, and for that I apologize.
- Wyatt O'DayFounder & CEO of wyDay