As most of you are aware, on Monday night and Tuesday afternoon our services were temporarily unavailable. Now that everything is stable, we’d like to take some time here to explain what we know about what happened and answer some of the most commonly asked questions that came up during the incident. Before we get to the technical details, the most important message we would like to convey is that we understand how serious this downtime was and we offer our sincerest apologies to you and your users for having to go through this. We know many of you run important registrations, campaigns, contests and services through Wufoo and, for some, the timing couldn’t have been worse. We completely empathize and hope the following will help you understand our position and our options for the future and restore some of that lost confidence.
So, what happened exactly?
During both outages, the problem was the same — all power went out at the data center and this resulted in all sites and services hosted there to go down in addition to ours. The downside to a power loss on this scale is that all core level services were affected. This therefore significantly increased the time to get all servers online. Networking on critical level servers had to be brought up first, and then all application level servers had to go through a crash recovery process.
We’re still working on getting a better understanding from our data center as to how and why this happened, and how they’re going to make sure this isn’t going to happen again. When we picked this data center, their power system was one of the key criteria. The main elements of their power equipment (UPS, generator and power control) are all good systems with ample capacity. We’re still trying to understand what it was about how these are put together, or how they’ve been managed, that led to the failure that we had.
What triggered the power outage?
In both cases, there was a significant reduction in the voltage to the building from the local electric utility.
Doesn’t Wufoo’s servers have a backup power supply?
We have redundant circuits on two separate power systems. Additionally, our UPS systems turned on as expected, and provided power for another hour after the outage occured. The third, and final, resort is to use a generator when a failure like this happens. Once an outage gets to the level that we had earlier this week, we must rely on the the company that runs the data center to get emergency systems like a generator up and operational. We don’t have the details yet, but there was apparently some difficulty in switching over to generator power, so the UPS reserves on our systems therefore ran out. Again, we’re seeking more details about this.
Once power came back, why was Wufoo down for so long?
It is tough to prepare for a worst case scenario, and we can assure you everyone at BitPusher (our server management team) was moving as fast as they could. Realistically, recovery from a situation like this will always take at least an hour since all core servers need to be restored and all application servers have individual crash recovery processes that have to complete. On Tuesday, the process went smoother and it took approximately an hour from when power was restored.
What improvements are you going to make?
Continuing in the current environment without significant changes that ensure this won’t happen again is unacceptable. We are examining three different directions: better power in the existing facility, moving our infrastructure to a different facility, and working with third-party hosting partners.
One thing, we’re also working on is creating better communication strategies and methods for communicating with our users during such incidents. While we responded (as always) to every single email sent in during the outage with updates as they came in, it’s good for you to know how to passively observe our actions behind the scenes.
Many of you followed our updates on on our Twitter page and our Wufoo Status blog and we think everyone liked how that worked out. Currently, we’re working on ways to enhance the Wufoo Status blog so that it provides daily updates and a feed on our uptime status and additional information. More on that to come as we enhance those processes.
Why was there no downtime page?
Normally, Wufoo has a styled downtime page that appears when a form is unavailable. These pages make the embedded forms look more professional. During these outages, there was no downtime page and every request to our servers timed out. The reason for this is because our current downtime system relies on the load balancer. Since the load balancer lost power also, we had no downtime page to serve. We’re currently looking at our options for restoring downtime functionality from another facility.
Is my data safe?
Yes. There was no data loss during the power outage and all servers successfully completed crash recovery without issue.
This experience hurt our company. We would like a refund.
We completely understand and again apologize for such inconveniences. If you feel that these outages directly interfered with your primary purpose for using Wufoo, please contact us via support and we would be happy to give you a refund for this month’s services.
End Note
During the outage, many of you were actually very supportive and we’re extremely grateful for such understanding and support. We’ve always believed that we have the best kinds of people using Wufoo and many of your actions this week served as a testament that this really is true. Our entire team sincerely thanks all of you for such patience and understanding and hope we can live up to such good treatment by minimizing these incidents to few and far between.