Building a Resilient Infrastructure to Overcome Outages at Scale: Introducing Reactive Production

Wix Engineering
Dec 16, 2021
6 min read

Updated: Dec 26, 2021

Your entire house relies on electricity in one way or another. Let’s face it - without it, things pretty much just stop. And it is not an exaggeration to say that the millions of services used by people across the globe similarly rely on cloud providers for all of their hardware needs - websites, online tools, mobile apps and games, you name it. And so when last week one of them, the Amazon AWS, had an incident where an entire region of it’s servers went dark, the impact was overwhelming.

Photo by Alina Grubnyak on Unsplash

Yet, not a single user of Wix or visitor of a Wix-powered website noticed even a thing - despite Wix relying to an extent on Amazon services, including the impacted region - US-East 1. In this article we share the details of how we built a system with resilience and flexibility at its core and how we are now able to move resources and data between servers thousands of miles apart with a snap of a finger.

BTW - while writing this blog post there was a similar outage event.

Geographical distribution

Wix is very user-centric. In production engineering it means we have to live by the axiom that “everything that can go wrong will go wrong”, and work to prevent things from going wrong, as much as we can. So we make sure that everything our users are building is served and available for their users (something we refer to as users of users) at all times - meaning we can’t afford to just call it a day if and when our infrastructure partner has issues and we are taken offline in some geographical location.

With that in mind we are building all of the infrastructure that’s running all of Wix services in a way that whenever something goes wrong, we have the appropriate mitigation steps to make sure our users aren’t affected. One particular unique thing about our approach is taking the high availability, offered by cloud providers out of the box, beyond the obvious - we are looking at high availability through a multi-geographical lens instead of focusing on HA zones in the same regions.

We start by utilizing everything that high-availability zones have to offer, but we are also going beyond that. Remember, we simply cannot rely on any kind of a shared infrastructure that can take an entire zone down. So we are doing geographically distributed high availability - we are taking all of the highly-available zones spread around different geographical locations across thousands of miles (on the U.S. East Coast, in Europe, in Asia), 15 locations overall, and utilizing all of them at all times.

All those locations are scaled according to the amount of traffic they receive and serve. So instead of having one very big location serving everybody, we have 15 “small” locations - if we divide things evenly, that’s roughly 6.6% of the original size.

Now imagine one of the localities goes down for whatever reason. The remaining 14 will be taking over all of its load, usually distributing between 2-3 closest locations. To even do this on one side and then to also be economically efficient on the other side, we have several algorithms that we’ve built in order to make sure this is all working properly.

In order to be cost-efficient, this means that, when things are moving along normally, all of our locations are always serving traffic and are not just idly standing by.

Crisis averted

The underlying principle for everything is monitoring. We actively monitor all of our locations, from the standpoint of a “user of a Wix user”, all the way down to the last service running in any given location. Such high visibility gives us a very in-depth understanding of any given location’s status - to the degree that we are actually able to say if a location, at any given time, is, let’s say “good enough” to be serving the traffic in a manner we expect it to. We don’t need to wait for a disaster to strike an entire location - as soon as we see indications of one, we don’t need to sit around and wait to see how things unravel and then react - we can actually be proactive!

This brings us to the latest outage with Amazon, mentioned above. We started to see a few issues with small services that had to do with DynamoDB. Simply put, the situation became “shaky” with several of our services telling the system that they weren’t feeling that well anymore - a major oversimplification, of course, but one which helps illustrate the point. At the time Amazon didn’t post anything, nobody was even aware of any issues. Despite that we then decided that we were feeling certain enough to make the decision to “move out” of that location, so we stopped routing user traffic to US-EAST. The end result was that our provider did end up having a failure and our users didn’t even notice anything being wrong!

What’s under the hood

How are we able to have a system in place that allows us to shift resources someplace else and then calmly investigate what’s going on without risking our user’s experience? Here’s what we have set up, in simplified terms. We:

Monitor the entire stack and understand where, in the entire stack of technologies, we are having problems.
Have the ability to serve traffic from different locations (using DNS providers to shift traffic between a variety of data centers)
Have the ability to scale our resources automatically and very quickly to cope with immediate change of demand
Have the data layer which is able to receive a request on one side of the world and understand that it is a continuation of another request that was started on the other side of the world.

Besides all of the above, our entire infrastructure, although geographically spread, needs to be very much aware of what a request is, how it is getting in, and how it needs to be processed when it is moving between different geographical locations.

****In this graph you see 1 location spiking as a result of the error, and simultaneously another takes the "ownership" of our users of Wix users traffic - we call it the PUBLIC traffic switch

****In this graph you see Houston our command and control bot reporting in slack he detected an error and is automatically performing the traffic shift across locations

This means that, for example, when a checkout transaction happens on a website of one of our users, behind that transaction needs to be a fault-tolerant infrastructure which is able to connect the dots when that transaction starts in a series of servers in one location but continues in another location, seamlessly and continuously. And we are talking about distributing across thousands of physical miles, of course.

To be able to “move out” instantly, our entire system needs to know how to grow in order to accommodate the change. And so it is auto-scaling. Beyond that, it always has a buffer of resources to make sure that if a switch happens during one of the usual seasonal spikes - we are always below it.

This design allows us to:

Never be in a single region geographically
Keep track of every request between any and all of those locations seamlessly
Be certain that each location can auto-scale as needed when needed
Move small chunks around instead of migrating the whole system

All of the above enables us to react very fast to any events and have confidence in our system. And that confidence is what became the backbone of the system we refer to as Reactive Production.

As we said before - what can fail will fail - and the Reactive Production system allows us to move things around when we need to before serious problems even have time to manifest in any significant manner. In essence, we can catch potential problems with a cloud provider’s infrastructure and feel comfortable that the Reactive Production system will act faster than a human and automatically redistribute traffic at any given moment in time and at a moment’s notice.

We have been testing this system and practicing to make sure we are ready for the last two and a half years. And during the last outage we moved quickly, automatically, and before our provider even understood what their problems were.

Summary

In production engineering we have to live by the axiom that “everything that can go wrong will go wrong”, and work to prevent things from going wrong, as much as we can.

That means building all of the infrastructure that’s running all of Wix services with resilience and flexibility in mind. That means looking at high availability through a multi-geographical lens instead of focusing on HA zones alone. And recently, when the entire region of Amazon-hosted servers, East-1, had an outage, we proved our approach is working - we and all of our clients stayed online.

This post was written by Jonathan Ginzburg

For more engineering updates and insights:

Follow us on: Twitter | Facebook | LinkedIn
Join our Telegram channel
Visit us on GitHub
Subscribe to our monthly newsletter
Subscribe to our YouTube channel
Follow our Medium publication
Listen to our podcast on Apple, Spotify or Google

Building a Resilient Infrastructure to Overcome Outages at Scale: Introducing Reactive Production

Geographical distribution

Crisis averted

What’s under the hood

Summary

Recent Posts

Comments