Updated: Aug 17, 2021
Every year, on the Friday after Thanksgiving, traffic to online shopping sites spikes three, four, five times over. Websites built to accommodate a certain amount of demand can easily be overwhelmed by a stampede of customers they don’t see at any other time of the year.
How, then, is it possible to avoid system breakdowns? Jonathan Ginzburg, Head of Production at Wix, has one solution: break them down yourself.
In this episode, we follow Jonathan, Shahar Zur and Fabio Furiosi as they conduct a multi-continental test of just how far they can push their system in production.
INTRO TO BLACK FRIDAY
Hi, I’m Ran Levi, welcome to The Wix Engineering Podcast.
What does Black Friday mean for someone in your line of work?
Jonathan: Wow. Well, let’s start.
Black Friday is a time for holiday shopping. We flock to websites for 40% off deals on Christmas cat pajamas, air fryers and impractically large televisions.
But for web engineers, Black Friday means something much different.
Fabio: Black Friday is extremely important because it is the most important weekend of our users.
Shahar: Yeah, and for us it’s also kind of a test for our systems because of course we always need to be super available, highly available all the time. But in Black Friday, it’s extra important.
Jonathan: To give a simple example - think of a big mall, right?
Like, think of yourself going to the mall on Black Friday and there’s no electricity or whatever other problem, you would probably not stay too long and not buy proper stuff. So the virtual space mall is a bit different, but still, the concept is the same.
Wix holds a lot of stores in it. And all of these stores are in the same virtual mall.
So we are preparing a lot to make sure our mall is fine and shiny before these days are coming so that none of our users will ever experience any kind of problem, at least not any expected one.
For the people who make the web tick--who keep websites running, and running well, so that the rest of us can do our shopping--Black Friday is a tsunami that comes every year.
Jonathan: That one weekend can have more traffic than the entire month.
It’s quite normal for websites to experience ten times their usual traffic during Black Friday weekend. And, somehow, it was even worse in 2020 than ever before.
Fabio: More than ever during Corona a lot of businesses shut down, they moved to be online only.
Black Friday 2020 was possibly the single most popular day for online shopping in the history of online shopping.
Fabio: We reached I think more than one million requests per minute. So it was extremely, extremely a lot of traffic to be handled.
And it’s precisely because Black Friday is so popular that the cost of error, for the businesses who rely on that income, is incredibly high.
Fabio: The income of these people was relying on us - and we couldn’t betray them.
INTRO TO THE GUESTS
In this episode, to find out how the web stays afloat during the busiest time of the year, we’re going behind the scenes with three people who spent almost the entirety of last Fall making sure that happened.
Shahar: I am Shahar. I am a backend developer at Wix. I am also the owner of Site Assets which is the service that is responsible for calculating all the data that is required for a Wix site to render.
Fabio: My name is Fabio. I’m working in the production team as an operation manager. Our team is responsible to – as an umbrella to look at the stability and resilience of the company and to improve it day by day.
Jonathan: I’m Jonathan Ginsburg, the title is Head of Production Platform, but that actually means that I’m in charge of providing all of the backbone infrastructure and services that assists Wix products to run on.
Whether you own a physical or online business, Black Friday is something that requires time to prepare for.
For folks like Jonathan, Shahar and Fabio, Black Friday isn’t just something to be prepared for--it’s a whole ordeal. In their own ways, each of them works with the systems supporting websites, to make sure they operate properly at max capacity. That kind of work requires a great amount of time and investment.
Fabio: The investment in terms of people is huge. I think that more than 100 people were working on that in different parts of the company.
Hundreds of Wix employees worked on building a better Black Friday. As an operations manager, Fabio’s task was to oversee and coordinate the lot of it.
Fabio: Many, many things were done. We started approximately one and a half month before Black Friday for the preparation of it.
A lot of services were added, a lot of parts, a lot of activities were done in order to optimize the queries from – moved from the masters to slaves, and on top of that, of course other verticals as well, events, blogs, bookings. We checked all the flow to see that everything was correctly dimensioned. Security was involved to check all the PCI, together with system.
Everybody knew exactly who was responsible for what during the long weekend of the Black Friday. Channels were opened long before. So any kind of channel, Slack, WhatsApp, Zoom, in order not to lose time in case something is happening.
There were far too many projects going on--too many people and processes--to cover in just one podcast. But as just one example, we can look to Shahar’s team, which began building a new Wix Viewer all the way back in the summer.
Shahar: The Wix viewer is basically the application that is rendering the Wix websites. And the new Wix viewer is a project in which we moved most of the application code from our users’ devices to Site Assets.
So it is running on our servers basically. And the meaning of this is that if Site Assets is down or is responding slowly for some reason, then our users can potentially get hurt.
The Wix Viewer was necessary regardless of Black Friday, but weaknesses in the old system were more likely to be exposed by high Black Friday traffic. So rather than expose customers to potential failure, they decided to simply rebuild the whole thing from the beginning.
Shahar: And that’s what we did during the past few months before Black Friday.
The re-architecture of the Wix viewer is a huge project which involved more than 100 people. So I cannot talk about all of the aspects of it. But the main idea was to move the maximum amount of code that we can to the server. Like to stop calculating stuff on our users’ devices and to do these calculations on our servers instead and also caching aggressively the results.
We basically split the service into two microservices. One of them was responsible for routing requests and the other for executing them and by implementing smart routing, we were able to separate calculations that were very different from one another. And to get a much better performance and to be able to handle a higher scale.
Amid everything happening in those months leading up to Black Friday, one particular project stood out from the rest. A project so massive that it involved a not-insignificant percentage of the entire world’s internet traffic.
Fabio: Infrastructure in Wix is huge.
We have main data centers spread around worldwide and a lot of micro pops in order to be as close as possible to the users.
When you visit an e-commerce site, everything you see was put there by the company you’re buying from. But the site itself, and its many processes, are supported by a web infrastructure company like Wix. Wix servers sit in massive data centers around the world, where internet traffic is being processed and routed up to a million times per second.
Jonathan: And our entire system is built in such as a way that we are built to cope with the traffic.
The problem, for providers like Wix, is that for 360 days a year, web traffic tends to be predictable and relatively steady. Then Black Friday weekend comes around, and everything goes into hyperdrive.
Jonathan: So expecting a one month traffic in 24 hours or even more than a month’s traffic in 24 hours and holding that amount of stress for a full long weekend can take a lot of toll on your system.
On any given Black Friday, if not properly accounted for, too many people might query the same web servers, causing overloads and crashes. In the worst case scenario--for a provider like Wix, which services millions of websites all at once--we’re talking about large swaths of the internet slowing down or completely crashing, all at once. A nightmare scenario.
Jonathan: And we know that for our users in Wix, it is extremely important that during this crucial time of Black Friday and Cyber Monday, when they’re doing sales and stuff like that, they want the mall to operate properly.
To protect against a potential Black Friday collapse, Jonathan and his team prepared for an extraordinary step: they decided to collapse their own systems, intentionally.
It doesn’t make much sense, at first, that they would purposely do to their systems the exact thing they really didn’t want to happen. But there’s a method to it. In fact there’s an entire sub-field of engineering dedicated to this practice: it’s called “chaos engineering” or “chaos testing”.
The point of chaos testing is to experiment in a controlled environment--to introduce unexpected conditions and trigger failures in order to better understand how your systems will respond when things go wrong in the real world. It’s a practice that was originally developed at Netflix a decade ago, and it’s part of the reason why you rarely get outages during your Queen’s Gambit binge sessions.
Jonathan: And the idea is to really build this resilience. And each time, make sure that this resilience, once it was built and you have the relationships and you know what is going on in case of an emergency.
To determine whether their data centers would be able to handle Black Friday traffic under extreme stress, Jonathan and his team prepared a chaos test--or, perhaps more accurately, a stress test. They would purposely collapse their own systems to see what would happen.
And it’s important to note: this wasn’t going to occur in some simulated environment. The production team was prepared to cripple their real-life, active system in production. While all the Wix websites were running, in real time, conducting actual business.
When I imagine this, I think there must be a million risks to doing this. What if your customers lose service before Black Friday? What if something really important breaks while you’re trying others out?
Jonathan: So one, yes, definitely. But on the other hand, you prefer testing it before Black Friday then failing on Black Friday. So it’s much better to be prepared.
But even before the test started, things were already going wrong.
Jonathan: As we approached the date, we understood we were just not going to pass it. So in about three weeks, but a week before the event and we’re like, there’s no chance that we can pass it. We cannot go into Black Friday like this.
In its current state, Wix was about to have a very bad Black Friday weekend.
Jonathan: We understood there were gaps in our own infrastructure that we really wanted to make sure that we don’t have.
So these gaps are basically in three different realms. Let’s put it like that. One of the realms is around automatic processes. So we want our production to be self-healing. We want it to understand what is going on and if what is going on is OK and is not OK.
The second thing is you don’t want to run on infrastructure which is out of date. You want to make sure that all of the systems which are relevant to Black Friday are really renewed properly and are not running on any legacy.
The third problem has to do with scale. Our scale of production is huge. Really, really huge. When you start stress testing it you actually find all of the different bugs of the scale. This is like an ongoing process. And this is why during this ongoing process, three weeks before Black Friday to – a week before the event of the test – we found ourselves in a situation in which we’re not quite there yet because we still have a lot of stuff to do.
One by one, the production team addressed the potential sore points in their system. The test itself, which they’d originally planned to do weeks in advance of Black Friday, had still not happened yet.
Jonathan: And a week before Black Friday, we kind of – or not kind of, we actually did close all of these gaps and we’re starting to ask ourselves, “Well, are we ready to stress test?” “Are we actually going to go through with this?”
It was now or never, so they finally got the green light. It was the Tuesday before Black Friday and Wix’s entire global infrastructure, servicing millions upon millions of websites, was about to be dropped to its knees.
DESIGN OF THE TEST
The exact design of the stress test was pretty simple, but to understand how it worked we first need to explain how the infrastructure itself is set up.
Jonathan: So, important to understand, our platform in Wix is highly available in multiple places. So, we can serve everything from a lot of different areas.
Wix data centers are spread out across the world partly because there are too many machines to fit in one place, but also as a means of fault tolerance. If a site on one continent fails for whatever reason, it’s unlikely to spread to the sites elsewhere around the globe.
Jonathan: So we’re usually not dependent on any special geography or any special area in the world and for whatever reason, one of the areas fails, doesn’t matter why, local ISP, security bugs, it can be whatever, it’s lots. We can always move traffic to a different area. And we use the system to do maintenance in specific areas. So we’re like, move out the traffic from this specific area, do this maintenance, finish, move back to traffic, see that everything’s all right, and then proceed.
As long as most of these data centers are up and running, the Wix system overall should be fine. But, hypothetically, what would happen if multiple data centers failed at the same time?
Jonathan: Our worst case scenario is collapsing - all of these areas to a single area, single location. It’s like a doomsday scenario. But we really need to make sure that we can handle this doomsday scenario.
The likelihood of multiple data centers on multiple continents all failing at the same time is extraordinarily low. But simulating that scenario is a pretty good way of figuring out just how far you can stretch the system without complete failure.
Jonathan: So we decided on the strategic location in which we made all of the latest upgrades and everything and we really felt that we are safe there and said, “OK, let’s go ahead and let’s do that.” And that one specifically was AWS US East-1.
Amazon’s U.S. East-1 data center, located in northern Virginia.
DAY OF THE TEST
With their target data center and their plan ready, on the Tuesday before Black Friday, Jonathan and his colleagues gathered together to attempt the most daring stress test in the company’s history.
Jonathan: We started at the beginning of the day, everybody was super stressful. Important to say, beginning of the day, Israeli time. So we had like 12 hours until we get heavy traffic from the US. And we’re starting to move their bits and pieces.
As dawn turned to morning in the Western hemisphere, entire longitudes-worth of internet traffic began to roll in.
Jonathan: We’re starting with Asia, who’s about to sleep, and then we’re moving to Europe which is about… it’s heavy traffic already. And we’re all at the edge of our seats and we’re seeing that everything is expanding and, obviously, you get – during this day, all of these bugs and all of these small processes that don’t run properly and all of the indications that you’re missing and everybody’s working around it to make sure that everything works and everything is like super stressful.
Everybody was locked to their computers--constantly checking things, making adjustments, and taking swigs of coffee.
Jonathan: And bits and pieces over a period of almost nine hours, each time, we’re taking another global distributed service and putting it into the one DC.
Finally, after nine hours of taking down data centers and monitoring the results, all of Wix’s traffic was routing through US-East-1 alone.
Jonathan: Everybody is sitting in front of their keyboards and everybody is just looking at it and everybody is just looking at the screen, seeing the graphs, don’t believe that everything is actually holding in a single location.
They went over all the metrics, double- and triple-checking to make sure everything was still up and running. Amazingly, everything seemed to be...just fine.
Jonathan: We’re running from a single DC for a few good hours to make sure that everything is stable - and everybody is extremely excited.
It worked. For three hours, the entire company’s global internet traffic filtered through US-East-1 without any major outages, and only minor, predictable slowdowns. The results came out so good, in fact, that they decided to repeat the whole process all over again.
Jonathan: That was Tuesday.
And then Wednesday morning, we really felt that we did it so well in that specific location that we feel comfortable that we need to do it again.
But let’s do it again in a different location, get the assurance that we can do it in another location, not just this one single location.
And this time with less problems and with more knowledge, because we got a lot of experience from the day before, we’re doing it and over a period of three hours.
The production team stress-tested once more, this time in just 3 hours instead of 12, and on a different, less resilient data center. And it worked again.
Jonathan: We felt safe, we felt – we got – like we got all of the relevant numbers we needed to make sure that we have enough resources and made sure that everything is shiny and tidy for the coming Black Friday.
By the time Black Friday actually came around, Jonathan’s systems were a go, ready for whatever hit them, Shahar’s new Wix Viewer was live, and all the other operations Fabio had overseen in those months leading up to this particular weekend had been smoothed out.
Fabio: We were constantly monitoring the network, looking for alerts, checking what was going on, if we had some kind of a problem, a spike up or down. Nothing happened and that was amazing. So it was four long days of nothing - and that made us extremely, extremely happy.
Shahar: I was a bit stressed. I took my laptop anywhere I went. But eventually it was – yeah, nothing happened. It was even kind of boring I would say.
Fabio and Shahar had a boring Black Friday weekend--the thing they most wanted, and least expected.
Fabio: We all enjoyed a nice glass of whisky and we stopped the day. So we decided to take a half a day off.
Shahar: I went on a vacation. I went up north for three days and I almost succeeded not working at all.
For Jonathan and the production team, things weren’t quite as boring.
Jonathan: We’re done with the stress test and we’re feeling really comfortable and we’re saying, “OK, let’s…” we’re feeling comfortable, we’re feeling prepared, this is what we want it to be.” Indeed the stressful time with everything and all of that last minute changes and stuff like that, but we tested it twice. Everybody felt very comfortable.
And that evening, one of the areas, the US East-1 had – Amazon, had a major incident in their data center.
The day before Black Friday--just 24 hours after their second stress test completed--US-East-1 went down. With it, web services from Roku to Flickr, Glassdoor, 1Password and Coinbase were interrupted or slowed to a crawl. Nobody knew if service would resume in time for Friday.
So, basically, the very scenario Wix’s production team had imagined was unfolding, in real life, before their eyes.
Jonathan: It’s like well-known, it’s all over the news and that’s one of the areas we’re serving from.
In any other situation, we would all just sit at the verge of a heart attack, really like with the AWS crashing and are saying, “Oh what are we going to do with this? How would this affect us? How can we get into Black Friday like that and trying to mitigate it all over?” Instead of this, we had just this comfortable meeting saying, “OK. So that’s the situation.” Well, we did the test there but we also tested it in different location and we are pretty certain that we can withstand throughout the Black Friday without this area, without US East for the entire Black Friday.”
US-East-1 came back up later that day, but it hardly mattered. Wix’s system would have remained afloat regardless--they knew it, because they’d already tested it.
Jonathan: And thank god for that, but – or thank the engineers for that.
For more engineering updates and insights: