Updated: Jul 4
Every year, on the Friday after Thanksgiving, traffic to online shopping sites spikes three, four, five times over. Websites built to accommodate a certain amount of demand can easily be overwhelmed by a stampede of customers they don’t see at any other time of the year.
How, then, is it possible to avoid system breakdowns? Jonathan Ginzburg, Head of Production at Wix, has one solution: break them down yourself.
In this episode, we follow Jonathan, Shahar Zur and Fabio Furiosi as they conduct a multi-continental test of just how far they can push their system in production.
INTRO TO BLACK FRIDAY
Hi, I’m Ran Levi, welcome to The Wix Engineering Podcast.
What does Black Friday mean for someone in your line of work?
Jonathan: Wow. Well, let’s start.
Black Friday is a time for holiday shopping. We flock to websites for 40% off deals on Christmas cat pajamas, air fryers and impractically large televisions.
But for web engineers, Black Friday means something much different.
Fabio: Black Friday is extremely important because it is the most important weekend of our users.
Shahar: Yeah, and for us it’s also kind of a test for our systems because of course we always need to be super available, highly available all the time. But in Black Friday, it’s extra important.
Jonathan: To give a simple example - think of a big mall, right?
Like, think of yourself going to the mall on Black Friday and there’s no electricity or whatever other problem, you would probably not stay too long and not buy proper stuff. So the virtual space mall is a bit different, but still, the concept is the same.
Wix holds a lot of stores in it. And all of these stores are in the same virtual mall.
So we are preparing a lot to make sure our mall is fine and shiny before these days are coming so that none of our users will ever experience any kind of problem, at least not any expected one.
For the people who make the web tick--who keep websites running, and running well, so that the rest of us can do our shopping--Black Friday is a tsunami that comes every year.
Jonathan: That one weekend can have more traffic than the entire month.
It’s quite normal for websites to experience ten times their usual traffic during Black Friday weekend. And, somehow, it was even worse in 2020 than ever before.
Fabio: More than ever during Corona a lot of businesses shut down, they moved to be online only.
Black Friday 2020 was possibly the single most popular day for online shopping in the history of online shopping.
Fabio: We reached I think more than one million requests per minute. So it was extremely, extremely a lot of traffic to be handled.
And it’s precisely because Black Friday is so popular that the cost of error, for the businesses who rely on that income, is incredibly high.
Fabio: The income of these people was relying on us - and we couldn’t betray them.
INTRO TO THE GUESTS
In this episode, to find out how the web stays afloat during the busiest time of the year, we’re going behind the scenes with three people who spent almost the entirety of last Fall making sure that happened.
Shahar: I am Shahar. I am a backend developer at Wix. I am also the owner of Site Assets which is the service that is responsible for calculating all the data that is required for a Wix site to render.
Fabio: My name is Fabio. I’m working in the production team as an operation manager. Our team is responsible to – as an umbrella to look at the stability and resilience of the company and to improve it day by day.
Jonathan: I’m Jonathan Ginsburg, the title is Head of Production Platform, but that actually means that I’m in charge of providing all of the backbone infrastructure and services that assists Wix products to run on.
Whether you own a physical or online business, Black Friday is something that requires time to prepare for.
For folks like Jonathan, Shahar and Fabio, Black Friday isn’t just something to be prepared for--it’s a whole ordeal. In their own ways, each of them works with the systems supporting websites, to make sure they operate properly at max capacity. That kind of work requires a great amount of time and investment.
Fabio: The investment in terms of people is huge. I think that more than 100 people were working on that in different parts of the company.
Hundreds of Wix employees worked on building a better Black Friday. As an operations manager, Fabio’s task was to oversee and coordinate the lot of it.
Fabio: Many, many things were done. We started approximately one and a half month before Black Friday for the preparation of it.
A lot of services were added, a lot of parts, a lot of activities were done in order to optimize the queries from – moved from the masters to slaves, and on top of that, of course other verticals as well, events, blogs, bookings. We checked all the flow to see that everything was correctly dimensioned. Security was involved to check all the PCI, together with system.
Everybody knew exactly who was responsible for what during the long weekend of the Black Friday. Channels were opened long before. So any kind of channel, Slack, WhatsApp, Zoom, in order not to lose time in case something is happening.
There were far too many projects going on--too many people and processes--to cover in just one podcast. But as just one example, we can look to Shahar’s team, which began building a new Wix Viewer all the way back in the summer.
Shahar: The Wix viewer is basically the application that is rendering the Wix websites. And the new Wix viewer is a project in which we moved most of the application code from our users’ devices to Site Assets.
So it is running on our servers basically. And the meaning of this is that if Site Assets is down or is responding slowly for some reason, then our users can potentially get hurt.
The Wix Viewer was necessary regardless of Black Friday, but weaknesses in the old system were more likely to be exposed by high Black Friday traffic. So rather than expose customers to potential failure, they decided to simply rebuild the whole thing from the beginning.
Shahar: And that’s what we did during the past few months before Black Friday.
The re-architecture of the Wix viewer is a huge project which involved more than 100 people. So I cannot talk about all of the aspects of it. But the main idea was to move the maximum amount of code that we can to the server. Like to stop calculating stuff on our users’ devices and to do these calculations on our servers instead and also caching aggressively the results.
We basically split the service into two microservices. One of them was responsible for routing requests and the other for executing them and by implementing smart routing, we were able to separate calculations that were very different from one another. And to get a much better performance and to be able to handle a higher scale.
Amid everything happening in those months leading up to Black Friday, one particular project stood out from the rest. A project so massive that it involved a not-insignificant percentage of the entire world’s internet traffic.
Fabio: Infrastructure in Wix is huge.
We have main data centers spread around worldwide and a lot of micro pops in order to be as close as possible to the users.
When you visit an e-commerce site, everything you see was put there by the company you’re buying from. But the site itself, and its many processes, are supported by a web infrastructure company like Wix. Wix servers sit in massive data centers around the world, where internet traffic is being processed and routed up to a million times per second.
Jonathan: And our entire system is built in such as a way that we are built to cope with the traffic.
The problem, for providers like Wix, is that for 360 days a year, web traffic tends to be predictable and relatively steady. Then Black Friday weekend comes around, and everything goes into hyperdrive.
Jonathan: So expecting a one month traffic in 24 hours or even more than a month’s traffic in 24 hours and holding that amount of stress for a full long weekend can take a lot of toll on your system.
On any given Black Friday, if not properly accounted for, too many people might query the same web servers, causing overloads and crashes. In the worst case scenario--for a provider like Wix, which services millions of websites all at once--we’re talking about large swaths of the internet slowing down or completely crashing, all at once. A nightmare scenario.
Jonathan: And we know that for our users in Wix, it is extremely important that during this crucial time of Black Friday and Cyber Monday, when they’re doing sales and stuff like that, they want the mall to operate properly.
To protect against a potential Black Friday collapse, Jonathan and his team prepared for an extraordinary step: they decided to collapse their own systems, intentionally.
It doesn’t make much sense, at first, that they would purposely do to their systems the exact thing they really didn’t want to happen. But there’s a method to it. In fact there’s an entire sub-field of engineering dedicated to this practice: it’s called “chaos engineering” or “chaos testing”.
The point of chaos testing is to experiment in a controlled environment--to introduce unexpected conditions and trigger failures in order to better understand how your systems will respond when things go wrong in the real world. It’s a practice that was originally developed at Netflix a decade ago, and it’s part of the reason why you rarely get outages during your Queen’s Gambit bi