Growing Pains, E02: Full Transcript

Updated: Apr 21

Ittai Zeidman, our Backend Engineering Lead, was in the hospital with his wife and newborn, when he got an urgent call from the company’s VP of R&D. A crisis was unfolding: the build system was broken, leaving hundreds of developers unable to do their work.

This crisis wasn’t an isolated incident: it was the result of a series of problems resulting from the company’s success and fast growth. Ittai and his peers faced a serious challenge - but they knew they weren't the only ones: Google, Facebook, and Twitter were also serving millions of users.

Utilizing and learning from the experience of these companies, Ittai helped transform Wix’s build system from the ground up with Bazel, an open-source Continuous Integration system. Him and his team tackled these significant scaling issues, and managed to implement company-wide infrastructure successfully.

Hear all about it directly from Ittai Zeidman and Or Shachar on our 2nd podcast episode. Plus, Natan Silnitsky that will introduce Exodus, our open-source tool that can easily migrate JVM code from Maven to Bazel:

You can also listen to this episode on Apple podcast, Spotify, Google or on Wix Engineering site. And you can also read the full interview here:

Hi, I’m Ran Levi, and this is The Wix Engineering Podcast.

Ittai: So there I am at the hospital with my newborn, my third one.

That’s Ittai Zeidman, lead backend engineer at Wix. He spoke to Nate Nelson, our senior producer.

Ittai: So I also have my two other kids with me. And I’ve been getting some texts over the past few days, I know that there are problems... And we are one day after the delivery, my wife is there, completely beat, she’s trying to recuperate and I get a call from my manager.

Ittai had just experienced the birth of his child. He was with his wife and his kids in the hospital - nothing could be better. That’s when he got a phone call.

Ittai: The call was about basically our CI system grinding to a halt. We recently made a big upgrade of the build server and things started to decline. And at that point, I think it was a day after the birth, the things just stopped working and we needed to understand how we’re basically - letting people work again because they were unable to deliver at all.

Now, listeners, I must admit: if my boss called me the day after my child’s birth, I’d probably just not pick up the phone. Fortunately for his coworkers, Ittai did pick up. Unfortunately for Ittai, not 24 hours after watching his wife push a full human being out of her body, he now had another crisis to deal with.

Ittai: He’s like, “So, how is it going? How is the baby?” and, yo know, small talking. Like OK, OK. What do you need? Because we texted already. He said his congratulations before and he’s like “Listen, things are really bad. We don’t understand what’s happening and people are stuck for hours. We really need you to come in”. And I was like, okay... I’m literally holding the baby in my hands, my eldest is holding the phone next to my ear and then I was like, “OK. Let’s see how I can juggle this,” and then basically I called my parents, they came, they spent some time there and I went in the afternoon to the offices and then we did a brainstorm.

The VP engineering and VP R&D and the person leading the CI at the time and myself, we were trying to brainstorm how are we going to get out of this mud. It was very alarming to know that people are just waiting.

In this episode of our show, Ittai and his engineering team attempt to shift their entire company onto a new build system - to take an old system, incapable of handling the full scope of the company’s needs, and replace it with a new, better, bigger one from scratch. It’s a very complicated, difficult maneuver.

More importantly, though, Ittai’s story is a case study in how to effectively handle the lowest levels of a major organization’s IT infrastructure. Making any significant changes to a system that supports thousands of engineers every day is, you’ll see, akin to trying to pull a tablecloth out from under a set of fine china without breaking anything. Or maybe it’s like giving birth. Except, you know, instead of a baby you get a build server.

Okay, scratch that. It’s not at all like giving birth.

Ittai Zeidman
Ittai Zeidman

Interviewer: So you had to leave the hospital - how did your wife feel about that?

Ittai: It’s going to sound cliché but I have an amazing wife. Basically she understands. She understands and the fact is that I’m there a lot. I leave work twice a week to be with my kids. I try to be very present and so when I need to go, it’s for a concrete reason. This is the partnership that we have.

Or: I’m Or. I’ve worked at Wix for a little over three years now. I manage the built-in, the CI team for the dev experience group.

Or Shachar is a close colleague of Ittai’s - part of the team whose job it was to deal with the Wix “build system.”

What is a ‘build system’? When a number of developers are working on the same project, each one is responsible for certain bits and pieces. Those bits and pieces, collectively, make up the final software product they intend to deliver. A build system’s purpose is to take these individual code pieces, verify that their dependencies are met, compile them if needed and run automated tests.

Or: A build system is basically something that takes all the code of the developers and its purpose is to provide feedback to the developers on whether their code is OK and doesn’t break anything.

A build system is especially important in a Continuous Integration environment, or CI for short. In a Continuous Integration environment, several developers who work on individual code pieces need to integrate their code into a single shared repository. In Wix’s case, hundreds of developers could be working on the same code base - which is why for such an integration to work, the code needs to be stable and error-free. A build system is necessary to this process, as it keeps everybody on the same page and prevents one person’s code from disagreeing with the rest. Think of it like a traffic cop - waving cars through when there’s a lane, stopping them when there’s threat of a crash.

Or: When I want to change my code or when I want to change the company code, the organization code, I want to assume that the code that I’m starting with is stable, is green. So the goal is to keep it that way.

As a backend infrastructure developer, Natan Silnitsky is an expert in this very technology. He’s the final member of the build team we’ll meet in this episode.

Natan: When I heard that I had a possibility to work with Ittai Zeidman, I was thrilled because from the time I got to know him with my previous role, I thought that it was the perfect combination of all at once in one package.

Natan’s been at his job for half a decade. In his first days, the company was using a common yet aging build infrastructure.

Natan: When I got to Wix and up until we actually did the switch over to the new system, the basic system was comprised of Maven as a build tool and dependency management and the build server environment was TeamCity.

The exact roles that Maven and TeamCity played in Wix’s build infrastructure are not particularly important to our story - but for the sake of clarity, i’ll just say that Maven is the automated build tool that does the compiling, testing, etc., while TeamCity is the actual server on which these processes are done.

This frees the developers from running the build processes on their own local computers - and has the added benefit that the build server can be configured to be identical to the production servers, which means there’s less chances for bugs that stem from the differences in the configuration of the developer’s own machine versus that of the production server.

For a while, pairing Apache’s Maven build tool with the TeamCity continuous integration server worked very well. It allowed for a quick, continuous delivery of code to the company’s wider code base.

But then problems began to arise. What changed? Not Maven - it was the same as ever.

Natan: So when I started at Wix’s backend guild in Tel Aviv, we were like, I don’t know, 50 people, even less I think. And you can really easily talk to one another if you needed something or... the scale wasn’t that big.

As an employee of five years, Natan saw his company grow substantially. It was a good thing, but growth came with its own problems. The solutions that’d worked for years before could no longer accommodate such scale.

Natan: As more and more products were created and more and more teams were created, it means that there were more interactions, dependencies between these teams, between these products, and the dependency trees of the code got bigger and bigger and bigger. So it really was a big problem - and how to deal with all this huge dependency tree of this code.

Or Shachar
Or Shachar

Interviewer: What sorts of complaints were the engineers having about the old system?

Or: It went from “I just pushed my code and I need to get my deployable and my build is in the queue, stuck in the queue for hours and nothing happens”. Or something in the upstream - “My module is currently blocked because some module in the upstream that I depend on transitively is currently broken, so can anyone please fix it?”

Ittai: We were also hitting a lot of Maven limitations. We just want to be able to work and the fact that we – our feedback is now broken, it’s something that we can’t live with. You want us to deliver fast - please help us to deliver fast.

All these problems, individually, weren’t the end of the world. Altogether, however, it was clear: the build system was broken.

Ittai: So actually usually when people say the build was broken, it means that – for example, we are unable to assert that the software still works. But what we actually had was that the build server itself was just not working. So it was like the build might have been working, but no one could actually get feedback. So that’s like the worst case scenario because no one can get feedback whereas sometimes some people can get some of the feedback in.

To be clear: Maven wasn’t in itself bad software. It simply became an emblem of the growing need for the company to adapt to new solutions that fit their growing size. Average runs were taking 45 minutes. For a company that manages thousands of builds every day, this was unacceptable. Productivity was down. Developers were frustrated.

Natan: So levels of frustration build up and – well, I think the guild management understood, the server guild management understood that developers are hurting and development velocity is hurting. And the decision was made to see what kind of alternative we can find for Maven.

A change in the build systems had to be made. But finding a replacement wouldn’t be easy. Few organizations have quite so many developers working together, and separately, on so many different projects as Wix.

Ittai: I think it actually affects a great deal because at a certain size you can’t take regular off-the-shelf software. You have a lot of workflows that you want to cater to and also we work in a continuous integration mode.

The number of companies that have gone through an infrastructure shift of such a scale as we’re describing in this episode is small. Therefore, the number of available solutions for such a project is quite small.

But, after testing a few of the options, one stood out above the rest.

Natan: I remember some guy started checking if Bazel can work for us.

Or: So Bazel is an open source build tool coming out from Google. It used to be an internal tool, it is an internet tool at Google called internally Blaze, they open sourced in 2014 - they open sourced the core component of it.

Bazel is the publicly available iteration of a build tool - Blaze - which was designed by Google.

Ittai: Bazel has a strong extension story that allows you to use the core but still add your own ecosystem, and I think the proof is that Bazel has many, many external plugins maintained by the community.

Or: Unlike Maven, this tool supports fine-grain build units, source dependency, correct incrementality, massive parallelism. You can parallelize the work you do locally, you can also use remote execution to have many more parallel tasks running at the same time.

Perhaps more important than any specific feature was that Bazel was designed not just by, but for Google. So it was pre-made to be able to handle very large projects: Google-sized projects - from GMail to Google Search itself - which must be constantly updated and highly fault-tolerant.

Ittai: This tool is what powers the day-to-day of tens of thousands of engineers inside of Google. You know, building two billion lines of code, running – I don’t know, I’m guessing tens of millions, hundreds of millions of tests every day. So this is – in a lot of aspects - this is a very battle-tested tool and they have been working on it in various versions for more than a decade. So we were really excited to be able to take something that’s been distilled over a lot of time and dev years.

The decision to use Bazel was finalized. But all that was just step one. Actually implementing the new system would be much more difficult than simply choosing the platform.

Ittai: Each department uses software of other departments via versions and this has a host of other problems, because basically you get drift of versions, and tech debt, and blames between the companies, which we didn’t want to get to.

Or: Another thing is that we didn’t want to lose any correctness of our build. We didn’t want to have this trade-off of maybe just running just a subset of the test or a subset of the modules or not waiting for other modules to integrate in. We wanted to keep the build correct as well.

Getting hundreds of developers to all follow one company-wide process change would be like trying to herd all the wildlife in a jungle into one zoo. Different people on different teams in different departments in different locations around the world have different requirements.

The one thing all of the developers had in common was that they needed the freedom to temporarily continue their work on ongoing projects via the old system, even while coworkers gradually moved on to the new one.