Ittai Zeidman, our Backend Engineering Lead, was in the hospital with his wife and newborn, when he got an urgent call from the company’s VP of R&D. A crisis was unfolding: the build system was broken, leaving hundreds of developers unable to do their work.
This crisis wasn’t an isolated incident: it was the result of a series of problems resulting from the company’s success and fast growth. Ittai and his peers faced a serious challenge - but they knew they weren't the only ones: Google, Facebook, and Twitter were also serving millions of users.
Utilizing and learning from the experience of these companies, Ittai helped transform Wix’s build system from the ground up with Bazel, an open-source Continuous Integration system. Him and his team tackled these significant scaling issues, and managed to implement company-wide infrastructure successfully.
Hear all about it directly from Ittai Zeidman and Or Shachar on our 2nd podcast episode. Plus, Natan Silnitsky that will introduce Exodus, our open-source tool that can easily migrate JVM code from Maven to Bazel:
Hi, I’m Ran Levi, and this is The Wix Engineering Podcast.
Ittai: So there I am at the hospital with my newborn, my third one.
That’s Ittai Zeidman, lead backend engineer at Wix. He spoke to Nate Nelson, our senior producer.
Ittai: So I also have my two other kids with me. And I’ve been getting some texts over the past few days, I know that there are problems... And we are one day after the delivery, my wife is there, completely beat, she’s trying to recuperate and I get a call from my manager.
Ittai had just experienced the birth of his child. He was with his wife and his kids in the hospital - nothing could be better. That’s when he got a phone call.
Ittai: The call was about basically our CI system grinding to a halt. We recently made a big upgrade of the build server and things started to decline. And at that point, I think it was a day after the birth, the things just stopped working and we needed to understand how we’re basically - letting people work again because they were unable to deliver at all.
Now, listeners, I must admit: if my boss called me the day after my child’s birth, I’d probably just not pick up the phone. Fortunately for his coworkers, Ittai did pick up. Unfortunately for Ittai, not 24 hours after watching his wife push a full human being out of her body, he now had another crisis to deal with.
Ittai: He’s like, “So, how is it going? How is the baby?” and, yo know, small talking. Like OK, OK. What do you need? Because we texted already. He said his congratulations before and he’s like “Listen, things are really bad. We don’t understand what’s happening and people are stuck for hours. We really need you to come in”. And I was like, okay... I’m literally holding the baby in my hands, my eldest is holding the phone next to my ear and then I was like, “OK. Let’s see how I can juggle this,” and then basically I called my parents, they came, they spent some time there and I went in the afternoon to the offices and then we did a brainstorm.
The VP engineering and VP R&D and the person leading the CI at the time and myself, we were trying to brainstorm how are we going to get out of this mud. It was very alarming to know that people are just waiting.
In this episode of our show, Ittai and his engineering team attempt to shift their entire company onto a new build system - to take an old system, incapable of handling the full scope of the company’s needs, and replace it with a new, better, bigger one from scratch. It’s a very complicated, difficult maneuver.
More importantly, though, Ittai’s story is a case study in how to effectively handle the lowest levels of a major organization’s IT infrastructure. Making any significant changes to a system that supports thousands of engineers every day is, you’ll see, akin to trying to pull a tablecloth out from under a set of fine china without breaking anything. Or maybe it’s like giving birth. Except, you know, instead of a baby you get a build server.
Okay, scratch that. It’s not at all like giving birth.
Interviewer: So you had to leave the hospital - how did your wife feel about that?
Ittai: It’s going to sound cliché but I have an amazing wife. Basically she understands. She understands and the fact is that I’m there a lot. I leave work twice a week to be with my kids. I try to be very present and so when I need to go, it’s for a concrete reason. This is the partnership that we have.
Or: I’m Or. I’ve worked at Wix for a little over three years now. I manage the built-in, the CI team for the dev experience group.
Or Shachar is a close colleague of Ittai’s - part of the team whose job it was to deal with the Wix “build system.”
What is a ‘build system’? When a number of developers are working on the same project, each one is responsible for certain bits and pieces. Those bits and pieces, collectively, make up the final software product they intend to deliver. A build system’s purpose is to take these individual code pieces, verify that their dependencies are met, compile them if needed and run automated tests.
Or: A build system is basically something that takes all the code of the developers and its purpose is to provide feedback to the developers on whether their code is OK and doesn’t break anything.
A build system is especially important in a Continuous Integration environment, or CI for short. In a Continuous Integration environment, several developers who work on individual code pieces need to integrate their code into a single shared repository. In Wix’s case, hundreds of developers could be working on the same code base - which is why for such an integration to work, the code needs to be stable and error-free. A build system is necessary to this process, as it keeps everybody on the same page and prevents one person’s code from disagreeing with the rest. Think of it like a traffic cop - waving cars through when there’s a lane, stopping them when there’s threat of a crash.
Or: When I want to change my code or when I want to change the company code, the organization code, I want to assume that the code that I’m starting with is stable, is green. So the goal is to keep it that way.
As a backend infrastructure developer, Natan Silnitsky is an expert in this very technology. He’s the final member of the build team we’ll meet in this episode.
Natan: When I heard that I had a possibility to work with Ittai Zeidman, I was thrilled because from the time I got to know him with my previous role, I thought that it was the perfect combination of all at once in one package.
Natan’s been at his job for half a decade. In his first days, the company was using a common yet aging build infrastructure.
Natan: When I got to Wix and up until we actually did the switch over to the new system, the basic system was comprised of Maven as a build tool and dependency management and the build server environment was TeamCity.
The exact roles that Maven and TeamCity played in Wix’s build infrastructure are not particularly important to our story - but for the sake of clarity, i’ll just say that Maven is the automated build tool that does the compiling, testing, etc., while TeamCity is the actual server on which these processes are done.
This frees the developers from running the build processes on their own local computers - and has the added benefit that the build server can be configured to be identical to the production servers, which means there’s less chances for bugs that stem from the differences in the configuration of the developer’s own machine versus that of the production server.
For a while, pairing Apache’s Maven build tool with the TeamCity continuous integration server worked very well. It allowed for a quick, continuous delivery of code to the company’s wider code base.
But then problems began to arise. What changed? Not Maven - it was the same as ever.
Natan: So when I started at Wix’s backend guild in Tel Aviv, we were like, I don’t know, 50 people, even less I think. And you can really easily talk to one another if you needed something or... the scale wasn’t that big.
As an employee of five years, Natan saw his company grow substantially. It was a good thing, but growth came with its own problems. The solutions that’d worked for years before could no longer accommodate such scale.
Natan: As more and more products were created and more and more teams were created, it means that there were more interactions, dependencies between these teams, between these products, and the dependency trees of the code got bigger and bigger and bigger. So it really was a big problem - and how to deal with all this huge dependency tree of this code.
Interviewer: What sorts of complaints were the engineers having about the old system?
Or: It went from “I just pushed my code and I need to get my deployable and my build is in the queue, stuck in the queue for hours and nothing happens”. Or something in the upstream - “My module is currently blocked because some module in the upstream that I depend on transitively is currently broken, so can anyone please fix it?”
Ittai: We were also hitting a lot of Maven limitations. We just want to be able to work and the fact that we – our feedback is now broken, it’s something that we can’t live with. You want us to deliver fast - please help us to deliver fast.
All these problems, individually, weren’t the end of the world. Altogether, however, it was clear: the build system was broken.
Ittai: So actually usually when people say the build was broken, it means that – for example, we are unable to assert that the software still works. But what we actually had was that the build server itself was just not working. So it was like the build might have been working, but no one could actually get feedback. So that’s like the worst case scenario because no one can get feedback whereas sometimes some people can get some of the feedback in.
To be clear: Maven wasn’t in itself bad software. It simply became an emblem of the growing need for the company to adapt to new solutions that fit their growing size. Average runs were taking 45 minutes. For a company that manages thousands of builds every day, this was unacceptable. Productivity was down. Developers were frustrated.
Natan: So levels of frustration build up and – well, I think the guild management understood, the server guild management understood that developers are hurting and development velocity is hurting. And the decision was made to see what kind of alternative we can find for Maven.
A change in the build systems had to be made. But finding a replacement wouldn’t be easy. Few organizations have quite so many developers working together, and separately, on so many different projects as Wix.
Ittai: I think it actually affects a great deal because at a certain size you can’t take regular off-the-shelf software. You have a lot of workflows that you want to cater to and also we work in a continuous integration mode.
The number of companies that have gone through an infrastructure shift of such a scale as we’re describing in this episode is small. Therefore, the number of available solutions for such a project is quite small.
But, after testing a few of the options, one stood out above the rest.
Natan: I remember some guy started checking if Bazel can work for us.
Or: So Bazel is an open source build tool coming out from Google. It used to be an internal tool, it is an internet tool at Google called internally Blaze, they open sourced in 2014 - they open sourced the core component of it.
Bazel is the publicly available iteration of a build tool - Blaze - which was designed by Google.
Ittai: Bazel has a strong extension story that allows you to use the core but still add your own ecosystem, and I think the proof is that Bazel has many, many external plugins maintained by the community.
Or: Unlike Maven, this tool supports fine-grain build units, source dependency, correct incrementality, massive parallelism. You can parallelize the work you do locally, you can also use remote execution to have many more parallel tasks running at the same time.
Perhaps more important than any specific feature was that Bazel was designed not just by, but for Google. So it was pre-made to be able to handle very large projects: Google-sized projects - from GMail to Google Search itself - which must be constantly updated and highly fault-tolerant.
Ittai: This tool is what powers the day-to-day of tens of thousands of engineers inside of Google. You know, building two billion lines of code, running – I don’t know, I’m guessing tens of millions, hundreds of millions of tests every day. So this is – in a lot of aspects - this is a very battle-tested tool and they have been working on it in various versions for more than a decade. So we were really excited to be able to take something that’s been distilled over a lot of time and dev years.
The decision to use Bazel was finalized. But all that was just step one. Actually implementing the new system would be much more difficult than simply choosing the platform.
Ittai: Each department uses software of other departments via versions and this has a host of other problems, because basically you get drift of versions, and tech debt, and blames between the companies, which we didn’t want to get to.
Or: Another thing is that we didn’t want to lose any correctness of our build. We didn’t want to have this trade-off of maybe just running just a subset of the test or a subset of the modules or not waiting for other modules to integrate in. We wanted to keep the build correct as well.
Getting hundreds of developers to all follow one company-wide process change would be like trying to herd all the wildlife in a jungle into one zoo. Different people on different teams in different departments in different locations around the world have different requirements.
The one thing all of the developers had in common was that they needed the freedom to temporarily continue their work on ongoing projects via the old system, even while coworkers gradually moved on to the new one.
Natan: Early on, the bigger challenge was how to do this transition over to Bazel in a sort of gradual manner while still keeping Maven and running in production.
Or: So there was a phase where our repositories had both Maven build descriptor, the .pom files, and the Bazel build discriptior, the build.bazel file, and it could be built by either build systems and we actually had two parallel build systems.
Ittai: A big part of why we took so long is because we wanted to allow people to continue to deliver - or at least most people to deliver regularly - while we keep on adding more and more bits and basically polishing the experience. My guess is that we could have pulled the switch a lot sooner and just say, “OK, you know what? We will just have a few months of people trying to adjust.” But because of the velocity focus we really wanted to build a side-by-side solution.
This side-by-side solution was both the solution to the other problems, and its own kind of problem. Some portion of the company’s developers would move more quickly to Bazel, while others would stay behind longer. And yet, everyone’s code had to, ultimately, agree. Here’s Natan Silnitsky.
Natan: This interim period was quite hard because we still had a lot of Maven dependencies and it was kind of painful because you already also had the growing pains of working with Bazel. One of the developers at Wix compared it to changing the engines of airplanes while they’re in flight.
Ittai: I started alone in a small attic hidden away because basically I didn’t want people to find me so that I could focus on the actual project.
There were a few months when I was supposed to be on the project but I was sitting in the main offices and then people just took me and used me for help and consultations. So I just went away.
You know, I started alone for a few months, then Or joined and then a few months later another developer joined and then we grew and grew until we united with the existing CI team. And the whole group then grew some more. The guild system at Wix is very, very robust and the backend guild decided that this is part of the strategic abilities of Wix, being able to have like a really fast feedback loop. And so we put as many people as we could from other places to help with this effort. And some people returned to their companies and some people stayed because they love the challenge and the domain and the people and the group grew to something like 15, 20 people.
Even with 15 or 20 people, the transition from Maven to Bazel was a lot of complicated work. Maven and Bazel are fundamentally unlike one another.
Natan: I specifically at that point was in charge of all the third-party dependencies aspect of the migrator. Because with Bazel, the third-party dependency mechanism is completely different than the way it is with Maven, so there was the challenge of seeing how to do this translation correctly in a multi-repo environment.
The company had to migrate its entire code base from one to the other, but Bazel could not simply interpret what was designed in Maven. It would be like sending an English novel to someone who speaks Chinese - sure, both people in the interaction can understand language, but you need a mechanism for translating that language in the transition process.
Of course, translating a full novel - let alone an entire anthology, which might be a better analogy for the Wix code base - requires lots of tedious work. This is why a migrator tool was needed. Or Shachar:
Or: So it was very clear to us in a very early stage that we needed it to be done automatically.
We needed a tool to understand the relationship between the different files, the different source files in the repository and the external dependencies.
Natan: The automatic migration, migrator tool from Maven to Bazel, we decided to call it Exodus.
Exodus was the key that allowed Maven to transition to Bazel.
Or: That’s Exodus. It saves you time to do stuff manually. And we tried to automate most of the work because if we needed to write this manually, it would take us - I don’t know how much.
Exodus had one simple function: to take the Wix code base as input, and output that information in the form of Bazel “build” scripts. It did so by first analyzing the Maven setup, then developing a dependency graph which quickly translates files in the Maven configuration towards the Bazel configuration.
Now the transition was finally in full swing. But problems still came in left, right and center. Like how, as effective as it was, Exodus wasn’t perfect. And not everybody found it easy to adapt to Bazel.
Ittai: The automatic process had limitations, right?
Natan: So they probably got like 80% of the build targets compiled pretty fast and it’s probably the same ratio for tests. But then you had the more exotic test cases, the more exotic compilation issues, stuff like using Spring XML beans that the migrator couldn’t really know about because it’s not part of the Maven dependency structure, so it was kind of blind to that. So there was more manual work in that area. So all kinds of small exotic places in the code that we needed to tackle manually.
Ittai: This wasn’t a very fun thing to do. And our hearts went out to those people, we tried to help them as much as we could, and for them it wasn’t that pain-free.
Natan: There are so many little different ways that people write code and do stuff. And it was really sometimes the Wild Wild West here with the way they created their test environments and run tests, that was quite a challenge.
Ittai: Yeah, that was a big challenge because people had – it was a big change. A lot of it again is tech debt, right? People didn’t know they had problems and most of the time, or in some instances had problems only rarely and they didn’t know about that.
It’s easy, in discussing the finer technical details of replacing build systems, to forget just how massive an undertaking this is. We’re not talking about something that occurs overnight, or over the course of a few weeks, or months.
We began when a phone call came to Ittai as he was in the hospital with his newborn baby. He spent the next few months working on the early stages of a system shift, mostly out of his attic. The team expanded. They shopped and tested new build tools, and consulted with experts at other major companies. They chose a tool, and in order for it to work, actually developed their own application from scratch to translate their existing code base to the new format. Then they finally began moving the entire company’s code base, while still maintaining the old one just for convenience.
Ittai: The project took – it depends on where you count, but it took something like three years, right? Like to move and put an end to our existing build tool for the entire backend developers, it took three years.
Three years! In the time it took Ittai’s team to transfer Wix to Bazel, Ittai’s newborn learned how to walk and process full sentences.
Ittai: We know of other places that it took a year, two years. It very much depends on what are the gains that you’re trying to achieve, how many developers you have, how much code you have. We have a lot of code, we have a lot of developers and we really wanted to make an investment that will pay off when Wix is five times and ten times bigger code-wise and developer-wise.
Part of what we did was not only change the build tool from Maven to Bazel. We actually built a complete new build system comprised of the build server, the deployment mechanism. Our previous solution actually made you only have one build on master at a time and so people had to wait. We were thinking, “OK, how can we build a solution that people don’t have to wait, where we don’t have any locks and people can just flow indefinitely. This is why we said we are going for the full Monty. We’re trying to build a robust system that will take us forward.
Luckily, they went for the full Monty, and they pulled it off.
Natan: It changed everything. It’s like suddenly you get real deterministic builds across Wix which are very, very fast, much, much faster if you don’t have Maven involved.
The side-by-side, Maven-with-Bazel solution worked exactly as intended. Even with hiccups along the way, it meant that the important work Ittai and his team were doing hadn’t conflicted with the important work other employees of the company were doing elsewhere.
Ittai: One really strong feature that we had is once we started onboarding developers, we told them “you can decide where you deploy from”. You can deploy from the old system, from the new system and this is just a self-service toggle where you can toggle on and back off and people were very excited about this because it gave them the confidence to deploy whenever they needed. And if they saw problems, then they could just revert, not be blocked by the deliveries and go back to it again. We gave them a few months to do this iteration. This was really, really meaningful.
Exodus, despite its shortcomings, saved remarkable amounts of time.
Natan: So the migrator tool quite easily and successfully and quickly got the Bazel configuration output. But that part was pretty easy for most of the repositories.
In fact, it’s because Exodus was so useful that it’s now developed a second life in the wider developer community.
Natan: I spoke in quite a lot of conferences in 2019. I got a lot of reach-outs from people saying “I have 400 Maven modules, it’s really slowing me down, I’m interested”. And I did see quite a lot of people try out the migrator and we got some bug issues open which I’m glad to always help fix.
Interviewer: What lessons should we take away from your guys’ story about how developers can face serious issues of scaling and then implement company-wide infrastructure successfully?
Or: Wow. He’ll talk about it. But, no, the bottom line is - take it step by step, to not break the current system while building a new system. Think big, think outside of the box, outside of what you know. Explore, think about other companies that you want to grow into or what size you will become and do it right.
Ittai: Yeah. So I’ll agree and disagree with Or. So I think I’ll agree with Or about thinking outside the box. Talking to other companies, learning from them, while understanding that their solution is their solution and you need to find your own.
About “do it right”, I’ll say do it wrong - expect to fail a few times. A few more things is collaborate. First of all, inside the organization. If you can collaborate with people, get their buy-in. it will be much easier because a lot of times these journeys are hard. They take time. People suffer in the existing solution. If you have buy-in, people are much more inclined to wait, to help you. People actually gave their people to this effort so that it can come faster because they knew we were on the same boat together, because I gave them updates, because I was trying to be as transparent as possible. Sometimes I succeeded, sometimes I failed, but this was definitely an objective - to keep people in the loop.
Ittai Zeitmann and the Wix backend engineering team succeeded, in the end, as a result of thousands of hours of dedicated research, problem solving and sheer manual labor. Had they not sought the help of those that came before - listening to the advice of engineers at Facebook, adopting Google’s build tool - it probably would’ve taken more than three years, with a lot more setbacks and difficult work.
If you’re an engineer facing a situation as monumental as this, hopefully our podcast has given you at least a little fuel to go on. Because while Ittai, Or and Natan all work for the same company, it’s not really out of a sense of duty to their managers or shareholders that they dedicated three years of their lives to this project.
It’s not because this is their job, it’s because this is what they do. These guys are engineers. They’re built to tackle hard problems and push the limits of our technology just a little bit further than it would’ve otherwise been.
That’s it for this episode, thank you for listening. The Wix Engineering Podcast is produced by PI Media. Written by Nate Nelson, produced by Guy Bin Noun and narrated and edited by me, Ran Levi. Special thanks to Moard Stern from Wix. See you again next episode, bye bye.
For more engineering updates and insights: