Backend Engineering Lead, Ittai Zeidman (@ittaiz), provides a sneak peek into how the build process is handled at Wix Engineering, where the already massive scale of production updates keeps growing by the day.
Ittai reveals the CI challenges we faced and how switching to the right build tool may help to increase CI performance by 90%. The quick preview of the auto-migration process and released open-source tool powering it (Exodus - GitHub, blog post) is the cherry on the cake.
What is CI and what was the motivation for building a nextgen one?
CI stands for Continuous Integration, and it ensures that our build system is fully equipped to compile the code, run the tests and successfully follow the build process all the way through production. Make, Ant, Maven, Gradle, and Bazel are a few known examples for build tools while Jenkins, TeamCity, and Travis are known examples for build servers.
Picking the right tools, servers and methodologies has direct implications on productivity and quality, and it has to take into consideration the requirements of multiple teams from all across Wix Engineering.
The motivation for a next-gen CI system came when engineers started reporting pains due to some drawbacks of the existing system. Most complaints revolved around insufficient feedback from the system and long build times, as there were often lots of dependencies between the changes. This meant that it often took hours, on bad cases 1-2 days, for a developer to get from committing to production given broken dependencies and “build storms” caused by a lot of activity.
One of the all time low points of the system was when we upgraded our build server and it went haywire halting our development for around 2 days. I remember this vividly as I was called to an urgent crysis room with the VP R&D from the hospital after my wife gave birth to our youngest daughter. Before such a change could be processed, the system had to handle all the providing changes first, which increased the waiting time.
Were those problems specific to Wix Engineering?
While these drawbacks might not be an issue for smaller organizations, they are significantly amplified when the scale is at play. Wix builds many million of lines of code every day, runs tens of thousands of builds with millions of tests for its close to a thousand engineers.
Wix’s business grows rapidly and the needs from R&D in general and CI specifically need to match that growth and more. This meant we needed a faster track to production, and we were aiming for reducing build time by 90% to accommodate for that.
90% sounds staggering, what caused the existing system to underperform?
We were using Maven, which wasn’t built for continuous integration at scale. It only supported CI for small multi-module projects and was not incremental nor was it really parallel.
Over the years we have built many systems to help us with the scale (code, modules, developers) and help the build system handle it.
One example is a system that was responsible for automatically creating, updating and deleting builds in the build server whenever developers add modules or make changes to them. This was needed so that the granularity on CI will be as fine grain as the build tool allows us (in comparison to just building all of the repository in a single build).
Another example is a system to detect dependencies between modules, including finding shortest paths, and then dynamically and in real time updating the dependency chain in the build server to ensure correct build triggering.
Similar tools like Gradle and sbt proved somewhat more incremental and parallel, but they too could not support thousands of developers and a plenitude of modules and microservices. Another part of the equation is that we strive for full continuous integration and so all of our intra Wix dependencies have no version and work like a logical monorepo. Tying the scale, CI and many repositories together is a big challenge.
How did you resolve this?
We explored the practices used by large-scale companies like Google, Facebook, and Twitter, and noticed that all of them employed a similar concept – parallelize whatever you can and use aggressive and correct caching. Combining those with remote execution on hundreds of cores provides the outstanding throughputs and a significant acceleration compared to alternative methods.
During that process, we came to know Bazel, Google’s open-source CI system released four years ago, which has been used internally by Google for over a decade. About 40,000 engineers inside of google are working with Bazel on 2 billion lines of code every day, and from our perspective, that was a remarkable demonstration of the field-proven scalability.
What is the secret sauce that makes Bazel different from the rest?
Bazel is unique in being both fast and correct, as there is often a tradeoff between the two when handling CI. When I say correct I mean that you can count on Bazel to rebuild whatever code that was changed, directly and transitively, and ensure that the incremental output is something you can ship to production without a worry. With such a tool in your arsenal you can forget about full, clean, builds and see tremendous performance wins over time.
Bazel does this by tracking inputs and outputs of modules with declared effects (similar to a pure function), and whenever the same input data is identified, it fetches the relevant output entry from the cache to avoid repetitive processing. This tracking is thanks to Bazel defining simple low-level, language agnostic, constructs which mainly relate to inputs and outputs of an action (compilation, code generation, test running, etc).
For example when we need to compile some java sources the inputs to the javac compilation action are the sources, jars of dependencies on the compile classpath, the javac tool that is running the action and the various arguments passed to it since a change in any of those can cause this action to return different outputs.
On top of caching, and thanks to the strict requirements on the action graph, we described above, Bazel knows exactly what depends on what and so can parallelize everything else. It’s important to clarify that Bazel itself doesn’t try to guess what depends on what but rather the user needs to define the dependencies between modules (targets in Bazel’s semantics).
This is an advantage since Bazel is language agnostic but also adds burden for the end user. There are many initiatives in the ecosystem to try and supplement this by building specific language/stack related tools to automatically define the minimal dependencies and reduce the cost for the user.
Finally Bazel doesn’t stop at parallelizing actions on the host machine but supports remote execution which means the machine, be it CI worker or the dev machine, can use hundreds of other workers to get the job done faster and cheaper.
Combining caching, parallelism and remote execution provides impressive results.
Were there any striking challenges during the migration to Bazel?
There were quite a few challenges during that process, two of which were more substantial. The first was that we had to advance and extend the Bazel ecosystem as it had some blind spots in the areas that were not that relevant for Google. Bazel was developed and honed for Google’s needs like having a monorepo and having no external binary dependencies (AFAIK all external dependencies are source versioned into google3).
The Bazel team did a lot of work to prepare and adjust Bazel for these needs but there is a limit to how far one can go without feeling the day-to-day pains. This meant that we needed to iterate with the community on issues like handling many repositories, and managing lots of external binary dependencies and dealing with JVM interop.
Given we could write a new project in Bazel we then needed to solve how to move the millions of lines of code and hundreds of developers to Bazel, while integrating the new system with the existing one, which like we mentioned was quite complicated with ~10 microservices of its own.
On the one hand, it was apparent that we couldn’t ask hundreds of engineers to stop their work for a couple of months until integration is done.
On the other hand, having only part of the team move to a forked branch of their code with Bazel in it for testing involved a high risk of merge issues when going back to one main branch, as the two will likely be far apart at that point. We’ve all seen cases of teams forking out to a “some-system-ng” branch where they now need to play catch up with the “mainline” branch.
To overcome this challenge, we implemented a dedicated tool, called Exodus (GitHub, blog post) that can automate the migration from Maven to Bazel. The tool exports the Maven environment into a much finer granularity that is supported by Bazel, so we could harness its parallel graphs and superior caching. This conversion ensures that we don’t end up with a Bazel environment that actually behaves like Maven because of insufficient granularity. We realized it was quite unique when we presented the solution in the last Bazel conference and got a lot of requests to share it with the community.
Where are we now with our Bazel usage?
All of our backend developers are using Bazel as their source of truth and we’re gearing towards shutting down maven. We’re very pleased with many aspects of Bazel and developers now report very fast CI iterations. Current IDE support, however, is very challenging and we’re hard at work at seeing how we can bring a very big impact there for Wix and probably for the community in general. Stay tuned!
Share with us a little about your background
Programming has always been a great passion of mine, and I’ve been doing it professionally for the past thirteen years. I’ve been with Wix for over six years now, and, for the past two and a half years as a Backend Engineering Lead, I’ve been working on the next-generation CI system. Other than that, my partner and I are the proud parents of three amazing daughters.
For more engineering updates and insights: