Comment: This article is the 1st part in a series of articles. You can find the next posts in this series here:
Auto Scaling CI Agents At Wix - Part II by Etamar Joseph Weinberg
How We Migrated to a New CI System with Zero Downtime - Part III by Shay Sofer
Developer velocity can not be underestimated. We already know that. Whether it’s building & deploying a hotfix for an urgent production incident, or your regular day-to-day development cycle. A CI/CD pipeline must provide a fast feedback loop.
But what happens when your company just keeps on growing and your existing build server is starting to cave in under pressure? What can you do if your users are getting frustrated, having critical builds queued for over an hour?
At Wix, we have over 2K microservices clusters in production with over 400 deployments/day. In the backend guild alone, we have around 500 developers and 8K builds/day.
Let’s meet Sarah. Sarah is an awesome developer. She’s responsible, an expert in her domain, and a very talented coder. Unfortunately, there’s a production bug that she needs to fix.
Sarah quickly troubleshoots, finds the root cause and has her fix ready in less than 15 minutes.
She opens a Pull Request, expecting a “Build Started” notification. And waits… and waits.. and nothing is happening.
The build is queued, the build server is not keeping up with the overall load. Slack messages are being exchanged and teams are scrambling to understand why the build is not starting.
You get the point.
The more our company grows, the bigger the problem becomes. A CI system that doesn't scale with your organization is a major failure point that must be addressed.
This is our first blog post in a series about how we built a scalable CI solution that can handle a massive number of builds.
In this post, we'll talk about building a scalable CI system using Buildkite - and how we reduced the time builds spend in queue during load from ~40 minutes to only seconds.
The next posts in the series will reveal the bits & bytes of our auto-scaler (read here), walk you through an overview of the migration process, and explain how we seamlessly migrated the Wix’s backend builds to the new system. Stay tuned!
Wix’s CI System
The following is a simplified high-level architecture of our builds-triggering mechanism:
First, we are processing Github webhooks that are fired when a developer pushes code or opens a pull request.
We consume those messages in the Triggering Service and trigger one or more Bazel builds in the Build Server using its API.
(More on Bazel in the “Continuous Integration on a Huge Scale Using Bazel” by Ittai Zeidman).
We have a variety of build types that dramatically differ from each other in their importance (“Is the build in the critical path to GA?”) and burstiness (“How many builds are we triggering in bulk?”).
Pull Request builds or builds that run on the main branch are in the critical path. The feedback loop must be quick and we want those builds to run as fast as possible. Let’s refer to them as main and pr builds.
Builds that are triggered by a scheduler in the background aren’t as important, as they were not triggered by a user. Users are not following them and are not actively waiting for the outcome. Let’s refer to them as background builds.
Additionally, we have the unique capability of running additional builds on other repositories, as part of our PR checks. This guarantees that their main branch is compiling with the PR changes. This is great for finding out cross-repo breakages before they are merged. Infra developers usually choose to run those checks on all Wix repositories. Let’s refer to them as cross-repo builds. As you can imagine, those are very bursty as multiple builds are being triggered simultaneously. (More on “Cross-repo checks” in the “Virtual Monorepo For Bazel” post by Or Shachar).
So we have various build types with different characteristics. Some are not as important, and others are very, very bursty.
Pain points
Using Bazel, we reduced the overall build time by 90% (Again, a great read can be found on that topic here).
But you can’t leverage any of Bazel’s goodies if... the build does not start at all. It’s like building a spaceship that fails to launch off the ground.
Photo by Ricardo Gomez Angel on Unsplash
Imagine a scenario in which two infrastructure developers are working on two different PRs and both choose to run cross-repo checks on all of Wix, to make sure they are not breaking anyone. As a result, more than 120 builds are triggered simultaneously.
Add this to the regular workload of developers that are working and pushing changes.
And now, imagine that Sarah is trying to push her super-important changes to production.
Ideally, we wouldn’t want “less important” builds to interfere with Sarah’s urgent fixes. But for us, this was far from reality.
Our build server had a limit of ~200 concurrent builds and once we hit that threshold, builds just waited in a queue.
We often had builds waiting for 40-50 minutes in the Build Server's queue before they could start running.
During those incidents, our hands were pretty much tied and we could only resort to extreme measures such as turning on a killswitch to stop new builds from triggering or canceling builds manually.
So in short:
Poor scalability in our build server caused major problems for us.
Different build types require isolation and different SLAs. We had no way to achieve that.
Introducing… Buildkite!
So we came to the inevitable conclusion that we need to replace our weakest link.
The build server.
The first step was to identify what Wix needs in a build server. We needed the ability to concurrently run a massive amount of builds, but we also had other requirements. Dynamic pipelining, reliable triggering & notifications mechanisms and UI flexibility to name a few.
We conducted several POCs with multiple vendors and ultimately chose the one that met most of our requirements - Buildkite.
While building our new CI pipeline on top of Buildkite we ran into interesting design dilemmas:
How to represent Wix’s “Virtual Mono Repo” in Buildkite?
How can we dynamically load the steps we want to run? How can we override them for testing and debugging purposes?
How can Wix reliably trigger builds via an API on multiple build servers and consume builds notifications?
How to create the much needed isolation between the different build types Wix maintains?
How to properly scale-out/in according to our load?
After sharing this article Shay Sofer was invited to talk at Unblock Conference 2021. Here's his full talk - 6 Challenges Wix faced while building a super CI pipeline:
Challenge #1: How to represent Wix’s Virtual Mono Repo in Buildkite?
In Wix, we work in a Virtual Mono Repo environment. We currently have around 60 repositories.
Does it mean we should have a Buildkite Pipeline per Git repository? Or perhaps a single pipeline for all of the repositories?
But wait, what exactly are Buildkite Pipelines?
Quoting the docs, “A pipeline is a template of the steps you want to run. There are many types of steps, some run scripts, some define conditional logic, and others wait for user input. When you run a pipeline, a build is created”.
Multiple pipelines have the following advantages:
Buildkite leads us towards a repo-per-pipeline approach. For example, when creating a new pipeline you need to configure its Github repository URL.
Better out-of-the-box visibility. On Buildkite's main page there are success % and build duration per pipeline. So we can get that data, per repository, for free.
Easier pagination and filtering in the UI.
However, for us there were several cons to using multiple pipelines:
If the pipeline steps had to change, we would need to update all of the pipelines instead of updating just a single pipeline.
Our triggering service would need to hold a mapping of the repository to its pipeline name, also known as “slug”. For example, the repository infrastructure.git is mapped to the slug infrastructure.
Choosing multiple pipelines was the best choice for us. One pipeline per Wix Github repository. To overcome the cons we:
Created a script that can bulk-update the steps of all pipelines at once. In case we need to change the steps (that are stored in the Buildkite pipelines) for all of the repos we can just run the script and update them all at one go.
Maintain a configuration file (that is pushed to git) that holds the repository => slug mapping.
So we have successfully represented Wix’s Virtual Mono Repo in Buildkite and have a mechanism to easily update the steps in case those will change in the future.
But what are the odds that those steps will change? Can we dramatically reduce it? Can we dynamically control which steps are being executed?
This is where “Dynamic Pipelining” comes into play.
Challenge #2: How to dynamically load build steps?
Since Wix has different build types - we can not hardcode the steps for those pipelines. We need the ability to determine, in runtime, what steps will run according to the context we’re in.
We also require the ability to override the default steps. It can be quite useful for testing purposes.
So instead of hardcoding the steps in the pipelines we chose to use Dynamic Pipelining.
Each build consists of 2 steps:
Pipeline Reader step. This step dynamically understands the context of our build and loads the correct build commands to Buildkite.
We execute the commands that were loaded in step #1. The core part of it is bazel test //… that runs a Bazel build in our working directory.
The Pipeline Reader steps are very concise in their purpose and are offloading the logic to an in-house Buildkite plugin we created.
This is to reduce the need for future changes to those steps.
The plugin downloads a very simple script that dynamically determines the build type (according to an environment variable we pass to the build) and loads the correct build commands to Buildkite using dynamic pipelining.
For example, if it wants to execute pipeline-pr.yml because we are in a Pull Request context, the script will run buildkite-agent pipeline upload pipeline-pr.yml.
This command will dynamically load the correct steps for the “Pull Request” build type, and those will be executed in the 2nd step.
Using this script, we can identify situations where we'd like to override the default steps we're using, and easily add code that loads a different pipeline, if certain conditions are met:
Using the mechanism described above, Wix was able to implement the extremely important capability of customizing builds under various conditions. We can now execute different steps for builds that are running on specific repositories, specific build types, or whether a feature toggle is enabled. We can safely test changes to pipelines and gradually roll them out.
Challenge #3: How can Wix trigger builds in multiple Build Servers?
We have our pipelines ready and waiting to trigger builds! Yet we still needed a way to trigger them - so we developed our own microservice that triggers builds via API in Buildkite
As previously mentioned, we had a triggering microservice that was tightly coupled to our legacy build server:
We chose to refactor the current state and ended up with 3 microservices that take part in the triggering flow:
A generic Triggering service - that decides what should be triggered by emitting a TriggerBuildCommand.
A Legacy Triggering service that triggers builds on our legacy build server.
A Buildkite Triggering service that triggers builds on Buildkite.
We gained:
Decoupling. We no longer have a single service. We now have a Triggering service that decides what is the build we should trigger, and other services that encapsulate the specifics of each and every build server. That knowledge is hidden from our Triggering service.
Extensibility. We can easily trigger a single build in both build servers. We can also add an additional build server, by creating a new service and have it consume BuildTriggerCommand as well.
We are also leveraging Greyhound, Wix’s open-sourced high-level SDK for Kafka, to retry failed API calls that otherwise would have been lost. (A great read on Greyhound can be found here)
That separation of concerns turned out to be very, very important for smooth migration between the build servers. We can now easily trigger all of the builds side-by-side on both systems, while having Buildkite builds run in dry-run mode without affecting any users.
Watch: How We Built and Migrated to a New Scalable CI, with Shay Sofer and Etamar Weinberg, Wix Engineering Conference 2021 -
Challenge #4: How can Wix reliably consume build results?
Great, so we’re triggering builds. How can we keep track of their results and report them to Github?
We asked ourselves two questions:
Buildkite already has a Github integration; should we use it or build our own?
If we build our own, how do we consume build notifications?
Wix has a pretty complex logic for reporting Github status checks.
Per commit, we are running multiple builds and reporting multiple Github status checks.
Some are as simple as a success or failed build result or just a link to a “history view” of builds for our commit.
Other checks are the result of a set of builds. As an example, the reported result for “Cross Repo Builds” is the aggregate result of the cross-repo builds we just triggered. We are also customizing the “Details” link in Github according to different status check types.
We decided that the answer to the first question was "no". In order to consume Buildkite's notifications for builds, we built a NotificationProcessor service. It processes them, and a downstream consumer reports to Github.
For the second question - we can use Buildkite’s Webhooks or AWS EventBridge integration.
Since we do not want to drop notifications in case our endpoint is unavailable, a retry mechanism is essential. Retries are not supported by Buildkite Webhooks, so we chose to integrate with AWS EventBridge, as it provides the extra resilience we require.
Messages in EventBridge can be easily routed to SQS/SNS/Lambda and others, if needed.
Challenge #5: How to achieve isolation between Wix’s different build types?
As we mentioned earlier, Wix has a variety of build types.
Some are very important and should be triggered ASAP, while others might not be on the critical path. How can we achieve isolation and prevent situations where a build storm in “less important” builds slows everyone down?
This is where Buildkite really shines for us.
Buildkite is a “Bring your own infrastructure” build server and it also provides the ability to create different agent pools.
We realized that we have:
Build types that are on the critical path and should have minimal queue time (main, pr)
Build types that were triggered by a background job rather than a user, so we can tolerate a longer queue time (background)
Build types that are very bursty (cross-repo)
So we ended up creating the following queues:
main-queue
pr-queue
cross-repo-queue
background-queue
We’re pinning each build to its dedicated pool using the agents property.
For example:
agents:
queue: background-queue
We now have complete separation between the build types. It essentially means that we are guaranteeing isolation and preventing builds from slowing down our critical path.
Do you remember Sarah from the beginning of this post? She can now create a Pull Request to her production hotfix (pr-queue),merge it to the main branch (main-queue) even while hundreds of other cross repo builds are being run (cross-repo-queue).
This solves a major problem for us. But there’s one more thing to address. How many agents are running and how do we scale them out?
Challenge #6: Building our very own Wix autoscaler?
We created isolation between different builds - but how do we scale automatically according to the load?
We chose a Kubernetes container-based solution with KEDA (Kubernetes Event Driven AutoScaling).
Builds are running on Buildkite agents, which are actually Kubernetes pods.
With KEDA, we can scale agents automatically based on load, or based on the time of day.
For example, for the very important main-queue, we make sure to have plenty of idle (pre-warmed) agents to guarantee that queue time is minimal.
For the background-queue, we can have a minimum amount of idle agents. That queue is optimized for cost and KEDA will scale out when needed. The tradeoff is a longer queue time for those builds, as it takes 50-60 seconds or sometimes even minutes to spawn new pods that are ready to run builds.
In order to further reduce costs, we drastically reduce the number of idle agents in all queues during off-hours.
It’s also important to have proper monitoring on the # of total agents, # of idle agents and # of busy agents. All of those are provided out of the box by buildkite-agent-metrics.
Here you can clearly see how many idle agents we have in each queue, how we scale-out under load (The green “Total agents” line is always above the yellow “Busy agents” line) and how the # of agents is reduced during off-hours in order to optimize cost.
All of those parameters are easily configurable, giving us the flexibility to decide for every queue:
Should we optimize for low queue time? (large buffer of idle agents. KEDA will continue to scale-out as needed)
Or should we optimize for cost and as a result builds will spend more time in the queue? (no buffer of idle agents, scaling-out will happen on demand)
This is extremely powerful.
We are no longer capped to a specific number of concurrent builds.
We have full control on the level of concurrency by configuring how many total pods are spawned.
We have full control on how fast builds will start, by routing them to queues and configuring the buffer of idle agents each queue will have.
Show me the numbers!
The migration was only completed a few weeks ago, but we have already seen an amazing improvement.
When under load. from around ~40minutes of queue time, queues that are optimized for fast queue-time now demonstrate a low, consistent p90 of 10s.
As the different build types are isolated, a spike of background/cross-repo builds does not disrupt the mainstream build flow.
Here’s a graph of main-queue’s queue-time during work hours:
(*Note that off-hours builds are hidden from the graph. Obviously, during off-hours we optimize for cost so queue-time is a bit higher)
Summary
In this post we discussed how we built a scalable CI solution on top of Buildkite and went over some design dilemmas and how we:
Represented Wix Virtual Mono Repo in Buildkite.
Leveraged Dynamic Pipelining to our advantage, allowing us the flexibility to dynamically load pipelines and customize builds for testing and rolling out changes.
Developed a resilient mechanism to trigger builds and consume notifications.
Created isolation between Wix’s different build types using Buildkite’s queues.
Built an in-house autoscaler using K8S and KEDA.
Reduced the overall time builds spent in queue (under load) from 40 minutes to a few seconds.
Be sure to stay tuned for the next posts in the series where we'll dive into the bits and bytes of the autoscaler and how we actually performed the migration.
This article is part I of "Auto Scaling CI Agents At Wix" by Etamar Joseph Weinberg. You can read it here.
This post was written by Shay Sofer
You can follow him on Twitter
For more engineering updates and insights:
Join our Telegram channel
Visit us on GitHub
Subscribe to our YouTube channel
Thanks Shay for the detailed blog post. Very useful information.
With your BuildKite agents, are they maintaining state between builds? Meaning, do they keep the Bazel repository cache and local disk cache around. Or are they wiped clean at the start of each build?