Auto Scaling CI Agents At Wix

Wix Engineering
Feb 10, 2022
5 min read

Updated: Jul 12, 2022

Comment: This article is the 2nd part in a series of articles.

This is the 1st part - 6 Challenges We Faced While Building a Super CI Pipeline - by Shay Sofer
And this is the 3rd part - How We Migrated to a New CI System with Zero Downtime - by Shay Sofer

In 1913, Henry Ford and his gang installed the first moving assembly line for the mass production of an entire automobile. A century later, software companies are investing their top resources in order to improve their “assembly line” - to test and build their code. In order to deliver products to users rapidly and with confidence, adoption of automated testing and integration patterns is needed.

At Wix, we are paying a lot of attention to the process our code goes through “from commit to production”. We run over 2000 microservices and deploy about 600 new versions each day. In order to deliver quality products to our customers at such a pace, we implemented an automated Continuous Integration workflow for each artifact.

Photo by Andrea Charlesta in Adobe Stock

Step 1: Change Infrastructure to enable Scale & Configuration

Wix has been practicing CI/CD since 2011. The system we had built back then served us for many years, but as the company continued to grow, scaling problems began to appear. One of the main pain points with our previous CI was the infrastructure of the platform. Running 10,000 jobs every day can be very challenging - and expensive! We had multiple types of build agents connected to 3 different vendors of build servers with no auto-scaling capabilities and with very limited customization options for our engineers. Not only that, agents were stateful for up to 24 hours. That meant that a CI job could affect the agent environment, and potentially, other jobs that would run on it.

We decided to redesign our CI platform infrastructure in order to solve those issues. Focusing on keeping our agents’ environment “clean” so that builds are more reproducible and hermetic, we now run them in containers.

We use Kubernetes in order to schedule and manage the lifecycle of those containers by using Jobs. Each Job creates one Pod that runs a CI agent. Every build starts on its own “fresh” Pod with all of the necessary dependencies in it. When the CI job is done, the agent Pod gets terminated, and the Kubernetes Job is marked as completed successfully.

As discussed earlier, running thousands of CI jobs each day can be very expensive. In an effort to optimize the cost, we defined an auto-scaling mechanism to allow us to scale out / in based on the load on the CI platform..

Step 2: Automate Scaling

We are auto-scaling the number of CI agents we are running (Kubernetes Jobs) using KEDA, (from the docs) a Kubernetes-based Event Driven Autoscaler. With KEDA, you can drive the scaling of any container in Kubernetes based on the number of events that need to be processed.

How Keda Works KEDA enables the creation of kubernetes jobs based on a query it performs against Prometheus every X seconds by using the Prometheus scaler and the ScaledJob custom resource definition. We collect relevant data from our build servers (like ‘number of builds in queue’ and ‘number of running builds’) and store that in Prometheus. Keda is configured to query Prometheus and create X number of jobs based on the result of a specified query.

Scale out is performed exclusively by KEDA. It creates the number of jobs needed in order to match the needed capacity for agents. We implemented the scale-in mechanism by starting each job with “max idle time”. If a Kubernetes job has started and didn't get assigned to a specific CI agent after X amount of time, it will be stopped gracefully. If it does get assigned to a CI job, it will be terminated when the CI job is finished.

Here’s an example of a ScaledJob yaml:

Step 3: One CI server to queue them all

Our “next generation” CI server is Buildkite: Buildkite is a platform for running fast, secure, and scalable continuous integration pipelines on your own infrastructure. One of the key advantages of Buildkite is that it enables the creation of multiple queues for different CI job configurations. We create a unique ScaledJob (auto scaler) for each queue in Buildkite.

That way, we can set the auto scaling query and the kubernetes Job spec differently between different queues. This architecture is aligned with our R&D demand to set different SLO’s for different CI builds. For instance, you can decide that master builds shouldn't be treated the same as pull request builds / bug fixes builds, and so on. Also, different projects in the business might also have different requirements.

The baseline for all of our auto scaling queries is the same:

ceil((builds_queued)+(builds_running)*X) + (Y and ON() (hour() > 6 < 23) OR Z)

X = the number by which we multiply the sum of the queued builds and the number of running builds (we suggest to start with 1.15 and go from there based on the queue time monitoring).

Y = a const number we add during working hours.

Z = a const number we add during off hours.

Y and Z can differ according to your needs and load patterns. We noticed that CI operations tend to burst on a specific time in day, so we scale differently using the promql function - hour().

Generic architecture

Most CI servers let you connect agents via push.

This behaviour allows us to keep the backbone of the architecture relevant while using multiple different CI servers, or if we switch providers. The pieces of the puzzle that change between different use cases are the agent image and the server that Prometheus scrapes in order to collect relevant metrics.

Step 4: Watch what Happens

Develop > Test > Release > Monitor

Even with all of the right building blocks in place, there’s no guarantee that the CI infrastructure would be efficient. In order to complete the picture, monitoring is mandatory.

Focus should be on developing the right “knobs” so you could, based on data, make the right tweaks fast. There are many parameters to measure: time spent in queue, auto-scaling formula, agent machine resources utilization, and so on.

With the right data in place, it is possible to reach the sweet spot between cost and performance. Not only that, this kind of transparency will make the relationship of your team with the financial team much better and more productive.

Summary

In this short post we discussed what we did to make our CI infrastructure more configurable, scaleable, and much more automated:

Identified bottlenecks and missing features on our legacy architecture.
Redesigned the Infrastructure to enable scale & configurability.
Used awesome open source projects in order to implement generic automated scaling: K8S, KEDA and Prometheus.
Described a multi-queue architecture from the infrastructure perspective while using Buildkite as a SaaS build server.
Designed a monitoring solution in order to make the right decisions and be cost-effective.

This article is part II of "6 Challenges We Faced While Building a Super CI Pipeline" by Shay Sofer. You can read it here.

This post was written by Etamar Joseph Weinberg

For more engineering updates and insights:

Follow us on: Twitter | Facebook | LinkedIn
Join our Telegram channel
Visit us on GitHub
Subscribe to our monthly newsletter
Subscribe to our YouTube channel
Follow our Medium publication
Listen to our podcast on Apple, Spotify or Google