top of page

Making Order in CI/CD Mess

Updated: Apr 12, 2020

In this post I will describe the motivation that led us at Wix Engineering to develop a CI portal that would truly facilitate developers’ independent work in production.

I guess many CI teams out there are contemplating these kinds of questions, and perhaps will benefit from our experience.



9 years ago, when we started working on a CI solution for Wix, we had 5 backend monoliths, one flash project and around 50 developers. All 5 monoliths were built on a single TeamCity server (we’ll call it ‘CI TC’ from now on).


Deployment to production was a “simple” manual operation of downloading an artifact from TeamCity, FTPing it to production, and crossing our fingers. This clearly didn’t work well and caused instability in production.


The next obvious step was to build a state of the art CI solution that will provide stability and clarity in each step of the process from vcs to production.



The birth of Lifecycle


CI is a modern production line operation that involves several entities, starting with build server that has many builds configurations, each build configuration points to source codes that produces one artifact or more, these artifacts are stored in binary repository system, which later can be consumed by other builds, in case these are libraries or production-deployables artifacts.


And there are other entities involved in the CI process, there are developers that wrote the code, there are physical servers where these artifacts are installed in production, monitoring system for each of the artifacts running in production and more.

Connecting the data from all these entities and providing a coherent picture of build status is not a byproduct of the process, it is something that needs to be stitched together and that’s what Lifecycle is designed to do.




Wix build process in a nutshell


Different organizations use different terms and approaches for CI and release process, before diving to Lifecycle, here’s a short description of Wix CI and release process, it will help understand Lifecycle implementation.


Artifact version at Wix can be in one of the following stages:


  • SNAPSHOT (development)

  • RC - Release candidate

  • GA - General Availability (prod)


Snapshot:

Commit to master branch, triggers a build in dev CI, a successful build produce a SNAPSHOT version artifacts that are stored at Wix binary repository (Artifactory).


RC: A version ready for deployment, it first needs to be promoted from Snapshot to RC, this operation practically copied the last snapshot version from SNAPSHOT repository in Artifactory to RC repository and it version is updated from X.Y.Z-SNAPSHOT to a unique RC version.


GA:

Once an RC version is ready, the developer can deploy it in production by updating production of the new RC version.



Lifecycle version 1



In the first version of Lifecycle, a developer could review this information:

  • A list of the builds configured in the build server

  • The build’s latest version in each phase (Dev, RC, and GA)

  • 20 last global production events (RC / GA / Rollback)

  • 20 last selected build production events (RC / GA / Rollback)

  • A New Relic performance graph of a selected build


It also included Deep Links to:

  • Each build in the build server

  • Latest artifact version in Artifactory

  • Service dashboard in New Relic


And had the following Build Actions:

  • Trigger RC

  • Trigger GA/Rollback


This worked well for our release operation; it was easy to trace what has changed in production and rollback if the new version didn’t behave as expected. Nothing in production was done manually. Since we kept records for each operation, it was relatively easy to correlate changes in production with incidents in production.


As Wix kept growing, we had to provide a solution that supported a much larger number of developers and builds.


CI system for big scale has different requirements. If at the beginning we could manually manage all the builds, as we got bigger we had to make it more self-service, with wider functionality and ease of use.


That was a great opportunity to extend Lifecycle and add all the fatcher needed to fit our growing, versatile organization.


Here’s a screenshot of the current version of ‘Lifecycle’:




Lifecycle - under the hood


Lifecycle is actually built of several microservices that use Kafka and RCP to communicate.



Services description


1. Git Event Listener - Produce kafka messages for every commit push to any of Wix repositories, this message is consumed by any service (not only Lifecycle services that what to be notified of such events).

2. Teamcity - We extended teamcity capabilities and added a plugin that produces a Kafka event whenever a build starts / completes / fails.

3. Repo Descriptor Service - Since we use several build tools and technologies, we created a general entity called ‘Artifact’ to reflect the data we need:


  • Artifact Id.

  • Build tool (e.g., Maven, NPM).

  • Programming language (e.g., JS, Python, Scala).

  • Contributors.

  • Dependencies.

  • CI config settings - an internal CI configuration file used by Wix developers as an extension to base build configuration (not in the scope of this post).


This server consumes the messages produced by ‘Git listener service’ and refreshes the repository data by pulling the latest changes and analyzing it. Once done, the service notify via Kafka new data is available.

4. Build Descriptor Service - Consumes TeamCity messages, this server triggers a complete “build refresh” process.


It all starts with getting the following build data from TeamCity:

  • Git sources URL, last build result, last build revision.

  • With git url, ‘Build Descriptor Service’ calls ‘Repo Descriptor Server’ and gets the list of artifacts represented by this build and the relevant data, such as the build tool and language.

  • For each artifact listed, ‘Build Descriptor Service’ queries Artifactory to get the latest dev version and the list of all RC versions.

Once ‘Build Descriptor Service’ completes its task, a message is produced letting all consumers there’s new data.


5. Production Interface Service - Responsible for reflecting the status of an artifact in production. It gets the server name, its version and the uptime of all artifact servers running in production, monitors deployment progress, and updates the developers when deployment completes or fails.


6. RC server - Manages the RC process, validates the build can be RCed (no pending changes to build in CI TC and CI build is green), and monitors the progress of the RC process.


7. GA service - Manages the GA/Testbed/Rollback processes, stores the current and past GA versions of each build. It is also responsible for the complete process, including:

  • Update System GA was triggered.

  • Update ‘Production Interface Service’ GA was triggered.

  • Update DB with the new version.



Summary


‘Lifecycle’ fetches all the information needed from all of the above microservices and builds the complete build picture.


Using the `Lifecycle` UI, developers have a single access point to drill down to every data source or service in the build process. For a fast growing organization with ~1000 developers, who work on thousands of artifacts, it’s crucial to have a CI portal where anyone involved in the process (including production and CI engineers) can get a clear picture of any project.





 

This post was written by Igal Harel

 

For more engineering updates and insights:





Recent Posts

See All
bottom of page