top of page
Writer's pictureWix Engineering

Graceful Worker Termination: How NOT To Do It



HTTP status 502. After that everything continues to work. Then another 1 or 2 HTTP status 502… This is the story of how we found and fixed a very elusive bug in our backend code scheduler. This bug was so elusive in fact, that it was under 0.1% error rate in our system - yet it did have an effect on users. We marked the worker shutdown and recycling sequences as the main suspects. During the investigation process, our engineering teams gained valuable insights into the workings of Kubernetes deletion API, pod eviction sequence, and how it interacts with NodeJS and Go runtimes. We hope the lessons we learned will be helpful to teams who are looking for a custom graceful shutdown mechanism as well.


But before digging further into the issue, let me give a bit of background on how Velo backend works first.



Random workloads


Velo by Wix extends a Wix website, it allows users to write site/project specific backend code that runs on top of NodeJS. Since Wix has hundreds of millions of users, managing such a huge worker grid in a cost effective way requires a very unique architecture, driven by one intent: to create an environment that runs user code which is both isolated and has a blazing fast boot time.


On top of these requirements, the usage pattern of any given Wix website is extremely unpredictable. Some sites induce little to no traffic, others have constant massive traffic, while others yet might produce random bursts of traffic followed by no traffic at all. This basically means we can't assume anything about a site’s usage patterns for the sake of the backend worker.



Designing an ephemeral worker grid


Withstanding the diversity of such traffic patterns in a scalable & cost effective way led us to base our architecture around an ephemeral single tenant worker with a low cold start time. These workers spawn on demand and are ready to serve user requests almost immediately in less than 100 milliseconds (eliminating the 0 to 1 startup time).


The ephemeral design of our workers helped us optimize allocated resources - we designed our workers to terminate after a certain predefined period. This gave us a few benefits: 1. Ability to have as few stale workers as possible 2. Being able to update the software of the workers whenever we need to. But we couldn’t just sporadically kill a running worker. We needed some sort of graceful shutdown sequence to make sure we’re not hurting already handled requests.


A deterministic worker shutdown sequence would require defining request length. We decided to configure a maximal backend request timeout of X seconds. With this timeout in place, if we direct new incoming requests to a different worker, we can safely terminate the current site’s worker after these X seconds. To be on the safe side, we decided to give workers 4X seconds of grace before they terminate.



An (un)graceful shutdown mechanism


All our workers run on a proprietary worker grid called Kore. Kore (Kubernetes Orchestrated Runtime Engine) is a computational grid that’s responsible for the orchestration of backend code execution. Each of our runtime pods consists of a few containers, one of them being the user’s NodeJS runtime.


Since we use Kubernetes, we aspired to use its native capabilities as our building blocks as much as possible. When utilizing Kubernetes' delete API to remove a pod, we can set graceful shutdown parameters (the same can be achieved by setting terminationGracePeriodSeconds in the yaml, but this will be immutable and we wanted the ability to have a “per tenant” grace configuration). Here’s a rough demonstration of how we would delete a pod with “SOME_POD_ID” using a kubectl command:



Frontend code:



Backend code:



Using Velo’s site monitoring capability, we saw that every 5 minutes we got two requests that end with 502 errors.

It was clear that workers die ungracefully - we had a bug in our graceful shutdown mechanism.


Let’s go back to our design. If we use --grace-period feature of the Kubernetes API, this is what happens (taken from Kubernetes termination documentation):


1 - Pod is set to the “Terminating” State and removed from the endpoints list of all Services

2 - preStop Hook is executed (if specified)

3 - SIGTERM signal is sent to the pod

4 - Kubernetes waits for a grace period


Our findings suggested that the NodeJS process was somehow terminated instantly upon SIGTERM. After digging into our runtime code, we found the root cause. We had an exit hook handler that listened on SIGTERM, closed some sockets and eagerly invoked process.exit(). Removing it was complex, so we preferred to find a way to work around it.



Making the ungraceful - graceful


Now that we found the root cause, we need to resolve the issue for our users.

The solution was actually very simple. We added a preStop hook to our worker pod’s sidecar yaml that sleeps for 60 seconds.



This means that now our worker termination sequence looks like this:

  1. A worker termination time is reached

  2. We delete worker from being our discovery service - no new requests will reach that worker (important to note, we use our own proprietary discovery service and not Kubernetes’ tooling. The reasons are beyond the scope of this article)

  3. We send a delete pod command via Kubernetes API

  4. Kubernetes preStop hook is triggered, sleeps for 60 seconds

    • This allows workers to gracefully complete any existing request.

5. Kubernetes sends a SIGTERM is sent to the worker




Conclusion


At Velo, we set out to create a reliable & fast ephemeral worker grid that is both cost effective and transparent to the users. This required a graceful shutdown mechanism. Without one, we lose all of the above. As it turned out, creating one is not as trivial to design and test.

Eventually, we managed to implement an elegant solution for our problem using native tools from the available tech toolbox. For me, this is a handy compass for a solid architectural design, one which can be easily reasoned.

We did conclude that we are missing end to end test coverage on how we decommission our workers. We also saw the value of investigating ongoing error patterns even if they don't form into critical mass.


If it’s consistent - it's interesting and might uncover application blind spots.


 

This post was written by Moshe Maman


 

For more engineering updates and insights:

bottom of page