How (Not) To Add Deadlock To Your Critical Flow - 3 Simple Steps

Wix Engineering
Jul 12, 2020
5 min read

Intro

In this post, I will describe how a library which I developed ended up causing a deadlock six months after it was released, how a new feature that was recently developed by Wix combined with java’s ConcurrentHashMap can cause a deadlock, what I learned from it and some practical advice.

Background

I’m on the Editor Server team of Wix.com which is responsible for 100M+ Wix’s sites, and serving tens of millions of requests each day. Our main challenges are around scale and resilience. One of the team’s responsibilities is a library that renders scripts. The API looks like this:

For example, let’s say there is a template Foo.mustach with the content:

Hi {{a}} I'm working!! {{b}}

So calling the library:

Will return “Hi man I’m working!! Bye”.

My task was to extend this library. Originally, all the templates were stored on a disk as resource files. Now, I got some new requirements:

Support templates from another domain
Be agnostic about adding or updating a template
The render API is critical and should be optimized to performance
Templates should fit in the available memory

Long story short, to store the templates I decided to use a new feature that was recently developed by Wix - Greyhound In-Memory Key-Value Store, which provides a high-level abstraction for eventually consistent key-value stores on top of Kafka's compact topics. Since the data is stored in the hosted service, it is very fast, enabling changing data & broadcasting it in a single transaction but it does require you to understand if it can fit in your service memory footprint. Using the store, developers from another domain were able to write and update templates in an agnostic way. Also, the store had taken care of making the data available between different servers and different data centers, and allowed quick read access (since all of the data was stored in memory). Everyone was happy and the feature was released to the users.

The Incident And the Investigation

It has been about six months since the feature was released, and we started getting alerts that a single message processing took more than an hour to complete in the Kafka consumer’s message handler.

Alert’s Dashboard

Those message handlers were being used by the Greyhound Key-Value Store in my library. Which is why that was the time I was called in to help figure out what went wrong and fix it.

After a deeper investigation, we saw that those long executions occurred only during a server startup and were never completed until restart. First step was to take a thread dump from one of the problematic servers.

Looking into the thread dump we saw a single thread that was being blocked while trying to perform “ConcurrentHashMap.remove”, and lots of other threads blocked on ConcurrentHashMap.computeIfAbsent.

Using the call stack details we went back to the source code and by applying some quite common sense logic, we managed to understand the specific flow that would lead to a deadlock.

In order to explain it all, let’s dive deeper into the architecture.

The Architecture

When the library is called, first of all it needs to get the template, which can be found either in the resource files or the KV store. After that we need to do some processing on the raw template and then we can render the context with the template. It looks something like this -

Actually, there’s an in-memory cache of the processed templates so first we try to get the processed template from cache and only if that fails, we process the template and then also add it to the cache.

In case of an update of a template in the KV store the system will remove the processed template from cache so that it’s reprocessed on the next request using the updated template.

The cache mechanism implementation was in a third party library that we used and was based on java's ConcurrentHashMap as a data structure. For new requests, `ComputeIfAbsent(key, getCacheValue)` method was used, which locks the cache until `getCacheValue` execution is completed in case of an absent key and we provided the implementation of `getCacheValue`. On template update `remove(key)` method was used.

On `getCacheValue`, the system would call the KV-store in order to get the template, however, that call was being blocked until the store was fully initialized. While the store is being initialized, `OnUpdate` is being triggered synchronously for each item.

It is only when `OnUpdate` has been triggered for every item that the store is initialized. And once the store is initialized, it can’t change its state back to not initialized.

The Deadlock

So, let's say we got a request for the template `Foo` while the store is still initializing. `Foo` is not in the cache and isn’t in the store either. `getCacheValue` is being executed which locks the cache. In `getCacheValue` we try to read from the store but we are blocked until the store is fully initialized. During the initialization process, OnUpdate(`Foo`) is called and we try to remove `Foo` from the cache.

However, we have to wait until the cache is released in order to remove the key and proceed with the initialization. Therefore the initialization never completes. And that, ladies and gentlemen, is how a production’s deadlock looks like.

The Fix

The fix we did was to add a “healthTest” which would prevent any user traffic (requests) to the server until the store has initialized. This feature was supported by our framework. On each service startup, all the healthTests are being evaluated and only once they all pass, the service starts to get traffic.

Another thing we did is to avoid calling a potential blocking code while in the initialization process. Since a value that was just initialized can’t appear in the cache, we used a parameter that indicates whether `OnUpdate` invocation is part of the initialization process and stopped calling ‘remove from cache’ during the initialization process.

Also, a blocking API, such as reading from the store, should have a timeout. Such timeout could have prevented the deadlock. However we choose not to add any timeout for this call since after adding the health test, this call is not blocking anymore.

Another open question is why started getting deadlocks only six months after we deployed this feature? The reason is that there is a parameter that determines whether the read from the store should be blocked until the store is initialized. Initially, we didn’t set this parameter so we got the default behavior (that the read should not be blocked). However, that default behavior was changed after six months from false to true.

Summary

As promised the 3 simple steps to (not) adding a deadlock are:

(Not) Relying on default values
(Not) Serving traffic before server is fully initialized
(Not) Calling a blocking API within blocking code

Or, from a developer perspective:

Whether default values should or shouldn't be used is a debatable topic that won’t be determined in this post. However, when a specific value has a direct impact on your algorithm or business logic it should be defined explicitly. Specifically, when the default value is configured by an external library which can change it.
Handling requests before the server is fully initialized is error prone. In the good case, it can cause some requests to fail until the server is initialized (Which can be quite a few when having high throughput) and in the worst case scenario could even cause a deadlock. Defining what must be initialized before handling requests is a must do requirement for a healthy software.
Blocking API is something we try to avoid, but when we do use it, it’s highly recommended to provide a timeout for the execution and minimize the number of actions inside that scope.

I learned a lot from this bug and I hope that now you did too!

This post was written by Guy Nahum

For more engineering updates and insights:

Follow us on: Twitter | Facebook | LinkedIn
Visit us on GitHub
Subscribe to our monthly newsletter
Subscribe to our YouTube channel
Follow our Medium publication
Listen to our podcast on Apple, Spotify or Google

How (Not) To Add Deadlock To Your Critical Flow - 3 Simple Steps

Recent Posts

Comments