Billing systems are complex. They handle perhaps the most unforgiving aspect of the relationship between the customer and a company.
We at Wix.com are processing hundreds of millions of dollars yearly. This means a lot of transactions running through the system, each of them representing a customer putting their trust in us. When a customer presses the Submit button on a purchase page, he or she wants to be sure that what they’re paying for is indeed what they have ordered. Similarly, every owner of a Wix-powered website expects our processing systems to just work and deliver the same result every time.
Anything other than that fractures the confidence of the customer that we are capable of providing a high-quality service. So, for example, when a customer order charge flow is being executed, we must make sure that we charge that customer successfully exactly once, even if they somehow managed to do a “double submit”.
Another issue we’ve come across was that during migrating Kafka topics we may be receiving the same message several times on different topics and in a different format, and we had to make sure we’re processing everything EXACTLY ONCE.
The number of use cases is unlimited, as we soon found out the demand for our service all across Wix. Especially when handling financial transactions.
We asked ourselves, how can we achieve this goal in a system that’s using a version of Kafka for messaging that supports only AT LEAST ONCE semantics, not using Kafka at all?
So, what is idempotency anyway?
According to Wikipedia, “idempotency is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application.”
Or as Einstein put it: “The definition of insanity is doing the same thing over and over and expecting different results”. Well, he never did say that, but it could have been such a sweet tweet.
Implementing idempotency/exactly-once semantics could easily become a minefield.
To name just a few challenges:
How do you save the result of the initial invocation?
What if your service doesn’t even have a database?
How are you handling errors?
How do we differentiate between the behavior of internal errors in the idempotency operation itself and business exception?
How are you handling timeouts?
How are you handling cross-DC synchronizations?
All these will have to be solved again each time over and over.
These patterns are repeating themselves in many other use cases across Wix. We wanted to devise a generic solution that would allow us to enrich any function/method with the idempotency/exactly once property.
What did we do?
We decided to go with a fat client/slim server architecture. The code was first written by Maxim Zabuti with Later contributions by others (including myself). Creating a library would have required each user to make modifications to their own database, or even create one if they didn’t have one already (this was actually the case in one of our use-cases). We wanted to make it easy to use and fast to implement.
User Perspective (scala code)
The myFunctionThatReturnsFutureOfInteger will be executed the first time and the result of it will be saved.
Further calls to myFunctionThatReturnsFutureOfInteger will return the saved result without executing the myFunctionThatReturnsFutureOfInteger.
As you can see in this example, myFunctionThatReturnsFutureOfInteger needs not to worry about idempotency. In fact, there may be a case where it will be invoked in a context where idempotency is completely irrelevant.
It is important to note that there may be a state where the operation partially succeeded, yet, the function failed with an exception. In this case, the function has to be able to handle this inconsistent state from a previous invocation, or the caller should prevent subsequent invocations by setting allowRetryOnFailure to false.
Another very useful function is withExactlyOnce:
In this case, we told the idempotency client to throw an exception for multiple invocations with the same key. You may choose to do something with the exception or rethrow it. In this case, we are logging the error and continuing with the normal execution flow.
The key parameter uniquely identifies the request, allowing the idempotency infrastructure to understand that the two invocations are the same. Calculating the key can be tricky, as we had to find out. At some point, customers that created order and modified it got the same idempotency key calculated.
As a result, the system charged the first invoice, resulting in an erroneous charge and user complaints.
What happens behind the scenes?
The idempotency client follows the following steps (for the first invocation of fn, key):
Save in the server that an Idempotent operation has begun for some key with a default TTL of 5 minutes
Save the result of fn to the server with an infinite TTL using JSON serialization
Return the result to the caller
On the second invocation
The server returns the stored result
Deserialize* it back to the original object type
The client returns the results to the caller
* special handling for cases such as None, null, void. The JSON mapper is provided by the caller.
The server has a very simple API to communicate with the fat client. The user is completely oblivious to that API. It uses a MySql database, as this is the standard database for most services at Wix. This Database replicates across DCs, allowing the Idempotency to be available cross DC. With Wix infrastructure, this Service has a very low latency of 5-10 ms.
Seeing the demand for this service from outside the billing group, it became clear that this service can be a prime candidate for being released as an open-source candidate.
Hopefully, we will be able to release it this year.
Idempotency and exactly-once are problems we stumble upon in many areas, definitely in financial systems. The code that handles the implementation must remain separated from the business logic for simplicity. The code that handles the idempotency/exactly-once has its caveats to watch out for.
As we here at Wix realized, it is best to solve this problem Exactly-Once
This post was written by Lior Asher
For more engineering updates and insights: