Updated: Feb 4
As a developer at Wix, microservices are my natural habitat. I truly appreciate the advantages that come with this architecture. I really do.
Scalability, division of concerns, data separation, and above all - continuous integration and continuous deployment. These advantages make my work flexible, they reduce redundant coordination between teams and create shorter development to production cycles.
Yet even after years of working in this environment, I can still be surprised with the overheads of its distributed nature. Such was the case with data aggregation.
The Use Case
I work at the “Premium and Services” team, which is responsible for user subscriptions.
We handle the complete lifespan of subscriptions - from initial configuration of the subscription offer, to the purchase, to delivery after payment, to management of subscriptions by users.
One of our services provides information for the subscriptions page - a page where a user can view all their premium Wix assets. This page is using information from 15 different information services inside Wix, such as the billing system, subscriptions systems, site data information, domain information, and so on.
The synchronous model and its shortcomings
Our initial naïve implementation was simple - just call all sources, aggregate the data and show it to the user. After a while the number of users scaled, and so did the number of subscriptions per user, and we realised that our service had a lot of timeouts and a long response time.
If you think of it, this makes sense - when you call 15 services, even if you do it in parallel, the slowest service will determine your response time. And if one of them is down - your whole service is down, or waiting for retry.
So we decided to move to an asynchronous CQRS model.
Advantages of CQRS (Command Query Separation) asynchronous model
In an asynchronous aggregation model we collect our data offline and save it into a view table - when a user calls the data, it will be instantly available. Saving our own view table enables us to save data in a non-normalised way, optimised to the service consumer need. Having one table and one line per each asset makes the read process faster and simpler.
Saving our own copy of the data enables us to seperate the read and write into two modules, we can now have two different performance setups. In our case (which is the common case), the write process needs less servers than the read process.
Collecting data offline without a user waiting for the response enables us to use retries on network failures. And as I have mentioned above, reading from the DB and not making network calls makes the read process faster and more resilient to network failures.
We gain partition tolerance and our services become more available - as the CAP theorem predicts, we should expect to pay the price in data inconsistency.
The challenges in aggregating data asynchronously
The major challenge is maintaining data consistency. Since we duplicate the data, we need to choose our data synchronization strategy. Our view DB can be updated in two ways:
Time-based refresh interval — refresh all records in a constant interval.
Update on change — update records upon change.
Time-based interval was not an option in our case, as we have hundreds of millions of users and most of them don’t change their assets frequently.
Bombarding our data-providing services with millions of requests would overload our systems for pretty much nothing. So we decided to go with the “update on change” option. In order to make it work we needed some kind of announcement for each change in the data source.
The update on change event challenge
Imagine we have a source service with an API, let’s call it API A, and we need some kind of an event that would tell us that the next call to the API will have a different information than the prior call to the same API.
Now put yourself in the shoes of the source data service. You have been asked to publish an event each time data in an API A was changed. Do you have such an event? Most likely if your service is not built as CQRS, you don’t have one.
Let’s take the billing system for example - billing information can be changed from numerous sources:
Actions by users, like purchase, or credit card changes
Actions done when finalising the transaction with the payment gateway
Refunds done by support agents using a UI
Putting an event on each flow is error prone, since each new development of a new flow will need to be aware of this event and publish it too. You might think to go to the lower level, and publish events in the Data Access Objects.
If you do not have a view table you will soon find out that there are internal states in your DB that are not relevant to the API A, or data migrations that are not relevant for the API A.
And if you do not have a view table, most likely you have few tables that are used for API A, like invoices and line-items inside invoices, so you don’t have a one place to publish the API “on change” event.
In our case most of the source data services were not CQRS, and had no such events. We contributed code to other services and created these events in their business flows, taking the risk for missing events in the future.
Time based data and external data providers
In some cases, your data cannot be saved in a view table as particular information might be related to the time of the read. For example, if we’re dealing with a trial period and need to return the trial remaining days. You need to calculate the remaining trial days when you retrieve the data and you cannot save this value in the DB - this might force you to try to determine the current trial duration, for example, which is really not your service concern.
Another problem is external data sources that do not provide events. For example, VAT percentage is updated at different times and at different geographic locations. For this reason VAT calculations are usually done using external providers and cannot be saved in a view table.
Sometimes in these cases you have no choice but to introduce online calculations during the data retrieval.
If these data sources are not large, you can reduce network calls during the read by retrieving data into “in memory cache” using constant refresh and using it during retrieval. Unfortunately this makes the read module a bit more complicated.
Maintaining eventual consistent data
Using events exposes you to eventual consistency complications.
When using an events-streaming platform like Kafka, in most cases the order of events is not guaranteed.
In general, update messages can come in two major flavors:
Event containing entity data — in this case you can use the data from the message to update your view table.
Event containing the id of the entity — in this case after the event arrives you need to retrieve the last version of the entity from the source service.
Each of the flavors has its pros and cons.
Events with Ids are less vulnerable to order issues since you always retrieve data from the origin server. However, sometimes, due to the eventually consistent nature of the source data, it is possible that you retrieve data from the source service and then find out that it is older than the data in the view table. This can happen ,for example, when you once get the data from one DB instance, and on the next event get it from another DB instance which did not finish synching his replicated data. Cases like that should be detected and handled.
Events containing entity data reduce the need to create additional network calls and potentially overload the source service with requests. Yet they are also more vulnerable to order issues.
In either case you need to save the version of the origin entity in your view table.
In the case of an event containing entity data - you need to ignore stale messages.
In the case of an entity by id - when the data you retrieved from the origin server has a version that is lower than the version in your view table, you then need to retry until you get a version that is newer than the verison in your view table.
In our case we used updates by entity ID, and in most cases on each update from one of the sources we retrieve data from all of the sources and rebuild our view table line. We’re also using version numbers when those are supplied by the source service.
Data consistency tools
Source data models and the view table itself can change over time. When such change happens we use an internal migration tool and recollect data into our view table. On a day to day runtime, since both the collection process and the retries run in the background, we monitor retrieval fails and their causes.
For debugging and data validation purposes we developed an internal tool to compare records in the DB to the original source.
Summary and lesson learned
Usually, when talking about challenges of data consistency in microservices, the discussion is about handling write transactions and patterns like the Saga Pattern. But even read-only data aggregation in microservice systems can pose some challenges.
When a system like that scales, moving into an asynchronous model is inevitable. And as I have illustrated above, this change has its costs.
The main cost is maintaining data consistency , finding the right place to announce an “on change” event, and maintaining data integrity in an eventual consistent environment can pose a challenge. Validating data consistency also has its day to day overhead.
In order to reduce these costs, services should be designed to support asynchronous consumptions.
Two major design aspects can facilitate asynchronous consumption:
Have a clear notion of a view in your service and know when your view has changed. If your service holds simple data with simple CRUD operations, this is easy. If not, consider separating your view into a different table.
Have an incrementing version id of your entity to support stale data detection.
Making sure your service is asynchronous-consumption-ready will enable easier consumption by data aggregation services.
This post was written by Gal Morad
For more engineering updates and insights: