The quest of technological companies on the path of making their systems responsive, resilient, elastic, and message driven, requires a paradigm shift and is full of tradeoffs. In the last 5 years our asynchronous scale for both flows and events increased 30-fold. Not only was this a direct result of traffic scaling, but also of the change in the mindset of our engineers that happened over those critical years.
This is a story of a company’s evolution and trade-offs, of something that started as a “make it as simple as possible” to then evolve as scale and concerns changed, affecting the technology stack along the way to meet said concerns and trade-offs. In this post I tried to share Wix's journey into event driven and asynchronous flows from my own perspective as a backend guild manager.
Let’s Go Back In Time
Wix started as a website building platform. At the time most of the backend communication between services was synchronous in the request response style.
The year was 2014, Wix user base was growing quickly and rapidly, reaching 60M site owners at a rate of +1M new registered site owners each month and ~300M monthly visitors. We had around 250 microservices running in production serving about 700M daily HTTP requests. The vast majority of use cases were about in-session users triggering a request from a browser. So the default architecture that helped server engineers solve their day-to-day challenges looked as follows:
Figure 1: The common synchronous micro services architecture at Wix (2014-2018)
To better illustrate the challenges of this setup, let’s revisit one typical use case - Booking. Wix Bookings is a scheduling system that lets customers book online services. Our site owners (aka Wix Users) enable their users (aka Visitors/Users of Users) to book sessions of different types and sync it all with their Google calendars.
The flow here consists of two main user journeys:
(1) “List My Services” - returning all available services that site owners offer to their users (i.e House Call Services)
(2) “Book Now” - Sends customer booking details to the Database to reserve the unique spot within a given timespan.
Figure 2: Example of one of our Yoga’s teacher Booking Home Page
Under The Hood
The entire flow was implemented as a sequential and synchronous flow to each one of the different subsystems. In-session user triggered the flow and we then aimed for low latency flow - let’s see how things looked behind the scenes:
Figure 3: Booking Flow - Micro Services Architecture
Rendering “List My Services”.
During the site rendering flow, multiple API calls were triggered to the BookingService which would validate the request and perform a REST call to the CatalogService. The response was rendered as a catalog view.
“Book Now” on-click flow:
First a REST call to the UsersService which would correlate the userId within Wix Database against the current in-session customer for personalization reasons.
Then, BookingService would trigger several RPC Calls to enrichment servers (like getting user timezone, permissions and additional information). Then, a call would be triggered to the ScheduleService to find some available time slots.
Once a match was found, the server was redirected to the Checkout Service to handle the different payment methods.
Once that was done, the BookingService finally persisted the transaction to MySQL.
Advantages of the Synchronous Microservices Pattern
This kind of architecture is referred to as “Service Orchestration”, since there is one service to manage the flow and instruct other services to perform actions. In our example, the BookingService manages the flow and acts as the “orchestrator” for the flow. Although the BookingService, in its APIs, mostly used Threads Pool and Futures for all the IO work, the underline model itself was synchronous.
Synchronous flows are quite simple to understand, implement & test. You can just look at the code and pretty much completely understand the services you are communicating with. Although, the BookingService could fail if any of the internal systems downstream fails. When an error occurs, the flow fails-fast and propagates the failure back to the user. User immediately gets an error message and can click the retry button to try and book again.
Synchronous services are inherently more atomic. When the response is sent back to the user, we are 100% sure that the transaction is persistent and race conditions that could cause discrepancies such as double booking are significantly lowered. Moreover, since the Booking service was blocked on the IO call downstream, we gained strong consistency guarantee, but sacrificed the availability, since we want to make sure the transaction is persisted by both (or more) servers (aka the CAP Theorem).
The Pains and Stability Patterns
The synchronous pattern, albeit common and powerful, still suffers from many pitfalls. By early 2015 the Wix scale was constantly growing as more users were using the platform, so each server got more and more requests. This is where the problems started to surface. Let’s get a closer look at the pain points within 3 different aspects:
BookService, CatalogService, UsersService and all the services downstream were tightly coupled. Coupling is a term that refers to the degree of interdependence between system components. One fact you better remember is that although services can be deployed and can run on different hosts, services like Booking, Catalog, Users, and Schedules should all be available and running at the same time. We will soon see how that impacts the availability of services to users.
System failures start when cracks in the architecture appear. In our case it was due to two aspects:
Traffic Scaling. More and more users hit the “Book Now” Button. UsersService response time jumped from milliseconds to seconds, its memory consumption got higher and higher due to blocked threads, until each node would get OutOfMemoryError and fail to respond back.
Unbalanced Capacity. Anytime you have a “many-to-one” or “many-to-few” relationship, you can be hit by scaling effects when one side increases. We were adding more Booking nodes to the cluster to keep up with the growing booking requests, but it caused an opposite effect from what we wanted and made UsersService crash under high load.
Choosing the “Service Orchestration” pattern can potentially impact the dev-velocity of a team while growing & scaling. New feature requests from different stakeholders (such as advanced analytics or custom user notifications) required booking core team involvement and re-deployment of a service that triggers yet another end-point.
OK, But how can we mitigate these issues?
Use timeouts! Each integration point over the network must be configured with timeouts; beware of libraries which abstract the network call and can be the weak link in your service health.
Resilient Caching (Koboshi). The unique purpose of this cache is not just all about performance - it is also about helping with resiliency. It caches remote sources (i.e 3rd party data sources) and by eliminating network calls, helps reduce coupling, which increases availability over having up-to-date configuration.
Separating Reads and Writes - allows you to separate the load from reads and writes enabling you to scale each independently, hence applying different optimization strategies to the two sides. For Wix use cases this was a perfect match since we are inherently serving two different segments: our clients (~Millions) and their users (~ Billions).
RPC & Asynchronous IO - HTTP protocol is synchronous. Even though BookingService utilized threads, still the OS threads underneath were blocked until a response came back from all other services. Frameworks like Netty are not bound to HTTP and provide asynchronous non-blocking IO capabilities which are much better for scaling and can hold thousands of concurrent connections.
Limit your queues and create back pressure. Important law in queue theorem is Little Law. It explains why when the length of a queue reaches toward infinity, response time also leans toward infinity. We really don’t want unbounded queues in our systems.
The Need For A Paradigm Shift
During the recent years (2014-2020) Wix product blends got richer and richer. We were not just a platform for building websites, we now offered our clients an entire ecosystem for businesses, such as: managing online shops, advance scheduling, CRM services, financial transaction and payments integrations, even offering a business cloud solution (aka Velo). As a result, the traffic scaled dramatically and reached to about 500 billion HTTP requests per day. Our clusters expanded to more data centers around the globe and we have recently deployed over 2000 microservice clusters to production.
Figure 4: The evolution of new micro service clusters we managed in production, each cluster is visualised as a rectangle (2010-Now).
Moving to Consistent Asynchronous Communication
The simple synchronous paradigm didn’t hold up. When you build a platform for over 200 million site owners and billions of visitors hitting their pages. We needed a new tool in our engineering toolbox, one that would help make our distributed services more resilient and enable us to grow in a healthy manner both technologically and organizational-wise.
In 2015 Kafka was first introduced into the Wix backend platform, today we are managing 6 clusters in both GCP and AWS. Each Cluster holds more than 10k topics. Billions of daily events are being replicated.
Figure 5.1: Introducing Kafka into Wix Backend Architecture (2015-today)
Figure 5.2: Producers trends to Kafka, X40 growth. 3 more Data Centers were added (2016-2021)
From Orchestration to Choreography
Concepts like Choreography solve the above issue, which is the main challenge in the orchestration approach. It is more like the decentralized way of broadcasting data, known as events. Instead of instructing microservices what to do, the services which are interested in those “facts” will subscribe to queues and perform actions when events reach them. This is also known as reactive architecture, which helps to reduce coupling to the origin service (BookingService) which produced those messages.
Back to our example. During 2018 the Booking team refactored their entire architecture which allowed them to move quickly and meet the constant scale growth.
Figure 6 : Booking flow after the refactor, now BookingService produces events to Kafka
The main concept was to use Kafka as a message bus instead of one service orchestrating the entire flow. Now the Booking service would acknowledge the request and publish a CreateBookingEvent to a queue. Contact Service and other enrichment servers would listen to this queue, consume the events and publish enriched events back. When there was failure, Contact Service retried. Upon success, a CompletedBookingEvent was published. The Notification Service listened to that topic and notified our user about the status of the transaction.
Tradeoff In Moving to Asynchronous
Asynchronous systems tend to be significantly more Complex than synchronous ones, harder to trace and to manage. However, the complexity of the system, demands of performance and scale often justify the overhead when you design for scale. Let’s review some:
Consistency Guarantee and its impact on user experience. If it’s not managed properly at product level, users might find themselves wondering why their data got lost. Well it didn’t - it probably traveled between datacenters. At the p95 data replication should be transparent to the users, but you need to design your UX so your user will understand when there is some delay in his request. At Wix mainly we embrace the asynchronous approach with fail fast on writes, and reads are preferred from local replicas.
Ordering & Delivery guarantee. In a distributed messaging system, the machines that run the middleware can always fail independently of one another. An individual machine can crash, or a network failure can happen while a producer is in the process of sending a message to some queue. Depending on the action a producer takes to handle a failure like that, you can get different semantics which would also impact the order in which messages arrive to the queue.
Idempotency. As discussed above, when a producer sends messages to the queue, things can go wrong. The consumer service should be able to overcome any duplication and even if it encounters duplicate messages or order mismatch, it should be tolerant to that.
Backpressure. Commonly occurs in a scenario when one server is sending requests to another server faster than it can process them. The following use case helps to explain the challenge. CPU bound work is supposed to be much faster than IO bound operations. We saw earlier how two services that are coupled can easily cause cascading failures which might bring down the entire cluster. Adding a Message Queue between the two components helps to gain better control of the flow.
In figure 6, Service A is much faster and doing in-flight computation, working at throughput of 100 RPS, while Service B is a bit slower since it is doing a network call, its throughput is only 75 RPS. Let’s try to calculate the difference and its impact:
Figure 7: The lag is growing, but you will not have data lost or errors from the backend like 503
Difference (Service A - Service B) = 100 - 75 = 25 RPS
The Load Impact = 25 * 60 secs * 60 min = 90,000 RPH!
Without a message queue this entire workload will hit service B and will make it even slower. With message queue this extra capacity will be persistent in the message queue. Indeed you will still feel the pressure but this time you won’t have data loss.
Embrace Asynchronous Message-Passing. This was a key part of the shift we are going through. Asynchronous and consistent communication flow between other bounded contexts is necessary in order to decouple them, allowing concurrency in time and distribution and mobility in space. Remember, decoupling and isolation is a prerequisite for resilience and elasticity. (interesting related topics to explore: The Reactive Manifesto, Domain Events, CDC).
Extra Resiliency and Fault Tolerance
Retry Strategies. In a fully distributed ecosystem, retries are inevitable. Consistent microservice communication via message bus reduces coupling and opens new possibilities to improve your fault tolerance when a downstream service has problems to produce or consume messages. At Wix we have implemented different policies that can be configured - a mix between blocking and non-blocking retries.
Resilient Producer. In the asynchronous communication style, message bus is the backbone of the system. All services constantly produce to and consume from the message bus. This makes the message bus the Achilles heel of the system as it remains a central point of failure. Kafka is built to scale but it is still a network component. At Wix even in the 99th percentile we couldn’t afford losing a single message. This is why we implemented a unique producer which produces to the local disk before sending to Kafka, for extra resilience.