AI for Customer Care Routing at Wix - Making The First Step of Reinforcement Learning

Wix Engineering
Nov 19, 2024
9 min read

Updated: Mar 31

Please notice: This post is part 2 of "AI for Revolutionizing Customer Care Routing System at Wix".

Call centers, similar to most queueing systems, are traditionally optimized to minimize customers waiting time. Nevertheless, in reality, additional aspects can also tilt the routing decision balance, such as Quality of Service (QoS), fairness and workload distribution. For optimizing the customer experience at Wix, we developed the Expert Smart Routing. It is a data-driven, end-to-end, Reinforcement Learning (RL) system for completely redesigning the way customers get served, resulting in significant improvement in overall customer satisfaction.

In the first post, we walked you through the journey of developing and deploying this solution. We faced the complexities of balancing different objectives and hypothesized that focusing solely on waiting time would lead to suboptimal customer satisfaction. Our findings supported this hypothesis. Now, we are about to dive deeper into the first solution we developed, prior to the RL model, which we call “The Greedy.”

This post will explore the practical value of deploying an intermediate solution when working toward complex deep learning or reinforcement learning models. We will cover the motivations behind the Greedy model, along with its logic and algorithm. Furthermore, we will review the development process of this highly interpretable solution, which enabled us to deliver value quickly, test project assumptions, and gain valuable experience with real data.

Motivation - Why Not Jump Straight to Full-Blown RL?

While the RL model was our ultimate goal, jumping straight to it wasn't beneficial for several reasons. RL research and deployment take time, and the risks are high. We aimed to bring value to our customer care system as quickly as possible, rather than waiting for the full RL solution to mature. Moreover, we had no concrete way of knowing the performance gap between this preliminary model and the future RL model. This uncertainty called for a more immediate, understandable, and trustworthy solution.

Bringing Value Faster

The initial deployment of "The Greedy" model allowed us to start improving customer care operations much earlier. We were able to have an immediate impact on routing efficiency while continuing to develop and refine the RL model in the background.

Explainability and Trust

For AI systems to work effectively in real-time customer care environments, they must earn the trust of human operators, such as Real-Time Analysts (RTAs) and customer care managers. Trust is difficult to build with complex, opaque deep learning models. However, the Greedy model provided a transparent and easily interpretable solution, enabling us to clearly explain routing decisions.

Paving the Way for the RL Solution

Developing the Greedy model was not just a stopgap measure—it was a critical step in preparing for the RL model. The RL model requires mature settings for effective training: clean data, a well-calibrated reward function, and a reliable simulator that closely approximates real-life scenarios. This is even more prominent since we did not have the ability to conduct an A/B test, but rather a full exposure test, and then release. The Greedy model allowed us to verify and refine these components, ensuring that the RL model would have a solid foundation to build upon. Since this model is much easier to debug, control, and adapt, we could translate these lessons to direct action items and iterate quickly on different parts of the system.

The Approach: Meet the Greedy

Reinforcement Learning is all about future aware decisions. The Greedy model operates on a one-step value-based logic, essentially making the best possible decision at the current moment without considering future implications—hence the name "Greedy." Given a ticket to route, it calculates the expected reward for each possible routing action (i.e., expert) based on the current state, ultimately allocating it to the expert with the current highest estimated value.

How Does It Work?

Two out of the five Key Performance Indicators (KPIs) we focused on (elaborated on the previous post)—tiers matching and occupancy, as elaborated on in the previous post—are deterministically determined at the routing decision time, while others can be estimated or learned in various ways:

Waiting time was statistically estimated as the sum of random variables with parameterized distributions (exponential). Beyond the expected value, the model could yields the variance, allowing us to assess the probability of breaching several service levels—a common need in the customer care industry.
The abandonment rate, in the case of the callback channel, was well explained using a single variable linear regression - by the based on waiting time.
The ticket resolution rateprobability to be was modeled in a supervised manner, fitted overusing historical data.

AI for Customer Care Routing — Image 1. A schematic view on how the reward is being estimated by the Greedy model. Black box symbols logic components, which can be implemented in a variety of ways.

Pseudo-Code

This pseudo code describes how top k experts are retrieved based on a pool of candidate experts, a ticket to route (and associated user). Illustration of this process is given in image 1.

Calibrating the Reward Parameters

The Greedy model is tightly coupled with our reward scoring system, which allows us to fine-tune the reward from individual ticket outcomes to broader customer care KPIs. This alignment was crucial for daily operational adjustments, factoringaccounting forin real-time variables likesuch as expert availability and channel constraints. This tuning wasprocess donewas performed iteratively on the validation set, eachwith time observing the daily KPIs observed after each iteration.

AI for Customer Care Routing Policy — Image 2: Illustration of the simulator - policy relationship and flow, starting with tickets entering the system (1), simulator request for expert routing to policy (2), policy return its selected expert (3) simulator ticket allocation and interaction computation into service KPIs and reward (4). Eventually, these ticket-level KPIs are aggregated daily, communicated to the product team and tuned to their preference.

Why is it suboptimal compared to RL?

The greedy nature of the algorithm lets it focus merely on the current routing. Overall, we learned that this behavior could be suboptimal in two main use cases.

it cannot consider long-term dynamics and trade offs. For example, while a greedy model would assign an idle top tier expert to a low tier user rather than let him wait for a low tier expert, a RL model trained on a longer horizon should save this precious resource for future high tier users who are likely to arrive.
A greedy model cannot have future aware decisions and planning. One case we noticed was the inferior ability to prevent early exhaustion of experts, often at the expense of slightly higher waiting time, for saving much longer future waiting times, especially before peak times. A future aware RL model will be able to keep its expert “fresh” and balanced before the rush hours, better facilitating it and avoiding extreme waiting.

The R&D Journey: Big Cycles, Small Cycles

We worked in four big cycles (aka milestones), each with a strict deadline, culminating in a deployed model or feature that brought tangible value to customer care. Each milestone was further broken into multiple small cycles (aka checkpoints) in which we develop, evaluate and revisit progress and scope with the project stakeholders. These improvements could relate to the routing policy, the simulator, or the reward function.

Next, we will cover some lessons from the process, detailing a few challenges we had to tackle—lessons likely relevant to similar use cases.

Milestone #1: The Live Test

Our first milestone was the longest, by far. It aimed to close the first cycle with real data and required two main efforts: (1) having a fully validated simulator of the customer care (CC) environment, and (2) developing a reward-based policy that we could test live. Most of the heavy lifting was on building the simulator and perfecting its approximation to real life. Once this level was achieved, we shifted to programming basic baselines, the current policy, and ultimately the Greedy policy. Lastly, we optimized it on a validation set, benchmarking it with the current policy of the CC system. Overall, this milestone required success in two subsequent evaluations:

Simulation evaluation: Approximating the simulator gap to real life by comparing simulated results of the current CC routing policy and the KPIs observed during that time on production. We defined an acceptable error rate for each KPI as a clear definition of done for the task. Due to the inability to conduct an A/B test (but rather a live test), this step is fundamental for the system evaluation.
Policy evaluation: Simulating and comparing current policy vs another policy on same test scenarios or dataset. Here, we required outperforming the current policy on some of the KPIs, such as tiers matching, knowingly sacrificing waiting time to some extent.

With a validated simulator and offline tested policy, we were ready for the production environment. The live test spread over two full weeks, balancing the need for enough data to learn from and the organizational needs.

Milestone #2: The Full Release

The live test led to numerous considerable action items we must achieve before going to the full release. These tasks composed the second milestone. Interestingly, most of the value gained from these improvements were from data and product improvements, not better modeling.

Data Improvements & maintenance: We revisited some of the parameters and assumptions made. For example, we initially assumed that experts who were skilled for topics in the past but not practiced recently would retain their resolution rates. However, after live testing, we understood that due to rapid product changes, this knowledge degrades quite fast without proper practice. Furthermore, we learned that statistics and data should be updated regularly, facilitating changes of new / retired experts and trends in KPIs, requiring heavy validation for both integrity and data drift.
Product improvements - The fact that improvement of ML models performance can derive from smarter usage of them is often being disregarded. This can relate to how to trigger and use the model, how to consume its predictions and what business related pre/post processing can be used. For example, we decided to exclude from the model flow one specific topic that is strictly limited to a small group of experts and focus on short response time. In addition, we adopted a fallback strategy to the old system if waiting times exceeded expectations, shifting our priority to serve as fast as possible.

Once released, the model showed a significant increase in customer satisfaction (CSAT). That said, we identified a lead for massive improvement of the system. We were able to simulate it quite quickly, and after verifying our assumptions, we decided to stray from the original plan, and have another stop along the way - our third milestone.

Milestone #3: Chaotic System - Optimizing Previous Allocations

If the Greedy policy release allowed us to slice the cake better (achieving a better balance), this milestone effort made the whole cake bigger - improving all KPIs withwhile barely sacrificing the waiting time.

The production environment proved chaotic in ways that were hard to predict. For instance, the frequency, duration, and triggers of experts’ spontaneous breaks, or call duration outliers were phenomena that could not be learned or predicted. We observed that dozens of minutes could pass between the prediction point to the start service, a span of duration in which a lot can change in the system. To decrease this uncertainty, we added a prediction of optimization to previous routings.

This new flow essentially triggers any time an expert becomes idle without any waiting tickets in their queue. Then, we inspect all waiting tickets in the system, looking for the best ticket we can allocate to the agent. A valid ticket candidate for being rerouted must have an expected positive reward gain—higher expected reward (as described in the pseudo code) if allocated to the new expert than if it remained in its current state. Both estimations are recalculated, ensuring we have an updated snapshot, thus reducing uncertainty as time passes.

Wix Engineering AI — Image 3: Illustration of the optimization flow. Here, a new expert became available, able to take a waiting ticket in one of the queues, specifically, ticket 103. In practice, Ticket 102 could be selected as well, depending on the expert's ability to handle its topic, the ticket served, waiting time, etc.

Milestone #4: RL Model

Armed with all the lessons learned along the way, we now have everything we need to develop the full RL solution. With a stable version in production that already shows massive value, we could use the Greedy model as a strong baseline. We will delve further into the RL model in a future post.

Summary

The Greedy model provided a crucial stepping stone toward our ultimate goal of deploying a full RL solution. It allowed us to bring value quickly, test our assumptions, and lay the groundwork for more advanced models. As we continue to refine and develop our RL approach, the lessons learned from the Greedy model will guide us in creating even more sophisticated and effective customer care solutions.

The take-aways of this post could be summarized as follows:

Having an explainable solution before the full RL solution was highly valuable: bringing value faster to the product, offering explainability and increased trust, and also assisting in the research efforts.
Having a quality simulator is most of what you need, both for the greedy model, and for the RL one. It was vital for evaluation, an enabler for any model training, and a gold asset to be utilized across the customer care funnel.
Much of the value gained from ongoing improvements were from data and product perspectives, rather than modeling’s, that is, better usage of the same model and higher quality inputs.