top of page

Engineering Resilience: Behind the Scenes of Wix’s Global Platform

Engineering Resilience - Wix

Hi, I’m Ben Chen, an Engineering Group Manager at Wix. Over the years, my team and I have faced countless challenges managing one of the most complex, globally distributed platforms in the world. 


Today, I want to share the story of how we transformed traffic management and production infrastructure at Wix, turning obstacles into opportunities with two new systems - Production State and Traffic Light.



The High-Stakes World of Global Platforms


Imagine being responsible for managing millions of requests per minute across a mix of cloud and physical data centers. Each click, scroll, and transaction depends on flawless execution.


Downtime is not an option. It’s like orchestrating a high-stakes performance where every move must be perfectly timed, where a single misstep can cause dissonance felt worldwide.


As we scaled, we grappled with pressing questions:


  • How do we maintain resilience during production issues?

  • Can we seamlessly shift traffic between data centers (DCs)/regions without causing downtime?

  • And most crucially, how do we ensure the target DC can handle the sudden influx of traffic?


Finding answers was no easy feat, but as with any good story, the struggle paved the way for innovation.



Building the Backbone: Production State


Production State

The journey began with a simple yet profound realization: we needed a single source of truth to navigate the complexity of our infrastructure. Enter Production State, the system that became the heartbeat of our traffic management strategy.


We started by developing a framework of sensors that standardize and collect data from various sources - such as Grafana alerts, HTTP endpoints, and static configurations. For example, a Grafana sensor is defined using a JSON structure that includes the alert ID, dashboard ID, and panel ID. Once configured, sensors detect state changes using polling or event-driven mechanisms, and automatically trigger notifications to Slack through our internal automation system.


These sensors delivered real-time views of system health with a binary output of "healthy" or "unhealthy," cutting through the noise to keep us focused on the critical insights.


Next, the data collected by these sensors was aggregated into a unified state of production, providing a comprehensive snapshot stored in a central database. This centralized state became our compass, guiding every decision and giving us an accurate view of the infrastructure's health at any given moment.


Building on this foundation, we introduced actions and automations to respond to changes in the system. These actions, whether triggered manually or automatically, allowed us to validate the system state before making significant moves like shifting traffic.


For instance, if a DC faltered, the system would ensure the target DC was ready before initiating the shift, preventing unnecessary complications. Over time, predictable scenarios were automated, enabling us to respond quickly and confidently without sacrificing control.


Together, these components formed the foundation for a resilient and adaptive traffic management system. By consolidating data, simplifying decision-making, and enabling rapid responses, Production State transformed how we approach challenges in our infrastructure, setting the stage for seamless traffic operations.



Enter Traffic Light: The Maestro of Traffic Shifts


As we refined Production State, a need emerged for a dedicated tool to manage the complexities of traffic transitions. Thus, Traffic Light was born, orchestrating everything from database remastering to seamless traffic moves.


Here are some of its star features:


  • Prewarm and Prewarm Monitor: Before moving traffic, Traffic Light calculates the necessary pod capacity at the target DC and preps it in advance. This ensures a controlled autoscaling event. The system waits until at least 90% of pods are ready before giving the green light, making transitions as smooth as silk.

  • DNS Updates: When a traffic move is necessary, Traffic Light updates DNS configurations with precision, rerouting traffic efficiently.

  • Artifact Pinning: To prevent latency issues, we "pin" certain services to their primary DCs, ensuring localized operations and avoiding unnecessary cross-region calls.

  • Database Remastering: Once a cumbersome and error-prone process, database remastering is now automated via AWS Step Functions, reducing execution time from hours to mere minutes.



Lessons Learned and Challenges Overcome


Innovation rarely follows a straight path, and ours was no exception. Initially, we relied on user-defined reactions that were inconsistent and hard to maintain. The solution? Intents - predefined actions tied to specific objectives like shifting traffic or resolving crises. Intents brought consistency and made our systems more robust.


We also discovered the importance of keeping DCs "warm." After moving traffic, we keep the original DC operational for 30 minutes. This buffer allows us to quickly revert traffic if needed, eliminating the time-intensive prewarming step. It’s a simple practice with a profound impact.



The Payoff


Looking back, our journey to transform traffic management and infrastructure resilience has been nothing short of groundbreaking. By implementing Production State and Traffic Light, we’ve managed to redefine how we approach challenges in a global platform. The systems we built are more than tools - they are a testament to what is possible with a clear vision and relentless innovation.


These solutions have empowered us to respond to critical situations with speed and precision, drastically improving our ability to keep the platform running smoothly no matter the circumstances. Automation has played a key role in reducing human intervention, allowing our engineers to focus on what truly matters: solving problems creatively and driving innovation.


At its core, the real success of this transformation lies in building resilience—not just in our systems, but also in our approach to challenges. This resilience ensures that no matter what obstacles we face, we are equipped to adapt, recover, and excel without compromising the experience of our users.

Here are the results that highlight the impact of our efforts:


  • Reduced MTTR: We’ve reduced our mean time to recovery by 80%

  • Increased Automation: Over 64% of traffic moves are now automated, requiring minimal human intervention.

  • Enhanced Velocity: Our infrastructure updates are faster and safer than ever.



Looking Ahead


Production State and Traffic Light are just the beginning. As we scale further, we’re exploring AI-driven predictive monitoring and advancing our serverless architecture. Our next big leap? Managing all infrastructure changes with the same gradual, controlled approach we use for traffic - like rolling out a new feature to a subset of users before a full release.



A Parting Thought


To my fellow engineers and leaders: Resilience isn’t just about tools. It’s about adopting a mindset of continuous learning, adaptation, and improvement.


I hope our journey inspires you to tackle your own challenges with confidence and creativity. Thank you for joining me on this journey. If you have questions or want to share your own stories, I’d love to hear from you!



Ben Chen

This post was written by Ben Chen




More of Wix Engineering's updates and insights: 

Comments


bottom of page