top of page

How Wix Slashed Spark Costs by 50% and Migrated 5,000+ Daily Workflows from EMR to EMR on EKS


ree

Introduction

At Wix, we are constantly evolving our data infrastructure to stay ahead of the curve. As our data workloads grew, we wanted a solution that was cost-efficient, scalable, future-proof and resilient. Our decision to migrate from AWS EMR (Elastic MapReduce) to EMR on EKS (Elastic Kubernetes Service) was driven by these needs, along with the increasing adoption of Kubernetes for big data workloads.


Many cutting-edge Apache Spark features—such as graceful decommissioning on spot instances—are only supported on Kubernetes. By leveraging Kubernetes' flexibility, we aimed to enhance resource utilization, reduce costs, and improve job scheduling efficiency while ensuring a seamless migration experience for our users.


For more background on our internal Spark platform, PlatySpark, and how we built it, check out our first articles on the topic (Introducing PlatySpark: How Wix Built the Ultimate Spark-as-a-Service Platform - Part 1 and Part 2). Let's go over our main motivation points.



Why We Migrated to EMR on EKS


1. Faster Spin-Up Times

One of the biggest operational challenges with EMR was long startup times. Launching new EMR clusters took around 10-15 minutes, significantly delaying workflows, especially in dynamic environments. With EMR on EKS:


  • Nodes spin up in 2-3 minutes.

  • Pods start in seconds.


This improvement enabled us to scale aggressively and respond quickly to workload changes, significantly reducing idle time and optimizing performance.


2. Cost Reduction and Resource Optimization

Running on EMR with YARN had inefficiencies in resource allocation, leading to wasted capacity. EMR on EKS, combined with binpacking strategies, improved resource utilization by:


  • Packing multiple workloads onto fewer nodes, reducing waste.

  • Utilizing Kubernetes' superior auto-scaling, which outperforms managed auto-scaling on YARN.

  • Enabling dynamic scaling, allowing us to run jobs more efficiently.


Our goal was to achieve at least 20% cost reduction—we ended up reducing costs by 60% for our shared EMR cluster and 35%-50% for dedicated EMR clusters that were created and terminated for specific applications.


3. Multi-Spark Version Support

A major limitation of EMR was the inability to run multiple Spark versions simultaneously within the same cluster. This was particularly problematic when testing new versions or running applications with different dependencies. With EMR on EKS, we could:


  • Run multiple Spark versions in isolated environments within the same EKS cluster.

  • Ensure stability while allowing teams to upgrade incrementally.


4. Improved Spot Instance Utilization

In AWS EMR, using spot instances was challenging due to potential interruptions and limited availability in a single Availability Zone (AZ). With EMR on EKS, we could:


  • Run each application in a dedicated AZ, increasing the probability of acquiring spot instances.

  • Improve reliability while keeping costs low.



Our Migration Goals

Before starting the migration, we set three primary goals:


  1. Achieve at least 20% cost reduction compared to EMR.

  2. Ensure performance remains on par with EMR, especially for shuffle-heavy workloads.

  3. Make the migration as transparent as possible for our users.


The Migration Process

Migrating from AWS EMR to EMR on EKS was a complex process that required careful planning and execution. Our goal was to ensure a smooth transition with minimal disruption to users.


We adopted a phased approach, gradually shifting workloads to the new environment while continuously monitoring performance and stability.


Here’s how we structured the migration:


Shared EMR Cluster Migration

Our shared EMR cluster is an environment running 3,500+ applications daily across multiple teams. To ensure a smooth migration:


  • We built a shared namespace with queues, mimicking the YARN queue structure.

  • Queue names and limitations remained unchanged.

  • We gradually migrated queues, ensuring stability and preventing failures.

  • 99% of jobs were migrated transparently, without requiring users to make any changes.

  • Job redirection was handled behind the scenes, eliminating disruptions.



Queues creation on Apache YuniKorn scheduler


Apache YuniKorn


Dedicated EMR Cluster Migration

For workloads requiring dedicated clusters, the migration process was straightforward. Users needed minimal changes, like adding a few parameters to the airflow operator as the underlying infrastructure was handled through PlatySpark, our internal Spark platform.


Challenges and Solutions


1. Replacing Apache Livy for Job Submission


Initially, our PlatySpark platform used Apache Livy to submit jobs on EMR. AWS provides a Livy integration for EMR on EKS, so we planned to retain it. However, we faced unexpected issues:


  • Livy consumed excessive CPU and memory, making it a performance bottleneck.

  • A large, non-scalable Livy instance was required to handle job submissions.

  • We had to keep very large livy instance running to handle peak times. While most of the day required much smaller instance


To resolve this, we replaced Livy with direct interactions with the AWS EMR Containers API, allowing us to eliminate the bottleneck and improve efficiency.


Java process created on apache livy when spinning spark drivers


spark drivers


HLD of our spark platform for EMR


spark platform for EMR Wix Engineering


HLD of our spark platform for EMR on EKS


spark platform for EMR on EKS Wix Engineering


2. Choosing the Right Kubernetes Scheduler

Scheduling is critical for Spark workloads. We evaluated multiple Kubernetes  schedulers:


  • Karpenter (AWS Native Auto-Scaler): Lacked essential features such as a UI, API for resource tracking, and internal queue management.

  • Volcano (Apache Project): Had poor documentation and required S3 files for queue configurations, making it inconvenient. It has some nice features but still wasn't a good fit for us.

  • Apache YuniKorn on top of Karpenter (Final Choice):

    • Feature-rich: UI, API for resource tracking, and built-in queue management.

    • Easy configuration: Allowed defining internal queues and namespace-level limits.

    • Best fit for our Spark-on-Kubernetes workloads.


In our migration from AWS EMR to EMR on EKS, selecting an appropriate scheduler was crucial. After evaluating several options, we chose Apache YuniKorn for its comprehensive feature set that aligns with our multi-tenant, large-scale, and cloud-native environment.


With YuniKorn 1.6.0, we encountered issues related to orphan allocation on nodes. It turned out to be a known bug that was fixed in version 1.6.1, and after upgrading, these issues were fully resolved.


Beyond that, there were numerous DevOps-related tasks that we won't detail here, but it's important to highlight that a strong DevOps team is crucial when transitioning to EKS.


Key Features of Apache YuniKorn:

  1. App-Aware Scheduling: Unlike the default Kubernetes scheduler, which operates on a pod-by-pod basis without context, YuniKorn recognizes users, applications, and queues. This awareness allows for fine-grained control over resource quotas, fairness, and priorities, essential for multi-tenant systems.

  2. Hierarchical Resource Queues: YuniKorn supports hierarchical queues that can map logically to organizational structures, providing precise resource control for different tenants. The YuniKorn UI offers a centralized view to monitor resource usage across these queues.

  3. Gang Scheduling: This feature allows an application to request a set of resources (a gang) to be scheduled simultaneously. The application starts only when all requested resources are available, ensuring efficient resource utilization.

  4. Job Ordering and Queuing: YuniKorn enables applications to be queued with policies determining resource allocation order, such as FIFO, Fair, StateAware, or Priority-based. This predictability simplifies client-side operations.

  5. Resource Fairness: In multi-tenant environments, YuniKorn ensures fairness across users and teams, considering weights or priorities to allow more important applications to access additional resources when necessary.

  6. Resource Reservation: YuniKorn can reserve resources for pending requests, preventing starvation of larger or more specific pods in heterogeneous workloads.

  7. Preemption: To maintain fairness and adhere to resource guarantees, YuniKorn can preempt lower-priority tasks, reallocating resources to higher-priority ones when necessary.

By integrating Apache YuniKorn, we enhanced our scheduling capabilities, achieving efficient resource sharing and meeting the complex demands of our diverse workloads.



Performance Findings


Optimizing Storage for Shuffle-Intensive Workloads

During our PoC, we found that using EBS volumes for shuffle-heavy workloads significantly impacted performance due to:


  • High network overhead, as EBS operates outside the Kubernetes cluster.

  • Increased latency, leading to slower job execution.


To match EMR’s performance, we switched to instance types with local NVMe SSDs (m5d/m6d), which resulted in:


  • Performance parity with EMR on EC2.

  • Eliminated shuffle-related bottlenecks.


Making the Migration Transparent for Users

Our internal Spark platform, PlatySpark, was enhanced to support Kubernetes natively. Users continued to experience:


  • Live logs (Airflow console and real-time logs).

  • Job monitoring with full visibility.

  • Spark History UI, maintaining up to five days of job history per namespace.

  • Live Spark UI links for debugging.

  • Detailed metrics and lineage tracking.



Final Thoughts

Migrating from AWS EMR to EMR on EKS was a significant undertaking, but the results exceeded our expectations:


  • 60% cost reduction for the shared cluster, 35%-50% for dedicated clusters.

  • Faster startup times, eliminating the need for bootstrap scripts.

  • Performance parity with EMR when using NVMe SSDs.

  • Seamless migration with minimal user impact.


For organizations considering a similar transition, the key takeaways are:


  1. Use Kubernetes-native scheduling solutions like Apache YuniKorn for better queue management and resource tracking.

  2. Replace Apache Livy with direct AWS API interactions to avoid bottlenecks.

  3. Optimize storage for shuffle-intensive workloads by using local NVMe SSDs instead of EBS.

  4. Plan a phased migration approach, ensuring a smooth transition with minimal disruption.


By embracing Kubernetes as the future of Spark, we’ve built a more scalable, cost-efficient, and resilient data platform. Our experience proves that the shift from EMR to EMR on EKS is not just viable—it’s transformational.



Almog Gelber

This post was written by Almog Gelber



More of Wix Engineering's updates and insights: 

bottom of page