Introducing PlatySpark: How Wix Built the Ultimate Spark-as-a-Service Platform - Part 1

Wix Engineering
May 12
6 min read

Updated: May 28

The Challenge

Managing Apache Spark applications at scale can be a daunting task, especially for large organizations with multiple data engineers working on various projects.

At Wix, a leading platform for website creation and business solutions, we process vast amounts of data to support millions of users worldwide. Our data infrastructure handles approximately 50 TB of raw event data every day, capturing over 40 billion user interactions daily. With 5,000 Spark applications running across 60,000 production tables, we rely on Spark and Trino for data processing, orchestrated by Apache Airflow.

Running Spark at this scale is no simple feat. As an infrastructure team, our job is to centralize control over configurations, resource management, monitoring, and governance for all these thousands of jobs.

With over 120 data engineers spread across different teams and business units, managing Spark workloads efficiently is a critical challenge. While AWS EMR provides a solid foundation, we quickly realized that we needed something more streamlined and automated to manage clusters, deploy applications, and handle failures effectively.

This gap in functionality led us to create PlatySpark, our in-house Spark management platform.

PlatySpark simplifies Spark operations by providing a Spark-as-a-Service model, automating cluster management, and enabling seamless Spark job execution through APIs.

In this article, we’ll take you through the reasons we built PlatySpark, its architecture, and how it enhances the overall experience for Spark users in our organization.

Introducing Platyspark

PlatySpark is a microservices-based platform that simplifies running and managing Spark applications at scale. It provides an API-driven experience for submitting, monitoring, and orchestrating Spark jobs and EMR clusters, integrating seamlessly with Airflow through its dedicated operator.

By automating cluster management, dependency handling, and job execution, PlatySpark enhances reliability, scalability, and ease of use for Spark workloads.

Why Did We Build PlatySpark?

There are several key reasons why we decided to build our Spark-as-a-Service platform, Let’s break them down:

1. Simplifying Spark Deployment and Usage

Deploying Spark jobs on AWS EMR can feel like navigating a labyrinth. You need to manually configure and create clusters, set up authentication, submit jobs, and constantly monitor them for success, failure, and performance metrics. Not to mention, keeping track of the various configurations for each job can be overwhelming.

With PlatySpark, we wanted to eliminate these manual steps. It abstracts away the complexity of Spark deployment by providing a simple API for submitting jobs, and we even built an Airflow operator to automate this process further. The result? Spark jobs can now be executed seamlessly, saving time and reducing the risk of human error.

2. Automatic Failure Handling

We all know that infrastructure failures are inevitable—clusters go down, applications crash, and things break. But rather than leaving these failures to be dealt with manually, PlatySpark handles them automatically.

We use the concept of DesiredState (what users want) and ActualState (what AWS reports) to track the status of clusters and applications. This ensures that everything is aligned and that there are no discrepancies between the desired configuration and the actual state of the system.

With PlatySpark’s Watchdog component, we can continuously monitor cluster and application health. If a failure occurs, the Watchdog automatically restarts failed streaming applications or clusters.

For batch jobs, we created an Airflow operator that manages edge cases like pod restarts or connection issues. Normally, when a pod restarts, a stateless operator loses job context, causing issues like duplicate runs. Our solution ensures job consistency, preventing issues like data duplication and unnecessary restarts.

Platyspark also prevents duplicate clusters. For example, what happens if multiple jobs request to run on the same cluster spec and name at the exact same millisecond?

PlatySpark ensures that only a single cluster is created, and all jobs—whether two or even dozens—will reuse the same cluster.

These are just examples for handling multiple edge cases which are all handled automatically by Platyspark

Example for alert notifying the user that streaming application restarted by platyspark

Example for alert notifying the user that streaming cluster restarted by platyspark

3. Supporting Multiple Languages & Processing Types

One size doesn’t fit all. While our data infra team primarily works in Scala, our data engineers prefer Python for their Spark workloads. We needed a platform that could accommodate both languages while also supporting both batch and streaming workloads.

PlatySpark is designed with flexibility in mind, supporting both Scala and Python, as well as different Spark job types. This allows our teams to work within the same platform, using their preferred languages, without worrying about compatibility issues.

4. Monitoring & Governance & Observability

Observability and governance are critical for managing Spark workloads at scale. We wanted to give users clear visibility into their jobs, track performance, enforce best practices, and prevent costly inefficiencies.

Real-time Metrics and Alerts

We integrated a Spark Listener that captures key performance metrics and sends them to Prometheus, with a Grafana dashboard providing real-time insights into Spark job execution. Additionally, these metrics are written to an Iceberg table, allowing us to run analytical queries on historical performance data.

Based on these metrics, we implemented an alerting system to notify teams when jobs perform full scans on partitioned tables—a major red flag for performance and cost efficiency. Users are also alerted when their jobs exhibit poor performance (like spill to disk), allowing them to optimize query plans and reduce unnecessary compute costs.

Platyspark performance alert system:

Iceberg table containing all spark jobs data and metrics: (actual names were replaced)

Logging resource consumption of each spark app running on the system with relevant tags:

Platyspark API GRAFANA dashboard:

Column-Level Lineage with OpenLineage

Understanding data movement is crucial for governance and debugging. We integrated OpenLineage alongside our custom lineage solution, which tracks how data flows through Spark transformations.

Lineage events are sent to Kafka, where they are consumed and stored in OpenMetadata and Iceberg tables.

This gives us end-to-end column-level lineage tracking, making it easy to audit data pipelines, debug transformations, and optimize queries.

Enhancing Spark UI with Dataflint and Third-Party tools

While Spark’s built-in UI provides job execution details, it lacks advanced insights into performance bottlenecks. To bridge this gap, we integrated Dataflint, an open-source alternative that extends the native Spark UI with additional monitoring capabilities.

Dataflint helps users identify job bottlenecks, making it easier to debug slow-running queries.

We also collaborated with the Dataflint team to add Iceberg-specific alerts and metrics, allowing users to optimize their write strategies and ensure jobs execute efficiently.

For more info: https://github.com/dataflint/spark

Enforcing Runtime Limits

For the shared cluster EMR, we enforce a 4-hour runtime limit on Spark jobs. Any job exceeding this limit is automatically canceled, and the user receives a notification. This prevents long-running jobs from consuming excessive resources and ensures fair usage across teams.

As for dedicated clusters, we allow the user to configure their own runtime limit for getting alerts.

Alerting on long running EMR’s

We also make sure that custom clusters are not running for too long (excluding spark streaming clusters) and alert the users on such cases.

It is important to say that each and every one of these tools comes out of the box with platyspark, without any need for user integration.

5. Future-Proofing

We wanted to build PlatySpark in a way that would allow easy integrations. As hundreds of data engineers use this framework, we need to avoid a full blown migration for every code or infra change.

In fact, one of the next steps that we’re taking is migrating to Kubernetes with PlatySpark. This change will allow us to run Spark on any infrastructure, without requiring changes to user code. Stay tuned for a detailed article on how we made this transition seamlessly!

Conclusion

Bottom line, as a big company, we need Centralized Infrastructure Control.

As an infrastructure team in a big company, we aimed to build a centralized platform that streamlines resource management, enforces best practices, and provides visibility into Spark workloads. Managing EMR clusters at scale involves handling security configurations, optimizing performance settings, and ensuring compliance with organizational policies.

With PlatySpark, we have established a unified interface that abstracts away the complexities of cluster provisioning and Spark configuration. This allows teams to focus on their workloads rather than infrastructure concerns. PlatySpark ensures that every cluster adheres to default configurations for security, resource allocation, and performance tuning.

It also provides a consistent way to manage security groups, IAM roles, S3 access policies, and storage integrations, such as Hive and Iceberg.

Additionally, PlatySpark enhances governance by offering centralized monitoring and control over Spark workloads. By standardizing how clusters are created and managed, we improve operational efficiency, reduce misconfigurations, and ensure that resources are used effectively.

This approach not only simplifies infrastructure management but also empowers data teams to deploy Spark applications with confidence, knowing that best practices are automatically applied.