Introducing PlatySpark: How Wix Built the Ultimate Spark-as-a-Service Platform - Part 2

Wix Engineering
1 hour ago
6 min read

Introduction

In the first part of this article (Introducing PlatySpark: How Wix Built the Ultimate Spark-as-a-Service Platform - Part 1) , we discussed the main reasons that led us to build PlatySpark, our "Spark-as-a-Service" platform and its main benefits. Now, let's dive into the technical details of how PlatySpark is structured.

PlatySpark was designed to abstract the complexities of running Spark across different infrastructures while providing a seamless user experience. Its architecture ensures scalability, observability, and efficient resource management, integrating deeply with both YARN and Kubernetes-based deployments. NOTE: I will cover the EMR to EKS migration in an upcoming blog post.

In this section, we’ll break down the key components of PlatySpark, how they interact, and how they empower teams to run Spark workloads efficiently with minimal operational overhead.

PlatySpark Architecture: How It Works

At its core, PlatySpark is a collection of microservices working in tandem to deliver a unified experience for Spark users.

The diagram shows a high-level overview of the components:

fw-spark-server: The main API server that handles the management of Spark clusters and applications.
spark-platform-app-builder: A service that builds Python virtual environments and prepares application artifacts for execution.
python-aws-manager: The python package that manages communication with AWS services, such as EMR, ensuring smooth interactions.
python-livy-client: A wrapper around Apache Livy that allows for managing Spark jobs via a REST API.
PlatySpark Watchdog: An Airflow DAG that monitors cluster and application health, and ensures they stay operational.
PlatySpark Cleaner: Another Airflow DAG that cleans up terminated clusters and completed applications from our internal database.
PlatySpark operator - An Airflow operator that uses the PlatySpark API to run batch jobs. The operator fully manages the application, reports its status, fetches logs, and provides a live link to the Spark UI and logs.

PlatySpark follows an API-first approach, ensuring that all data and actions are accessible via a REST API. Every component interacts through well-defined API endpoints, enabling seamless integration and automation.

The PlatySpark Operator in Airflow fully relies on these APIs to perform every action, including submitting Spark jobs, monitoring execution status, fetching logs, and accessing the Spark UI. This API-driven design makes it easy to integrate with other systems, enabling flexibility and scalability in Spark job management.

Python

CLUSTER_NAME = 'stream-platform-emr'

res=requests.get(url=f'https://wix_serverless_url/fw-spark-server/list_apps', 
				params={'cluster_name': CLUSTER_NAME})

print(res.__dict__)
print(res.status_code)
print(res.json())

Optimizing Cluster Management with the ClusterConfig Class

Managing clusters efficiently is a cornerstone of running Spark workloads at scale. As we rely heavily on AWS EMR for our Spark jobs, we needed a flexible way to configure and manage our clusters. This is where the ClusterConfig class in PlatySpark comes in.

The ClusterConfig class allows us to customize every aspect of an EMR cluster, from instance types to scaling policies, and even bootstrap actions. It’s designed to ensure that clusters are optimized for the specific needs of the workload at hand.

What EMR “Types” do we have at WIX?

Development vs Shared vs. Dedicated Clusters

PlatySpark provides three options for cluster configurations:

Development environment : Cluster that allows remote connections from your IDE and live debugging your code before deploying to production.
Shared Cluster Configuration: If you are running a small-medium scale app, the SharedClusterConfig class makes it easy to use existing cluster environments with minimal setup. We configure YARN for fair scheduling, setting resource limits per queue to ensure no one team monopolizes resources.

Development vs Shared vs. Dedicated Clusters

Dedicated Cluster Configuration: For high scale workloads and getting full control over the cluster configs, the ClusterConfig class lets us fully customize our clusters, including instance fleets, bootstrap actions, and scaling policies. PlatySpark provides default configs to hive, iceberg and other default spark configs that you could extend for getting custom made clusters according to the team’s needs.

Why did we create the cluster config class?

Managing EMR clusters efficiently is crucial for optimizing performance and cost in Spark-based applications. It enables users to:

Define custom cluster configurations that inherit from our basic default configuration.
Allows you to manage instance types and fleets along with bootstrap actions along other configurations on an easy interface.
Allow centralized control and validations over user configurations.

Shared vs. Dedicated Clusters

Using a Shared Cluster

PlatySpark provides a predefined SharedClusterConfig for the shared cluster environment.

We use it for small-medium scale applications where the overhead of creating a dedicated cluster for each one of them is redundant and also time consuming (provisioning and bootstrapping of each cluster).

The shared cluster runs about 3500 applications a day:

from wixflow.operators.platyspark_cluster_config import SharedClusterConfig 
shared_cluster_conf = SharedClusterConfig()

Configuring a Dedicated Cluster

Example of cluster config definition:

from emr_cluster import ClusterConfig, InstanceWeightRatio, LaunchSpecifications 

# Define EMR cluster name 

emr_name = "my-emr-cluster" 

# Define custom Spark configurations 

spark_configs = { 
	"spark.executor.memory": "5g", 
	"spark.driver.memory": "2g",
	"spark.executor.cores": "3", 
} 

cluster_config = ( 
ClusterConfig(name=emr_name, is_graviton_emr=True) .overwrite('AutoTerminationPolicy', {'IdleTimeout': 900}) .with_instance_fleets( 

	volume_size_in_gb_master=400, 
	volume_size_in_gb_worker=800, 
	number_of_volumes_per_node=2, 
	task_target_ondemand_capacity=1, 
	core_target_ondemand_capacity=30, 
	master_instances=["m5.4xlarge", "m5a.4xlarge"], 

# Set up core and task instance types and weights

core_instances_and_weights= ( 
InstanceWeightRatio() 
	.add_instance(instance_type='m5a.16xlarge', weight=6) 
	.add_instance(instance_type='m5a.8xlarge', weight=3) 
	.add_instance(instance_type='m5.16xlarge', weight=6) 
), 

task_instances_and_weights=( 
	InstanceWeightRatio() 
	.add_instance(instance_type='m5a.16xlarge', weight=6) 
	.add_instance(instance_type='m5a.8xlarge', weight=3) 
	.add_instance(instance_type='m5.16xlarge', weight=6) 
	), 

	launch_specifications=LaunchSpecifications() 
) 
.with_tags( 
	service_role='emr-cross-verticals', 
	team='verticals-data', 
	project='apps', 
	business_unit='BI' 
) 
.with_cluster_configurations( 
	classification="spark-defaults", 
	config=spark_configs 
	) 
)

Providing this kind of interface greatly simplifies the complex JSON structures required by Boto3. It allows users to effortlessly create clusters without worrying about security groups or other common configurations, such as Hive settings, Iceberg configurations, S3 buckets, and various other necessary parameters.

Additionally, tagging plays a vital role in cost attribution and resource management. PlatySpark enforces a team tag to ensure every cluster is properly associated with a specific team and project, enabling clear visibility into usage and expenses across projects.

Building the PlatySpark Airflow Operator

As part of our journey to streamline Spark job execution within Airflow, we developed the PlatySpark Airflow Operator. This operator integrates seamlessly with PlatySpark’s API to manage the entire lifecycle of Spark applications.

Why we built it

Managing Spark workloads efficiently in Airflow requires handling cluster creation, job submission, and log retrieval while ensuring smooth monitoring. Instead of manually orchestrating these steps, the PlatySpark Airflow Operator provides an automated and consistent approach.

It allows users to:

Automate EMR cluster management (create/terminate clusters as needed).
Seamlessly submit Spark applications via PlatySpark.
Fetch logs and expose live Spark UI links for real-time debugging.
Enforce governance by managing configurations and resource allocations.

How It Works

The PlatySpark Airflow Operator follows a structured execution flow:

Cluster Management: It verifies if an EMR cluster is running. If not, it spins up a new cluster using the provided ClusterConfig.
Application Packaging: If the Spark application is sourced locally, the operator compresses and uploads it to S3 to ensure accessibility.
Environment Setup: If Python dependencies are specified, the operator dynamically creates a virtual environment or configures EMR bootstrap scripts to handle installations.
Job Submission: The application is submitted to Livy via fw-spark-server, ensuring efficient execution.
Monitoring & Logging: Execution logs are fetched and exposed in Airflow logs, along with direct links to the Spark UI and log locations for enhanced debugging.

Key Features

Supports Multiple Application Sources: Users can specify LocalLocation (Airflow filesystem) or S3Location to define where their Spark application resides.
Flexible Cluster Configuration: Users can choose a shared cluster (SharedClusterConfig) or define a custom ClusterConfig to meet specific requirements.
Seamless Integration with PlatySpark API: The operator interacts with PlatySpark APIs to submit applications and fetch statuses.
Enhanced Debugging: Each execution includes structured logs, error traces, and monitoring dashboards via Grafana links.

How to Use It

The PlatySpark Airflow Operator is available as different task types:

PlatySparkCreateClusterOperator: Provisions an EMR cluster.
PlatySparkTerminateClusterOperator: Shuts down an EMR cluster.
PlatySparkRunAppOperator: Creates or gets a running cluster and runs a Spark job.

Example usage in an Airflow DAG:

Troubleshooting and Monitoring

To assist with debugging and monitoring, the operator provides:

Structured Logs: Operator logs clearly mark each log type (spark stdout,stderr).
Grafana Dashboards: Pre-configured dashboards track job metrics, execution details, and error traces.
PlatySpark API Request Tracing: The request ID (X-Wix-Request-Id) is logged to enable detailed tracking of API interactions.

Metrics are printed directly to the airflow console, removing the need for costly count() operations and allowing the users full visibility.

Conclusion

In conclusion, our migration to PlatySpark has been a game-changer, enabling us to seamlessly transition over 5,000 Spark applications from our previous infrastructure, which was error-prone and caused significant stability issues.

The migration primarily involved replacing the existing Airflow operator with the PlatySpark operator, allowing our users to leverage the fully managed infrastructure without worrying about the complexities under the hood.

Today, PlatySpark successfully orchestrates and executes a vast number of jobs in production, demonstrating its reliability and efficiency at scale.

For more details on how we seamlessly migrated thousands of Spark applications from EMR to EMR on EKS using PlatySpark, check out my next article!

This post was written by Almog Gelber

More of Wix Engineering's updates and insights:

Follow us on: Twitter | Facebook | LinkedIn
Join our Telegram channel
Visit us on GitHub
Subscribe to our monthly newsletter
Subscribe to our YouTube channel
Follow our Medium publication
Listen to our podcast on Apple or Spotify