How Wix Cut 50% of Its Data Platform Costs - Without Sacrificing Performance (Part 1)

Wix Engineering
Jul 31
7 min read

In this post, we share the journey of optimizing the data platform at Wix, which led to a 50% reduction in monthly data platform costs. The goal was to implement a more cost-effective system while maintaining performance and scalability. The challenges were substantial, including the complexity of managing data across multiple cloud providers, tracking resource usage, and developing a unified pricing model for diverse data assets.

Background

We, the Data Infra group at Wix build and maintain infrastructure to allow the company to be data driven and make smarter decisions. We are strong believers in self service, so all of our tools are open to all Wix employees. Thus, Wix employees are free to create new data assets without talking even once to the data infra group members.

We provide tools to ingest different types of data via streaming and batch pipelines: users clickstream, A/B test events, production monitoring events, production DB replicas, domain events, exports from 3rd party providers, data from internal tools, and more. Our data is structured similarly to the data medallion design pattern, so once that raw data is ingested into the bronze layer of our data lake, data professionals continue creating more and more pipelines to populate the silver and gold layers of the data lake.

To help with different types of data prep we maintain two query engines: Trino and Spark. Trino serves as a SQL gateway into the data platform for adhoc queries via IDEs such quix or redash; as the main source of data for reporting tools such as Tableau or PowerBI; and also for many types of scheduled queries. Spark handles all the load that doesn’t fit the memory and CPU limits of trino clusters and allows experienced data engineers to create more complex jobs with custom resource requirements, complex orchestration patterns on top of Airflow and custom python dependencies.

Soon enough the amount of data assets and the total cost of the data platform skyrocketed, mainly due to the high number of data professionals, ever growing number of Wix users and fast introduction of new features to Wix.

That brings us to the main challenges we saw when starting the journey to a more cost-efficient data platform.

Main challenges

When we started digging into our cost-efficiency project, we quickly learned we only knew our overall monthly cloud spend. We were missing key details, like how many database tables each team had and their costs, or how much Trino CPU usage each team was responsible for. So, first off, we needed to tackle the fact we were using several cloud providers.

Challenge 1: Multi-cloud

While Amazon AWS is our primary cloud provider - powering most of our workloads on EC2 and storing the majority of our data lake in S3 - we also rely on a variety of other platforms. Confluent powers our real-time streaming, StarTree supports our real-time OLAP use cases, Snowflake is used for user analytics, and GCP hosts a few others.

With costs distributed across multiple providers, and each data team potentially using a different mix of services, we needed a unified way to consolidate and analyze these expenses across the board.

Challenge 2: Cloud resources visibility

Each of our cloud providers offers some form of cost visibility - AWS has Cost Explorer, Confluent exposes a cost API, Snowflake provides account usage tables, and GCP comes with a variety of built-in billing reports.

But expecting data engineers to navigate multiple UIs and APIs just to track costs is neither scalable nor an effective use of their time.

Most major cloud platforms also support resource tagging, which allows for more granular cost breakdowns by team, project, or any custom dimension. While we had tagging in place for most AWS resources, there was no consistent naming convention or enforced structure for tag values - making the data unreliable for meaningful analysis.

To improve cost visibility, we needed to centralize and standardize this information and make it easily accessible for both data professionals and their stakeholders. Speaking in data lake terms, we needed to create a set of uniform tables that expose the granular cost data behind each resource we deployed in each cloud provider.

Challenge 3: Usage attribution

While resource tagging is a valuable tool, it's not sufficient for attributing usage within shared infrastructure - like a Trino cluster running on EC2, a multi-tenant EMR setup, or a shared S3 bucket holding data for the entire lake.

Historically, we tagged shared resources with the owning team label to indicate ownership of the infrastructure itself. But that approach didn’t help us understand which teams were actually using specific data assets within those shared environments. We needed a way to attribute usage more accurately - at the data level, not just the infrastructure level.

Technically, Amazon S3 supports object-level tagging, which could allow for file-level attribution. But maintaining those tags at scale is highly impractical. Our engineers use Spark and Trino to manage Iceberg tables, so tagging would require propagating team metadata through every write path, potentially modifying low-level S3 interactions, and ensuring tags persist through operations like compaction. The engineering overhead and complexity simply outweigh the benefits.

Challenge 4: Pricing model

Once we figured out how to attribute usage to specific teams, the next step was building a cost model around the most common data assets in Wix’s data platform. These include:

User clickstream events
Domain events
Data lake tables
Spark and Trino jobs
User models for personalization
Feature store for Wix’s ML platform

The main challenge was the diversity of technologies powering these assets, and finding a unified pricing model that could work across them all.

For example, user clickstream data flows through EC2, Confluent Kafka, EMR clusters, and S3. Spark jobs run on EMR, Trino queries run on EC2-based clusters, personalization models rely on EC2, Aerospike, and EMR, while the ML platform performs online inference via Aerospike and trains models using AWS SageMaker.

The complexity of this stack meant we couldn’t apply a one-size-fits-all pricing formula. Instead, we had to develop tailored models that reflect the actual resource consumption behind each type of asset.

How We Addressed the Challenges

Let’s walk through the key changes we made to lay the foundation for addressing the challenges outlined above.

Tags

To bring consistency to our AWS tagging strategy, we took several key steps:

Defined a standardized tagging policy with four levels of granularity: department, team, project and service_role
Worked with infrastructure teams to ensure all AWS resources were properly tagged
Migrated all cloud resources into Terraform modules for better tag enforcement and automation

Below is an example of the terraform code we use to deploy an instance of Trino coordinator. It is managed by "data-engines" team as part of the "trino" project. In a similar manner we tagged all the cloud resources that belong to many different projects managed by "data engines". Once completed, we could answer questions like "How much money do we pay for trino?", "What are the most costly components behind trino" and so on

Unified costs data

Once we established a consistent tagging strategy in AWS, we moved on to consolidating cost data from all our cloud providers into our data lake. Here's how we sourced that data:

AWS Data Exports Exported detailed cost data to S3 in CSV or Parquet format
S3 Inventory Provided object-level metadata like size, storage tier, replication status, and more
Confluent Cloud API Offered daily aggregated cost data via their REST API
Snowflake Exposed detailed usage metrics through a rich set of system tables
GCP data exports Allowed exporting billing data directly into a BigQuery dataset

Of course, each provider prefers to keep you within their own ecosystem. AWS stores costs in S3, Snowflake exposes usage through its own tables, and GCP pushes data to BigQuery. To bring everything together in our own data lake, we had to implement additional ETL steps to extract and standardize the data across platforms.

The end result is a collection of tables in the data lake:

Data-level attribution

Once we had consolidated cost data from all our cloud providers into the data lake, the next challenge was understanding who was responsible for using which resources. This might sound straightforward, but in a shared, multi-tenant infrastructure like ours, it's anything but.

Our infrastructure isn’t neatly siloed - multiple teams share Trino clusters, Spark clusters, and massive S3 buckets. So even though we had tags showing which team owned a resource (e.g., a Trino coordinator or an EMR cluster), that didn't tell us who was using it. And without accurate usage attribution, we couldn’t fairly charge teams or help them optimize their spend.

We realized we had to go beyond infrastructure-level ownership and develop data-level attribution.

Here’s how we began laying that groundwork:

Defined a usage attribution strategy: We started by defining the kinds of resources we wanted to attribute - Trino queries, Spark jobs, S3 datasets, clickstream events—and what metadata we would need to track.
Established metadata standards: For each data asset, we needed to enforce consistent metadata that tied back to the team. For example, DAGs in Airflow had to specify team ownership, and that ownership had to be passed downstream to the Spark jobs or Trino queries they executed.
Propagated metadata through the stack: We built conventions and automation to propagate this metadata from the orchestration layer (Airflow) down to query engines (Trino, Spark). This meant injecting ownership tags into SQL comments or Spark config parameters automatically, ensuring traceability from the job trigger all the way to resource usage.
Instrumented job-level telemetry: We began capturing execution logs enriched with ownership metadata and stored them as structured tables in our data lake. This allowed us to trace every CPU-second or GB of data scanned back to a specific team, even inside shared infrastructure.

This groundwork was essential. Without it, all the dashboards and cost reports in the world would have been meaningless - because we wouldn't know who to show them to, or who could take action.

Examples of ownership metadata across different internal and OSS tools:

Examples of metadata that was propagated into Trino and Spark:

Summary and next part

Examples of metadata that was propagated into Trino and Spark:

By the end of this foundational phase, we had achieved three critical things:

Visibility into what we were spending and where
Ownership attribution across cloud and data resources
A centralized, queryable view of costs inside our own data lake

These steps were essential - but they were just the beginning. In Part 2, we’ll show how we turned this visibility into action: building cost models, tracking efficiency, engaging data teams, and ultimately reducing our monthly data platform costs by 50%.

This post was written by Valery Frolov

More of Wix Engineering's updates and insights:

Follow us on: Twitter | Facebook | LinkedIn
Join our Telegram channel
Visit us on GitHub
Subscribe to our monthly newsletter
Subscribe to our YouTube channel
Follow our Medium publication
Listen to our podcast on Apple or Spotify

3 Comments

Roshan

Sep 05

Looking for top interior designers in Bangalore? Find your perfect match on our webpage. Browse curated profiles, stunning portfolios, and client reviews to start your dream project today.

Tired of the endless scrolling for home inspiration? A great home is all about personalization! You can now find and collaborate with amazing online interior designers who can bring your unique vision to life—all from the comfort of your couch.

Ready to start your design journey? Check out the link in our bio to find your perfect match and get a home that's truly yours. ✨