How Wix Cut 50% of Its Data Platform Costs - Without Sacrificing Performance (Part 2)

Wix Engineering
Sep 1
6 min read

In Part 1 (read here), we shared how we built the foundations for cost-efficiency at Wix’s data platform: creating visibility into multi-cloud costs, enforcing ownership, and building a centralized data model for usage and spend.

Now, with that visibility in place, the real work began - translating insights into impact.

In this post, we’ll walk through how we:

Instrumented jobs and queries with team-level metadata
Defined pricing models and efficiency KPIs
Built dashboards and tooling to drive adoption
Engaged data teams in cost optimization

These efforts helped us cut monthly cloud spend by 50%, without compromising on performance or agility.

Main challenges: continued

Job-level telemetry

Now that we have a method to propagate ownership metadata to each job and query - whether for Trino or Spark - the next step was capturing job execution events and storing them in the data lake.

For Trino, we utilized custom Event Listeners to track query execution. These listeners captured detailed query stats, including ownership metadata, and exposed this information as a table in the data lake.

For Spark, we relied on SparkListener and OpenLineage to collect metadata for Spark jobs triggered by Airflow, ensuring that we captured the necessary details for each job execution.

Rolling out such changes to all spark jobs was really easy as we are using an in-house framework called Platyspark that consists of Airflow operator that communicates with an API layer that abstracts away all the nitty gritty details such as attaching common configuration configs, managing statuses of EMR clusters, routing jobs to correct EMR cluster, streaming logs, etc...

Data lake attribution

Our data lake includes two Iceberg catalogs used by data engineers: prod and sandbox. Over time, these catalogs have grown significantly, now containing hundreds of schemas and nearly 100,000 tables. These schemas serve different purposes - some are owned by specific teams, others are tied to cross-team projects, and a few are personal or experimental.

To streamline cost attribution, we assigned each schema to a single owning team and cleaned up the prod catalog by removing unused or personal schemas. This ownership mapping allowed us to group storage costs by team, making it possible to allocate costs more fairly and transparently.

Because we use OpenMetadata as our central data catalog, assigning ownership was straightforward.

Pricing model

To calculate the cost of each data asset, we rely on key metrics such as storage size for data lake tables, CPU milliseconds for Trino queries, and CPU seconds for Spark jobs.

For example, to calculate the cost of a single Trino millisecond, we divide the price of all trino infra by the sum of all the CPU milliseconds. The resulting number can be used to calculate the cost of every trino query which we now can attribute to a specific team, thus allowing us to split the bill between the teams that actually use Trino. So, if Trino infrastructure costs $10,000 for a given period and we measured 5 trillion CPU milliseconds, then the cost per CPU millisecond is $0.000002.

The formula can be generalised as:

Price of Cost Unit = Total Price of Infra / Total Number of Units

While this approach is straightforward and easy to explain, it only offers an approximation of the true cost. It doesn’t factor in variables like different storage tier pricing or the higher costs associated with frequently accessed data from object storage. Despite these limitations, this method has proven effective for our needs.

Efficient infrastructure

With our pricing models in place, we were able to introduce a meaningful efficiency KPI framework to track how infrastructure changes impacted costs over time. Rather than focusing on raw cloud spend, we defined cost-per-unit metrics for key workloads. For example:

Cost of processing 1B clickstream events
Cost of 100M CPU milliseconds for Trino
Cost of 100M CPU milliseconds for Spark

These KPIs gave us a normalized view of infrastructure efficiency that could be tracked daily and monthly. They allowed us to detect regressions and validate improvements following infrastructural changes. In the future blog posts we will explore the changes we did to make our infrastructure more cost efficient - like data compaction and cleanup, different storage tiers for older data, TTL for data assets, etc...

Exposing usage data

We created Grafana dashboards that provided both high-level management views and more detailed, team-specific insights. Each data team had a dedicated dashboard that displayed their daily usage costs for services like Trino, Spark, clickstream events, Snowflake, and more. These dashboards included deep-dive graphs for each category, highlighting the top data assets attributed to each team.

Exposing usage data Grafana Wix Engineering

In addition, we created comprehensive documentation on how to navigate and utilize the dashboards. We also provided actionable "recipes" for each category, offering practical advice on how to optimize costs - such as reducing the cost of clickstream events or improving the efficiency of data lake tables. For example, if the SQL used a very specific SQL pattern to recreate the table, then reducing the job frequency from hourly to daily, decreased the underlying storage costs by 95%.

Data Teams engagement

We were pleasantly surprised by the level of engagement and the initial cost improvements driven by the data teams. It turns out that providing data professionals with clear, actionable insights sparks their ability to identify optimization opportunities.

At first, we focused on tackling the low-hanging fruit: optimizing CPU-intensive jobs, cleaning up large, forgotten tables, and addressing neglected BI events.

Once these easy wins were handled, the teams began investing time into more strategic improvements—refactoring jobs, trimming obsolete data lake tables, and introducing sampling for clickstream events where applicable.

Ultimately, it comes down to ROI. Each team must weigh whether the time spent optimizing costs would be better invested in creating new value for the company.

To keep the momentum going, we hold a monthly meeting to review the latest changes, identify emerging trends, and provide a monthly cost report that highlights both improvements and regressions. Additionally, we plan on exposing the cost of assets within internal UIs to raise awareness

Key takeaways

To summarise our journey, here are the top takeaways from our journey:

Make cost data visible and accessible Centralize billing and usage data across all providers into your data lake. Expose it through dashboards that are tailored for both high-level overviews and team-specific insights.
Enforce ownership across data assets Define clear ownership at the schema, pipeline, and query/job level. Use metadata propagation to track and attribute costs accurately across shared infrastructure.
Build simple, explainable cost models Even approximate models based on CPU time or storage size can be effective when they are consistent, transparent, and tied to real infrastructure usage.
Introduce efficiency KPIs Track cost-per-unit metrics like “cost per 1B clickstream events” or “cost per 100M Trino CPU millis” to benchmark improvements and identify regressions over time.
Invest in infrastructure improvements Tackle inefficiencies at the platform level—automated cleanup, compaction, better compression, tiered storage, and smarter autoscaling all contribute to lower costs.
Empower data teams with guidance Provide clear documentation and optimization playbooks. Engage teams regularly to highlight cost drivers and celebrate meaningful improvements.
Make cost optimization a continuous process Establish regular reviews and reporting. Foster a culture where cost awareness is part of day-to-day decision-making—not just a reactive effort during audits or budget cuts.

Conclusion

Our journey toward a more cost-effective data platform wasn’t about cutting corners - it was about building transparency, ownership, and efficiency into every layer of our data infrastructure. By combining a solid pricing model, clear usage attribution, and a culture of accountability, we were able to reduce our monthly cloud spend by 50% - without sacrificing performance or agility.

This transformation wouldn’t have been possible without the close collaboration between infrastructure teams and data professionals. Once the right data surfaced, teams were quick to respond, finding opportunities to trim waste, optimize pipelines, and make smarter decisions about their workloads.

Ultimately, cost optimization is not a one-time project - it’s an ongoing process. As our platform evolves, so will our tooling, processes, and practices. But with the right foundations in place, we’ve made cost awareness part of how we operate - not just a problem to fix when budgets get tight.