top of page

Building Smart ETLs Using Iceberg Metadata

The challenge


Business is dynamic and constantly changing. And so are the KPIs and logic that the business analytics are based on. For data professionals - it can be hard to keep up with the pace of changes. When business KPIs change, the data and its metadata ( e.g., its schema and how each field is calculated) should change accordingly.




Ideally, the change should be effortless. Data pipelines should be smart and support these changes without any effort from developers and data stakeholders. If they don’t support these changes, it may require lots of developers’ extra effort and cause data issues, as our team discovered and will describe in detail, later.


To eliminate these data issues in advance, there should be two versions of the metadata:

  1. The required metadata - represents the most up-to-date business KPIs.

  2. The existing metadata - reflects how the existing version of the data warehouse tables were built. This metadata should always be consistent with the data itself, which can be tricky.


In this article, you will learn how to achieve such consistency with Iceberg field and table comments.

Here’s what we did


As a part of Wix Data Engineering, the team I belong to works on creating hundreds of aggregated tables in an automated process, based on users' configurations. The user configurations contain the required aggregations for each field in a table.


For example, one aggregation supplied by the user can be “the daily count of visitors to a site”.


This automated process runs daily and the configurations can be changed by the user from one daily run to the next. Therefore, in each run, the process should calculate the difference between the required configurations and the configurations that the data is currently based on.


If the configuration did not change - the process will simply extract the previous day’s data, transform it, and append it to the final table. Let’s say that a user changed the configuration to include the daily count of visitors to a site, but only from the US. In this case, the existing data (based on the aggregation “daily count of visitors to a site”) will no longer be relevant. So, the process will have to rebuild this entire field - based on the full source data (while filtering only US data).


As you can see, these configurations and how they change are essential for building our data warehouse tables.


In the past, we stored these configurations in a separate metadata table. This worked for a long time, until one day when we had a system outage occur right after the data was written but before the metadata table was written. As a consequence, the next run used the old version (from two runs prior) when the process tried to calculate the configuration change.


As you can imagine, this caused data issues that were relatively hard to recover from and resulted in wasted human and compute resources that were spent fixing huge data warehouse tables and informing downstream users.


We had to find a more bullet-proof way to store metadata that would prevent this type of inconsistency between data and metadata in the future. To do that, we chose to stop separating between the data and its metadata and write them both in one atomic action. To implement this, we used Iceberg field comments to store users’ configurations.



Iceberg table metadata to the rescue


Iceberg is an open-source table format built for huge tables, originally developed in Netflix to address performance and usability issues that existed in the Hive table format.


An Iceberg table tracks its data files and the changes that happen to them in a highly scalable way. The state of a table is maintained in a table metadata file. So if, for example, the schema of a table was changed, a new metadata file will replace the existing one.

The table metadata file supports field-level comments, so each field can have an optional string associated with it (this feature was added in this PR).

In addition to field-level comments, Iceberg tables also support table-level comments which can also be used as a table-level configuration store.


Luckily for us, spark also supports field-level metadata - and each spark column alias has a metadata attribute in it. Field and table level comments are highly useful for us, as they allow us to update table data and metadata in an atomic action.



Practical example


If you don’t use Iceberg and spark, you can test them out locally. One easy way to do it is by using Docker, as described here by Tabular. Assuming you have a working setup of spark and Iceberg, here is how you can write Iceberg field comments using PySpark.


Note that the field comment is written to the metadata attribute of the field alias, and has to contain a dictionary that maps the key “comment” to your custom string.



Table level comments


In addition to field comments, table-level comments can also be useful. For example, you can keep the required start_date of the table or the type of it (e.g. dimension, scd) in a table comment, and then read it in the process to choose which kind of process to perform.


Here is how we write table comments using spark SQL. Note that this property of Iceberg tables is one of many properties that can be set on Iceberg tables, in a similar way to hive tables - by using “TBLPROPERTIES”.



Final note


As we demonstrated here, writing the data and its metadata in one atomic action can be achieved using Iceberg field comments - which can allow more fault-tolerant ETLs that can adapt to changing business definitions without any developer effort.


We hope you found this article useful and that you will use Iceberg table metadata in a creative way in your data processes. Let us know in the comments!

 

This post was written by Itai Sevitt


Itai is part of the data engineering guild at Wix. He has been working at the company for three years. Additionally, he is also active in the open source community, contributing to various projects.

 

For more engineering updates and insights:



Recent Posts

See All
bottom of page