Migrations are hard, most of you may know that. A startup company may face many challenges, from recruiting the right people, to funding, to developing your product at a decent pace, to being relevant on the market. Notice that technology migrations are usually not one of them.
Yet once your company and product mature over several years, technology migrations become a part of the norm. Actually, mature companies working on developing a product for several years have to deal with migrations to newer technologies at some point even if they properly planned their initial development process.
In the early days of Wix we invested a lot of effort in big data infrastructure. We were ahead of the curve and leveraged what tools were available at the time. As for Apache Spark processes, we were quite the pioneers, which we became by building the Hadoop ecosystem. Yet as time went by and as technology progressed, we quickly found ourselves behind and needed to migrate to a new solution.

Photo by Aaron Boris on Unsplash
What to Focus on
Our experience in planning migrations at scale taught us that there were a few key criteria we need to focus on in order to make migrations successful:
The new system needs to include at least all the elements that the old system had. This claim is true as long as the capabilities of the old system are not “harmful” to the users and the overall infrastructure. For instance, unlimited cluster usage that comes at the expense of other users.
The teams that maintain the systems, old and new, need to take into consideration an overlap period where both systems need to be maintained. As a result, the effort of migrating can stand in the way of developing new products and features.
By default, you should aim for a migration that requires zero effort from your internal users in the company, such as the developers using the infrastructure. Although sometimes the risks and complications associated with a migration require more control and involvement of users to reduce those risks.
People that use the system need to be interested in migration results. As long as there’s interest, involved teams will be engaged with the migration process.

In order for the migration to be considered successful, it is important for migration-related tasks to not be pushed to the bottom of the engineers’ priority lists. Meaning that ideally teams’ day-to-day tasks and “external” tasks should be balanced out.
Even though usually engineers’ involvement with migration tasks gets decent priority, we need to keep in mind that people have a limited timeline capacity. Plus consider that engineers will have less patience towards any errors they are potentially exposed to. Even if it’s an important migration, it’s better to do as little “harm” to current team tasks as possible.
Therefore, the orchestrator of the migration (TL/team member/TPM/technical product) should pay attention to the teams’ needs and requirements in favor of simply making sure the migration goes forward.
The balance between responding to the needs and remaining committed to complete the migration is the most sensitive point that requires patience, attention, and striving for a single goal.
Making frequent announcements on the migration progress and updates on the main technical obstacles along the way can definitely shed light on how the said migration is going and contribute to the team's motivation to be part of a big and important project.
Our Use Case
At Wix, we run a significant part of our data operation using Spark. At the beginning of 2021, we had roughly 200 spark jobs running over a single Cloudera Hadoop cluster, so we knew a migration of these workloads to a cloud-native solution like AWS EMR was at the top of our priorities.
We had three main triggers for the migration:
As part of our efforts to move all our tables and data lake to Iceberg, we needed to run on a more recent spark version (>3.0).
To get the benefit EMR:
Faster upgrades for new spark versions in the industry.
Better efficiency of resource usage -> Cost reduction.
Up until then, we ran on one Cloudera Hadoop cluster that basically reached its end of life (both development and production).
Planning the Migration
Once we've made the decision to migrate, our planning phase focused on 3 key things:
workload mapping
desired high-level architecture
cost planning
During the initial planning phase, we mapped all the spark processes that were about to migrate to EMR. We additionally mapped the abilities that we wanted to have besides simply running them as a development environment with connections to external sources.
The main idea for production jobs was to have one shared EMR for all low-mid volume ETLs and dedicated clusters for our higher-volume ETLs.
The resources for each EMR were set as default for the shared one but they can also be overwritten for different cluster configurations if they require different machine type, number, SSD capacity etc.
As for the cost, before running deep into migration, consider the architecture, make a rough estimation of how much it is going to cost and make sure you are not offering a solution that is extremely higher than the budget you have a day before the migration. The entire cost for each machine is fully described in the AWS documentation so our estimation ended up being pretty accurate eventually.
Open Communication Channels
As a kickstart, after the planning phase, we made a presentation for all the data engineers in order to explain the purpose of the migration and how it was going to play out. We also set a decent timeline for the migration with a bit of buffers in order for the teams to prepare and test their ETLs and update for any unexpected behavior.
We set up a dedicated Slack channel which was pretty active during the migration itself. We wrote an internal knowledge base called ‘How to use an EMR’, which covered topics like how to develop on EMR, how to run the process in a local airflow environment, what needed to be configured in order to run the process in the shared cluster and how/which process should run on a dedicated cluster, etc. As for the operational DB, the knowledge base contained all the jars or configurations that needed to be included for each DB (Mysql/MongoDB/ MS-SQL).
The Migration Itself
Once we finished putting together the plan, we got to work and actually started migrating everything. As we realized during that process, it required more work on our part than simply moving jobs around.
Since our schedule system is based on Apache Airflow, our DevOps implemented an EMR operator unique to Wix and we needed to adjust all the extensions along with the migration such as:
Connection to our operational databases- Mysql, MongoDB, MSSQL, BigQuery, etc.
Development cluster (Devenv) in order to run jobs remotely.
To have the ability to run all pipelines of the ETL in a local environment in airflow.
As for the development environment, we have one cluster that is specified for development to work in a remote mode. In order to avoid a new configuration for each development (because of the IP changes for each termination and relaunch of a development cluster), the DEs set in their IDE the DNS one time and the only thing left to do was to upload the relevant files for their development.
The IP of the EMR can be presented in a simple command in a Slack app if they need it for SSH access, spark UI, history server, or any other case.
All EMRs were set to be live as long as they had a running application, but once an application runs on a specific EMR for 60 min, the EMR was to be terminated automatically. We set it to 60 min, so processes that use the same cluster and run different apps at different times, won’t wait for resources (~20 min each new application).
Those timelines are relevant for all cluster types, for production (shared, dedicated), and the development cluster. We also enabled the DEs to run a command to terminate the cluster if they knew that a cluster would not be used till the next run (the day after) and no other apps would wait for the re-launched cluster.
We wanted to give Wix's data engineering the flexibility to control their EMR compute resources, so they can self-serve and scale based on their needs. That is why the Amazon EMR-managed scaling service was implemented as part of the cluster config, so data engineers can easily set the maximum and the minimum number of machines (cores & tasks).
The service was not implemented in the first stage when the operator launched, and once it was added, the efficiency of the EMR usage increased dramatically. Especially for the ETLs where their spark tasks spread along the pipeline.
In the screenshot below, you can see our usage memory report for one of the clusters via Ganglia which was also added easily in the bootstrap stage. Ganglia is an open-source project and is a scalable, distributed system designed to monitor clusters and grids while minimizing the impact on their performance.
Did It Work? The Results

Eventually, at the end of the migration, the resource cost for all the migration jobs improved by more than ~30% which was not the main goal of the migration, but definitely worth mentioning.
We can’t say we reached the timeline that was set for the migration, but the motivation to solve all the obstacles was definitely strong and we thus ended up being not far from the initial planning. In the first stage, all the non-unique jobs migrated pretty easily as planned. The main obstacles were with the unique jobs, such as a special way to the key of specific jobs, specific python packages for a few processes, etc. Each unique case required unique solutions.
In addition, we did encounter instability during the first stages but each bug led us to the stable environment we have today.
Summary
To summarize, there are many elements that need to be considered before, during, and after a migration. The value of the efforts needs to be communicated to all teams involved. Requirements and risks must be considered along with working on ending the migration successfully.
AWS EMR made it easy for us to migrate to a cloud-native solution and unlocked many new possibilities we didn't have before which we are still leveraging to this day.

This post was written by Arik Sasson
For more engineering updates and insights:
Join our Telegram channel
Visit us on GitHub
Subscribe to our YouTube channel