- I am running around 60 microservices and managing them is getting difficult, any advice?
- Oh, it’s quite simple. Delete your integration and test environments first.
Sounds crazy? Or maybe brilliant? That's exactly the dilemma our Head of Backend Engineering, Yuval Perry, faced back in 2012.
He was then working for company which was running more than 60 microservices - and managing things was getting difficult. He approached Aviran Mordo - Wix’s VP Engineering - for advice. That's how he got that "simple, start by deleting things" advise, which definitely wasn’t the answer Yuval was expecting…
In this episode Yuval explains how he had to step out of his comfort zone in a major way to ultimately come up with a solution he himself described as ‘either crazy or brilliant’. Listen now:
Hi everyone, welcome to the Wix Engineering podcast. I’m Ran Levi.
Yuval: My name is Yuval Perry. I’ve been in the world of engineering for 25 years. I started my first professional job when I was 18 back in the IDF and I think I’m one of those people who always knew what they wanted to do.
After 25 years of engineering, most new days in the office aren’t much of a surprise to Yuval. But a few years back, he ran into a challenge he simply was not prepared for.
In this episode, Yuval Perry, facing the total collapse of his company’s software environment, seeks advice from someone who knows what to do. The advice he receives is not what he expects, or wants to hear. But it’s what he needs to hear. And it’s a lesson to any engineer facing scaling issues at their own medium-to-large organization.
Yuval: So that’s a very interesting story.
The problem Yuval faced concerned little, highly-specialized pieces of software called “microservices".
Yuval: So in the old ages, an application was a monolith, one big process that runs on a computer. And this fits a certain organization, let’s say 20 to 50, maybe 60 engineers. But right now, we’re dealing with companies that have 1 thousand, 100,000 even engineers, and you can’t write – not everybody can participate on the same code base. So you have to split the code base into several code bases in order to manage multiple work and also multiple computation and demands.
At large companies, having a single code base is in many cases unviable. You simply can’t have hundreds or thousands of engineers logging into the same software platform, all “shouting” over one another. It makes more sense to split up the software environment, siloing different people and teams.
Yuval: back at the time, the company was an Ad tech company which had a huge monolith in production and we decided to go and rewrite in the microservice style.
We just started by building the CICD environment and preparing everything for the new microservices instead of ripping them out.
However, Yuval’s company was, perhaps, premature in moving to a microservice-oriented model. As the software added up, it became more difficult to control than they’d anticipated.
Yuval: In just a few years, we had more than 60 microservices in production. We were very proud of each and every one of them because we built a sophisticated CI/CD environment and everything was deployed behind a feature toggle and it was super eventual-consistent. So we thought back then that we did everything by the book.
But as we grew a little bit more, then each person that breaks the build or breaks the ongoing development would have affected and impacted every other person in the team and eventually every time something would break, it stopped the development of all of the company.
We started to feel the load of the management. I mean things started to fail – to break more than often and it got into a point where I couldn’t imagine adding 10 more.
At first glance, what Yuval is saying might not make much sense. After all, the main reason for breaking up the monolithic structure of the software was to allow each developer - or a small group of developers - to write code independently of other developers and groups - why would someone breaking their build affect the development of other microservices in the company? Isn’t preventing such things the whole reasoning behind microservices?
Well, let’s listen again to what Yuval just said.
Yuval: We were very proud of each and every one of them because we built a sophisticated CI/CD environment and everything was deployed behind a feature toggle and it was super eventual-consistent.”
That CI/CD environment? That was the crux.
Tuval: Now when the integration environment is not stable, then the whole dev velocity becomes slower and slower each day.“
In a sense, dividing the monolith into 60 or so microservices gave the developers greater independence, greater freedom. But at the same time, the integration environment - which was common to all microservices - made them codependent on each other, again. It was like a 100-meter race where all the participants sprint towards the finish line, only to queue up in the last few meters and cross the finish line one by one.
You might be asking yourself: if that was the case, why was the integration system used in the first place? The answer is simple: that CI/CD system was also part of the company’s Quality Assurance process, that made sure that no broken code found its way to production. And when it comes to software, QA is almost always taken very, very seriously.
Yuval: I think because back then we treated the integration environment as our last line of defense before production and we invested a lot of time resourcing it and still it was never stable.”
Yuval didn’t know what to do to solve this problem. Maybe greater management and oversight was needed. Perhaps reverting back to the monolith paradigm would return everything back to normal? He decided to seek help.
Yuval: So since I’m a very curious guy, I started to investigate it and being Israeli, I started writing emails to super technical engineers from various companies, CTOs, chief architects, and I had a template that I used to do for the email. I made minor changes for each person and basically I listed challenges which I was sure we were struggling with, we were both struggling with, and I suggested that we would schedule a meeting where we can brainstorm.
To my surprise, 100 percent of the people that received an email wanted to meet right the following week. Super Israeli.
Of all the architects, engineers and CTOs Yuval spoke to, one person, in particular, turned out to be key. It was someone whom he’d never met before.
Yuval: So one of the people I contacted was Aviran Mordo from Wix.
When we met, I described the basic architecture and the design and then our challenges both human-related and tech-related and I remember I told him that if I add 10-20 more services, everything would collapse.
So right in the beginning of his answer, he told us -
Aviran: you should start by deleting your integration and testing environment.
Yuval: I remember that at that time I didn’t know if it was crazy or brilliant, but that’s because it was out of my comfort zone.
Deleting your integration and testing environment? It seemed like a joke. But Aviran was dead serious, and he had plenty of experience when it came to microservices.
Aviran: We started microservices in 2011. The word wasn’t even there, microservices. So we had to actually develop our own management systems, deployment system, monitoring systems.
So currently, well, nobody can really know but it’s about 1500 microservices clusters. Those are unique. So every microservice has at least three instances. Actually, in our case, it’s three instances per data center and we have like four plus. So it’s a lot of instances and a lot of unique microservices.
How did Wix manage to grow to such a massive scale, without a testing and integration environment?
Well, for Aviran, it started with his own personal experience as a developer.
Aviran: Actually, in my past experience, I almost never worked with QA. I had to rely on the quality of the code that I produced and then skip QA process in most cases, until I worked at some point in Lockheed, then in one other small company. I did have experience with QA. But my experience was that the QA process was very long and the impact on the quality of the code, and just speaking on my behalf, was not really big.
When Aviran joined Wix and helped the company grow its microservices infrastructure, they had at first a testing and integration system similar to that Yuval built in his company.
Aviran: So we started just like any other standard process. So we had code and we did testing – unit testing – and we had staging and we had QA, but one of the driving forces behind Wix engineering and Wix is “always progress faster”.
And at some point, we always have this process like every year, and Yuval can tell you that, about once a year we stop and we’re thinking, and this is something in our culture, OK, what are we doing wrong and how can we move faster? What are the things that hold us back? And at some point, when we did this stepback and thought about how can we move faster, we just realized that the cost of having a staging environment is really big and this is, let’s say, the biggest thing that holds us back from releasing faster to our customers.
Well, we started thinking about, OK, how – what are the best ways that we can eliminate this problem?
It turns out that there was a way to eliminate this problem, and that’s eliminating the staging and integration platform. But what about QA? Well, let’s again go back to what Yuval said earlier:
Yuval: I think because back then, we treated the integration environment as our last line of defense before production and we invested a lot of time resourcing it and still it was never stable.”
“Last line of defence.” That kind of implies that there are other lines of defence - and there are. When practicing Test Driven Development, a well known software development methodology, developers are expected to create testing environments on their local machines. If executed correctly, this local testing environment can be just as efficient and effective at rooting out bugs as any external QA platform. But - and this is a big but - you have to trust your developers to make their local testing environments as robust and complete as possible, so as to catch all the bugs before they get to production.
This is key. When he first said it, it sounded as if Aviran was saying that testing code isn’t important. But of course it is - that wasn’t what he actually meant.
The point was that Yuval and his company had to trust their developers to be responsible - to test and debug their own code without the centralized system.
Yuval: And it’s kind of – you have to really trust your developers and your methodology of testing, of basically TDD, that you will be willing to release your code to production without the – let’s say the QA verification. And not many people or managers are willing to take that risk because, one, they’re not sure or don’t trust their own methodology, or the quality of the code that is produced by the developers.
So you really have to trust your methodology and processes in order to release good code to production and not impact your customers. And this, this leap, is something that engineering management have to do in their mind.
OK, now I put all my trust in my engineers, and in the culture, in my methodology, that I’m willing to take that risk and get rid of this QA staging part of the release process. You can still do QA and testing on the developer’s machine but I don’t need this whole integration environment.”
For Aviran, this ‘leap of faith’ came - if not easily - then at least naturally.
Aviran: So Wix’s culture is a developer-centric culture. So we trust the developer to write code, to write their test and actually to release it to production. So once you take out all those protection barriers or gates, then the code quality becomes better because the developers know there is no QA, there is no one else that can check after me other than myself. So you have to write your code, you have to write your test, and you actually deploy it to production.”
Aviran says that in order to enable your developers to justify the trust you put in them, you need to give them two things. The first is responsibility, or a sense of ownership of the code they write.
Aviran: we give a lot of power and responsibility to the engineers. So it is basically up to the engineer to write the code and run it on production. So there is no throwing of responsibility to someone else. You don’t have security gates that will check your work. It’s up to you and you are the one that is waking up at night if something happens.”
The other is control - a way for developers to see what’s going on in production, and how their code actually behaves in the real world.
Aviran: so every one of the engineers have control. We give the dashboards, we give them control over the deployment and build, and they can do it on their own.”
For Aviran and his team, this system of trust worked much better than having standard QA.
Aviran: So at first, we released once a day and then we released two times a day and three times a day, and we learned as we go what are the pain points, and now we’re releasing 500 times a day.”
For Yuval, Aviran’s advice was a bit too drastic - and that’s understandable, because to make it work you need to make sure your developers’ local testing environment is as similar as possible to the production environment. That might require a serious investment in infrastructure.
Yuval: the solution has to fit your company’s scale.
I mean we couldn’t copy exactly what Wix suggested because it didn’t fit our scale. I wouldn’t recommend, for instance, a new startup to adopt Wix guidelines because they would need to build crazy infrastructure before the first line of code. So maybe a fresh startup should start with a monolith and rip microservices later. It depends.”
But it appears Yuval liked what he heard from Aviran, because today these two work together as colleagues, and Yuval is practicing Aviran’s methodology on a daily basis.
Yuval: So what we try to do is to create an environment where you work only with your PC or laptop in production. There’s nothing in between, because you can’t copy all the microservices and duplicate a large scale system into another environment.
We develop a system that we – using TDD, we achieve high test coverage and then we create marks for every service, and you really don’t have to speak to any other person in Wix in order to run the service that you are building. Because you can just manage integration on that scale.
So how do you scale up a system from a few dozen to thousands of microservices? As we’ve seen, it’s not an easy thing to do. If you decide to stick with the good old practice of having a centralised QA system, that system needs to be robust and stable enough so as to not impede the pace of development. And if you decide to ditch the QA system for a more distributed system of local development tests running on developers’ machines - that, too, requires a considerable amount of investment in infrastructure.
But the challenge can be more than just a technical one. Trust is not something that you can code - although, you know, I haven’t checked, there’s probably some Python library out there that does that. But seriously, trust is something that is built over time, and to flourish, it needs to be ingrained in a company’s culture. So, does your company trust it’s developers to do the right thing? If it doesn't, perhaps it’s time to think things over. Perhaps it’s time to get out of your comfort zone.
For more engineering updates and insights: