FinOps to The Rescue: When Costs, Engineering and Cloud Clash, E17: Full Transcript

Updated: Dec 26, 2021




Every company, and everyone, uses the cloud. We use it because it’s easy, fast and cost effective. But, a lot of the time, we use it in a sub-optimal way. We could be doing better.


Listen to Dvir Mizrahi tell the story of FinOps - developing a smarter and better financial engineering culture.



You can also listen to this episode on Apple podcast, Spotify, Google or on Wix Engineering site. And you can also read the full episode here:

 

Hi and welcome to the Wix Engineering podcast. I’m Ran Levi.


Everybody uses the cloud. Everybody.


You and I use it in our daily lives, when we do things like watch Netflix and scroll through social media. Businesses use the cloud to provide their services to customers, or store data, or fulfill whatever other internet-based tasks they require.


It isn’t just that most people, and most businesses use the cloud--all of us do. Even three years ago, a SaaS company called RightScale found that a full 96% of companies relied on it for some reason or another. The cloud really is everywhere.


It’s everywhere because it works. Distributed computing, it turns out, is more effective at distributing resources and providing high-quality and low-latency service to internet users. And for companies that have been around a while--since before AWS burst onto the scene, a little over a decade ago--cloud is simply cheaper than having to run those old-fashioned data centers on your premises.


Maybe you know this already. What you may not know--what many of us overlook--is that even though the cloud is cost-effective, a lot of the time, it’s being used in a sub-optimal way. We could be doing better.


Dvir: My name is Dvir.


Dvir Mizrahi joined Wix five years ago.


Dvir: My background is technical. I came from the dev ops world and I used to be a developer and when I joined Wix, it was to create some governance on a migration path that we had from on-premises data center into the cloud.


DevOps-, “development” plus “operations”, Dvir’s background--is a popular methodology that leverages the cloud to produce better software, faster. A lot of developers work within the theory and structures of DevOps, and companies these days love to talk about their DevOps “culture.” But when Dvir joined Wix, he wanted to solve a related, new kind of problem. One that most other developers and most other companies simply weren’t considering.


Dvir: So to answer that, we’ll have to go a bit back when we ran on an on-premises data center.


At any large company building, any time between...I don’t know...1970 and 2010, you were liable to find a room that housed the big, ugly computers that ran everyone’s computers. Aside from the technical considerations involved in running these mini data centers, there were matters of cost.


Dvir: So if I wanted to buy hardware in the old world, there was a procurement process that we actually had negotiations and pricing and we had budget approvals and there was a whole process of many teams related to the cost that governed the actual cost.


Basically, it was expensive to run mini data centers, but it was predictable and easy to track. If you need to buy a server, an ethernet cord, whatever, you go to a manufacturer and buy it. Simple. By contrast…


Dvir: If you want to purchase hardware or infrastructure on the cloud, all you have to do is click a button. And because of that, it creates a financial challenge, because there is no actual governance on how much cost or how much infrastructure - there is no limit on how much you can provision in the cloud. And therefore the margin for error is a lot larger than on-premises datacenter.


As an analogy, imagine yourself approaching a checkout counter at a store.


If you pull cash out of your wallet, it becomes very clear how much money you’re spending. With enough 20s in your hand, you might even start to rethink whether that new pair of shoes is actually worth it. But with a credit card? Heck, it’s no problem at all. Swipe or tap and you’re all set. It hardly feels like you did anything at all.


For businesses, procurement for on-premise data centers was like paying with cash. It took more consideration, it was more present. Procurement on the cloud can happen with a few clicks. It’s much more convenient, but easier to get carried away and lose track--that is, until it’s too late.


Dvir: They definitely know it at the end of the month, right? When they get invoiced, they know how much they have to pay for the cloud.


Wix started to realize they had this problem about five years ago. But they didn’t really know what to do about it, or whom they might turn to for advice.


Dvir: There were some cloud analysts or cloud optimizers but it was usually companies that consulted to the people on the companies that hired them. And when Wix talked with other companies, there was no governance, actual governance on the cloud. What was happening is companies got invoiced by the end of the month and then started investigating where the cost came from.


Without any clear direction or guidance they could follow, the company realized they’d have to fix this problem themselves. So they hired a team to do it. That team included Dvir. They called it “FinOps.” Like DevOps but, you know, finance…


Dvir: It was a hell of a journey. And I think the biggest challenge when we started was that we had no one to compare to. We were alone in this defining of this new role and this new position and function in this organization, in Wix, and we didn’t have anyone to talk to, anyone to compare to and we didn’t know – I know it sounds silly but we didn’t know what we didn’t know.


Not knowing what they didn’t know, the new FinOps team went looking to see if anyone had tried anything like this before.


Dvir: The initial thought was that we don’t have to reinvent the wheel. And when we started checking different vendors at the time, I’m talking five years ago – not today, today there are many good vendors out there. But five years ago, when we looked around, there weren’t that many and their product wasn’t that advanced.


No cloud savings tools were really good enough to commit to, or cheap enough to test out. The only option left was to do everything from scratch. But where to start?


Dvir: We don’t know the cloud services, we don’t know the technology, and we don’t know how to optimize - or there was no best practices


Their only good idea was to look at the data that AWS provides its customers. Presumably, there would be useful data to glean in those reports.


Dvir: We started looking at that repost and at what kind of information the cloud offers us.


As they became more familiar with the information in the reports, a bigger picture began to emerge.


Dvir: We moved forward in this journey into understanding the pricing models, understanding our topology and architecture, understanding technically what is possible and what applies to financial KPI.


Soon, they were building little tools to aid themselves and experimenting with how they could make the data in the reports come out just a little bit nicer.


Dvir: It started like a test project. So we got the report, played around with it, created a small dashboard with the information that we got. And it just grew from there uncontrollably. Because we added another feature and then we said, you know what, let’s support that instead of looking for a vendor that sells me that and, it grew more and more and more features were added, managing reservations and monitoring for non-utilized workloads...


The project was taking shape. Now, with a better understanding of the cloud, the company’s infrastructure, and the associated costs, they came up with goals, KPIs, and a plan for the future. Broadly speaking, we can whittle it all down to three parts. Three pillars.


Dvir: The first one is visibility and monitoring. The second one would be design and optimization. And the third one would be education and mindset.


First, visibility. This was the problem we mentioned earlier. If the people making decisions about cloud resource usage at your company don’t comprehensively understand what’s going on under the hood, you’re going to have problems. It’s too easy to use more of the cloud than you realize, or more than you’re utilizing efficiently.


Dvir: So for example if I’m looking at five instances in the cloud, I just know as a business that I’m running five instances in the cloud. But a FinOps engineer can tell you, oh, those five instances are databases, for example.


Then when you apply the business logic, it creates a lot more business visibility on what you are running. So you’re able to answer business questions. For example, how much my databases cost, how much my Cassandra cluster costs or how much my agenda X is costing me every month, every day.


To create that visibility, it’s not enough to get the invoice. You have to drill down.


The key isn’t really the invoices, or the reports themselves, but who’s reading them in the first place.


Dvir: There are tools out of the box from the different clouds that will allow you to – will help you to gain that visibility. We have costs exploring on Amazon, we have the reports in Google Billing, and we have the cost subscription in Azure.


All of them can give you something like a dashboard and you can filter based on time period and different criteria. But eventually, if you don’t know what you’re reading, then you don’t really have that visibility and that’s why it’s important to have someone that’s actually coming from the engineering side and knows what C5 9xlarge means or what Inter AZ Traffic means. It will help you map the workload, the business workload better.


Technical people can help finance people interpret the nuts and bolts of what they’re paying for. But being able to see what’s going on is only half the solution.


Dvir: When we’re talking about visibility and monitoring, the day to day would be to create that visibility on the cloud, what we’re paying for. But that’s the initial phase. What we want to do is to highlight anomalies and red flags.


In other words, not just seeing what’s been going on, but being able to keep track of any issues that may pop up down the line.


Dvir: We want to create that monitoring framework to let everyone know if there is a financial incident. And that’s how we categorize that, like an incident and not like someone wasting a lot of money. It’s an actual production incident.


When we have that framework of monitoring and we have those alerts and budget in place, so we will know that everything is aligned - that creates governance.


That leads us to the second pillar of FinOps: design and optimization. Not just seeing where the existing holes are, not just patching them, but going that one step further.


Dvir: The strength of the engineering side is to actually do a proactive approach and not a reactive approach. Don’t change your infrastructure after your provision date. But proactively design and be part of the architecture of the provisioning team and the way that your company is built on the cloud.


Proactively design your agendas on the cloud before provisioning them and create the proper gateways in order to get that financial benefit. So you don’t have to pay for something you don’t provision at all.


This could apply to really any function that occurs on the cloud, anywhere in an organization.


Dvir: For example, if we have 10 technical ways to solve a problem but only two actually keep that financial KPI of the company, then my job as a FinOps would be to highlight those two approaches.


After some time, let’s say three months, half a year or a year, we’re going to review everything, make sure that we’re still optimized, meaning that what was right a year ago, we want to make sure that it’s still right this year. And we want to make sure that we chose the right topology with the new releases and the new technology that came out and new services that are offered to us by the cloud.


Simply talking about design optimization can be easier said than done. If the problem for visibility--pillar one--was having people at the table who understood both finance and engineering, the problem with design optimization--pillar two--is that your finance and engineering people might not always agree with one another.


Interviewer: So when cost and performance are naturally going to be a little bit at odds, how do you guys solve those kinds of problems?


Dvir: The short answer would be to find that silver lining.


The silver lining has to be about overcoming each side’s individual interests, to think about the company as a whole.


Dvir: We don’t look to override the financial KPI on top of other criteria. For example, if there might be a chance that I will lose completely a redundancy topology, because of cost, we probably won’t do it. To be honest, everyone knows Wix and our company. We’re a public company. So downtime for Wix will cost a lot more than any optimization that we’ll do.


It's brand impact – you know, everything that comes with it.


As an example of how this works in practice, consider the time Dvir and his team had to rethink Wix’s deployment across their cloud provider: AWS.


Dvir: We had a network topology that utilized the three availability zones in a region.


Now, those of you who don’t know, every region in the cloud consists of several availability zones for redundancy, meaning that if one building is – the connection is dead, then you will have other buildings in the same region where you can deploy your workload.


At the beginning when we started migrating to Amazon, we ran on a single region, in “US East”. And when we started creating our topology, we needed that redundancy and we provisioned our workload in the three availability zones.


Now, every call between one availability zone to the other costs money. When we created another region, anytime we had any issues with the availability zone in the same region, we didn’t rely on the other availability zones. We shifted traffic to a different region. And eventually when we looked at the architecture and we tried to find that financial KPI that we can apply on that, we understood that even if I have a simple issue with one availability zone, I’m shifting my traffic to a whole new region.


So I don’t really need the redundancy on the availability zone level. I can use the redundancy on the region level. And we shifted the architecture from a single-region multi-AZ architecture to a multi-region single-AZ architecture.


It sounds rather technical--almost like a minor definitional change. But that wasn’t it, at all.


Dvir: That shift alone reduced all the traffic between the availability zones and saves Wix about 85 percent of our traffic costs.


85% savings, with no downside.


Dvir: It was an amazing achievement from the financial side. We didn’t give up on redundancy and we didn’t impact our performance.


So now we have visibility, monitoring, and newly-designed and better-optimized infrastructure. Finally…


Dvir: And for me, the most important thing, a part of the FinOps culture, the thing that this function has to implement in the organization - is to change the engineering mindset.


Winning over hearts and minds.


This one might be the trickiest. Developers, in particular, are super smart and great at learning new things. But sometimes - like every other human being - a little rigid in their preferences and habits. Getting them to change how they’re doing things can be tough. And who wants to think about money, anyway? That’s the job of the finance people.


Dvir: Now it’s something that I like to say in my lectures, but I know that the financial KPI doesn’t have to be in the top five KPIs of an engineer and that’s OK, right? We have dev velocity, performance, resilience, redundancy. We have a lot of KPIs that the engineer has to keep. Financial KPI is not really one of them. But it is still a KPI and when we’re talking to the engineering team…


To convince engineers to see things his way, Dvir makes sure to talk to them in their own language.


Dvir: ...we want to change the terminology and we’re not telling them, like, “you’re wasting money” or “you’re costing a lot”.


What we’re telling them is that they’re executing bad engineering. So you wouldn’t write a wasteful code. Why would you run on a wasteful environment?


So those are the three pillars: visibility, design, and culture. FinOps, in a nutshell.


Dvir: And those three pillars actually create our day to day for any objective, for any assignment that we have, even if it’s reviewing a new agenda or creating a new topology or creating a new dashboard of visibility or adding another layer of financial KPI and business KPI. This is our day-to-day.



Interviewer: What is one thing that you would like people to take away from the story that you haven’t yet mentioned?


Dvir: This is a good question and I thought about this a lot. Like, what’s the message that I want to pass on in this podcast? And I think that it’s “don’t give up” - as a company or as an engineer that’s listening to this podcast. Don’t give up on that financial KPI.


I think that everyone is talking about the financial KPI or the cost side in some kind of a – oh, it’s a different department, I’m an engineer. But as I mentioned before, I think that the cost KPI, it’s an engineering KPI. I think that a good engineer will eventually... Like with DevOps as a methodology and culture - engineers actually adopted it, and you don’t have to be a DevOps engineer, you can have an engineer working on code and still apply DevOps methodologies. I think that will be the future for FinOps. I think that people and engineers specifically will apply that culture and methodologies to their day to day workloads because I believe this is good engineering to do so.


I think that if you have several ways to solve a problem but one of them is financially optimized, I think that this is the direction you want to go. And for that, that knowledge gap will be very soon shortened. I believe that people will be more aware of how much their actions on the cloud will cost and I believe that after listening to this podcast and understanding the potential of this function in an organization, I hope that they will apply those best practices and that culture and bring that with them. Because as it happened in Wix, it only took one man to convince everyone that we needed that function in this organization.


Interviewer: So sort of summing things up here, what is the scale of what we’re discussing here? By the end of this FinOps process, from the beginning of what we were describing earlier in the interview to this point now, are we talking about thin margins or medium, or are you guys making a major cut to Wix’s bottom line?


Dvir: It is a major, major, major cut to Wix’s bottom line. I can’t share the specific numbers but I can tell you that I have some job security here. [Laughing]



That’s it for this episode, thank you for listening. For a full list of our previous episodes, visit https://wix.engineering/podcast. The Wix Engineering podcast is produced by PI Media, written by Nate Nelson, produced by Yotam Halachmi and narrated and edited by me, Ran Levi. Special thanks to Moard Stern from Wix. See you in the next episode, bye bye.

 

For more engineering updates and insights:

0 comments

Recent Posts

See All