Wix Engineering Tech Interviews: Armon Dadgar, Matan Cohen and Ran Schneider

Wix Engineering
Jun 21, 2022
30 min read

We love telling stories behind the daily challenges we face and how we solve them. But we also love hearing about the insights, experiences and the lessons learned by prominent voices in the global community. In this series of tech interviews, our very own Wix Engineers talk directly with some of the most inspiring minds in the tech industry. From software engineering to architecture, open source, and personal growth.

This time our very own Matan Cohen and Ran Schneider talked with Armon Dadgar, the Co-Founder and CTO at HashiCorp, a big influencer in the DevOps area and in general on the future of infrastructure management, how to manage infra as a code (with over two hundred production engineers!) and the vision of HashiCorp.

Oh, and don’t forget to check out the rest of the series:

https://www.youtube.com/watch?v=rCn80CQUEeY

Matan: My group is responsible for building products in the DevOps area - although on a smaller scale than HashiCorp, we are nevertheless building products for DevOps. Our group is called DevOps for Infrastructure and we are responsible for the dev experience and build products in order to make the life of production engineers easier.

Now, this is a very unique situation to be in - this field of product for DevOps, it's quite unique. Not all people have the skill set to do this kind of job.

For example, you need to have a product manager - well for our product, we are the product manager! You need to have programmers that know infrastructure and vice versa, DevOps that know how to build products…

Now, in HashiCorp, you've done it on such a large scale - how do you overcome those problems and issues, and in general, how do you build such a huge company working on DevOps products?

Armon: It's a great question. And it's definitely a tough challenge, right? Because as you mentioned, it's a pretty specialized skill set. I think for us, part of it is a bit of self selection. As you're hiring software engineers who want to work on infrastructure - you get a bit of a different “breed”.

There are a lot of folks who want to work on the end product, to do things that are in front of the customer, they like having that user interaction. And then there are folks who just really love the plumbing of it, where it's like, "I don't really want to deal with the user, I want to build the deploy system”, or “I want to build the infrastructure management”, things like that, the kind of core plumbing stuff.

Part of it is that you're hiring for a slightly different mentality, you're hiring folks who like that kind of plumbing and that infrastructure layer. And the other side of it is that a lot of engineers are just unfamiliar with it and it might seem more intimidating...

So part of it is' just getting people exposed to it. And then you find that a lot of software engineers find that problem space exciting. For example, I can personally think back to before HashiCorp when I worked at an Ad company. Then I could think something like "Well, is an ad company going to be that interesting? You're just serving ads!". But when you start looking into constraints and the challenges, you realize that you're working at an immense scale with hundreds of thousands of requests per second. You care deeply about performance - every millisecond matters, you have to have responses in under 50 milliseconds. And you're super conscious of your margins and your cost.

And when you start talking to people about those constraints, then their reaction is often to say that that's actually a pretty interesting problem space. So set aside the fact that it's serving ads, the technical problems are very interesting.

And I think infrastructure is sort of the same thing. If you just talk to people about the plumbing, it can get kind of boring. But then talk to them about the constraints of the problems you're helping solve. Say, how do you help make another 500 or 1,000 engineers productive? Or, say, if you solve some of these underlying infrastructure problems, this is how that’s going to improve the cost, or the agility, or the security posture, etc. And then people can get excited about those things as opposed to the plumbing of it.

So it's about a few of those different things. Find people who are excited about it. Do the right storytelling, speaking about what's interesting about the problem, not just describe the plumbing of it. And then just give people exposure so they don't feel scared of the word “infrastructure”.

Matan: Do you feel that’s the same for product managers as well? Of course, for developers, for DevOps, you can make them quite interested in the core functionality and thus build a product like that, but with product managers, sometimes they want to… be very close to the business. But sometimes, for some other companies the products of DevOps are kind of a gray zone…

Armon: It's a hard problem. I mean, we have that problem internally too. Even our own platform team, even within HashiCorp, even though we ourselves make platform software! So even our own platform team has this problem of finding the right PMs who want to fit into that. We have a few and they're great. The challenge here is that they are really, really hard to find. Finding PMs who want to work on internal platforms is really hard.

The flip side of this has been having our engineering managers take on a bit more of that kind of product management leader role plus supplementing it with technical product management, so TPM.

[...]

Speaking about where you can find those kinds of people - you are going to have to hire them from deep infrastructure companies and alike. They have to be coming out of storage, networking, compute… That's where you're going to find those PMs who have that deep familiarity. Because you just have to spend time in the deep plumbing for it to really make sense to you. Otherwise, you're just like, "Okay, I don't really know what we're talking about here."

Ran: We use a lot of HashiCorp products at Wix - like Vault, Consul, Terraform, etc. And we noticed that most of these products are built on top of Golang. And I know that nowadays a lot of infrastructure-related products are built using Golang. But back then, when you just started the company, it was an immature language without too much community around it. Kind of a risky decision to make back then, not to go with something like Java that was very popular back then and still is today. How did you make that decision?

Armon: It's a really good question and we spent a lot of time debating it. If we go back to around 2012, when we started the project, this was a big question for me. What language do we bet the company on? Me and Mitchell, we played with a bunch of different languages, and so coming in, it wasn't like we were particularly Go experts or something. Actually, most of our time would probably be spent with Python, in the previous company we'd done a lot in Erlang, reasonably comfortable in C/C++.

So I think there were a few different questions for us. One, how easy is it to package and distribute the software? Vagrant famously is in Ruby and packaging and distributing a Ruby application is a complete nightmare because you need to ship the cold Ruby runtime, it dynamically links to a bunch of libraries like OpenSSL and libxml and all of this stuff. And you have to manage all of that versioning super carefully, because if it links to the wrong version of libxml and then the library doesn't work, then Vagrant crashes… It's this horrible nightmare when you have that kind of a dynamic linking on a desktop product like that. We figured it out with Vagrant, but it was horrible and we didn’t want to do that again. We wanted something that can link into a static binary that's fully self-contained, sort of.

Now, there's a bunch of great choices for that, right? You can do that with C/C++, you can do that with Java - build a kind of a “fat” JAR. But you can't really do it with something like Erlang, since it’s kind of a super nightmare to distribute. It also has a billion files and a runtime and all of that kind of stuff. So we're like, "Okay, something like that's going to be challenging." That was one concern.

The second concern was deciding if a certain language was good for building high performance networking software. A lot of what we do - if you think about Vault, Consul, Nomad - these are systems that are designed to process tens of thousands, in some cases hundreds of thousands of transactions per second. So it had to be a relatively high performance language for what we were doing, especially the network-oriented stuff.

That kind of ruled out Ruby and ruled out Python - they're not particularly performant in multi-core and networking use cases. Obviously C++ was a good fit. Java was sort of in the middle. Anyone who's tried dealing with high performance Java knows the nightmare of tuning garbage collectors and your Heap sizes. It's just kind of a pain to deal with that when you hear about the performance. And so that was the second concern.

And then the third concern was hiring - how difficult is it to hire, onboard and train people for this? Ruby, Python - great, really easy, more or less. C/CC++ would be really hard, not a lot of people are excited about writing C all day... Erlang - there's like 12 people in Sweden that know how to write in Erlang, so not super helpful (laughing).

And that’s when Go was showing up around the scene. This was before 1.0 even, this was Go 0.8 or something, and we were seeing Google do some stuff with Go.

It fit the bill on distribution, because it compiles down to a single static binary. We thought that was really nice. And then they've made a lot of great decisions around minimal runtime. Yes, it has a garbage collector, but it's very different from Java's, it doesn't have nearly the same level of garbage collection performance problems.

It's designed to be a systems language - so think high concurrency Goroutines, all that kind of stuff. Okay, now it fits the bill of being a potentially good language for high performance network software.

And then with the third thing, with hiring, what we really liked about the syntax was that although nobody really knew Go at the time, it isn’t that weird of a language. If you know Java, you can pick it up pretty easily. Same if you know C/C++. Even if you know Python or Ruby, coming to Go is not that weird.

So no matter what you know, you can kind of come to Go and it'll take you a week to three weeks to really get going. Yet it's not weird the way Erlang is weird, or hard the way C or C++ is hard where it takes you years to get good at it.

It fit all those boxes. The biggest risk was trying to think about what would happen if Google abandoned it.. But we thought it still was a good bet, because worst case scenario - it would still be open source, the compiler would still be around, it wasn’t going to go anywhere. And two, because the community is early, we thought that as HashiCorp we could become one of the places to go to if you wanted to write Go software, we could help define what the whole Go ecosystem looks like.

And it's fun for us, because almost any major Go project we see, there's a HashiCorp library in it. Whether it's a library that Mitchell wrote or I wrote, or it's a HashiCorp-maintained one, you see our stuff across the whole Go ecosystem. It was by virtue of being early in that community that we could help define it.

It ended up being a great bet. And Google has done a great job stewarding it and driving the development of it. And it's awesome to just watch every generation of it get better, the garbage collector get faster, etc. It's been a good choice.

Ran: When a new company is trying to find a solution for orchestrating containers, the goto solution nowadays is Kubernetes. And in HashiCorp, you also have a product for scheduling containers and other services that’s called Nomad. Why do you think the majority of the community went with the Kubernetes approach? And do you think that there is maybe a better use case where one should choose Nomad over Kubernetes? For special use cases or something like that?

Armon: Another good question and one we spent a bunch of time thinking about - why did Kubernetes win the market, broadly speaking? I think ultimately the lesson for us comes down to the ecosystem. That’s the biggest difference, right? What Google did with Kubernetes by virtue of putting it into CNCF and creating a foundation around it, is to set this big tent around all these other vendors that were all pushing it. So it wasn't just Google talking about Kubernetes. Which they did, of course - they hired a hundred engineers and spent 10 million per year on marketing. But then they also had RedHat contributing to it and IBM and Cisco and VMware and… pick your favorite, right?

Each of those companies was then involved with developing it, marketing it, advertising Kubernetes. And this creates this enormous surface area of a “Mindshare generation” for Kubernetes. And then there was helping educate users on it, and going to CNCF, and KubeCon and all of those things. Whereas with Nomad it was a much smaller group. It was just HashiCorp talking about it at our events and at HashiConfs, but it's a much smaller audience than what Google, IBM, Microsoft and Cisco and everyone else can advertise to.

Now, anyone who's tried to self-run Kubernetes realizes it's an operational challenge. And so there was a ton of pressure on the Cloud providers to provide and manage Kubernetes, because nobody wanted to run this thing. And so once they had GKE, and AKS, and EKS, then there was this easy button one could just press to make Amazon deal with it... I think that really helped them flip it, the dynamic would have been very different if everyone had to actually operate Kubernetes. Because then they'd be like, "Holy cow, this thing is way too much work".

Looking back, maybe some things we would've done differently, but overall it's that dynamic of having 10 of the world's biggest vendors all lean on the same thing being a very difficult thing to compete with.

Matan: Yeah, it makes sense. Cool. So as I said before, our group is responsible for, besides bringing some tools and products, we are helping to automate some, how to say, processes in Wix, in the infrastructure processes that need to have an automated way, for example. So let's say that you want to create a database. You need to have a flow, a backend engineer, need the database and some other stuff. So you need sometimes to create Terraform as a service kind of. So for example, I mean, of course, DevOps writing Terraform all day, and this is their code of course. So two years ago, we needed to find a way to write it programmatically. And we have our own languages, we are writing in OGS, and Golang, and many others, but we needed to find a way to write it in a form programmatically.

Matan: And we were very glad to see that there is Terraform as JSON and not to have templates with HCL and some crazy stuff. And it was amazing. And we've created a bunch of abstractions on top of that. We are using [inaudible 00:19:50] and [inaudible 00:19:51] and other stuff in order to write Terraform as JSON. And we found out lately that there is a Terraform CDK, which seems also very promising. Now we saw that there is also Go in TypeScript and many other languages to support it. And we want to understand from your vision, what is the best, or maybe which time you need to go with the Terraform as JSON and when you need to go with the CDK, it's still in the beta, right? The CDK or it's out of the beta?

Armon: Oh, that's a good question. I think it's out of beta. That's a good question.

Matan: Anyhow, regarding division.

Armon: Yeah, that's a good question. I know there's a whole team that that's all they work on, so I'm pretty sure it's out of beta, but yeah, it's a really good question. So I guess maybe the way we more broadly think about it is it's ultimately Terraform core, the Terraform engine, in some sense doesn't care what the input format is, right? At the end of the day, all of these are sort of to use a bit of compiler language. All of them get translated into an intermediate representation for Terraform anyway, right? So there's sort of many different kind of language front doors for Terraform, and then all of it gets compiled into sort of the IR, which is what Terraform actually uses to sort of execute on and process the graph basically, internally.

Armon: And so our view, as we sort of thought about it and said, "Okay, great." Obviously you don't want humans to write the IR because that's terrible, right? So, what are the different front doors that actually make sense for people? And so I think the original where we started was a JSON right? So JSON was always the kind of like closest match in some sense to what the internal IR is. The problem with JSON is it's kind of painful to author it, right? It's pretty verbose. You don't want to actually sit down and write it in JSON. It becomes a nightmare to deal with it once you have too many levels of nesting and things like that. So HCL was really born out of saying, "Hey, we think that 80% of Terraform usage will be more humans sitting down writing it, kind of owning and curating what does my core VPC look like, right? What should this module be?"

Armon: So our expected use cases that 80% of the time is sort of hand curated. And so HCL was meant to be easy to read, easy to write, way less verbose version of JSON. And so they've always had sort of a one to one compatibility with one another, but one is obviously way less verbose, because that's what it was designed to solve. The JSON piece always filled that gap of, hey, people were like, "I want to auto generate Terraform and programmatically [inaudible 00:22:26]." That's the perfect fit for it, right? I think the evolution with things like the CDK have now been people saying, "You know what? As I'm going sort of away from my ops person being the one responsible for Terraform, closer to my developer, the developers are more comfortable thinking less in terms of configuration and more in terms of whatever native programming language they're used to. And so can we have an alternative for them?" Which is still designed for when you're manually curating it, you're writing it. In the same way the operators are curating the HCL, you're developers sort of curating, let's say the TypeScript in this case.

Armon: That's really what I think the CDK is more meant to solve. I think if you're building it into a pipeline and programmatically generating and putting it behind things like REST APIs, the JSON format's probably still the best, because I think that's going to give you sort of the total flexibility to do whatever you want, or-

Matan: Whatever language you want.

Armon: Yeah, you can enforce different policies, you can put your access controls, you can do all of that kind of stuff. And then you're just passing the JSON through the layer. Of course, you can still have users write in CDK, just have that emit the JSON and then you can pass the JSON further down the pipeline. But I think you have a little bit more flexibility with the raw JSON than even you do with the CDK, right? Because that's truly, you can kind of generate it however you want at that point, because it's closer to that raw IR within Terraform, basically. So for us, ultimately our vision is there's many different front door that are optimized for different audiences, right? HCL we think will be the 80% use case. I think CDK is probably 10% use case. And then JSON's probably the other 10% use case where you're programmatically generating it. But our view is, all of those are valid and all of those are going to be things we support. And in fact, we're working on adding more languages to CDK. So our view is the more, the merrier.

Matan: Yeah. It's it was very interesting for us because I mean, probably, I don't know if we would pick up the language to use CDK. If we had CDK two years ago mature in Golang or in another language, it's very interesting what we would have picked up because it's very interesting vision wise because we have the JSON and it's very easy to manipulate as you said, but the other languages to use it in your own language, it's something that maybe, as you said, developers who tended to infrastructure, it's more close to what they're doing. So probably they would pick up with this.

Armon: Well, the risk that I find with them, this is always the challenge, right? We always had this trade off, even with HCL is, the more power and flexibility you give people, the more they can shoot themself in the foot, right? And this is the sort of trade off, right? And the perfect example of this is C and C++. You can do whatever you want, including blow your leg off, right? You want to access random memory space, you can do that. Go for it, right? And so you have this incredible power, but it's really easy to mess things up. Versus you think about a language like Java or Python, it's way more restricted. It's like you're wearing a straight jacket, right? You can't access arbitrary memory, right? It doesn't let you do that.

Armon: And so as a result, there's a lot less ways to blow your leg off, right? Not that you can't do it. And so I think that balance with the HCL was sort of the same thing. We wanted enough flexibility and power, but we purposely constrained the language in a way that we're like, "You can't do arbitrary, crazy things in HCL." And you can drop me into any HCL code base anywhere at any company. And you're like, within 30 seconds, I have a good sense of what's going on, right? Because you can only do so much weirdness. Versus the problem with the raw programming language, and you already see it with CDK is that you're only limited by your imagination, which could be a problem.

Armon: Because some people have a very active imagination. And so you start coming in and there's arbitrary Python, hierarchies of classes, and meta classes. And they're auto injecting things and it takes you five days just to figure out this S3 bucket, how did it even come to exist? Why does it have this name? Because it's inheriting all these properties in these weird ways, right? So that's the problem I think, is with a native programing language is, it opens Pandora's box, doing whatever you want. And I think sometimes developers get a little overly creative on it. Where you're like, "How the hell is this infrastructure even being generated? It doesn't even make sense."

Matan: I actually saw a blog that there is someone that is handling his Spotify playlist with HashiCorp Terraform, which is pretty funny and amazing how people are so creative. You can use LSDK or something, man, but you used the Terraform so he went for it, which is pretty amazing. And by the way, this actually leads me to our next topic, which I think that this is what bothers us most of, I don't know, the last six or five months. So we are thinking of how to treat infra code in general and Terraform, of course, we are going to discuss about Terraform because this is our main conversation and our main language that we're using. So from one hand, Terraform, it's infrastructure is code. But from the other hand, it's a definition.

Matan: Now in the new versions of Terraform, there is for loops and other stuff, which brings this to be more code in general. So we came from the development experience. So, there is a Node.js application, Golang application. You have a CI, you have a CD, you have local dev experience. And for each one of them, you have a view of how to use it. So for example, for a Node, you will have js, for testing, you will have that in the deployment, you will have a [inaudible 00:28:25], blue-green, and some other stuff. And then DCI, containers, images, et cetera. But because we have so many production engineers and DevOps in Wix, that are writing Terraform on a daily basis, more than a hundred. It's like a smaller [inaudible 00:28:42]. It's crazy.

Matan: And we need to figure out how we can take infrastructure as code, and we need to understand are we going to treat it like code? And we picked up with a way that we said, "Okay, we are going for it big time." We are going to spend everything to make infrastructure as code an application from the development to the production. And this is what we want to share our thoughts with you and hear your opinion regarding Terraform. Because as we sit, we have a lot of friends in the industry and other companies, and we are discussing with them about the solutions about Terraform, and each one of them is building an obstruction on top of this. Terraform remote execution. There is the Terraform Cloud, the Atlantis and many others-

Ryan:

Everyone has their short solution.

Matan: And yeah, everybody has their own solution. And we wanted to hear from you, what is, from your perspective, the best practices going from the development to the production in infrastructure in Terraform. It's a big question, but it's...

Armon: No, it's a really, really good question. And it's one that we think about in terms of how do we need to evolve Terraform? So maybe I'll answer it in a few different pieces, because I think there's a few different components to it. So I think the first piece of it is sort of testing. I think we often get this sort of question of how do you do testing with Terraform? And I think even if we sort of talk about testing, we're like, "Okay, well, what kind of testing are we talking about?" Because I think one type of testing is, I'll call it almost unit testing with Terraform. Which is, I asked Terraform for an S3 bucket, and then when I run it, did I get an S3 bucket? And at that level, our view is sort of, "Well, it almost doesn't make sense to test Terraform at that level for an end user."

Armon: It makes sense for us, but because you're like, what are you testing? Are you testing that when you asked for MDM that it gave you MDM or that you asked for S3 bucket, it gave you- Because you're like, "If it doesn't, that means Terraform is just fundamentally broken." Right?

Matan: Right.

Armon: So, in some sense it's like, are you just testing that I typed in the S3 bucket and that I'm validating that's what's in the file, right? Like, so our view is, okay, unit testing doesn't make a lot of sense because Terraform is declaring it. You're just declaring what you want and it's Terraform's job to make it happen and if it doesn't, that's a Terraform bug, it's not a bug with your code. So moving up a higher level, it's more about, okay, is it more like integration testing? Which is, you said, I want a VM and a load balancer and a DNS record, and then when I talk to the DNS record, I should get a 200 okay back, right?

Armon: The traffic should flow from the DNS to the load bouncer to the VM. And so what I'm really trying to test is more of the integration testing or the functional testing, which is, once Terraform finished running, is it running the way I think it should be running, right? And then there's sort of what I'll call almost the sort of ongoing kind of, what's the best way to put it? Sort of assertions about my infrastructure of, I expect these things to always be true about it, right? Otherwise, something is wrong. A good example is my SSL certificate should always be valid, right? I shouldn't have an expired certificate on my load balancer. That's a good example, right?

Armon: And so that's the kind of thing where you're like, "I don't want to deploy with an expired certificate and I want to know if it's going to expire in the future or if it has expired so I understand what's wrong with the infrastructure." So there's almost these different layers, which is unit testing where you're like, are you testing that Terraform works? And it shouldn't be the user's problem. Integration or functional testing, which is, does your infrastructure function the way you think it does? And then all the way up to kind of monitoring blackbox sort of validation of, is it staying in that condition that I thought it was in? So as I kind of look at what the ecosystem has done around it, I think there's some people who get really all cut overly into the unit testing side of it, which feels kind of counterproductive honestly, right?

Armon: And I think at the integration testing, there's been good stuff in the community where people have built their own. Terratest is a good example of... A lot of people do bunch stuff with Terratest and Test Kitchen at that kind of integration testing level. And then I think at that kind of continuous monetary level, there's a bunch of tools around drift detection and things like that, which is like, "Hey, we ran Terraform, have things changed in the meantime? Have we drifted and there's something wrong?" So the way we're sort of thinking about each of the layers, right, is great unit testing. We will continue to improve that story at HashiCorp and people shouldn't think about it. At the integration level, some stuff you'll see actually from as early as this year, maybe as early as sort of this quarter is starting to bring in what we call sort of pre and post conditions into the Terraform configuration.

Armon: So you can define sort of hey, the post condition of my DNS resource for example, is that if I curl this resource, I should get a 200 okay back, right? So you're starting to be able to declare sort of in line within the Terraform code, hey, the pre-condition to the DNS record is the load balancer should exist. So I can test that before I even bother with the DNS. The post condition is after the DNS circuit's created, it should return a 200 okay. So you can start to bring that declarative definition in. So as Terraform is actually doing the execution, it's validating these pre and post conditions and assertions as it goes. And then ultimately we want to give you different failure modes.

Armon: So you might want to say, hey, I have- Let's say, a good example is I have an expensive RDS database, right? It might be some huge RDS database that's expensive. I might have a pre-condition which is before I create that thing, make sure this is true. And if my pre-condition fails, whole execution. Don't create the expensive database because some important thing is wrong. But you might also say actually, it's okay if my assertions false, continue creating the resource, but return an error at the end of the Terraform [inaudible 00:34:41]. So you can start to have these different behaviors versus, do I just want to error the run? Do I want to abort the run? Is it just a warning? Right? Just return a warning message at the end= that's like, "This thing failed. Your load balancer's not returning a 200 okay. That's probably bad." But don't error that.

Ran: This basically will run during the apply or during the plan, whatever? I mean, and there will be a post condition that if that if the apply didn't return at the end 200 okay, so you have the choice, as you mentioned, you have choice to fail, but it leads us to another point that sometimes it's how to do rollbacks in the world of infrastructure, because-

Armon: Well, we'll come back to that one. We'll talk about rollback.

Ran: Okay. Okay.

Armon: So that's kind of the functional level is that kind of middle layer of pre and post condition that we're going to bring into the language natively. And then at the very top level that sort of blackbox is a notion of where we're thinking about kind of a first class in the way resource and data, or first class is a first class assertion. So you can put in this first class assertion, whether it's around a module, or resource or data, basically you can take any Terraform input and basically make an assertion about it and say, "Hey, my SSL certificate, fetch the certificate from the load balancer. That thing's time to live should always be at least 30 days." And so now at that top level, you can see how then that'll get incorporated into things like Terraform Cloud, where we can say, "Hey, I want to just run the Terraform sort of plan and drift detection every 10 minutes."

Armon: And so it can tell you, "Hey." For example, "You asked for five VMs. Now one of them is missing, there's drift." Or it can tell you, "Hey, all of a sudden this assertion that your certificate's valid is now failed." Because now that certificate's only valid for 29 days and not 30 days anymore, right? And so you might want to know about that assertion so you can go change it before you have an outage or whatever. So that's kind of the evolution of I think where testing is going, is basically baking in pre and post conditions so the functional test is part of Terraform's execution and then bringing assertions in so you can actually do that kind of blackbox testing on sort of a continuous basis and really starting to do more sort of continuous monitoring and drift detection as part of things like Terraform Cloud, basically. So that's kind of the direction we're taking Terraform testing.

Armon: Then I think we'll round out the story a little bit, right? Because I think today it's kind of like you have to use Terratest or Test Kitchen or whatever, and we'll just bring that native into it. I think the second thing you brought up was that the deployment experience of it, of blue-green and how do you go from dev, stage, prod, whatever. And so again, I think the ecosystem has done some good tooling here. Terragrunt is a good example of they have a notion of stacks and multiple workspaces and things like that. So we're really working on bringing the notion of multi workspace as native to Terraform. So you might say, "Great, I have a dev workspace, staging, production workspace, and they have a relationship, right? I'm going to promote from one to the other." Or maybe I have five production environments, right, in different regions. I want to first deploy it to whatever, UK one, and then from there I deploy to France one, Germany one, so on and so forth, right? So you might have a relationship between these different workspaces.

Armon: And I think that notion of a stack or whatever we end up calling it, a multi workspace type thing, that's another big thing we're looking to solve this year to fix a bit of that deployment pipeline thing. Which is like, okay, you should be able to define what that pipeline looks like in Terraform and then it helps you manage the upgrades of the code version, or variables, or whatever it is that you need to flow through that kind of pipeline. So both of those, I think hopefully you'll see a much, much better story this year as we sort of add that to the Terraform core language and core run time, as opposed to needing to kind of piece together a bunch of community tools to do it.

Ran: This is a definitely feature that we're looking forward to use. We are right at this moment, like thinking about how to develop this stuff. And it's very good to hear that you are working to release it as part of the core features of Terraform.

Matan: Yeah.

Ran: Very exciting.

Armon: It's definitely-

Matan: I mean, probably in companies that have just two regions, for example, as you said, like U.S. and maybe Europe, so it's not a big deal to do a plan, apply, check the metrics, check the monitoring and they change the modules, but when you have 20 data centers, for example, so this is what we are solving at the moment. We need to solve multi DC management. So, we need to- I mean, a DevOps, for example, right now should go and change the module version, for example, and just a tiny thing, but it needs to go tediously DC by DC. Plan, apply, monitoring. And God forbid that you need to do a rollback in the 19th. And this is crazy. So this is one of the main issues that we are facing today. And if you are multiplying it by the number of infrastructure people in our organization, so their isn’t have a clear solution for that. So it's amazing to hear that.

Armon: On your point on things like the monitoring and the fact that this is the platform Terraform's doing this. I think that where it gets really challenging is then from the application team's perspective where they might not even know what metrics to go look at. They're not the infrastructure experts. The platform team, yeah, they know what to go look at, but the app team doesn't really care. They're just like, "I want to deploy my app, whatever." And so I think that's where things like, "Hey, can the platform team put those assertions into the Terraform modules, into the Terraform code?" So if I'm the app team, I push my deploy button it and it's like, "Oh, this assertion failed because my DataDog metric is off now." Okay, that's a thing I know, now I can go look at that metric, but otherwise, the dev team might not even know what metric they should go look at.

Armon: So I think that's some of the stuff where it's like, can we close that loop and bring those metrics that you're looking at, put them in the Terraform code so the dev team doesn't have to know about it and then have an opinionated workflow on what that pipeline looks like, basically.

Matan: I want to ask another question regarding the CI. So we seen most of the, let's say, infrastructure fields, there is kind of an artifact, right? So when you have- Technology from regular application, right? So you're writing your code, doesn't matter, Node.js, Golang, et cetera. And then you have probably had a PR for testing, linting etc and then when the code mergerd you have a process to create your artifact. And with this artifact, this is, let's say, the end of the CI. And then you're taking this artifact to the CD. Now in Terraform, it's a little bit different because as we see it, the most closest to artifact that we found, it's a module kind of. Maybe a few modules as a collection. But the problem with the module is that it's had variables. I mean, it's not a problem. This is what it's meant to be. It's meant to be like a usable resources-

Armon: More like a library than an artifact.

Matan: Yeah.

Armon: Yeah.

Matan: Yeah, of course. So the problem is that if we want to treat Terraform like a code, it's a little bit difficult to treat it like something that is sealed and closed. And maybe I can give an analogy from another area, for example, what Helm did to Kubernetes manifests. So they took several Kubernetes manifests and they give it a revision, a chart. And then you can treat it like something that it sealed as an artifact that we can take. So what I wanted to hear, what is your opinion,if we are taking Terraform as code, what is your opinion about an artifact of Terraform? Like Terraform image, for example.

Armon: Yeah, it's a good question. And I think there's, I guess there's almost two layers to it, which is the core... So I guess artifact, as we think about is the thing that should get passed down the CIC pipeline is actually the Terraform plan because the Terraform plan ends up being sort of the compilation. If you will, of-

Ran: We develop those.

Armon: Variable inputs, all of the sort of modules and sort of source code and state into one atomic unit. And when you feed the plan into apply, that's what it's using to say, "Hey, atomically execute against this thing." So I think if you're sort of compiling Terraform now, what you get is the plan output, right? So that's how I would think about the literal chunk that should be passed through the pipeline. And if you think about a system like Terraform Cloud, that is how it works.

Armon: When it generates a plan, when you hit apply, what it's done in the background is it saves that plan output, and then it's passing that through, and that's what it's executing to make sure that you don't have drift between, this is what the plan said it would do and this is what the apply actually did. So that's the low level atomic unit, the higher level one, if we think about most infrastructures composed of more than one module, right? Modules tend to be lower level building blocks. And I think the common pattern we've seen is this notion of a multimodal architecture, right? You might have a module for your load balancer and DNS and, SSL record or whatever. So you have 10 different modules that you're composing to then say, this is a Java app, let's say.

Matan: Right.

Armon: And so that higher level thing, the Java app, what's that? And I think that's where we're sort of talking about what we're loosely using the internal language of a blueprint where we want to basically introduce this higher level concept of a multimodal pattern that's more opinionated, it's structured, it pulls together these different things. And so I think you'll see some more from us this year around that as well

Matan: Yeah. So regarding the... First of all, we're very glad to hear that.

Ran: Yeah, exciting news.

Matan: Yeah, exciting news for us because we have a lot of difficulties understand the vision and think about it. Regarding the plan, I mean, it's atomic in a way, but I mean, when we got the experience of DevOps engineers, they're doing plan and apply so many times until they understand if the module it's the, how to say? In the way they want it. I mean, although you have a plan, it doesn't mean from a DevOp's perspective that it's finished, because maybe the apply was wrong. So this is why it was hard to us to think about the plan as, how to say? Atomic thing, like an artifact, for example. But for all the cases, of course it makes sense and everything and we're and this is the direction.

Armon: Yeah. I mean, the way I think about it is, almost like if we have like a CI pipeline for an app, every time I check in the app, the first thing it's going to do is compile it into the binary and then it's going to go run all the unit tests. And if the unit test sort of fails, well, I'm not going to upload that artifact to Artifactory or something, right? I'm just going to throw away the compiled binary. And so to me, that's almost the same thing with Terraform. You're going to run the Terraform plan, right? It's compiling it into that plan output, if you will, and then when you're sort of looking at it, you're like, "Does it do what I think it's going to do?" And you're like, "Oh no, something is wrong." You sort of throw that thing away in the same way you're kind of throwing the binary away and making a change again.

Armon: And then once you're like, "Oh my unit test passed. Okay, that's the thing I'm going to deploy." I take the compiled artifact through. Sort of the same thing. It's like, "Okay, once my plan looks right, then that's the plan output that I take and actually execute." So I think there's a close analogy if you think about it as the compiled binary where your CI is compiling it all the time, you're just throwing it away most of the time.

Matan: Yeah, cool. So last question. What is your next vision and thoughts of what HashiCorp is going to do?

Armon: Yeah. I mean, there's a bunch of the Terraform stuff that we talked about this year.

Matan: Of course, of course. It was amazing.

Armon: I think the biggest, right, for the company, broadly is the sort of shift for us from really historically we've always been a desktop software company, right? We build open source stuff and then we have enterprise versions of it, but customers download it, run it, whatever. And now the big, big shift for us is really becoming sort of a Cloud delivered business, right? And that's a totally different motion, right? We're used to you have four month release cycles for our software, right? It's almost like a waterfall planning methodology for us versus now, increasingly as we think about now we have Terraform Cloud, we have an HCP version of Vault and Consul. We have HCP Packer in GA, Boundary, and Waypoint, and Vagrant will all end up on our HashiCorp Cloud Platform this year as well. So it's really this shift towards all of these now being delivered as also a managed service solution.

Armon: And I think that'll start to bring a different set of capabilities for the products as well, right? Yes, there's still the core open source and you can have it just managed by us, but then it's going to be additional capabilities that are sort of Cloud delivered that might still work. So I think that kind of a blend for us of really being able to Cloud deliver the services and support those sort of hybrid modes is going to be super interesting for me. It's a whole new world of new features for us to deliver.

Matan: Thank you very much!

For more engineering updates and insights:

Follow us on: Twitter | Facebook | LinkedIn
Join our Telegram channel
Visit us on GitHub
Subscribe to our monthly newsletter
Subscribe to our YouTube channel
Follow our Medium publication
Listen to our podcast on Apple, Spotify or Google

Wix Engineering Tech Interviews: Armon Dadgar, Matan Cohen and Ran Schneider

Recent Posts

Comments