We love telling stories behind the daily challenges we face and how we solve them. But we also love hearing about the insights, experiences and the lessons learned by prominent voices in the global community. In this series of tech interviews, our very own Wix Engineers talk directly with some of the most inspiring minds in the tech industry. From software engineering to architecture, open source, and personal growth.
This time our very own Matan Cohen and Ran Schneider talked with Armon Dadgar, the Co-Founder and CTO at HashiCorp, a big influencer in the DevOps area and in general on the future of infrastructure management, how to manage infra as a code (with over two hundred production engineers!) and the vision of HashiCorp.
Oh, and don’t forget to check out the rest of the series:
Matan: My group is responsible for building products in the DevOps area - although on a smaller scale than HashiCorp, we are nevertheless building products for DevOps. Our group is called DevOps for Infrastructure and we are responsible for the dev experience and build products in order to make the life of production engineers easier.
Now, this is a very unique situation to be in - this field of product for DevOps, it's quite unique. Not all people have the skill set to do this kind of job.
For example, you need to have a product manager - well for our product, we are the product manager! You need to have programmers that know infrastructure and vice versa, DevOps that know how to build products…
Now, in HashiCorp, you've done it on such a large scale - how do you overcome those problems and issues, and in general, how do you build such a huge company working on DevOps products?
Armon: It's a great question. And it's definitely a tough challenge, right? Because as you mentioned, it's a pretty specialized skill set. I think for us, part of it is a bit of self selection. As you're hiring software engineers who want to work on infrastructure - you get a bit of a different “breed”.
There are a lot of folks who want to work on the end product, to do things that are in front of the customer, they like having that user interaction. And then there are folks who just really love the plumbing of it, where it's like, "I don't really want to deal with the user, I want to build the deploy system”, or “I want to build the infrastructure management”, things like that, the kind of core plumbing stuff.
Part of it is that you're hiring for a slightly different mentality, you're hiring folks who like that kind of plumbing and that infrastructure layer. And the other side of it is that a lot of engineers are just unfamiliar with it and it might seem more intimidating...
So part of it is' just getting people exposed to it. And then you find that a lot of software engineers find that problem space exciting. For example, I can personally think back to before HashiCorp when I worked at an Ad company. Then I could think something like "Well, is an ad company going to be that interesting? You're just serving ads!". But when you start looking into constraints and the challenges, you realize that you're working at an immense scale with hundreds of thousands of requests per second. You care deeply about performance - every millisecond matters, you have to have responses in under 50 milliseconds. And you're super conscious of your margins and your cost.
And when you start talking to people about those constraints, then their reaction is often to say that that's actually a pretty interesting problem space. So set aside the fact that it's serving ads, the technical problems are very interesting.
And I think infrastructure is sort of the same thing. If you just talk to people about the plumbing, it can get kind of boring. But then talk to them about the constraints of the problems you're helping solve. Say, how do you help make another 500 or 1,000 engineers productive? Or, say, if you solve some of these underlying infrastructure problems, this is how that’s going to improve the cost, or the agility, or the security posture, etc. And then people can get excited about those things as opposed to the plumbing of it.
So it's about a few of those different things. Find people who are excited about it. Do the right storytelling, speaking about what's interesting about the problem, not just describe the plumbing of it. And then just give people exposure so they don't feel scared of the word “infrastructure”.
Matan: Do you feel that’s the same for product managers as well? Of course, for developers, for DevOps, you can make them quite interested in the core functionality and thus build a product like that, but with product managers, sometimes they want to… be very close to the business. But sometimes, for some other companies the products of DevOps are kind of a gray zone…
Armon: It's a hard problem. I mean, we have that problem internally too. Even our own platform team, even within HashiCorp, even though we ourselves make platform software! So even our own platform team has this problem of finding the right PMs who want to fit into that. We have a few and they're great. The challenge here is that they are really, really hard to find. Finding PMs who want to work on internal platforms is really hard.
The flip side of this has been having our engineering managers take on a bit more of that kind of product management leader role plus supplementing it with technical product management, so TPM.
Speaking about where you can find those kinds of people - you are going to have to hire them from deep infrastructure companies and alike. They have to be coming out of storage, networking, compute… That's where you're going to find those PMs who have that deep familiarity. Because you just have to spend time in the deep plumbing for it to really make sense to you. Otherwise, you're just like, "Okay, I don't really know what we're talking about here."
Ran: We use a lot of HashiCorp products at Wix - like Vault, Consul, Terraform, etc. And we noticed that most of these products are built on top of Golang. And I know that nowadays a lot of infrastructure-related products are built using Golang. But back then, when you just started the company, it was an immature language without too much community around it. Kind of a risky decision to make back then, not to go with something like Java that was very popular back then and still is today. How did you make that decision?
Armon: It's a really good question and we spent a lot of time debating it. If we go back to around 2012, when we started the project, this was a big question for me. What language do we bet the company on? Me and Mitchell, we played with a bunch of different languages, and so coming in, it wasn't like we were particularly Go experts or something. Actually, most of our time would probably be spent with Python, in the previous company we'd done a lot in Erlang, reasonably comfortable in C/C++.
So I think there were a few different questions for us. One, how easy is it to package and distribute the software? Vagrant famously is in Ruby and packaging and distributing a Ruby application is a complete nightmare because you need to ship the cold Ruby runtime, it dynamically links to a bunch of libraries like OpenSSL and libxml and all of this stuff. And you have to manage all of that versioning super carefully, because if it links to the wrong version of libxml and then the library doesn't work, then Vagrant crashes… It's this horrible nightmare when you have that kind of a dynamic linking on a desktop product like that. We figured it out with Vagrant, but it was horrible and we didn’t want to do that again. We wanted something that can link into a static binary that's fully self-contained, sort of.
Now, there's a bunch of great choices for that, right? You can do that with C/C++, you can do that with Java - build a kind of a “fat” JAR. But you can't really do it with something like Erlang, since it’s kind of a super nightmare to distribute. It also has a billion files and a runtime and all of that kind of stuff. So we're like, "Okay, something like that's going to be challenging." That was one concern.
The second concern was deciding if a certain language was good for building high performance networking software. A lot of what we do - if you think about Vault, Consul, Nomad - these are systems that are designed to process tens of thousands, in some cases hundreds of thousands of transactions per second. So it had to be a relatively high performance language for what we were doing, especially the network-oriented stuff.
That kind of ruled out Ruby and ruled out Python - they're not particularly performant in multi-core and networking use cases. Obviously C++ was a good fit. Java was sort of in the middle. Anyone who's tried dealing with high performance Java knows the nightmare of tuning garbage collectors and your Heap sizes. It's just kind of a pain to deal with that when you hear about the performance. And so that was the second concern.
And then the third concern was hiring - how difficult is it to hire, onboard and train people for this? Ruby, Python - great, really easy, more or less. C/CC++ would be really hard, not a lot of people are excited about writing C all day... Erlang - there's like 12 people in Sweden that know how to write in Erlang, so not super helpful (laughing).
And that’s when Go was showing up around the scene. This was before 1.0 even, this was Go 0.8 or something, and we were seeing Google do some stuff with Go.
It fit the bill on distribution, because it compiles down to a single static binary. We thought that was really nice. And then they've made a lot of great decisions around minimal runtime. Yes, it has a garbage collector, but it's very different from Java's, it doesn't have nearly the same level of garbage collection performance problems.
It's designed to be a systems language - so think high concurrency Goroutines, all that kind of stuff. Okay, now it fits the bill of being a potentially good language for high performance network software.
And then with the third thing, with hiring, what we really liked about the syntax was that although nobody really knew Go at the time, it isn’t that weird of a language. If you know Java, you can pick it up pretty easily. Same if you know C/C++. Even if you know Python or Ruby, coming to Go is not that weird.
So no matter what you know, you can kind of come to Go and it'll take you a week to three weeks to really get going. Yet it's not weird the way Erlang is weird, or hard the way C or C++ is hard where it takes you years to get good at it.
It fit all those boxes. The biggest risk was trying to think about what would happen if Google abandoned it.. But we thought it still was a good bet, because worst case scenario - it would still be open source, the compiler would still be around, it wasn’t going to go anywhere. And two, because the community is early, we thought that as HashiCorp we could become one of the places to go to if you wanted to write Go software, we could help define what the whole Go ecosystem looks like.
And it's fun for us, because almost any major Go project we see, there's a HashiCorp library in it. Whether it's a library that Mitchell wrote or I wrote, or it's a HashiCorp-maintained one, you see our stuff across the whole Go ecosystem. It was by virtue of being early in that community that we could help define it.
It ended up being a great bet. And Google has done a great job stewarding it and driving the development of it. And it's awesome to just watch every generation of it get better, the garbage collector get faster, etc. It's been a good choice.
Ran: When a new company is trying to find a solution for orchestrating containers, the goto solution nowadays is Kubernetes. And in HashiCorp, you also have a product for scheduling containers and other services that’s called Nomad. Why do you think the majority of the community went with the Kubernetes approach? And do you think that there is maybe a better use case where one should choose Nomad over Kubernetes? For special use cases or something like that?
Armon: Another good question and one we spent a bunch of time thinking about - why did Kubernetes win the market, broadly speaking? I think ultimately the lesson for us comes down to the ecosystem. That’s the biggest difference, right? What Google did with Kubernetes by virtue of putting it into CNCF and creating a foundation around it, is to set this big tent around all these other vendors that were all pushing it. So it wasn't just Google talking about Kubernetes. Which they did, of course - they hired a hundred engineers and spent 10 million per year on marketing. But then they also had RedHat contributing to it and IBM and Cisco and VMware and… pick your favorite, right?
Each of those companies was then involved with developing it, marketing it, advertising Kubernetes. And this creates this enormous surface area of a “Mindshare generation” for Kubernetes. And then there was helping educate users on it, and going to CNCF, and KubeCon and all of those things. Whereas with Nomad it was a much smaller group. It was just HashiCorp talking about it at our events and at HashiConfs, but it's a much smaller audience than what Google, IBM, Microsoft and Cisco and everyone else can advertise to.
Now, anyone who's tried to self-run Kubernetes realizes it's an operational challenge. And so there was a ton of pressure on the Cloud providers to provide and manage Kubernetes, because nobody wanted to run this thing. And so once they had GKE, and AKS, and EKS, then there was this easy button one could just press to make Amazon deal with it... I think that really helped them flip it, the dynamic would have been very different if everyone had to actually operate Kubernetes. Because then they'd be like, "Holy cow, this thing is way too much work".
Looking back, maybe some things we would've done differently, but overall it's that dynamic of having 10 of the world's biggest vendors all lean on the same thing being a very difficult thing to compete with.
Matan: Yeah, it makes sense. Cool. So as I said before, our group is responsible for, besides bringing some tools and products, we are helping to automate some, how to say, processes in Wix, in the infrastructure processes that need to have an automated way, for example. So let's say that you want to create a database. You need to have a flow, a backend engineer, need the database and some other stuff. So you need sometimes to create Terraform as a service kind of. So for example, I mean, of course, DevOps writing Terraform all day, and this is their code of course. So two years ago, we needed to find a way to write it programmatically. And we have our own languages, we are writing in OGS, and Golang, and many others, but we needed to find a way to write it in a form programmatically.
Matan: And we were very glad to see that there is Terraform as JSON and not to have templates with HCL and some crazy stuff. And it was amazing. And we've created a bunch of abstractions on top of that. We are using [inaudible 00:19:50] and [inaudible 00:19:51] and other stuff in order to write Terraform as JSON. And we found out lately that there is a Terraform CDK, which seems also very promising. Now we saw that there is also Go in TypeScript and many other languages to support it. And we want to understand from your vision, what is the best, or maybe which time you need to go with the Terraform as JSON and when you need to go with the CDK, it's still in the beta, right? The CDK or it's out of the beta?
Armon: Oh, that's a good question. I think it's out of beta. That's a good question.
Matan: Anyhow, regarding division.
Armon: Yeah, that's a good question. I know there's a whole team that that's all they work on, so I'm pretty sure it's out of beta, but yeah, it's a really good question. So I guess maybe the way we more broadly think about it is it's ultimately Terraform core, the Terraform engine, in some sense doesn't care what the input format is, right? At the end of the day, all of these are sort of to use a bit of compiler language. All of them get translated into an intermediate representation for Terraform anyway, right? So there's sort of many different kind of language front doors for Terraform, and then all of it gets compiled into sort of the IR, which is what Terraform actually uses to sort of execute on and process the graph basically, internally.
Armon: And so our view, as we sort of thought about it and said, "Okay, great." Obviously you don't want humans to write the IR because that's terrible, right? So, what are the different front doors that actually make sense for people? And so I think the original where we started was a JSON right? So JSON was always the kind of like closest match in some sense to what the internal IR is. The problem with JSON is it's kind of painful to author it, right? It's pretty verbose. You don't want to actually sit down and write it in JSON. It becomes a nightmare to deal with it once you have too many levels of nesting and things like that. So HCL was really born out of saying, "Hey, we think that 80% of Terraform usage will be more humans sitting down writing it, kind of owning and curating what does my core VPC look like, right? What should this module be?"
Armon: So our expected use cases that 80% of the time is sort of hand curated. And so HCL was meant to be easy to read, easy to write, way less verbose version of JSON. And so they've always had sort of a one to one compatibility with one another, but one is obviously way less verbose, because that's what it was designed to solve. The JSON piece always filled that gap of, hey, people were like, "I want to auto generate Terraform and programmatically [inaudible 00:22:26]." That's the perfect fit for it, right? I think the evolution with things like the CDK have now been people saying, "You know what? As I'm going sort of away from my ops person being the one responsible for Terraform, closer to my developer, the developers are more comfortable thinking less in terms of configuration and more in terms of whatever native programming language they're used to. And so can we have an alternative for them?" Which is still designed for when you're manually curating it, you're writing it. In the same way the operators are curating the HCL, you're developers sort of curating, let's say the TypeScript in this case.
Armon: That's really what I think the CDK is more meant to solve. I think if you're building it into a pipeline and programmatically generating and putting it behind things like REST APIs, the JSON format's probably still the best, because I think that's going to give you sort of the total flexibility to do whatever you want, or-
Matan: Whatever language you want.
Armon: Yeah, you can enforce different policies, you can put your access controls, you can do all of that kind of stuff. And then you're just passing the JSON through the layer. Of course, you can still have users write in CDK, just have that emit the JSON and then you can pass the JSON further down the pipeline. But I think you have a little bit more flexibility with the raw JSON than even you do with the CDK, right? Because that's truly, you can kind of generate it however you want at that point, because it's closer to that raw IR within Terraform, basically. So for us, ultimately our vision is there's many different front door that are optimized for different audiences, right? HCL we think will be the 80% use case. I think CDK is probably 10% use case. And then JSON's probably the other 10% use case where you're programmatically generating it. But our view is, all of those are valid and all of those are going to be things we support. And in fact, we're working on adding more languages to CDK. So our view is the more, the merrier.
Matan: Yeah. It's it was very interesting for us because I mean, probably, I don't know if we would pick up the language to use CDK. If we had CDK two years ago mature in Golang or in another language, it's very interesting what we would have picked up because it's very interesting vision wise because we have the JSON and it's very easy to manipulate as you said, but the other languages to use it in your own language, it's something that maybe, as you said, developers who tended to infrastructure, it's more close to what they're doing. So probably they would pick up with this.
Armon: Well, the risk that I find with them, this is always the challenge, right? We always had this trade off, even with HCL is, the more power and flexibility you give people, the more they can shoot themself in the foot, right? And this is the sort of trade off, right? And the perfect example of this is C and C++. You can do whatever you want, including blow your leg off, right? You want to access random memory space, you can do that. Go for it, right? And so you have this incredible power, but it's really easy to mess things up. Versus you think about a language like Java or Python, it's way more restricted. It's like you're wearing a straight jacket, right? You can't access arbitrary memory, right? It doesn't let you do that.
Armon: And so as a result, there's a lot less ways to blow your leg off, right? Not that you can't do it. And so I think that balance with the HCL was sort of the same thing. We wanted enough flexibility and power, but we purposely constrained the language in a way that we're like, "You can't do arbitrary, crazy things in HCL." And you can drop me into any HCL code base anywhere at any company. And you're like, within 30 seconds, I have a good sense of what's going on, right? Because you can only do so much weirdness. Versus the problem with the raw programming language, and you already see it with CDK is that you're only limited by your imagination, which could be a problem.
Armon: Because some people have a very active imagination. And so you start coming in and there's arbitrary Python, hierarchies of classes, and meta classes. And they're auto injecting things and it takes you five days just to figure out this S3 bucket, how did it even come to exist? Why does it have this name? Because it's inheriting all these properties in these weird ways, right? So that's the problem I think, is with a native programing language is, it opens Pandora's box, doing whatever you want. And I think sometimes developers get a little overly creative on it. Where you're like, "How the hell is this infrastructure even being generated? It doesn't even make sense."