CI Police, E13: Full Transcript

Updated: Jul 4



How do you get all the divisions, teams, employees and projects in a company to follow a company-wide requirement? You can email everyone, but good luck getting them to read it. You can talk to everybody individually, if you have unlimited free time on your hands.


Last year, Roy Sommer and his team members decided they needed a better way. So they founded CI Police.


You can also listen to this episode on Apple podcast, Spotify, Google or on Wix Engineering site. And you can also read the full episode here:

HACK I


On February 9th, 2021, a bug bounty hunter named Alex Birsan published an article online. Its title: “How I Hacked Into Apple, Microsoft and Dozens of Other Companies.”


And according to his article, it took him less than a day for his first attacks to actually succeed.


Roy Sommer is a Development Team Lead at Wix. He came across the article for the same reason everybody else did: because it threatened his entire company. The hack Birsan was describing was tailor-made for software giants like Apple, Microsoft and Wix, and it would allow an attacker to…


Basically run any code on your machine - any code. Like if they wanted to copy files, they could copy files. If they wanted to encrypt or delete anything - they could do that. If they wanted to steal secrets from the RAM of your machine or your file system, your RSA identities, your SSH identities or any other secret or key, they could do that.


Basically, with this vulnerability, a hacker could do whatever they wanted to the biggest tech companies in the world.


Luckily, Birsan himself was a white hat. He had no intention of using his zero-day for malicious purposes. But that wasn’t going to stop any other hacker anywhere else in the world, who now had a blueprint for how to crack the most lucrative targets they could imagine.


And so a race started. In one lane: any hacker in the world with even a modest amount of technical skill. In the other lane: Wix and others like them, who had to fix all their systems, company-wide, before it was too late.


I know who I’d be betting on. Hackers are crude but they’re fast. By contrast, making any large-scale changes at a multinational corporation takes forever--just think about the dozens, probably hundreds of people you have to involve in the process.


Fortunately, Wix had one secret weapon in their arsenal. A weapon with Black hair and Green Eyes, who’s never done a podcast before.


Roy: [Sigh] Sorry. I don’t know if you can tell, but I’m a little bit nervous. Just a little bit.



PROBLEM: WIX GROWTH


Months earlier, when Roy started working on a new project, he didn’t realize it was one day going to help the company avoid a major cybersecurity incident. Instead, he’d set out to solve a much broader issue.


Interviewer: What was it like to make a technical decision across a company with as many developers as Wix has?


Roy: It’s a very good question and Wix is a good case study for that. Because Wix is an example of a company that grew rapidly from a relatively small one just a few years back to what it is today. Which is a large company, a rather large company with about 6,000 employees. So making decisions back then was easier because you would just – you would know everyone. You would just go up to them and you would ask them to do certain things.


But then as the company grew, this method did not scale as well, because now relaying those messages would take longer and having many different teams means that each individual team has their own priorities and their own features that they want to implement.


Coordinating between developers and teams became prohibitively difficult. With so many things happening at once, even getting the right message to the right people wasn’t easy.


Roy: Basically, we would have to chase them and ask them repeatedly to do what we wanted them to do.


Sometimes, just to avoid the headache, it was simpler to blast emails to entire sectors of the company.


Roy: we would call them action required emails where we would basically tell everyone “Hey guys if this does not concern you, you can move on. But if it is, then you have to do this and that.” And that’s how we would relay the message to everyone.




EXAMPLE: IE


As workflows and communications lines faltered, the products that were delivered to customers suffered. Take, for instance, the Wix Viewer - the thing that renders Wix websites in a browser.


Roy: And if you developed a project that should run on the Viewer, then you would have to support a large variety of browsers, including old ones such as Internet Explorer. So this would be rather hard because Internet Explorer is very old and so very limited in today’s standards as a browser. So you would have to run all kinds of transformations in order to make your code compatible with this browser.


So we would often run into issues where people – where groups that develop projects for the viewer - would just miss Internet Explorer. They would just – their projects, the best case scenario would be that they would break the sites completely, because then we would know about it. But in other cases, they would just make buttons disappear, make the loading experience bad, your sites loading slower, et cetera, et cetera.


And we’re talking about thousands of users. And you know this is really unacceptable because we aim to give the best service, the best experience to our users.


So even a downtime of a few minutes is unacceptable on our end. So we just decided that this can no longer… we can no longer continue working like this. We have to figure out a way to be able to introduce, enforce, and raise awareness to those bugs.



SOLUTION: CI POLICE


In other words, they needed to figure out how to make company-wide technical decisions that might affect hundreds of developers, and thousands of projects, without interrupting those developers or negatively affecting those projects.


Interviewer: So do other companies simply accept that big company-wide changes will take this long time to implement or they’re equally good? Are there paradigms out there for tackling the same kind of problem?


Roy: So, that’s a great question. And this is something that we looked into. And what we basically found is that most companies, at least most companies who open-sourced their solutions, took an approach where they are giving you sufficient tools for searching issues and for searching the code base for issues and looking up for projects or broken dependencies or even statically deprecating specific functions inside your code. So you can actually just decorate a specific function and deprecate it and then it will start alerting the people who are using it.


And those are all very good solutions and they hold really well. But we wanted something a little bit different. We wanted something that is more proactive and more loose than just static code analysis


We try to give as much freedom to our developers as we can. So every team can make almost every decision regarding their code base and we didn’t want to limit that.


The alternative Roy and four of his colleagues came up with was to create, in essence, a bottleneck. A centralized policing system, for code. They named it CI Police. CI stands for continuous integration, because that is the point where the system interacts with your project and “police”, because it would enforce rules that all Wix code must abide by.


Roy: We collect the data, we collect the projects, we run your code, and if necessary, we also prevent from specific projects to deploy.


The story from before that we were discussing about the browser compatibility problems - this is a good example for a correct usage of the system.


This was actually one of the first rules maybe even the first rule, if I remember correctly, that we implemented using the system, where basically it went like this: If your project is a public project, it’s deploying to the site renderer, then we’re going to run a simple syntax check on the output of your product, what we call the bundle that your project creates. This is the actual code that their browser downloads from the internet. So we would run a very simple check that would ensure that you’re not using features that are not available on Internet Explorer. And while easy to notice and sometimes even easy to fix, this is something that people usually oversee. They miss it.


Many projects had, at least some part of their code was not compatible. And we helped them fix that in just a couple of months.



MATURING PROCESS


It worked like a charm. Before, when developers published code that crashed Internet Explorer, guys like Roy would have to send email blasts, or chase down individual developers - wherever they may be in the world - to try and get the problem retroactively fixed. If you’re wondering why not just open a Jira task to the teams - that’s because in organizations as large as Wix, it’s not always easy to know who is affected by what bug, and so they decided to notify everyone in R&D instead. Now, that same code would automatically be pulled over by CI Police, and sent to code jail until it fixed its compatibility issues.


As you might imagine, not everybody was happy with the new system.


Roy: Because many of them were failing this check, we actually broke quite a few builds at Wix. So people started coming up to us and they were telling us, “Hey guys, this came out of the blue. What can we do about it? Can you please stop it? Just stop it.”


A lot of the developers who didn’t realize they were committing crimes didn’t understand why their code was being put in jail.


Roy: So we started sending notifications and we’re using Slack for that. So we actually track the owners of the projects and we send them Slack notifications and we tell them, “Hey guys, you have this and that that you have to fix. Please do that either ASAP if it’s a critical problem, critical issue, or, you know, take your time but please do that when you have some free time.”


As time went on, more developers had more complaints.


Roy: Some people came up to us and they said, “Well, we cannot do this right now. We have deadlines. We have urgent matters to solve, to resolve. Please give us some more time.” So the next thing, the next step in the evolution was to give people this grace period to be able to actually schedule when we want this rule to start breaking builds and then people would get those notifications in advance and they will have a sufficient amount of time to prepare for this deprecation.


And finally, we had quite a few projects, quite a few project owners which could not – they could not fulfill the requirements by the deadline for justified reasons. So we actually have a system where we can cherry pick those specific projects and give them extensions.


CI Police was still maturing when it was hit with its biggest task ever: to save the whole company from a devastating, public zero-day vulnerability.



HACK II


Normally, of course, the people who handle zero-day patching are cybersecurity professionals. The reason CI Police got involved has to do with the nature of the vulnerability. What Alex Birsan discovered was rooted in software registries, like npm.


Roy: npm is where we store our open source packages.


So basically, if you wanted to use React in your project, then you would run npm install react and then it would pull it from the registry and you can just use it after you install it. And this is all good and well, but in a company such as Wix or in other large companies, larger companies, we have our own private npm registries where we store our private packages. You know all those packages that you wouldn’t want just anyone from the internet to see or install.


This vulnerability among other things stated that under some circumstances, an attacker could switch a private package that was not published to the public registry with their own code.


How they would do that is that they would create a similarly named package - a package with the exact same name, not similar, the the exact same name on the public registry.


But this package contains their malicious code. And then under some circumstances, npm would install this package instead of the private one.


Confusing npm into replacing a private package with a public, malicious package by giving the public package the same name. Birsan called it “dependency confusion”.


Roy: This essentially allows an attacker to inject code into either your working station or even servers in case you run the installation process on your server instead.



FIX: SCOPE


Protecting against dependency confusion wouldn’t require any software patching or ordinary cybersecurity measures at all. It simply required that the names of Wix’s private packages couldn’t be copied. The tool for making that happen is called “scope” - kind of like trademarks, but for code packages.


Roy: The special thing about scopes is that they are owned by organizations. And we had the “wix” scope owned by us. So essentially, what we had to do was to rename all of our internal projects to use this scope


Renaming files seems like the easiest thing in the world. But as we’ve learned: applying any technical decision across a large company is a headache.


Roy: Because here at Wix, we have over, I think, 4,000 active node projects. And when we started rolling this test to see how many of them are using what we call unscoped packages, we found that over 3,000 were. So we had to figure out, we had to fix 3 - 3.5 thousand projects. And you know there are dozens of different teams working on them and hundreds of different developers developing them. And they have their own priorities and their own features, and their own deadlines.


Scoping over 3,000 packages would take a while, but the company couldn’t wait a while. The threat was imminent. The CI Police budgeted two weeks to fix everything.



IMPLEMENTING THE FIX

Roy: So what we did was to create a rule that tells you to use the scoped version of every package.


Next, the CI Police started sending out alerts to developers in every corner of the company.


Roy: We would tell you “Hey, this is the project. These are the dependencies that you have to fix. And we opened a Slack channel like a war room of sorts for everyone to come and consult with us, and at any point of time we could see exactly how many projects were still not passing this test, this rule.


As the first week, and then the second week passed, hundreds of developers working across thousands of projects gradually made the shift.


Roy: When the deadline arrived after two weeks, 3,000 projects were already fixed without much hassle. So this was a huge win for us because even when we had CI Police at first, it took us two months to migrate just 30 projects. And here we had over 3,000 projects and in under two weeks, we were able to close this vulnerability, this very dangerous vulnerability in those projects.



CONCLUSION


The story of CI Police reminds me of being a dad.


If you’re a parent, listening to this, you know how impossible it can be to get one kid to do something. If you have two or three kids, even just figuring out a dinner that everybody can agree on takes forever. If you have more than three kids... Well, good luck with that.


Trying to apply a change across an entire company is really difficult. Developers are just big kids, and getting them all to agree on one thing can take weeks, even months to accomplish. That’s why it’s so important to have something like CI Police: to streamline the process. And it's really not that difficult to implement: Roy and his colleagues may have taken months to perfect their process but, remember, they were only five people in a company of thousands. A drop in the bucket compared to what they accomplished.


Interviewer: It seems like some kind of an outsize responsibility for just a few of you.


Roy: Well, you know, they say that if you took a developer and told them to switch a light bulb, they would write code and they would build a robot, and they would do everything that they had to do in order to automate the task and not change the bulb themselves. So this is essentially what we did here. We created a system of services, of tools, that in essence is not that complicated. But the ability that it gives you as an infrastructure developer and you know as essentially any developer at Wix, far outweighs our – the proportion of our team in the company.


So it’s really, I think it’s a good example of when you implement something right and you have a solid idea and a solid method to implement this idea, then you can make wonders even if you’re just a team of five.


That’s it for this episode, thanks for listening. For a full list of our previous episodes - visit wix.engineering/podacst. The Wix Engineering Podcast is produced by PI Media - written by Nate Nelson, produced by Guy Bin Noun and narrated and edited by me, Ran Levi. Special thanks to Moard Stern from Wix. See you again next episode, bye bye.



For more engineering updates and insights:

107 views0 comments