This is the story of the culture, methods and tools, allowing the development of features in high-traffic super-sensitive components in a SaaS e-commerce product.
We all know that new features might also cause pain to our users, and of course we all try to minimize the amount of critical bugs in production.
Avoiding bugs in production is hard and trying to avoid all bugs might slow the development process. But I believe it’s possible to move forward quickly with new features, while successfully decreasing the amount of bugs reaching production and minimizing the impact of bugs that do reach it.
In Wix, our group develops the e-commerce solution for Wix sites - it’s the platform that allows our users to build their online stores, manage their business, design their websites, and provide to their store visitors a professional and reliable e-commerce experience. When developing new features we must be very careful, since our users’ businesses depend on our platform. Yet we also need to be agile in order to keep up with the competition and give our users the new functionalities they need fast.
As a front-end developer, I work on some of the high-traffic and most-sensitive components of our e-commerce platform - the storefront pages and widget, like our gallery of products, our Cart page, our Checkout page, and more. These components are quite sensitive, since a regression in them can break the purchase flow for the site visitors, which are the buyers of our users.
We deploy to production many times a day, and when a code change is deployed, it affects millions of online stores in components that have thousands of hits per minute.
But still, we don’t have pre-production or testing environments, we test our changes directly on the production environment, and when a developer pushes a code change, it can reach production in hours or even minutes, without a manual QA in the process.
So how can it be done while also making sure we are being very cautious and responsible? Let’s see…
Minimizing regressions during development
During the development process we try to minimize the amount of regressions, meaning we want to add new functionality but keep the rest of the system working as before.
Of course we also do that by manual testing of QA and other relevant stakeholders, like product managers, designers and UX, but I want to focus on some of our developers' infrastructure.
We do that using several methods and tools:
High-coverage of tests: We strive to have very high test coverage, most of the tests are the integration tests that run our frontend application in different flows, expecting it to behave the same all the time. Working with TDD (Test Driven Development), these tests are being written along with the code itself. The tests are fast to run, but they use a relatively sterile and isolated environment, as it uses mock data and it doesn’t run next to other components or under the full architecture of a site.
E2E production tests: Using an internal tool that was developed at Wix, each change in a component is applied to a real production site, checking if the site is still functioning in the same way, both via logical checks and via screenshot comparisons. This lets us know that our change had no bad effect on a real-world non-sterile environment. In this case, the change runs along with the rest of the components around it, while the other components are the same as they are in production and only the component under test has changes compared to its state in production. This assures that we have no side effects caused by our change, and a broken test is probably caused by our change.
Performance testing: During development, using another tool developed at Wix, we monitor our artifact bundle size all the time, checking if our change had no major bad effects in that aspect. For example, we might add a new package to our package.json, not taking into consideration that it’s increasing the bundle size enormously, causing a regression in the load-time of the component. In addition, as with the visual E2E tests, we also test in production sites for which the artifact from our change didn’t cause a regression in Google Lighthouse score.
Feature toggle: We use feature toggles a lot. This is probably the first thing we do when starting a new feature development or even a bug fix - creating a configuration for a new feature toggle. The feature toggle allows us to hide our changes behind a wall, so the code lives with our change, but the change shouldn’t affect users until we decide to open it. Feature toggle also gives us the confidence of making small changes, and continuously pushing the changes to production, even when the feature is still under development. This is much less risky than pushing one big change. When the feature or bug fix is ready, the feature toggle lets us expose it to users gradually, which also reduces the risk.
Gradually roll-out a change to production
When development is finished we want to start rolling out our feature or bug fix.
Doing it gradually can help us minimize the effect of a bug in the deployed version.
We gradually expose a change in the product in two main ways:
We deploy our changes to production gradually: When a developer wants to deploy a change, not all users will get the new version immediately. An automatic mechanism gradually deploys the new version, serving it to more and more users, while monitoring the main KPIs. If a KPI is hit, the process is stopped automatically and we do a roll-back to the previous stable version.
We open features gradually using feature toggles: After a version containing the new feature or bug fix is deployed, we want to expose the change gradually to users. We usually first expose the feature internally to the company employees, trying to reveal issues before they reach production, and only then do we start the gradual exposure to real users. With sensitive components, like the Checkout page, we might open the feature to specific geos, checking if we have no drop in main KPIs, and no complaints to our support channels.
Monitoring during roll-out
As we said before, we usually roll-out gradually, while monitoring different aspects of the feature and KPIs.
Some of the ways we’re doing this:
1. Using a mechanism developed in Wix we monitor front-end interactions, load time and success-rate. For each widget, we can tell how many hits per minute it has and what the success rate is, both of loading the widget successfully in users’ browsers, and for specific user actions we monitor. The data is shown in a dashboard for each widget and can be filtered by geos, by device, and more. Most importantly, there is a layer of rules and alerts built upon this data. If for example a broken version reaches production, and the success rate falls beneath the pre-defined threshold, then an alert will be fired.
2. Another way we monitor client side code on users’ browsers is using errors reporting to Sentry. Sentry is a tool that enables reporting of errors, makes it possible to aggregate them by the widget the error came from, and to see the timeline of errors, the version of the deployed code, and more. With Sentry we monitor the error flows and not the main success flows. It allows us to identify an increase in unexpected errors and quickly understand if the root cause is a version that was recently deployed. Per project in Sentry (that usually represents a specific widget/component in the product), there is a set of rules defining when to trigger an alert, like the amount of errors per minute we allow.
Taking actions during production incidents
So we’re developing features and fixing bugs with a lot of responsibility and ownership, taking as many precautions as we can.
But in such a complicated system, while trying to move forward quickly, pushing many changes to production a few times a day, opening feature toggles and experiments, things might break.
Here are parts of what we’re doing in order to solve issues in production as fast as we can:
On-call duty, 24/7, 365 days a year: At any moment we have an on-call person to monitor alerts, check our urgent channels in Slack and investigate if an alert we got is a real production issue. The on-call gives the first emergency response, then involving relevant developers that might know a specific area in the product, or a recent change that was done.
Closing a feature toggle: If there is a real issue in production, the first thing we will prefer to do if possible, is to close a specific feature toggle that exposes the wrong behavior. This is a better solution than rolling-back to a previous version as it’s only hiding a specific change, but not reverting a lot of other changes that have been deployed in the last deployment.
Rolling back to the last stable version: If closing a feature toggle is not an option, we might roll-back to a previously deployed version. This will immediately serve users with a version that does not contain the change that caused the issue.
Investigating production incidents so it will not reoccur
When a production incident affecting our users takes place, we have a process of post-mortem.
We try to learn from what happened so that the same issue will not happen again and so that we could prevent similar issues from happening in the future.
The on-call and the responsible developer research the issue, looking for the root cause and suggesting what actions we can take so that a similar issue will not happen again.
For example, it might be a missing test that could prevent the change from reaching production, or it might be a faster and more precise alert that could help us catch the root cause faster.
This process does not exist so that we could find who is guilty or who to blame, it is a process that should help us learn and improve.
To sum up
As developers working with such high-traffic components, in a very sensitive area of the product, we must be very careful and be responsible for what we do.
But in addition to being as responsible as we can, we must move forward fast and develop our features fearlessly.
This is part of the spirit of being a developer at Wix - own the feature from its initial design to full exposure to our users. And this is very exciting for a software developer, reaching so many people so quickly.
It might sound conflicted, but we believe that we can combine the two - being responsible and professional, while changing things quickly and fearlessly.
This is done both by the tools and methods we developed, and by the spirit of being a developer at Wix.
This post was written by Guy Segev
For more engineering updates and insights:
Join our Telegram channel
Visit us on GitHub
Subscribe to our YouTube channel