Updated: Apr 12, 2020
More and more times in the last few months our team started getting alerts during non-working hours about issues in our databases. What issues? Mostly related to inconsistencies or data lags between our databases across data centers. As our user base continues to grow enormously, we encounter difficulties with our databases` scaling. Our database is a pretty big MySQL DB with over 12TB of data.
We don’t like to be awake during non-working hours. Who does? So we figured it’s probably the right time to trigger a research in order to find a scalable storage solution. The first phase of the research was to define the prerequisites for such storage solutions and the criteria upon which we will compare the different storage solutions. We came up with 4 alternatives that we wanted to explore: Amazon S3, Cassandra, Google File Store, and DynamoDB. I thought it was an amazing research so why not share the details here?
Our current storage: Wix sites are built of pages. A page is nothing but a json file. Our users pages data is stored as jsons in MySQL DB in an immutable schema. In other words, every change made to a site page by the user implies a new page will be inserted to the DB with a new unique id. Users pages json size is around 10K. Basically, our database is used as a key value store where pageId is the key and the json is the value. In addition to the DB storage, all our users pages are uploaded to Amazon S3 asynchronously.
Our Network Topology: Wix editor servers currently run in data centers in USA east and west with a mySql master DB on each data center. We run in Active-Passive mode so that at any point in time there is only a single active DB, and the data is replicated from the active DB to the passive DB.
The scale problems: As our user base continues to grow enormously, we faced 2 major problems:
DB had to continuously purge old pages, so that it doesn’t grow to an enormous size which is not maintainable nor recommended. Purging was an offline process that had different issues. The DB size was already around 12 TB, and if the purge wasn’t working for some reason, it grew very fast. We had to keep purging so we don’t lose control of the DB size.
We started getting lags between the active DB and the passive DB. In other words, the passive DB is lagging some seconds after the primary DB. That implies that if there is a need to switch between the data centers so that the current primary DB becomes passive and the “old” passive DB becomes active, we can have a problem. Users might not see their latest site changes since the data was not yet propagated to the secondary DB.
So what do we do about the scale issues? We started with some short-term solutions that helped us significantly reduce the number and length of lags experienced. However, we knew it was all good for the short run. At one point or another, we will face scaling issues again and we better invest in research of a scalable data storage technology.
Data storage technology research: Even before starting any research, there were already a few people telling me about their own experiences and how they switched to this “amazingly best DB you’ve ever seen” which solved all their problems. I realized I need to be precise on what exactly it is that I am looking for in a DB. What are the prerequisites for any persistency solution to be even considered? Also our specific use case in this situation is that we basically need a place to store JSONs and not a general purpose database.
There was one persistency characteristic that was very clear. Whatever technology we eventually choose needs to free us from scaling issues “forever”. We should not invest now in switching from our mysql DB solution to some new technology only to find out in 2 years from now that it can not scale any further. It needs to be a technology that is scalable by design. As a guideline, we defined that the new persistency technology should be able to handle 10 times our current production traffic.
It is also clear that we need a mature technology. Wix editor is in the core of Wix functionality and is used by all Wix users. This is not the place to take excessive risks or experience with some new cutting edge fancy technology.
So we defined the following criteria for our next storage solution:
Must have criteria:
Scalable “forever” - as described above
Market Mature solution - as described above
Strong Read consistency (read after write) as users expect to be able to read their data right after it was written.
Can be replicated across remote Data Centers - as explained above
Other comparison criteria:
Minimum operational investment (Obviously ideal for cloud solutions)
Selected DB technologies: So what are the technologies that meet the above must have requirements? We found the following technologies as potential candidates:
Amazon S3. S3 scales “forever”. No operational investment required as it’s cloud based. It has no limit on the amount of files you can store on it and is relatively cheap. Also, our users page data is already stored there, so it only comes naturally that we should consider having our servers consume page data directly from S3.
Cassandra. Cassandra is mature, reliable and is highly scalable. This is not a cloud-based solution, meaning we have to invest in the operational side. However, it was designed for scalability and has built in network topology support. It supports strong read consistency and eventual consistency.
Google Cloud File store. Their file store interface is suppose to be fast, reliable and scalable by design. It is cloud based. We need to research it further.
Amazon DynamoDB. Dynamo should be able to scale easily. It is available for quite a few years already and since our app servers run on Amazon DCs, then the network latency is probably going to show low numbers.
“Production” stress test: So we got the 4 candidates defined. Now how should we compare them? It is very common to do a stress test on the DBs using one tool or another, but we thought of a live traffic alternative. We used a concept we called “Multiplication Factor”. What is it? Imagine you could clone Wix live production DB traffic to the alternative DB that you want to research. No simulations - Real traffic! Then you are able to measure and compare it.
How do you avoid hurting real users if there is an issue with the alternative DB? You first write to the major DB - our mySql and only then write to the alternative storage. So if it had an issue it would not impact the major DB data which was already written. Also, we added the ability to multiply writes to the alternative DB so you can define the ratio of the traffic that you want to use for the alternative DB. You can get only 10% of real traffic or 20% or whatever, BUT you can also multiply the traffic so you get 200% traffic or 10 X traffic. That gives you a much higher confidence whether your DB can handle live traffic 5 years from now, right?
A few ideas were raised and, eventually, we came up with the following list as comparison benchmarks:
Write response time during load scenario
Read response time during load scenario
Load scenario - Use multiplication factor to write X times our current traffic in prod to a storage POC solution. Storage POC should be pre-filled with pages to better simulate a real-life scenario.
Prepare for the “Production” stress test:
To be able to compare the solutions, they need to contain the same amount of data. Obviously an empty storage performance will be different from the performance of a storage filled with huge amount of data. So a duplicator service was created. The service copies data from the original DB into the selected storage solution. This might take time if you have a lot of data, but it allows you to fill the storage so that it can be compared to real production solution.
Once the data was filled, we could actually start multiplying the production traffic against the selected storage solution and measure our KPIs.
Running test on each storage
Running Test against MySql:
The first thing I did was to create the benchmark based on our current MySql DB. No changes in code were required. All I needed was to collect the performance numbers under actual traffic conditions over some time. We never use average on performance measurements. Instead we always use percentile results. We measure p50, p95 and p99. This is what I got:
Running Test against amazon S3:
Now that we had a benchmark we continued to the next challenger - Amazon S3. Currently all user pages are uploaded to Amazon S3 so to collect the write performance I didn’t need to make any changes to the code. I only collected the numbers. However, to get the read performance data I did use the multiplication mechanism, so that every time a user opens their site in Wix Editor the server downloads the site pages from Amazon S3.
The server does it in a different thread so that it doesn’t interfere with serving production traffic but with the actual load of production. This is the power of the multiplication factor. The results were:
Running Test against Cassandra:
Before starting a real test in Cassandra, you must create a suitable schema. By default Cassandra allows querying only based on the partition key. However, in our MySql implementation, we query for the pages either by the page id or by the site id in which case return multiple pages.
The schema partition that I created had site_id as the primary key then I also created a MATERIALIZED VIEW based on page_id. Materialized view in Cassandra allows querying based on a field which is different from the partition key. The view is maintained by Cassandra upon each insert done to the table and as such allows you to query based on it as if it is a “standard” index. We used network topology of LOCAL_QUORUM for read and write which guarantees we are strongly consistent.
The next step was to fill Cassandra with data, so that the multiplication tests happen in a DB that resembles actual production environment. As explained above, DB comparisons must happen when they are filled with data to make sense.
Once this data was ready I started measuring performance with 1:1 to actual production traffic.
Testing write requests was easy as I was simply inserting a page to Cassandra in parallel to each page inserted to Mysql. With read requests I could not use actual copies of production requests for page ids as it was not guaranteed that all pages are available in Cassandra. To address this I used a few thousands page ids that I pulled from Cassandra and used them randomly in each read request. Now each production request to a page results in a similar x pages request to Cassandra but with a different (random) page ids.
Running Test against DynamoDB:
I set up a table in Amazon and started learning how to use it. The feature set is amazing since it allows replicating across different regions using what Amazon calls “Global table”. It only requires a few clicks and it’s set. Also since our use case required querying both by page id and by site id I created the schema with site id as the primary key and with page id as a global index.
Once again I used 1:1 to our current production traffic for write. Every page inserted to mysql resulted in a page inserted to DynamoDB. With read requests I used the same random trick on DynamoDB as I used on Cassandra. Then I measured with 1:1 traffic.
Capacity Units and related exceptions in Dynamo: Unfortunately the simplicity of setting up a DynamoDB table across different regions with multiple global indexes comes with a cost. Part of setting up the DB requires defining read and write capacity units for the DB. Essentially, it tells Amazon how many actual read and write KB per second are going to be transferred from and to the table.
Higher numbers tell Amazon it needs to prepare for higher load and possibly increase the number of instances. What happens if the actual load is higher than the capacity defined unit numbers? Well, you get an exception in run time. Amazon provides a mechanism to support setting flexible limits based on actual traffic, but your code should still be prepared for run time exceptions.
After browsing the net and talking to people who use Dynamo in production, I learned that setting up capacity units makes Dynamo a less-flexible solution because no matter how high your thresholds are, you should still prepare for some exceptions. Taking these complexities into account led me to drop DynamoDB as a valid solution. I realized that choosing to work with DynamoDB implies that your code needs to handle capacity exceptions and I was not up to it.
Running Test against Google Cloud File Store:
Google File store is more like a file system. We used the same http upload and download mechanism we have for S3 in order to upload and download from Google File Store. Upload was not done by my team so I only collected the read performance. See below the data:
As you can see, the numbers were not really better than Amazon S3, so there was no good use case for us to seriously consider moving to Google File Store over Amazon S3 which we have already worked with.
We were done with performance testing using 100% of real production traffic with the different storage solutions. Below you can find the results.
The decision which storage is the right fit for us was now easier as we had hands on experience with each solution as well as actual performance numbers against our current production traffic. We learned that DynamoDB requires setting capacity units correctly and always prepare for exceptions in production as explained above, so DynamoDB was eliminated in spite of relatively good performance numbers.
Google file store was not performing better than the others, so we were left out with one decision - Amazon S3 vs. Cassandra.
Cassandra provided the best performance, however, the operational overhead of running your own Cassandra instances had to be on our internal DBA team. Amazon S3 provided less attractive latency numbers, BUT we figured, they were good enough given users total transactions expectations. On the bright side, selecting S3 meant we did not have to consider the operational overhead any more.
We would be able to scale without dealing any more with increasing DB size or lags between data centers. We thought it was the right choice for us and we decided to go ahead further with a full move from our current MySql to fully cloud Amazon S3 and further test the decision in fully working prod environment.
So that was it. We made our choice. We spent the next few months, in order to move all our persistency from mySql to Amazon S3. It was a major effort for the team, and it is now over. All our traffic is now served from Amazon S3 and so far things seems to work fine. We didn’t have any major issues up till now. Still a little early to be certain, but it looks like we were correct with our choice. To make sure our performance wouldn’t be impacted, we also added a “write through” cache layer. This layer performs really well with the active users. As a result, in spite of not choosing the fastest solution, we still get good performance.
If you followed me up till now, then I hope you found this post somewhat valuable. Below I have summarized the main points I tried to make for better clarity:
DB lags and increasing DB size concerns triggered a data storage research
Starting with a definition of must have characteristic for our next storage solution
Coming up with a list of possible storage solutions meeting the required characteristics
Defined the benchmark comparison criteria
Use actual production traffic on every storage solution that we decided to research to measure latency results for read and write of each solution
Summarize the latency as well as other points identified while hands on working with the solutions to make a decision where to invest further
Thank you for reading my post. I’ll be happy to answer any questions, or event better - to hear about your research or thoughts about different persistencies used in your firm.
This post was written by Eyal Malron
You can connect with him on LinkedIn
For more engineering updates and insights:
Visit us on GitHub
Subscribe to our YouTube channel