Updated: Dec 24, 2019
The following is based on a true story, some table names and data have been changed to protect the privacy of the ones concerned.
Let me tell you a story about an optimistic locking and the synchronization challenges databases faced in the highly-available environment.
Optimistic Locking is a strategy where, when you read a record, you also note the record version number (via a timestamp, auto-incremental ID, etc.) and before you write the changes to the record, you check that the version hasn't changed.
If the record has changed and is dirty, you abort the transaction and ask the user to repeat the action. For example:
Imagine two users who update Smith’s first name simultaneously via a User Interface.
12:00 - User A opens the user interface (UI), changes the “first name” to “Peter” BUT doesn’t click submit. It means the change hasn’t reached the database yet.
12:05 - User B opens the UI to see that Smith’s first name is John, and changes it to “Sarah”. User B does click the submit button, and the change lands in the database.
12:10 - User A finally clicks submit and receives an error message: “Data has been updated by another user!”
How does it work behind the scenes?
In Step 1, the client gets the data (John Smith), and the data version number (1542286500000) to.
Same happens on Step 2 and when user B submits changes to the data, the server uses the following query to push update to the database:
Update <table> set first_name = ‘Sarah’, set version = 1542286599999 where id =1 AND version = 1542286500000
As the version passed by the client matches the one in the database, the update is successful. However, on Step 3, when User A submits their change, the data version the client uses does NOT match the one in the database anymore (remember, it was updated on step 2). Server throws an exception.
This is the commonly expected behavior for optimistic locking.
Let’s scale and see if that is still true
Imagine a scenario where multiple MySQL databases reside on several hosts across two Data Centers (DC) with high availability configured in the Active-Passive mode.
It means we write only to the database in the Active DC and the data gets replicated to the passive DC. The passive DC is used in read-only mode.
It is important to note that this replication doesn’t occur instantly and might occur some delay.
In MySql, the two major ways of data replication are:
Statement-Based Replication: the “leader DB” (i.e., active database) writes SQL statements to the binary log and the SQL statements are executed on the replica (i.e., passive database).
Row-Based Replication: the “leader” writes events to the binary log that tracks changes to the individual table rows. These events are then copied to the replica.
What can go wrong?
Problems might arise when several MySQL servers are in Active-Active mode. In this case, both DCs will receive traffic simultaneously. This can happen, for example, if the transition period from Active to Passive DC takes too much time.
When the database servers are in this mode, and the same entity is being updated at the same time across different Data Centers, we are in trouble. The longer we are in the Active-Active mode, the higher is the chance of these conflicting updates.
In the real world, this may happen. Here is an example with an overlap period of a couple of hours:
Now, imagine the following:
DC 1 & 2 are in sync and the data is:
DC 1 is the active data center and the data stored there gets updated to:
As this change gets replicated to DC 2 with a “statement-based” replication, the same query that was executed over the data in DC 1 is now executed for DC 2:
Update <table> set first_name = ‘Peter’, set version = 1542286599999 where id =1 AND version = 1542286500000
Let’s imagine that at the same time DC 2 became active, and received a portion of traffic that updated the same record, for example:
In this case, the update from the replication statement above won’t succeed in DC 2 because of the version conflict. The version is now different between DC 1 and DC 2.
We are now in an undesirable situation of corrupt data, where the same row has different values across DC’s. From now on - because of optimistic locking - and “statement based” replication, the data will never be synchronized again.
This can have very negative effects. Depending from which database you read, you’ll get different data for the same entity!
This is the important part: Once the “version” value is different for the same entity across DC’s, the replication will never work for that entity. This is because the optimistic locking mechanism is using that “version” value in all the update queries and the “statement based” replication execute that same query across all DC’s. If the “version” value in the passive DC is different, the “where” clause (version = <version number>) will be the the reason why the entity won’t be updated.
How do we fix that?
The first solution would be to look for the latest update time instead of the exact match of the source record on the client and on the server (i.e. instead of version = <version number> use version >= <version number>). It will make the latest change apply on both DCs.
Another way to minimize this problem is to use “row-based” replication. We mentioned it before, but let’s recap: With “row-based” replication, the changes in the “leader” database are unconditionally propagated to the “replica” database.
By using “row-based” replication, we will still temporarily encounter the same problem of conflicting changes, however, once we go back to a normal situation (i.e. one “leader” database) the data will get replicated correctly from active to passive DC.
The “corrupt” data in the replicas will be overridden by the changes that occur in the “leader” DB when using “row-based” replication.
Another way to deal with this problem is to use event sourcing architecture. With event sourcing, you don’t update events, you only insert them. If we receive two concurrent events where the same record is updated, we can determine whether they are actually conflicting or not.
The application can replay events and use the self-healing mechanism to handle the cases that are considered corrupt and therefore reach a highly resilient system.
There is much more to say, but that is a part of another story…
This post was written by Laurent Gaertner in collaboration with Maxim Zabuti
For more engineering updates and insights:
Visit us on GitHub
Subscribe to our YouTube channel