MVCC stands for MultiVersion Concurrency Control. It’s a way to implement transactional isolation in any application that needs to manage the state of data items.
The traditional mechanism for state management in a database is concurrent control using read-write locks. However, this solution is based on blocking and can create contention. MVCC is not blocking. This means that multiple clients can read and write simultaneously - reads do not block other reads and writes do not block other writes or reads.
What is Isolation?
The four cardinal principles of Database transaction are expressed as the accronym ACID:
- Atomicity : all-or-nothing execution
- Consistency : constraints about acceptable post-transaction states
- Isolation : each transaction must appear to be executed as if no other transaction is executing as the same time
- Durability : the condition that once completed, the effect on the database of a transaction, must never be lost.
The principle of Isolation guides our efforts to guarantee that when a database client initiates a transaction, the data that the client sees will always be the same. Regardless of other ongoing transactions, including ones that are modifying database records.
A small example
Alice works at Transparent Inc., a progressive company that allows its employees to check everyone’s yearly bonus via an Internet browser. When Alice’s browser (client) loads the webpage sent from the company’s accounting server, she’s shocked and disappointed - “Where’s my freakin’ bonus?”
| LastUpdated | Employee | Bonus |
|---|---|---|
| 10:00 | Alice | $0 |
| 10:00 | Bob | $500 |
| 10:00 | David | $500 |
Feeling quite upset, Alice calls accounting:
“Accounting, you’re talking to Mallory. What can I help you with?”
“It’s Alice. I’m looking at my bonus for this year, it’s $0. What’s going on?”
“I just calculated your bonus and updated it just before you called.”
“Nope. All I see is $0.”
“Well, I’m looking at your bonus right now and it’s not $0.
Have you tried refreshing your browser?”.
Still annoyed, Alice clicks the refresh button on her browser. As her browser loads a new version of the page, Alice mood goes from happy to surprised.
| LastUpdated | Employee | Bonus |
|---|---|---|
| 10:15 | Alice | $500 |
| 10:00 | Bob | $500 |
| 10:15 | Celine | $3000 |
Alice smiles, but then reacts.
“Wait! What happened to David’s bonus, and who’s Celine?”
“The company CFO just called and instructed me to remove David’s bonus. Apparently, David is under police investigation for stealing company secrets. Celine is the newly applied company Ai. It’s been delivering amazing results.”
The example illustrates three expected behaviours of the database.
If Mallory at Accounting starts a transaction that:
- Modifies a record (Alice): Alice will still see her bonus as being $0, since Mallory’s update transaction had not completed by the time Alice’s browser sent a request for the page.
- Deletes a record (David): Alice will still see her bonus (until she refreshes her browser)
- Adds a record (Celine): Alice won’t see the effects until she refreshes her browser (sending a request for the new version of each record).
There are exception to these rules (skew).
The Read-Write Lock Challenge:
In the early days of databases, a common solution to handle concurrent readers and writers was to simply lock the entire database when someone wanted to read or write.
Some of the downsides of Locking:
- if we lock the whole database then readers and writers have to wait - this creates contention.
- if we do not lock the database, and allow writers to write when they want, then readers might get inconsitent data.
- our database become more durable if we can easilty abort the process and rollback any changes we intended to make.
Let’s introduce some terminology
There’s quite a lot terminology associated with MVCC - we certainly won’t have time to explore much of them here. For the interested reader the book is referenced in the MVCC wiki page.
In this post we’ll limit the scope to a few essential concepts that we can use to build rudimentary MVCC mechanisms.
Record
We need a term for the smallest unit of information that the MVCC will handle.
In a traditional relational database, a record in database terms refers to one single cell of data.
In this post we’re going to assume that a record refers to a single cell of data in a row in a database.
The fundamental assumption is that a record is discreet and cannot be simultaneously modified by multiple entities/ clients/transactions.
In a graph database we’re are going to define a record as a graph element (a node, or an edge).
Collection
A collection is a set of records that enable us to simultaneoulsy (in one transaction) modify (create, read, update, delete) a group of records.
A single transaction can include multiple collections. We can think of collection as a table in a database. All records that share the same Transaction Id are also part of the same collection.
The important thing here is that a transaction is not isolated to a single collection as long as each of the records share the same transaction IDs.
Transaction Identifier
Every transaction in the database is given a unique identifier, a transactionId.
The unique identifer could be a number but it doesn’t have to be.
A transactionId is the unique number for the transaction.
All records that have been modified under the same transaction can be saved or rolled back as one atomic operation, which is ultimately what we want.
There are multiple ways to identify transactions, here are three popular ways:
1. Incrementing number
For each new transaction, an atomic counter generates a nonce (number used once), and assigns it as the transactionId. This happens before any changes to the collection of records occur.
The transactionId represent all the record changes. Regardless if the record changes are committed or discarded, a nonce can only be used once for the same transactionId.
For in-memory mvcc, all transactionIds must be persisted to avoid collisions from a nonce being used more than once if the application needs to be restarted.
2. Timestamp
Each transactionId is assigned a high granularity timestamp. In order to avoid a a collision (two transactions with the same timestamp), the granularity needs to reflect the probability of two transactions being spawned close in time.
Timestamps enable the database system to recreate or restore the exact state of the collections to some point in time.
3. UUID, Universal Unique ID
UUID is a popular way to assign unique identifiers. It should be noted that UUID does theoretically include the risk of collisions, even though the risk is extremely small. is note Special cases where you are working with distributed databases or have other requirements that are not satisfied by the two types above. This may include UUID based algorithms and may not even be strictly a number.
In this post we’ll implement
TransactionIduseing anincrementing counter.
MVCC API
With the bare fundamentals in place we are now ready to start building a simple MVCC API. The interface that we want our API to provide is the following:
- Initiate a Transaction payload ?
- Creating a record
- Update a record
- Delete a record
- Commit Transaction Payload (changes)
- Rollback changes
Our intention is to use Rust’s powerful type system to create a generic MVCC protocol.
In order to produce new transactions we need:
- the next
transactionId(idea: use std::iter::Successor trait?)
- an array or set (HashSet?) of the current active transaction
// We initialize a varible that will be responsible for holding the next
// available transaction id number.
// Being treated like a global variable - TODO: find a better solution.
// TODO: look at using UUID
let mut next_xid:u32 = 1;
// We use a vector to hold the list of currently active transaction Ids.
// TODO: change to HashSet?
// TODO: change the type to one that ONLY takes TxId
// We use a BTreeSet to track the order of when a TxId is added
// According to data structure convention, it is a logic error for an item in a BTreeSet
// to be modified in such a way that the item’s ordering relative to any other item
// as determined by the Ord trait, changes while it is in the set.
// Sounds perfect, let's use that.
// TODO: We can also require elements to be of TxId type.
let mut active_xids:BTreeSet<u32> = BTreeSet::new();
// We initialize an empty vector that we'll use to hold records that we want to
// include in a transaction.
let records: BTreeMap<String, u32> = BTreeMap::new();
// new transaction will take next_xid by reference.
fn new_transaction(next_xid: &mut u32) -> Result<Transaction, Err> {
// We prime the next transaction payload by incrementing the atomic counter
// that is in charge of issuing transaction Ids.
next_xid = next_xid + 1;
// Our newly minted transaction is added to the list of active transactions.
// TODO: remove ops: *.contains(X), *.remove(X)
active_xids.insert(next_xid);
// Construct and return a new Transaction.
// If Transaction::new() fails then we bubble up the error.
Ok(Transaction::new().set_txid(next_xid)?)
}
/// Transaction is a builder responsible for creating transactions.
/// We can use to add records that are to be included in the transation.
/// Example
/// ```
/// let mut phonebook:HashMap<String, u8> = vec!({"Alice":123}, {"Bob":555});
/// // The Conductor is in charge of handling all transactions according to MVCC.
/// let c = Conductor::new(phonebook);
/// let mut t: Transaction = Transaction::new().set_txid(1);
/// t1.add_record({"Alice":321});
/// // The Conductor registers the commited transaction, making sure that it gets
/// // persisted to disk, and also marks older versions of a record for deletion.
/// c.register(t.commit());
struct Transaction{
// The unique id of the transaction.
xid:u32,
//TODO: look at the time of this vec, should it be Vec<u32>?
// A list of things that need to be undone
// in the event that the transaction is rolled back
// (meaning, it will be reverted/_not_ committed).
// needs to be ordered, changeable.
// TODO: should it allow duplicate members?
// TODO: decide on type here!
// We use a BTreeSet because we need to store
// in order.
// In the BTreeSet we (orderly) store HashMaps
// because we need keys associated with values
// Like, TransactionType::Add("add") = 1
// TODO: probably should change String into an
// enum.
rollback_actions: BTreeSet<BTreeMap<String, u32>>,
}
impl Transaction {
pub fn new(&self, xid:u32) -> Self {
Transaction {
self.xid = xid,
}
}
}
For each record we want to be able to track two pieces of metadata, that represent a TransactionId, but they are not required to have values:
created XIDexpired XID
If possible, we want to store these two as fields of the record type. Could be stored outside of
memory.
struct Record {
created_xid: Option<u32>,
expired_xid: Option<u32>,
}
Record Visibility and Locking
The visibility of a row depends on who is looking at it.
Every row has to be tested if it is visible from the point of view of the transaction looking at it:
A transaction, $T_i$, is only allowed to read or mutate any record $R_j$ that is visible to the transaction.
So, we have to test the visibility of each record.
// We need to check if the record, that was passed to the function, is already
// included in some other active/ongoing transaction, and if that active/ongoing
// transaction is the instance of Transaction to which record_is_viisible belongs
// or if that ongoing transaction is an other instance of Transaction.
// TODO: check type for record here
impl Transaction{
fn record_is_visible(
&self,
active_xids: &BTreeSet<u32>,
record: &BTreeMap<String, u32>
) -> bool {
// Case: The record was created in an other currently active transaction
// so the record is _not_ an element of this transaction's group of records.
// if active_xids.contains(record.get(&"created_xid")) && record.get(&"created_xid") != self.xid {
// If the xid is active but it does not belong
// to this Transaction we return false.
if active_xids.contains(record.get(&"created_xid")) {
if record.get(&"created_xid") != self.xid {
return false;
}
}
// Case: The record expired or an no transactions holds it that is our own
if record.get(&"expired_xid") != 0 {
// if !active_xids.contains(record.get(&"expired_xid")) || record.get(&"created_xid") == self.xid {
if active_xids.contains(record.get(&"expired_xid")) == false {
return false;
} else if record.get(&"created_xid") == self.xid {
return false;
}
}
// Our checks above indicate that the record is visible to our Transaction.
// So we can return true.
true
}
}
Collision
Simultaneous transactions cannot make a modification to the same record. There are two strategies we can use to handle conflicts // If this happens there are two ways to handle this:
- Rollback
We abort and rollback the transaction that tried to make the most recent change(s) (the tx with the most recent timestamp or highest xid counter value).
We can also propagate the error back to the original client. - Blocking
We await (block) the second transaction until the record that the transaction wants to modify becomes available.
This strategy has performance challenges and potential reread errors.
Rollback strategy is the safest one, so let’s use that.
// TODO: why is this step necessary?
Is the Row Locked?
When we do need to modify a record we need to check if it’s locked by another transaction with this:
// TODO: Should the record be a BTreeMap<String, u32>?
impl Transaction {
fn row_is_locked(
&self,
record: &HashMap<String, u32>
) -> bool {
if record.get(&"expired_xid") != 0 {
// if !active_xids.contains(record.get(&"expired_xid")) || record.get(&"created_xid") == self.xid {
if active_xids.contains(record.get(&"expired_xid")) == true {
return true;
}
return false;
} else {
return false;
}
}
CRUD Time!
Adding a Record
// TODO: why?
We set the created_xid to the current transactionId and expired_xid to 0:
So the record gets stamped with an id indicating
which TransactionId that it got included into.
We also set the expiration date of the record to 0.
impl Transaction {
// The record's attribute _created_xid_ gets the TransactionId value of the Transaction that included it in its collection of records that its handling.
// TODO: change "created_xid" into an enum
// Record::{TransactionCreattionId, TransactionExpirationId}
// record.get()
fn add_record(&self, &mut record: &mut HashMap<String, u32> ) -> Result<(), Err> {
record["created_xid"] = self.xid;
record["expired_xid"] = 0;
// rollback_actions: BTreeSet<BTreeMap<String, u32>>,
// Unclear if this should be:
// .insert or .append ... append adds the element to the end of the list
// insert adds the element to the specific position.
// We insert a "delete" operation at the correct index of the rollback set
// If we decide to rollback then the rollback function well see that a "delete"
// action was queued at this sequence slot of the transaction.
let mut action:BTreeMap<String, u32> = BTreeMap::new();
action.insert("delete".to_string(), records.len());
self.rollback_actions.insert(action);
Ok(records.append(record))
}
}
Deleting a Record
/////////////////////////////////////////////
In deleting a record, we need to handle two different scenarios:
-
- The record has an
expired_xidor $0$.
That would indicate that the record has never been deleted - not but the TransactionId that is checking, not by any other transaction. Changing theexpired_xidto theTransactionIdwould mark therecordas being eligable up for deletion (by a vacuum procees or something similar, see below.)
- The record has an
-
- If the
expired_xid!= $0$ thenexpired_xidis an active transaction. That would indicate that therecordhas been deleted by another active transaction. In that case we can’t modify therecordin any way - it’s locked.
- If the
Due to visibility constraints,
expired_xidhas to be an active transaction.
// We pass the identifier of the record we
// want to delete.
// We also need to pass, by reference, our records store.
impl Transactions {
fn delete_record(&self, id:u32, &records:HashMap<String,u32>) -> Result<(), Err> {
// We iterate over each record in our record store, and check if the id of that particular stored record matches the id of the record that we want to delete.
for (i,record) in records.iter().enumerate() {
if self.record_is_visible(&record) && record.get(&"id").unwrap() == &id {
if self.row_is_locked(&record) {
return Err("Row is locked by another transaction");
} else {
record["expired_xid"] = self.xid;
let new_rec: HashMap<String,u32> = HashMap::new();
new_rec["add"] = self.xid;
self.rollback_actions.append(new_rec);
// TODO: Should this really return ok?
return Ok(())
}
}
}
}
}
Updating a Record
To modify a specific record we delete the older version of the record, and add a new version of the record.
The existig record will still be visible by other transactions.
// TODO: verify logic of block from failed update.
If the call to fn update_record() fails then
its error will bubble upstream and block the subsequent call to fn add_record.
fn update_record(&self, id:u32, name:String) -> Result<(), Err> {
self.delete_record(id);
let new_record_version:HashMap<String,u32> = HashMap::new();
new_record_version.insert("id", id);
new_record_version.insert("name", name);
self.add_record(new_record_version);
}
Commit Changes
Once all the modifications have been made we need to commit all the changes so that future clients/transactions can see these new changes.
Once all the modifications to the records have been made it’s time to m commit the changes.
Once the changes have been commmited, other transactionId will be able to see the changes.
Clientes, like Alice browser in the earlier example will be able to see the changes
once the web page has been reloaded.
completedthe TransactionId has completed all its work (things like update, reading,
deleting records), it’s time commit the changes.
We do this by removing the TransactionId, from active_xid (the list of active transactions).
impl Transaction {
// We remove our TransactionId from the list of
// active transactions.
// TODO: Improve error handling here.
fn commit(&self, &mut active_xids) -> Result<(), bool> {
active_xids.remove(&self.xid)
}
}
The time complexity of `fn commit() is $O(n)$ so long running transactions with a large number of changes should be no problem.
Rollback Changes
If for some reason we have to abort the transaction then we can do so in multiple ways. Rollback means performing the modifications the transaction did, but in revers.
enum TransactionTypes {
Add("add"),
Delete("delete"),
}
impl Transaction {
// In order to walk back our intended modifications we iterate over rollback_actions in reverse.
// first element is the action "add", "delete"
// second element is the index of the record in question.
// records is a BTreeMap<String, u32>
fn rollback(&self, &mut records) {
for action in self.rollback_actions.iter().rev() {
// Each action that we get is a HashMap containing the action type
if let Some((key, value)) = action.first_key_value() {
match key.as_str() {
"add" => {
// Our action previously added the record to our `TransactionId`,
// so now we rollback by removing it from our `TransactionId`.
// By setting the expired_xid attribute of the record in question to 0
// we state that the record is no longer an element of an expired `TransactionId`.
// TODO: implement removal logic
},
"delete" => {
// Restore the deleted record
// TODO: implement restoration logic
},
_ => {
// Unknown action type
}
}
}
}
// Finally the TransactionId is set as being _inactive_ by removing it from the list of _active transactions.
active_xids.remove(&self.xid);
}
}
Recovery
Our fn rollback is dependent on two assumptions:
-
- In case of an application crash, any transactions that were in the process of being rolled back will continue to finalize that process once the application is restarted. Otherwise we would get a corrupt db. One strategy to mitigate this risk is to continuously persist ongoing TransactionIds to disk and then use that persisted data during application bootup to return the application to its correct state.
-
- We need to verify somehow that all modifications actually were rolled back. The application needs an efficient way to do this.
Vacuuming or Reclaiming Space
Since we are not actually deleting data, only marking some data for deletion, with time our in-memory database will grow to be very large.
We need to periodically purge dead transactions
by going through all records and delete the ones that are dead, in order to reclaim space in memory.
How do we distinguish between alive and dead records?
TODO: clarify rows or records below.
We can simply write all alive rows/records to a new file, drop all the rest from memory, and then read in the records from the file.
Death Row
Regarding row visibility:
Records that have an expired_xid != 0, and do not
appear in the active_xids, are to be considered
dead rows and should not appear when fn row_is_visible() is called.