summaryrefslogtreecommitdiffstats
path: root/doc/features/bit-rot/object-versioning.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/features/bit-rot/object-versioning.txt')
-rw-r--r--doc/features/bit-rot/object-versioning.txt236
1 files changed, 0 insertions, 236 deletions
diff --git a/doc/features/bit-rot/object-versioning.txt b/doc/features/bit-rot/object-versioning.txt
deleted file mode 100644
index def901f0fc5..00000000000
--- a/doc/features/bit-rot/object-versioning.txt
+++ /dev/null
@@ -1,236 +0,0 @@
-Object versioning
-=================
-
- Bitrot detection in GlusterFS relies on object (file) checksum (hash) verification,
- also known as "object signature". An object is signed when there are no active
- file desciptors referring to it's inode (i.e., upon last close()). This is just an
- hint for the initiation of hash calculation (and therefore signing). There is
- absolutely no control over when clients can initiate modification operations on
- the object. An object could be under modification while it's hash computation is
- under progress. It would also be in-appropriate to restrict access to such objects
- during the time duration of signing.
-
- Object versioning is used as a mechanism to identify the staleness of an objects
- signature. The document below does not just list down the version update protocol,
- but goes through various factors that led to its design.
-
-NOTE: The word "object" is used to represent a "regular file" (in linux sense) and
- object versions are persisted in extended attributes of the object's inode.
- Signature calculation includes object's data (no metadata as of now).
-
-INDEX
-=====
- i. Version updation protocol
- ii. Correctness guaraantees
- iii. Implementation
- iv. Protocol enhancements
-
-i. Version updation protocol
-============================
- There are two types of versions associated with an object:
-
- a) Ongoing version: This version is incremented on first open() [when
- the in-memory representation of the object (inode) is marked dirty
- and synchronized to disk. When an object is created, a default ongoing
- version of one (1) is assigned. An object lookup() too assigns the
- default version if not present. When a version is initialized upon
- lookup() or creat() FOP, it need to be durable on disk and therefore
- can just be a extended attrbute set with out an expensive fsync()
- syscall.
-
- b) Signing version: This is the version against which an object is deemed
- to be signed. An objects signature is tied to a particular signed version.
- Since, an object is a candidate for signing upon last release() [last
- close()], signing version is the "ongoing version" at that point of time
-
- An object's signature is trustable when the version it was signed against
- matches the ongoing version, i.e., if the hash is calculated by hand and
- compared against the object signature, it *should* be a perfect match if
- and only if the versions are equal. On the other hand, the signature is
- considered stale (might or might not match the hash just calculated).
-
- Initialization of object versions
- ---------------------------------
- An object that existed before the pre versioning days, is assigned the
- default versions upon lookup(). The protocol at this point expects "no"
- durability guarantess of the versions, i.e., extended attribute sets
- need not be followed by an explicit filesystem sync (fsync()). In case
- of a power outage or a crash, versions are re-initialized with defaults
- if found to be non-existant. The signing version is initialized with a
- deafault value of zero (0) and the ongoing version as one (1).
-
- [
- NOTE: If an object already has versions on-disk, lookup() just brings
- the versions in memory. In this case both versions may or may
- not match depending on state the object was left in.
- ]
-
-
- Increment of object versions
- ----------------------------
- During initial versioning, the in-memory representation of the object is
- marked dirty, so that subsequent modification operations on the object
- triggers a versiong synchronization to disk (extended attribute set).
- Moreover, this operation needs to be durable on disk, for the protocol
- to be crash consistent.
-
- Let's picturize the various version states after subsequent open()s.
- Not all modification operations need to increment the ongoing version,
- only the first operations needs to (subsequent operations are NO-OPs).
-
- NOTE: From here one "[s]" depicts a durable filesystem operation and
- "*" depicts the inode as dirty.
-
-
- lookup() open() open() open()
- ===========================================================
-
- OV(m): 1* 2 2 2
- -----------------------------------------
- OV(d): 1 2[s] 2 2
- SV(d): 0 0 0 0
-
-
- Let's now picturize the state when an already signed object undergoes
- file operations.
-
- on-disk state:
- OV(d): 3
- SV(d): 3|<signature>
-
-
- lookup() open() open() open()
- ===========================================================
-
- OV(m): 3* 4 4 4
- -----------------------------------------
- OV(d): 3 4[s] 4 4
- SV(d): 3 3 3 3
-
- Signing process
- ---------------
- As per the above example, when the last open file descriptor is closed,
- signing needs to be performed. The protocol restricts that the signing
- needs to be attached to a version, which in this case is the in-memory
- value of the ongoing version. A release() also marks the inode dirty,
- therefore, the next open() does a durable version synchronization to
- disk.
-
- [carry forwarding the versions from earlier example]
-
- close() release() open() open()
- ===========================================================
-
- OV(m): 4 4* 5 5
- -----------------------------------------
- OV(d): 4 4 5[s] 5
- SV(d): 3 3 3 3
-
- As shown above, a relase() call triggers a signing with signing version
- as OV(m): which in this case is 4. During signing, the object is signed
- with a signature attached to version 4 as shown below (continuing with
- the last open() call from above):
-
- open() sign(4, signature)
- ===========================================================
-
- OV(m): 5 5
- -----------------------------------------
- OV(d): 5 5
- SV(d): 3 4:<signature>[s]
-
- A signature comparison at this point of time is un-trustable due to
- version mismatches. This also protects from node crashes and hard
- reboots due to durability guarantee of on-disk version on first
- open().
-
- close() release() open()
- ===========================================================
-
- OV(m): 4 4* 5
- -------------------------------- CRASH
- OV(d): 4 4 5[s]
- SV(d): 3 3 3
-
- The protocol is immune to signing request after crashes due to
- the version synchronization performed on first open(). Signing
- request for a version lesser than the *current* ongoing version
- can be ignored. It's left upon the implementation to either
- accept or ignore such signing request(s).
-
- [
- NOTE: Inode forget() causes a fresh lookup() to be trigerred.
- Since a forget() call is received when there are no
- active references for an inode, the on-disk version is
- the latest and would be copied in-memory on lookup().
- ]
-
-ii. Correctness Guarantees
-==========================
-
- Concurrent open()'s
- -------------------
- When an inode is dirty (i.e., the very next operations would try to
- synchronize the version to disk), there can be multiple calls [say,
- open()] that would find the inode state as dirty and try to writeback
- the new version to disk. Also, note that, marking the inode as synced
- and updating the in-memory version is done *after* the new version
- is written on disk. This is done to avoid incorrect version stored
- on-disk in case the version synchronization fails (but the in-memory
- version still holding the updated value).
- Coming back to multiple open() calls on an object, each open() call
- tries to synchronize the new version to disk if the inode is marked
- as dirty. This is safe as each open() would try to synchronize the
- new version (ongoingversion + 1) even if the updation is concurrent.
- The in-memory version is finally updated to reflect the updated
- version and mark the inode non-dirty. Again this is done *only* if
- the inode is dirty, thereby open() calls which updated the on-disk
- version but lost the race to update the in-memory version result
- are NO-OPs.
-
- on-disk state:
- OV(d): 3
- SV(d): 3|<signature>
-
-
- lookup() open() open()' open()' open()
- =============================================================
-
- OV(m): 3* 3* 3* 4 NO-OP
- --------------------------------------------------
- OV(d): 3 4[s] 4[s] 4 4
- SV(d): 3 3 3 3 3
-
-
- open()/release() race
- ---------------------
- This race can cause a release() [on last close()] to pick up the
- ongoing version which was just incremented on fresh open(). This
- leads to signing of the object with the same version as the
- ongoing version, thereby, mismatching signatures when calculated.
- Another point that's worth mentioning here is that the open
- file descriptor is *attached* to it's inode *after* it's done
- version synchronization (and increment). Hence, if a release()
- sneaks in this window, the file desriptor list for the given
- inode is still empty, therefore release() considering it as a
- last close().
- To counter this, the protocol should track the open and release
- counts for file descriptors. A release() should only trigger a
- signing request when the file desccriptor for an inode is empty
- and the numbers of releases match the number of opens. When an
- open() sneaks and increments the ongoing version but the file
- descriptor is still not attached to the inode, open and release
- counts mismatch, hence identifying an open() in progress.
-
-
-iii. Implementation
-===================
- Refer to: xlators/feature/bit-rot/src/stub
-
-iv. Protocol enhancements
-=========================
-
- a) Delaying persisting on-disk versions till open()
- b) Lazy version updation (until signing?)
- c) Protocol changes required to handle anonymous file
- descriptors in GlusterFS.