From cd3d34289c92f01843a866f4432bdd2da1ee59db Mon Sep 17 00:00:00 2001 From: Venky Shankar Date: Wed, 18 Feb 2015 17:01:21 +0530 Subject: doc: document bit-rot feature Change-Id: Ibad640d01975906b7642c76a1649e3e272f3a8bc BUG: 1170075 Signed-off-by: Venky Shankar Reviewed-on: http://review.gluster.org/9712 Tested-by: Gluster Build System Reviewed-by: Vijay Bellur --- doc/features/bit-rot/00-INDEX | 8 + doc/features/bit-rot/bitrot-docs.txt | 5 + doc/features/bit-rot/memory-usage.txt | 48 ++++++ doc/features/bit-rot/object-versioning.txt | 236 +++++++++++++++++++++++++++++ 4 files changed, 297 insertions(+) create mode 100644 doc/features/bit-rot/00-INDEX create mode 100644 doc/features/bit-rot/bitrot-docs.txt create mode 100644 doc/features/bit-rot/memory-usage.txt create mode 100644 doc/features/bit-rot/object-versioning.txt (limited to 'doc/features/bit-rot') diff --git a/doc/features/bit-rot/00-INDEX b/doc/features/bit-rot/00-INDEX new file mode 100644 index 00000000000..d351a1976ff --- /dev/null +++ b/doc/features/bit-rot/00-INDEX @@ -0,0 +1,8 @@ +00-INDEX + - this file +bitrot-docs.txt + - links to design, spec and feature page +object-versioning.txt + - object versioning mechanism to track object signature +memory-usage.txt + - memory usage during object expiry tracking diff --git a/doc/features/bit-rot/bitrot-docs.txt b/doc/features/bit-rot/bitrot-docs.txt new file mode 100644 index 00000000000..39cd491dbcd --- /dev/null +++ b/doc/features/bit-rot/bitrot-docs.txt @@ -0,0 +1,5 @@ +* Feature page: http://www.gluster.org/community/documentation/index.php/Features/BitRot + +* Design: http://goo.gl/Mjy4mD + +* CLI specification: http://goo.gl/2o12Fn diff --git a/doc/features/bit-rot/memory-usage.txt b/doc/features/bit-rot/memory-usage.txt new file mode 100644 index 00000000000..5fe06d4a209 --- /dev/null +++ b/doc/features/bit-rot/memory-usage.txt @@ -0,0 +1,48 @@ +object expiry tracking memroy usage +==================================== + +Bitrot daemon tracks objects for expiry in a data structure known +as "timer-wheel" (after which the object is signed). It's a well +known data structure for tracking million of objects of expiry. +Let's see the memory usage involved when tracking 1 million +objects (per brick). + +Bitrot daemon uses "br_object" structure to hold information +needed for signing. An instance of this structure is allocated +for each object that needs to be signed. + +struct br_object { + xlator_t *this; + + br_child_t *child; + + void *data; + uuid_t gfid; + unsigned long signedversion; + + struct list_head list; +}; + +Timer-wheel requires an instance of the structure below per +object that needs to be tracked for expiry. + +struct gf_tw_timer_list { + void *data; + unsigned long expires; + + /** callback routine */ + void (*function)(struct gf_tw_timer_list *, void *, unsigned long); + + struct list_head entry; +}; + +Structure sizes: + sizeof (struct br_object): 64 bytes + sizeof (struct gf_tw_timer_list): 40 bytes + +Together, these structures take up 104 bytes. To track all 1 million objects +at the same time, the amount of memory taken up would be: + + 1,000,000 * 104 bytes: ~100MB + +Not so bad, I think. diff --git a/doc/features/bit-rot/object-versioning.txt b/doc/features/bit-rot/object-versioning.txt new file mode 100644 index 00000000000..def901f0fc5 --- /dev/null +++ b/doc/features/bit-rot/object-versioning.txt @@ -0,0 +1,236 @@ +Object versioning +================= + + Bitrot detection in GlusterFS relies on object (file) checksum (hash) verification, + also known as "object signature". An object is signed when there are no active + file desciptors referring to it's inode (i.e., upon last close()). This is just an + hint for the initiation of hash calculation (and therefore signing). There is + absolutely no control over when clients can initiate modification operations on + the object. An object could be under modification while it's hash computation is + under progress. It would also be in-appropriate to restrict access to such objects + during the time duration of signing. + + Object versioning is used as a mechanism to identify the staleness of an objects + signature. The document below does not just list down the version update protocol, + but goes through various factors that led to its design. + +NOTE: The word "object" is used to represent a "regular file" (in linux sense) and + object versions are persisted in extended attributes of the object's inode. + Signature calculation includes object's data (no metadata as of now). + +INDEX +===== + i. Version updation protocol + ii. Correctness guaraantees + iii. Implementation + iv. Protocol enhancements + +i. Version updation protocol +============================ + There are two types of versions associated with an object: + + a) Ongoing version: This version is incremented on first open() [when + the in-memory representation of the object (inode) is marked dirty + and synchronized to disk. When an object is created, a default ongoing + version of one (1) is assigned. An object lookup() too assigns the + default version if not present. When a version is initialized upon + lookup() or creat() FOP, it need to be durable on disk and therefore + can just be a extended attrbute set with out an expensive fsync() + syscall. + + b) Signing version: This is the version against which an object is deemed + to be signed. An objects signature is tied to a particular signed version. + Since, an object is a candidate for signing upon last release() [last + close()], signing version is the "ongoing version" at that point of time + + An object's signature is trustable when the version it was signed against + matches the ongoing version, i.e., if the hash is calculated by hand and + compared against the object signature, it *should* be a perfect match if + and only if the versions are equal. On the other hand, the signature is + considered stale (might or might not match the hash just calculated). + + Initialization of object versions + --------------------------------- + An object that existed before the pre versioning days, is assigned the + default versions upon lookup(). The protocol at this point expects "no" + durability guarantess of the versions, i.e., extended attribute sets + need not be followed by an explicit filesystem sync (fsync()). In case + of a power outage or a crash, versions are re-initialized with defaults + if found to be non-existant. The signing version is initialized with a + deafault value of zero (0) and the ongoing version as one (1). + + [ + NOTE: If an object already has versions on-disk, lookup() just brings + the versions in memory. In this case both versions may or may + not match depending on state the object was left in. + ] + + + Increment of object versions + ---------------------------- + During initial versioning, the in-memory representation of the object is + marked dirty, so that subsequent modification operations on the object + triggers a versiong synchronization to disk (extended attribute set). + Moreover, this operation needs to be durable on disk, for the protocol + to be crash consistent. + + Let's picturize the various version states after subsequent open()s. + Not all modification operations need to increment the ongoing version, + only the first operations needs to (subsequent operations are NO-OPs). + + NOTE: From here one "[s]" depicts a durable filesystem operation and + "*" depicts the inode as dirty. + + + lookup() open() open() open() + =========================================================== + + OV(m): 1* 2 2 2 + ----------------------------------------- + OV(d): 1 2[s] 2 2 + SV(d): 0 0 0 0 + + + Let's now picturize the state when an already signed object undergoes + file operations. + + on-disk state: + OV(d): 3 + SV(d): 3| + + + lookup() open() open() open() + =========================================================== + + OV(m): 3* 4 4 4 + ----------------------------------------- + OV(d): 3 4[s] 4 4 + SV(d): 3 3 3 3 + + Signing process + --------------- + As per the above example, when the last open file descriptor is closed, + signing needs to be performed. The protocol restricts that the signing + needs to be attached to a version, which in this case is the in-memory + value of the ongoing version. A release() also marks the inode dirty, + therefore, the next open() does a durable version synchronization to + disk. + + [carry forwarding the versions from earlier example] + + close() release() open() open() + =========================================================== + + OV(m): 4 4* 5 5 + ----------------------------------------- + OV(d): 4 4 5[s] 5 + SV(d): 3 3 3 3 + + As shown above, a relase() call triggers a signing with signing version + as OV(m): which in this case is 4. During signing, the object is signed + with a signature attached to version 4 as shown below (continuing with + the last open() call from above): + + open() sign(4, signature) + =========================================================== + + OV(m): 5 5 + ----------------------------------------- + OV(d): 5 5 + SV(d): 3 4:[s] + + A signature comparison at this point of time is un-trustable due to + version mismatches. This also protects from node crashes and hard + reboots due to durability guarantee of on-disk version on first + open(). + + close() release() open() + =========================================================== + + OV(m): 4 4* 5 + -------------------------------- CRASH + OV(d): 4 4 5[s] + SV(d): 3 3 3 + + The protocol is immune to signing request after crashes due to + the version synchronization performed on first open(). Signing + request for a version lesser than the *current* ongoing version + can be ignored. It's left upon the implementation to either + accept or ignore such signing request(s). + + [ + NOTE: Inode forget() causes a fresh lookup() to be trigerred. + Since a forget() call is received when there are no + active references for an inode, the on-disk version is + the latest and would be copied in-memory on lookup(). + ] + +ii. Correctness Guarantees +========================== + + Concurrent open()'s + ------------------- + When an inode is dirty (i.e., the very next operations would try to + synchronize the version to disk), there can be multiple calls [say, + open()] that would find the inode state as dirty and try to writeback + the new version to disk. Also, note that, marking the inode as synced + and updating the in-memory version is done *after* the new version + is written on disk. This is done to avoid incorrect version stored + on-disk in case the version synchronization fails (but the in-memory + version still holding the updated value). + Coming back to multiple open() calls on an object, each open() call + tries to synchronize the new version to disk if the inode is marked + as dirty. This is safe as each open() would try to synchronize the + new version (ongoingversion + 1) even if the updation is concurrent. + The in-memory version is finally updated to reflect the updated + version and mark the inode non-dirty. Again this is done *only* if + the inode is dirty, thereby open() calls which updated the on-disk + version but lost the race to update the in-memory version result + are NO-OPs. + + on-disk state: + OV(d): 3 + SV(d): 3| + + + lookup() open() open()' open()' open() + ============================================================= + + OV(m): 3* 3* 3* 4 NO-OP + -------------------------------------------------- + OV(d): 3 4[s] 4[s] 4 4 + SV(d): 3 3 3 3 3 + + + open()/release() race + --------------------- + This race can cause a release() [on last close()] to pick up the + ongoing version which was just incremented on fresh open(). This + leads to signing of the object with the same version as the + ongoing version, thereby, mismatching signatures when calculated. + Another point that's worth mentioning here is that the open + file descriptor is *attached* to it's inode *after* it's done + version synchronization (and increment). Hence, if a release() + sneaks in this window, the file desriptor list for the given + inode is still empty, therefore release() considering it as a + last close(). + To counter this, the protocol should track the open and release + counts for file descriptors. A release() should only trigger a + signing request when the file desccriptor for an inode is empty + and the numbers of releases match the number of opens. When an + open() sneaks and increments the ongoing version but the file + descriptor is still not attached to the inode, open and release + counts mismatch, hence identifying an open() in progress. + + +iii. Implementation +=================== + Refer to: xlators/feature/bit-rot/src/stub + +iv. Protocol enhancements +========================= + + a) Delaying persisting on-disk versions till open() + b) Lazy version updation (until signing?) + c) Protocol changes required to handle anonymous file + descriptors in GlusterFS. -- cgit