diff options
Diffstat (limited to 'doc/features/bit-rot')
| -rw-r--r-- | doc/features/bit-rot/00-INDEX | 8 | ||||
| -rw-r--r-- | doc/features/bit-rot/bitrot-docs.txt | 5 | ||||
| -rw-r--r-- | doc/features/bit-rot/memory-usage.txt | 48 | ||||
| -rw-r--r-- | doc/features/bit-rot/object-versioning.txt | 236 | 
4 files changed, 0 insertions, 297 deletions
diff --git a/doc/features/bit-rot/00-INDEX b/doc/features/bit-rot/00-INDEX deleted file mode 100644 index d351a1976ff..00000000000 --- a/doc/features/bit-rot/00-INDEX +++ /dev/null @@ -1,8 +0,0 @@ -00-INDEX -        - this file -bitrot-docs.txt -        - links to design, spec and feature page -object-versioning.txt -        - object versioning mechanism to track object signature -memory-usage.txt -        - memory usage during object expiry tracking diff --git a/doc/features/bit-rot/bitrot-docs.txt b/doc/features/bit-rot/bitrot-docs.txt deleted file mode 100644 index 39cd491dbcd..00000000000 --- a/doc/features/bit-rot/bitrot-docs.txt +++ /dev/null @@ -1,5 +0,0 @@ -* Feature page: http://www.gluster.org/community/documentation/index.php/Features/BitRot - -* Design: http://goo.gl/Mjy4mD - -* CLI specification: http://goo.gl/2o12Fn diff --git a/doc/features/bit-rot/memory-usage.txt b/doc/features/bit-rot/memory-usage.txt deleted file mode 100644 index 5fe06d4a209..00000000000 --- a/doc/features/bit-rot/memory-usage.txt +++ /dev/null @@ -1,48 +0,0 @@ -object expiry tracking memroy usage -==================================== - -Bitrot daemon tracks objects for expiry in a data structure known -as "timer-wheel" (after which the object is signed). It's a well -known data structure for tracking million of objects of expiry. -Let's see the memory usage involved when tracking 1 million -objects (per brick). - -Bitrot daemon uses "br_object" structure to hold information -needed for signing. An instance of this structure is allocated -for each object that needs to be signed. - -struct br_object { -        xlator_t *this; - -        br_child_t *child; - -        void *data; -        uuid_t gfid; -        unsigned long signedversion; - -        struct list_head list; -}; - -Timer-wheel requires an instance of the structure below per -object that needs to be tracked for expiry. - -struct gf_tw_timer_list { -        void *data; -        unsigned long expires; - -        /** callback routine */ -        void (*function)(struct gf_tw_timer_list *, void *, unsigned long); - -        struct list_head entry; -}; - -Structure sizes: -  sizeof (struct br_object): 64 bytes -  sizeof (struct gf_tw_timer_list): 40 bytes - -Together, these structures take up 104 bytes. To track all 1 million objects -at the same time, the amount of memory taken up would be: - -  1,000,000 * 104 bytes: ~100MB - -Not so bad, I think. diff --git a/doc/features/bit-rot/object-versioning.txt b/doc/features/bit-rot/object-versioning.txt deleted file mode 100644 index def901f0fc5..00000000000 --- a/doc/features/bit-rot/object-versioning.txt +++ /dev/null @@ -1,236 +0,0 @@ -Object versioning -================= - -  Bitrot detection in GlusterFS relies on object (file) checksum (hash) verification, -  also known as "object signature". An object is signed when there are no active -  file desciptors referring to it's inode (i.e., upon last close()). This is just an -  hint for the initiation of hash calculation (and therefore signing). There is -  absolutely no control over when clients can initiate modification operations on -  the object. An object could be under modification while it's hash computation is -  under progress. It would also be in-appropriate to restrict access to such objects -  during the time duration of signing. - -  Object versioning is used as a mechanism to identify the staleness of an objects -  signature. The document below does not just list down the version update protocol, -  but goes through various factors that led to its design. - -NOTE: The word "object" is used to represent a "regular file" (in linux sense) and -      object versions are persisted in extended attributes of the object's inode. -      Signature calculation includes object's data (no metadata as of now). - -INDEX -===== -  i.   Version updation protocol -  ii.  Correctness guaraantees -  iii. Implementation -  iv.  Protocol enhancements - -i. Version updation protocol -============================ -  There are two types of versions associated with an object: - -  a) Ongoing version: This version is incremented on first open() [when -     the in-memory representation of the object (inode) is marked dirty -     and synchronized to disk. When an object is created, a default ongoing -     version of one (1) is assigned. An object lookup() too assigns the -     default version if not present. When a version is initialized upon -     lookup() or creat() FOP, it need to be durable on disk and therefore -     can just be a extended attrbute set with out an expensive fsync() -     syscall. - -  b) Signing version: This is the version against which an object is deemed -     to be signed. An objects signature is tied to a particular signed version. -     Since, an object is a candidate for signing upon last release() [last -     close()], signing version is the "ongoing version" at that point of time - -  An object's signature is trustable when the version it was signed against -  matches the ongoing version, i.e., if the hash is calculated by hand and -  compared against the object signature, it *should* be a perfect match if -  and only if the versions are equal. On the other hand, the signature is -  considered stale (might or might not match the hash just calculated). - -  Initialization of object versions -  --------------------------------- -     An object that existed before the pre versioning days, is assigned the -     default versions upon lookup(). The protocol at this point expects "no" -     durability guarantess of the versions, i.e., extended attribute sets -     need not be followed by an explicit filesystem sync (fsync()). In case -     of a power outage or a crash, versions are re-initialized with defaults -     if found to be non-existant. The signing version is initialized with a -     deafault value of zero (0) and the ongoing version as one (1). - -     [ -       NOTE: If an object already has versions on-disk, lookup() just brings -             the versions in memory. In this case both versions may or may -             not match depending on state the object was left in. -     ] - - -  Increment of object versions -  ---------------------------- -     During initial versioning, the in-memory representation of the object is -     marked dirty, so that subsequent modification operations on the object -     triggers a versiong synchronization to disk (extended attribute set). -     Moreover, this operation needs to be durable on disk, for the protocol -     to be crash consistent. - -     Let's picturize the various version states after subsequent open()s. -     Not all modification operations need to increment the ongoing version, -     only the first operations needs to (subsequent operations are NO-OPs). - -     NOTE: From here one "[s]" depicts a durable filesystem operation and -           "*" depicts the inode as dirty. - - -                       lookup()     open()    open()    open() -            =========================================================== - -            OV(m):        1*          2         2         2 -                      ----------------------------------------- -            OV(d):        1           2[s]      2         2 -            SV(d):        0           0         0         0 - - -     Let's now picturize the state when an already signed object undergoes -     file operations. - -     on-disk state: -          OV(d): 3 -          SV(d): 3|<signature> - - -                       lookup()     open()    open()    open() -            =========================================================== - -            OV(m):        3*          4         4         4 -                      ----------------------------------------- -            OV(d):        3           4[s]      4         4 -            SV(d):        3           3         3         3 - -  Signing process -  --------------- -     As per the above example, when the last open file descriptor is closed, -     signing needs to be performed. The protocol restricts that the signing -     needs to be attached to a version, which in this case is the in-memory -     value of the ongoing version. A release() also marks the inode dirty, -     therefore, the next open() does a durable version synchronization to -     disk. - -     [carry forwarding the versions from earlier example] - -                       close()     release()  open()   open() -            =========================================================== - -            OV(m):        4           4*        5         5 -                      ----------------------------------------- -            OV(d):        4           4         5[s]      5 -            SV(d):        3           3         3         3 - -     As shown above, a relase() call triggers a signing with signing version -     as OV(m): which in this case is 4. During signing, the object is signed -     with a signature attached to version 4 as shown below (continuing with -     the last open() call from above): - -                       open()           sign(4, signature) -            =========================================================== - -            OV(m):        5                     5 -                      ----------------------------------------- -            OV(d):        5                     5 -            SV(d):        3               4:<signature>[s] - -     A signature comparison at this point of time is un-trustable due to -     version mismatches. This also protects from node crashes and hard -     reboots due to durability guarantee of on-disk version on first -     open(). - -                       close()     release()  open() -            =========================================================== - -            OV(m):        4           4*        5 -                      --------------------------------  CRASH -            OV(d):        4           4         5[s] -            SV(d):        3           3         3 - -     The protocol is immune to signing request after crashes due to -     the version synchronization performed on first open(). Signing -     request for a version lesser than the *current* ongoing version -     can be ignored. It's left upon the implementation to either -     accept or ignore such signing request(s). - -     [ -        NOTE: Inode forget() causes a fresh lookup() to be trigerred. -              Since a forget() call is received when there are no -              active references for an inode, the on-disk version is -              the latest and would be copied in-memory on lookup(). -     ] - -ii. Correctness Guarantees -========================== - -     Concurrent open()'s -     ------------------- -     When an inode is dirty (i.e., the very next operations would try to -     synchronize the version to disk), there can be multiple calls [say, -     open()] that would find the inode state as dirty and try to writeback -     the new version to disk. Also, note that, marking the inode as synced -     and updating the in-memory version is done *after* the new version -     is written on disk. This is done to avoid incorrect version stored -     on-disk in case the version synchronization fails (but the in-memory -     version still holding the updated value). -     Coming back to multiple open() calls on an object, each open() call -     tries to synchronize the new version to disk if the inode is marked -     as dirty. This is safe as each open() would try to synchronize the -     new version (ongoingversion + 1) even if the updation is concurrent. -     The in-memory version is finally updated to reflect the updated -     version and mark the inode non-dirty. Again this is done *only* if -     the inode is dirty, thereby open() calls which updated the on-disk -     version but lost the race to update the in-memory version result -     are NO-OPs. - -     on-disk state: -          OV(d): 3 -          SV(d): 3|<signature> - - -                       lookup()     open()    open()'   open()'  open() -            ============================================================= - -            OV(m):        3*          3*        3*        4      NO-OP -                      -------------------------------------------------- -            OV(d):        3           4[s]      4[s]      4        4 -            SV(d):        3           3         3         3        3 - - -     open()/release() race -     --------------------- -     This race can cause a release() [on last close()] to pick up the -     ongoing version which was just incremented on fresh open(). This -     leads to signing of the object with the same version as the -     ongoing version, thereby, mismatching signatures when calculated. -     Another point that's worth mentioning here is that the open -     file descriptor is *attached* to it's inode *after* it's done -     version synchronization (and increment). Hence, if a release() -     sneaks in this window, the file desriptor list for the given -     inode is still empty, therefore release() considering it as a -     last close(). -     To counter this, the protocol should track the open and release -     counts for file descriptors. A release() should only trigger a -     signing request when the file desccriptor for an inode is empty -     and the numbers of releases match the number of opens. When an -     open() sneaks and increments the ongoing version but the file -     descriptor is still not attached to the inode, open and release -     counts mismatch, hence identifying an open() in progress. - - -iii. Implementation -=================== -     Refer to: xlators/feature/bit-rot/src/stub - -iv. Protocol enhancements -========================= - -     a) Delaying persisting on-disk versions till open() -     b) Lazy version updation (until signing?) -     c) Protocol changes required to handle anonymous file -        descriptors in GlusterFS.  | 
