summaryrefslogtreecommitdiffstats
path: root/doc/developer-guide/afr-locks-evolution.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/developer-guide/afr-locks-evolution.md')
-rw-r--r--doc/developer-guide/afr-locks-evolution.md91
1 files changed, 91 insertions, 0 deletions
diff --git a/doc/developer-guide/afr-locks-evolution.md b/doc/developer-guide/afr-locks-evolution.md
new file mode 100644
index 00000000000..2dabbcfeb13
--- /dev/null
+++ b/doc/developer-guide/afr-locks-evolution.md
@@ -0,0 +1,91 @@
+History of locking in AFR
+--------------------------
+
+GlusterFS has **locks** translator which provides the following internal locking operations called `inodelk`, `entrylk` which are used by afr to achieve synchronization of operations on files or directories that conflict with each other.
+
+`Inodelk` gives the facility for translators in GlusterFS to obtain range (denoted by tuple with **offset**, **length**) locks in a given **domain** for an inode.
+Full file lock is denoted by the tuple (offset: `0`, length: `0`) i.e. length `0` is considered as infinity.
+
+`Entrylk` enables translators of GlusterFS to obtain locks on `name` in a given **domain** for an inode, typically a directory.
+
+The **locks** translator provides both *blocking* and *nonblocking* variants and of these locks.
+
+
+AFR makes use of locks xlator extensively:
+
+1)For FOPS (from clients)
+-----------------------
+* Data transactions take inode locks on data domain, Let's refer to this domain name as DATA_DOMAIN.
+
+ So locking for writes would be something like this:`inodelk(offset,length, DATA_DOMAIN)`
+ For truncating a file to zero, it would be `inodelk(0,0,DATA_DOMAIN)`
+
+* Metadata transactions (chown/chmod) also take inode locks but on a special range on metadata domain,
+ i.e.`(LLONG_MAX-1 , 0, METADATA_DOMAIN).`
+
+* Entry transactions (create, mkdir, rmdir,unlink, symlink, link,rename) take entrylk on `(name, parent inode)`.
+
+
+2)For self heal:
+-------------
+* For Metadata self-heal, it is the same. i.e.`inodelk(LLONG_MAX-1 , 0, METADATA_DOMAIN)`.
+* For Entry self-heal, it is `entrylk(NULL name, parent inode)`. Specifying NULL for the name takes full lock on the directory referred to by the inode.
+* For data self-heal, there is a bit of history as to how locks evolved:
+
+### Initial version (say version 1) :
+There was no concept of selfheal daemon (shd). Only client lookups triggered heals. so AFR always took `inodelk(0,0,DATA_DOMAIN)` for healing. The issue with this approach was that when heal was in progress, I/O from clients was blocked .
+
+### version 2:
+shd was introduced. We needed to allow I/O to go through when heal was going,provided the ranges did not overlap. To that extent, the following approach was adopted:
+
++ 1.shd takes (full inodelk in DATA_DOMAIN). Thus client FOPS are blocked and cannot modify changelog-xattrs
++ 2.shd inspects xattrs to determine source/sink
++ 3.shd takes a chunk inodelk(0-128kb) again in DATA_DOMAIN (locks xlator allows overlapping locks if lock owner is the same).
++ 4.unlock full lock
++ 5.heal
++ 6.take next chunk lock(129-256kb)
++ 7.unlock 1st chunk lock, heal the second chunk and so on.
+
+
+Thus after 4, any client FOP could write to regions that was not currently under heal. The exception was truncate (to size 0) because it needs full file lock and will always block because some chunk is always under lock by the shd until heal completes.
+
+Another issue was that 2 shds could run in parallel. Say SHD1 and SHD2 compete for step 1. Let SHD1 win. It proceeds and completes step 4. Now SHD2 also succeeds in step 1, continues all steps. Thus at the end both shds will decrement the changelog leading to negative values in it)
+
+### version 3
+To prevent parallel self heals, another domain was introduced, let us call it SELF_HEAL_DOMAIN. With this domain, the following approach was adopted and is **the approach currently in use**:
+
++ 1.shd takes (full inodelk on SELF_HEAL_DOMAIN)
++ 2.shd takes (full inodelk on DATA_DOMAIN)
++ 3.shd inspects xattrs to determine source/sink
++ 4.unlock full lock on DATA_DOMAIN
++ 5.take chunk lock(0-128kb) on DATA_DOMAIN
++ 6.heal
++ 7.take next chunk lock(129-256kb) on DATA_DOMAIN
++ 8.unlock 1st chunk lock, heal and so on.
++ 9.Finally release full lock on SELF_HEAL_DOMAIN
+
+Thus until one shd completes step 9, another shd cannot start step 1, solving the problem of simultaneous heals.
+Note that the issue of truncate (to zero) FOP hanging still remains.
+Also there are multiple network calls involved in this scheme. (lock,heal(ie read+write), unlock) per chunk. i.e 4 calls per chunk.
+
+### version 4 (ToDo)
+Some improvements that need to be made in version 3:
+* Reduce network calls using piggy backing.
+* After taking chunk lock and healing, we need to unlock the lock before locking the next chunk. This gives a window for any pending truncate FOPs to succeed. If truncate succeeds, the heal of next chunk will fail (read returns zero)
+and heal is stopped. *BUT* there is **yet another** issue:
+
+* shd does steps 1 to 4. Let's assume source is brick b1, sink is brick b2 . i.e xattrs are (0,1) and (0,0) on b1 and b2 respectively. Now before shd takes (0-128kb) lock, a client FOP takes it.
+It modifies data but the FOP succeeds only on brick 2. writev returns success, and the attrs now read (0,1) (1,0). SHD takes over and heals. It had observed (0,1),(0,0) earlier
+and thus goes ahead and copies stale 128Kb from brick 1 to brick2. Thus as far as application is concerned, `writev` returned success but bricks have stale data.
+What needs to be done is `writev` must return success only if it succeeded on atleast one source brick (brick b1 in this case). Otherwise The heal still happens in reverse direction but as far as the application is concerned, it received an error.
+
+### Note on lock **domains**
+We have used conceptual names in this document like DATA_DOMAIN/ METADATA_DOMAIN/ SELF_HEAL_DOMAIN. In the code, these are mapped to strings that are based on the AFR xlator name like so:
+
+DATA_DOMAIN --->"vol_name-replicate-n"
+
+METADATA_DOMAIN --->"vol_name-replicate-n:metadata"
+
+SELF_HEAL_DOMAIN -->"vol_name-replicate-n:self-heal"
+
+where vol_name is the name of the volume and 'n' is the replica subvolume index (starting from 0).
pan='2'>13 years v3.4.0beta2commit df83bc05ff...Vijay Bellur13 years v3.3.2qa3commit 1a7e6053d3...Vijay Bellur13 years v3.3.2qa2commit 0ab16bb29a...Vijay Bellur13 years v3.4.0beta1commit 5ac55756cd...Anand Avati13 years v3.4.0alpha3commit 92729add67...Vijay Bellur13 years v3.3.2qa1commit d836002fce...Vijay Bellur13 years v3.4.0alpha2commit c37546cf11...Anand Avati13 years v3.4.0alphacommit 765fdd0809...Vijay Bellur13 years v3.4.0qa8commit 315ee9c4e0...Vijay Bellur13 years v3.4.0qa7commit 6fd654dc94...Vijay Bellur13 years v3.3.1commit e7f14ad073...Vijay Bellur13 years v3.4.0qa6commit e8c75fd929...Vijay Bellur13 years v3.4.0qa5commit fef94c2acf...Vijay Bellur13 years v3.4.0qa4commit 48d749dda3...Vijay Bellur13 years v3.4.0qa3commit c85a3eee54...Vijay Bellur13 years v3.3.1qa3commit 517a9d2450...Vijay Bellur13 years v3.3.1qa2commit ace4cae71c...Vijay Bellur14 years v3.3.1qa1commit 753f8c1324...Vijay Bellur14 years v3.2.7commit 092dc2676b...Vijay Bellur14 years v3.2.7qa2commit 2533d2b56b...Vijay Bellur14 years v3.3.0commit 1b79849119...Vijay Bellur14 years v3.3.0qa45commit 493ef71222...Anand Avati14 years v3.3.0qa44commit 647f561f6a...Vijay Bellur14 years v3.3.0qa43commit 9d4c8b3909...Vijay Bellur14 years v3.3.0qa42commit d54d9e9412...Vijay Bellur14 years v3.3.0beta4commit bdd240eca1...Vijay Bellur14 years v3.3.0qa41commit 8852f95869...Vijay Bellur14 years v3.3.0qa40commit 9189ff9739...Vijay Bellur14 years v3.3.0qa39commit 81df001b3e...Vijay Bellur14 years v3.3.0qa38commit fdcbf065a9...Vijay Bellur14 years v3.3.0qa37commit 66fddb979d...Vijay Bellur14 years v3.3.0qa36commit 857ba84a23...Vijay Bellur14 years v3.3.0qa35commit 80eeaab2be...Vijay Bellur14 years v3.3.0beta3commit df8e2f53b7...Vijay Bellur14 years v3.2.7qa1commit deea482def...Vijay Bellur14 years v3.3.0qa34commit 4bb82b2c77...Vijay Bellur14 years v3.3.0qa33commit 1043dedfb5...Vijay Bellur14 years v3.3.0qa32commit af0eb165f6...Vijay Bellur14 years v3.3.0qa31commit c40b9975d0...Vijay Bellur14 years v3.3.0qa30commit d98c3e1934...Vijay Bellur14 years v3.3.0qa29commit 65c6e3706f...Anand Avati14 years v3.3.0qa28commit 212d739886...Vijay Bellur14 years v3.2.6p3commit 410b1092e6...Vijay Bellur14 years v3.2.6p2commit 5ce988633d...Vijay Bellur14 years v3.3.0qa27commit 152a0194e7...Vijay Bellur14 years v3.2.6commit fafd5c17c0...Vijay Bellur14 years v3.2.6qa6commit fafd5c17c0...Vijay Bellur14 years v3.2.6qa5commit e657569da2...Vijay Bellur14 years v3.3.0qa26commit f6a779ffc5...Vijay Bellur14 years v3.2.6qa4commit 8127a6f35e...Vijay Bellur14 years v3.3.0qa25commit 468768d280...Vijay Bellur14 years v3.3.0qa24commit 88c6c11813...Vijay Bellur14 years v3.3.0qa23commit 42cc043875...Vijay Bellur14 years v3.3.0qa22commit c8d47f056e...Vijay Bellur14 years v3.2.6qa3commit cd3ad588f2...Anand Avati14 years v3.2.6qa2commit fa580e9299...Anand Avati14 years v3.3.0qa21commit 83a3daf7c2...Vijay Bellur14 years v3.3.0qa20commit 0694749c3e...Vijay Bellur14 years v3.2.6qa1commit 1020a3dfe9...Anand Avati14 years v3.3.0qa19commit be003fbb3a...Vijay Bellur14 years v3.3.0qa18commit d7d9f3d400...Vijay Bellur14 years v3.3.0qa17commit 0074f20844...Vijay Bellur14 years v3.3.0qa16commit 7235e5b1af...Vijay Bellur14 years v3.3.0qa15commit 289c2902d6...Vijay Bellur14 years v3.2.5commit edf9551b38...Vijay Bellur14 years v3.2.5qa9commit edf9551b38...Vijay Bellur14 years v3.2.5qa8commit 252c9e5cf2...Vijay Bellur14 years v3.2.5qa7commit d2a05724a6...Vijay Bellur14 years v3.2.5qa6commit 51601b2bff...Vijay Bellur14 years v3.2.5qa5commit 8668da9744...Vijay Bellur14 years v3.2.5qa4commit bca358604d...Vijay Bellur14 years v3.2.5qa3commit 3b0eecb53f...Vijay Bellur14 years v3.2.5qa2commit 7dcc94cf1f...Vijay Bellur14 years v3.2.5qa1commit 449f31c8ae...Vijay Bellur14 years v3.3.0qa14commit 4235f7a74e...Vijay Bellur14 years v3.2.4commit da73b31942...Vijay Bellur14 years v3.3.0qa13commit 795c8996c1...Vijay Bellur14 years v3.2.4qa5commit 6c5d3e40a6...Vijay Bellur14 years v3.3.0qa12commit 16b7e3bf20...Vijay Bellur14 years v3.2.4qa4commit edd9461647...Vijay Bellur14 years v3.3.0qa11commit 7658047903...Vijay Bellur14 years v3.3.0qa10commit 4765dd1a1c...Vijay Bellur14 years v3.2.4qa3commit 9564e09e53...Vijay Bellur14 years v3.2.4qa2commit 0f9502d5eb...Vijay Bellur14 years v3.2.4qa1commit 6fe790ee35...Vijay Bellur14 years v3.3.0qa9commit b827cdb230...Vijay Bellur14 years v3.1.7commit a2739b842b...Vijay Bellur15 years v3.1.7qa4commit a2739b842b...Vijay Bellur15 years v3.1.7qa3commit f9fa468090...Vijay Bellur15 years v3.1.7qa2commit d120020fd5...Vijay Bellur15 years v3.1.7qa1commit 561bba7ae4...Vijay Bellur15 years v3.2.3commit 1acef91232...Vijay Bellur15 years v3.3beta2commit b827cdb230...Vijay Bellur15 years v3.3.0qa8commit b827cdb230...Vijay Bellur15 years v3.3.0qa7commit 601f5725a0...Vijay Bellur15 years v3.2.3qa6commit 1acef91232...Vijay Bellur15 years v3.3.0qa6commit b6e3e9c480...Vijay Bellur15 years v3.3.0qa5commit 5ace31ac21...Vijay Bellur15 years v3.2.3qa5commit 10f69943c4...Vijay Bellur15 years v3.3.0qa4commit 350ae611ca...Vijay Bellur15 years v3.2.3qa4commit 0564d1198b...Vijay Bellur15 years v3.2.3qa3commit 2f53b7857c...Vijay Bellur