From 7cfb26296f9f123c586f7e767402b97c724dc510 Mon Sep 17 00:00:00 2001 From: Ravishankar N Date: Fri, 6 Sep 2013 15:49:14 +0530 Subject: doc: Steps to recover from split-brain BUG: 1006776 Change-Id: I6f4bffe992758b2889df9ebfe117834dabe5c1cf Signed-off-by: Pranith Kumar K Signed-off-by: Ravishankar N Reviewed-on: http://review.gluster.org/4216 Reviewed-by: Kaleb KEITHLEY Tested-by: Anand Avati --- doc/split-brain.md | 251 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 251 insertions(+) create mode 100644 doc/split-brain.md diff --git a/doc/split-brain.md b/doc/split-brain.md new file mode 100644 index 00000000..b0d938e2 --- /dev/null +++ b/doc/split-brain.md @@ -0,0 +1,251 @@ +Steps to recover from File split-brain. +====================================== + +Quick Start: +============ +1. Get the path of the file that is in split-brain: +> It can be obtained either by +> a) The command `gluster volume heal info split-brain`. +> b) Identify the files for which file operations performed + from the client keep failing with Input/Output error. + +2. Close the applications that opened this file from the mount point. +In case of VMs, they need to be powered-off. + +3. Decide on the correct copy: +> This is done by observing the afr changelog extended attributes of the file on +the bricks using the getfattr command; then identifying the type of split-brain +(data split-brain, metadata split-brain, entry split-brain or split-brain due to +gfid-mismatch); and finally determining which of the bricks contains the 'good copy' +of the file. +> `getfattr -d -m . -e hex `. +It is also possible that one brick might contain the correct data while the +other might contain the correct metadata. + +4. Reset the relevant extended attribute on the brick(s) that contains the +'bad copy' of the file data/metadata using the setfattr command. +> `setfattr -n -v ` + +5. Trigger self-heal on the file by performing lookup from the client: +> `ls -l ` + +Detailed Instructions for steps 3 through 5: +=========================================== +To understand how to resolve split-brain we need to know how to interpret the +afr changelog extended attributes. + +Execute `getfattr -d -m . -e hex ` + +* Example: +[root@store3 ~]# getfattr -d -e hex -m. brick-a/file.txt +\#file: brick-a/file.txt +security.selinux=0x726f6f743a6f626a6563745f723a66696c655f743a733000 +trusted.afr.vol-client-2=0x000000000000000000000000 +trusted.afr.vol-client-3=0x000000000200000000000000 +trusted.gfid=0x307a5c9efddd4e7c96e94fd4bcdcbd1b + +The extended attributes with `trusted.afr.-client-` +are used by afr to maintain changelog of the file.The values of the +`trusted.afr.-client-` are calculated by the glusterfs +client (fuse or nfs-server) processes. When the glusterfs client modifies a file +or directory, the client contacts each brick and updates the changelog extended +attribute according to the response of the brick. + +'subvolume-index' is nothing but (brick number - 1) in +`gluster volume info ` output. + +* Example: +[root@pranithk-laptop ~]# gluster volume info vol + Volume Name: vol + Type: Distributed-Replicate + Volume ID: 4f2d7849-fbd6-40a2-b346-d13420978a01 + Status: Created + Number of Bricks: 4 x 2 = 8 + Transport-type: tcp + Bricks: + brick-a: pranithk-laptop:/gfs/brick-a + brick-b: pranithk-laptop:/gfs/brick-b + brick-c: pranithk-laptop:/gfs/brick-c + brick-d: pranithk-laptop:/gfs/brick-d + brick-e: pranithk-laptop:/gfs/brick-e + brick-f: pranithk-laptop:/gfs/brick-f + brick-g: pranithk-laptop:/gfs/brick-g + brick-h: pranithk-laptop:/gfs/brick-h + +In the example above: +``` +Brick | Replica set | Brick subvolume index +---------------------------------------------------------------------------- +-/gfs/brick-a | 0 | 0 +-/gfs/brick-b | 0 | 1 +-/gfs/brick-c | 1 | 2 +-/gfs/brick-d | 1 | 3 +-/gfs/brick-e | 2 | 4 +-/gfs/brick-f | 2 | 5 +-/gfs/brick-g | 3 | 6 +-/gfs/brick-h | 3 | 7 +``` + +Each file in a brick maintains the changelog of itself and that of the files +present in all the other bricks in it's replica set as seen by that brick. + +In the example volume given above, all files in brick-a will have 2 entries, +one for itself and the other for the file present in it's replica pair, i.e.brick-b: +trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a) +trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for brick-b as seen by brick-a + +Likewise, all files in brick-b will have: +trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b +trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for itself (brick-b) + +The same can be extended for other replica pairs. + +Interpreting Changelog (roughly pending operation count) Value: +Each extended attribute has a value which is 24 hexa decimal digits. +First 8 digits represent changelog of data. Second 8 digits represent changelog +of metadata. Last 8 digits represent Changelog of directory entries. + +Pictorially representing the same, we have: +``` +0x 000003d7 00000001 00000000 + | | | + | | \_ changelog of directory entries + | \_ changelog of metadata + \ _ changelog of data +``` + + +For Directories metadata and entry changelogs are valid. +For regular files data and metadata changelogs are valid. +For special files like device files etc metadata changelog is valid. +When a file split-brain happens it could be either data split-brain or +meta-data split-brain or both. When a split-brain happens the changelog of the +file would be something like this: + +* Example:(Lets consider both data, metadata split-brain on same file). +[root@pranithk-laptop vol]# getfattr -d -m . -e hex /gfs/brick-?/a +getfattr: Removing leading '/' from absolute path names +\#file: gfs/brick-a/a +trusted.afr.vol-client-0=0x000000000000000000000000 +trusted.afr.vol-client-1=0x000003d70000000100000000 +trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57 +\#file: gfs/brick-b/a +trusted.afr.vol-client-0=0x000003b00000000100000000 +trusted.afr.vol-client-1=0x000000000000000000000000 +trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57 + +###Observations: + +####According to changelog extended attributes on file /gfs/brick-a/a: +The first 8 digits of trusted.afr.vol-client-0 are all +zeros (0x00000000................), and the first 8 digits of +trusted.afr.vol-client-1 are not all zeros (0x000003d7................). +So the changelog on /gfs/brick-a/a implies that some data operations succeeded +on itself but failed on /gfs/brick-b/a. + +The second 8 digits of trusted.afr.vol-client-0 are +all zeros (0x........00000000........), and the second 8 digits of +trusted.afr.vol-client-1 are not all zeros (0x........00000001........). +So the changelog on /gfs/brick-a/a implies that some metadata operations succeeded +on itself but failed on /gfs/brick-b/a. + +####According to Changelog extended attributes on file /gfs/brick-b/a: +The first 8 digits of trusted.afr.vol-client-0 are not all +zeros (0x000003b0................), and the first 8 digits of +trusted.afr.vol-client-1 are all zeros (0x00000000................). +So the changelog on /gfs/brick-b/a implies that some data operations succeeded +on itself but failed on /gfs/brick-a/a. + +The second 8 digits of trusted.afr.vol-client-0 are not +all zeros (0x........00000001........), and the second 8 digits of +trusted.afr.vol-client-1 are all zeros (0x........00000000........). +So the changelog on /gfs/brick-b/a implies that some metadata operations succeeded +on itself but failed on /gfs/brick-a/a. + +Since both the copies have data, metadata changes that are not on the other +file, it is in both data and metadata split-brain. + +Deciding on the correct copy: +----------------------------- +The user may have to inspect stat,getfattr output of the files to decide which +metadata to retain and contents of the file to decide which data to retain. +Continuing with the example above, lets say we want to retain the data +of /gfs/brick-a/a and metadata of /gfs/brick-b/a. + +Resetting the relevant changelogs to resolve the split-brain: +------------------------------------------------------------- +For resolving data-split-brain: +We need to change the changelog extended attributes on the files as if some data +operations succeeded on /gfs/brick-a/a but failed on /gfs/brick-b/a. But +/gfs/brick-b/a should NOT have any changelog which says some data operations +succeeded on /gfs/brick-b/a but failed on /gfs/brick-a/a. We need to reset the +data part of the changelog on trusted.afr.vol-client-0 of /gfs/brick-b/a. + +For resolving metadata-split-brain: +We need to change the changelog extended attributes on the files as if some +metadata operations succeeded on /gfs/brick-b/a but failed on /gfs/brick-a/a. +But /gfs/brick-a/a should NOT have any changelog which says some metadata +operations succeeded on /gfs/brick-a/a but failed on /gfs/brick-b/a. +We need to reset metadata part of the changelog on +trusted.afr.vol-client-1 of /gfs/brick-a/a + +So, the intended changes are: +On /gfs/brick-b/a: +For trusted.afr.vol-client-0 +0x000003b00000000100000000 to 0x000000000000000100000000 +(Note that the metadata part is still not all zeros) +Hence execute +`setfattr -n trusted.afr.vol-client-0 -v 0x000000000000000100000000 /gfs/brick-b/a` + +On /gfs/brick-a/a: +For trusted.afr.vol-client-1 +0x0000000000000000ffffffff to 0x000003d70000000000000000 +(Note that the data part is still not all zeros) +Hence execute +`setfattr -n trusted.afr.vol-client-1 -v 0x000003d70000000000000000 /gfs/brick-a/a` + +Thus after the above operations are done, the changelogs look like this: +[root@pranithk-laptop vol]# getfattr -d -m . -e hex /gfs/brick-?/a +getfattr: Removing leading '/' from absolute path names +\#file: gfs/brick-a/a +trusted.afr.vol-client-0=0x000000000000000000000000 +trusted.afr.vol-client-1=0x000003d70000000000000000 +trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57 + +\#file: gfs/brick-b/a +trusted.afr.vol-client-0=0x000000000000000100000000 +trusted.afr.vol-client-1=0x000000000000000000000000 +trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57 + + +Triggering Self-heal: +--------------------- +Perform `ls -l ` to trigger healing. + +Fixing Directory entry split-brain: +---------------------------------- +Afr has the ability to conservatively merge different entries in the directories +when there is a split-brain on directory. +If on one brick directory 'd' has entries '1', '2' and has entries '3', '4' on +the other brick then afr will merge all of the entries in the directory to have +'1', '2', '3', '4' entries in the same directory. +(Note: this may result in deleted files to re-appear in case the split-brain +happens because of deletion of files on the directory) +Split-brain resolution needs human intervention when there is at least one entry +which has same file name but different gfid in that directory. +Example: +On brick-a the directory has entries '1' (with gfid g1), '2' and on brick-b +directory has entries '1' (with gfid g2) and '3'. +These kinds of directory split-brains need human intervention to resolve. +The user needs to remove either file '1' on brick-a or the file '1' on brick-b +to resolve the split-brain. In addition, the corresponding gfid-link file also +needs to be removed.The gfid-link files are present in the .glusterfs folder +in the top-level directory of the brick. If the gfid of the file is +0x307a5c9efddd4e7c96e94fd4bcdcbd1b (the trusted.gfid extended attribute got +from the getfattr command earlier),the gfid-link file can be found at +> /gfs/brick-a/.glusterfs/30/7a/307a5c9efddd4e7c96e94fd4bcdcbd1b + +####Word of caution: +Before deleting the gfid-link, we have to ensure that there are no hard links +to the file present on that brick. If hard-links exist,they must be deleted as +well. -- cgit