summaryrefslogtreecommitdiffstats
path: root/doc/split-brain.md
blob: b0d938e26bc1c4bf1b992cbdb9585cf8d153971a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
Steps to recover from File split-brain.
======================================

Quick Start:
============
1. Get the path of the file that is in split-brain:  
>  It can be obtained either by  
>       a) The command `gluster volume heal info split-brain`.  
>       b) Identify the files for which file operations performed
           from the client keep failing with Input/Output error.

2. Close the applications that opened this file from the mount point.
In case of VMs, they need to be powered-off.

3. Decide on the correct copy:  
> This is done by observing the afr changelog extended attributes of the file on
the bricks using the getfattr command; then identifying the type of split-brain 
(data split-brain, metadata split-brain, entry split-brain or split-brain due to
gfid-mismatch); and finally determining which of the bricks contains the 'good copy'
of the file.  
>   `getfattr -d -m . -e hex <file-path-on-brick>`.  
It is also possible that one brick might contain the correct data while the
other might contain the correct metadata.

4. Reset the relevant extended attribute on the brick(s) that contains the
'bad copy' of the file data/metadata using the setfattr command.  
>   `setfattr -n <attribute-name> -v <attribute-value> <file-path-on-brick>`

5. Trigger self-heal on the file by performing lookup from the client:  
>   `ls -l <file-path-on-gluster-mount>`

Detailed Instructions for steps 3 through 5:  
===========================================
To understand how to resolve split-brain we need to know how to interpret the
afr changelog extended attributes.

Execute `getfattr -d -m . -e hex <file-path-on-brick>`

* Example:  
[root@store3 ~]# getfattr -d -e hex -m. brick-a/file.txt  
\#file: brick-a/file.txt  
security.selinux=0x726f6f743a6f626a6563745f723a66696c655f743a733000  
trusted.afr.vol-client-2=0x000000000000000000000000  
trusted.afr.vol-client-3=0x000000000200000000000000  
trusted.gfid=0x307a5c9efddd4e7c96e94fd4bcdcbd1b  

The extended attributes with `trusted.afr.<volname>-client-<subvolume-index>`
are used by afr to maintain changelog of the file.The values of the
`trusted.afr.<volname>-client-<subvolume-index>` are calculated by the glusterfs
client (fuse or nfs-server) processes. When the glusterfs client modifies a file
or directory, the client contacts each brick and updates the changelog extended 
attribute according to the response of the brick.

'subvolume-index' is nothing but (brick number - 1) in
`gluster volume info <volname>` output.

* Example:  
[root@pranithk-laptop ~]# gluster volume info vol  
 Volume Name: vol  
 Type: Distributed-Replicate  
 Volume ID: 4f2d7849-fbd6-40a2-b346-d13420978a01  
 Status: Created  
 Number of Bricks: 4 x 2 = 8  
 Transport-type: tcp  
 Bricks:  
 brick-a: pranithk-laptop:/gfs/brick-a  
 brick-b: pranithk-laptop:/gfs/brick-b  
 brick-c: pranithk-laptop:/gfs/brick-c  
 brick-d: pranithk-laptop:/gfs/brick-d  
 brick-e: pranithk-laptop:/gfs/brick-e  
 brick-f: pranithk-laptop:/gfs/brick-f  
 brick-g: pranithk-laptop:/gfs/brick-g  
 brick-h: pranithk-laptop:/gfs/brick-h  

In the example above:  
```
Brick             |    Replica set        |    Brick subvolume index
----------------------------------------------------------------------------
-/gfs/brick-a     |       0               |       0
-/gfs/brick-b     |       0               |       1
-/gfs/brick-c     |       1               |       2
-/gfs/brick-d     |       1               |       3
-/gfs/brick-e     |       2               |       4
-/gfs/brick-f     |       2               |       5
-/gfs/brick-g     |       3               |       6
-/gfs/brick-h     |       3               |       7
```

Each file in a brick maintains the changelog of itself and that of the files
present in all the other bricks in it's replica set as seen by that brick.

In the example volume given above, all files in brick-a will have 2 entries, 
one for itself and the other for the file present in it's replica pair, i.e.brick-b:  
trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a)  
trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for brick-b as seen by brick-a  

Likewise, all files in brick-b will have:  
trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b  
trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for itself (brick-b)  

The same can be extended for other replica pairs.  

Interpreting Changelog (roughly pending operation count) Value:  
Each extended attribute has a value which is 24 hexa decimal digits.  
First 8 digits represent changelog of data. Second 8 digits represent changelog
of metadata. Last 8 digits represent Changelog of directory entries.  

Pictorially representing the same, we have:
```
0x 000003d7 00000001 00000000
        |      |       |
        |      |        \_ changelog of directory entries
        |       \_ changelog of metadata
         \ _ changelog of data
```
         

For Directories metadata and entry changelogs are valid.
For regular files data and metadata changelogs are valid.
For special files like device files etc metadata changelog is valid.
When a file split-brain happens it could be either data split-brain or
meta-data split-brain or both. When a split-brain happens the changelog of the
file would be something like this:  

* Example:(Lets consider both data, metadata split-brain on same file).  
[root@pranithk-laptop vol]# getfattr -d -m . -e hex /gfs/brick-?/a  
getfattr: Removing leading '/' from absolute path names  
\#file: gfs/brick-a/a  
trusted.afr.vol-client-0=0x000000000000000000000000  
trusted.afr.vol-client-1=0x000003d70000000100000000  
trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57   
\#file: gfs/brick-b/a  
trusted.afr.vol-client-0=0x000003b00000000100000000  
trusted.afr.vol-client-1=0x000000000000000000000000  
trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57  

###Observations:

####According to changelog extended attributes on file /gfs/brick-a/a:  
The first 8 digits of trusted.afr.vol-client-0 are all
zeros (0x00000000................), and the first 8 digits of
trusted.afr.vol-client-1 are not all zeros (0x000003d7................).
So the changelog on /gfs/brick-a/a implies that some data operations succeeded
on itself but failed on /gfs/brick-b/a.

The second 8 digits of trusted.afr.vol-client-0 are
all zeros (0x........00000000........), and the second 8 digits of
trusted.afr.vol-client-1 are not all zeros (0x........00000001........).
So the changelog on /gfs/brick-a/a implies that some metadata operations succeeded 
on itself but failed on /gfs/brick-b/a.

####According to Changelog extended attributes on file /gfs/brick-b/a:  
The first 8 digits of trusted.afr.vol-client-0 are not all
zeros (0x000003b0................), and the first 8 digits of
trusted.afr.vol-client-1 are all zeros (0x00000000................).
So the changelog on /gfs/brick-b/a implies that some data operations succeeded
on itself but failed on /gfs/brick-a/a.

The second 8 digits of trusted.afr.vol-client-0 are not
all zeros (0x........00000001........), and the second 8 digits of
trusted.afr.vol-client-1 are all zeros (0x........00000000........).
So the changelog on /gfs/brick-b/a implies that some metadata operations succeeded
on itself but failed on /gfs/brick-a/a.

Since both the copies have data, metadata changes that are not on the other
file, it is in both data and metadata split-brain.

Deciding on the correct copy:  
-----------------------------
The user may have to inspect stat,getfattr output of the files to decide which 
metadata to retain and contents of the file to decide which data to retain.
Continuing with the example above, lets say we want to retain the data
of /gfs/brick-a/a and metadata of /gfs/brick-b/a.

Resetting the relevant changelogs to resolve the split-brain:  
-------------------------------------------------------------
For resolving data-split-brain:  
We need to change the changelog extended attributes on the files as if some data
operations succeeded on /gfs/brick-a/a but failed on /gfs/brick-b/a. But
/gfs/brick-b/a should NOT have any changelog which says some data operations
succeeded on /gfs/brick-b/a but failed on /gfs/brick-a/a. We need to reset the
data part of the changelog on trusted.afr.vol-client-0 of /gfs/brick-b/a.

For resolving metadata-split-brain:  
We need to change the changelog extended attributes on the files as if some
metadata operations succeeded on /gfs/brick-b/a but failed on /gfs/brick-a/a.
But /gfs/brick-a/a should NOT have any changelog which says some metadata
operations succeeded on /gfs/brick-a/a but failed on /gfs/brick-b/a.
We need to reset metadata part of the changelog on
trusted.afr.vol-client-1 of /gfs/brick-a/a

So, the intended changes are:  
On /gfs/brick-b/a:  
For trusted.afr.vol-client-0  
0x000003b00000000100000000 to 0x000000000000000100000000  
(Note that the metadata part is still not all zeros)  
Hence execute
`setfattr -n trusted.afr.vol-client-0 -v 0x000000000000000100000000 /gfs/brick-b/a`

On /gfs/brick-a/a:  
For trusted.afr.vol-client-1  
0x0000000000000000ffffffff to 0x000003d70000000000000000  
(Note that the data part is still not all zeros)  
Hence execute  
`setfattr -n trusted.afr.vol-client-1 -v 0x000003d70000000000000000 /gfs/brick-a/a`

Thus after the above operations are done, the changelogs look like this:  
[root@pranithk-laptop vol]# getfattr -d -m . -e hex /gfs/brick-?/a  
getfattr: Removing leading '/' from absolute path names  
\#file: gfs/brick-a/a  
trusted.afr.vol-client-0=0x000000000000000000000000  
trusted.afr.vol-client-1=0x000003d70000000000000000  
trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57  

\#file: gfs/brick-b/a  
trusted.afr.vol-client-0=0x000000000000000100000000  
trusted.afr.vol-client-1=0x000000000000000000000000  
trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57  


Triggering Self-heal:
---------------------
Perform `ls -l <file-path-on-gluster-mount>` to trigger healing.

Fixing Directory entry split-brain:
----------------------------------
Afr has the ability to conservatively merge different entries in the directories
when there is a split-brain on directory.
If on one brick directory 'd' has entries '1', '2' and has entries '3', '4' on
the other brick then afr will merge all of the entries in the directory to have
'1', '2', '3', '4' entries in the same directory.
(Note: this may result in deleted files to re-appear in case the split-brain
happens because of deletion of files on the directory)
Split-brain resolution needs human intervention when there is at least one entry
which has same file name but different gfid in that directory.
Example:  
On brick-a the directory has entries '1' (with gfid g1), '2' and on brick-b
directory has entries '1' (with gfid g2) and '3'.
These kinds of directory split-brains need human intervention to resolve.
The user needs to remove either file '1' on brick-a or the file '1' on brick-b
to resolve the split-brain. In addition, the corresponding gfid-link file also
needs to be removed.The gfid-link files are present in the .glusterfs folder
in the top-level directory of the brick. If the gfid of the file is
0x307a5c9efddd4e7c96e94fd4bcdcbd1b (the trusted.gfid extended attribute got
from the getfattr command earlier),the gfid-link file can be found at
> /gfs/brick-a/.glusterfs/30/7a/307a5c9efddd4e7c96e94fd4bcdcbd1b

####Word of caution:
Before deleting the gfid-link, we have to ensure that there are no hard links
to the file present on that brick. If hard-links exist,they must be deleted as
well.