glusterfs.git/xlators/cluster/afr/src, branch v3.12.9

cluster/afr: Fixing the flaws in arbiter becoming source patch

2018-04-18T13:23:19+00:00

Backport of https://review.gluster.org/19045

Problem:
Setting the write_subvol value to read_subvol in case of metadata
transaction during pre-op (commit 19f9bcff4aada589d4321356c2670ed283f02c03)
might lead to the original problem of arbiter becoming source.

Scenario:
1) All bricks are up and good
2) 2 writes w1 and w2 are in progress in parallel
3) ctx->read_subvol is good for all the subvolumes
4) w1 succeeds on brick0 and fails on brick1, yet to do post-op on
   the disk
5) read/lookup comes on the same file and refreshes read_subvols back
   to all good
6) metadata transaction happens which makes ctx->write_subvol to be
   assigned with ctx->read_subvol which is all good
7) w2 succeeds on brick1 and fails on brick0 and this will update the
   brick in reverse order leading to arbiter becoming source

Fix:
Instead of setting the ctx->write_subvol to ctx->read_subvol in the
pre-op statge, if there is a metadata transaction, check in the
function __afr_set_in_flight_sb_status() if it is a data/metadata
transaction. Use the value of ctx->write_subvol if it is a data
transactions and ctx->read_subvol value for other transactions.

With this patch we assign the value of ctx->write_subvol in the
afr_transaction_perform_fop() with the on disk value, instead of
assigning it in the afr_changelog_pre_op() with the in memory value.

Change-Id: Id2025a7e965f0578af35b1abaac793b019c43cc4
BUG: 1566131
Signed-off-by: karthik-us 
Signed-off-by: Ravishankar N

cluster/afr: Fix for arbiter becoming source

2018-04-18T13:23:19+00:00

Backport of https://review.gluster.org/#/c/18049/

Problem:
When eager-lock is on, and two writes happen in parallel on a FD
we were observing the following behaviour:
- First write fails on one data brick
- Since the post-op is not yet happened, the inode refresh will get
  both the data bricks as readable and set it in the inode context
- In flight split brain check see both the data bricks as readable
  and allows the second write
- Second write fails on the other data brick
- Now the post-op happens and marks both the data bricks as bad and
  arbiter will become source for healing

Fix:
Adding one more variable called write_suvol in inode context and it
will have the in memory representation of the writable subvols. Inode
refresh will not update this value and its lifetime is pre-op through
unlock in the afr transaction. Initially the pre-op will set this
value same as read_subvol in inode context and then in the in flight
split brain check we will use this value instead of read_subvol.
After all the checks we will update the value of this and set the
read_subvol same as this to avoid having incorrect value in that.

Change-Id: I2ef6904524ab91af861d59690974bbc529ab1af3
BUG: 1566131
Signed-off-by: karthik-us

cluster/afr: Prevent ping-event handling on shd

2018-04-06T12:50:03+00:00

On shd, we shouldn't treat any brick down based
on latency, otherwise self-heal will never happen

fixes: 1562723
Change-Id: Ica07fcc4fae91a6bfd9c9a670e2be464704d94b7
BUG: 1562723
Signed-off-by: Pranith Kumar K

cluster/afr: Fail open on split-brain

2018-03-08T06:39:57+00:00

Problem:
Append on a file with split-brain succeeds. Open is intercepted by open-behind,
when write comes on the file, open-behind does open+write. Open succeeds
because afr doesn't fail it. Then write succeeds because write-behind
intercepts it. Flush is also intercepted by write-behind, so the application
never gets to know that the write failed.

Fix:
Fail open on split-brain, so that when open-behind does open+write open fails
which leads to write failure. Application will know about this failure.

Change-Id: I4bff1c747c97bb2925d6987f4ced5f1ce75dbc15
BUG: 1544635
Signed-off-by: Pranith Kumar K 
(cherry picked from commit 786343abca3474ff01aa1017210112d97cbc4843)

cluster/afr: remove unnecessary child_up initialization

2018-02-06T07:07:33+00:00

The child_up array was initialized with all elements being -1 to
allow afr_notify() to differentiate down bricks from bricks that
haven't reported yet. With current implementation this is not needed
anymore and it was causing unexpected results when other parts of
the code considered that if child_up[i] != 0, it meant that it was up.

Backport of:
> BUG: 1541038

Change-Id: I2a9d712ee64c512f24bd5cd3a48dcb37e3139472
BUG: 1541930
Signed-off-by: Xavier Hernandez

cluster/afr: Honor default timeout of 5min for analyzing split-brain files

2017-11-30T06:42:49+00:00

Problem:
After setting split-brain-choice option to analyze the file to resolve
the split brain using the command
"setfattr -n replica.split-brain-choice -v "choiceX" "
should allow to access the file from mount for default timeout of 5mins.
But the timeout was not honored and was able to access the file even after
the timeout.

Fix:
Call the inode_invalidate() in afr_set_split_brain_choice_cbk() so that
it will triger the cache invalidate after resetting the timer and the
split brain choice. So the next calls to access the file will fail with EIO.

Change-Id: I698cb833676b22ff3e4c6daf8b883a0958f51a64
BUG: 1514380
Signed-off-by: karthik-us 
(cherry picked from commit 933ec57ccda2c1ba5ce6f207313c3b6802e67ca3)

cluster/afr: Make choose-local "reconfigurable"

2017-10-12T18:46:10+00:00

        Backport of:
        > Change-Id: Ibab292ba705d993b475cd0303fb3318211fb2500
        > Reviewed-on: https://review.gluster.org/18026
        > BUG: 1480525
        > cherry-picked from commit 1e2d6537875d16b783e3c50ada7ee61487c6d796

With this change, enabling choose-local (which means its state makes
transition from "off" to "on") will be effective after the first
gfid-lookup on "/" since volume-set was executed.

Change-Id: Ibab292ba705d993b475cd0303fb3318211fb2500
BUG: 1501022
Signed-off-by: Krutika Dhananjay

afr: heal gfid as a part of entry heal

2017-10-10T05:33:15+00:00

Problem:
If a brick crashes after an entry (file or dir) is created but before
gfid is assigned, the good bricks will have pending entry heal xattrs
but the heal won't complete because afr_selfheal_recreate_entry() tries
to create the entry again and it fails with EEXIST.

Fix:
We could have fixed posx_mknod/mkdir etc to assign the gfid if the file
already exists but the right thing to do seems to be to trigger a lookup
on the bad brick and let it heal the gfid instead of winding an
mknod/mkdir in the first place.

(cherry picked from commit 20fa80057eb430fd72b4fa31b9b65598b8ec1265)
Change-Id: I82f76665a7541f1893ef8d847b78af6466aff1ff
BUG: 1499202
Signed-off-by: Ravishankar N

cluster/afr: Sending subvol up/down events when subvol comes up or goes down

2017-10-06T06:32:18+00:00

> BUG: 1493539

(cherry picked from commit 3bbb4fe4b33dc3a3ed068ed2284077f2a4d8265a)

Change-Id: I6580351b245d5f868e9ddc6a4eb4dd6afa3bb6ec
BUG: 1492066
Signed-off-by: karthik-us

afr: don't check for file size in afr_mark_source_sinks_if_file_empty

2017-10-05T12:54:40+00:00

... for AFR_METADATA_TRANSACTION and just mark source and sinks if
metadata is the same.

(cherry picked from commit 24637d54dcbc06de8a7de17c75b9291fcfcfbc84)
Change-Id: I69e55d3c842c7636e3538d1b57bc4deca67bed05
BUG: 1496317
Signed-off-by: Ravishankar N