glusterfs.git/xlators/cluster/afr/src, branch v3.12.14

cluster/afr: Fix dict-leak in pre-op

2018-08-17T05:34:35+00:00

At the time of pre-op, pre_op_xdata is populted with the xattrs we get from the
disk and at the time of post-op it gets over-written without unreffing the
previous value stored leading to a leak.
This is a regression we missed in
https://review.gluster.org/#/q/ba149bac92d169ae2256dbc75202dc9e5d06538e


Originally:
> Signed-off-by: Pranith Kumar K 
> (cherry picked from commit e7b79c59590c203c65f7ac8548b30d068c232d33)

Change-Id: I0456f9ad6f77ce6248b747964a037193af3a3da7
Fixes: bz#1613512
Signed-off-by: Amar Tumballi

afr: don't update readables if inode refresh failed on all children

2018-07-11T14:04:25+00:00

Backport of: https://review.gluster.org/#/c/20029/
3.12 still supports quorum-reads, hence modified afr_inode_refresh_done() to
support that.

If inode refresh failed on all children of afr due to ENOENT (say file
migrated by dht), it resets the readables to zero. Any inflight txn which
then later comes on the inode fails with EIO because no readable
children present for the inode.

Fix:
Don't update readables when inode refresh fails on *all* children of
afr. In that way any inflight txns will either proceed with its own inode
refresh if needed and fail it with the right errno or use the old value
of readables and continue with the txn.

Also, add quorum checks to the beginning of afr_transaction(). Otherwise, we
seem to be winding the lock and checking for quorum only in pre-op pahse.

Note: This should ideally fix BZ 1329505 since the stop gap fix for
it is has been reverted at https://review.gluster.org/#/c/20028.

Change-Id: I82990769f01be918a073fec83fc67ba4b3be24b1
BUG: 1599247
Signed-off-by: Ravishankar N

afr: heal gfids when file is not present on all bricks

2018-07-11T14:03:47+00:00

Backport of https://review.gluster.org/#/c/20271/ (only change is in .t)

commit 20fa80057eb430fd72b4fa31b9b65598b8ec1265 introduced a regression
wherein if a file is present in only 1 brick of replica *and* doesn't
have a gfid associated with it, it doesn't get healed upon the next
lookup from the client. Fix it.

Change-Id: I7d1111dcb45b1b8b8340a7d02558f05df70aa599
BUG: 1598121
fixes: bz#1598121
Signed-off-by: Ravishankar N 
(cherry picked from commit eb472d82a083883335bc494b87ea175ac43471ff)

afr: fix bug-1363721.t failure

2018-07-09T10:03:07+00:00

Backport of https://review.gluster.org/#/c/20036/
Note:  We need to update inode context's write_subvol even in case of compound
fops. This is not there in master and 4.1 since compound FOPS was removed in it.

Problem:
In the .t, when the only good brick was brought down, writes on the fd were
still succeeding on the bad bricks. The inflight split-brain check was
marking the write as failure but since the write succeeded on all the
bad bricks, afr_txn_nothing_failed() was set to true and we were
unwinding writev with success to DHT and then catching the failure in
post-op in the background.

Fix:
Don't wind the FOP phase if the write_subvol (which is populated with readable
subvols obtained in pre-op cbk) does not have at least 1 good brick which was up
when the transaction started.

Change-Id: I4a1fef4569609c31cffeaef591a64c10870e8d0b
BUG: 1598720
Signed-off-by: Ravishankar N

afr: add quorum checks in pre-op

2018-07-06T01:42:54+00:00

Backport of https://review.gluster.org/#/c/19781/

Problem:
We seem to be winding the FOP if pre-op did not succeed on quorum bricks
and then failing the FOP with EROFS since the fop did not meet quorum.
This essentially masks the actual error due to which pre-op failed. (See
BZ).

Fix:
Skip FOP phase if pre-op quorum is not met and go to post-op.

Change-Id: Ie58a41e8fa1ad79aa06093706e96db8eef61b6d9
BUG: 1597154
Signed-off-by: Ravishankar N

afr: capture the correct errno in post-op quorum check

2018-07-05T05:56:24+00:00

If the post-op phase of txn did not meet quorm checks, use that errno to
unwind the FOP rather than blindly setting ENOTCONN.

Change-Id: I0cb0c8771ec75a45f9a25ad4cd8601103deddf0c
BUG: 1597120
Signed-off-by: Ravishankar N 
(cherry picked from commit 440a048f24b006c80af3d7bcd0a1f13fe3459d87)

afr: add quorum checks in post-op

2018-07-04T04:04:22+00:00

afr relies on pending changelog xattrs to identify source and sinks and the
setting of these xattrs happen in post-op. So if post-op fails, we need to
unwind the write txn with a failure.

Change-Id: I0f019ac03890108324ee7672883d774918b20be1
BUG: 1597120
Signed-off-by: Ravishankar N 
(cherry picked from commit a40a87ec3b226ae86a6ed8f4af25b45965a20cad)

afr: don't treat all cases all bricks being blamed as split-brain

2018-07-04T04:03:45+00:00

Problem:
We currently don't have a roll-back/undoing of post-ops if quorum is not
met. Though the FOP is still unwound with failure, the xattrs remain on
the disk.  Due to these partial post-ops and partial heals (healing only when
2 bricks are up), we can end up in split-brain purely from the afr
xattrs point of view i.e each brick is blamed by atleast one of the
others. These scenarios are hit when there is frequent
connect/disconnect of the client/shd to the bricks while I/O or heal
are in progress.

Fix:
Instead of undoing the post-op, pick a source based on the xattr values.
If 2 bricks blame one, the blamed one must be treated as sink.
If there is no majority, all are sources. Once we pick a source,
self-heal will then do the heal instead of erroring out due to
split-brain.

Change-Id: I3d0224b883eb0945785ade0e9697a1c828aec0ae
BUG: 1597123
Signed-off-by: Ravishankar N 
(cherry picked from commit 0e6e8216823c2d9dafb81aae0f6ee3497c23d140)

cluster/afr: Fixing the flaws in arbiter becoming source patch

2018-04-18T13:23:19+00:00

Backport of https://review.gluster.org/19045

Problem:
Setting the write_subvol value to read_subvol in case of metadata
transaction during pre-op (commit 19f9bcff4aada589d4321356c2670ed283f02c03)
might lead to the original problem of arbiter becoming source.

Scenario:
1) All bricks are up and good
2) 2 writes w1 and w2 are in progress in parallel
3) ctx->read_subvol is good for all the subvolumes
4) w1 succeeds on brick0 and fails on brick1, yet to do post-op on
   the disk
5) read/lookup comes on the same file and refreshes read_subvols back
   to all good
6) metadata transaction happens which makes ctx->write_subvol to be
   assigned with ctx->read_subvol which is all good
7) w2 succeeds on brick1 and fails on brick0 and this will update the
   brick in reverse order leading to arbiter becoming source

Fix:
Instead of setting the ctx->write_subvol to ctx->read_subvol in the
pre-op statge, if there is a metadata transaction, check in the
function __afr_set_in_flight_sb_status() if it is a data/metadata
transaction. Use the value of ctx->write_subvol if it is a data
transactions and ctx->read_subvol value for other transactions.

With this patch we assign the value of ctx->write_subvol in the
afr_transaction_perform_fop() with the on disk value, instead of
assigning it in the afr_changelog_pre_op() with the in memory value.

Change-Id: Id2025a7e965f0578af35b1abaac793b019c43cc4
BUG: 1566131
Signed-off-by: karthik-us 
Signed-off-by: Ravishankar N

cluster/afr: Fix for arbiter becoming source

2018-04-18T13:23:19+00:00

Backport of https://review.gluster.org/#/c/18049/

Problem:
When eager-lock is on, and two writes happen in parallel on a FD
we were observing the following behaviour:
- First write fails on one data brick
- Since the post-op is not yet happened, the inode refresh will get
  both the data bricks as readable and set it in the inode context
- In flight split brain check see both the data bricks as readable
  and allows the second write
- Second write fails on the other data brick
- Now the post-op happens and marks both the data bricks as bad and
  arbiter will become source for healing

Fix:
Adding one more variable called write_suvol in inode context and it
will have the in memory representation of the writable subvols. Inode
refresh will not update this value and its lifetime is pre-op through
unlock in the afr transaction. Initially the pre-op will set this
value same as read_subvol in inode context and then in the in flight
split brain check we will use this value instead of read_subvol.
After all the checks we will update the value of this and set the
read_subvol same as this to avoid having incorrect value in that.

Change-Id: I2ef6904524ab91af861d59690974bbc529ab1af3
BUG: 1566131
Signed-off-by: karthik-us