glusterfs.git/xlators/cluster/afr/src/afr-self-heal.h, branch v5.1

afr: fix incorrect reporting of directory split-brain

2018-10-11T11:00:02+00:00

Backport of https://review.gluster.org/#/c/glusterfs/+/21135/

Problem:
When a directory has dirty xattrs due to failed post-ops or when
replace/reset brick is performed, AFR does a conservative merge as
expected, but heal-info reports it as split-brain because there are no
clear sources.

Fix:
Modify pending flag to contain information about pending heals and
split-brains. For directories, if spit-brain flag is not set,just show
them as needing heal and not being in split-brain.

Change-Id: I09ef821f6887c87d315ae99e6b1de05103cd9383
fixes: bz#1638163
Signed-off-by: Ravishankar N

Land clang-format changes

2018-09-12T11:52:48+00:00

Change-Id: I6f5d8140a06f3c1b2d196849299f8d483028d33b

afr: heal gfids when file is not present on all bricks

2018-06-19T06:05:20+00:00

commit 20fa80057eb430fd72b4fa31b9b65598b8ec1265 introduced a regression
wherein if a file is present in only 1 brick of replica *and* doesn't
have a gfid associated with it, it doesn't get healed upon the next
lookup from the client. Fix it.

Change-Id: I7d1111dcb45b1b8b8340a7d02558f05df70aa599
fixes: bz#1591193
Signed-off-by: Ravishankar N

cluster/afr: Make AFR eager-locking similar to EC

2018-03-14T13:32:35+00:00

Problem:
1) Afr's eager-lock only works for data transactions.
2) When there are conflicting writes, write with conflicting region initiates
unlock of eager-lock leading to extra pre-ops and post-ops on the file. When
eager-lock goes off, it leads to extra fsyncs for random-write workload in afr.

Solution (that is modeled after EC):
In EC, when there is a conflicting write, it waits for the current write to
complete before it winds the conflicted write. This leads to better utilization
of network and disk, because we will not be doing extra xattrops and FSYNCs and
inodelk/unlock. Moved fd based counters to inode based counters.

I tried to model the solution based on EC's locking, but it is not similar to
AFR because we had to keep backward compatibility.

Lifecycle of lock:
==================
First transaction is added to inode->owners list and an inodelk will be sent on
the wire. All the next transactions will be put in inode->waiters list until
the first transaction completes inodelk and [f]xattrop completely.  Once
[f]xattrop also completes, all the requests in the inode->waiters list are
checked if it conflict with any of the existing locks which are in
inode->owners list and if not are added to inode->owners list and resumed with
doing transaction. When these transactions complete fop phase they will be
moved to inode->post_op list and resume the transactions that were paused
because of conflicts. Post-op and unlock will not be issued on the wire until
that is the last transaction on that inode. Last transaction when it has to
perform post-op can choose to sleep for deyed-post-op-secs value. During that
time if any other transaction comes, it will wake up the sleeping transaction
and takes over the ownership of the lock and the cycle continues. If the
dealyed-post-op-secs expire, then the timer thread will wakeup the sleeping
transaction and it will set lock->release to true and starts doing post-op and
then unlock. During this time if any other transactions come, they will be put
in inode->frozen list. Once the previous unlock comes it will move the frozen
list to waiters list and moves the first element from this waiters-list to
owners-list and attempts the lock and the cycle continues. This is the general
idea.  There is logic at the time of dealying and at the time of new
transaction or in flush fop to wakeup existing sleeping transactions or
choosing whether to delay a transaction etc, which is subjected to change based
on future enhancements etc.

Fixes: #418
BUG: 1549606
Change-Id: I88b570bbcf332a27c82d2767dfa82472f60055dc
Signed-off-by: Pranith Kumar K

cluster/afr: Fail open on split-brain

2017-10-26T18:23:35+00:00

Problem:
Append on a file with split-brain succeeds. Open is intercepted by open-behind,
when write comes on the file, open-behind does open+write. Open succeeds
because afr doesn't fail it. Then write succeeds because write-behind
intercepts it. Flush is also intercepted by write-behind, so the application
never gets to know that the write failed.

Fix:
Fail open on split-brain, so that when open-behind does open+write open fails
which leads to write failure. Application will know about this failure.

Change-Id: I4bff1c747c97bb2925d6987f4ced5f1ce75dbc15
BUG: 1294051
Signed-off-by: Pranith Kumar K

afr: heal gfid as a part of entry heal

2017-10-09T06:23:08+00:00

Problem:
If a brick crashes after an entry (file or dir) is created but before
gfid is assigned, the good bricks will have pending entry heal xattrs
but the heal won't complete because afr_selfheal_recreate_entry() tries
to create the entry again and it fails with EEXIST.

Fix:
We could have fixed posx_mknod/mkdir etc to assign the gfid if the file
already exists but the right thing to do seems to be to trigger a lookup
on the bad brick and let it heal the gfid instead of winding an
mknod/mkdir in the first place.

Change-Id: I82f76665a7541f1893ef8d847b78af6466aff1ff
BUG: 1493415
Signed-off-by: Ravishankar N

afr: auto-resolve split-brains for zero-byte files

2017-09-26T04:04:18+00:00

Problems:
As described in BZ 1491670, renaming hardlinks can result in data/mdata
split-brain of the DHT link-to files (T files) without any mismatch of
data and metadata.

As described in BZ 1486063, for a zero-byte file with only dirty bits
set, arbiter brick will likely be chosen as the source brick.

Fix:
For zero byte files in split-brain, pick first brick as
a) data source if file size is zero on all bricks.
b) metadata source if metadata is the same on all bricks

In arbiter case, if file size is zero on all bricks and there are no
pending afr xattrs, pick 1st brick as data source.

Change-Id: I0270a9a2f97c3b21087e280bb890159b43975e04
BUG: 1491670
Signed-off-by: Ravishankar N 
Reported-by: Rahul Hinduja 
Reported-by: Mabi

afr: discover/lookup heal fixes

2017-09-04T05:31:31+00:00

Addresses review comments in commit 468ca877807625817b72921d1e9585036687b640

Change-Id: I04b1bd3b00abfd6758798d6272954e36a24249a9
BUG: 1473636
Signed-off-by: Ravishankar N 
Reviewed-on: https://review.gluster.org/18187
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

afr: check validity of afr_reply

2017-08-31T09:40:58+00:00

...in various self-heal code paths.

Originally found by Pranith in __afr_selfheal_name_impunge ()

Also change __afr_selfheal_assign_gfid() to send lookup only on those
bricks that don't have a gfid matching that of the source.

Change-Id: I70a2ccd750a2af92c5fc36e0eefb2b6125404b4a
BUG: 1482923
Signed-off-by: Pranith Kumar K 
Signed-off-by: Ravishankar N 
Reviewed-on: https://review.gluster.org/18065
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System

afr: heal metadata in discover code path

2017-08-16T11:46:47+00:00

During graph switch, if fuse sends nameless (gfid) lookups, afr takes
the discover code path to serve it. If there are pending metadata heals,
they do not happen unless an inode refresh happens as a part of
discover (which is not guaranteed to happen always).

This patch fixes it by attempting metadata heal as a part of discover,
just like how it is done in lookup code path.

Also removed creating superfluous  heal frames when launching heal.

Change-Id: I49868649361ebe5d70b6ea150f4686169b6c3070
BUG: 1473636
Signed-off-by: Ravishankar N 
Reviewed-on: https://review.gluster.org/17850
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Karthik U S