glusterfs.git/xlators/cluster/afr/src, branch v3.11.0rc0

Halo Replication feature for AFR translator

2017-05-08T05:37:07+00:00

	Backport of https://review.gluster.org/16177
		    https://review.gluster.org/17174

Merged both these patches to make sure IPV6 changes don't make it to 3.11 at all.

Summary:
Halo Geo-replication is a feature which allows Gluster or NFS clients to write
locally to their region (as defined by a latency "halo" or threshold if you
like), and have their writes asynchronously propagate from their origin to the
rest of the cluster.  Clients can also write synchronously to the cluster
simply by specifying a halo-latency which is very large (e.g. 10seconds) which
will include all bricks.

In other words, it allows clients to decide at mount time if they desire
synchronous or asynchronous IO into a cluster and the cluster can support both
of these modes to any number of clients simultaneously.

There are a few new volume options due to this feature:
  halo-shd-latency:  The threshold below which self-heal daemons will
  consider children (bricks) connected.

  halo-nfsd-latency: The threshold below which NFS daemons will consider
  children (bricks) connected.

  halo-latency: The threshold below which all other clients will
  consider children (bricks) connected.

  halo-min-replicas: The minimum number of replicas which are to
  be enforced regardless of latency specified in the above 3 options.
  If the number of children falls below this threshold the next
  best (chosen by latency) shall be swapped in.

New FUSE mount options:
  halo-latency & halo-min-replicas: As descripted above.

This feature combined with multi-threaded SHD support (D1271745) results in
some pretty cool geo-replication possibilities.

Operational Notes:
- Global consistency is gaurenteed for synchronous clients, this is provided by
  the existing entry-locking mechanism.
- Asynchronous clients on the other hand and merely consistent to their region.
  Writes & deletes will be protected via entry-locks as usual preventing
  concurrent writes into files which are undergoing replication.  Read operations
  on the other hand should never block.
- Writes are allowed from _any_ region and propagated from the origin to all
  other regions.  The take away from this is care should be taken to ensure
  multiple writers do not write the same files resulting in a gfid split-brain
  which will require resolution via split-brain policies (majority, mtime &
  size).  Recommended method for preventing this is using the nfs-auth feature to
  define which region for each share has RW permissions, tiers not in the origin
  region should have RO perms.

TODO:
- Synchronous clients (including the SHD) should choose clients from their own
  region as preferred sources for reads.  Most of the plumbing is in place for
  this via the child_latency array.
- Better GFID split brain handling & better dent type split brain handling
  (i.e. create a trash can and move the offending files into it).
- Tagging in addition to latency as a means of defining which children you wish
  to synchronously write to

Test Plan:
- The usual suspects, clang, gcc w/ address sanitizer & valgrind
- Prove tests

Reviewers: jackl, dph, cjh, meyering

Reviewed By: meyering

Subscribers: ethanr

Differential Revision: https://phabricator.fb.com/D1272053

Tasks: 4117827

 >Change-Id: I694a9ab429722da538da171ec528406e77b5e6d1
 >BUG: 1428061
 >Signed-off-by: Kevin Vigor 
 >Reviewed-on: http://review.gluster.org/16099
 >Reviewed-on: https://review.gluster.org/16177
 >Tested-by: Pranith Kumar Karampuri 
 >Smoke: Gluster Build System 
 >NetBSD-regression: NetBSD Build System 
 >CentOS-regression: Gluster Build System 
 >Reviewed-by: Pranith Kumar Karampuri 

BUG: 1448416
Change-Id: I694a9ab429722da538da171ec528406e77b5e6d1
Signed-off-by: Pranith Kumar K 
Reviewed-on: https://review.gluster.org/17192
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Kaushal M

cluster/afr: GFID split brain resolution with favorite-child-policy

2017-04-21T00:38:54+00:00

Problem:
Currently the automatic split brain resolution with favorite child policy
is not resolving the GFID split brains.

Fix:
When there is a GFID split brain and the favorite child policy is set to
size/mtime/ctime/majority, based on the policy decide on the source and
sinks. Delete the entry from the sinks and recreate it from the source.
Mark the appropriate pending attributes and resolve the GFID split brain.
When the heal takes place it will complete the pending heals and reset
the attributes.

Change-Id: Ie30e5373f94ca6f276745d9c3ad662b8acca6946
BUG: 1430719
Signed-off-by: karthik-us 
Reviewed-on: https://review.gluster.org/16878
Smoke: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Ravishankar N 
CentOS-regression: Gluster Build System

afr: don't do a post-op on a brick if op failed

2017-04-19T02:29:25+00:00

Problem:
In afr-v2, self-blaming xattrs are not there by design. But if the FOP
failed on a brick due to an error other than ENOTCONN (or even due to
ENOTCONN, but we regained connection before postop was wound), we wind
the post-op also on the failed brick, leading to setting self-blaming
xattrs on that brick. This can lead to undesired results like healing of
files in split-brain etc.

Fix:
If a fop failed on a brick on which pre-op was successful, do not
perform post-op on it. This also produces the desired effect of not
resetting the dirty xattr on the brick, which is how it should be
because if the fop failed on a brick, there is no reason to clear the
dirty bit which actually serves as an indication of the failure.

Change-Id: I5f1caf4d1b39f36cf8093ccef940118638caa9c4
BUG: 1438255
Signed-off-by: Ravishankar N 
Reviewed-on: https://review.gluster.org/16976
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

syncop: don't wake task in synctask_wake unless really needed

2017-03-28T22:34:48+00:00

Problem:

In EC and AFR, we launch synctasks during self-heal.

(i) These tasks usually stackwind a FOP to all its children and call
synctask_yield() which does a swapcontext to synctask_switchto() and puts the
task in syncenv's waitq by calling __wait(task). This happends as long as the
FOP ckbs from all children haven't been received.

(ii) For each FOP cbk, we call synctask_wake() which again does a swapcontext
to synctask_switchto() which now puts the task in syncenv's runq by calling
__run(task). When the task runs and the conext switches back to the FOP path,
it puts the task in waitq because we haven't heard from all children as
explained in (i).

Thus we are unnecessarily using the swapcontext syscalls to just toggle
the task back and forth between the waitq and runq.

Fix:
Store the stackwind count in new variable 'syncbarrier->waitfor' before
winding the fop. In each cbk when we call synctask_wake(),  perform an actual
wake only if the cbk count == stackwind count.

Change-Id: Id62d3b6ffed5a8c50f8b79267fb34e9470ba5ed5
BUG: 1434274
Signed-off-by: Ravishankar N 
Signed-off-by: Ashish Pandey 
Reviewed-on: https://review.gluster.org/16931
Smoke: Gluster Build System 
Reviewed-by: Niels de Vos 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/afr: Undo pending xattrs only on the up bricks

2017-03-27T09:52:31+00:00

Problem:
While doing conservative merge, even if a brick is down, it will reset
the pending xattr on that. When that brick comes up, as part of the
heal, it will consider this brick as the source and removes the entries
on the other bricks, which leads to data loss.

Fix:
Undo pending only for the bricks which are up.

Change-Id: I18436fa0bb1faa5f60531b357dea3f6b20446303
BUG: 1433571
Signed-off-by: karthik-us 
Reviewed-on: https://review.gluster.org/16913
Reviewed-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Ravishankar N

afr: do not mention split-brain in log message in read_txn

2017-03-20T13:58:50+00:00

I am seeing a lot of messages in qe/customer logs where read_txn
complains that file is possibly in split-brain because of no readable
subvol being found, does inode refresh and then there is no split-brain
message post the inode refresh. This means that a lookup was not issued
on the indoe to populate 'readable' or it can mean one brick is source
for data and the other for metadata, making readable to be zero (because
readable=intersection of (data,metadata readable) since commit
7a1c1e290470149696.

Since we anyway log actual split-brains post inode-refresh, move this
message to DEBUG log level.

Change-Id: Idb88b8ea362515279dc9b246f06b6b646c6d8013
BUG: 1433838
Signed-off-by: Ravishankar N 
Reviewed-on: https://review.gluster.org/16879
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

afr: restore atime/mtime for non-regular files

2017-03-06T10:01:43+00:00

AFR restores atime/mtime only as a part of data heal. For non-regular
files (dirs, symlinks, char/block/socket files etc) which do not undergo
data-heal, atime/mtime is not restored.

This patch restores atime/mtime as a part of metadata heal for such
files.

Change-Id: Id8da885fc93fdf65c2f4bae2af3605b146ac1f16
BUG: 1429198
Signed-off-by: Ravishankar N 
Reviewed-on: https://review.gluster.org/16844
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

cluster/afr: Perform new entry mark before creating new entry

2017-02-16T15:24:31+00:00

There is a chance for the source brick to go down just after
the new entry is created and before source brick is marked with
necessary pending markers. If after this any I/O happens then
new entry will become source and reverse heal will happen.
To prevent this mark the pending xattrs before creating the new
entry.

BUG: 1417466
Change-Id: I233b87e694d32e5d734df5a83b4d2ca711c17503
Signed-off-by: Pranith Kumar K 
Reviewed-on: https://review.gluster.org/16474
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Ravishankar N 
Reviewed-by: Krutika Dhananjay

afr: all children of AFR must be up to resolve s-brain

2017-02-10T01:37:00+00:00

Problem:
The various split-brain resolution policies (favorite-child-policy based,
CLI based and mount (get/setfattr) based) attempt to resolve split-brain
even when not all bricks of replica are up. This can be a problem when
say in a replica 3, the only good copy is down and the other 2 bricks
are up and blame each other (i.e. split-brain). We end up healing the
file in such a  case and allow I/O on it.

Fix:
A decision on whether the file is in split-brain or not must be taken
only if we are able to examine the afr xattrs of *all* bricks of a given
replica.

Change-Id: Icddb1268b380005799990f5379ef957d84639ef9
BUG: 1417522
Signed-off-by: Ravishankar N 
Reviewed-on: https://review.gluster.org/16476
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

afr/cluster: Restore data-self-heal-window option

2017-02-08T07:53:43+00:00

Summary:
- Fixes a bug where data-self-heal-window was ignored and instead
  hard-coded to 128k
- Cherry-pick of D2752781

Test Plan:
- Prove tests

Reviewed By: sshreyas

Signed-off-by: Shreyas Siravara 

Change-Id: Ie38456ce9ad90921f7456fe02aaace88393433a9
BUG: 1404424
Reviewed-on-release-3.8-fb: http://review.gluster.org/16083
Tested-by: Shreyas Siravara 
Reviewed-by: Kevin Vigor 
Reviewed-on: https://review.gluster.org/16123
Reviewed-by: Jeff Darcy 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System