glusterfs.git/xlators/cluster/afr/src, branch v3.10.0rc1

cluster/afr: Perform new entry mark before creating new entry

2017-02-16T20:27:07+00:00

There is a chance for the source brick to go down just after
the new entry is created and before source brick is marked with
necessary pending markers. If after this any I/O happens then
new entry will become source and reverse heal will happen.
To prevent this mark the pending xattrs before creating the new
entry.

 >BUG: 1417466
 >Change-Id: I233b87e694d32e5d734df5a83b4d2ca711c17503
 >Signed-off-by: Pranith Kumar K 
 >Reviewed-on: https://review.gluster.org/16474
 >Smoke: Gluster Build System 
 >NetBSD-regression: NetBSD Build System 
 >CentOS-regression: Gluster Build System 
 >Reviewed-by: Ravishankar N 
 >Reviewed-by: Krutika Dhananjay 

BUG: 1422942
Change-Id: I460caaa2c6571c4256e50421e48857e3c2769b4d
Signed-off-by: Pranith Kumar K 
Reviewed-on: https://review.gluster.org/16646
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

afr: all children of AFR must be up to resolve s-brain

2017-02-15T12:31:45+00:00

Problem:
The various split-brain resolution policies (favorite-child-policy based,
CLI based and mount (get/setfattr) based) attempt to resolve split-brain
even when not all bricks of replica are up. This can be a problem when
say in a replica 3, the only good copy is down and the other 2 bricks
are up and blame each other (i.e. split-brain). We end up healing the
file in such a  case and allow I/O on it.

Fix:
A decision on whether the file is in split-brain or not must be taken
only if we are able to examine the afr xattrs of *all* bricks of a given
replica.

Signed-off-by: Ravishankar N 
> Reviewed-on: https://review.gluster.org/16476
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Pranith Kumar Karampuri 

(cherry picked from commit 0e03336a9362e5717e561f76b0c543e5a197b31b)
Change-Id: Icddb1268b380005799990f5379ef957d84639ef9
BUG: 1420982
Reviewed-on: https://review.gluster.org/16587
Tested-by: Ravishankar N 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

core: run many bricks within one glusterfsd process

2017-02-02T00:54:58+00:00

This patch adds support for multiple brick translator stacks running in
a single brick server process.  This reduces our per-brick memory usage
by approximately 3x, and our appetite for TCP ports even more.  It also
creates potential to avoid process/thread thrashing, and to improve QoS
by scheduling more carefully across the bricks, but realizing that
potential will require further work.

Multiplexing is controlled by the "cluster.brick-multiplex" global
option.  By default it's off, and bricks are started in separate
processes as before.  If multiplexing is enabled, then *compatible*
bricks (mostly those with the same transport options) will be started in
the same process.

Backport of:
> Change-Id: I45059454e51d6f4cbb29a4953359c09a408695cb
> BUG: 1385758
> Reviewed-on: https://review.gluster.org/14763

Change-Id: I4bce9080f6c93d50171823298fdf920258317ee8
BUG: 1418091
Signed-off-by: Jeff Darcy 
Reviewed-on: https://review.gluster.org/16496
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

afr: Avoid resetting event_gen when brick is always down

2017-01-13T02:54:02+00:00

Problem:
__afr_set_in_flight_sb_status(), which resets event_gen to zero, is
called if failed_subvols[i] is non-zero for any brick. But failed_subvols[i]
is true even if the brick was down *before* the transaction started.
Hence say if 1 brick is down in  a replica-3, every writev that comes
will trigger an inode refresh because of this resetting, as seen from
the no. of FSTATs in the profile info in the BZ.

Fix:
Reset event gen only if the brick was previously a valid read child and
the FOP failed on it the first time.

Also `s/afr_inode_read_subvol_reset/afr_inode_event_gen_reset` because
the function only resets event gen and not the data/metadata readable.

Change-Id: I603ae646cbde96995c35db77916e2ed80b602a91
BUG: 1409206
Signed-off-by: Ravishankar N 
Reviewed-on: http://review.gluster.org/16309
Smoke: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

cluster/afr: Do not log of split-brain when there isn't one

2017-01-12T06:42:02+00:00

* Even on errors like ENOENT, AFR logs split-brain after
  read-txn refresh, introduced by commit a07ddd8f.
  This can be a cause of much panic and confusion and needs to be fixed.

* Also fixed this issue in write-txns.

* Fixed afr read txns to log about split-brain only after knowing that
  there is no split-brain choice configured.

* Removed code duplication

* Fixed incorrect passing of error code in afr_write_txn_refresh_done()
  (the function was passing -0 as errno to gf_msg().

Change-Id: I354f454ce5bf0e5f00bc27916eb597367cb7d927
BUG: 1411625
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/16362
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Ravishankar N 
Reviewed-by: Pranith Kumar Karampuri

ec: Invalidations in disperse volume should not update the stat

2017-01-06T05:12:18+00:00

Issue:
In disperse volume, the file is present across bricks, hence the stat
from one brick doesn't carry the valid size of the file. Therefore
the upcall from one brick updating the md-cache results in wrong size
being updated.

Fix:
If the notification is cache invalidation then, indicate md-cache that
the attributes is invalid.

BUG: 1410375
Change-Id: Id89d2283478e70b62b435a8891fffc86d2be8cb2
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/16329
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Xavier Hernandez 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/afr: Fix missing name indices due to EEXIST error

2016-12-27T11:53:04+00:00

PROBLEM:
Consider a volume with  granular-entry-heal and sharding enabled. When
a replica is down and a shard is created as part of a write, the name
index is correctly created under indices/entry-changes/.
Now when a read on the same region triggers another MKNOD, the fop
fails on the online bricks with EEXIST. By virtue of this being a
symmetric error, the failed_subvols[] array is reset to all zeroes.
Because of this, before post-op, the GF_XATTROP_ENTRY_OUT_KEY will be
set, causing the name index, which was created in the previous MKNOD
operation, to be wrongly deleted in THIS MKNOD operation.

FIX:
The ideal fix would have been for a transaction to delete the name
index ONLY if it knows it is the one that created the index in the first
place. This would involve gathering information as to whether THIS xattrop
created the index from individual bricks, aggregating their responses and
based on the various posisble combinations of responses, decide whether to
delete the index or not. This is rather complex. Simpler fix would be
for post-op to examine local->op_ret in the event of no failed_subvols
to figure out whether to delete the name index or not. This can occasionally
lead to creation of stale name indices but they won't be affecting the IO path
or mess with pending changelogs in any way and self-heal in its crawl of
"entry-changes" directory would take care to delete such indices.

Change-Id: Ic1b5257f4dc9c20cb740a866b9598cf785a1affa
BUG: 1408712
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/16286
Smoke: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

afr: use accused matrix instead of readable matrix for deciding heals

2016-12-27T06:34:02+00:00

Problem:
afr_replies_interpret() used the 'readable' matrix to trigger client
side heals after inode refresh. But for arbiter, readable is always
zero. So when `dd` is run with a data brick down, spurious data heals
are are triggered. These heals open an fd, causing eager lock to be
disabled (open fd count >1) in afr transactions, leading to extra FXATTROPS

Fix:
Use the accused matrix (derived from interpreting the afr pending
xattrs) to decide whether we can start heal or not.

Change-Id: Ibbd56c9aed6026de6ec42422e60293702aaf55f9
BUG: 1408395
Signed-off-by: Ravishankar N 
Reviewed-on: http://review.gluster.org/16277
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri

afr: Ignore event_generation checks post inode refresh for write txns

2016-12-22T11:06:31+00:00

Before http://review.gluster.org/#/c/15673/, after inode refresh, we
failed read txns in case of EIO or event_generation being zero. For
write transactions, the check was only for EIO. 15673 re-factored the
code to fail both read and write when event_generation=0. This seems to
have caused a regression as explained in the BZ.

This patch restores that behaviour in afr_txn_refresh_done().

Change-Id: Ib8e116506badce6f58b55827dbe403d95069d744
BUG: 1406224
Signed-off-by: Ravishankar N 
Reviewed-on: http://review.gluster.org/16205
Reviewed-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System

cluster/afr: Fix per-txn optimistic changelog initialisation

2016-12-12T16:38:49+00:00

Incorrect initialisation of local->optimistic_change_log was leading
to skipped pre-op and post-op even when a brick didn't participate in
the txn because it was down.
The result - missing granular name index resulting in some entries
never getting healed.

FIX:
Initialise local->optimistic_change_log just before pre-op.

Also fixed granular entry heal to create the granular name index in
pre-op as opposed to post-op. This is to prevent loss of granular
information when during an entry txn, the good (src) brick goes
offline before the post-op is done. This would cause self-heal to
do conservative merge (since dirty xattr is the only information
available), which when granular-entry-heal is enabled, expects
granular indices, the lack of which can lead to loss of data in
the worst case.

Change-Id: Ia3ad716d6fb1821555f02180e86e8711a79f958d
BUG: 1402730
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/16075
Smoke: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System