glusterfs.git/xlators/cluster, branch release-3.7

afr: Avoid resetting event_gen when brick is always down

2017-01-14T03:59:55+00:00

Backport of http://review.gluster.org/#/c/16309/

Problem:
__afr_set_in_flight_sb_status(), which resets event_gen to zero, is
called if failed_subvols[i] is non-zero for any brick. But failed_subvols[i]
is true even if the brick was down *before* the transaction started.
Hence say if 1 brick is down in  a replica-3, every writev that comes
will trigger an inode refresh because of this resetting, as seen from
the no. of FSTATs in the profile info in the BZ.

Fix:
Reset event gen only if the brick was previously a valid read child and
the FOP failed on it the first time.

Also `s/afr_inode_read_subvol_reset/afr_inode_event_gen_reset` because
the function only resets event gen and not the data/metadata readable.

Change-Id: I7840f7123d3b3e0404743988088ec349391ca980
BUG: 1412890
Signed-off-by: Ravishankar N 
Reviewed-on: http://review.gluster.org/16387
Smoke: Gluster Build System 
Reviewed-by: Krutika Dhananjay 
Reviewed-by: Pranith Kumar Karampuri 
NetBSD-regression: NetBSD Build System 
Tested-by: Pranith Kumar Karampuri 
CentOS-regression: Gluster Build System

dht/rename : Incase of failure remove linkto file properly

2017-01-10T04:54:03+00:00

Generally linkto file is created using root user. Consider following
case, a user is trying to rename a file which he is not permitted.
So the rename fails with EACESS and when rename tries to cleanup the
linkto file, it fails.

The above issue happens when rename/00.t test executed on nfs-ganesha
clients :
Steps executed in script
* create a file "abc" using root
* rename the file "abc" to "xyz" using a non root user, it fails with EACESS
* delete "abc"
* create directory "abc" using root
* again try ot rename "abc" to "xyz" using non root user, test hungs here
which slowly leds to OOM kill of ganesha process

RCA put forwarded by Du for OOM kill of ganesha
Note that when we hit this bug, we've a scenario of a dentry being
present as:
    * a linkto file on one subvol
    * a directory on rest of subvols

When a lookup happens on the dentry in such a scenario, the control flow
goes into an infinite loop of:

    dht_lookup_everywhere
    dht_lookup_everywhere_cbk
    dht_lookup_unlink_cbk
    dht_lookup_everywhere_done
    dht_lookup_directory (as local->dir_count > 0)
    dht_lookup_dir_cbk (sets to local->need_selfheal = 1 as the entry is a linkto file on one of the subvol)
    dht_lookup_everywhere (as need_selfheal = 1).

This infinite loop can cause increased consumption of memory due to:
1) dht_lookup_directory assigns a new layout to local->layout unconditionally
2)  Most of the functions in this loop do a stack_wind of various fops.

This results in growing of call stack (note that call-stack is destroyed only after lookup response is
received by fuse - which never happens in this case)

Thanks Du for root causing the oom kill and Sushant for suggesting the fix

Upstream reference :
>Change-Id: I1e16bc14aa685542afbd21188426ecb61fd2689d
>BUG: 1397052
>Signed-off-by: Jiffin Tony Thottan 
>Reviewed-on: http://review.gluster.org/15894
>NetBSD-regression: NetBSD Build System 
>CentOS-regression: Gluster Build System 
>Smoke: Gluster Build System 
>Reviewed-by: Raghavendra G 
>(cherry picked from commit 57d59f4be205ae0c7888758366dc0049bdcfe449)

Change-Id: I1e16bc14aa685542afbd21188426ecb61fd2689d
BUG: 1401032
Signed-off-by: Jiffin Tony Thottan 
Reviewed-on: http://review.gluster.org/16016
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G

afr: use accused matrix instead of readable matrix for deciding heals

2016-12-28T09:15:27+00:00

Problem:
afr_replies_interpret() used the 'readable' matrix to trigger client
side heals after inode refresh. But for arbiter, readable is always
zero. So when `dd` is run with a data brick down, spurious data heals
are are triggered. These heals open an fd, causing eager lock to be
disabled (open fd count >1) in afr transactions, leading to extra FXATTROPS

Fix:
Use the accused matrix (derived from interpreting the afr pending
xattrs) to decide whether we can start heal or not.


> Reviewed-on: http://review.gluster.org/16277
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Smoke: Gluster Build System 
> Reviewed-by: Pranith Kumar Karampuri 
> Tested-by: Pranith Kumar Karampuri 
(cherry picked from commit 5a7c86e578f5bbd793126a035c30e6b052177a9f)

Change-Id: Ibbd56c9aed6026de6ec42422e60293702aaf55f9
BUG: 1408820
Signed-off-by: Ravishankar N 
Reviewed-on: http://review.gluster.org/16299
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/afr: When failing fop due to lack of quorum, also log error string

2016-11-11T12:23:39+00:00

        Backport of: http://review.gluster.org/#/c/15800

Change-Id: I1e73b4518bcf26196d6326065ad404f878e70bd4
BUG: 1393631
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/15814
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Smoke: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

features/shard: Fill loc.pargfid too for named lookups on individual shards

2016-11-09T06:39:14+00:00

        Backport of: http://review.gluster.org/#/c/15788/

On a sharded volume when a brick is replaced while IO is going on, named
lookup on individual shards as part of read/write was failing with
ENOENT on the replaced brick, and as a result AFR initiated name heal in
lookup callback. But since pargfid was empty (which is what this patch
attempts to fix), the resolution of the shards by protocol/server used
to fail and the following pattern of logs was seen:

Brick-logs:

[2016-11-08 07:41:49.387127] W [MSGID: 115009]
[server-resolve.c:566:server_resolve] 0-rep-server: no resolution type
for (null) (LOOKUP)
[2016-11-08 07:41:49.387157] E [MSGID: 115050]
[server-rpc-fops.c:156:server_lookup_cbk] 0-rep-server: 91833: LOOKUP(null)
(00000000-0000-0000-0000-000000000000/16d47463-ece5-4b33-9c93-470be918c0f6.82)
==> (Invalid argument) [Invalid argument]

Client-logs:
[2016-11-08 07:41:27.497687] W [MSGID: 114031]
[client-rpc-fops.c:2930:client3_3_lookup_cbk] 2-rep-client-0: remote
operation failed. Path: (null) (00000000-0000-0000-0000-000000000000)
[Invalid argument]
[2016-11-08 07:41:27.497755] W [MSGID: 114031]
[client-rpc-fops.c:2930:client3_3_lookup_cbk] 2-rep-client-1: remote
operation failed. Path: (null) (00000000-0000-0000-0000-000000000000)
[Invalid argument]
[2016-11-08 07:41:27.498500] W [MSGID: 114031]
[client-rpc-fops.c:2930:client3_3_lookup_cbk] 2-rep-client-2: remote
operation failed. Path: (null) (00000000-0000-0000-0000-000000000000)
[Invalid argument]
[2016-11-08 07:41:27.499680] E [MSGID: 133010]

Also, this patch makes AFR by itself choose a non-NULL pargfid even if
its ancestors fail to initialize all pargfid placeholders.

Change-Id: I34b9f90d0f09766b6d87b3994d5cd7a77b622dcb
BUG: 1392853
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/15797
Reviewed-by: Pranith Kumar Karampuri 
Reviewed-by: Ravishankar N 
NetBSD-regression: NetBSD Build System 
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System

afr,ec: Heal device files with correct major, minor numbers

2016-10-27T06:23:20+00:00

Thanks a lot to xiaoping.wu@nokia.com from Nokia for the bug and the
fix.

 >BUG: 1384297
 >Change-Id: Ie443237e85d34633b5dd30f85eaa2ac34e45754c
 >Signed-off-by: Pranith Kumar K 
 >Reviewed-on: http://review.gluster.org/15728
 >Smoke: Gluster Build System 
 >NetBSD-regression: NetBSD Build System 
 >Reviewed-by: Xavier Hernandez 
 >CentOS-regression: Gluster Build System 

Change-Id: I28636a741592335cebcaa1abc2af8460ebc740e1
BUG: 1388949
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/15736
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Xavier Hernandez

afr: Take full locks in arbiter only for data transactions

2016-10-16T06:11:12+00:00

Problem:
Sharding exposed a bug in arbiter config. where `dd` throughput was
extremely slow. Shard xlator was sending a fxattrop to update the file
size immediately after a writev. Arbiter was incorrectly over-riding the
LLONGMAX-1 start offset (for metadata domain locks) for this fxattrop,
causing the inodelk to be taken on the data domain. And since the
preceeding writev hadn't released the lock (afr does a 'lazy'
unlock if write succeeds on all bricks), this degraded to a blocking
lock causing extra lock/unlock calls and delays.

Fix:
Modify flock.l_len and flock.l_start to take full locks only for data
transactions.

> Reviewed-on: http://review.gluster.org/15641
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Pranith Kumar Karampuri 
(cherry picked from commit 3a97486d7f9d0db51abcb13dcd3bc9db935e3a60)

Change-Id: I906895da2f2d16813607e6c906cb4defb21d7c3b
BUG: 1385226
Signed-off-by: Ravishankar N 
Reported-by: Max Raba 
Reviewed-on: http://review.gluster.org/15649
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/ec: Use locks for opendir

2016-09-15T19:13:05+00:00

Problem:
In some cases we see that readdir keeps winding to the brick that doesn't have
any blocked locks i.e. first brick. This is leading to the client assuming that
there are no blocking locks on the inode so it won't give away the lock. Other
clients end up blocked on the lock as if the command hung.

Fix:
Proper way to fix this issue is to use infra present in
http://review.gluster.org/14736 This is a stop gap fix where we start taking
inodelks in opendir which goes to all the bricks, this will detect if there is
any contention.

cherry picked from commit f013335400d033a9677797377b90b968803135f4:

>BUG: 1346719
>Change-Id: I91109107a26f6535b945ac476338e9f21dc31eb9
>Signed-off-by: Pranith Kumar K 
>Reviewed-on: http://review.gluster.org/15309
>Smoke: Gluster Build System 
>CentOS-regression: Gluster Build System 
>NetBSD-regression: NetBSD Build System 
>Reviewed-by: Ashish Pandey 

Change-Id: I91109107a26f6535b945ac476338e9f21dc31eb9
BUG: 1373392
Signed-off-by: Ashish Pandey 
Reviewed-on: http://review.gluster.org/15406
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

dht: "replica.split-brain-status" attribute value is not correct

2016-09-13T04:49:55+00:00

Problem: In a distributed-replicate volume attribute
         "replica.split-brain-status" value does not display split-brain
           condition though directory is in split-brain.
         If directory is in split brain on mutiple replica-pairs
         it does not show full list of replica pairs.

Solution: Update the dht_aggregate code to aggregate the xattr
          value in this specific condition.

Fix:      1) function getChoices returns the choices from split-brain
             status string.
          2) function add_opt adding the choices to local buffer to
             store in dictionary
          3) For the key "replica.split-brain-status" function dht_aggregate
             call dht_aggregate_split_brain_xattr to prepare the list.

Test:     To verify the patch followed below steps
          1) Create a distributed replica volume and create mount point
          2) Stop heal daemon
          3) Touch file and directories on mount point
             mkdir test{1..5};touch tmp{1..5}
          4) Down brick process on one of the replica set
             pkill -9 glusterfsd
          5) Change permission of dir on mount point
             chmod 755 test{1..5}
          6) Restart brick process on node with force option
          7) kill brick process on other node in same replica set
          8) Change permission of dir again on mount point
             chmod 766 test{1..5}
          9) Reexecute same step from 4-9 on other replica set also
          10) After check heal status on server it will show dir's are
              in split brain on all replica sets
          11) After check the replica.split-brain-status attr on mount
              point it will show wrong status of split brain.
          12) After apply the patch the attribute shows correct value.

> Change-Id: Icdfd72005a4aa82337c342762775a3d1761bbe4a
> Signed-off-by: Mohit Agrawal 
> Reviewed-on: http://review.gluster.org/15201
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Raghavendra G 
> (cherry picked from commit c4e9ec653c946002ab6d4c71ee8e6df056438a04)

Change-Id: Ia5234e8a2291a7e8a7211c82368f4df1c99fa099
Backport of commit c4e9ec653c946002ab6d4c71ee8e6df056438a04
BUG: 1375099
Signed-off-by: Mohit Agrawal 
Reviewed-on: http://review.gluster.org/15466
NetBSD-regression: NetBSD Build System 
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G

cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order

2016-08-22T10:05:08+00:00

        Backport of: http://review.gluster.org/15080

When the bricks are brought offline and then online in cyclic
order while writes are in progress on a file, thanks to inode
refresh in write txns, AFR will mostly fail the write attempt
when the only good copy is offline. However, there is still a
remote possibility that the file will run into split-brain if
the brick that has the lone good copy goes offline *after* the
inode refresh but *before* the write txn completes (I call it
in-flight split-brain in the patch for ease of reference),
requiring intervention from admin to resolve the split-brain
before the IO can resume normally on the file. To get around this,
the patch does the following things:
i) retains the dirty xattrs on the file
ii) avoids marking the last of the good copies as bad (or accused)
    in case it is the one to go down during the course of a write.
iii) fails that particular write with the appropriate errno.

This way, we still have one good copy left despite the split-brain situation
which when it is back online, will be chosen as source to do the heal.

Change-Id: I7c13c6ddd5b8fe88b0f2684e8ce5f4a9c3a24a08
BUG: 1367270
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/15222
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Oleksandr Natalenko 
Reviewed-by: Pranith Kumar Karampuri