glusterfs.git/xlators/cluster, branch v3.12.11

cluster/dht: Fix dht_rename lock order

2018-05-09T05:04:20+00:00

Fixed dht_order_rename_lock to use the same inodelk ordering
as that of the dht selfheal locks (dictionary order of
lock subvolumes).

Change-Id: Ia3f8353b33ea2fd3bc1ba7e8e777dda6c1d33e0d
BUG: 1570475
Signed-off-by: N Balachandran

cluster/dht: Handle file migrations when brick down

2018-04-18T13:24:30+00:00

The decision as to which node would migrate a file
was based on the gfid of the file. Files were divided
among the nodes for the replica/disperse set. However,
if a brick was down when rebalance started, the nodeuuids
would be saved as NULL and a set of files would not be migrated.

Now, if the nodeuuid is NULL, the first non-null entry in
the set is the node responsible for migrating the file.

Change-Id: I72554c107792c7d534e0f25640654b6f8417d373
fixes: bz#1566820
Signed-off-by: N Balachandran 

(cherry picked from commit 1f0765242a689980265c472646c64473a92d94c0)

Change-Id: Id1a6e847b0191b6a40707bea789a2a35ea3d9f68

cluster/dht: Wind open to all subvols

2018-04-18T13:23:51+00:00

dht_opendir should wind the open to all subvols
whether or not local->subvols is set. This is
because dht_readdirp winds the calls to all subvols.

Change-Id: I67a96b06dad14a08967c3721301e88555aa01017
updates: bz#1566820
Signed-off-by: N Balachandran 
(cherry picked from commit c4251edec654b4e0127577e004923d9729bc323d)

cluster/afr: Fixing the flaws in arbiter becoming source patch

2018-04-18T13:23:19+00:00

Backport of https://review.gluster.org/19045

Problem:
Setting the write_subvol value to read_subvol in case of metadata
transaction during pre-op (commit 19f9bcff4aada589d4321356c2670ed283f02c03)
might lead to the original problem of arbiter becoming source.

Scenario:
1) All bricks are up and good
2) 2 writes w1 and w2 are in progress in parallel
3) ctx->read_subvol is good for all the subvolumes
4) w1 succeeds on brick0 and fails on brick1, yet to do post-op on
   the disk
5) read/lookup comes on the same file and refreshes read_subvols back
   to all good
6) metadata transaction happens which makes ctx->write_subvol to be
   assigned with ctx->read_subvol which is all good
7) w2 succeeds on brick1 and fails on brick0 and this will update the
   brick in reverse order leading to arbiter becoming source

Fix:
Instead of setting the ctx->write_subvol to ctx->read_subvol in the
pre-op statge, if there is a metadata transaction, check in the
function __afr_set_in_flight_sb_status() if it is a data/metadata
transaction. Use the value of ctx->write_subvol if it is a data
transactions and ctx->read_subvol value for other transactions.

With this patch we assign the value of ctx->write_subvol in the
afr_transaction_perform_fop() with the on disk value, instead of
assigning it in the afr_changelog_pre_op() with the in memory value.

Change-Id: Id2025a7e965f0578af35b1abaac793b019c43cc4
BUG: 1566131
Signed-off-by: karthik-us 
Signed-off-by: Ravishankar N

cluster/afr: Fix for arbiter becoming source

2018-04-18T13:23:19+00:00

Backport of https://review.gluster.org/#/c/18049/

Problem:
When eager-lock is on, and two writes happen in parallel on a FD
we were observing the following behaviour:
- First write fails on one data brick
- Since the post-op is not yet happened, the inode refresh will get
  both the data bricks as readable and set it in the inode context
- In flight split brain check see both the data bricks as readable
  and allows the second write
- Second write fails on the other data brick
- Now the post-op happens and marks both the data bricks as bad and
  arbiter will become source for healing

Fix:
Adding one more variable called write_suvol in inode context and it
will have the in memory representation of the writable subvols. Inode
refresh will not update this value and its lifetime is pre-op through
unlock in the afr transaction. Initially the pre-op will set this
value same as read_subvol in inode context and then in the in flight
split brain check we will use this value instead of read_subvol.
After all the checks we will update the value of this and set the
read_subvol same as this to avoid having incorrect value in that.

Change-Id: I2ef6904524ab91af861d59690974bbc529ab1af3
BUG: 1566131
Signed-off-by: karthik-us

cluster/dht: Skipped files are not treated as errors

2018-04-06T12:50:35+00:00

For skipped files, use a return value of 1 to prevent
error messages being logged.

> Change-Id: I18de31ac1a64d4460e88dea7826c3ba03c895861
> BUG: 1553598
> Signed-off-by: N Balachandran 

Change-Id: I18de31ac1a64d4460e88dea7826c3ba03c895861
BUG: 1555161
Signed-off-by: N Balachandran

cluster/afr: Prevent ping-event handling on shd

2018-04-06T12:50:03+00:00

On shd, we shouldn't treat any brick down based
on latency, otherwise self-heal will never happen

fixes: 1562723
Change-Id: Ica07fcc4fae91a6bfd9c9a670e2be464704d94b7
BUG: 1562723
Signed-off-by: Pranith Kumar K

cluster/ec: send list-node-uuids request to all subvolumes

2018-04-06T12:49:33+00:00

The xattr trusted.glusterfs.list-node-uuids was only sent to a single
subvolume. This was returning null uuids from the other subvolumes as
if they were down.

This fix forces that xattr to be requested from all subvolumes.

Backport of:
> BUG: 1561406

Change-Id: If62eb39a6857258923ba625e153d4ad79018ea2f
BUG: 1561731
Signed-off-by: Xavi Hernandez

cluster/ec: Change default read policy to gfid-hash

2018-04-06T12:49:09+00:00

Problem:
Whenever we read data from file over NFS, NFS reads
more data then requested and caches it. Based on the
stat information it makes sure that the cached/pre-read
data is valid or not.

Consider 4 + 2 EC volume and all the bricks are on
differnt nodes.

In EC, with round-robin read policy, reads are sent on
different set of data bricks. This way, it balances the
read fops to go on all the bricks and avoid heating UP
(overloading) same set of bricks.

Due to small difference in clock speed, it is possible
that we get minor difference for atime, mtime or ctime
for different bricks. That might cause a different stat
returned to NFS based on which NFS will discard
cached/pre-read data which is actually not changed and
could be used.

Solution:
Change read policy for EC as gfid-hash. That will force
all the read to go to same set of bricks.

>Change-Id: I825441cc519e94bf3dc3aa0bd4cb7c6ae6392c84
>BUG: 1554743
>Signed-off-by: Ashish Pandey 

Change-Id: I825441cc519e94bf3dc3aa0bd4cb7c6ae6392c84
BUG: 1558352
Signed-off-by: Ashish Pandey

cluster/ec: avoid delays in self-heal

2018-04-06T12:48:38+00:00

Self-heal creates a thread per brick to sweep the index looking for
files that need to be healed. These threads are started before the
volume comes online, so nothing is done but waiting for the next
sweep. This happens once per minute.

When a replace brick command is executed, the new graph is loaded and
all index sweeper threads started. When all bricks have reported, a
getxattr request is sent to the root directory of the volume. This
causes a heal on it (because the new brick doesn't have good data),
and marks its contents as pending to be healed. This is done by the
index sweeper thread on the next round, one minute later.

This patch solves this problem by waking all index sweeper threads
after a successful check on the root directory.

Additionally, the index sweep thread scans the index directory
sequentially, but it might happen that after healing a directory entry
more index entries are created but skipped by the current directory
scan. This causes the remaining entries to be processed on the next
round, one minute later. The same can happen in the next round, so
the heal is running in bursts and taking a lot to finish, specially
on volumes with many directory levels.

This patch solves this problem by immediately restarting the index
sweep if a directory has been healed.

Backport of:
> BUG: 1547662

Change-Id: I58d9ab6ef17b30f704dc322e1d3d53b904e5f30e
BUG: 1555201
Signed-off-by: Xavi Hernandez