glusterfs.git/xlators/cluster/dht/src, branch v3.11.0beta1

cluster/dht: Make rebalance throttle option tuned by number

2017-04-29T14:29:34+00:00

Current rebalance throttle options: lazy/normal/aggressive may not always be
sufficient for the purpose of throttling.  In our recent test, we observed for
certain setups, normal and aggressive modes behaved similarly consuming full
disk bandwidth. So in cases like this admin should be able to  tune it
down(or vice versa) depending on the need.

Along with old throttle configurations, thread counts are tuned based on number.
e.g. gluster v set vol-name cluster-rebal.throttle  5.

Admin can tune up/down between 0 and the number of cores available.

Note: For heterogenous servers, validation will fail on the old server if "number"
is given for throttle configuration.
The message looks something like this:
"volume set: failed: Staging failed on vm2. Error: cluster.rebal-throttle should be {lazy|normal|aggressive}"

Test: Manual test by logging active thread number after reconfiguring throttle option.
testcase: tests/basic/distribute/throttle-rebal.t

Change-Id: I46e3cde546900307831028b344ecf601fd9b02c3
BUG: 1438370
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/16980
NetBSD-regression: NetBSD Build System 
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Atin Mukherjee 
Reviewed-by: Raghavendra G

dht: send lookup on old name inside rename with bname and pargfid

2017-04-29T14:28:38+00:00

Inside rename, a lookup is done on the source name to make sure that
the file is there. But we used to do a gfid based lookup and hence,
even if the source name was renamed to a new name from some other client,
lookup will be successful as server3_3_lookup will fetch the new path
based on the gfid.

So even if the source file does not exist any more rename will carry on,
and as server3_3_link(destination is hashed to a different brick other
than source cached scenario) also does gfid based resolve, it wont
detect that the source name does not exist and hardlink creation will be
successful (since gfid based resolve will get the new dentry).

To solve this problem, do a name based lookup inside rename. So that
rename will fail right away if the source does not exist.

Change-Id: Ieba8bdd6675088dbf18de90ed4622df043d163bd
BUG: 1412135
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/16375
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: N Balachandran 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G

cluster/dht: rebalance perf enhancement

2017-04-29T14:28:19+00:00

Problem: Throttle settings "normal" and "aggressive" for rebalance
did not have performance difference.

normal mode spawns $(no. of cores - 4)/2 threads and aggressive
spawns $(no. of cores - 4) threads. Though aggressive mode has twice
the number of threads compared to that of normal mode, there was no
performance gain when switched to aggressive mode from normal mode.

RCA:
During the course of debugging the above problem, we tried assigning
migration job to migration threads spawned by rebalance, rather than
synctasks(as there is more overhead associated to manage the task
queue and threads). This gave us a significant improvement over rebalance
under synctasks. This patch does not really gurantee that there will be a
clear performance difference between normal and aggressive mode, but this
patch certainly maximized the disk utilization for 1GBfiles run.

Results:

Test enviroment:
Gluster Config:
Number of Bricks: 2 (one brick per disk(RAID-6 12 disk))
Bricks:
Brick1: server1:/brick/test1/1
Brick2: server2:/brick/test1/1
Options Reconfigured:
performance.readdir-ahead: on
server.event-threads: 4
client.event-threads: 4

1000 files with 1GB each were created/renamed such that all files will have
server1 as cached and server2 as hashed, so that all files will be migrated.

Test machines had 24 cores each.

Results  with/without synctask based migration:
-----------------------------------------------

mode                    normal(10threads)          aggressive(20threads)

timetaken               0:55:30 (h:m:s)            0:56:3 (h:m:s)
withsynctask

timetaken
with migrator           0:38:3 (h:m:s)             0:23:41 (h:m:s)
threads

From above table it can be seen that, there is a clear 2x perf gain between
rebalance with synctask vs rebalance with migrator threads.

Additionally this patch modifies the code so that caller will have the exact error
number returned by dht_migrate_file(earlier the errno meaning was overloaded). This
will help avoiding scenarios where migration failure due to ENOENT, can result in
rebalance abort/failure.

Change-Id: I8904e2fb147419d4a51c1267be11a08ffd52168e
BUG: 1420166
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/16427
Smoke: Gluster Build System 
Reviewed-by: N Balachandran 
Reviewed-by: Raghavendra G 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

cluster/dht: Pass the correct xdata in fremovexattr fop

2017-04-28T04:27:14+00:00

Change-Id: Id84bc87e48f435573eba3b24d3fb3c411fd2445d
BUG: 1440051
Signed-off-by: Krutika Dhananjay 
Reviewed-on: https://review.gluster.org/17126
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Raghavendra G

cluster/dht Remove redundant logs in dht rmdir

2017-04-26T09:44:06+00:00

Removing redundant logs were introduced in
https://review.gluster.org/#/c/17065/

Change-Id: I0d6055488b51a13c91d2121e87f653cdb94888b0
BUG: 1445590
Signed-off-by: N Balachandran 
Reviewed-on: https://review.gluster.org/17118
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Susant Palai 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G

feature/dht: Directory synchronization

2017-04-26T09:00:34+00:00

Design doc: https://review.gluster.org/16876

Directory creation is now synchronized with blocking inodelk of the
parent on the hashed subvolume followed by the entrylk on the hashed
subvolume between dht_mkdir, dht_rmdir, dht_rename_dir and lookup
selfheal mkdir.

To maintain internal consistency of directories across all subvols of
dht, we need locks. Specifically we are interested in:

 1. Consistency of layout of a directory. Only one writer should modify
    the layout at a time. A writer (layout setting during directory heal
    as part of lookup) shouldn't modify the layout while there are
    readers (all other fops like create, mkdir etc., which consume
    layout) and readers shouldn't read the layout while a writer is in
    progress. Readers can read the layout simultaneously. Writer takes
    a WRITE inodelk on the directory (whose layout is being modified)
    across ALL subvols. Reader takes a READ inodelk on the directory
    (whose layout is being read) on ANY subvol.

 2. Consistency of directory namespace across subvols. The path and
    associated gfid should be same on all subvols. A gfid should not be
    associated with more than one path on any subvol. All fops that can
    change directory names (mkdir, rmdir, renamedir, directory creation
    phase in lookup-heal) takes an entrylk on hashed subvol of the
    directory.

 NOTE1: In point 2 above, since dht takes entrylk on hashed subvol of a
        directory, the transaction itself is a consumer of layout on
        parent directory. So, the transaction is a reader of parent
        layout and does an inodelk on parent directory just like any
        other layout reader. So a mkdir (dir/subdir) would:

     > Acquire a READ inodelk on "dir" on any subvol.
     > Acquire an entrylk (dir, "subdir") on hashed subvol of "subdir".
     > creates directory on hashed subvol and possibly on non-hashed subvols.
     > UNLOCK (entrylk)
     > UNLOCK (inodelk)

 NOTE2: mkdir fop while setting the layout of the directory being created
        is considered as a reader, but NOT a writer. The reason is for
        a fop which can consume the layout of a directory to come either
        of the following conditions has to be true:

     > mkdir syscall from application has to complete. In this case no
       need of synchronization.
     > A lookup issued on the directory racing with mkdir has to complete.
       Since layout setting by a lookup is considered as a writer, only
       one of either mkdir or lookup will set the layout.

Code re-organization:
   All the lock related routines are moved to "dht-lock.c" file.
   New wrapper function is introduced to take blocking inodelk
   followed by entrylk 'dht_protect_namespace'

Updates #191
Change-Id: I01569094dfbe1852de6f586475be79c1ba965a31
Signed-off-by: Kotresh HR 
BUG: 1443373
Reviewed-on: https://review.gluster.org/15472
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G 
Smoke: Gluster Build System

cluster/dht: correct space check for rebalance

2017-04-25T16:28:19+00:00

With rebalance doing fallocate on destination, we don't need to
add file size to the "destination available space" to decide whether
to migrate the file or not.

Notes: Fallocate would have already occupied the file size space on
destination

Change-Id: If7f6a6654e6257726680cf20d618482a6e9095a6
BUG: 1441508
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17104
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Amar Tumballi 
Reviewed-by: N Balachandran 
Reviewed-by: Raghavendra G

cluster/dht: Pass the req dict instead of NULL in dht_attr2()

2017-04-24T04:12:57+00:00

This bug was causing VMs to pause during rebalance. When qemu winds
down a STAT, shard fills the trusted.glusterfs.shard.file-size attribute
in the req dict which DHT doesn't wind its STAT fop with upon detecting
the file has undergone migration. As a result shard doesn't find the
value to this key in the unwind path, causing it to fail the STAT
with EINVAL.

Also, the same bug exists in other fops too, which is also fixed in
this patch.

Change-Id: Id7823fd932b4e5a9b8779ebb2b612a399c0ef5f0
BUG: 1440051
Signed-off-by: Krutika Dhananjay 
Reviewed-on: https://review.gluster.org/17085
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G

cluster/dht: rm -rf fails if dir has stale linkto files

2017-04-21T05:49:41+00:00

rm -rf  fails with ENOENT if dir contains a lot of
stale linkto files. This is because a single
readdirp is sent as part of the rmdir which would return
and delete only as many linkto files on the bricks as would fit
in one readdirp buffer. Running rm -rf  multiple times
will eventually delete all the files. The fix sends readdirp
on each subvol until no more entries are returned.

Change-Id: I447f2d193de4bd8ac16e4541c6b919d22250e39e
BUG: 1442724
Signed-off-by: N Balachandran 
Reviewed-on: https://review.gluster.org/17065
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G

cluster/dht: Skip file migration if the subvol that meets min-free-disk

2017-04-19T05:35:14+00:00

criteria happens to be the same subvol containing data-file

Rebalance need to figure out a new subvol in case the hashed subvol
does not have enough space. In the process of figuring out the new subvol,
we need to ignore the source subvol, otherwise it will lead to data loss.

Test: Manual
Ran the following
sizeof /tmp/1: 1.5GB
sizeof /brick/1: 16GB
sizeof /tmp/2: 1.5GB


glusterd;  gluster v create test1 vm1:/brick/1 vm1:/tmp/1;
gluster v start test1;
mount -t glusterfs vm1:test1 /mnt;
for i in {1..2000}
do
dd if=/dev/zero of=/mnt/file$i bs=1KB count=1 &> /dev/null;
done
gluster v add-brick test1 vm1:/tmp/2
gluster v set test1 min-free-disk 12GB
gluster v remove-brick test1 vm1:/tmp/1 star


file count and data were intact.

Change-Id: Ib8fc8467a3d48a7c12958824c4f0b88e160b86c1
BUG: 1441508
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17064
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G