glusterfs.git/xlators/cluster/dht/src/dht-common.h, branch v3.11.0beta1

cluster/dht: Make rebalance throttle option tuned by number

2017-04-29T14:29:34+00:00

Current rebalance throttle options: lazy/normal/aggressive may not always be
sufficient for the purpose of throttling.  In our recent test, we observed for
certain setups, normal and aggressive modes behaved similarly consuming full
disk bandwidth. So in cases like this admin should be able to  tune it
down(or vice versa) depending on the need.

Along with old throttle configurations, thread counts are tuned based on number.
e.g. gluster v set vol-name cluster-rebal.throttle  5.

Admin can tune up/down between 0 and the number of cores available.

Note: For heterogenous servers, validation will fail on the old server if "number"
is given for throttle configuration.
The message looks something like this:
"volume set: failed: Staging failed on vm2. Error: cluster.rebal-throttle should be {lazy|normal|aggressive}"

Test: Manual test by logging active thread number after reconfiguring throttle option.
testcase: tests/basic/distribute/throttle-rebal.t

Change-Id: I46e3cde546900307831028b344ecf601fd9b02c3
BUG: 1438370
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/16980
NetBSD-regression: NetBSD Build System 
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Atin Mukherjee 
Reviewed-by: Raghavendra G

cluster/dht: rebalance perf enhancement

2017-04-29T14:28:19+00:00

Problem: Throttle settings "normal" and "aggressive" for rebalance
did not have performance difference.

normal mode spawns $(no. of cores - 4)/2 threads and aggressive
spawns $(no. of cores - 4) threads. Though aggressive mode has twice
the number of threads compared to that of normal mode, there was no
performance gain when switched to aggressive mode from normal mode.

RCA:
During the course of debugging the above problem, we tried assigning
migration job to migration threads spawned by rebalance, rather than
synctasks(as there is more overhead associated to manage the task
queue and threads). This gave us a significant improvement over rebalance
under synctasks. This patch does not really gurantee that there will be a
clear performance difference between normal and aggressive mode, but this
patch certainly maximized the disk utilization for 1GBfiles run.

Results:

Test enviroment:
Gluster Config:
Number of Bricks: 2 (one brick per disk(RAID-6 12 disk))
Bricks:
Brick1: server1:/brick/test1/1
Brick2: server2:/brick/test1/1
Options Reconfigured:
performance.readdir-ahead: on
server.event-threads: 4
client.event-threads: 4

1000 files with 1GB each were created/renamed such that all files will have
server1 as cached and server2 as hashed, so that all files will be migrated.

Test machines had 24 cores each.

Results  with/without synctask based migration:
-----------------------------------------------

mode                    normal(10threads)          aggressive(20threads)

timetaken               0:55:30 (h:m:s)            0:56:3 (h:m:s)
withsynctask

timetaken
with migrator           0:38:3 (h:m:s)             0:23:41 (h:m:s)
threads

From above table it can be seen that, there is a clear 2x perf gain between
rebalance with synctask vs rebalance with migrator threads.

Additionally this patch modifies the code so that caller will have the exact error
number returned by dht_migrate_file(earlier the errno meaning was overloaded). This
will help avoiding scenarios where migration failure due to ENOENT, can result in
rebalance abort/failure.

Change-Id: I8904e2fb147419d4a51c1267be11a08ffd52168e
BUG: 1420166
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/16427
Smoke: Gluster Build System 
Reviewed-by: N Balachandran 
Reviewed-by: Raghavendra G 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

feature/dht: Directory synchronization

2017-04-26T09:00:34+00:00

Design doc: https://review.gluster.org/16876

Directory creation is now synchronized with blocking inodelk of the
parent on the hashed subvolume followed by the entrylk on the hashed
subvolume between dht_mkdir, dht_rmdir, dht_rename_dir and lookup
selfheal mkdir.

To maintain internal consistency of directories across all subvols of
dht, we need locks. Specifically we are interested in:

 1. Consistency of layout of a directory. Only one writer should modify
    the layout at a time. A writer (layout setting during directory heal
    as part of lookup) shouldn't modify the layout while there are
    readers (all other fops like create, mkdir etc., which consume
    layout) and readers shouldn't read the layout while a writer is in
    progress. Readers can read the layout simultaneously. Writer takes
    a WRITE inodelk on the directory (whose layout is being modified)
    across ALL subvols. Reader takes a READ inodelk on the directory
    (whose layout is being read) on ANY subvol.

 2. Consistency of directory namespace across subvols. The path and
    associated gfid should be same on all subvols. A gfid should not be
    associated with more than one path on any subvol. All fops that can
    change directory names (mkdir, rmdir, renamedir, directory creation
    phase in lookup-heal) takes an entrylk on hashed subvol of the
    directory.

 NOTE1: In point 2 above, since dht takes entrylk on hashed subvol of a
        directory, the transaction itself is a consumer of layout on
        parent directory. So, the transaction is a reader of parent
        layout and does an inodelk on parent directory just like any
        other layout reader. So a mkdir (dir/subdir) would:

     > Acquire a READ inodelk on "dir" on any subvol.
     > Acquire an entrylk (dir, "subdir") on hashed subvol of "subdir".
     > creates directory on hashed subvol and possibly on non-hashed subvols.
     > UNLOCK (entrylk)
     > UNLOCK (inodelk)

 NOTE2: mkdir fop while setting the layout of the directory being created
        is considered as a reader, but NOT a writer. The reason is for
        a fop which can consume the layout of a directory to come either
        of the following conditions has to be true:

     > mkdir syscall from application has to complete. In this case no
       need of synchronization.
     > A lookup issued on the directory racing with mkdir has to complete.
       Since layout setting by a lookup is considered as a writer, only
       one of either mkdir or lookup will set the layout.

Code re-organization:
   All the lock related routines are moved to "dht-lock.c" file.
   New wrapper function is introduced to take blocking inodelk
   followed by entrylk 'dht_protect_namespace'

Updates #191
Change-Id: I01569094dfbe1852de6f586475be79c1ba965a31
Signed-off-by: Kotresh HR 
BUG: 1443373
Reviewed-on: https://review.gluster.org/15472
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G 
Smoke: Gluster Build System

cluster/dht: Skip file migration if the subvol that meets min-free-disk

2017-04-19T05:35:14+00:00

criteria happens to be the same subvol containing data-file

Rebalance need to figure out a new subvol in case the hashed subvol
does not have enough space. In the process of figuring out the new subvol,
we need to ignore the source subvol, otherwise it will lead to data loss.

Test: Manual
Ran the following
sizeof /tmp/1: 1.5GB
sizeof /brick/1: 16GB
sizeof /tmp/2: 1.5GB


glusterd;  gluster v create test1 vm1:/brick/1 vm1:/tmp/1;
gluster v start test1;
mount -t glusterfs vm1:test1 /mnt;
for i in {1..2000}
do
dd if=/dev/zero of=/mnt/file$i bs=1KB count=1 &> /dev/null;
done
gluster v add-brick test1 vm1:/tmp/2
gluster v set test1 min-free-disk 12GB
gluster v remove-brick test1 vm1:/tmp/1 star


file count and data were intact.

Change-Id: Ib8fc8467a3d48a7c12958824c4f0b88e160b86c1
BUG: 1441508
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17064
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G

cluster/dht: Make rebalance honor min-free-disk

2017-04-13T09:57:42+00:00

test:  Manual

created files of size 1K on 2 brick(of size 1GB) setup .
added a brick of size 16GB.
set min-free-disk to 12GB(so that first two bricks won't receive any files).
removed one of the 1st brick of size 1GB.

Logs from test:
[2017-04-12 08:52:08.196484] W [MSGID: 0] [dht-rebalance.c:895:__dht_check_free_space]
 0-test1-dht: Write will cross min-free-disk for file - /tile32 on subvol - test1-client-1.
Looking for new subvol.

[2017-04-12 08:52:08.196904] I [MSGID: 0] [dht-rebalance.c:925:__dht_check_free_space]
0-test1-dht: new target found - test1-client-2 for file - /tile32

 - Post migration we have two files. The new destination (/brick/1) has the data file
[root@vm1 ~]# ll /brick/1/tile32
-rw-r--r--. 2 root root 0 Apr 12 14:22 /brick/1/tile32

 - On the old target the linkto file is there with linkto xattr pointing to /brick/1
[root@vm1 ~]# ll /tmp/2/tile32
---------T. 2 root root 1000 Apr 12 14:22 /tmp/2/tile32
[root@vm1 ~]# getfattr -m . -de text /tmp/2/tile32
getfattr: Removing leading '/' from absolute path names
security.selinux="unconfined_u:object_r:user_tmp_t:s0"
trusted.gfid="����:Aс�#�/'b2"
trusted.glusterfs.dht.linkto="test1-client-2"

Marking ./tests/features/worm_sh.t as bad test.
Reason being, this patch failed on master branch as well and it has nothing
to do with rebalance/remove-brick.

BUG: 1441508
Change-Id: I90bae251cda3d957a49cdceda90cd08311a392fb
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17034
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Amar Tumballi 
Reviewed-by: Raghavendra G 
CentOS-regression: Gluster Build System

dnt: Add a GF_DEFRAG_CMD_NONE to gf_defrag_type

2017-02-26T16:31:25+00:00

Coverity complain about enum mismatch since we assign GF_OP_CMD_NONE
to a variable holding gf_defrag_type.

Change-Id: I63e71f552b3cc752c26c1b8705420f38908e17e6
BUG: 789278
Signed-off-by: Michael Scherer 
Reviewed-on: https://review.gluster.org/16756
Smoke: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan 
Tested-by: Michael Scherer 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

tier : Tier as a service

2017-01-17T04:49:47+00:00

tierd is implemented by separating from rebalance process.

The commands affected:

1) Attach tier will trigger this process instead of old one
2) tier start and tier start force will also trigger this process.
3) volume status [tier] will show tier daemon as a process instead
of task and normal tier status and tier detach status works.
4) tier stop implemented.
5) detach tier implemented separately along with new detach tier
status
6) volume tier volname status will work using the changes.
7) volume set works

This patch has separated the tier translator from the legacy
DHT rebalance code. It now sends the RPCs from the CLI
to glusterd separate to the DHT rebalance code.
The daemon is now a service, similar to the snapshot daemon,
and can be viewed using the volume status command.

The code for the validation and commit phase are the same
as the earlier tier validation code in DHT rebalance.

The “brickop” phase has been changed so that the status
command can use this framework.

The service management framework is now used.
DHT rebalance does not use this framework.

This service framework takes care of :

*) spawning the daemon, killing it and other such processes.
*) volume set options , which are written on the volfile.
*) restart and reconfigure functions. Restart is to restart
the daemon at two points
        1)after gluster goes down and comes up.
        2) to stop detach tier.
*) reconfigure is used to make immediate volfile changes.
By doing this, we don’t restart the daemon.
it has the code to rewrite the volfile for topological
changes too (which comes into place during add and remove brick).

With this patch the log, pid, and volfile are separated
and put into respective directories.

Change-Id: I3681d0d66894714b55aa02ca2a30ac000362a399
BUG: 1313838
Signed-off-by: hari gowtham 
Reviewed-on: http://review.gluster.org/13365
Smoke: Gluster Build System 
Tested-by: hari gowtham 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Dan Lambright 
Reviewed-by: Atin Mukherjee

feature/dht: undo partially successful dir rename

2017-01-11T15:40:25+00:00

As with dht, dirs are present on all subvolumes,
renaming them is a compound operation and thus a
partial success + partial failure scenario is
possible, resulting in an inconsistent state.

For purposes of reproduction, such a scenario can
easily be produced by stopping the volume, edit the
volfile of a certain subvolume to get at an
"option read-only on" setting, and then restart
the volume. Thus those operations that are to make change
on the affected subvolume will fail with EROFS.

To handle such scenarios, we introduce an in-memory cache
where we record the return values obtained from the
subvolumes. At the final stage of the dir rename operation
we check if it's a partial success/fail situation. If yes,
then we perform a reverse rename op on those subvolumes
where the operation succeeded.

Change-Id: I3d05f74f53932cb984a918d252a7309c1009a51d
BUG: 1412069
Signed-off-by: Raghavendra G 
Reviewed-on: http://review.gluster.org/15739
NetBSD-regression: NetBSD Build System 
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: N Balachandran

cluster/dht: Fix memory corruption while accessing regex stored in

2016-12-08T17:56:41+00:00

private

If reconfigure is executed parallely (or concurrently with dht_init),
there are races that can corrupt memory. One such race is modification
of regexes stored in conf (conf->rsync_regex_valid and
conf->extra_regex_valid) through dht_init_regex. With change [1],
reconfigure codepath can get executed parallely (with itself or with
dht_init) and this fix is needed.

Also, a reconfigure can race with any thread doing dht_layout_search,
resulting in dht_layout_search accessing regex freed up by reconfigure
(like in bz 1399134).

[1] http://review.gluster.org/15046

Change-Id: I039422a65374cf0ccbe0073441f0e8c442ebf830
BUG: 1399134
Signed-off-by: Raghavendra G 
Reviewed-on: http://review.gluster.org/15945
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: N Balachandran 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

cluster/tier: handle fast demotions

2016-10-19T19:51:48+00:00

Demote files on priority if hi-watermark has been breached and continue
to demote until the watermark drops below hi-watermark.

Monitor watermark more frequently.
Trigger demotion as soon as hi-watermark is breached.
Add cluster.tier-emergency-demote-query-limit option to limit number
of files returned from the database query for every iteration of
tier_migrate_using_query_file(). If watermark hasn't dropped below
hi-watermark during the first iteration, the next iteration will be
triggered approximately 1 second after tier_demote() returns to the
main tiering loop.
Update changetimerecorder xlator to handle query for emergency demote
mode.

Add tier-ctr-interface.h:
Move tier and ctr interface specific macros and struct definition from
libglusterfs/src/gfdb/gfdb_data_store.h to new header
libglusterfs/src/tier-ctr-interface.h

Change-Id: If56af78c6c81d37529b9b6e65ae606ba5c99a811
BUG: 1366648
Signed-off-by: Milind Changire 
Reviewed-on: http://review.gluster.org/15158
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Dan Lambright