glusterfs.git/xlators/cluster, branch v3.10.5

cluster/rebalance: Fix hardlink migration failures

2017-08-11T20:03:45+00:00

A brief about how hardlink migration works:
  - Different hardlinks (to the same file) may hash to different bricks,
but their cached subvol will be same. Rebalance picks up the first hardlink,
calculates it's  hash(call it TARGET) and set the hashed subvolume as an xattr
on the data file.
  - Now all the hardlinks those come after this will fetch that xattr and will
create linkto files on TARGET (all linkto files for the hardlinks will be hardlink
to each other on TARGET).
  - When number of hardlinks on source is equal to the number of hardlinks on
TARGET, the data migration will happen.

RACE:1
  Since rebalance is multi-threaded, the first lookup (which decides where the TARGET
subvol should be), can be called by two hardlink migration parallely and they may end
up creating linkto files on two different TARGET subvols. Hence, hardlinks won't be
migrated.

Fix: Rely on the xattr response of lookup inside gf_defrag_handle_hardlink since it
is executed under synclock.

RACE:2
  The linkto files on TARGET can be created by other clients also if they are doing
lookup on the hardlinks.  Consider a scenario where you have 100 hardlinks.  When
rebalance is migrating 99th hardlink, as a result of continuous lookups from other
client, linkcount on TARGET is equal to source linkcount. Rebalance will migrate data
on the 99th hardlink itself. On 100th hardlink migration, hardlink will have TARGET as
cached subvolume. If it's hash is also the same, then a migration will be triggered from
TARGET to TARGET leading to data loss.

Fix: Make sure before the final data migration, source is not same as destination.

RACE:3
  Since a hardlink can be migrating to a non-hashed subvolume, a lookup from other
client or even the rebalance it self, might delete the linkto file on TARGET leading
to hardlinks never getting migrated.

This will be addressed in a different patch in future.

> Change-Id: If0f6852f0e662384ee3875a2ac9d19ac4a6cea98
> BUG: 1469964
> Signed-off-by: Susant Palai 
> Reviewed-on: https://review.gluster.org/17755
> Smoke: Gluster Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: N Balachandran 
> Reviewed-by: Raghavendra G 
> Signed-off-by: Susant Palai 

Change-Id: If0f6852f0e662384ee3875a2ac9d19ac4a6cea98
BUG: 1473141
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17838
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

cluster/dht: fix on demand migration files from client

2017-08-11T19:37:12+00:00

    On demand migration of files i.e. migration done by clients
    triggered by a setfattr was broken.

    Dependency on defrag led to crash when migration was triggered from
    client.

    Note: This functionality is not available for tiered volumes. Migration
    from tier served client will fail with ENOTSUP.

    usage (But refer to the steps mentioned below to avoid any issues) :
    setfattr -n "trusted.distribute.migrate-data" -v "1" 

    The purpose of fixing the on-demand client migration was to give a
    workaround where the user has lots of empty directories compared to
    files and want to do a remove-brick process.

    Here are the steps to trigger file migration for remove-brick process from
    client. (This is highly recommended to follow below steps as is)

    Let's say it is a replica volume and user want to remove a replica pair
    named brick1 and brick2. (Make sure healing is completed before you run
    these steps)

    Step-1: Start remove-brick process
     - gluster v remove-brick  brick1 brick2 start
    Step-2: Kill the rebalance daemon
     - ps aux | grep glusterfs | grep rebalance\/ | awk '{print $2}' | xargs kill
    Step-3: Do a fresh mount as mentioned here
     -  glusterfs -s ${localhostname} --volfile-id rebalance/$volume-name /tmp/mount/point
    Step-4: Go to one of the bricks (among brick1 and brick2)
     - cd 
    Step-5: Run the following command.
     - find . -not \( -path ./.glusterfs -prune \) -type f -not -perm 01000 -exec bash -c 'setfattr -n "distribute.fix.layout" -v "1" ${mountpoint}/$(dirname '{}')' \; -exec  setfattr -n "trusted.distribute.migrate-data" -v "1" ${mountpoint}/'{}' \;

    This command will ignore the linkto files and empty directories. Do a fix-layout of
    the parent directory. And trigger a migration operation on the files.

    Step-6: Once this process is completed do "remove-brick force"
     - gluster v remove-brick  brick1 brick2 force

    Note: Use the above script only when there are large number of empty directories.
    Since the script does a crawl on the brick side directly and avoids directories those
    are empty, the time spent on fixing layout on those directories are eliminated(even if the script
    does not do fix-layout on empty directories, post remove-brick a fresh layout will be built
    for the directory, hence not affecting application continuity).

    Detailing the expectation for hardlink migartion with this patch:
        Hardlink is migrated only for remove-brick process. It is highly essential
    to have a new mount(step-3) for the hardlink migration to happen. Why?:
    setfattr operation is an inode based operation. Since, we are doing setfattr from
    fuse mount here, inode_path will try to build path from the linked dentries to the inode.
    For a file without hardlinks the path construction will be correct. But for hardlinks,
    the inode will have multiple dentries linked.

            Without fresh mount, inode_path will always get the most recently linked dentry.
    e.g. if there are three hardlinks named dir1/link1, dir2/link2, dir3/link3, on a client
    where these hardlinks are looked up, inode_path will always return the path dir3/link3
    if dir3/link3 was looked up most recently. Hence, we won't be able to create linkto
    files for all other hardlinks on destination (read gf_defrag_handle_hardlink for more details
    on hardlink migration).

            With a fresh mount, the lookup and setfattr become serialized. e.g. link2 won't be
    looked up until link1 is looked up and migrated. Hence, inode_path will always have the correct
    path, in this case link1 dentry is picked up(as this is the most recently looked up inode) and
    the path is built right.

    Note: If you run the above script on an existing mount(all entries looked up), hard links may
    not be migrated, but there should not be any other issue. Please raise a bug, if you find any
    issue.

    Tests: Manual

> Change-Id: I9854cdd4955d9e24494f348fb29ba856ea7ac50a
> BUG: 1450975
> Signed-off-by: Susant Palai 
> Reviewed-on: https://review.gluster.org/17115
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Smoke: Gluster Build System 
> Reviewed-by: Raghavendra G 
> Signed-off-by: Susant Palai 

Change-Id: I9854cdd4955d9e24494f348fb29ba856ea7ac50a
BUG: 1473140
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17837
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

cluster/dht: initialize throttle option "normal" to same in init and reconfigure

2017-08-11T18:37:21+00:00

Normal value were different in dht_init and dht_reconfigure.
Initialization/reconfigure of throttle option are carved out to a separate function
(dht_configure_throttle) now. Normal value will be "2".

> Change-Id: Ie323eae019af41d6bef0a136e3d284dc82bab9a1
> BUG: 1451162
> Signed-off-by: Susant Palai 
> Reviewed-on: https://review.gluster.org/17303
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Zhou Zhengping 
> Reviewed-by: Raghavendra G 
> Signed-off-by: Susant Palai 

Change-Id: Ie323eae019af41d6bef0a136e3d284dc82bab9a1
BUG: 1473137
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17836
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

cluster/dht: Make rebalance throttle option tuned by number

2017-08-11T18:06:40+00:00

Current rebalance throttle options: lazy/normal/aggressive may not always be
sufficient for the purpose of throttling.  In our recent test, we observed for
certain setups, normal and aggressive modes behaved similarly consuming full
disk bandwidth. So in cases like this admin should be able to  tune it
down(or vice versa) depending on the need.

Along with old throttle configurations, thread counts are tuned based on number.
e.g. gluster v set vol-name cluster-rebal.throttle  5.

Admin can tune up/down between 0 and the number of cores available.

Note: For heterogenous servers, validation will fail on the old server if "number"
is given for throttle configuration.
The message looks something like this:
"volume set: failed: Staging failed on vm2. Error: cluster.rebal-throttle should be {lazy|normal|aggressive}"

Test: Manual test by logging active thread number after reconfiguring throttle option.
testcase: tests/basic/distribute/throttle-rebal.t

> Change-Id: I46e3cde546900307831028b344ecf601fd9b02c3
> BUG: 1438370
> Signed-off-by: Susant Palai 
> Reviewed-on: https://review.gluster.org/16980
> NetBSD-regression: NetBSD Build System 
> Smoke: Gluster Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Atin Mukherjee 
> Reviewed-by: Raghavendra G 
> Signed-off-by: Susant Palai 

Change-Id: I46e3cde546900307831028b344ecf601fd9b02c3
BUG: 1473136
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17835
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

cluster/dht: rebalance perf enhancement

2017-08-11T17:40:29+00:00

Problem: Throttle settings "normal" and "aggressive" for rebalance
did not have performance difference.

normal mode spawns $(no. of cores - 4)/2 threads and aggressive
spawns $(no. of cores - 4) threads. Though aggressive mode has twice
the number of threads compared to that of normal mode, there was no
performance gain when switched to aggressive mode from normal mode.

RCA:
During the course of debugging the above problem, we tried assigning
migration job to migration threads spawned by rebalance, rather than
synctasks(as there is more overhead associated to manage the task
queue and threads). This gave us a significant improvement over rebalance
under synctasks. This patch does not really gurantee that there will be a
clear performance difference between normal and aggressive mode, but this
patch certainly maximized the disk utilization for 1GBfiles run.

Results:

Test enviroment:
Gluster Config:
Number of Bricks: 2 (one brick per disk(RAID-6 12 disk))
Bricks:
Brick1: server1:/brick/test1/1
Brick2: server2:/brick/test1/1
Options Reconfigured:
performance.readdir-ahead: on
server.event-threads: 4
client.event-threads: 4

1000 files with 1GB each were created/renamed such that all files will have
server1 as cached and server2 as hashed, so that all files will be migrated.

Test machines had 24 cores each.

Results  with/without synctask based migration:
-----------------------------------------------

mode                    normal(10threads)          aggressive(20threads)

timetaken               0:55:30 (h:m:s)            0:56:3 (h:m:s)
withsynctask

timetaken
with migrator           0:38:3 (h:m:s)             0:23:41 (h:m:s)
threads

From above table it can be seen that, there is a clear 2x perf gain between
rebalance with synctask vs rebalance with migrator threads.

Additionally this patch modifies the code so that caller will have the exact error
number returned by dht_migrate_file(earlier the errno meaning was overloaded). This
will help avoiding scenarios where migration failure due to ENOENT, can result in
rebalance abort/failure.

> Change-Id: I8904e2fb147419d4a51c1267be11a08ffd52168e
> BUG: 1420166
> Signed-off-by: Susant Palai 
> Reviewed-on: https://review.gluster.org/16427
> Smoke: Gluster Build System 
> Reviewed-by: N Balachandran 
> Reviewed-by: Raghavendra G 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Signed-off-by: Susant Palai 

Change-Id: I8904e2fb147419d4a51c1267be11a08ffd52168e
BUG: 1473134
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17834
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

cluster/dht: correct space check for rebalance

2017-08-11T17:13:16+00:00

With rebalance doing fallocate on destination, we don't need to
add file size to the "destination available space" to decide whether
to migrate the file or not.

Notes: Fallocate would have already occupied the file size space on
destination

> Change-Id: If7f6a6654e6257726680cf20d618482a6e9095a6
> BUG: 1441508
> Signed-off-by: Susant Palai 
> Reviewed-on: https://review.gluster.org/17104
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Amar Tumballi 
> Reviewed-by: N Balachandran 
> Reviewed-by: Raghavendra G 
> Signed-off-by: Susant Palai 

Change-Id: If7f6a6654e6257726680cf20d618482a6e9095a6
BUG: 1473133
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17833
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

cluster/dht: Skip file migration if the subvol that meets min-free-disk

2017-08-11T16:14:12+00:00

... criteria happens to be the same subvol containing data-file

Rebalance need to figure out a new subvol in case the hashed subvol
does not have enough space. In the process of figuring out the new subvol,
we need to ignore the source subvol, otherwise it will lead to data loss.

Test: Manual
Ran the following
sizeof /tmp/1: 1.5GB
sizeof /brick/1: 16GB
sizeof /tmp/2: 1.5GB


glusterd;  gluster v create test1 vm1:/brick/1 vm1:/tmp/1;
gluster v start test1;
mount -t glusterfs vm1:test1 /mnt;
for i in {1..2000}
do
dd if=/dev/zero of=/mnt/file$i bs=1KB count=1 &> /dev/null;
done
gluster v add-brick test1 vm1:/tmp/2
gluster v set test1 min-free-disk 12GB
gluster v remove-brick test1 vm1:/tmp/1 star


file count and data were intact.

> Change-Id: Ib8fc8467a3d48a7c12958824c4f0b88e160b86c1
> BUG: 1441508
> Signed-off-by: Susant Palai 
> Reviewed-on: https://review.gluster.org/17064
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Raghavendra G 
> Signed-off-by: Susant Palai 

Change-Id: Ib8fc8467a3d48a7c12958824c4f0b88e160b86c1
BUG: 1473133
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17832
Smoke: Gluster Build System 
Tested-by: Shyamsundar Ranganathan 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

cluster/dht: Make rebalance honor min-free-disk

2017-08-11T12:18:11+00:00

test:  Manual

created files of size 1K on 2 brick(of size 1GB) setup .
added a brick of size 16GB.
set min-free-disk to 12GB(so that first two bricks won't receive any files).
removed one of the 1st brick of size 1GB.

Logs from test:
[2017-04-12 08:52:08.196484] W [MSGID: 0] [dht-rebalance.c:895:__dht_check_free_space]
 0-test1-dht: Write will cross min-free-disk for file - /tile32 on subvol - test1-client-1.
Looking for new subvol.

[2017-04-12 08:52:08.196904] I [MSGID: 0] [dht-rebalance.c:925:__dht_check_free_space]
0-test1-dht: new target found - test1-client-2 for file - /tile32

 - Post migration we have two files. The new destination (/brick/1) has the data file
[root@vm1 ~]# ll /brick/1/tile32
-rw-r--r--. 2 root root 0 Apr 12 14:22 /brick/1/tile32

 - On the old target the linkto file is there with linkto xattr pointing to /brick/1
[root@vm1 ~]# ll /tmp/2/tile32
---------T. 2 root root 1000 Apr 12 14:22 /tmp/2/tile32
[root@vm1 ~]# getfattr -m . -de text /tmp/2/tile32
getfattr: Removing leading '/' from absolute path names
security.selinux="unconfined_u:object_r:user_tmp_t:s0"
trusted.gfid="����:Aс�#�/'b2"
trusted.glusterfs.dht.linkto="test1-client-2"

Marking ./tests/features/worm_sh.t as bad test.
Reason being, this patch failed on master branch as well and it has nothing
to do with rebalance/remove-brick.

> BUG: 1441508
> Change-Id: I90bae251cda3d957a49cdceda90cd08311a392fb
> Signed-off-by: Susant Palai 
> Reviewed-on: https://review.gluster.org/17034
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> Reviewed-by: Amar Tumballi 
> Reviewed-by: Raghavendra G 
> CentOS-regression: Gluster Build System 
> Signed-off-by: Susant Palai 

Change-Id: I90bae251cda3d957a49cdceda90cd08311a392fb
BUG: 1473132
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17831
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

dht/rebalance: Crawler performance improvement

2017-08-11T11:48:54+00:00

 The job of the crawler in rebalance is to fetch files from each
local subvolume and push them to migration queue if it is eligible for
migration. And we do a lookup on the entries received to figure out the
eligibilty. Since, the lookup done is on a local subvolume we receive
linkto files and regular files as well. This requires us to do two lookups.

first: do a lookup on the file to figure out whether it is a linkto file
second: do a lookup on the file to figure out if it should be migrated

Note: The migrator thread also does one lookup for the file before
migration.

Optimization: Remove the lookup done by the crawler. Offload these task
to the migrator threads. For linkto file verification get the stat and
xattr information from readdirp.

So in total we have one lookup instead of three for each entry.

Performance numbers:
Create two node, two brick setup. Created 100000 files. And started
rebalance. Since, there is no add-brick, no files will be migrated and
we will get the crawler performance.

Without patch:
[root@gprfs039 ~]# grs
                                    Node Rebalanced-files          size
scanned      failures       skipped               status  run time in
h:m:s
                               ---------      -----------   -----------
-----------   -----------   -----------         ------------
--------------
                               localhost                0        0Bytes
50070             0             0            completed        0:0:48
                            server2                0        0Bytes
49930             0             0            completed        0:0:44
volume rebalance: test1: success

Total: 48 seconds

WiththecurrentPatch:
[root@gprfs039 mnt]# gluster v rebalance test1 status
                                    Node Rebalanced-files          size
scanned      failures       skipped               status  run time in
h:m:s
                               ---------      -----------   -----------
-----------   -----------   -----------         ------------
--------------
                               localhost                0        0Bytes
50070             0             0            completed        0:0:12
                            server2                0        0Bytes
49930             0             0            completed        0:0:12
volume rebalance: test1: success

Total: 12 seconds

That's 4X speed gain. :)

> Updates glusterfs#155
> Change-Id: Idc8e5b366e76c54aa40d698876ae62fe1630b6cc
> BUG: 1439571
> Signed-off-by: Susant Palai 
> Reviewed-on: https://review.gluster.org/15781
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Raghavendra G 

Updates glusterfs#155
Change-Id: Idc8e5b366e76c54aa40d698876ae62fe1630b6cc
BUG: 1473129
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17830
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

refcount: typecast function for calling on free

2017-08-11T11:22:43+00:00

All of the functions called to free the refcounted structure are doing a
typecast from (void*) to their own type taht is being free'd. This
really is not needed and the refcount interface is made a little simpler
without the requirement of typecasting.

With this small improvement in the API, all callers are updated too.

Cherry picked from commit f2ca301bd741e3e3f076cd3f72fcd377bcef2a1a:
> Change-Id: I32473b6d1799f62861d4b2d78ea30c09e6c80ab1
> BUG: 1416889
> Signed-off-by: Niels de Vos 
> Reviewed-on: https://review.gluster.org/16471
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> Reviewed-by: Xavier Hernandez 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Kaleb KEITHLEY 

Backport note: This patch makes it easier to backport changes that use
               gf_refcount_t. There is no functional change.

Change-Id: I32473b6d1799f62861d4b2d78ea30c09e6c80ab1
BUG: 1471870
Signed-off-by: Niels de Vos 
Reviewed-on: https://review.gluster.org/17913
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan