glusterfs.git/xlators/cluster/dht/src, branch v3.10.11

cluster/dht: Add migration checks to dht_(f)xattrop

2018-01-30T13:38:09+00:00

The dht_(f)xattrop implementation did not implement
migration phase1/phase2 checks which could cause issues
with rebalance on sharded volumes.
This does not solve the issue where fops may reach the target
out of order.

> Change-Id: I2416fc35115e60659e35b4b717fd51f20746586c
> BUG: 1471031
> Signed-off-by: N Balachandran 

Change-Id: I2416fc35115e60659e35b4b717fd51f20746586c
BUG: 1498081
Signed-off-by: N Balachandran

cluster/dht: EBADF handling for fremovexattr and fsetxattr

2017-10-03T12:26:48+00:00

Add EBADF handling for dht_fremovexattr and dht_fsetxattr.

> BUG: 1476665
> Signed-off-by: N Balachandran 
> Reviewed-on: https://review.gluster.org/17999
> Smoke: Gluster Build System 
> Reviewed-by: Shyamsundar Ranganathan 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Raghavendra G 

(cherry picked from commit 747a08d34e2a1e94d7fce68a3577370288bb1955)
Change-Id: Ide0d5812dae79655d2565157e5baabcd753b4309
BUG: 1467010
Signed-off-by: N Balachandran

dht: add FOP check to dht_file_setattr_cbk

2017-09-29T09:01:07+00:00

Problem:
bug-797171.7 loaded error-gen xlator on the brick which sent EBADF for a
non fd-based fop, namely setattr. This caused
dht_check_and_open_fd_on_subvol_task() to crash as local->fd was NULL.

Fix:
Call dht_check_and_open_fd_on_subvol_task() from dht_file_setattr_cbk
only for dht_fsetattr and not dht_setattr or dht_setattr2

> Reviewed-on: https://review.gluster.org/18208
> Smoke: Gluster Build System 
> Reviewed-by: Susant Palai 
> Reviewed-by: Amar Tumballi 
> Reviewed-by: Raghavendra G 
> Reviewed-by: N Balachandran 
> CentOS-regression: Gluster Build System 
(cherry picked from commit 47188e9eac59de416a5c86c7ec7540ed6aaa1c98)


Signed-off-by: Ravishankar N 
Change-Id: Iab4999e213bf2065804f3f8237e470ad454e3c99
BUG: 1497122

cluster/dht: Check for open fd only on EBADF

2017-09-17T12:47:26+00:00

DHT fd based fops used to check if the fd was open
on the cached subvol before winding the call. However,
this introduced a performance regression of about
30% for reads.

This check was introduced to handle cases where files
were migrated while IOs were happening. As this is not
the common case, dht will now check if the fd is
open on the cached subvol only if the call fails
with EBADF.

This will prevent a performance hit where a rebalance
is not running.

> BUG: 1476665
> Signed-off-by: N Balachandran 
> Reviewed-on: https://review.gluster.org/17976
> Smoke: Gluster Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Amar Tumballi 
> Reviewed-by: Susant Palai 
> Reviewed-by: Raghavendra G 

Change-Id: I2035a858d63c3fcd22bb634055bbb0ad01686808
BUG: 1467010
Signed-off-by: N Balachandran 
Reviewed-on: https://review.gluster.org/18057
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Raghavendra G

cluster/rebalance: Fix hardlink migration failures

2017-08-11T20:03:45+00:00

A brief about how hardlink migration works:
  - Different hardlinks (to the same file) may hash to different bricks,
but their cached subvol will be same. Rebalance picks up the first hardlink,
calculates it's  hash(call it TARGET) and set the hashed subvolume as an xattr
on the data file.
  - Now all the hardlinks those come after this will fetch that xattr and will
create linkto files on TARGET (all linkto files for the hardlinks will be hardlink
to each other on TARGET).
  - When number of hardlinks on source is equal to the number of hardlinks on
TARGET, the data migration will happen.

RACE:1
  Since rebalance is multi-threaded, the first lookup (which decides where the TARGET
subvol should be), can be called by two hardlink migration parallely and they may end
up creating linkto files on two different TARGET subvols. Hence, hardlinks won't be
migrated.

Fix: Rely on the xattr response of lookup inside gf_defrag_handle_hardlink since it
is executed under synclock.

RACE:2
  The linkto files on TARGET can be created by other clients also if they are doing
lookup on the hardlinks.  Consider a scenario where you have 100 hardlinks.  When
rebalance is migrating 99th hardlink, as a result of continuous lookups from other
client, linkcount on TARGET is equal to source linkcount. Rebalance will migrate data
on the 99th hardlink itself. On 100th hardlink migration, hardlink will have TARGET as
cached subvolume. If it's hash is also the same, then a migration will be triggered from
TARGET to TARGET leading to data loss.

Fix: Make sure before the final data migration, source is not same as destination.

RACE:3
  Since a hardlink can be migrating to a non-hashed subvolume, a lookup from other
client or even the rebalance it self, might delete the linkto file on TARGET leading
to hardlinks never getting migrated.

This will be addressed in a different patch in future.

> Change-Id: If0f6852f0e662384ee3875a2ac9d19ac4a6cea98
> BUG: 1469964
> Signed-off-by: Susant Palai 
> Reviewed-on: https://review.gluster.org/17755
> Smoke: Gluster Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: N Balachandran 
> Reviewed-by: Raghavendra G 
> Signed-off-by: Susant Palai 

Change-Id: If0f6852f0e662384ee3875a2ac9d19ac4a6cea98
BUG: 1473141
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17838
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

cluster/dht: fix on demand migration files from client

2017-08-11T19:37:12+00:00

    On demand migration of files i.e. migration done by clients
    triggered by a setfattr was broken.

    Dependency on defrag led to crash when migration was triggered from
    client.

    Note: This functionality is not available for tiered volumes. Migration
    from tier served client will fail with ENOTSUP.

    usage (But refer to the steps mentioned below to avoid any issues) :
    setfattr -n "trusted.distribute.migrate-data" -v "1" 

    The purpose of fixing the on-demand client migration was to give a
    workaround where the user has lots of empty directories compared to
    files and want to do a remove-brick process.

    Here are the steps to trigger file migration for remove-brick process from
    client. (This is highly recommended to follow below steps as is)

    Let's say it is a replica volume and user want to remove a replica pair
    named brick1 and brick2. (Make sure healing is completed before you run
    these steps)

    Step-1: Start remove-brick process
     - gluster v remove-brick  brick1 brick2 start
    Step-2: Kill the rebalance daemon
     - ps aux | grep glusterfs | grep rebalance\/ | awk '{print $2}' | xargs kill
    Step-3: Do a fresh mount as mentioned here
     -  glusterfs -s ${localhostname} --volfile-id rebalance/$volume-name /tmp/mount/point
    Step-4: Go to one of the bricks (among brick1 and brick2)
     - cd 
    Step-5: Run the following command.
     - find . -not \( -path ./.glusterfs -prune \) -type f -not -perm 01000 -exec bash -c 'setfattr -n "distribute.fix.layout" -v "1" ${mountpoint}/$(dirname '{}')' \; -exec  setfattr -n "trusted.distribute.migrate-data" -v "1" ${mountpoint}/'{}' \;

    This command will ignore the linkto files and empty directories. Do a fix-layout of
    the parent directory. And trigger a migration operation on the files.

    Step-6: Once this process is completed do "remove-brick force"
     - gluster v remove-brick  brick1 brick2 force

    Note: Use the above script only when there are large number of empty directories.
    Since the script does a crawl on the brick side directly and avoids directories those
    are empty, the time spent on fixing layout on those directories are eliminated(even if the script
    does not do fix-layout on empty directories, post remove-brick a fresh layout will be built
    for the directory, hence not affecting application continuity).

    Detailing the expectation for hardlink migartion with this patch:
        Hardlink is migrated only for remove-brick process. It is highly essential
    to have a new mount(step-3) for the hardlink migration to happen. Why?:
    setfattr operation is an inode based operation. Since, we are doing setfattr from
    fuse mount here, inode_path will try to build path from the linked dentries to the inode.
    For a file without hardlinks the path construction will be correct. But for hardlinks,
    the inode will have multiple dentries linked.

            Without fresh mount, inode_path will always get the most recently linked dentry.
    e.g. if there are three hardlinks named dir1/link1, dir2/link2, dir3/link3, on a client
    where these hardlinks are looked up, inode_path will always return the path dir3/link3
    if dir3/link3 was looked up most recently. Hence, we won't be able to create linkto
    files for all other hardlinks on destination (read gf_defrag_handle_hardlink for more details
    on hardlink migration).

            With a fresh mount, the lookup and setfattr become serialized. e.g. link2 won't be
    looked up until link1 is looked up and migrated. Hence, inode_path will always have the correct
    path, in this case link1 dentry is picked up(as this is the most recently looked up inode) and
    the path is built right.

    Note: If you run the above script on an existing mount(all entries looked up), hard links may
    not be migrated, but there should not be any other issue. Please raise a bug, if you find any
    issue.

    Tests: Manual

> Change-Id: I9854cdd4955d9e24494f348fb29ba856ea7ac50a
> BUG: 1450975
> Signed-off-by: Susant Palai 
> Reviewed-on: https://review.gluster.org/17115
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Smoke: Gluster Build System 
> Reviewed-by: Raghavendra G 
> Signed-off-by: Susant Palai 

Change-Id: I9854cdd4955d9e24494f348fb29ba856ea7ac50a
BUG: 1473140
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17837
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

cluster/dht: initialize throttle option "normal" to same in init and reconfigure

2017-08-11T18:37:21+00:00

Normal value were different in dht_init and dht_reconfigure.
Initialization/reconfigure of throttle option are carved out to a separate function
(dht_configure_throttle) now. Normal value will be "2".

> Change-Id: Ie323eae019af41d6bef0a136e3d284dc82bab9a1
> BUG: 1451162
> Signed-off-by: Susant Palai 
> Reviewed-on: https://review.gluster.org/17303
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Zhou Zhengping 
> Reviewed-by: Raghavendra G 
> Signed-off-by: Susant Palai 

Change-Id: Ie323eae019af41d6bef0a136e3d284dc82bab9a1
BUG: 1473137
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17836
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

cluster/dht: Make rebalance throttle option tuned by number

2017-08-11T18:06:40+00:00

Current rebalance throttle options: lazy/normal/aggressive may not always be
sufficient for the purpose of throttling.  In our recent test, we observed for
certain setups, normal and aggressive modes behaved similarly consuming full
disk bandwidth. So in cases like this admin should be able to  tune it
down(or vice versa) depending on the need.

Along with old throttle configurations, thread counts are tuned based on number.
e.g. gluster v set vol-name cluster-rebal.throttle  5.

Admin can tune up/down between 0 and the number of cores available.

Note: For heterogenous servers, validation will fail on the old server if "number"
is given for throttle configuration.
The message looks something like this:
"volume set: failed: Staging failed on vm2. Error: cluster.rebal-throttle should be {lazy|normal|aggressive}"

Test: Manual test by logging active thread number after reconfiguring throttle option.
testcase: tests/basic/distribute/throttle-rebal.t

> Change-Id: I46e3cde546900307831028b344ecf601fd9b02c3
> BUG: 1438370
> Signed-off-by: Susant Palai 
> Reviewed-on: https://review.gluster.org/16980
> NetBSD-regression: NetBSD Build System 
> Smoke: Gluster Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Atin Mukherjee 
> Reviewed-by: Raghavendra G 
> Signed-off-by: Susant Palai 

Change-Id: I46e3cde546900307831028b344ecf601fd9b02c3
BUG: 1473136
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17835
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

cluster/dht: rebalance perf enhancement

2017-08-11T17:40:29+00:00

Problem: Throttle settings "normal" and "aggressive" for rebalance
did not have performance difference.

normal mode spawns $(no. of cores - 4)/2 threads and aggressive
spawns $(no. of cores - 4) threads. Though aggressive mode has twice
the number of threads compared to that of normal mode, there was no
performance gain when switched to aggressive mode from normal mode.

RCA:
During the course of debugging the above problem, we tried assigning
migration job to migration threads spawned by rebalance, rather than
synctasks(as there is more overhead associated to manage the task
queue and threads). This gave us a significant improvement over rebalance
under synctasks. This patch does not really gurantee that there will be a
clear performance difference between normal and aggressive mode, but this
patch certainly maximized the disk utilization for 1GBfiles run.

Results:

Test enviroment:
Gluster Config:
Number of Bricks: 2 (one brick per disk(RAID-6 12 disk))
Bricks:
Brick1: server1:/brick/test1/1
Brick2: server2:/brick/test1/1
Options Reconfigured:
performance.readdir-ahead: on
server.event-threads: 4
client.event-threads: 4

1000 files with 1GB each were created/renamed such that all files will have
server1 as cached and server2 as hashed, so that all files will be migrated.

Test machines had 24 cores each.

Results  with/without synctask based migration:
-----------------------------------------------

mode                    normal(10threads)          aggressive(20threads)

timetaken               0:55:30 (h:m:s)            0:56:3 (h:m:s)
withsynctask

timetaken
with migrator           0:38:3 (h:m:s)             0:23:41 (h:m:s)
threads

From above table it can be seen that, there is a clear 2x perf gain between
rebalance with synctask vs rebalance with migrator threads.

Additionally this patch modifies the code so that caller will have the exact error
number returned by dht_migrate_file(earlier the errno meaning was overloaded). This
will help avoiding scenarios where migration failure due to ENOENT, can result in
rebalance abort/failure.

> Change-Id: I8904e2fb147419d4a51c1267be11a08ffd52168e
> BUG: 1420166
> Signed-off-by: Susant Palai 
> Reviewed-on: https://review.gluster.org/16427
> Smoke: Gluster Build System 
> Reviewed-by: N Balachandran 
> Reviewed-by: Raghavendra G 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Signed-off-by: Susant Palai 

Change-Id: I8904e2fb147419d4a51c1267be11a08ffd52168e
BUG: 1473134
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17834
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

cluster/dht: correct space check for rebalance

2017-08-11T17:13:16+00:00

With rebalance doing fallocate on destination, we don't need to
add file size to the "destination available space" to decide whether
to migrate the file or not.

Notes: Fallocate would have already occupied the file size space on
destination

> Change-Id: If7f6a6654e6257726680cf20d618482a6e9095a6
> BUG: 1441508
> Signed-off-by: Susant Palai 
> Reviewed-on: https://review.gluster.org/17104
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Amar Tumballi 
> Reviewed-by: N Balachandran 
> Reviewed-by: Raghavendra G 
> Signed-off-by: Susant Palai 

Change-Id: If7f6a6654e6257726680cf20d618482a6e9095a6
BUG: 1473133
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/17833
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan