summaryrefslogtreecommitdiffstats
path: root/xlators
Commit message (Collapse)AuthorAgeFilesLines
* cluster/dht: act as passthrough for renames on single child DHTRaghavendra G2018-04-101-7/+15
| | | | | | | | | | Various synchronization present in dht_rename while handling directories and files is necessary only if we have more than only one child. Change-Id: Ie21ad419125504ca2f391b1ae2e5c1d166fee247 fixes: bz#1563511 Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
* experimental/cloudsync: Download xlator for archival featureSusant Palai2018-04-1015-2/+2414
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | spec-files: https://review.gluster.org/#/c/18854/ Overview: * Cloudsync maintains three file states in it's inode-ctx i.e 1 - LOCAL, 2 - REMOTE, 3 - DOWNLOADING. * A data modifying fop is allowed only if the state is LOCAL. If the state is REMOTE or DOWNLOADING, client will download or wait for the download to finish initiated by other client. * Multiple download and upload from different clients are synchronized by inodelk. * In POSIX a state check is done (part of different commit)before allowing the fop to continue. If the state is remote/downloading the fop is unwound with EREMOTE. The client will then download the file and continue with the fop again. * Basic Algo for fop (let's say write fop): - If LOCAL -> resume fop - If REMOTE -> - INODELK - STAT (this gets state and heal the state if needed) - DOWNLOAD - resume fop Note: * Developers will need to write plugins for download, based on the remote store they choose. In phase-1, support will be added for one remote store per volume. In future, more options for multiple remote stores will be explored. TODOs: - Implement stat/lookup/readdirp to return size info from xattr - Make plugins configurable - Implement unlink fop - Add metrics collection - Add sharding support Design Contributions: Aravinda V K <avishwan@redhat.com> Amar Tumballi <amarts@redhat.com> Ram Ankireddypalle <areddy@commvault.com> Susant Palai <spalai@redhat.com> updates: #387 Change-Id: Iddf711ee7ab4e946ae3e472ff62791a7b85e6d4b Signed-off-by: Susant Palai <spalai@redhat.com>
* quota: allow writes when with EINVAL on pgfid isnot existKinglong Mee2018-04-091-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | NFS client gets "Invalid argument" when writing file through nfs-ganesha. 1. With quota disabled; nfs client mount nfs-ganesha share, and do 'll' in the testing directory. 2. Enable quota; getfattr: Removing leading '/' from absolute path names trusted.gfid=0xe2edaac0eca8420ebbbcba7e56bbd240 trusted.gfid2path.b3250af8fa558e66=0x39663134343566662d653530332d343831352d396635312d3236633565366332633137642f7465737466696c653932 trusted.glusterfs.quota.9f1445ff-e503-4815-9f51-26c5e6c2c17d.contri.3=0x00000000000002000000000000000001 Notice: testfile92 without trusted.pgfid xattr. 3. restart glusterfs volume by "gluster volume stop/start gvtest" 4. echo somedata > testfile92 5. ll testfile92 -rw-r--r-- 1 root root 0 Mar 6 21:43 testfile92 BUG: 1560319 Change-Id: Iaa4dd1e891c99069fb85b7b11bb0482cbf2303b1 fixes: bz#1560319 Signed-off-by: Kinglong Mee <mijinlong@open-fs.com>
* features/index: Choose different base file on EMLINK errorPranith Kumar K2018-04-061-18/+9
| | | | | | | Change-Id: I4648816af908539efdc2528608aa2ebf7f0d0e2f fixes: bz#1559004 BUG: 1559004 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Turn ON the stripe-cache option by defaultAshish Pandey2018-04-061-1/+1
| | | | | | Change-Id: I0a290396c30c635b13ee73004d20259efb76a954 fixes: bz#1563945 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* glusterd: show brick online after port registrationAtin Mukherjee2018-04-051-2/+3
| | | | | | | | | | | | | | | | | gluster-block project needs a dependency check to see if all the bricks are online before bringing up the relevant gluster-block services. While the patch https://review.gluster.org/#/c/19785/ attempts to write the script but brick should be only marked as online only when the pmap_signin is completed. While this is perfectly fine for non brick multiplexing, but with brick multiplexing this patch still doesn't eliminate the race completely as the attach_req call is asynchrnous and glusterd immediately marks the port as registerd. Change-Id: I81db54b88f7315e1b24e0234beebe00de6429f9d Fixes: bz#1563273 Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
* afr: add quorum checks in pre-opRavishankar N2018-04-051-33/+31
| | | | | | | | | | | | | | | | | Problem: We seem to be winding the FOP if pre-op did not succeed on quorum bricks and then failing the FOP with EROFS since the fop did not meet quorum. This essentially masks the actual error due to which pre-op failed. (See BZ). Fix: Skip FOP phase if pre-op quorum is not met and go to post-op. Fixes: 1561129 Change-Id: Ie58a41e8fa1ad79aa06093706e96db8eef61b6d9 fixes: bz#1561129 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* glusterd: mark port_registered to true for all running bricks with brick muxAtin Mukherjee2018-04-052-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | glusterd maintains a boolean flag 'port_registered' which is used to determine if a brick has completed its portmap sign in process. This flag is (re)set in pmap_sigin and pmap_signout events. In case of brick multiplexing this flag is the identifier to determine if the very first brick with which the process is spawned up has completed its sign in process. However in case of glusterd restart when a brick is already identified as running, glusterd does a pmap_registry_bind to ensure its portmap table is updated but this flag isn't which is fine in case of non brick multiplex case but causes an issue if the very first brick which came as part of process is replaced and then the subsequent brick attach will fail. One of the way to validate this is to create and start a volume, remove the first brick and then add-brick a new one. Add-brick operation will take a very long time and post that the volume status will show all other brick status apart from the new brick as down. Solution is to set brickinfo->port_registered to true for all the running bricks when brick multiplexing is enabled. Change-Id: Ib0662d99d0fa66b1538947fd96b43f1cbc04e4ff Fixes: bz#1560957 Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
* features/changelog: Update option levelsAravinda VK2018-04-051-0/+7
| | | | | | | | Options levels for Changelog Xlator Change-Id: Idd246717e38096c44258a990a0939f82e5fc9654 Updates: #430 Signed-off-by: Aravinda VK <avishwan@redhat.com>
* cluster/dht: enable lookup-optimize by defaultN Balachandran2018-04-042-2/+4
| | | | | | | | | | | | | | Lookup-optimize has been shown to improve create performance. The code has been in the project for several years and is considered stable. Enabling this by default in order to test this in the upstream regression runs. Change-Id: Iab792979ee34f0af4713931e0b5b399c23f65313 updates: bz#1557435 BUG: 1557435 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* glusterd: fix txn_opinfo memory leakAtin Mukherjee2018-04-043-9/+25
| | | | | | | | | | | | | For transactions where there's no volname involved (eg : gluster v status), the originator node initiates with staging phase and what that means in op-sm there's no unlock event triggered which resulted into a txn_opinfo dictionary leak. Credits : cynthia.zhou@nokia-sbell.com Change-Id: I92fffbc2e8e1b010f489060f461be78aa2b86615 Fixes: bz#1550339 Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
* glusterd: honour localtime-logging for all the daemonsAtin Mukherjee2018-04-035-0/+30
| | | | | | Change-Id: I97a70d29365b0a454241ac5f5cae56d93eefd73a Fixes: bz#1563334 Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
* cluster/afr: Prevent ping-event handling on shdPranith Kumar K2018-04-031-0/+2
| | | | | | | | | On shd, we shouldn't treat any brick down based on latency, otherwise self-heal will never happen fixes: bz#1562717 Change-Id: Ica07fcc4fae91a6bfd9c9a670e2be464704d94b7 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* glusterd: setting mgmt_v3_timer->timer to NULL after deleting mgmt_v3_timerSanju Rakonde2018-04-021-1/+0
| | | | | | | | | | | We are setting mgmt_v3_timer->timer to NULL after mgmt_v3_timer is deleted which is unnecessary. So removing the statement. This issue is caught while running glusterd with ASAN. Change-Id: Ied1f91590a2c64ec1af36d4de9c3febd6cf94bb9 Fixes: bz#1562907 Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
* mount/fuse: Set default fuse reader thread count to 1Krutika Dhananjay2018-04-021-1/+1
| | | | | | | Updates #412 Change-Id: Ida53d8b630feabb856a3551fa888f92382ade768 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
* cluster/dht: Update dht option levelsN Balachandran2018-04-021-2/+16
| | | | | | | | | Set the levels for DHT options based on https://review.gluster.org/#/c/19466/ Change-Id: I51b31a706a0b9517404e83224c89de145fd5d7e1 updates: #430 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* mount/fuse: Add support for multi-threaded fuse readersKrutika Dhananjay2018-04-025-83/+168
| | | | | | | | | | | | | | Usage: Use 'reader-thread-count=<NUM>' as command line option to set the thread count at the time of mounting the volume. Next task is to make these threads auto-scale based on the load, instead of having the user remount the volume everytime to change the thread count. Updates #412 Change-Id: I94aa1505e5ae6a133683d473e0e4e0edd139b76b Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
* cluster/dht: Update layout in inode only on successN Balachandran2018-04-022-4/+24
| | | | | | | | | | | | | | | | | | | | | With lookup-optimize enabled, gf_defrag_settle_hash in rebalance sometimes flips the on-disk layout on volume root post the migration of all files in the directory. This is sometimes seen when attempting to fix the layout of a directory multiple times before calling gf_defrag_settle_hash. dht_fix_layout_of_directory generates a new layout in memory but updates it in the inode ctx before it is set on disk. The layout may be different the second time around due to dht_selfheal_layout_maximize_overlap. If the layout is then not written to the disk, the inode now contains the wrong layout. gf_defrag_settle_hash does not check the correctness of the layout in the inode before updating the commit-hash and writing it to the disk thus changing the layout of the directory. Change-Id: Ie1407d92982518f2a0c40ec70ad370b34a87b4d4 updates: bz#1557435 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* Revert "glusterd: handling brick termination in brick-mux"Sanju Rakonde2018-03-294-55/+25
| | | | | | | | | | | | | This reverts commit a60fc2ddc03134fb23c5ed5c0bcb195e1649416b. This commit was causing multiple tests to time out when brick multiplexing is enabled. With further debugging, it's found that even though the volume stop transaction is converted into mgmt_v3 to allow the remote nodes to follow the synctask framework to process the command, there are other callers of glusterd_brick_stop () which are not synctask based. Change-Id: I7aee687abc6bfeaa70c7447031f55ed4ccd64693 updates: bz#1545048
* afr: add new value for read-hash-mode volume optionRavishankar N2018-03-296-32/+119
| | | | | | | | | | Updates: #363 This new value (3) will try to wind read requests to the child of AFR having the least amount of pending requests in its queue. Change-Id: If6bda2aac9bf7aec3fc39622f78659313c4b6508 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/ec: send list-node-uuids request to all subvolumesXavi Hernandez2018-03-281-1/+1
| | | | | | | | | | | | The xattr trusted.glusterfs.list-node-uuids was only sent to a single subvolume. This was returning null uuids from the other subvolumes as if they were down. This fix forces that xattr to be requested from all subvolumes. Change-Id: If62eb39a6857258923ba625e153d4ad79018ea2f fixes: bz#1561406 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* glusterd: changing the op-version of volume stop mgmt v3Kaleb S. KEITHLEY2018-03-281-3/+3
| | | | | | | | log message describe the actual test Change-Id: I1ea7300a6b186032a65236492d6d2a6eef0ab983 fixes: bz#1560441 Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com>
* glusterd: handling brick termination in brick-muxSanju Rakonde2018-03-284-25/+55
| | | | | | | | | | | | | | | Problem: There's a race between the last glusterfs_handle_terminate() response sent to glusterd and the kill that happens immediately if the terminated brick is the last brick. Solution: When it is a last brick for the brick process, instead of glusterfsd killing itself, glusterd will kill the process in case of brick multiplexing. And also changing gf_attach utility accordingly. Change-Id: I386c19ca592536daa71294a13d9fc89a26d7e8c0 fixes: bz#1545048 BUG: 1545048 Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
* cluster/dht: ENOSPC will not fail rebalanceN Balachandran2018-03-281-6/+2
| | | | | | | | | ENOSPC returned by a file migration is no longer considered a rebalance failure. Change-Id: I21cf3a8acdc827bc478e138d6cb5db649d53a28c fixes: bz#1553598 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* Quota: heal directory on newly added bricks when quota limit is reachedSanoj Unnikrishnan2018-03-284-4/+52
| | | | | | | | | | | | | | | | | Problem: if a lookup is done on a newly added brick for a path on which limit has been reached, the lookup fails to heal the directory tree due to quota. Solution: Tag the lookup as an internal fop and ignore it in quota. Since marking internal fop does not usually give enough contextual information. Introducing new flags to pass the contextual info. Adding dict_check_flag and dict_set_flag to aid flag operations. A flag is a single bit in a bit array (currently limited to 256 bits). Change-Id: Ifb6a68bcaffedd425dd0f01f7db24edd5394c095 fixes: bz#1505355 BUG: 1505355 Signed-off-by: Sanoj Unnikrishnan <sunnikri@redhat.com>
* quick-read: Provide statistics to the monitorPoornima G2018-03-282-26/+89
| | | | | | | Updates: #425 Change-Id: Iea5198821f4eabc46bc63529afa4a92d4b4c2be0 Signed-off-by: Poornima G <pgurusid@redhat.com>
* glusterd: changing the op-version of volume stop mgmt v3Sanju Rakonde2018-03-271-1/+1
| | | | | | Change-Id: Iefc5a00d36436b23181871fa365f27b8d90cff0a fixes: bz#1560441 Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
* glusterd: Implementing volume stop in mgmt v3Sanju Rakonde2018-03-262-1/+66
| | | | | | Change-Id: I8f9c594cf56331d54eb4884335699744685ef20d fixes: bz#1560441 Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
* nl-cache: Provide statistics to the monitorPoornima G2018-03-241-9/+61
| | | | | | | Updates: #429 Change-Id: Ic2e64422055f1838d5d453643c739ef1e9319cfe Signed-off-by: Poornima G <pgurusid@redhat.com>
* md-cache: Provide statistics to the monitorPoornima G2018-03-241-9/+57
| | | | | | | Updates: #427 Change-Id: Ib1f45016ac75d7bc2755db0dd4b68ce1d95d26c3 Signed-off-by: Poornima G <pgurusid@redhat.com>
* features/quota: Add new fields to translator options for GD2Sanoj Unnikrishnan2018-03-242-29/+51
| | | | | | | | | | | | alert-time, soft timeout, hard timeout, default soft limit and deem-statfs will be settable through volume set command. hence marked as settable. Other options are used only via quota commands. Updates #302 Change-Id: I02d258cc3aa7fe58ccbadd59441cce64cfd9ba6e Signed-off-by: Sanoj Unnikrishnan <sunnikri@redhat.com>
* libgfchangelog: Correct the log messageNiklas Hambüchen2018-03-241-1/+1
| | | | | | | | | | | | Provide correct error message for changelog end time check Updated error message to print "wrong result for end". Original patch by Keith Schincke <kschinck@redhat.com> from https://review.gluster.org/#/c/8121/ Change-Id: Ia3458cbac7784bfc71c05da67391a3f8259f18f0 BUG: 1559126 Signed-off-by: Niklas Hambüchen <mail@nh2.me>
* python: Remove all uses of find_library. Fixes #1450593Niklas Hambüchen2018-03-241-2/+1
| | | | | | | | `find_library()` doesn't consider LD_LIBRARY_PATH on Python < 3.6. Change-Id: Iee26085cb5d14061001f19f032c2664d69a378a8 BUG: 1450593 Signed-off-by: Niklas Hambüchen <mail@nh2.me>
* rpcsvc: enable ownthread feature for glusterfs4_0_fop_progMilind Changire2018-03-221-0/+1
| | | | | | | | Ownthread feature needs enabling for glusterfs4_0_fop_prog Change-Id: Idce63eb094ae0fdfcddbd52d0dee25aa0e074926 BUG: 1559075 Signed-off-by: Milind Changire <mchangir@redhat.com>
* cluster/ec: fix SHD crash for null gfid'sXavi Hernandez2018-03-211-0/+8
| | | | | | | | | | | | | | | | | | | When the self-heal daemon is doing a full sweep it uses readdirp to get extra stat information from each file. This information is obtained in two steps by the posix xlator: first the directory is read to get the entries and then each entry is stated to get additional info. Between these two steps, it's possible that the file is removed by the user, so we'll get an error, leaving stat info empty. EC's heal daemon was using the gfid blindly, causing an assert failure when protocol/client was trying to encode the gfid. To fix the problem a check has been added. If we detect a null gfid, we simply ignore it and continue healing. Change-Id: I2e4acdcecd0b6951055e50d1c37d686a2186a228 BUG: 1558016 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/afr: Switch to active-fd-count for open-fd checksPranith Kumar K2018-03-211-8/+8
| | | | | | BUG: 1557932 Change-Id: I3783e41b3812267bc10c0d05d062a31396ce135b Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* storage/posix: Add active-fd-count option in glusterPranith Kumar K2018-03-212-32/+32
| | | | | | | | | | | | | | | | | | | | Problem: when dd happens on sharded replicate volume all the writes on shards happen through anon-fd. When the writes don't come quick enough, old anon-fd closes and new fd gets created to serve the new writes. open-fd-count is decremented only after the fd is closed as part of fd_destroy(). So even when one fd is on the way to be closed a new fd will be created and during this short period it appears as though there are multiple fds opened on the file. AFR thinks another application opened the same file and switches off eager-lock leading to extra latency. Fix: Have a different option called active-fd whose life cycle starts at fd_bind() and ends just before fd_destroy() BUG: 1557932 Change-Id: I2e221f6030feeedf29fbb3bd6554673b8a5b9c94 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* features/shard: Do list_del_init() while list memory is validPranith Kumar K2018-03-201-1/+1
| | | | | | | | | | | | | | | | | Problem: shard_post_lookup_fsync_handler() goes over the list of inode-ctx that need to be fsynced and in cbk it removes each of the inode-ctx from the list. When the first member of list is removed it tries to modifies list head's memory with the latest next/prev and when this happens, there is no guarantee that the list-head which is from stack memory of shard_post_lookup_fsync_handler() is valid. Fix: Do list_del_init() in the loop before winding fsync. BUG: 1557876 Change-Id: If429d3634219e1a435bd0da0ed985c646c59c2ca Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* georep : Pause/Resume of geo-replication with wrong userSunny Kumar2018-03-201-0/+23
| | | | | | | | | | | While performing pause/resume on geo-replication with wrong user (other user then you setup), always returns success. Which further leads to snapshot creation failure as it is detecting active geo-replication session. Change-Id: I6e96e8dd3e861348b057475387f0093cb903ae88 BUG: 1550936 Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
* glusterd: TLS verification fails while using intermediate CAMohit Agrawal2018-03-191-0/+3
| | | | | | | | | | | | | | | | | | | | | Problem: TLS verification fails while using intermediate CA if mgmt SSL is enabled. Solution: There are two main issue of TLS verification failing 1) not calling ssl_api to set cert_depth 2) The current code does not allow to set certificate depth while MGMT SSL is enabled. After apply this patch to set certificate depth user need to set parameter option transport.socket.ssl-cert-depth <depth> in /var/lib/glusterd/secure_acccess instead to set in /etc/glusterfs/glusterd.vol. At the time of set secure_mgmt in ctx we will check the value of cert-depth and save the value of cert-depth in ctx.If user does not provide any value in cert-depth in that case it will consider default value is 1 BUG: 1555154 Change-Id: I89e9a9e1026e37efb5c20f9ec62b1989ef644f35 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* glusterd: glusterd crash in gd_mgmt_v3_unlock_timer_cbkGaurav Yadav2018-03-151-1/+0
| | | | | | | | Memory cleanup of same pointer twice inside gd_mgmt_v3_unlock_timer_cbk causing glusterd to crash. Change-Id: I9147241d995780619474047b1010317a89b9965a BUG: 1550339
* cluster/afr: Make AFR eager-locking similar to ECPranith Kumar K2018-03-149-908/+813
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: 1) Afr's eager-lock only works for data transactions. 2) When there are conflicting writes, write with conflicting region initiates unlock of eager-lock leading to extra pre-ops and post-ops on the file. When eager-lock goes off, it leads to extra fsyncs for random-write workload in afr. Solution (that is modeled after EC): In EC, when there is a conflicting write, it waits for the current write to complete before it winds the conflicted write. This leads to better utilization of network and disk, because we will not be doing extra xattrops and FSYNCs and inodelk/unlock. Moved fd based counters to inode based counters. I tried to model the solution based on EC's locking, but it is not similar to AFR because we had to keep backward compatibility. Lifecycle of lock: ================== First transaction is added to inode->owners list and an inodelk will be sent on the wire. All the next transactions will be put in inode->waiters list until the first transaction completes inodelk and [f]xattrop completely. Once [f]xattrop also completes, all the requests in the inode->waiters list are checked if it conflict with any of the existing locks which are in inode->owners list and if not are added to inode->owners list and resumed with doing transaction. When these transactions complete fop phase they will be moved to inode->post_op list and resume the transactions that were paused because of conflicts. Post-op and unlock will not be issued on the wire until that is the last transaction on that inode. Last transaction when it has to perform post-op can choose to sleep for deyed-post-op-secs value. During that time if any other transaction comes, it will wake up the sleeping transaction and takes over the ownership of the lock and the cycle continues. If the dealyed-post-op-secs expire, then the timer thread will wakeup the sleeping transaction and it will set lock->release to true and starts doing post-op and then unlock. During this time if any other transactions come, they will be put in inode->frozen list. Once the previous unlock comes it will move the frozen list to waiters list and moves the first element from this waiters-list to owners-list and attempts the lock and the cycle continues. This is the general idea. There is logic at the time of dealying and at the time of new transaction or in flush fop to wakeup existing sleeping transactions or choosing whether to delay a transaction etc, which is subjected to change based on future enhancements etc. Fixes: #418 BUG: 1549606 Change-Id: I88b570bbcf332a27c82d2767dfa82472f60055dc Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Change default read policy to gfid-hashAshish Pandey2018-03-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Whenever we read data from file over NFS, NFS reads more data then requested and caches it. Based on the stat information it makes sure that the cached/pre-read data is valid or not. Consider 4 + 2 EC volume and all the bricks are on differnt nodes. In EC, with round-robin read policy, reads are sent on different set of data bricks. This way, it balances the read fops to go on all the bricks and avoid heating UP (overloading) same set of bricks. Due to small difference in clock speed, it is possible that we get minor difference for atime, mtime or ctime for different bricks. That might cause a different stat returned to NFS based on which NFS will discard cached/pre-read data which is actually not changed and could be used. Solution: Change read policy for EC as gfid-hash. That will force all the read to go to same set of bricks. Change-Id: I825441cc519e94bf3dc3aa0bd4cb7c6ae6392c84 BUG: 1554743 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/ec: avoid delays in self-healXavi Hernandez2018-03-144-48/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Self-heal creates a thread per brick to sweep the index looking for files that need to be healed. These threads are started before the volume comes online, so nothing is done but waiting for the next sweep. This happens once per minute. When a replace brick command is executed, the new graph is loaded and all index sweeper threads started. When all bricks have reported, a getxattr request is sent to the root directory of the volume. This causes a heal on it (because the new brick doesn't have good data), and marks its contents as pending to be healed. This is done by the index sweeper thread on the next round, one minute later. This patch solves this problem by waking all index sweeper threads after a successful check on the root directory. Additionally, the index sweep thread scans the index directory sequentially, but it might happen that after healing a directory entry more index entries are created but skipped by the current directory scan. This causes the remaining entries to be processed on the next round, one minute later. The same can happen in the next round, so the heal is running in bursts and taking a lot to finish, specially on volumes with many directory levels. This patch solves this problem by immediately restarting the index sweep if a directory has been healed. Change-Id: I58d9ab6ef17b30f704dc322e1d3d53b904e5f30e BUG: 1547662 Signed-off-by: Xavi Hernandez <jahernan@redhat.com>
* cluster/dht: Skipped files are not treated as errorsN Balachandran2018-03-121-9/+11
| | | | | | | | | For skipped files, use a return value of 1 to prevent error messages being logged. Change-Id: I18de31ac1a64d4460e88dea7826c3ba03c895861 BUG: 1553598 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* rpcsvc: correct event-thread scalingMilind Changire2018-03-121-3/+4
| | | | | | | | | | | | | Problem: Auto thread count derived from the number of attachs and detachs was reset to 1 when server_reconfigure() was called. Solution: Avoid auto-thread-count reset to 1. Change-Id: Ic00e86adb81ba3c828e354a6ccb638209ae58b3e BUG: 1547888 Signed-off-by: Milind Changire <mchangir@redhat.com>
* protocol: Fix 4.0 client, parsing older iatt in dictShyamsundarR2018-03-104-44/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | In a mixed mode cluster involving 4.0 and older 3.x bricks, if clients are newer, then the iatt encoded in the dictionary can be of the older iatt format, which a newer client will map incorrectly to the newer structure. This causes failures in FOPs that depend on this iatt for some functionality (seen in mkdir operations failing as EIO, when DHT hits its internal setxattr call). The fix provided is to convert the iatt in the dict, based on which RPC version is used to communicate with the server. IOW, this is the reverse of change in commit "b966c7790e" Tested using a mixed mode cluster (i.e bricks in 3.12 and 4.0 versions) and a mixed set of clients, 3.12 and 4.0 clients. There is no regression test provided, as this needs a mixed mode cluster to test and validate. Change-Id: I454e54651ca836b9f7c28f45f51d5956106aefa9 BUG: 1554053 Signed-off-by: ShyamsundarR <srangana@redhat.com>
* protocol: Added iatt conversion to older formatShyamsundarR2018-03-103-0/+94
| | | | | | | | | | | | | | Added iatt conversion to an older format, when dealing with older RPC versions. This enables iatt structure conformance when dealing with older clients. This helps fix rolling upgrade from 3.x versions to 4.0 version of gluster by sending the right iatt in the dictionary when DHT requests the same. Change-Id: Ieaf925f81f8c7798a8fba1e90a59fa9dec82856c BUG: 1544699 Signed-off-by: ShyamsundarR <srangana@redhat.com>
* protocol/client: fix memory corruptionXavi Hernandez2018-03-096-92/+78
| | | | | | | | | | | | | There was an issue when some accesses to saved_fds list were protected by the wrong mutex (lock instead of fd_lock). Additionally, the retrieval of fdctx from fd's context and any checks done on it have also been protected by fd_lock to avoid fdctx to become outdated just after retrieving it. Change-Id: If2910508bcb7d1ff23debb30291391f00903a6fe BUG: 1553129 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* glusterd: volume get fixes for client-io-threads & quorum-typeRavishankar N2018-03-075-7/+52
| | | | | | | | | | | | | | | | | | | | 1. If a replica volume created on glusterfs-3.8 was upgraded to glusterfs-3.12, `gluster vol get volname client-io-threads` displayed 'on' even though it wasn't and the xlator wasn't loaded on the client-graph. This was due to removing certain checks in glusterd_get_default_val_for_volopt as a part of commit 47604fad4c2a3951077e41e0c007ceb979bb2c24. Fix it. 2. Also, as a part of op-version bump-up, client-io-threads was being loaded on the clients during volfile regeneration. Prevent it. 3. AFR assumes quorum-type to be auto in newly created replic 3 (odd replica in general) volumes but `gluster vol get quorum-type` displays 'none'. Fix it. Change-Id: I19e586361ed1065c70fb378533d3b4dac1095df9 BUG: 1545056 Signed-off-by: Ravishankar N <ravishankar@redhat.com>