summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/afr/src/afr-common.c
Commit message (Collapse)AuthorAgeFilesLines
* afr: more quorum checks in lookup and new entry markingRavishankar N2020-06-161-11/+13
| | | | | | | | | | | | | | | | Problem: See github issue for details. Fix: -In lookup if the entry exists in 2 out of 3 bricks, don't fail the lookup with ENOENT just because there is an entrylk on the parent. Consider quorum before deciding. -If entry FOP does not succeed on quorum no. of bricks, do not perform new entry mark. Fixes: #1303 Change-Id: I56df8c89ad53b29fa450c7930a7b7ccec9f4a6c5 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Prioritize ENOSPC over other errorskarthik-us2020-06-051-1/+3
| | | | | | | | | | | | | | | | | | | | | | Problem: In a replicate/arbiter volume if file creations or writes fails on quorum number of bricks and on one brick it is due to ENOSPC and on other brick it fails for a different reason, it may fail with errors other than ENOSPC in some cases. Fix: Prioritize ENOSPC over other lesser priority errors and do not set op_errno in posix_gfid_set if op_ret is 0 to avoid receiving any error_no which can be misinterpreted by __afr_dir_write_finalize(). Also removing the function afr_has_arbiter_fop_cbk_quorum() which might consider a successful reply form a single brick as quorum success in some cases, whereas we always need fop to be successful on quorum number of bricks in arbiter configuration. Change-Id: I106e267f8b9451f681022f1cccb410d9bc824c08 Fixes: #1254 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* afr: fix memory leak in afr_priv_destroy()Dmitry Antipov2020-06-011-0/+2
| | | | | | | | | | | | | | | | Found with GCC ASan: Direct leak of 202 byte(s) in 2 object(s) allocated from: #0 0x7fc6c6ef0667 in __interceptor_malloc (/usr/lib64/libasan.so.6+0xb0667) #1 0x7fc6c6bd145b in __gf_malloc /path/to/glusterfs/libglusterfs/src/mem-pool.c:175 #2 0x7fc6c6bd17a3 in gf_vasprintf /path/to/glusterfs/libglusterfs/src/mem-pool.c:223 #3 0x7fc6c6bd1993 in gf_asprintf /path/to/glusterfs/libglusterfs/src/mem-pool.c:243 #4 0x7fc6b0dc92f6 in init /path/to/glusterfs/xlators/cluster/afr/src/afr.c:590 ... Change-Id: I29feb1d30a045fb70472758e6ed4e195888090b2 Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Fixes: #1278
* afr: event gen changesRavishankar N2020-04-241-74/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The general idea of the changes is to prevent resetting event generation to zero in the inode ctx, since event gen is something that should follow 'causal order'. Change #1: For a read txn, in inode refresh cbk, if event_generation is found zero, we are failing the read fop. This is not needed because change in event gen is only a marker for the next inode refresh to happen and should not be taken into account by the current read txn. Change #2: The event gen being zero above can happen if there is a racing lookup, which resets even get (in afr_lookup_done) if there are non zero afr xattrs. The resetting is done only to trigger an inode refresh and a possible client side heal on the next lookup. That can be acheived by setting the need_refresh flag in the inode ctx. So replaced all occurences of resetting even gen to zero with a call to afr_inode_need_refresh_set(). Change #3: In both lookup and discover path, we are doing an inode refresh which is not required since all 3 essentially do the same thing- update the inode ctx with the good/bad copies from the brick replies. Inode refresh also triggers background heals, but I think it is okay to do it when we call refresh during the read and write txns and not in the lookup path. The .ts which relied on inode refresh in lookup path to trigger heals are now changed to do read txn so that inode refresh and the heal happens. Change-Id: Iebf39a9be6ffd7ffd6e4046c96b0fa78ade6c5ec Fixes: #1179 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reported-by: Erik Jacobson <erik.jacobson at hpe.com>
* cluster/afr: Fixes for haloPranith Kumar K2020-03-131-4/+15
| | | | | | | | | | | Current implementation assumes that ping-event will come after connect event but that may not be the case in the cases where after socket connection fds need to be re-opened which would consume more time. So handle any order of the ping/child-up events. fixes: bz#1800583 Change-Id: I6bcdc0caa503bdc039ef2b4739fbf4afae121f05 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* afr: prevent spurious entry heals leading to gfid split-brainRavishankar N2020-02-181-2/+2
| | | | | | | | | | | | | | | | | | | | Problem: In a hyperconverged setup with granular-entry-heal enabled, if a file is recreated while one of the bricks is down, and an index heal is triggered (with the brick still down), entry-self heal was doing a spurious heal with just the 2 good bricks. It was doing a post-op leading to removal of the filename from .glusterfs/indices/entry-changes as well as erroneous setting of afr xattrs on the parent. When the brick came up, the xattrs were cleared, resulting in the renamed file not getting healed and leading to gfid split-brain and EIO on the mount. Fix: Proceed with entry heal only when shd can connect to all bricks of the replica, just like in data and metadata heal. fixes: bz#1801624 Change-Id: I916ae26ad1fabf259bc6362da52d433b7223b17e Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/thin-arbiter: Wait for TA connection before ta-file lookupAshish Pandey2020-02-171-17/+21
| | | | | | | | | | | | | | | | | | Problem: When we mount a ta volume, as soon as 2 data bricks are connected we consider that the mount is done and then send a lookup/create on ta file on ta node. However, this connection with ta node might not have been completed. Due to this delay, ta replica id file will not be created and we will see ENOTCONN error in log file if we do lookup. Solution: As we know that this ta node could have a higher latency, we should wait for reasonable time for connection to happen before sending lookup/create on replica id file. fixes: bz#1720463 Change-Id: I36f90865afe617e4e84cee57fec832a16f5dd6cc
* tests: Fix spurious self-heald.t failurePranith Kumar K2020-02-111-23/+15
| | | | | | | | | | | | | | | | | | Problem: heal-info code assumes that all indices in xattrop directory definitely need heal. There is one corner case. The very first xattrop on the file will lead to adding the gfid to 'xattrop' index in fop path and in _cbk path it is removed because the fop is zero-xattr xattrop in success case. These gfids could be read by heal-info and shown as needing heal. Fix: Check the pending flag to see if the file definitely needs or not instead of which index is being crawled at the moment. fixes: bz#1801623 Change-Id: I79f00dc7366fedbbb25ec4bec838dba3b34c7ad5 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* afr: simplify afr_has_quorum()Yaniv Kaul2020-01-021-26/+7
| | | | | | | | | | | | | 1. Perform AFR_COUNT() once, in afr_has_quorum() and pass the result to afr_lookup_has_quorum() 2. Simplify afr_lookup_has_quorum() - pass less parameters to it. (Via the change in item 1 above). 3. Make afr_is_add_replica_mount_lookup_on_root() static function. 4. Remove dead code - afr_decide_heal_info() which was not used. Change-Id: If9168cd01e22788a0e60b91e315787d2aa60e97b updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* afr: make heal info locklessRavishankar N2019-12-121-200/+168
| | | | | | | | | | | | | | | | | | | Changes in locks xlator: Added support for per-domain inodelk count requests. Caller needs to set GLUSTERFS_MULTIPLE_DOM_LK_CNT_REQUESTS key in the dict and then set each key with name 'GLUSTERFS_INODELK_DOM_PREFIX:<domain name>'. In the response dict, the xlator will send the per domain count as values for each of these keys. Changes in AFR: Replaced afr_selfheal_locked_inspect() with afr_lockless_inspect(). Logic has been added to make the latter behave same as the former, thus not breaking the current heal info output behaviour. fixes: bz#1774011 Change-Id: Ie9e83c162aa77f44a39c2ba7115de558120ada4d Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* Revert "afr: make heal info lockless"Ravishankar N2019-11-281-138/+65
| | | | | | | | This reverts commit fce5f68bc72d448490a0d41be494ac54a9181b3c. I merged the wrong patch by mistake! Hence reverting it. updates: bz#1774011 Change-Id: Id7d6ed1d727efc02467c8a9aea3374331261ebd5
* afr: make heal info locklessRavishankar N2019-11-281-65/+138
| | | | | | | | | | | | | | | | | | | Changes in locks xlator: Added support for per-domain inodelk count requests. Caller needs to set GLUSTERFS_MULTIPLE_DOM_LK_CNT_REQUESTS key in the dict and then set each key with name 'GLUSTERFS_INODELK_DOM_PREFIX:<domain name>'. In the response dict, the xlator will send the per domain count as values for each of these keys. Changes in AFR: Replaced afr_selfheal_locked_inspect() with afr_lockless_inspect(). Logic has been added to make the latter behave same as the former, thus not breaking the current heal info output behaviour. fixes: bz#1774011 Change-Id: I9ae08ce768b39aeb6ee230207b5b7fa744176952 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: fix log floodingRavishankar N2019-11-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | Commit "ccf33e789 - dict.c: remove redundant checks" removed some NULL checks in certain dict functions. This caused flooding of fuse mount logs when I/O was done on the mount on a replica volume: Message: W [dict.c:1478:dict_get_with_refn] (-->/usr/local/lib/libglusterfs.so.0(dict_get_uint32+0x4d) [0x7ff9121ec963] -->/usr/local/lib/libglusterfs.so.0(dict_get_with_ref+0x90) [0x7ff9121eb93f] -->/usr/local/lib/libglusterfs.so.0(+0x229be) [0x7ff9121eb9be] ) 0-dict: dict OR key (glusterfs.lk.lkmode) is NULL [Invalid argument] Fix: In the relevant AFR functions, check that dict is not NULL before trying to perform operations on it. See bug description for more details. fixes: bz#1772006 Change-Id: I30c89c0b5d6c80cc86a6047aae70127769412120 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: lock healing changesRavishankar N2019-10-301-35/+783
| | | | | | | | | | | | | | | | | | | Implements lock healing for gluster-block fencing use case. If mandatory lock is enabled: - Add domain lock/unlock to afr_lk fop. - Maintain a list of locks to be healed in afr_private_t. - Add lock to the list if afr_lk(F_SETLK or F_SETLKW) was sucessful. - Remove it from the list during afr_lk(F_UNLCK). - On child_down, mark lock as needing heal on that child. If lock is lost on quorum no. of bricks, remove it from the list and mark fd bad. - For fds marked as bad, fail the subsequent fd based fops. - On parent up, traverse the list and heal the locks IFF the client is the lk owner and has quorum. (shd does not heal any locks). updates: #613 Change-Id: I03c46ceaea30f5e6236d5ec13f71d843d827f1bc Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Take a copy of xattr-reqPranith Kumar K2019-10-301-2/+7
| | | | | | | | | | Afr adds its own xattrs to the req, so it should take a copy of the dictionary to prevent parent xlator re-using the modified xattr-req to another subvolume fixes: bz#1765155 Change-Id: I268e2dbd1b12323135d369e90a22a8bdde2cf7c2 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* afr: support split-brain CLI for replica 3Ravishankar N2019-10-091-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | Ever since we added quorum checks for lookups in afr via commit bd44d59741bb8c0f5d7a62c5b1094179dd0ce8a4, the split-brain resolution commands would not work for replica 3 because there would be no readables for the lookup fop. The argument was that split-brains do not occur in replica 3 but we do see (data/metadata) split-brain cases once in a while which indicate that there are a few bugs/corner cases yet to be discovered and fixed. Fortunately, commit 8016d51a3bbd410b0b927ed66be50a09574b7982 added GF_CLIENT_PID_GLFS_HEALD as the pid for all fops made by glfsheal. If we leverage this and allow lookups in afr when pid is GF_CLIENT_PID_GLFS_HEALD, split-brain resolution commands will work for replica 3 volumes too. Likewise, the check is added in shard_lookup as well to permit resolving split-brains by specifying "/.shard/shard-file.xx" as the file name (which previously used to fail with EPERM). Change-Id: I3c543dea79caf7cfbc1633e9089cb1cdd2538ba9 Fixes: bz#1756938 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: replace afr_frame_return() when possible with direct callYaniv Kaul2019-10-071-8/+5
| | | | | | | | | | | | If you are already under lock, just decrement the call count directly instead of removing the lock, re-taking the lock and decrementing. Implements https://github.com/gluster/glusterfs/issues/728 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com> Change-Id: I3fa20b4651fbdb826655c5a03baeed46e99b5487
* afr-common.c, afr-self-heal.h: calloc/alloca0 -> malloc/allocaYaniv Kaul2019-09-201-1/+1
| | | | | | | | | | In 3 cases, there was a memory allocation and zeroing, followed directly by populating it with content. Replaced with memory allocation that did not zero the memory. Change-Id: I4fbb5c924fb3a144e415d2368126b784dde760ea updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* afr/lookup: Pass xattr_req in while doing a selfheal in lookupMohammed Rafi KC2019-09-051-2/+6
| | | | | | | | | | We were not passing xattr_req when doing a name self heal as well as a meta data heal. Because of this, some xdata was missing which causes i/o errors Change-Id: Ibfb1205a7eb0195632dc3820116ffbbb8043545f Fixes: bz#1728770 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* afr: wake up index healer threadsRavishankar N2019-08-301-4/+2
| | | | | | | | | | | | | ...whenever shd is re-enabled after disabling or there is a change in `cluster.heal-timeout`, without needing to restart shd or waiting for the current `cluster.heal-timeout` seconds to expire. See BZ 1743988 for more details. Change-Id: Ia5ebd7c8e9f5b54cba3199c141fdd1af2f9b9bfe fixes: bz#1744548 Reported-by: Glen Kiessling <glenk1973@hotmail.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr - Unused variablesBarak Sason2019-08-241-3/+8
| | | | | | | | | | | -Minor change to if-else structure to avoid code duplication. -Added logging in case method calls fails CID: 1394654 Updates: bz#789278 Change-Id: Ibef4450dc89ddd3bf951303d5b87f503924fd250 Signed-off-by: Barak Sason <bsasonro@redhat.com>
* Multiple files: get trivial stuff done before lockYaniv Kaul2019-08-011-6/+4
| | | | | | | | | Initialize a dictionary for example seems to be prefectly fine to be done before taking a lock. Change-Id: Ib29516c4efa8f0e2b526d512beab488fcd16d2e7 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* cluster/ta: Notify the clients only if there are pending healskarthik-us2019-07-121-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In case of thin arbiter, before index healer starts crawling the indices at every heal-timeout interval, even if there is nothing to be healed it will send an upcall notification to all the clients to release any AFR_TA_DOM_NOTIFY locks that they hold. SHD will wait for the upcall to return before proceeding with the heal even though there is nothing to be healed. This will also invalidates the cached information about the bricks states on the clients which leads to extra calls on TA from clients for the next reads & writes if needed. This will impact the IO performance. Fix: - Before sending the upcall to the clients, check for any pending heals on TA without taking any locks. - If there is nothing marked bad on TA, then continue with the index crawl to heal any dirty markings present on the files due to any post-op failure. - If there is a brick marked as bad on TA, then take the AFR_TA_DOM_NOTIFY lock on TA from SHD, get the state on TA and continue with the current healing process. Change-Id: Ieb477bc6cb18bbdfd4e7a0453c5ed79b574ec9d6 fixes: bz#1724184 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* afr/read: Implement latency based read child selectionMohammed Rafi KC2019-06-201-21/+80
| | | | | | | | | | | | | | | | | Network latency is an important factor selecting a read subvolume. So this patch is adding two new policy. 1) We measure the latency of a child during a GF_DUMP rpc call. Then use this latency to pick a read subvol having the least latency. 2) Second one is an hybrid mode where it calculates the effective latency by multiplying outstanding pending read request and latency, and choose the least one. Change-Id: Ia49c8a08ab61f7dcdad8b8950aa4d338e7accf97 fixes: #520 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* multiple files: another attempt to remove includesYaniv Kaul2019-06-141-2/+0
| | | | | | | | | | | | | | | | | | There are many include statements that are not needed. A previous more ambitious attempt failed because of *BSD plafrom (see https://review.gluster.org/#/c/glusterfs/+/21929/ ) Now trying a more conservative reduction. It does not solve all circular deps that we have, but it does reduce some of them. There is just too much to handle reasonably (dht-common.h includes dht-lock.h which includes dht-common.h ...), but it does reduce the overall number of lines of include we need to look at in the future to understand and fix the mess later one. Change-Id: I550cd001bdefb8be0fe67632f783c0ef6bee3f9f updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* afr: thin-arbiter lock release fixesRavishankar N2019-05-101-1/+11
| | | | | | | | | | | | | | | | | | | | - pass fop state instead of afr local to afr_ta_dom_lock_check_and_release() - avoid afr_lock_release_synctask() being called simultaneosuly from notify code path and transaction (post-op) code path due to races. - Check if the post-op on TA is valid based on event_gen checks. - Invalidate in-memory information when we get TA child down. Note: Thi patch addresses some pending review comments of commit 053b1309dc8fbc05fcde5223e734da9f694cf5cc (https://review.gluster.org/#/c/glusterfs/+/20095/) fixes: bz#1698449 Change-Id: I2ccd7e1b53362f9f3fed8680aecb23b5011eb18c Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Set lk-owner before inodelk/entrylk/lkPranith Kumar K2019-04-221-18/+19
| | | | | | Updates: bz#1624701 Change-Id: I7152c28ad85925abccdcc4cd6de8cb2a2b847a51 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Remove local from owners_list on failure of lock-acquisitionPranith Kumar K2019-04-151-4/+4
| | | | | | | | | | | | | When eager-lock lock acquisition fails because of say network failures, the local is not being removed from owners_list, this leads to accumulation of waiting frames and the application will hang because the waiting frames are under the assumption that another transaction is in the process of acquiring lock because owner-list is not empty. Handled this case as well in this patch. Added asserts to make it easier to find these problems in future. fixes bz#1696599 Change-Id: I3101393265e9827755725b1f2d94a93d8709e923 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Invalidate inode on change of split-brain-choicePranith Kumar K2019-04-051-4/+1
| | | | | | | | | | | When split-brain choice is changed from one brick to another brick, inode-invalidate is not called so readv call is served from cache leading to failures in split-brain-resolution.t. Fixed it by calling inode_invaldate() when this happens. updates bz#1193929 Change-Id: I2624614eec38c0303f3e1dc55dfae3d4b864218b Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Send inodelk/entrylk with non-zero lk-ownerPranith Kumar K2019-04-021-1/+16
| | | | | | | | | | | | | | Found missing assignment of lk-owner for an inodelk/entrylk before winding the fops. locks xlator at the moment allows this operation. This leads to multiple threads in the same client being able to get locks on the inode because lk-owner is same and transport is same. So isolation with locks can't be achieved. To fix it, we need locks xlator change which will disallow null-lk-owner based inodelk/entrylk/lk. To achieve that we need to first fix all the places which do this mistake. updates bz#1624701 Change-Id: Ic3431da3f451a1414f1f4fdcfc4cf41e555f69dd Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* afr: thin-arbiter read txn fixesRavishankar N2019-03-291-7/+13
| | | | | | | | | | | | | - Fixes afr_ta_read_txn() to handle inode refresh failures. code-path. - Fixes a double free issue of dict. Note: This patch address post-merge review comments for commit 69532c141be160b3fea03c1579ae4ac13018dcdf fixes: bz#1686398 Change-Id: Id5299b45b68569d47df6b73755918237a1592cb4 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: add client-pid to all gf_event() callsRavishankar N2019-03-271-4/+8
| | | | | | | | | client-pid for glustershd is GF_CLIENT_PID_SELF_HEALD client-pid for glfsheal is GF_CLIENT_PID_GLFS_HEALD updates: bz#1689250 Change-Id: Ib3a863af160ff48c822a5e6b0c27c575c9887470 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Add quorum checks to open & opendir fopskarthik-us2019-03-081-0/+18
| | | | | | | | | | | | | | | Problem: Currently even if open & opendir fails on quorum number of bricks, but succeeds on atleast one brick, it will result in success. This leads to inconsistency in the behaviour with other operations following the open, which has quorum checks. Fix: Add quorum checks to open & opendir fops to avoid inconsistency. Change-Id: If8fcb82072a6dc45ea6d4a6754b79763215eba2a fixes: bz#1634664 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* cluster/thin-arbiter: Consider thin-arbiter before marking new entry changelogAshish Pandey2019-02-011-0/+1
| | | | | | | | | | | | | | | | If a fop to create an entry fails on one of the data brick, we mark the pending changelog on the entry on brick for which it was successful. This is done as part of post op phase to make sure that entry gets healed even if it gets renamed to some other path where its parent was not marked as bad. As it happens as part of post op, we should consider thin-arbiter to check if the brick, which was successful, is the good brick or not. This will avoide split brain and other issues. Change-Id: I12686675be98f02f70a5186b3ed748c541514d53 updates: bz#1662264 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* Multiple files: reduce work while under lock.Yaniv Kaul2019-01-291-8/+7
| | | | | | | | | | | | | | | | | Mostly, unlock before logging. In some cases, moved different code that was not needed to be under lock (for example, taking time, or malloc'ing) to be executed before taking the lock. Note: logging might be slightly less accurate in order, since it may not be done now under the lock, so order of logs is racy. I think it's a reasonable compromise. Compile-tested only! updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com> Change-Id: I2438710016afc9f4f62a176ef1a0d3ed793b4f89
* afr : fix memory leakSunny Kumar2019-01-111-23/+15
| | | | | | | | | | | | | | | This patch fixes memory leak reported by ASan. The fix was first merged by https://review.gluster.org/#/c/glusterfs/+/21805. But later change was reverted due to this patch https://review.gluster.org/#/c/glusterfs/+/21178/. updates: bz#1633930 Change-Id: I1febe121e0be33a637397a0b54d6b78391692b0d Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
* cluster/afr: Refactor internal locking code to allow multiple inodelksPranith Kumar K2018-12-281-8/+1
| | | | | | | | | | | | | | | | For implementing copy_file_range fop, AFR needs to perform two inodelks in the same transaction. This patch brings in the necessary structure to make it easier to do so. Entry-locks in AFR were already taking multiple entry-locks on different inodes with the respective basenames. This patch extends the logic in inodelks to use the same lockee_t structure. This lead to removal of quite a lot of duplicate code present in afr-lk-common.c as both the locks are doing same things except 'winding' part. updates: #536 Change-Id: Ibfce7e3f260bb27b18645152ec680c33866fe0ae Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Allow lookup on root if it is from ADD_REPLICA_MOUNTkarthik-us2018-12-181-21/+62
| | | | | | | | | | | | | | | | | | | | | Problem: When trying to convert a plain distribute volume to replica-3 or arbiter type it is failing with ENOTCONN error as the lookup on the root will fail as there is no quorum. Fix: Allow lookup on root if it is coming from the ADD_REPLICA_MOUNT which is used while adding bricks to a volume. It will try to set the pending xattrs for the newly added bricks to allow the heal to happen in the right direction and avoid data loss scenarios. Note: This fix will solve the problem of type conversion only in the case where the volume was mounted at least once. The conversion of non mounted volumes will still fail since the dht selfheal tries to set the directory layout will fail as they do that with the PID GF_CLIENT_PID_NO_ROOT_SQUASH set in the frame->root. Change-Id: Ic511939981dad118cc946754341318b164954b3b fixes: bz#1655854 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* Don't depend on string options to be valid alwaysPranith Kumar K2018-12-171-13/+9
| | | | | | updates bz#1650403 Change-Id: Ib5a11e691599ce4bd93c1ed5aca6060592893961 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* AFR xlator: use dict_{setn|getn|deln|get_int32n|set_int32n|set_strn}Yaniv Kaul2018-12-171-48/+67
| | | | | | | | | | | | | | | | | | | | In a previous patch (https://review.gluster.org/20769) we've added the key length to be passed to dict_* funcs, to remove the need to strlen() it. This patch moves some xlators to use it. - In some cases, moved strlen() of the key length outside of locks, which is usually a good thing. Please verify it's safe to do so. - In some cases, created a prefix for the keys, replacing something like "%d-%d" with a "%s" in snprintf(). Not sure it adds value, but improves readability. Please review carefully. Compile-tested only! Change-Id: I04f2a1eb2ecfc3283d849d150d10d088ae7aa7f1 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* cluster/afr: Fix mem leak reported by ASANKotresh HR2018-12-171-4/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Traceback: Direct leak of 765 byte(s) in 9 object(s) allocated from: #0 0x7ffb9cad2c48 in malloc (/lib64/libasan.so.5+0xeec48) #1 0x7ffb9c5f8949 in __gf_malloc ./libglusterfs/src/mem-pool.c:136 #2 0x7ffb9c5f91bb in gf_vasprintf ./libglusterfs/src/mem-pool.c:236 #3 0x7ffb9c5f938a in gf_asprintf ./libglusterfs/src/mem-pool.c:256 #4 0x7ffb826714ab in afr_get_heal_info ./xlators/cluster/afr/src/afr-common.c:6204 #5 0x7ffb825765e5 in afr_handle_heal_xattrs ./xlators/cluster/afr/src/afr-inode-read.c:1481 #6 0x7ffb825765e5 in afr_getxattr ./xlators/cluster/afr/src/afr-inode-read.c:1571 #7 0x7ffb9c635af7 in syncop_getxattr ./libglusterfs/src/syncop.c:1680 #8 0x406c78 in glfsh_process_entries ./heal/src/glfs-heal.c:810 #9 0x408555 in glfsh_crawl_directory ./heal/src/glfs-heal.c:898 #10 0x408cc0 in glfsh_print_pending_heals_type ./heal/src/glfs-heal.c:970 #11 0x408fc5 in glfsh_print_pending_heals ./heal/src/glfs-heal.c:1012 #12 0x409546 in glfsh_gather_heal_info ./heal/src/glfs-heal.c:1154 #13 0x403e96 in main ./heal/src/glfs-heal.c:1745 #14 0x7ffb99bc411a in __libc_start_main ../csu/libc-start.c:308 The dictionary is referenced by caller to print the status. So set it as dynstr, the last unref of dictionary will free it. updates: bz#1633930 Change-Id: Ib5a7cb891e6f7d90560859aaf6239e52ff5477d0 Signed-off-by: Kotresh HR <khiremat@redhat.com>
* afr: some minor itable related cleanupsRavishankar N2018-12-121-2/+0
| | | | | | | | | | | | - this->itable always needs to be allocated, hence move it outside afr_selfheal_daemon_init(). - Invoke afr_selfheal_daemon_init() only for self-heal daemon case. - remove redundant itable allocation in afr_discover(). - destroy itable in fini. Updates bz#1193929 Change-Id: Ib28b50b607386f5a5aa7d2f743c8b506ccb10eae Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* libglusterfs: Move devel headers under glusterfs directoryShyamsundarR2018-12-051-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | libglusterfs devel package headers are referenced in code using include semantics for a program, this while it works can be better especially when dealing with out of tree xlator builds or in general out of tree devel package usage. Towards this, the following changes are done, - moved all devel headers under a glusterfs directory - Included these headers using system header notation <> in all code outside of libglusterfs - Included these headers using own program notation "" within libglusterfs This change although big, is just moving around the headers and making it correct when including these headers from other sources. This helps us correctly include libglusterfs includes without namespace conflicts. Change-Id: Id2a98854e671a7ee5d73be44da5ba1a74252423b Updates: bz#1193929 Signed-off-by: ShyamsundarR <srangana@redhat.com>
* cluster/afr: s/uuid_is_null/gf_uuid_is_nullPranith Kumar K2018-11-071-1/+1
| | | | | | Updates bz#1193929 Change-Id: I1b312dabffac7e101df8ce15557527fd28a2c61f Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* all: fix the format string exceptionsAmar Tumballi2018-11-051-2/+2
| | | | | | | | | | | | | | | | Currently, there are possibilities in few places, where a user-controlled (like filename, program parameter etc) string can be passed as 'fmt' for printf(), which can lead to segfault, if the user's string contains '%s', '%d' in it. While fixing it, makes sense to make the explicit check for such issues across the codebase, by making the format call properly. Fixes: CVE-2018-14661 Fixes: bz#1644763 Change-Id: Ib547293f2d9eb618594cbff0df3b9c800e88bde4 Signed-off-by: Amar Tumballi <amarts@redhat.com>
* afr/lease: Read child nodes from lease structureroot2018-10-251-1/+1
| | | | | | | | | | For lease operation, we allocate and store child nodes data in lease structure. Use the same in afr_lease_cbk() while checking for the quorum. Change-Id: If1fdd5a0798888afd39ad3df57d96487baf9d1e6 updates: #350 Signed-off-by: Soumya Koduri <skoduri@redhat.com>
* afr: thin-arbiter 2 domain locking and in-memory stateRavishankar N2018-10-251-47/+157
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 2 domain locking + xattrop for write-txn failures: -------------------------------------------------- - A post-op wound on TA takes AFR_TA_DOM_NOTIFY range lock and AFR_TA_DOM_MODIFY full lock, does xattrop on TA and releases AFR_TA_DOM_MODIFY lock and stores in-memory which brick is bad. - All further write txn failures are handled based on this in-memory value without querying the TA. - When shd heals the files, it does so by requesting full lock on AFR_TA_DOM_NOTIFY domain. Client uses this as a cue (via upcall), releases AFR_TA_DOM_NOTIFY range lock and invalidates its in-memory notion of which brick is bad. The next write txn failure is wound on TA to again update the in-memory state. - Any incomplete write txns before the AFR_TA_DOM_NOTIFY upcall release request is got is completed before the lock is released. - Any write txns got after the release request are maintained in a ta_waitq. - After the release is complete, the ta_waitq elements are spliced to a separate queue which is then processed one by one. - For fops that come in parallel when the in-memory bad brick is still unknown, only one is wound to TA on wire. The other ones are maintained in a ta_onwireq which is then processed after we get the response from TA. Change-Id: I32c7b61a61776663601ab0040e2f0767eca1fd64 updates: bz#1579788 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/afr : Check for UP bricks before starting healAshish Pandey2018-10-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Problem: Currently for replica volume, even if only one brick is UP SHD will keep crawling index entries even if it can not heal anything. In thin-arbiter volume which is also a replica 2 volume, this causes inode lock contention which in turn sends upcall to all the clients to release notify locks, even if it can not do anything for healing. This will slow down the client performance and kills the purpose of keeping in memory information about bad brick. Solution: Before starting heal or even crawling, check if sufficient number of children are UP and available to check and heal entries. Change-Id: I011c9da3b37cae275f791affd56b8f1c1ac9255d updates: bz#1640581 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* all: fix warnings on non 64-bits architecturesXavi Hernandez2018-10-101-14/+17
| | | | | | | | | | When compiling in other architectures there appear many warnings. Some of them are actual problems that prevent gluster to work correctly on those architectures. Change-Id: Icdc7107a2bc2da662903c51910beddb84bdf03c0 fixes: bz#1632717 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* afr: fix incorrect reporting of directory split-brainRavishankar N2018-09-211-7/+7
| | | | | | | | | | | | | | | | | Problem: When a directory has dirty xattrs due to failed post-ops or when replace/reset brick is performed, AFR does a conservative merge as expected, but heal-info reports it as split-brain because there are no clear sources. Fix: Modify pending flag to contain information about pending heals and split-brains. For directories, if spit-brain flag is not set,just show them as needing heal and not being in split-brain. Fixes: bz#1626994 Change-Id: I09ef821f6887c87d315ae99e6b1de05103cd9383 Signed-off-by: Ravishankar N <ravishankar@redhat.com>