summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/afr/src/afr-common.c
Commit message (Collapse)AuthorAgeFilesLines
* afr: event gen changesRavishankar N2020-07-081-74/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The general idea of the changes is to prevent resetting event generation to zero in the inode ctx, since event gen is something that should follow 'causal order'. Change #1: For a read txn, in inode refresh cbk, if event_generation is found zero, we are failing the read fop. This is not needed because change in event gen is only a marker for the next inode refresh to happen and should not be taken into account by the current read txn. Change #2: The event gen being zero above can happen if there is a racing lookup, which resets even get (in afr_lookup_done) if there are non zero afr xattrs. The resetting is done only to trigger an inode refresh and a possible client side heal on the next lookup. That can be acheived by setting the need_refresh flag in the inode ctx. So replaced all occurences of resetting even gen to zero with a call to afr_inode_need_refresh_set(). Change #3: In both lookup and discover path, we are doing an inode refresh which is not required since all 3 essentially do the same thing- update the inode ctx with the good/bad copies from the brick replies. Inode refresh also triggers background heals, but I think it is okay to do it when we call refresh during the read and write txns and not in the lookup path. The .ts which relied on inode refresh in lookup path to trigger heals are now changed to do read txn so that inode refresh and the heal happens. Change-Id: Iebf39a9be6ffd7ffd6e4046c96b0fa78ade6c5ec Fixes: #1179 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reported-by: Erik Jacobson <erik.jacobson at hpe.com> (cherry picked from commit f0fcd909ad4535b60c9208d4804ebe6afe421a09)
* cluster/afr: Prioritize ENOSPC over other errorskarthik-us2020-07-081-1/+3
| | | | | | | | | | | | | | | | | | | | | | | Problem: In a replicate/arbiter volume if file creations or writes fails on quorum number of bricks and on one brick it is due to ENOSPC and on other brick it fails for a different reason, it may fail with errors other than ENOSPC in some cases. Fix: Prioritize ENOSPC over other lesser priority errors and do not set op_errno in posix_gfid_set if op_ret is 0 to avoid receiving any error_no which can be misinterpreted by __afr_dir_write_finalize(). Also removing the function afr_has_arbiter_fop_cbk_quorum() which might consider a successful reply form a single brick as quorum success in some cases, whereas we always need fop to be successful on quorum number of bricks in arbiter configuration. Change-Id: I106e267f8b9451f681022f1cccb410d9bc824c08 Fixes: #1254 Signed-off-by: karthik-us <ksubrahm@redhat.com> (cherry picked from commit fa63b45ca5edf172b1b89b28b5db3c5129cc57b6)
* afr: more quorum checks in lookup and new entry markingRavishankar N2020-07-081-11/+13
| | | | | | | | | | | | | | | | | Problem: See github issue for details. Fix: -In lookup if the entry exists in 2 out of 3 bricks, don't fail the lookup with ENOENT just because there is an entrylk on the parent. Consider quorum before deciding. -If entry FOP does not succeed on quorum no. of bricks, do not perform new entry mark. Fixes: #1303 Change-Id: I56df8c89ad53b29fa450c7930a7b7ccec9f4a6c5 Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit c4a6748f25d2c1ab3ebcf89952278ebf94c8d371)
* afr: prevent spurious entry heals leading to gfid split-brainRavishankar N2020-02-281-2/+2
| | | | | | | | | | | | | | | | | | | | | Problem: In a hyperconverged setup with granular-entry-heal enabled, if a file is recreated while one of the bricks is down, and an index heal is triggered (with the brick still down), entry-self heal was doing a spurious heal with just the 2 good bricks. It was doing a post-op leading to removal of the filename from .glusterfs/indices/entry-changes as well as erroneous setting of afr xattrs on the parent. When the brick came up, the xattrs were cleared, resulting in the renamed file not getting healed and leading to gfid split-brain and EIO on the mount. Fix: Proceed with entry heal only when shd can connect to all bricks of the replica, just like in data and metadata heal. fixes: bz#1804594 Change-Id: I916ae26ad1fabf259bc6362da52d433b7223b17e Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 06453d77d056fbaa393a137ca277a20e38d2f67e)
* cluster/thin-arbiter: Wait for TA connection before ta-file lookupAshish Pandey2020-02-261-19/+21
| | | | | | | | | | | | | | | | | | | Problem: When we mount a ta volume, as soon as 2 data bricks are connected we consider that the mount is done and then send a lookup/create on ta file on ta node. However, this connection with ta node might not have been completed. Due to this delay, ta replica id file will not be created and we will see ENOTCONN error in log file if we do lookup. Solution: As we know that this ta node could have a higher latency, we should wait for reasonable time for connection to happen before sending lookup/create on replica id file. fixes: bz#1804546 Change-Id: I36f90865afe617e4e84cee57fec832a16f5dd6cc (cherry picked from commit a7fa54ddea3fe429f143b37e4de06a93b49d776a)
* afr: support split-brain CLI for replica 3Ravishankar N2019-10-171-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Ever since we added quorum checks for lookups in afr via commit bd44d59741bb8c0f5d7a62c5b1094179dd0ce8a4, the split-brain resolution commands would not work for replica 3 because there would be no readables for the lookup fop. The argument was that split-brains do not occur in replica 3 but we do see (data/metadata) split-brain cases once in a while which indicate that there are a few bugs/corner cases yet to be discovered and fixed. Fortunately, commit 8016d51a3bbd410b0b927ed66be50a09574b7982 added GF_CLIENT_PID_GLFS_HEALD as the pid for all fops made by glfsheal. If we leverage this and allow lookups in afr when pid is GF_CLIENT_PID_GLFS_HEALD, split-brain resolution commands will work for replica 3 volumes too. Likewise, the check is added in shard_lookup as well to permit resolving split-brains by specifying "/.shard/shard-file.xx" as the file name (which previously used to fail with EPERM). Change-Id: I3c543dea79caf7cfbc1633e9089cb1cdd2538ba9 Fixes: bz#1760792 Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 47dbd753187f69b3835d2e42fdbe7485874c4b3e)
* afr/lookup: Pass xattr_req in while doing a selfheal in lookupMohammed Rafi KC2019-09-231-2/+6
| | | | | | | | | | | | | | | | | We were not passing xattr_req when doing a name self heal as well as a meta data heal. Because of this, some xdata was missing which causes i/o errors Backport of > https://review.gluster.org/#/c/glusterfs/+/23024/ >Change-Id: Ibfb1205a7eb0195632dc3820116ffbbb8043545f >Fixes: bz#1728770 >Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Fixes: bz#1749307 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> (cherry picked from commit d026f0bcfd301712e4f0671ccf238f43f2e6dd30) Change-Id: Ibfb1205a7eb0195632dc3820116ffbbb8043545f
* afr: wake up index healer threadsRavishankar N2019-09-051-4/+2
| | | | | | | | | | | | | | | (Backport of https://review.gluster.org/#/c/glusterfs/+/23288/) ...whenever shd is re-enabled after disabling or there is a change in `cluster.heal-timeout`, without needing to restart shd or waiting for the current `cluster.heal-timeout` seconds to expire. See BZ 1743988 for more details. Change-Id: Ia5ebd7c8e9f5b54cba3199c141fdd1af2f9b9bfe fixes: bz#1743988 Reported-by: Glen Kiessling <glenk1973@hotmail.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: thin-arbiter lock release fixesRavishankar N2019-05-151-1/+11
| | | | | | | | | | | | | | | | | | | | | - pass fop state instead of afr local to afr_ta_dom_lock_check_and_release() - avoid afr_lock_release_synctask() being called simultaneosuly from notify code path and transaction (post-op) code path due to races. - Check if the post-op on TA is valid based on event_gen checks. - Invalidate in-memory information when we get TA child down. Note: Thi patch addresses some pending review comments of commit 053b1309dc8fbc05fcde5223e734da9f694cf5cc (https://review.gluster.org/#/c/glusterfs/+/20095/) fixes: bz#1709130 Change-Id: I2ccd7e1b53362f9f3fed8680aecb23b5011eb18c Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 9ab2747da78061882f6734df4b265bce11adaef1)
* cluster/afr: Remove local from owners_list on failure of lock-acquisitionPranith Kumar K2019-04-161-4/+4
| | | | | | | | | | | | | When eager-lock lock acquisition fails because of say network failures, the local is not being removed from owners_list, this leads to accumulation of waiting frames and the application will hang because the waiting frames are under the assumption that another transaction is in the process of acquiring lock because owner-list is not empty. Handled this case as well in this patch. Added asserts to make it easier to find these problems in future. fixes bz#1699731 Change-Id: I3101393265e9827755725b1f2d94a93d8709e923 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* afr: thin-arbiter read txn fixesRavishankar N2019-04-161-7/+13
| | | | | | | | | | | | | | - Fixes afr_ta_read_txn() to handle inode refresh failures. code-path. - Fixes a double free issue of dict. Note: This patch address post-merge review comments for commit 69532c141be160b3fea03c1579ae4ac13018dcdf fixes: bz#1693992 Change-Id: Id5299b45b68569d47df6b73755918237a1592cb4 Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 500bd0014128e6727e83b6cb77e8ac94304b8f4a)
* afr: add client-pid to all gf_event() callsRavishankar N2019-04-161-4/+8
| | | | | | | | | | client-pid for glustershd is GF_CLIENT_PID_SELF_HEALD client-pid for glfsheal is GF_CLIENT_PID_GLFS_HEALD updates: bz#1693155 Change-Id: Ib3a863af160ff48c822a5e6b0c27c575c9887470 Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 8016d51a3bbd410b0b927ed66be50a09574b7982)
* cluster/thin-arbiter: Consider thin-arbiter before marking new entry changelogAshish Pandey2019-02-011-0/+1
| | | | | | | | | | | | | | | | If a fop to create an entry fails on one of the data brick, we mark the pending changelog on the entry on brick for which it was successful. This is done as part of post op phase to make sure that entry gets healed even if it gets renamed to some other path where its parent was not marked as bad. As it happens as part of post op, we should consider thin-arbiter to check if the brick, which was successful, is the good brick or not. This will avoide split brain and other issues. Change-Id: I12686675be98f02f70a5186b3ed748c541514d53 updates: bz#1662264 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* Multiple files: reduce work while under lock.Yaniv Kaul2019-01-291-8/+7
| | | | | | | | | | | | | | | | | Mostly, unlock before logging. In some cases, moved different code that was not needed to be under lock (for example, taking time, or malloc'ing) to be executed before taking the lock. Note: logging might be slightly less accurate in order, since it may not be done now under the lock, so order of logs is racy. I think it's a reasonable compromise. Compile-tested only! updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com> Change-Id: I2438710016afc9f4f62a176ef1a0d3ed793b4f89
* afr : fix memory leakSunny Kumar2019-01-111-23/+15
| | | | | | | | | | | | | | | This patch fixes memory leak reported by ASan. The fix was first merged by https://review.gluster.org/#/c/glusterfs/+/21805. But later change was reverted due to this patch https://review.gluster.org/#/c/glusterfs/+/21178/. updates: bz#1633930 Change-Id: I1febe121e0be33a637397a0b54d6b78391692b0d Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
* cluster/afr: Refactor internal locking code to allow multiple inodelksPranith Kumar K2018-12-281-8/+1
| | | | | | | | | | | | | | | | For implementing copy_file_range fop, AFR needs to perform two inodelks in the same transaction. This patch brings in the necessary structure to make it easier to do so. Entry-locks in AFR were already taking multiple entry-locks on different inodes with the respective basenames. This patch extends the logic in inodelks to use the same lockee_t structure. This lead to removal of quite a lot of duplicate code present in afr-lk-common.c as both the locks are doing same things except 'winding' part. updates: #536 Change-Id: Ibfce7e3f260bb27b18645152ec680c33866fe0ae Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Allow lookup on root if it is from ADD_REPLICA_MOUNTkarthik-us2018-12-181-21/+62
| | | | | | | | | | | | | | | | | | | | | Problem: When trying to convert a plain distribute volume to replica-3 or arbiter type it is failing with ENOTCONN error as the lookup on the root will fail as there is no quorum. Fix: Allow lookup on root if it is coming from the ADD_REPLICA_MOUNT which is used while adding bricks to a volume. It will try to set the pending xattrs for the newly added bricks to allow the heal to happen in the right direction and avoid data loss scenarios. Note: This fix will solve the problem of type conversion only in the case where the volume was mounted at least once. The conversion of non mounted volumes will still fail since the dht selfheal tries to set the directory layout will fail as they do that with the PID GF_CLIENT_PID_NO_ROOT_SQUASH set in the frame->root. Change-Id: Ic511939981dad118cc946754341318b164954b3b fixes: bz#1655854 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* Don't depend on string options to be valid alwaysPranith Kumar K2018-12-171-13/+9
| | | | | | updates bz#1650403 Change-Id: Ib5a11e691599ce4bd93c1ed5aca6060592893961 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* AFR xlator: use dict_{setn|getn|deln|get_int32n|set_int32n|set_strn}Yaniv Kaul2018-12-171-48/+67
| | | | | | | | | | | | | | | | | | | | In a previous patch (https://review.gluster.org/20769) we've added the key length to be passed to dict_* funcs, to remove the need to strlen() it. This patch moves some xlators to use it. - In some cases, moved strlen() of the key length outside of locks, which is usually a good thing. Please verify it's safe to do so. - In some cases, created a prefix for the keys, replacing something like "%d-%d" with a "%s" in snprintf(). Not sure it adds value, but improves readability. Please review carefully. Compile-tested only! Change-Id: I04f2a1eb2ecfc3283d849d150d10d088ae7aa7f1 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* cluster/afr: Fix mem leak reported by ASANKotresh HR2018-12-171-4/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Traceback: Direct leak of 765 byte(s) in 9 object(s) allocated from: #0 0x7ffb9cad2c48 in malloc (/lib64/libasan.so.5+0xeec48) #1 0x7ffb9c5f8949 in __gf_malloc ./libglusterfs/src/mem-pool.c:136 #2 0x7ffb9c5f91bb in gf_vasprintf ./libglusterfs/src/mem-pool.c:236 #3 0x7ffb9c5f938a in gf_asprintf ./libglusterfs/src/mem-pool.c:256 #4 0x7ffb826714ab in afr_get_heal_info ./xlators/cluster/afr/src/afr-common.c:6204 #5 0x7ffb825765e5 in afr_handle_heal_xattrs ./xlators/cluster/afr/src/afr-inode-read.c:1481 #6 0x7ffb825765e5 in afr_getxattr ./xlators/cluster/afr/src/afr-inode-read.c:1571 #7 0x7ffb9c635af7 in syncop_getxattr ./libglusterfs/src/syncop.c:1680 #8 0x406c78 in glfsh_process_entries ./heal/src/glfs-heal.c:810 #9 0x408555 in glfsh_crawl_directory ./heal/src/glfs-heal.c:898 #10 0x408cc0 in glfsh_print_pending_heals_type ./heal/src/glfs-heal.c:970 #11 0x408fc5 in glfsh_print_pending_heals ./heal/src/glfs-heal.c:1012 #12 0x409546 in glfsh_gather_heal_info ./heal/src/glfs-heal.c:1154 #13 0x403e96 in main ./heal/src/glfs-heal.c:1745 #14 0x7ffb99bc411a in __libc_start_main ../csu/libc-start.c:308 The dictionary is referenced by caller to print the status. So set it as dynstr, the last unref of dictionary will free it. updates: bz#1633930 Change-Id: Ib5a7cb891e6f7d90560859aaf6239e52ff5477d0 Signed-off-by: Kotresh HR <khiremat@redhat.com>
* afr: some minor itable related cleanupsRavishankar N2018-12-121-2/+0
| | | | | | | | | | | | - this->itable always needs to be allocated, hence move it outside afr_selfheal_daemon_init(). - Invoke afr_selfheal_daemon_init() only for self-heal daemon case. - remove redundant itable allocation in afr_discover(). - destroy itable in fini. Updates bz#1193929 Change-Id: Ib28b50b607386f5a5aa7d2f743c8b506ccb10eae Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* libglusterfs: Move devel headers under glusterfs directoryShyamsundarR2018-12-051-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | libglusterfs devel package headers are referenced in code using include semantics for a program, this while it works can be better especially when dealing with out of tree xlator builds or in general out of tree devel package usage. Towards this, the following changes are done, - moved all devel headers under a glusterfs directory - Included these headers using system header notation <> in all code outside of libglusterfs - Included these headers using own program notation "" within libglusterfs This change although big, is just moving around the headers and making it correct when including these headers from other sources. This helps us correctly include libglusterfs includes without namespace conflicts. Change-Id: Id2a98854e671a7ee5d73be44da5ba1a74252423b Updates: bz#1193929 Signed-off-by: ShyamsundarR <srangana@redhat.com>
* cluster/afr: s/uuid_is_null/gf_uuid_is_nullPranith Kumar K2018-11-071-1/+1
| | | | | | Updates bz#1193929 Change-Id: I1b312dabffac7e101df8ce15557527fd28a2c61f Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* all: fix the format string exceptionsAmar Tumballi2018-11-051-2/+2
| | | | | | | | | | | | | | | | Currently, there are possibilities in few places, where a user-controlled (like filename, program parameter etc) string can be passed as 'fmt' for printf(), which can lead to segfault, if the user's string contains '%s', '%d' in it. While fixing it, makes sense to make the explicit check for such issues across the codebase, by making the format call properly. Fixes: CVE-2018-14661 Fixes: bz#1644763 Change-Id: Ib547293f2d9eb618594cbff0df3b9c800e88bde4 Signed-off-by: Amar Tumballi <amarts@redhat.com>
* afr/lease: Read child nodes from lease structureroot2018-10-251-1/+1
| | | | | | | | | | For lease operation, we allocate and store child nodes data in lease structure. Use the same in afr_lease_cbk() while checking for the quorum. Change-Id: If1fdd5a0798888afd39ad3df57d96487baf9d1e6 updates: #350 Signed-off-by: Soumya Koduri <skoduri@redhat.com>
* afr: thin-arbiter 2 domain locking and in-memory stateRavishankar N2018-10-251-47/+157
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 2 domain locking + xattrop for write-txn failures: -------------------------------------------------- - A post-op wound on TA takes AFR_TA_DOM_NOTIFY range lock and AFR_TA_DOM_MODIFY full lock, does xattrop on TA and releases AFR_TA_DOM_MODIFY lock and stores in-memory which brick is bad. - All further write txn failures are handled based on this in-memory value without querying the TA. - When shd heals the files, it does so by requesting full lock on AFR_TA_DOM_NOTIFY domain. Client uses this as a cue (via upcall), releases AFR_TA_DOM_NOTIFY range lock and invalidates its in-memory notion of which brick is bad. The next write txn failure is wound on TA to again update the in-memory state. - Any incomplete write txns before the AFR_TA_DOM_NOTIFY upcall release request is got is completed before the lock is released. - Any write txns got after the release request are maintained in a ta_waitq. - After the release is complete, the ta_waitq elements are spliced to a separate queue which is then processed one by one. - For fops that come in parallel when the in-memory bad brick is still unknown, only one is wound to TA on wire. The other ones are maintained in a ta_onwireq which is then processed after we get the response from TA. Change-Id: I32c7b61a61776663601ab0040e2f0767eca1fd64 updates: bz#1579788 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/afr : Check for UP bricks before starting healAshish Pandey2018-10-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Problem: Currently for replica volume, even if only one brick is UP SHD will keep crawling index entries even if it can not heal anything. In thin-arbiter volume which is also a replica 2 volume, this causes inode lock contention which in turn sends upcall to all the clients to release notify locks, even if it can not do anything for healing. This will slow down the client performance and kills the purpose of keeping in memory information about bad brick. Solution: Before starting heal or even crawling, check if sufficient number of children are UP and available to check and heal entries. Change-Id: I011c9da3b37cae275f791affd56b8f1c1ac9255d updates: bz#1640581 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* all: fix warnings on non 64-bits architecturesXavi Hernandez2018-10-101-14/+17
| | | | | | | | | | When compiling in other architectures there appear many warnings. Some of them are actual problems that prevent gluster to work correctly on those architectures. Change-Id: Icdc7107a2bc2da662903c51910beddb84bdf03c0 fixes: bz#1632717 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* afr: fix incorrect reporting of directory split-brainRavishankar N2018-09-211-7/+7
| | | | | | | | | | | | | | | | | Problem: When a directory has dirty xattrs due to failed post-ops or when replace/reset brick is performed, AFR does a conservative merge as expected, but heal-info reports it as split-brain because there are no clear sources. Fix: Modify pending flag to contain information about pending heals and split-brains. For directories, if spit-brain flag is not set,just show them as needing heal and not being in split-brain. Fixes: bz#1626994 Change-Id: I09ef821f6887c87d315ae99e6b1de05103cd9383 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Use 2 domain locking in SHD for thin-arbiterkarthik-us2018-09-201-3/+4
| | | | | | | | | | | | | | | | | | | | With this change when SHD starts the index crawl it requests all the clients to release the AFR_TA_DOM_NOTIFY lock so that clients will know the in memory state is no more valid and any new operations needs to query the thin-arbiter if required. When SHD completes healing all the files without any failure, it will again take the AFR_TA_DOM_NOTIFY lock and gets the xattrs on TA to see whether there are any new failures happened by that time. If there are new failures marked on TA, SHD will start the crawl immediately to heal those failures as well. If there are no new failures, then SHD will take the AFR_TA_DOM_MODIFY lock and unsets the xattrs on TA, so that both the data bricks will be considered as good there after. Change-Id: I037b89a0823648f314580ba0716d877bd5ddb1f1 fixes: bz#1579788 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* Land part 2 of clang-format changesGluster Ant2018-09-121-5428/+5257
| | | | | Change-Id: Ia84cc24c8924e6d22d02ac15f611c10e26db99b4 Signed-off-by: Nigel Babu <nigelb@redhat.com>
* multiple xlators: strncpy()->sprintf(), reduce strlen()'sYaniv Kaul2018-09-071-14/+11
| | | | | | | | | | | | | | | | | | | | | | | xlators/cluster/afr/src/afr-common.c xlators/cluster/dht/src/dht-common.c xlators/cluster/dht/src/dht-rebalance.c xlators/cluster/stripe/src/stripe-helpers.c strncpy may not be very efficient for short strings copied into a large buffer: If the length of src is less than n, strncpy() writes additional null bytes to dest to ensure that a total of n bytes are written. Instead, use snprintf(). Also: - save the result of strlen() and re-use it when possible. - move from strlen to SLEN (sizeof() ) for const strings. Compile-tested only! Change-Id: Icdf79dd3d9f9ff120e4720ff2b8bd016df575c38 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* afr: thin-arbiter read txn changesRavishankar N2018-09-051-0/+22
| | | | | | | | | | | | | | | | If both data bricks are up, read subvol will be based on read_subvols. If only one data brick is up: - First qeury the data-brick that is up. If it blames the other brick, allow the reads. - If if doesn't, query the TA to obtain the source of truth. TODO: See if in-memory state can be maintained for read txns (BZ 1624358). updates: bz#1579788 Change-Id: I61eec35592af3a1aaf9f90846d9a358b2e4b2fcc Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* glusterd: Fix Buffer size issuesSanju Rakonde2018-09-041-3/+3
| | | | | | | | This patch fixes buffer size issue 1138522. Change-Id: Ia12fc8f34f75704f8ed3efae2022c4fd67a8c76c updates: bz#789278 Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
* cluster/afr: Delegate name-heal when possiblePranith Kumar K2018-09-041-24/+76
| | | | | | | | | | | | | | | | | | | | | | | | | Problem: When name-self-heal is triggered on the mount, it blocks lookup until name-self-heal completes. But that can lead to hangs when lot of clients are accessing a directory which needs name heal and all of them trigger heals waiting for other clients to complete heal. Fix: When a name-heal is needed but quorum number of names have the file and pending xattrs exist on the parent, then better to delegate the heal to SHD which will be completed as part of entry-heal of the parent directory. We could also do the same for quorum-number of names not present but we don't have any known use-case where this is a frequent occurrence so not changing that part at the moment. When there is a gfid mismatch or missing gfid it is important to complete the heal so that next rename doesn't assume everything is fine and perform a rename etc fixes bz#1622821 Change-Id: I8b002c85dffc6eb6f2833e742684a233daefeb2c Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Coverity fixes in afrkarthik-us2018-08-291-2/+0
| | | | | | | | | | | | | | | Fixes the deadcode issue in "afr-common.c" and null pointer dereference isse in "afr-dir-read.c". CIDs: 1395160, 1389018 Scan details: https://scan6.coverity.com/reports.htm#v42418/p10714/fileInstanceId=85017760&defectInstanceId=25877740&mergedDefectId=1395160 https://scan6.coverity.com/reports.htm#v42418/p10714/fileInstanceId=85017734&defectInstanceId=25877951&mergedDefectId=1389018 Change-Id: I65dff57305aa3ae43544be5353f801d761193e97 updates: bz#789278 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* cluster/afr: Delegate metadata heal with pending xattrs to SHDPranith Kumar K2018-08-281-0/+44
| | | | | | | | | | | | | | | | | | | Problem: When metadata-self-heal is triggered on the mount, it blocks lookup until metadata-self-heal completes. But that can lead to hangs when lot of clients are accessing a directory which needs metadata heal and all of them trigger heals waiting for other clients to complete heal. Fix: Only when the heal is needed but the pending xattrs are not set, trigger metadata heal that could block lookup. This is the only case where different clients may give different metadata to the clients without heals, which should be avoided. Updates bz#1622821 Change-Id: I6089e9fda0770a83fb287941b229c882711f4e66 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* afr: common thin-arbiter functionsRavishankar N2018-08-231-1/+136
| | | | | | | | | | | | | | ...that can be used by client and self-heal daemon, namely: afr_ta_post_op_lock() afr_ta_post_op_unlock() Note: These are not yet consumed. They will be used in the write txn changes patch which will introduce 2 domain locking. updates: bz#1579788 Change-Id: I636d50f8fde00736665060e8f9ee4510d5f38795 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* All: run codespell on the code and fix issues.Yaniv Kaul2018-07-221-7/+7
| | | | | | | | | | | | Please review, it's not always just the comments that were fixed. I've had to revert of course all calls to creat() that were changed to create() ... Only compile-tested! Change-Id: I7d02e82d9766e272a7fd9cc68e51901d69e5aab5 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* afr,ec: Print if the subvolume is up in statedumpPranith Kumar K2018-07-031-0/+1
| | | | | | fixes bz#1597156 Change-Id: I323eb9190e40b12df216698dcdba74a6d336beeb Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* afr: don't update readables if inode refresh failed on all childrenRavishankar N2018-06-181-27/+40
| | | | | | | | | | | | | | | | | | | | | | | | Problem: If inode refresh failed on all children of afr due to ENOENT (say file migrated by dht), it resets the readables to zero. Any inflight txn which then later comes on the inode fails with EIO because no readable children present for the inode. Fix: Don't update readables when inode refresh fails on *all* children of afr. In that way any inflight txns will either proceed with its own inode refresh if needed and fail it with the right errno or use the old value of readables and continue with the txn. Also, add quorum checks to the beginning of afr_transaction(). Otherwise, we seem to be winding the lock and checking for quorum only in pre-op pahse. Note: This should ideally fix BZ 1329505 since the stop gap fix for it is has been reverted at https://review.gluster.org/#/c/20028. Change-Id: Ia638c092d8d12dc27afb3cdad133394845061319 updates: bz#1584483 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: fix bug-1363721.t failureRavishankar N2018-05-211-0/+14
| | | | | | | | | | | | | | | | | | | | | | Problem: In the .t, when the only good brick was brought down, writes on the fd were still succeeding on the bad bricks. The inflight split-brain check was marking the write as failure but since the write succeeded on all the bad bricks, afr_txn_nothing_failed() was set to true and we were unwinding writev with success to DHT and then catching the failure in post-op in the background. Fix: Don't wind the FOP phase if the write_subvol (which is populated with readable subvols obtained in pre-op cbk) does not have at least 1 good brick which was up when the transaction started. Note: This fix is not related to brick muliplexing. I ran the .t 10 times with this fix and brick-mux enabled without any failures. Change-Id: I915c9c366aa32cd342b1565827ca2d83cb02ae85 updates: bz#1577672 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: Add lease() fopPoornima G2018-05-051-0/+149
| | | | | | | | Change-Id: Ied047dd5ee44e9d5a5d3db214826f7df30332ef9 updates: #350 BUG: 1319992 Signed-off-by: Poornima G <pgurusid@redhat.com> Signed-off-by: Jiffin Tony Thottan <jthottan@redhat.com>
* afr: initial changes for thin arbiterRavishankar N2018-04-301-5/+88
| | | | | | | | | 1. Create thin arbiter index file during mount. 2. Set pending marker in thin arbiter id file in case of failure. Change-Id: I269eb8d069f0323f1fc616175e5e5eb7b91d5f82 updates: #352 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Keep child-up until ping-eventPranith Kumar K2018-04-251-24/+34
| | | | | | | | | | | | | | | | | | | | | Problem: If we have 2 bricks, brick-A and brick-B with brick-A within halo-max-latency and brick-B more than halo-max-latency. If we set both halo-min, halo-max replicas as '1'. In this case, brick-A comes online and then ping-latency will be updated for it. When brick-B comes online, we have 2 up-bricks, so the code tries to find the brick with worst latency to mark it down. Since Brick-B just came online it always had '0' latency so brick-B used to be marked offline and Brick-B would eventually be the one to be online even when brick-A is more suited. Fix: Consider latency of just-up child as HALO_MAX_LATENCY so that worst-child until ping-latency is found as the just-up brick. Also keep ping-latency as -1 until child-up during initialization. BUG: 1567881 fixes bz#1567881 Change-Id: I148262fe505468190f0eb99225d0f6d57cdb6f04 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Make sure latency-arg is passed to afrPranith Kumar K2018-04-181-0/+2
| | | | | | | | | | | xlator_notify doesn't pass the extra arguments that come in the input function, so XLATOR_NOTIFY macro should be used instead to pass the extra arguments to the function. BUG: 1567881 fixes bz#1567881 Change-Id: Ic15b6c446638cbacf3149693147a754219037c47 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Prevent ping-event handling on shdPranith Kumar K2018-04-031-0/+2
| | | | | | | | | On shd, we shouldn't treat any brick down based on latency, otherwise self-heal will never happen fixes: bz#1562717 Change-Id: Ica07fcc4fae91a6bfd9c9a670e2be464704d94b7 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* afr: add new value for read-hash-mode volume optionRavishankar N2018-03-291-24/+62
| | | | | | | | | | Updates: #363 This new value (3) will try to wind read requests to the child of AFR having the least amount of pending requests in its queue. Change-Id: If6bda2aac9bf7aec3fc39622f78659313c4b6508 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Make AFR eager-locking similar to ECPranith Kumar K2018-03-141-169/+146
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: 1) Afr's eager-lock only works for data transactions. 2) When there are conflicting writes, write with conflicting region initiates unlock of eager-lock leading to extra pre-ops and post-ops on the file. When eager-lock goes off, it leads to extra fsyncs for random-write workload in afr. Solution (that is modeled after EC): In EC, when there is a conflicting write, it waits for the current write to complete before it winds the conflicted write. This leads to better utilization of network and disk, because we will not be doing extra xattrops and FSYNCs and inodelk/unlock. Moved fd based counters to inode based counters. I tried to model the solution based on EC's locking, but it is not similar to AFR because we had to keep backward compatibility. Lifecycle of lock: ================== First transaction is added to inode->owners list and an inodelk will be sent on the wire. All the next transactions will be put in inode->waiters list until the first transaction completes inodelk and [f]xattrop completely. Once [f]xattrop also completes, all the requests in the inode->waiters list are checked if it conflict with any of the existing locks which are in inode->owners list and if not are added to inode->owners list and resumed with doing transaction. When these transactions complete fop phase they will be moved to inode->post_op list and resume the transactions that were paused because of conflicts. Post-op and unlock will not be issued on the wire until that is the last transaction on that inode. Last transaction when it has to perform post-op can choose to sleep for deyed-post-op-secs value. During that time if any other transaction comes, it will wake up the sleeping transaction and takes over the ownership of the lock and the cycle continues. If the dealyed-post-op-secs expire, then the timer thread will wakeup the sleeping transaction and it will set lock->release to true and starts doing post-op and then unlock. During this time if any other transactions come, they will be put in inode->frozen list. Once the previous unlock comes it will move the frozen list to waiters list and moves the first element from this waiters-list to owners-list and attempts the lock and the cycle continues. This is the general idea. There is logic at the time of dealying and at the time of new transaction or in flush fop to wakeup existing sleeping transactions or choosing whether to delay a transaction etc, which is subjected to change based on future enhancements etc. Fixes: #418 BUG: 1549606 Change-Id: I88b570bbcf332a27c82d2767dfa82472f60055dc Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Remove unused code pathsPranith Kumar K2018-03-061-8/+2
| | | | | | | | | | | | | | | | Removed 1) afr-v1 self-heal locks related code which is not used anymore 2) transaction has some data types that are not needed, so removed them 3) Never used lock tracing available in afr as gluster's network tracing does the job. So removed that as well. 4) Changelog is always enabled and afr is always used with locks, so __changelog_enabled, afr_lock_server_count etc functions can be deleted. 5) transaction.fop/done/resume always call the same functions, so no need to have these variables. BUG: 1549606 Change-Id: I370c146fec2892d40e674d232a5d7256e003c7f1 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>