glusterfs.git -

	Commit message (Collapse)	Author	Age	Files	Lines
*	cluster/afr: Allow lookup on root if it is from ADD_REPLICA_MOUNT	karthik-us	2018-12-18	1	-21/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: When trying to convert a plain distribute volume to replica-3 or arbiter type it is failing with ENOTCONN error as the lookup on the root will fail as there is no quorum. Fix: Allow lookup on root if it is coming from the ADD_REPLICA_MOUNT which is used while adding bricks to a volume. It will try to set the pending xattrs for the newly added bricks to allow the heal to happen in the right direction and avoid data loss scenarios. Note: This fix will solve the problem of type conversion only in the case where the volume was mounted at least once. The conversion of non mounted volumes will still fail since the dht selfheal tries to set the directory layout will fail as they do that with the PID GF_CLIENT_PID_NO_ROOT_SQUASH set in the frame->root. Change-Id: Ic511939981dad118cc946754341318b164954b3b fixes: bz#1655854 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	Don't depend on string options to be valid always	Pranith Kumar K	2018-12-17	1	-13/+9
\| \| \| \| \| \|	updates bz#1650403 Change-Id: Ib5a11e691599ce4bd93c1ed5aca6060592893961 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	AFR xlator: use dict_{setn\|getn\|deln\|get_int32n\|set_int32n\|set_strn}	Yaniv Kaul	2018-12-17	1	-48/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a previous patch (https://review.gluster.org/20769) we've added the key length to be passed to dict_* funcs, to remove the need to strlen() it. This patch moves some xlators to use it. - In some cases, moved strlen() of the key length outside of locks, which is usually a good thing. Please verify it's safe to do so. - In some cases, created a prefix for the keys, replacing something like "%d-%d" with a "%s" in snprintf(). Not sure it adds value, but improves readability. Please review carefully. Compile-tested only! Change-Id: I04f2a1eb2ecfc3283d849d150d10d088ae7aa7f1 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
*	cluster/afr: Fix mem leak reported by ASAN	Kotresh HR	2018-12-17	1	-4/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Traceback: Direct leak of 765 byte(s) in 9 object(s) allocated from: #0 0x7ffb9cad2c48 in malloc (/lib64/libasan.so.5+0xeec48) #1 0x7ffb9c5f8949 in __gf_malloc ./libglusterfs/src/mem-pool.c:136 #2 0x7ffb9c5f91bb in gf_vasprintf ./libglusterfs/src/mem-pool.c:236 #3 0x7ffb9c5f938a in gf_asprintf ./libglusterfs/src/mem-pool.c:256 #4 0x7ffb826714ab in afr_get_heal_info ./xlators/cluster/afr/src/afr-common.c:6204 #5 0x7ffb825765e5 in afr_handle_heal_xattrs ./xlators/cluster/afr/src/afr-inode-read.c:1481 #6 0x7ffb825765e5 in afr_getxattr ./xlators/cluster/afr/src/afr-inode-read.c:1571 #7 0x7ffb9c635af7 in syncop_getxattr ./libglusterfs/src/syncop.c:1680 #8 0x406c78 in glfsh_process_entries ./heal/src/glfs-heal.c:810 #9 0x408555 in glfsh_crawl_directory ./heal/src/glfs-heal.c:898 #10 0x408cc0 in glfsh_print_pending_heals_type ./heal/src/glfs-heal.c:970 #11 0x408fc5 in glfsh_print_pending_heals ./heal/src/glfs-heal.c:1012 #12 0x409546 in glfsh_gather_heal_info ./heal/src/glfs-heal.c:1154 #13 0x403e96 in main ./heal/src/glfs-heal.c:1745 #14 0x7ffb99bc411a in __libc_start_main ../csu/libc-start.c:308 The dictionary is referenced by caller to print the status. So set it as dynstr, the last unref of dictionary will free it. updates: bz#1633930 Change-Id: Ib5a7cb891e6f7d90560859aaf6239e52ff5477d0 Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	afr: some minor itable related cleanups	Ravishankar N	2018-12-12	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \|	- this->itable always needs to be allocated, hence move it outside afr_selfheal_daemon_init(). - Invoke afr_selfheal_daemon_init() only for self-heal daemon case. - remove redundant itable allocation in afr_discover(). - destroy itable in fini. Updates bz#1193929 Change-Id: Ib28b50b607386f5a5aa7d2f743c8b506ccb10eae Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	libglusterfs: Move devel headers under glusterfs directory	ShyamsundarR	2018-12-05	1	-15/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	libglusterfs devel package headers are referenced in code using include semantics for a program, this while it works can be better especially when dealing with out of tree xlator builds or in general out of tree devel package usage. Towards this, the following changes are done, - moved all devel headers under a glusterfs directory - Included these headers using system header notation <> in all code outside of libglusterfs - Included these headers using own program notation "" within libglusterfs This change although big, is just moving around the headers and making it correct when including these headers from other sources. This helps us correctly include libglusterfs includes without namespace conflicts. Change-Id: Id2a98854e671a7ee5d73be44da5ba1a74252423b Updates: bz#1193929 Signed-off-by: ShyamsundarR <srangana@redhat.com>
*	cluster/afr: s/uuid_is_null/gf_uuid_is_null	Pranith Kumar K	2018-11-07	1	-1/+1
\| \| \| \| \| \|	Updates bz#1193929 Change-Id: I1b312dabffac7e101df8ce15557527fd28a2c61f Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	all: fix the format string exceptions	Amar Tumballi	2018-11-05	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, there are possibilities in few places, where a user-controlled (like filename, program parameter etc) string can be passed as 'fmt' for printf(), which can lead to segfault, if the user's string contains '%s', '%d' in it. While fixing it, makes sense to make the explicit check for such issues across the codebase, by making the format call properly. Fixes: CVE-2018-14661 Fixes: bz#1644763 Change-Id: Ib547293f2d9eb618594cbff0df3b9c800e88bde4 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	afr/lease: Read child nodes from lease structure	root	2018-10-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	For lease operation, we allocate and store child nodes data in lease structure. Use the same in afr_lease_cbk() while checking for the quorum. Change-Id: If1fdd5a0798888afd39ad3df57d96487baf9d1e6 updates: #350 Signed-off-by: Soumya Koduri <skoduri@redhat.com>
*	afr: thin-arbiter 2 domain locking and in-memory state	Ravishankar N	2018-10-25	1	-47/+157
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	2 domain locking + xattrop for write-txn failures: -------------------------------------------------- - A post-op wound on TA takes AFR_TA_DOM_NOTIFY range lock and AFR_TA_DOM_MODIFY full lock, does xattrop on TA and releases AFR_TA_DOM_MODIFY lock and stores in-memory which brick is bad. - All further write txn failures are handled based on this in-memory value without querying the TA. - When shd heals the files, it does so by requesting full lock on AFR_TA_DOM_NOTIFY domain. Client uses this as a cue (via upcall), releases AFR_TA_DOM_NOTIFY range lock and invalidates its in-memory notion of which brick is bad. The next write txn failure is wound on TA to again update the in-memory state. - Any incomplete write txns before the AFR_TA_DOM_NOTIFY upcall release request is got is completed before the lock is released. - Any write txns got after the release request are maintained in a ta_waitq. - After the release is complete, the ta_waitq elements are spliced to a separate queue which is then processed one by one. - For fops that come in parallel when the in-memory bad brick is still unknown, only one is wound to TA on wire. The other ones are maintained in a ta_onwireq which is then processed after we get the response from TA. Change-Id: I32c7b61a61776663601ab0040e2f0767eca1fd64 updates: bz#1579788 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Signed-off-by: Ashish Pandey <aspandey@redhat.com>
*	cluster/afr : Check for UP bricks before starting heal	Ashish Pandey	2018-10-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Currently for replica volume, even if only one brick is UP SHD will keep crawling index entries even if it can not heal anything. In thin-arbiter volume which is also a replica 2 volume, this causes inode lock contention which in turn sends upcall to all the clients to release notify locks, even if it can not do anything for healing. This will slow down the client performance and kills the purpose of keeping in memory information about bad brick. Solution: Before starting heal or even crawling, check if sufficient number of children are UP and available to check and heal entries. Change-Id: I011c9da3b37cae275f791affd56b8f1c1ac9255d updates: bz#1640581 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
*	all: fix warnings on non 64-bits architectures	Xavi Hernandez	2018-10-10	1	-14/+17
\| \| \| \| \| \| \| \| \| \|	When compiling in other architectures there appear many warnings. Some of them are actual problems that prevent gluster to work correctly on those architectures. Change-Id: Icdc7107a2bc2da662903c51910beddb84bdf03c0 fixes: bz#1632717 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	afr: fix incorrect reporting of directory split-brain	Ravishankar N	2018-09-21	1	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: When a directory has dirty xattrs due to failed post-ops or when replace/reset brick is performed, AFR does a conservative merge as expected, but heal-info reports it as split-brain because there are no clear sources. Fix: Modify pending flag to contain information about pending heals and split-brains. For directories, if spit-brain flag is not set,just show them as needing heal and not being in split-brain. Fixes: bz#1626994 Change-Id: I09ef821f6887c87d315ae99e6b1de05103cd9383 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	cluster/afr: Use 2 domain locking in SHD for thin-arbiter	karthik-us	2018-09-20	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With this change when SHD starts the index crawl it requests all the clients to release the AFR_TA_DOM_NOTIFY lock so that clients will know the in memory state is no more valid and any new operations needs to query the thin-arbiter if required. When SHD completes healing all the files without any failure, it will again take the AFR_TA_DOM_NOTIFY lock and gets the xattrs on TA to see whether there are any new failures happened by that time. If there are new failures marked on TA, SHD will start the crawl immediately to heal those failures as well. If there are no new failures, then SHD will take the AFR_TA_DOM_MODIFY lock and unsets the xattrs on TA, so that both the data bricks will be considered as good there after. Change-Id: I037b89a0823648f314580ba0716d877bd5ddb1f1 fixes: bz#1579788 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	Land part 2 of clang-format changes	Gluster Ant	2018-09-12	1	-5428/+5257
\| \| \| \| \|	Change-Id: Ia84cc24c8924e6d22d02ac15f611c10e26db99b4 Signed-off-by: Nigel Babu <nigelb@redhat.com>
*	multiple xlators: strncpy()->sprintf(), reduce strlen()'s	Yaniv Kaul	2018-09-07	1	-14/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	xlators/cluster/afr/src/afr-common.c xlators/cluster/dht/src/dht-common.c xlators/cluster/dht/src/dht-rebalance.c xlators/cluster/stripe/src/stripe-helpers.c strncpy may not be very efficient for short strings copied into a large buffer: If the length of src is less than n, strncpy() writes additional null bytes to dest to ensure that a total of n bytes are written. Instead, use snprintf(). Also: - save the result of strlen() and re-use it when possible. - move from strlen to SLEN (sizeof() ) for const strings. Compile-tested only! Change-Id: Icdf79dd3d9f9ff120e4720ff2b8bd016df575c38 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
*	afr: thin-arbiter read txn changes	Ravishankar N	2018-09-05	1	-0/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If both data bricks are up, read subvol will be based on read_subvols. If only one data brick is up: - First qeury the data-brick that is up. If it blames the other brick, allow the reads. - If if doesn't, query the TA to obtain the source of truth. TODO: See if in-memory state can be maintained for read txns (BZ 1624358). updates: bz#1579788 Change-Id: I61eec35592af3a1aaf9f90846d9a358b2e4b2fcc Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	glusterd: Fix Buffer size issues	Sanju Rakonde	2018-09-04	1	-3/+3
\| \| \| \| \| \| \| \|	This patch fixes buffer size issue 1138522. Change-Id: Ia12fc8f34f75704f8ed3efae2022c4fd67a8c76c updates: bz#789278 Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
*	cluster/afr: Delegate name-heal when possible	Pranith Kumar K	2018-09-04	1	-24/+76
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: When name-self-heal is triggered on the mount, it blocks lookup until name-self-heal completes. But that can lead to hangs when lot of clients are accessing a directory which needs name heal and all of them trigger heals waiting for other clients to complete heal. Fix: When a name-heal is needed but quorum number of names have the file and pending xattrs exist on the parent, then better to delegate the heal to SHD which will be completed as part of entry-heal of the parent directory. We could also do the same for quorum-number of names not present but we don't have any known use-case where this is a frequent occurrence so not changing that part at the moment. When there is a gfid mismatch or missing gfid it is important to complete the heal so that next rename doesn't assume everything is fine and perform a rename etc fixes bz#1622821 Change-Id: I8b002c85dffc6eb6f2833e742684a233daefeb2c Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/afr: Coverity fixes in afr	karthik-us	2018-08-29	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixes the deadcode issue in "afr-common.c" and null pointer dereference isse in "afr-dir-read.c". CIDs: 1395160, 1389018 Scan details: https://scan6.coverity.com/reports.htm#v42418/p10714/fileInstanceId=85017760&defectInstanceId=25877740&mergedDefectId=1395160 https://scan6.coverity.com/reports.htm#v42418/p10714/fileInstanceId=85017734&defectInstanceId=25877951&mergedDefectId=1389018 Change-Id: I65dff57305aa3ae43544be5353f801d761193e97 updates: bz#789278 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	cluster/afr: Delegate metadata heal with pending xattrs to SHD	Pranith Kumar K	2018-08-28	1	-0/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: When metadata-self-heal is triggered on the mount, it blocks lookup until metadata-self-heal completes. But that can lead to hangs when lot of clients are accessing a directory which needs metadata heal and all of them trigger heals waiting for other clients to complete heal. Fix: Only when the heal is needed but the pending xattrs are not set, trigger metadata heal that could block lookup. This is the only case where different clients may give different metadata to the clients without heals, which should be avoided. Updates bz#1622821 Change-Id: I6089e9fda0770a83fb287941b229c882711f4e66 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	afr: common thin-arbiter functions	Ravishankar N	2018-08-23	1	-1/+136
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	...that can be used by client and self-heal daemon, namely: afr_ta_post_op_lock() afr_ta_post_op_unlock() Note: These are not yet consumed. They will be used in the write txn changes patch which will introduce 2 domain locking. updates: bz#1579788 Change-Id: I636d50f8fde00736665060e8f9ee4510d5f38795 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	All: run codespell on the code and fix issues.	Yaniv Kaul	2018-07-22	1	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \|	Please review, it's not always just the comments that were fixed. I've had to revert of course all calls to creat() that were changed to create() ... Only compile-tested! Change-Id: I7d02e82d9766e272a7fd9cc68e51901d69e5aab5 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
*	afr,ec: Print if the subvolume is up in statedump	Pranith Kumar K	2018-07-03	1	-0/+1
\| \| \| \| \| \|	fixes bz#1597156 Change-Id: I323eb9190e40b12df216698dcdba74a6d336beeb Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	afr: don't update readables if inode refresh failed on all children	Ravishankar N	2018-06-18	1	-27/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: If inode refresh failed on all children of afr due to ENOENT (say file migrated by dht), it resets the readables to zero. Any inflight txn which then later comes on the inode fails with EIO because no readable children present for the inode. Fix: Don't update readables when inode refresh fails on all children of afr. In that way any inflight txns will either proceed with its own inode refresh if needed and fail it with the right errno or use the old value of readables and continue with the txn. Also, add quorum checks to the beginning of afr_transaction(). Otherwise, we seem to be winding the lock and checking for quorum only in pre-op pahse. Note: This should ideally fix BZ 1329505 since the stop gap fix for it is has been reverted at https://review.gluster.org/#/c/20028. Change-Id: Ia638c092d8d12dc27afb3cdad133394845061319 updates: bz#1584483 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	afr: fix bug-1363721.t failure	Ravishankar N	2018-05-21	1	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: In the .t, when the only good brick was brought down, writes on the fd were still succeeding on the bad bricks. The inflight split-brain check was marking the write as failure but since the write succeeded on all the bad bricks, afr_txn_nothing_failed() was set to true and we were unwinding writev with success to DHT and then catching the failure in post-op in the background. Fix: Don't wind the FOP phase if the write_subvol (which is populated with readable subvols obtained in pre-op cbk) does not have at least 1 good brick which was up when the transaction started. Note: This fix is not related to brick muliplexing. I ran the .t 10 times with this fix and brick-mux enabled without any failures. Change-Id: I915c9c366aa32cd342b1565827ca2d83cb02ae85 updates: bz#1577672 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	afr: Add lease() fop	Poornima G	2018-05-05	1	-0/+149
\| \| \| \| \| \| \| \|	Change-Id: Ied047dd5ee44e9d5a5d3db214826f7df30332ef9 updates: #350 BUG: 1319992 Signed-off-by: Poornima G <pgurusid@redhat.com> Signed-off-by: Jiffin Tony Thottan <jthottan@redhat.com>
*	afr: initial changes for thin arbiter	Ravishankar N	2018-04-30	1	-5/+88
\| \| \| \| \| \| \| \| \|	1. Create thin arbiter index file during mount. 2. Set pending marker in thin arbiter id file in case of failure. Change-Id: I269eb8d069f0323f1fc616175e5e5eb7b91d5f82 updates: #352 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	cluster/afr: Keep child-up until ping-event	Pranith Kumar K	2018-04-25	1	-24/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: If we have 2 bricks, brick-A and brick-B with brick-A within halo-max-latency and brick-B more than halo-max-latency. If we set both halo-min, halo-max replicas as '1'. In this case, brick-A comes online and then ping-latency will be updated for it. When brick-B comes online, we have 2 up-bricks, so the code tries to find the brick with worst latency to mark it down. Since Brick-B just came online it always had '0' latency so brick-B used to be marked offline and Brick-B would eventually be the one to be online even when brick-A is more suited. Fix: Consider latency of just-up child as HALO_MAX_LATENCY so that worst-child until ping-latency is found as the just-up brick. Also keep ping-latency as -1 until child-up during initialization. BUG: 1567881 fixes bz#1567881 Change-Id: I148262fe505468190f0eb99225d0f6d57cdb6f04 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/afr: Make sure latency-arg is passed to afr	Pranith Kumar K	2018-04-18	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \|	xlator_notify doesn't pass the extra arguments that come in the input function, so XLATOR_NOTIFY macro should be used instead to pass the extra arguments to the function. BUG: 1567881 fixes bz#1567881 Change-Id: Ic15b6c446638cbacf3149693147a754219037c47 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/afr: Prevent ping-event handling on shd	Pranith Kumar K	2018-04-03	1	-0/+2
\| \| \| \| \| \| \| \| \|	On shd, we shouldn't treat any brick down based on latency, otherwise self-heal will never happen fixes: bz#1562717 Change-Id: Ica07fcc4fae91a6bfd9c9a670e2be464704d94b7 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	afr: add new value for read-hash-mode volume option	Ravishankar N	2018-03-29	1	-24/+62
\| \| \| \| \| \| \| \| \| \|	Updates: #363 This new value (3) will try to wind read requests to the child of AFR having the least amount of pending requests in its queue. Change-Id: If6bda2aac9bf7aec3fc39622f78659313c4b6508 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	cluster/afr: Make AFR eager-locking similar to EC	Pranith Kumar K	2018-03-14	1	-169/+146
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: 1) Afr's eager-lock only works for data transactions. 2) When there are conflicting writes, write with conflicting region initiates unlock of eager-lock leading to extra pre-ops and post-ops on the file. When eager-lock goes off, it leads to extra fsyncs for random-write workload in afr. Solution (that is modeled after EC): In EC, when there is a conflicting write, it waits for the current write to complete before it winds the conflicted write. This leads to better utilization of network and disk, because we will not be doing extra xattrops and FSYNCs and inodelk/unlock. Moved fd based counters to inode based counters. I tried to model the solution based on EC's locking, but it is not similar to AFR because we had to keep backward compatibility. Lifecycle of lock: ================== First transaction is added to inode->owners list and an inodelk will be sent on the wire. All the next transactions will be put in inode->waiters list until the first transaction completes inodelk and [f]xattrop completely. Once [f]xattrop also completes, all the requests in the inode->waiters list are checked if it conflict with any of the existing locks which are in inode->owners list and if not are added to inode->owners list and resumed with doing transaction. When these transactions complete fop phase they will be moved to inode->post_op list and resume the transactions that were paused because of conflicts. Post-op and unlock will not be issued on the wire until that is the last transaction on that inode. Last transaction when it has to perform post-op can choose to sleep for deyed-post-op-secs value. During that time if any other transaction comes, it will wake up the sleeping transaction and takes over the ownership of the lock and the cycle continues. If the dealyed-post-op-secs expire, then the timer thread will wakeup the sleeping transaction and it will set lock->release to true and starts doing post-op and then unlock. During this time if any other transactions come, they will be put in inode->frozen list. Once the previous unlock comes it will move the frozen list to waiters list and moves the first element from this waiters-list to owners-list and attempts the lock and the cycle continues. This is the general idea. There is logic at the time of dealying and at the time of new transaction or in flush fop to wakeup existing sleeping transactions or choosing whether to delay a transaction etc, which is subjected to change based on future enhancements etc. Fixes: #418 BUG: 1549606 Change-Id: I88b570bbcf332a27c82d2767dfa82472f60055dc Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/afr: Remove unused code paths	Pranith Kumar K	2018-03-06	1	-8/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Removed 1) afr-v1 self-heal locks related code which is not used anymore 2) transaction has some data types that are not needed, so removed them 3) Never used lock tracing available in afr as gluster's network tracing does the job. So removed that as well. 4) Changelog is always enabled and afr is always used with locks, so __changelog_enabled, afr_lock_server_count etc functions can be deleted. 5) transaction.fop/done/resume always call the same functions, so no need to have these variables. BUG: 1549606 Change-Id: I370c146fec2892d40e674d232a5d7256e003c7f1 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/afr: Remove compound-fops usage in afr	Pranith Kumar K	2018-03-06	1	-43/+0
\| \| \| \| \| \| \| \| \|	We are not seeing much improvement with this change. So removing the feature so that it doesn't need to be maintained anymore. Fixes: #414 Change-Id: Ic7969b151544daf2547bd262a9fa03f575626411 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/afr: Make afr_fsync a transaction	karthik-us	2018-03-02	1	-163/+0
\| \| \| \| \| \|	Change-Id: I713401feb96393f668efb074f2d5b870d19e6fda BUG: 1548361 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	cluster/afr: Fix dict-leak in pre-op	Pranith Kumar K	2018-02-28	1	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \|	At the time of pre-op, pre_op_xdata is populted with the xattrs we get from the disk and at the time of post-op it gets over-written without unreffing the previous value stored leading to a leak. This is a regression we missed in https://review.gluster.org/#/q/ba149bac92d169ae2256dbc75202dc9e5d06538e BUG: 1550078 Change-Id: I0456f9ad6f77ce6248b747964a037193af3a3da7 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	core: fix some of the dict_{get,set} with proper APIs	Amar Tumballi	2018-01-17	1	-3/+2
\| \| \| \| \| \| \|	updates #220 Change-Id: I6e25dbb69b2c7021e00073e8f025d212db7de0be Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	cluster/afr: Fixing the flaws in arbiter becoming source patch	karthik-us	2018-01-13	1	-114/+152
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Setting the write_subvol value to read_subvol in case of metadata transaction during pre-op (commit 19f9bcff4aada589d4321356c2670ed283f02c03) might lead to the original problem of arbiter becoming source. Scenario: 1) All bricks are up and good 2) 2 writes w1 and w2 are in progress in parallel 3) ctx->read_subvol is good for all the subvolumes 4) w1 succeeds on brick0 and fails on brick1, yet to do post-op on the disk 5) read/lookup comes on the same file and refreshes read_subvols back to all good 6) metadata transaction happens which makes ctx->write_subvol to be assigned with ctx->read_subvol which is all good 7) w2 succeeds on brick1 and fails on brick0 and this will update the brick in reverse order leading to arbiter becoming source Fix: Instead of setting the ctx->write_subvol to ctx->read_subvol in the pre-op statge, if there is a metadata transaction, check in the function __afr_set_in_flight_sb_status() if it is a data/metadata transaction. Use the value of ctx->write_subvol if it is a data transactions and ctx->read_subvol value for other transactions. With this patch we assign the value of ctx->write_subvol in the afr_transaction_perform_fop() with the on disk value, instead of assigning it in the afr_changelog_pre_op() with the in memory value. Change-Id: Id2025a7e965f0578af35b1abaac793b019c43cc4 BUG: 1482064 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	afr: volume option fixes for GD2	Ravishankar N	2017-11-27	1	-1/+0
\| \| \| \| \| \| \| \| \|	This patch takes care of volume options exposed via the CLI. Updates #302 Change-Id: I6fd1645604928f6b9700e2425af4147cc6446a3a Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	afr: coverity fixes	Ravishankar N	2017-11-24	1	-8/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1.afr_discover_do: COPY_PASTE_ERROR 2.afr_fav_child_reset_sink_xattrs_cbk: REVERSE_INULL 3.afr_fop_lock_proceed: UNUSED_VALUE 4.afr_local_init: CHECKED_RETURN 5.afr_set_split_brain_choice: REVERSE_INULL 6.__afr_inode_write_finalize: FORWARD_NULL 7.afr_refresh_heal_done: REVERSE_INULL 8.afr_xl_op:UNUSED_VALUE 9.afr_changelog_populate_xdata: DEADCODE 10.set_afr_pending_xattrs_option: RESOURCE_LEAK Note: RESOURCE_LEAK complaints about afr_fgetxattr_pathinfo_cbk, afr_getxattr_list_node_uuids_cbk and afr_getxattr_pathinfo_cbk seem to be false alarms. Change-Id: Ia4ca1478b5e2922084732d14c1e7b1b03ad5ac45 BUG: 789278 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	afr: add checks for allowing lookups	Ravishankar N	2017-11-18	1	-85/+159
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: In an arbiter volume, lookup was being served from one of the sink bricks (source brick was down). shard uses the iatt values from lookup cbk to calculate the size and block count, which in this case were incorrect values. shard_local_t->last_block was thus initialised to -1, resulting in an infinite while loop in shard_common_resolve_shards(). Fix: Use client quorum logic to allow or fail the lookups from afr if there are no readable subvolumes. So in replica-3 or arbiter vols, if there is no good copy or if quorum is not met, fail lookup with ENOTCONN. With this fix, we are also removing support for quorum-reads xlator option. So if quorum is not met, neither read nor write txns are allowed and we fail the fop with ENOTCONN. Change-Id: Ic65c00c24f77ece007328b421494eee62a505fa0 BUG: 1467250 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	cluster/afr: Fix for arbiter becoming source	karthik-us	2017-11-18	1	-1/+75
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: When eager-lock is on, and two writes happen in parallel on a FD we were observing the following behaviour: - First write fails on one data brick - Since the post-op is not yet happened, the inode refresh will get both the data bricks as readable and set it in the inode context - In flight split brain check see both the data bricks as readable and allows the second write - Second write fails on the other data brick - Now the post-op happens and marks both the data bricks as bad and arbiter will become source for healing Fix: Adding one more variable called write_suvol in inode context and it will have the in memory representation of the writable subvols. Inode refresh will not update this value and its lifetime is pre-op through unlock in the afr transaction. Initially the pre-op will set this value same as read_subvol in inode context and then in the in flight split brain check we will use this value instead of read_subvol. After all the checks we will update the value of this and set the read_subvol same as this to avoid having incorrect value in that. Change-Id: I2ef6904524ab91af861d59690974bbc529ab1af3 BUG: 1482064 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	Coverity Issue: PW.INCLUDE_RECURSION in several files	Girjesh Rajoria	2017-11-09	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Coverity ID: 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 423, 424, 425, 426, 427, 428, 429, 436, 437, 438, 439, 440, 441, 442, 443 Issue: Event include_recursion Removed redundant, recursive includes from the files. Change-Id: I920776b1fa089a2d4917ca722d0075a9239911a7 BUG: 789278 Signed-off-by: Girjesh Rajoria <grajoria@redhat.com>
*	cluster/afr: Honor default timeout of 5min for analyzing split-brain files	karthik-us	2017-10-30	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: After setting split-brain-choice option to analyze the file to resolve the split brain using the command "setfattr -n replica.split-brain-choice -v "choiceX" <path-to-file>" should allow to access the file from mount for default timeout of 5mins. But the timeout was not honored and was able to access the file even after the timeout. Fix: Call the inode_invalidate() in afr_set_split_brain_choice_cbk() so that it will triger the cache invalidate after resetting the timer and the split brain choice. So the next calls to access the file will fail with EIO. Change-Id: I698cb833676b22ff3e4c6daf8b883a0958f51a64 BUG: 1503519 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	cluster/afr: Fail open on split-brain	Pranith Kumar K	2017-10-26	1	-8/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Append on a file with split-brain succeeds. Open is intercepted by open-behind, when write comes on the file, open-behind does open+write. Open succeeds because afr doesn't fail it. Then write succeeds because write-behind intercepts it. Flush is also intercepted by write-behind, so the application never gets to know that the write failed. Fix: Fail open on split-brain, so that when open-behind does open+write open fails which leads to write failure. Application will know about this failure. Change-Id: I4bff1c747c97bb2925d6987f4ced5f1ce75dbc15 BUG: 1294051 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	xlator/cluster/afr:coverity Issue "UNUSED_VALUE" in afr_get_split_brain_status	Subha sree Mohankumar	2017-10-05	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \|	Issue: Event value_overwrite:Overwriting previous write to "ret" with value "-1". Fix : An "If" condition is added to check the value of "ret". Change-Id: I7b6bd4f20f73fa85eb8a5169644e275c7b56af51 BUG: 789278 Signed-off-by: Subha sree Mohankumar <smohanku@redhat.com>
*	cluster/afr: Sending subvol up/down events when subvol comes up or goes down	karthik-us	2017-09-20	1	-0/+2
\| \| \| \| \| \|	Change-Id: I6580351b245d5f868e9ddc6a4eb4dd6afa3bb6ec BUG: 1493539 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	afr: discover/lookup heal fixes	Ravishankar N	2017-09-04	1	-50/+51
\| \| \| \| \| \| \| \| \| \| \| \|	Addresses review comments in commit 468ca877807625817b72921d1e9585036687b640 Change-Id: I04b1bd3b00abfd6758798d6272954e36a24249a9 BUG: 1473636 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: https://review.gluster.org/18187 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
*	afr: check validity of afr_reply	Ravishankar N	2017-08-31	1	-9/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...in various self-heal code paths. Originally found by Pranith in __afr_selfheal_name_impunge () Also change __afr_selfheal_assign_gfid() to send lookup only on those bricks that don't have a gfid matching that of the source. Change-Id: I70a2ccd750a2af92c5fc36e0eefb2b6125404b4a BUG: 1482923 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: https://review.gluster.org/18065 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>