summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/afr/src/afr-transaction.c
Commit message (Collapse)AuthorAgeFilesLines
* cluster/afr: Prioritize ENOSPC over other errorskarthik-us2020-07-081-46/+2
| | | | | | | | | | | | | | | | | | | | | | | Problem: In a replicate/arbiter volume if file creations or writes fails on quorum number of bricks and on one brick it is due to ENOSPC and on other brick it fails for a different reason, it may fail with errors other than ENOSPC in some cases. Fix: Prioritize ENOSPC over other lesser priority errors and do not set op_errno in posix_gfid_set if op_ret is 0 to avoid receiving any error_no which can be misinterpreted by __afr_dir_write_finalize(). Also removing the function afr_has_arbiter_fop_cbk_quorum() which might consider a successful reply form a single brick as quorum success in some cases, whereas we always need fop to be successful on quorum number of bricks in arbiter configuration. Change-Id: I106e267f8b9451f681022f1cccb410d9bc824c08 Fixes: #1254 Signed-off-by: karthik-us <ksubrahm@redhat.com> (cherry picked from commit fa63b45ca5edf172b1b89b28b5db3c5129cc57b6)
* afr: thin-arbiter lock release fixesRavishankar N2019-05-151-46/+76
| | | | | | | | | | | | | | | | | | | | | - pass fop state instead of afr local to afr_ta_dom_lock_check_and_release() - avoid afr_lock_release_synctask() being called simultaneosuly from notify code path and transaction (post-op) code path due to races. - Check if the post-op on TA is valid based on event_gen checks. - Invalidate in-memory information when we get TA child down. Note: Thi patch addresses some pending review comments of commit 053b1309dc8fbc05fcde5223e734da9f694cf5cc (https://review.gluster.org/#/c/glusterfs/+/20095/) fixes: bz#1709130 Change-Id: I2ccd7e1b53362f9f3fed8680aecb23b5011eb18c Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 9ab2747da78061882f6734df4b265bce11adaef1)
* cluster/afr : TA: Return actual error code in case of failureAshish Pandey2019-05-131-6/+6
| | | | | | | | | | In afr_ta_post_op_do, we were sending EIO for every failure. However, the original error code should be sent. Change-Id: I9fdc15dac00d758baf8e6f14db244f526481a63a updates: bz#1709143 Signed-off-by: Ashish Pandey <aspandey@redhat.com> (cherry picked from commit 63159cdb5374f458d7d2bffec24d4720ffc96d6c)
* cluster/afr: Remove local from owners_list on failure of lock-acquisitionPranith Kumar K2019-04-161-11/+8
| | | | | | | | | | | | | When eager-lock lock acquisition fails because of say network failures, the local is not being removed from owners_list, this leads to accumulation of waiting frames and the application will hang because the waiting frames are under the assumption that another transaction is in the process of acquiring lock because owner-list is not empty. Handled this case as well in this patch. Added asserts to make it easier to find these problems in future. fixes bz#1699731 Change-Id: I3101393265e9827755725b1f2d94a93d8709e923 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/thin-arbiter: Consider thin-arbiter before marking new entry changelogAshish Pandey2019-02-011-19/+75
| | | | | | | | | | | | | | | | If a fop to create an entry fails on one of the data brick, we mark the pending changelog on the entry on brick for which it was successful. This is done as part of post op phase to make sure that entry gets healed even if it gets renamed to some other path where its parent was not marked as bad. As it happens as part of post op, we should consider thin-arbiter to check if the brick, which was successful, is the good brick or not. This will avoide split brain and other issues. Change-Id: I12686675be98f02f70a5186b3ed748c541514d53 updates: bz#1662264 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/afr: Refactor internal locking code to allow multiple inodelksPranith Kumar K2018-12-281-74/+96
| | | | | | | | | | | | | | | | For implementing copy_file_range fop, AFR needs to perform two inodelks in the same transaction. This patch brings in the necessary structure to make it easier to do so. Entry-locks in AFR were already taking multiple entry-locks on different inodes with the respective basenames. This patch extends the logic in inodelks to use the same lockee_t structure. This lead to removal of quite a lot of duplicate code present in afr-lk-common.c as both the locks are doing same things except 'winding' part. updates: #536 Change-Id: Ibfce7e3f260bb27b18645152ec680c33866fe0ae Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Allow lookup on root if it is from ADD_REPLICA_MOUNTkarthik-us2018-12-181-5/+8
| | | | | | | | | | | | | | | | | | | | | Problem: When trying to convert a plain distribute volume to replica-3 or arbiter type it is failing with ENOTCONN error as the lookup on the root will fail as there is no quorum. Fix: Allow lookup on root if it is coming from the ADD_REPLICA_MOUNT which is used while adding bricks to a volume. It will try to set the pending xattrs for the newly added bricks to allow the heal to happen in the right direction and avoid data loss scenarios. Note: This fix will solve the problem of type conversion only in the case where the volume was mounted at least once. The conversion of non mounted volumes will still fail since the dht selfheal tries to set the directory layout will fail as they do that with the PID GF_CLIENT_PID_NO_ROOT_SQUASH set in the frame->root. Change-Id: Ic511939981dad118cc946754341318b164954b3b fixes: bz#1655854 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* AFR xlator: use dict_{setn|getn|deln|get_int32n|set_int32n|set_strn}Yaniv Kaul2018-12-171-3/+6
| | | | | | | | | | | | | | | | | | | | In a previous patch (https://review.gluster.org/20769) we've added the key length to be passed to dict_* funcs, to remove the need to strlen() it. This patch moves some xlators to use it. - In some cases, moved strlen() of the key length outside of locks, which is usually a good thing. Please verify it's safe to do so. - In some cases, created a prefix for the keys, replacing something like "%d-%d" with a "%s" in snprintf(). Not sure it adds value, but improves readability. Please review carefully. Compile-tested only! Change-Id: I04f2a1eb2ecfc3283d849d150d10d088ae7aa7f1 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* libglusterfs: Move devel headers under glusterfs directoryShyamsundarR2018-12-051-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | libglusterfs devel package headers are referenced in code using include semantics for a program, this while it works can be better especially when dealing with out of tree xlator builds or in general out of tree devel package usage. Towards this, the following changes are done, - moved all devel headers under a glusterfs directory - Included these headers using system header notation <> in all code outside of libglusterfs - Included these headers using own program notation "" within libglusterfs This change although big, is just moving around the headers and making it correct when including these headers from other sources. This helps us correctly include libglusterfs includes without namespace conflicts. Change-Id: Id2a98854e671a7ee5d73be44da5ba1a74252423b Updates: bz#1193929 Signed-off-by: ShyamsundarR <srangana@redhat.com>
* afr: thin-arbiter 2 domain locking and in-memory stateRavishankar N2018-10-251-24/+461
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 2 domain locking + xattrop for write-txn failures: -------------------------------------------------- - A post-op wound on TA takes AFR_TA_DOM_NOTIFY range lock and AFR_TA_DOM_MODIFY full lock, does xattrop on TA and releases AFR_TA_DOM_MODIFY lock and stores in-memory which brick is bad. - All further write txn failures are handled based on this in-memory value without querying the TA. - When shd heals the files, it does so by requesting full lock on AFR_TA_DOM_NOTIFY domain. Client uses this as a cue (via upcall), releases AFR_TA_DOM_NOTIFY range lock and invalidates its in-memory notion of which brick is bad. The next write txn failure is wound on TA to again update the in-memory state. - Any incomplete write txns before the AFR_TA_DOM_NOTIFY upcall release request is got is completed before the lock is released. - Any write txns got after the release request are maintained in a ta_waitq. - After the release is complete, the ta_waitq elements are spliced to a separate queue which is then processed one by one. - For fops that come in parallel when the in-memory bad brick is still unknown, only one is wound to TA on wire. The other ones are maintained in a ta_onwireq which is then processed after we get the response from TA. Change-Id: I32c7b61a61776663601ab0040e2f0767eca1fd64 updates: bz#1579788 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/afr: Make data eager-lock decision based on number of locksPranith Kumar K2018-09-211-6/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For both Virt and block workloads the file is opened multiple times leading to dynamically setting eager-lock to off for the workload. Instead of depending on the number-of-open-fds, if we change the logic to depend on number of inodelks, then it will give better performance than the earlier logic. When there is an eager-lock and number of inodelks is more than 1 we know that there is a conflicting lock, so depend on that information to decide whether to keep the current transaction go through delayed-post-op or not. Locks xlator doesn't have implementation to query number of locks in fxattrop in releases older than 3.10 so to keep things backward compatible in 3.12, data transactions will use new logic where as fxattrop transactions will use old logic. I am planning to send one more patch which makes metadata domain locks also depend on inodelk-count Profile info for a dd of 500MB to a file with another fd opened on the file using exec 250>filename Without this patch: 0.14 67.41 us 16.72 us 3870.82 us 892 FINODELK 0.59 279.87 us 95.71 us 2085.89 us 898 FXATTROP 3.46 366.43 us 81.75 us 6952.79 us 4000 WRITE 95.79 148733.99 us 50568.12 us 919127.86 us 273 FSYNC With this patch: 0.00 51.01 us 38.07 us 80.16 us 4 FINODELK 0.00 235.43 us 235.43 us 235.43 us 1 TRUNCATE 0.00 125.07 us 56.80 us 193.33 us 2 GETXATTR 0.00 135.86 us 62.13 us 209.59 us 2 INODELK 0.00 197.88 us 155.39 us 253.90 us 4 FXATTROP 0.00 450.59 us 394.28 us 506.89 us 2 XATTROP 0.00 56.96 us 19.06 us 406.59 us 23 FLUSH 37.81 273648.93 us 48.43 us 6017657.05 us 44 LOOKUP 62.18 4951.86 us 93.80 us 1143154.75 us 3999 WRITE postgresql benchmark performance changed from ~1130 TPS to ~2300TPS randio fio job inside Ovirt based VM went from ~600IOPs to ~2000IOPS fixes bz#1630368 Change-Id: If7f7388d2f08cf7f17ca517a4ea222560661dc36 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Batch writes in same lock even when multiple fds are openPranith Kumar K2018-09-191-9/+2
| | | | | | | | | | | | | | | | | | | | | | Problem: When eager-lock is disabled because of multiple-fds opened and app writes come on conflicting regions, the number of locks grows very fast leading to all the CPU being spent just in locking and unlocking by traversing huge queues in locks xlator for granting locks. Fix: Reduce the number of locks in transit by bundling the writes in the same lock and disable delayed piggy-pack when we learn that multiple fds are open on the file. This will reduce the size of queues in the locks xlator. This also reduces the number of network calls like inodelk/fxattrop. Please note that this problem can still happen if eager-lock is disabled as the writes will not be bundled in the same lock. fixes bz#1625961 Change-Id: I8fd1cf229aed54ce5abd4e6226351a039924dd91 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* Land part 2 of clang-format changesGluster Ant2018-09-121-1984/+1903
| | | | | Change-Id: Ia84cc24c8924e6d22d02ac15f611c10e26db99b4 Signed-off-by: Nigel Babu <nigelb@redhat.com>
* afr: common thin-arbiter functionsRavishankar N2018-08-231-4/+10
| | | | | | | | | | | | | | ...that can be used by client and self-heal daemon, namely: afr_ta_post_op_lock() afr_ta_post_op_unlock() Note: These are not yet consumed. They will be used in the write txn changes patch which will introduce 2 domain locking. updates: bz#1579788 Change-Id: I636d50f8fde00736665060e8f9ee4510d5f38795 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: switch lk_owner only when pre-op succeedsRavishankar N2018-07-181-10/+10
| | | | | | | | | | | | | | | | | | | Problem: In a disk full scenario, we take a failure path in afr_transaction_perform_fop() and go to unlock phase. But we change the lk-owner before that, causing unlock to fail. When mount issues another fop that takes locks on that file, it hangs. Fix: Change lk-owner only when we are about to perform the fop phase. Also fix the same issue for arbiters when afr_txn_arbitrate_fop() fails the fop. Also removed the DISK_SPACE_CHECK_AND_GOTO in posix_xattrop. Otherwise truncate to zero will fail pre-op phase with ENOSPC when the user is actually trying to freee up space. Change-Id: Ic4c8a596b4cdf4a7fc189bf00b561113cf114353 fixes: bz#1602236 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Mark dirty for entry transactions for quorum failureskarthik-us2018-07-161-11/+51
| | | | | | | | | | | | | | | | | | Problem: If an entry creation transaction fails on quprum number of bricks it might end up setting the pending changelogs on the file itself on the brick where it got created. But the parent does not have any entry pending marker set. This will lead to the entry not getting healed by the self heal daemon automatically. Fix: For entry transactions mark dirty on the parent if it fails on quorum number of bricks, so that the heal can do conservative merge and entry gets healed by shd. Change-Id: I56448932dd409b3ddb095e2ae32e037b6157a607 fixes: bz#1586020 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* cluster/afr: Make sure lk-owner is assigned at the time of lockPranith Kumar K2018-07-041-2/+1
| | | | | | | | | | | | | | Problem: In the new eager-lock implementation lk-owner is assigned after the 'local' is added to the eager-lock list, so there exists a possibility of lock being sent even before lk-owner is assigned. Fix: Make sure to assign lk-owner before adding local to eager-lock list fixes bz#1597805 Change-Id: I26d1b7bcf3e8b22531f1dc0b952cae2d92889ef2 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* afr: don't update readables if inode refresh failed on all childrenRavishankar N2018-06-181-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | Problem: If inode refresh failed on all children of afr due to ENOENT (say file migrated by dht), it resets the readables to zero. Any inflight txn which then later comes on the inode fails with EIO because no readable children present for the inode. Fix: Don't update readables when inode refresh fails on *all* children of afr. In that way any inflight txns will either proceed with its own inode refresh if needed and fail it with the right errno or use the old value of readables and continue with the txn. Also, add quorum checks to the beginning of afr_transaction(). Otherwise, we seem to be winding the lock and checking for quorum only in pre-op pahse. Note: This should ideally fix BZ 1329505 since the stop gap fix for it is has been reverted at https://review.gluster.org/#/c/20028. Change-Id: Ia638c092d8d12dc27afb3cdad133394845061319 updates: bz#1584483 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: fix bug-1363721.t failureRavishankar N2018-05-211-0/+38
| | | | | | | | | | | | | | | | | | | | | | Problem: In the .t, when the only good brick was brought down, writes on the fd were still succeeding on the bad bricks. The inflight split-brain check was marking the write as failure but since the write succeeded on all the bad bricks, afr_txn_nothing_failed() was set to true and we were unwinding writev with success to DHT and then catching the failure in post-op in the background. Fix: Don't wind the FOP phase if the write_subvol (which is populated with readable subvols obtained in pre-op cbk) does not have at least 1 good brick which was up when the transaction started. Note: This fix is not related to brick muliplexing. I ran the .t 10 times with this fix and brick-mux enabled without any failures. Change-Id: I915c9c366aa32cd342b1565827ca2d83cb02ae85 updates: bz#1577672 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: initial changes for thin arbiterRavishankar N2018-04-301-0/+107
| | | | | | | | | 1. Create thin arbiter index file during mount. 2. Set pending marker in thin arbiter id file in case of failure. Change-Id: I269eb8d069f0323f1fc616175e5e5eb7b91d5f82 updates: #352 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: fixes to afr-eager lockingRavishankar N2018-04-181-0/+2
| | | | | | | | | | | | | 1. If pre-op fails on all bricks,set lock->release to true in afr_handle_lock_acquire_failure so that the GF_ASSERT in afr_unlock() does not crash. 2. Added a missing 'return' after handling pre-op failure in afr_transaction_perform_fop(), fixing a use-after-free issue. Change-Id: If0627a9124cb5d6405037cab3f17f8325eed2d83 fixes: bz#1561129 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: add quorum checks in pre-opRavishankar N2018-04-051-33/+31
| | | | | | | | | | | | | | | | | Problem: We seem to be winding the FOP if pre-op did not succeed on quorum bricks and then failing the FOP with EROFS since the fop did not meet quorum. This essentially masks the actual error due to which pre-op failed. (See BZ). Fix: Skip FOP phase if pre-op quorum is not met and go to post-op. Fixes: 1561129 Change-Id: Ie58a41e8fa1ad79aa06093706e96db8eef61b6d9 fixes: bz#1561129 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Make AFR eager-locking similar to ECPranith Kumar K2018-03-141-397/+516
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: 1) Afr's eager-lock only works for data transactions. 2) When there are conflicting writes, write with conflicting region initiates unlock of eager-lock leading to extra pre-ops and post-ops on the file. When eager-lock goes off, it leads to extra fsyncs for random-write workload in afr. Solution (that is modeled after EC): In EC, when there is a conflicting write, it waits for the current write to complete before it winds the conflicted write. This leads to better utilization of network and disk, because we will not be doing extra xattrops and FSYNCs and inodelk/unlock. Moved fd based counters to inode based counters. I tried to model the solution based on EC's locking, but it is not similar to AFR because we had to keep backward compatibility. Lifecycle of lock: ================== First transaction is added to inode->owners list and an inodelk will be sent on the wire. All the next transactions will be put in inode->waiters list until the first transaction completes inodelk and [f]xattrop completely. Once [f]xattrop also completes, all the requests in the inode->waiters list are checked if it conflict with any of the existing locks which are in inode->owners list and if not are added to inode->owners list and resumed with doing transaction. When these transactions complete fop phase they will be moved to inode->post_op list and resume the transactions that were paused because of conflicts. Post-op and unlock will not be issued on the wire until that is the last transaction on that inode. Last transaction when it has to perform post-op can choose to sleep for deyed-post-op-secs value. During that time if any other transaction comes, it will wake up the sleeping transaction and takes over the ownership of the lock and the cycle continues. If the dealyed-post-op-secs expire, then the timer thread will wakeup the sleeping transaction and it will set lock->release to true and starts doing post-op and then unlock. During this time if any other transactions come, they will be put in inode->frozen list. Once the previous unlock comes it will move the frozen list to waiters list and moves the first element from this waiters-list to owners-list and attempts the lock and the cycle continues. This is the general idea. There is logic at the time of dealying and at the time of new transaction or in flush fop to wakeup existing sleeping transactions or choosing whether to delay a transaction etc, which is subjected to change based on future enhancements etc. Fixes: #418 BUG: 1549606 Change-Id: I88b570bbcf332a27c82d2767dfa82472f60055dc Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Remove unused code pathsPranith Kumar K2018-03-061-122/+15
| | | | | | | | | | | | | | | | Removed 1) afr-v1 self-heal locks related code which is not used anymore 2) transaction has some data types that are not needed, so removed them 3) Never used lock tracing available in afr as gluster's network tracing does the job. So removed that as well. 4) Changelog is always enabled and afr is always used with locks, so __changelog_enabled, afr_lock_server_count etc functions can be deleted. 5) transaction.fop/done/resume always call the same functions, so no need to have these variables. BUG: 1549606 Change-Id: I370c146fec2892d40e674d232a5d7256e003c7f1 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Remove compound-fops usage in afrPranith Kumar K2018-03-061-329/+4
| | | | | | | | | We are not seeing much improvement with this change. So removing the feature so that it doesn't need to be maintained anymore. Fixes: #414 Change-Id: Ic7969b151544daf2547bd262a9fa03f575626411 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Fix dict-leak in pre-opPranith Kumar K2018-02-281-10/+10
| | | | | | | | | | | | At the time of pre-op, pre_op_xdata is populted with the xattrs we get from the disk and at the time of post-op it gets over-written without unreffing the previous value stored leading to a leak. This is a regression we missed in https://review.gluster.org/#/q/ba149bac92d169ae2256dbc75202dc9e5d06538e BUG: 1550078 Change-Id: I0456f9ad6f77ce6248b747964a037193af3a3da7 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* afr: capture the correct errno in post-op quorum checkRavishankar N2018-01-311-8/+8
| | | | | | | | | If the post-op phase of txn did not meet quorm checks, use that errno to unwind the FOP rather than blindly setting ENOTCONN. Change-Id: I0cb0c8771ec75a45f9a25ad4cd8601103deddf0c BUG: 1506140 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: add quorum checks in post-opRavishankar N2018-01-191-0/+29
| | | | | | | | | | afr relies on pending changelog xattrs to identify source and sinks and the setting of these xattrs happen in post-op. So if post-op fails, we need to unwind the write txn with a failure. Change-Id: I0f019ac03890108324ee7672883d774918b20be1 BUG: 1506140 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Adding option to take full file lockkarthik-us2018-01-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In replica 3 volumes there is a possibilities of ending up in split brain scenario, when multiple clients writing data on the same file at non overlapping regions in parallel. Scenario: - Initially all the copies are good and all the clients gets the value of data readables as all good. - Client C0 performs write W1 which fails on brick B0 and succeeds on other two bricks. - C1 performs write W2 which fails on B1 and succeeds on other two bricks. - C2 performs write W3 which fails on B2 and succeeds on other two bricks. - All the 3 writes above happen in parallel and fall on different ranges so afr takes granular locks and all the writes are performed in parallel. Since each client had data-readables as good, it does not see file going into split-brain in the in_flight_split_brain check, hence performs the post-op marking the pending xattrs. Now all the bricks are being blamed by each other, ending up in split-brain. Fix: Have an option to take either full lock or range lock on files while doing data transactions, to prevent the possibility of ending up in split brains. With this change, by default the files will take full lock while doing IO. If you want to make use of the old range lock change the value of "cluster.full-lock" to "no". Change-Id: I7893fa33005328ed63daa2f7c35eeed7c5218962 BUG: 1535438 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* cluster/afr: Fixing the flaws in arbiter becoming source patchkarthik-us2018-01-131-20/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Setting the write_subvol value to read_subvol in case of metadata transaction during pre-op (commit 19f9bcff4aada589d4321356c2670ed283f02c03) might lead to the original problem of arbiter becoming source. Scenario: 1) All bricks are up and good 2) 2 writes w1 and w2 are in progress in parallel 3) ctx->read_subvol is good for all the subvolumes 4) w1 succeeds on brick0 and fails on brick1, yet to do post-op on the disk 5) read/lookup comes on the same file and refreshes read_subvols back to all good 6) metadata transaction happens which makes ctx->write_subvol to be assigned with ctx->read_subvol which is all good 7) w2 succeeds on brick1 and fails on brick0 and this will update the brick in reverse order leading to arbiter becoming source Fix: Instead of setting the ctx->write_subvol to ctx->read_subvol in the pre-op statge, if there is a metadata transaction, check in the function __afr_set_in_flight_sb_status() if it is a data/metadata transaction. Use the value of ctx->write_subvol if it is a data transactions and ctx->read_subvol value for other transactions. With this patch we assign the value of ctx->write_subvol in the afr_transaction_perform_fop() with the on disk value, instead of assigning it in the afr_changelog_pre_op() with the in memory value. Change-Id: Id2025a7e965f0578af35b1abaac793b019c43cc4 BUG: 1482064 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* afr: coverity fixesRavishankar N2017-11-241-2/+0
| | | | | | | | | | | | | | | | | | | | | | 1.afr_discover_do: COPY_PASTE_ERROR 2.afr_fav_child_reset_sink_xattrs_cbk: REVERSE_INULL 3.afr_fop_lock_proceed: UNUSED_VALUE 4.afr_local_init: CHECKED_RETURN 5.afr_set_split_brain_choice: REVERSE_INULL 6.__afr_inode_write_finalize: FORWARD_NULL 7.afr_refresh_heal_done: REVERSE_INULL 8.afr_xl_op:UNUSED_VALUE 9.afr_changelog_populate_xdata: DEADCODE 10.set_afr_pending_xattrs_option: RESOURCE_LEAK Note: RESOURCE_LEAK complaints about afr_fgetxattr_pathinfo_cbk, afr_getxattr_list_node_uuids_cbk and afr_getxattr_pathinfo_cbk seem to be false alarms. Change-Id: Ia4ca1478b5e2922084732d14c1e7b1b03ad5ac45 BUG: 789278 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Fix for arbiter becoming sourcekarthik-us2017-11-181-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: When eager-lock is on, and two writes happen in parallel on a FD we were observing the following behaviour: - First write fails on one data brick - Since the post-op is not yet happened, the inode refresh will get both the data bricks as readable and set it in the inode context - In flight split brain check see both the data bricks as readable and allows the second write - Second write fails on the other data brick - Now the post-op happens and marks both the data bricks as bad and arbiter will become source for healing Fix: Adding one more variable called write_suvol in inode context and it will have the in memory representation of the writable subvols. Inode refresh will not update this value and its lifetime is pre-op through unlock in the afr transaction. Initially the pre-op will set this value same as read_subvol in inode context and then in the in flight split brain check we will use this value instead of read_subvol. After all the checks we will update the value of this and set the read_subvol same as this to avoid having incorrect value in that. Change-Id: I2ef6904524ab91af861d59690974bbc529ab1af3 BUG: 1482064 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* cluster/afr: Fail open on split-brainPranith Kumar K2017-10-261-42/+1
| | | | | | | | | | | | | | | | | Problem: Append on a file with split-brain succeeds. Open is intercepted by open-behind, when write comes on the file, open-behind does open+write. Open succeeds because afr doesn't fail it. Then write succeeds because write-behind intercepts it. Flush is also intercepted by write-behind, so the application never gets to know that the write failed. Fix: Fail open on split-brain, so that when open-behind does open+write open fails which leads to write failure. Application will know about this failure. Change-Id: I4bff1c747c97bb2925d6987f4ced5f1ce75dbc15 BUG: 1294051 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* afr: propagate correct errno for fop failures in arbiterRavishankar N2017-05-151-11/+12
| | | | | | | | | | | | | | | | | | | | | | | Problem: If quorum is not met in fop cbk, arbiter sends an ENOTCONN error to the upper xlators. In a VM workload with sharding enabled, this was leading to the VM pausing when replace-brick was performed as described in the BZ. Fix: Move the fop cbk arbitration logic to afr_handle_quorum() because in normal replica volumes, that is the function that has the quorum and errno checks in the fop cbk path before doing a post-op. Thanks to Pranith for suggesting this approach. Change-Id: Ie6315db30c5e36326b71b90a01da824109e86796 BUG: 1449610 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: https://review.gluster.org/17235 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* afr: don't do a post-op on a brick if op failedRavishankar N2017-04-181-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In afr-v2, self-blaming xattrs are not there by design. But if the FOP failed on a brick due to an error other than ENOTCONN (or even due to ENOTCONN, but we regained connection before postop was wound), we wind the post-op also on the failed brick, leading to setting self-blaming xattrs on that brick. This can lead to undesired results like healing of files in split-brain etc. Fix: If a fop failed on a brick on which pre-op was successful, do not perform post-op on it. This also produces the desired effect of not resetting the dirty xattr on the brick, which is how it should be because if the fop failed on a brick, there is no reason to clear the dirty bit which actually serves as an indication of the failure. Change-Id: I5f1caf4d1b39f36cf8093ccef940118638caa9c4 BUG: 1438255 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: https://review.gluster.org/16976 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: Do not log of split-brain when there isn't oneKrutika Dhananjay2017-01-111-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | * Even on errors like ENOENT, AFR logs split-brain after read-txn refresh, introduced by commit a07ddd8f. This can be a cause of much panic and confusion and needs to be fixed. * Also fixed this issue in write-txns. * Fixed afr read txns to log about split-brain only after knowing that there is no split-brain choice configured. * Removed code duplication * Fixed incorrect passing of error code in afr_write_txn_refresh_done() (the function was passing -0 as errno to gf_msg(). Change-Id: I354f454ce5bf0e5f00bc27916eb597367cb7d927 BUG: 1411625 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/16362 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: Fix missing name indices due to EEXIST errorKrutika Dhananjay2016-12-271-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PROBLEM: Consider a volume with granular-entry-heal and sharding enabled. When a replica is down and a shard is created as part of a write, the name index is correctly created under indices/entry-changes/<dot-shard-gfid>. Now when a read on the same region triggers another MKNOD, the fop fails on the online bricks with EEXIST. By virtue of this being a symmetric error, the failed_subvols[] array is reset to all zeroes. Because of this, before post-op, the GF_XATTROP_ENTRY_OUT_KEY will be set, causing the name index, which was created in the previous MKNOD operation, to be wrongly deleted in THIS MKNOD operation. FIX: The ideal fix would have been for a transaction to delete the name index ONLY if it knows it is the one that created the index in the first place. This would involve gathering information as to whether THIS xattrop created the index from individual bricks, aggregating their responses and based on the various posisble combinations of responses, decide whether to delete the index or not. This is rather complex. Simpler fix would be for post-op to examine local->op_ret in the event of no failed_subvols to figure out whether to delete the name index or not. This can occasionally lead to creation of stale name indices but they won't be affecting the IO path or mess with pending changelogs in any way and self-heal in its crawl of "entry-changes" directory would take care to delete such indices. Change-Id: Ic1b5257f4dc9c20cb740a866b9598cf785a1affa BUG: 1408712 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/16286 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: Fix per-txn optimistic changelog initialisationKrutika Dhananjay2016-12-121-24/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Incorrect initialisation of local->optimistic_change_log was leading to skipped pre-op and post-op even when a brick didn't participate in the txn because it was down. The result - missing granular name index resulting in some entries never getting healed. FIX: Initialise local->optimistic_change_log just before pre-op. Also fixed granular entry heal to create the granular name index in pre-op as opposed to post-op. This is to prevent loss of granular information when during an entry txn, the good (src) brick goes offline before the post-op is done. This would cause self-heal to do conservative merge (since dirty xattr is the only information available), which when granular-entry-heal is enabled, expects granular indices, the lack of which can lead to loss of data in the worst case. Change-Id: Ia3ad716d6fb1821555f02180e86e8711a79f958d BUG: 1402730 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/16075 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: Serialize conflicting locks on all subvolsPranith Kumar K2016-12-061-1/+3
| | | | | | | | | | | | | | | | | | | | | | | Problem: 1) When a blocking lock is issued and the parallel lock phase fails on all subvolumes with EAGAIN, it is not switching to serialized locking phase. 2) When quorum is enabled and locks fail partially it is better to give errno returned by brick rather than the default quorum errno. Fix: Handled this error case and changed op_errno to reflect the actual errno in case of quorum error. BUG: 1369077 Change-Id: Ifac2e4a13686e9fde601873012700966d56a7f31 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/15984 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com>
* afr: fix bug in passing child index in afr_inode_write_fillRavishankar N2016-12-061-4/+3
| | | | | | | | | | | Change-Id: I7b70de317a5f15a3bf483ffe40b971143deddc11 BUG: 1401218 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/16029 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* afr, client: More mem-leak fixes in COMPOUND fop cbkKrutika Dhananjay2016-12-041-4/+21
| | | | | | | | | | | | | | | | | | | | | | Bugs found and fixed: 1. Use correct subvolume index in pre-op-writev compound cbk 2. Prevent use-after-free of local->compound_args members in compound fops cbk in protocol/client 3. Fix xdata and xattr leaks in client_process_response 4. Fix possible leak of xdata in client_pre_writev() in test mode. 5. Free req->compound_req_array.compound_req_array_val as well after freeing its members 6. Free tmp_rsp->flock.lk_owner.lk_owner_val in LK fop. Change-Id: I15b646d7d4e0e5cd4ea3d2d6452c815cf2eaf68f BUG: 1401218 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/16020 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* afr: fix auto-quorumJeff Darcy2016-11-281-10/+29
| | | | | | | | | | | | | | | | | | | | | | | | | (1) afr_have_quorum is dead code. It was copied to afr_has_quorum, and everything else uses that, but the original was never deleted (until now). (2) Auto-quorum should be default for any N>2. Leaving quorum disabled is BAD, but apparently deemed acceptable for N=2 because there's no real quorum in that case. For any larger number (including arbiter configurations) there is such a thing as real quorum and we should use it by default. Note that for N=3 the answers we get from "N % 2" (the old check) and "N > 2" (the new one) are the same. (3) The special case for even N in afr_has_quorum has been simplified and explained more thoroughly in a comment. Change-Id: I48b33c15093512fecf516b26dcf09afecb7ae33b Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/15873 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Vijay Bellur <vbellur@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: Fix the EIO that can occur in afr_inode_refresh as a resultPoornima G2016-11-281-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | of cache invalidation(upcall). Issue: ------ When a cache invalidation is recieved as a result of changing pending xattr, the read_subvol is reset. Consider the below chain of execution: CHILD_DOWN ... afr_readv ... afr_inode_refresh ... afr_inode_read_subvol_reset <- as a result of pending xattr set by some other client GF_EVENT_UPCALL will be sent afr_refresh_done -> this results in an EIO, as the read subvol was reset by the end of the afr_inode_refresh Solution: --------- When GF_EVENT_UPCALL is recieved, instead of resetting read_subvol, set a variable need_refresh in inode_ctx, the next time some one starts a txn, along with event gen, need_rrefresh also needs to be checked. Change-Id: Ifda21a7a8039b8874215e1afa4bdf20f7d991b58 BUG: 1396952 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/15892 Reviewed-by: Ravishankar N <ravishankar@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: allow I/O when favorite-child-policy is enabledRavishankar N2016-11-271-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Currently, I/O on a split-brained file fails even when the favorite-child-policy is set until the self-heal is complete. Fix: If a valid 'source' is found using the set favorite-child-policy, inspect and reset the afr pending xattrs on the 'sinks' (inside appropriate locks), refresh the inode and then proceed with the read or write transaction. The resetting itself happens in the self-heal code and hence can also happen in the client side background-heal or by the shd's index-heal in addition to the txn code path explained above. When it happens in via heal, we also add checks in undo-pending to not reset the sink xattrs again. Change-Id: Ic8c1317720cb26bd114b6fe6af4e58c73b864626 BUG: 1386188 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reported-by: Simon Turcotte-Langevin <simon.turcotte-langevin@ubisoft.com> Reviewed-on: http://review.gluster.org/15673 Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: Fix bugs in [f]inodelk/[f]entrylkPranith Kumar K2016-11-261-8/+0
| | | | | | | | | | | | | | | | | | | | | | Problems: 1) Inodelk is not taking quorum into account 2) finodelk, [f]entrylk are not implemented correctly 3) By default afr doesn't go for non-blocking parallel locks. Fix: Implemented a common framework which can be used by [f]inodelk/[f]entrylk. Used quorum for the same. Change-Id: I239f13875a065298630d266941df10cfa3addc85 BUG: 1369077 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/15802 Tested-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
* cluster/afr: Fix deadlock due to compound fopsKrutika Dhananjay2016-11-261-14/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an afr data transaction is eligible for using eager-lock, this information is represented in local->transaction.eager_lock_on. However, if non-blocking inodelk attempt (which is a full lock) fails, AFR falls back to blocking locks which are range locks. At this point, local->transaction.eager_lock[] per brick is reset but local->transaction.eager_lock_on is still true. When AFR decides to compound post-op and unlock, it is after confirming that the transaction did not use eager lock (well, except for a small bug where local->transaction.locks_acquired[] is not considered). But within afr_post_op_unlock_do(), afr again incorrectly sets the lock range to full-lock based on local->transaction.eager_lock_on value. This is a bug and can lead to deadlock since the locks acquired were range locks and a full unlock is being sent leading to unlock failure and thereby every other lock request (be it from SHD or other clients or glfsheal) getting blocked forever and the user perceives a hang. FIX: Unconditionally rely on the range locks in inodelk object for unlocking when using compounded post-op + unlock. Big thanks to Pranith for helping with the debugging. Change-Id: Idb4938f90397fb4bd90921f9ae6ea582042e5c67 BUG: 1398566 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/15929 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: Handle rpc errors, xdr failures etc with proper NULL checksKrutika Dhananjay2016-11-241-6/+17
| | | | | | | | | | | Change-Id: Id8ba76ba116d056bc7299dc5ce0980680a5a23f8 BUG: 1398226 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/15924 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: When failing fop due to lack of quorum, also log error stringKrutika Dhananjay2016-11-091-11/+12
| | | | | | | | | | | | Change-Id: I39de2bcfc660f23a4229a885a0e0420ca949ffc9 BUG: 1392865 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/15800 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* afr: Take full locks in arbiter only for data transactionsRavishankar N2016-10-141-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Sharding exposed a bug in arbiter config. where `dd` throughput was extremely slow. Shard xlator was sending a fxattrop to update the file size immediately after a writev. Arbiter was incorrectly over-riding the LLONGMAX-1 start offset (for metadata domain locks) for this fxattrop, causing the inodelk to be taken on the data domain. And since the preceeding writev hadn't released the lock (afr does a 'lazy' unlock if write succeeds on all bricks), this degraded to a blocking lock causing extra lock/unlock calls and delays. Fix: Modify flock.l_len and flock.l_start to take full locks only for data transactions. Change-Id: I906895da2f2d16813607e6c906cb4defb21d7c3b BUG: 1384906 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reported-by: Max Raba <max.raba@comsysto.com> Reviewed-on: http://review.gluster.org/15641 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* xlators/afr: fix unused variable warnings/errorsKaleb S. KEITHLEY2016-09-151-2/+0
| | | | | | | | | | | | | | | | | | | http://review.gluster.org/14085 fixes a "pragma leak" where the generated rpc/xdr headers have a pair of pragmas that disable these warnings. With the warnings disabled, many unused variables have crept into the code base. And 14085 won't pass its own smoke test until all these warnings are fixed. BUG: 1369124 Change-Id: I4946d460f66d3430bac8910260c0cb1b3e9ff87b Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com> Reviewed-on: http://review.gluster.org/15513 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Niels de Vos <ndevos@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>