summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/afr/src
Commit message (Collapse)AuthorAgeFilesLines
...
* cluster/afr: skip directory inspection when entry self-heal is offAnand Avati2013-08-111-1/+1
| | | | | | | | | | | | When user has explicitly configured to disable entry self-heal in the client, it is wrong to do the healing in opendir. So skip it. This is especially useful to reduce opendir() times after graph switches. Change-Id: Ic6eb9ff2334a5b8417f2f35410a366a536bad5df Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/5528 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Unwind frame on error in readdir[p]Pranith Kumar K2013-08-081-2/+2
| | | | | | | | | Change-Id: I5701bf115e0aa1adb4fb52f5418534910a2268d4 BUG: 994959 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5531 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* afr: check for non-zero call_count before doing a stack windRavishankar N2013-08-071-0/+5
| | | | | | | | | | | | | | | | | | | | When one of the bricks of a 1x2 replicate volume is down, writes to the volume is causing a race between afr_flush_wrapper() and afr_flush_cbk(). The latter frees up the call_frame's local variables in the unwind, while the former accesses them in the for loop and sending a stack wind the second time. This causes the FUSE mount process (glusterfs) toa receive a SIGSEGV when the corresponding unwind is hit. This patch adds the call_count check which was removed when afr_flush_wrapper() was introduced in commit 29619b4e Change-Id: I87d12ef39ea61cc4c8244c7f895b7492b90a7042 BUG: 988182 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/5393 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Disable eager-lock if open-fd-count > 1Pranith Kumar K2013-08-024-5/+98
| | | | | | | | | | | | | | | | | Lets say mount1 has eager-lock(full-lock) and after the eager-lock is taken mount2 opened the same file, it won't be able to perform any data operations until mount1 releases eager-lock. To avoid such scenario do not enable eager-lock for transaction if open-fd-count is > 1. Delaying of changelog piggybacking is avoided in this situation. Change-Id: I51b45d6a7c216a78860aff0265a0b8dabc6423a5 BUG: 910217 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5432 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: venkatesh somyajulu <vsomyaju@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Print self-heal log when self-heal succeedsPranith Kumar K2013-07-315-32/+120
| | | | | | | | | Change-Id: I95e47e589419dc6a032cbd8ba01964b6c176c2d5 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5408 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Handle REPLICATE_TRASH_DIR from old bricksPranith Kumar K2013-07-263-13/+52
| | | | | | | | | Change-Id: Ib99f79d3fa607c818dbc62006516480f598d8add BUG: 886998 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4640 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* storage/posix: implement batched fsync in a single threadAnand Avati2013-07-231-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because of the extra fsync()s issued by AFR transaction, they could potentially "clog" all the io-threads denying unrelated operations from making progress. This patch assigns a dedicated thread to issues fsyncs, as an experimental feature to understand performance characteristics with the approach. As a basis, incoming individual fsync requests are grouped into batches, falling in the same @batch-fsync-delay-usec window of time. These windows can extend in practice, as processing of the previous batch can take longer than @batch-fsync-delay-usec while new requests are getting batched. The feature support three modes (similar to the -S modes of fs_mark) - syncfs: In this mode one syncfs() is issued per batch, instead of N fsync()s (one per file.) - syncfs-single-fsync: In this mode one syncfs() is issued per batch (which, on Linux, guarantees the completion of write-out of dirty pages in the filesystem up to that point) and one single fsync() to synchronize or flush the controller/drive cache. This corresponds to -S 2 of fsmark. - syncfs-reverse-fsync: In this mode, one syncfs() is issued per batch, and all the open files in that batch are fsync()'ed in the reverse order of the queue. This corresponds to -S 4 of fsmark. - reverse-fsync: In this mode, no syncfs() is issued and all the files in the batch are fsync()'ed in the reverse order. This corresponds to -S 3 of fsmark. Change-Id: Ia1e170a810c780c8d80e02cf910accc4170c4cd4 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4746 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Handle parallel hardlinks self-healPranith Kumar K2013-07-231-0/+29
| | | | | | | | | Change-Id: Ieda11870c65edae500140b6c061f15a7b3f264f3 BUG: 986905 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5370 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* afr: customize client-pid=-1 xtime aggregation to tolerate a replica downAvra Sengupta2013-07-151-2/+9
| | | | | | | | | | | | Using the new 'pluggable policies' API of libxlator. Change-Id: Ie7528182dff8fb42c6e8287a106d3057944df775 BUG: 847839 Original Author: Csaba Henk <csaba@redhat.com> Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/4904 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* libxlator: implement pluggable aggregation policiesAvra Sengupta2013-07-151-0/+2
| | | | | | | | | | | | | | | | | The API is described in libxlator.h. Behavior remains the same for this commit; this is a preparatory step for per-translator customization of aggregation. Change-Id: I5d42923af59b2fd78e1ff59c12763875b57c5190 BUG: 847839 Original Author: Csaba Henk <csaba@redhat.com> Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/4903 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/*: get logic to calculate min() of the 'stime' xattrAvra Sengupta2013-07-141-0/+58
| | | | | | | | | | | | | | | | | | | | | | * in both distribute and replicate (ignoring stripe for now), add logic to calculate the min() of stime values. * What is a 'stime' ? Why is this required: - stime means 'slave xtime', mainly used to keep track of slave node's sync status when distributed geo-replication is used. Logic of calculating 'min()' for this stime is very important as in case of crashes/reboots/shutdown, we will have to 'restart' with crawling from stime time value from the mount point, which gives the 'min()' of all the bricks, which means, we don't miss syncing any files in the above cases. Change-Id: I2be8d434326572be9d4986db665570a6181db1ee BUG: 847839 Original Author: Amar Tumballi <amarts@redhat.com> Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/4893 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* afr : change the log level in lookup path to minimize incessant logging.Ravishankar N2013-07-081-3/+3
| | | | | | | | | | | | | Change the logging levels from WARNING to DEBUG in the lookup path to minimize incessant logging in case of gfid mismatch errors. Change-Id: I631b16df3249cf826606f547531f985dac696088 BUG: 959083 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/4939 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Let two data-self-heals compete in new domainPranith Kumar K2013-07-034-23/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: At the moment data-self-heal acquires locks in following pattern. It takes full file lock then gets xattrs on files on both replicas. Decides sources/sinks based on the xattrs. Now it acquires lock from 0-128k then unlocks the full file lock. Syncs 0-128k range from source to sink now acquires lock 128k+1 till 256k then unlocks 0-128k, syncs 128k+1 till 256k block... so on finally it takes full file lock again then unlocks the final small range block. It decrements pending counts and then unlocks the full file lock. This pattern of locks is chosen to avoid more than 1 self-heal to be in progress. BUT if another self-heal tries to take a full file lock while a self-heal is already in progress it will be put in blocked queue, further inodelks from writes by the application will also be put in blocked queue because of the way locks xlator grants inodelks. So until the self-heal is complete writes are blocked. Here is the code: xlators/features/locks/src/inodelk.c - line 225 if (__blocked_lock_conflict (dom, lock) && !(__owner_has_lock (dom, lock))) { ret = -EAGAIN; if (can_block == 0) goto out; gettimeofday (&lock->blkd_time, NULL); list_add_tail (&lock->blocked_locks, &dom->blocked_inodelks); } This leads to hangs in applications. Fix: Since we want to prevent two parallel self-heals. We let them compete in a separate "domain". Lets call the domain on which the locks have been taken on in previous approach as "data-domain". In the new approach When a self-heal is triggered, it acquires a full lock in the new domain "self-heal-domain". After this it performs data-self-heal using the locks in "data-domain" as before. unlock the full file lock in "self-heal-domain" With this approach, application's writevs don't have to wait in pending queue when more than 1 self-heal is triggered. Change-Id: Id79aef3dfa888945977fb9758374ac41c320d0d5 BUG: 967717 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5100 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Refactor inodelk to handle multiple domainsPranith Kumar K2013-07-0311-428/+500
| | | | | | | | | | | | | | | | | | - afr_local_copy should not be memduping locked nodes, that would mean that lock is taken in self-heal on those nodes even before it actually takes the lock. So removed memdup code. Even entry lock related copying (lockee info) is also not necessary for self-heal functionality, so removing that as well. Since it is not local_copy anymore changed its name. - My editor changed tabs to spaces. Change-Id: I8dfb92cb8338e9a967c06907a8e29a8404782d61 BUG: 967717 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5099 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Provide an option to disable afr durabilityPranith Kumar K2013-07-033-3/+23
| | | | | | | | | Change-Id: I40eec20ca6b3f857245a2438883822e251077ee9 BUG: 979365 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5269 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: post-op should complete before starting flushPranith Kumar K2013-07-033-36/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: At the moment afr-flush makes sure that a delayed post-op is woken up but it does not wait for it to complete the post-op before flush unwinds. These are the steps that are happening: 1) flush fop comes on an fd which wakes up a delayed post-op and continues with the flush fop. 2) post-op sends fsync on the wire. 3) flush completes and unwinds to fuse. 4) graph switch happens on the fuse mount disconnecting the old graph's client connections to bricks. 5) xattrop after fsync fails with ENOTCONN because the connections from old graph are taken down now. Fix: Wait for post-op to complete before starting to flush. We could make flush act similar to fsync (i.e.) wind flush as is but wait for post-op to complete before unwinding flush, but it is better to send flush as the final fop. So wind of flush will start after post-op is complete. Had to change fsync to accommodate this change. Change-Id: I93aa642647751969511718b0e137afbd067b388a BUG: 980548 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5274 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Allow data/entry self heal for metadata split-brainVenkatesh Somyajulu2013-07-027-134/+151
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Currently whenever there is metadata split-brain, a variable sh->op_failed is set to 1 to denote that self heal got failed. But if we proceed for data self heal, even code-path of data self heal also relies on the sh->op_failed variable. So if will check for sh->op_failed variable and will eventually fails to do data self heal. So needed a mechanism to allow data self heal even if metadata is in split brain. Fix: Some data structure revamp is done in http://review.gluster.com/#/c/5106/ fix and this patch is based on the above fix. Now we can store which particular self-heal got failed i.e GFID_OR_MISSING_ENTRY_SELF_HEAL, METADATA, DATA, ENTRY. And we can do two types of self heal failure check. 1. Individual type check: We can check which among all four (Metadata, Data, Gfid or missing entry, entry self heal) got failed. 2. In afr_self_heal_completion_cbk, we need to make check based on the fact that if any specific self heal got failed treat the complete self heal as failure so that it will populate corresponding circular buffer of event history accordingly. Change-Id: Icb91e513bcc752386fc8a78812405cfabe5cac2d BUG: 977797 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/5253 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Handle NULL fdctx in fsyncPranith Kumar K2013-06-271-1/+4
| | | | | | | | | | | | | | | | | | | Problem: If fdctx is NULL in afr_fsync, process crashes because of NULL dereference. Fix: if fdctx is NULL, always say witnessed unstable write so that fsyncs are done properly. Handled fdctx being null in afr_delayed_changelog_post_op otherwise fsync stub is never resumed and the mount was hanging. Change-Id: Icacc900e9be63c29db3325cb0e19cc250adebaac BUG: 978794 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5258 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* nfs: Remove afr split-brain handling in nfsPranith Kumar K2013-06-251-5/+0
| | | | | | | | | | | | | | | | We added this code as an interim fix until afr can handle split-brains even when opens are not issued. Afr code has matured to reject fd based fops when there are split-brains so we can remove it. Change-Id: Ib337f78eccee86469a5eaabed1a547a2cea2bdcf BUG: 974972 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5227 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Fix fd/memory leak on fsyncPranith Kumar K2013-06-241-1/+1
| | | | | | | | | Change-Id: I764883811e30ca9d9c249ad00b6762101083a2fe BUG: 976800 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5248 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/afr: Perform delayed changelog wakeups for anon fdPranith Kumar K2013-06-201-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | Problem: Nfs xlator never does open on a file for performing writes, afr does not perform changelog wakeup for this fd so operations which do metadata operations as soon as the data operations are completed perceive a delay of 'post-op-delay-secs'. Fix: Perform changelog wakeup on anon-fd if the fd with same pid is not present in inode-list. Note: This approach is a short-term fix. A proper fix needs a new domain for taking metadata locks so that data/metadata locks don't compete with each other. Change-Id: I253afb289eadf30c7951e56fb2c4840d7132f5e4 BUG: 966018 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5066 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Improvement in logging of self heal completion statusVenkatesh Somyajulu2013-06-137-83/+280
| | | | | | | | | | | | | | | | | | | | | | Problem: As the end of the self heal, message logged by "afr_self_heal_completion_cbk" is inadequate to determine what exactly failed during the course of afr self heal. It is worth to have knowledge of what all types of self heal got triggered for an entity and whether the status is success or failure. Fix: At the end of self heal, it will log information about out of 4 types of self heal (gfid or missing entry self heal, metadata, data and entry self heal), who all got triggered and who all got failed or successful at the end. Change-Id: I5360762fbd7d391ac4c6af6706b4835c5801835a BUG: 968301 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/5106 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* glusterfs: discard (hole punch) supportBrian Foster2013-06-133-0/+250
| | | | | | | | | | | | | | | | Add support for the DISCARD file operation. Discard punches a hole in a file in the provided range. Block de-allocation is implemented via fallocate() (as requested via fuse and passed on to the brick fs) but a separate fop is created within gluster to emphasize the fact that discard changes file data (the discarded region is replaced with zeroes) and must invalidate caches where appropriate. BUG: 963678 Change-Id: I34633a0bfff2187afeab4292a15f3cc9adf261af Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-on: http://review.gluster.org/5090 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* gluster: add fallocate fop supportBrian Foster2013-06-133-0/+253
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement support for the fallocate file operation. fallocate allocates blocks for a particular inode such that future writes to the associated region of the file are guaranteed not to fail with ENOSPC. This patch adds fallocate support to the following areas: - libglusterfs - mount/fuse - io-stats - performance/md-cache,open-behind - quota - cluster/afr,dht,stripe - rpc/xdr - protocol/client,server - io-threads - marker - storage/posix - libgfapi BUG: 949242 Change-Id: Ice8e61351f9d6115c5df68768bc844abbf0ce8bd Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-on: http://review.gluster.org/4969 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Avoid order mismatch in blocking entrylksPranith Kumar K2013-06-051-6/+9
| | | | | | | | | | | | | | | | | | | | | | | Problem: When taking blocking entrylks, afr orders the entrylks based on uuid_compare of gfids of parent dirs, if they are equal then it orders them based on the basenames. While this approach works fine, the implementation assumes loc->gfids to be populated at the time of the comparison, but loc may have gfid in loc->inode->gfid instead of loc->gfid which was leading to order mismatches and dead-locks. Fix: Implemented loc_gfid which gives gfid by checking both loc->gfid, loc->inode->gfid. Used this for ordering the blocking entrylks. Change-Id: Ib0db36bbaf0df09fa87c3c3bb6a834db74fc2154 BUG: 965987 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5062 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Removed misleading log from afr_start_crawlVenkatesh Somyajulu2013-05-311-2/+2
| | | | | | | | | | | | | | | Problem: If it is fresh volume with no files created the xattrop directory is not present. If crawl happens in that time, lookup for xattrop directory will fail and it results in printing a misleading log. Fix: Changed misleading log. Change-Id: Iae5da3e8423564d64096f88abdaf8c98e4935840 BUG: 928575 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/4993 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: Club missing entry, missing gfid self-healsPranith Kumar K2013-05-071-45/+8
| | | | | | | | | | | | | | | | | | | | | Problem: gfid-self-heal always assigns the gfid(GFID-1) it gets from lookup. Between the time of lookup to triggering the gfid-self-heal the entry could have changed. Now lets say there is a case where one of the files of the replica subolumes already has a gfid (GFID-2) and the other does not. In that case healing should happen with GFID-2 instead of GFID-1. Fix: Missing-entry-self-heal already handles all these cases. So removed separate handling of gfid-self-heal. Change-Id: Ie96261e9036c8f3cb4cad89347f9bf7b681cdc1a BUG: 767585 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/2670 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Avoid self-healing extended attribute used by SELinux.Vijay Bellur2013-04-301-0/+8
| | | | | | | | | | | | | | | | | | Since removexattr() fails to remove "security.selinux" in a system where SELinux is enforcing, xattr self-healing fails. As a consequence of this, user extended attributes are not being healed. Added a check in afr to prune SELinux xattr from the dictionary used for removing xattrs from the sink. Minor changes in tests and md-cache as well. Signed-off-by: Vijay Bellur <vbellur@redhat.com> Change-Id: I854bfc0098dde812ce2afe64b125ee40c04bdeb1 BUG: 957877 Reviewed-on: http://review.gluster.org/4905 Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Added documentation for eager-lock checkPranith Kumar K2013-04-221-0/+17
| | | | | | | | | Change-Id: Ifa42762adde8b55ef1e2b51a59c93cebd983343f BUG: 912581 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4792 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* afr: let eager-locking do its own overlap checksAnand Avati2013-04-053-2/+87
| | | | | | | | | | | | | | | | | | | | | | | | Today there is a non-obvious dependence of eager-locking on write-behind. The reason is that eager-locking works as long as the inheriting transaction has no overlaps with any of the transactions already in progress. While write-behind provides non-overlapping writes as a side-effect most of times (and only guarantees it when strict-write-ordering option is enabled, which is not on by default) eager-lock needs the behavior as a guarantee. This is leading to complex and unwanted checks for the presence of write-behind in the graph, for the simple task of checking for overlaps. This patch removes the interdependence between eager-locking and write-behind by making eager-locking do its own overlap checks with in-progress writes. Change-Id: Iccba1185aeb5f1e7f060089c895a62840787133f BUG: 912581 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4782 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: Treat all dir fop failure as success in changelogPranith Kumar K2013-04-033-2/+35
| | | | | | | | | | | | | | | For example: If a new entry creation fop fails with EEXIST or a delete entry fop fails with ENOENT, on all the subvols the fop is wound, then no change took place to the directory. So we can treat that case as no change happened to the directory. Change-Id: I3b3a7931954da2166a9cba19ff9f76f37739d751 BUG: 860210 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4626 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Made afr_sh_purge_entry_common message log more clear.Venkatesh Somyajulu2013-04-031-3/+1
| | | | | | | | | | | | | | | FIX: In missing entry self heal, once the source directories are determined after the lookup and if file is not present on any of the brick which contains the souce directory, the entry is removed from the directory. So log message should give information of "Purging of entry". Change-Id: I4d3deb602e0812dc1c9c8ba0a466716d81dede7e BUG: 947312 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/4753 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* pump: Set self-heal readdir size in pumpPranith Kumar K2013-04-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In Pump entry self-heal happens for each directory during the first opendir using conservative merge. But in entry-self-heal readdir is issued with '0' size. So entry self-heal is not creating any files. After pump thinks entry self-heal is complete it proceeds to heal each of the file in the directory it just healed. Fortunately most of the times it chooses source-brick in pump as read-child for readdir. This happens because readchild is the subvolume on which lookup succeeds first. In pump lookup succeeds faster in local process than on the destination brick process most of the times. For all the entries pump finds in readdir it does a lookup. During this lookup the entry on the destination brick is created and healed. This is the reason why replace-brick succeeds whenever read-child for the directory is chosen as the source-brick. Which is most of the times. When read-child is chosen as the destination brick, readdir returns no entries so replace-brick completes without syncing the whole data. Fix: Set readdir-size in pump so that entry self-heal happens with 64k size. This ensures that entry self-heal triggered from opendir actually creates the files on the destination brick. Change-Id: I65ea45d3c2735a9578f3aa34eff771b6563241ca BUG: 909800 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4712 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: detect in-progress creation in lookup and return ENOENTPranith Kumar K2013-04-021-0/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | if any subvol returned ENOENT while parent entrylk lock was held, yield and return ENOENT for the entire lookup. This is how the issue happens: Multiple clients A, B and C are attempting 'mkdir -p /mnt/a/b/c' 1 Client A is in the middle of mkdir(/a). It has acquired lock. It has performed mkdir(/a) on one subvol, and second one is still in progress 2 Client B performs a lookup, sees directory /a on one, ENOENT on the other, succeeds lookup. 3 Client B performs lookup on /a/b on both subvols, both return ENOENT (one subvol because /a/b does not exist, another because /a itself does not exist) 4 Client B proceeds to mkdir /a/b. It obtains entrylk on inode=/a with basename=b on one subvol, but fails on other subvol as /a is yet to be created by Client A. 5 Client A finishes mkdir of /a on other subvol 6 Client C also attempts to create /a/b, lookup returns ENOENT on both subvols. 7 Client C tries to obtain entrylk on on inode=/a with basename=b, obtains on one subvol (where B had failed), and waits for B to unlock on other subvol. 8 Client B finishes mkdir() on one subvol with GFID-1 and completes transaction and unlocks 9 Client C gets the lock on the second subvol, At this stage second subvol already has /a/b created from Client B, but Client C does not check that in the middle of mkdir transaction 10 Client C attempts mkdir /a/b on both subvols. It succeeds on ONLY ONE (where Client B could not get lock because of missing parent /a dir) with GFID-2, and gets EEXIST from ONE subvol. This way we have /a/b in GFID mismatch. One subvol got GFID-1 because Client B performed transaction on only one subvol (because entrylk() could not be obtained on second subvol because of missing parent dir -- caused by premature/speculative succeeding of lookup() on /a when locks are detected). Other subvol gets GFID-2 from Client C because while it was waiting for entrylk() on both subvols, Client B was in the middle of creating mkdir() on only one subvol, and Client C does not "expect" this when it is between lock() and pre-op()/op() phase of the transaction. Original-author: Anand Avati <avati@redhat.com> Change-Id: Idca475dbbc2a51e09da6fa0f9e1e37148caef208 BUG: 860210 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4625 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: sync xattrs removed on source to sink(s)Venky Shankar2013-04-021-12/+85
| | | | | | | | | | | | | xattrs are first removed from sink followed by setting source xattrs. Change-Id: I181cb5b785b667bbfc6e40787a2183a8f45de06b BUG: 906646 Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-on: http://review.gluster.org/4656 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/afr: prevent piggyback on stale pre_opPranith Kumar K2013-04-021-33/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here are the logs of a file on which we saw EIO because of size mismatch: [root@lizzie ~]# grep 38f18204 /var/log/glusterfs/mnt-x-.log Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 0, len: 7680 Cleared unstable write flag for 38f18204-2840-408e-ae65-c01f4106b8c4: offset 0 length 7680 Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 7680, len: 71680 Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 79360, len: 15716 fsync completed on 38f18204-2840-408e-ae65-c01f4106b8c4 for offset 0 length 7680 with changelog status: -1 -1 According to these logs fsync did not happen after writev with offset: 79360, len: 15716. Which is the reason for this problem. In total 3 writes came. lets call them w1, w2, w3 w1 does pre_op so pre_op_done[0], pre_op_done[1] counts become 1 and 1 then is_piggyback_post_op() is called for w1 and it returns *false* w1's fsync is fired Now w2 and w3 come and see that pre_op_done[0], pre_op_done[1] are both 1, so pre_op_piggyback[0] and pre_op_piggyback[1] are both incremented twice, once by w2, one more time by w3 and become 2, 2 ------- Step-A Now fsync of w1 is complete and it goes ahead with post op and decrements pre_op_done[0], pre_op_done[1] to 0, 0 Now w2, w3 writevs complete and is_piggyback_post_op will return *true* for both w2, w3. So fsync is not fired for both w2, w3 this patch prevents Step-A from happening. Change-Id: I8b6af1f1875b2cf5f718caa3c16ee7ff3dc96b5c BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4752 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/afr: fix fd leak with unsafe call_resume()Anand Avati2013-03-282-1/+17
| | | | | | | | | | | | | | Introduce AFR_CALL_RESUME macro which cleans up frame->local, like how AFR_STACK_UNWIND etc. do. Therefore fix leak in afr_fsync() path. Change-Id: I3855d8e7e84dbc44e05f507563b7f722bf9621b8 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4745 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: fsync before erase xattrs in data self-healPranith Kumar K2013-03-281-1/+75
| | | | | | | | | | | | Added extra fsync to data self-heal code to make sure the data reached disk before erasing the changelogs Change-Id: I9e7e6e55cdc49de2b991705d1638946464a9d4f9 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4744 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: piggyback and fsync resume changesPranith Kumar K2013-03-282-15/+18
| | | | | | | | | | | | | | 1) pre_op_piggyback should always be decremented. 2) Move fsync resume to just after post_op. 3) fsync stub should be created from afr's local not from the final response. Change-Id: I220bb532eb03bea584292f4dd2e816ad0c3e0cf7 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4741 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: fsync() guarantees POST-OP completionAnand Avati2013-03-273-10/+56
| | | | | | | | | | | | | | | | | | | | | | | | AFR now provides a stronger guarantee that fsync() returns only after completely finishing all the deferred/delayed POST-OP on that open file. To acheive this we make a stub out of the returning fsync and register it with the "delayed" frame in afr_changelog_wake_resume(). The delayed frame, after getting woken up and finishing the POST-OP will call_resume() the registered stub (which UNWINDs the fsync) at the time of frame destruction. This provides a guarantee that an application's (or FUSE) fsync() returns only after finishing up all the previous transactions, including delayed POST-OPs and UNLOCK. Change-Id: Iaa955457e2f25088a144fde37ad0444277b5cf49 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4737 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: ensure DATA operations are made durable before POST-OPAnand Avati2013-03-274-22/+314
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The changelogging scheme of AFR stores information about the state of all replicas in all replicas (in the extended attribute of the respective files on each server) in the form of 'pending counts' of operations (effectively "dirty flags"). These xattrs are blindly trusted while performing self-heal, and therefore utmost care has to be taken while updating and maintaing them. The most critical updation is the clearing of the pending counts corresponding to the *other* server in the changelog of a given server. Before clearing the pending count, we need durability guarantee of the write which was performed on the other server. To obtain such a guarantee, it may be necessary to explicitly introduce an fsync() phase (if the file itself wasn't already opened with O_SYNC). This patch introduces the detection of unstable stable writes on a file and issues explicit fsync() on the servers before performing the POST-OP clearing of pending flags. Change-Id: I2171b86a74ec91e40e5877eef0a4e7379578ecf7 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4721 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* nfs, afr: Fail lookup only on split-brainPranith Kumar K2013-03-201-0/+4
| | | | | | | | | | | Change-Id: Icee9772f1f1bf5336eb82a4dc13e198424cd4a65 BUG: 921996 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4699 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: Preserve mtime in self-healPranith Kumar K2013-03-121-14/+25
| | | | | | | | | | | | | | | | | | | | | | | Problem: Data self-heal may choose sink iatt to set mtimes. This happens because after syncing of data is done self-heal does one more xattrops/fstat to determine sources sinks to set the inode-ctx. Since this is done after data syncing and erase of xattrs, old source and old sink are now sources, but the mtimes of them differ. Old code just takes the first source from the list and update mtimes, which could be sink before the self-heal started. Fix: Set mtime from 'sources before syncing'. Change-Id: Id769e1b99aa4f041eaee775f64cbf2c57b799723 BUG: 918437 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4658 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster xlators: s/-1/GF_CLIENT_PID_GSYNCD/Csaba Henk2013-03-031-2/+2
| | | | | | | | | Change-Id: I03be3cb23684de4ab36cf2953002708466edd580 BUG: 765433 Signed-off-by: Csaba Henk <csaba@redhat.com> Reviewed-on: http://review.gluster.org/4601 Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: Turn on eager-lock for fd DATA transactionsPranith Kumar K2013-03-012-21/+7
| | | | | | | | | | | | | | | | | | | | Problem: With the present implementation, eager-lock is issued for any fd fop. eager-lock is being transferred to metadata transactions. But the lk-owner is set to local->fd address only for DATA transactions, but for METADATA transactions it is frame->root. Because of this unlock on the eager-lock fails and rebalance hangs. Fix: Enable eager-lock for fd DATA transactions Change-Id: If30df7486a0b2f5e4150d3259d1261f81473ce8a BUG: 916226 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4588 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Don't queue transactions during open-fd fixPranith Kumar K2013-02-226-385/+167
| | | | | | | | | | | | | | | | | | | | | Before Anonymous fds are available, afr had to queue up transactions if the file is not opened on one of its subvolumes. This happens until the attempt to open the file either succeeds or fails. These attempts happen until the file is successfully opened on the subvolume. Now client xlator uses anonymous fds to perform the fops if the fd used for the fop is not 'opened'. Fops will be successful even when the file is not opened so there is no need to queue up the transactions anymore in afr. Open is attempted on the subvolume where it is not opened independent of the fop. Change-Id: Id1a4b4ebe6f89f9efe8f6a8247918b91247d0819 BUG: 913051 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4568 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: do complete split-brain check in all the fd based fopsRaghavendra Bhat2013-02-194-19/+33
| | | | | | | | | | | | | | | | | fd based operations such as readv checked only for data split brain instead of complete split-brain (i.e both data + metadata) assuming that open would have done the complete split-brain check. However open-behind would have unwound open, without winding to afr thus preventing the complete split-brain check and some appliations will be able to read the contents of the file even though the file has metadata split-brain. So let all the fd based fops do a defensive check of complete split-brain. Change-Id: Ia90b35f2b08426dfcad804b7f8105278c86fbd2d BUG: 846240 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com> Reviewed-on: http://review.gluster.org/4548 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Use proper libtool option -avoid-version instead of bogus -avoidversionAnand Avati2013-02-071-2/+2
| | | | | | | | | | Change-Id: I1c9541058c7d07786539a3266ca125a6a15287d8 BUG: 859835 Signed-off-by: Anand Avati <avati@redhat.com> Original-author: Kacper Kowalik (Xarthisius) <xarthisius.kk@gmail.com> Signed-off-by: Kacper Kowalik (Xarthisius) <xarthisius.kk@gmail.com> Reviewed-on: http://review.gluster.org/3967 Tested-by: Gluster Build System <jenkins@build.gluster.com>
* afr: serialize modification of {entrylk,inodelk}_lock_countAnand Avati2013-02-071-53/+54
| | | | | | | | | | | | | | Typically this lock was not needed in practice, but with http://review.gluster.org/3842, this code gets executed in multiple threads for different servers and we lose a count. This results in leaked lock and a hang for a future transaction. Change-Id: I377ed20e44f2a45cff522289dfef181f0653eca2 BUG: 765564 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4480 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: Avoid priv->eager_lock value update racePranith Kumar K2013-02-064-4/+6
| | | | | | | | | | Change-Id: I7049c0c64e36a9dfa4cc0e0b34de7ec111d2f6c1 BUG: 908302 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4076 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>