summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/afr/src/afr-transaction.c
Commit message (Collapse)AuthorAgeFilesLines
* Revert "cluster/afr: eager locking of FD writes"v3.2.4qa3Vijay Bellur2011-09-221-1/+9
| | | | | | | | | This reverts commit 81456ec2dfb312ae60c5c4e6f960a3cbf8aaaa4c. Change-Id: Id03335117f5137f5d09781850bf4fba6eca0f73d Reviewed-on: http://review.gluster.com/492 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* cluster/afr: eager locking of FD writesAnand Avati2011-09-081-9/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is a change in the way write transactions hold a lock which optimizes the case of sequential writes from a single writer. Lock phase of a transaction has two sub-phases. First is an attempt to acquire locks in parallel by broadcasting non-blocking lock requests. If lock aquistion fails on any server, then the held locks are unlocked and revert to a blocking locked mode sequentially on one server after another. The change in this patch is to make the initial broadcasting lock request attempt to acquire lock on the entire file. If this fails, we revert back to the sequential "regional" blocking lock as before. In the case where such an "eager" lock is granted in the non-blocking phase, it gives rise to an opportunity for optimization. i.e, if the next write transaction on the same FD arrives before the unlock phase of the first transaction, it "takes over" the full file lock. Similarly if yet another transaction arrives before the unlock phase of the "optimized" transaction, that in turn "takes over" the lock as well. The actual unlock now happens at the end of the last "optimzed" transaction. Any operation which arrives before the unlock phase of the previous transaction is a potential candidate to become an "optimized" transaction. In cases where the previous transaction had aquired lock as a "regional" blocking lock, and the next transaction comes in before its unlock phase, then it would not be an "optimized" transaction. Implied assumption ------------------ Since two or more transactions can now operate within the same large lock, there is a possibility that overlapping transactions can arrive at oppoosite orders on the servers. However in the larger picture this is not possible as write-behind already ensures that no two overlapping writes on an inode are in transit at the same time. Overlapping writes across clients are not a problem as they compete at locks anyways. Theoretical benefits and potential harms ---------------------------------------- In case of a single writer: The benefits are large for sequential writes. In the best case the entire file write can happen with just one lock and unlock per server, provided writes are coming in fast enough and getting pipelined by write-behind soon enough (which is usually the case). If the writes are not coming in fast enough, then the optimization "kicks in" for only those subsets of writes which are close enough to get "piggybacked". For random writes the benefits are the same as well. In any case the overall performance is better than or equal to the performance without this optimization for a single writer. In case of multiple writers: When multiple writers are not writing concurrently, there is no negative performance impact. When multiple writers are writing concurrently to the same region, there is no negative impact either, as they were previously getting arbitrated at the locks translator too. In the case of multiple writers writing to different regions concurrently, there will be an increased number of "failovers" from failed parallel non-blocking to sequential blocking regional locks. This above "worst case" has a simple workaround that as soon as we detect > 1 open-fd-count in lookup xattr, we can disable this optimization on those fds. Beneficial side-effects ----------------------- There is another similar optimization in AFR for changelogs which goes by the name of "changelog-piggybacking". That works in a similar way where pending flags get 'taken over' or 'piggybacked' by the next transaction if its 'pre-op' phase kicks in before the 'post-op' phase of the previous transaction. It has been observed that this changelog-piggybacking optimization gives a saving of about ~55% savings of xattr calls hitting the wire, measured across various types of network interfaces. The side effect of this eager-lock optimization is that it gives an almost 100% saving of xattr calls by making the optimistic-changelog work much more efficiently as it gives a wider overlap of the xattr phases of two consecutive transactions. Change-Id: I41c02eb3b64c14c68ef66a344610ec3f024cd59d BUG: 3409 Reviewed-on: http://review.gluster.com/243 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@gluster.com>
* cluster/afr: Fix endian conversion for fxattropPranith Kumar K2011-08-181-1/+1
| | | | | | | | Change-Id: Ie67c4da49876555c3162909e474b9089a85f99a6 BUG: 3182 Reviewed-on: http://review.gluster.com/256 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@gluster.com>
* Change Copyright current yearPranith Kumar K2011-08-101-1/+1
| | | | | | | | Change-Id: Id1f1a91cf15d933d5621a0073ddaebe02df0f159 BUG: 3348 Reviewed-on: http://review.gluster.com/198 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@gluster.com>
* LICENSE: s/GNU Affero General Public/GNU General Public/Pranith Kumar K2011-08-061-3/+3
| | | | | | | | Change-Id: Ibf5f45431d7a55b70d7304649af652d6f25bb688 BUG: 3348 Reviewed-on: http://review.gluster.com/183 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@gluster.com>
* cluster/afr: log enhancement - part 3Amar Tumballi2011-04-011-19/+15
| | | | | | | | Signed-off-by: Amar Tumballi <amar@gluster.com> Signed-off-by: Vijay Bellur <vijay@dev.gluster.com> BUG: 2346 (Log message enhancements in GlusterFS - phase 1) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2346
* cluster/afr: white-space cleanup - part 2Amar Tumballi2011-03-311-211/+211
| | | | | | | | Signed-off-by: Amar Tumballi <amar@gluster.com> Signed-off-by: Vijay Bellur <vijay@dev.gluster.com> BUG: 2346 (Log message enhancements in GlusterFS - phase 1) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2346
* cluster/afr: Fix wrong memory allocationPranith K2011-03-141-4/+5
| | | | | | | | Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Signed-off-by: Vijay Bellur <vijay@dev.gluster.com> BUG: 2517 (the size of allocated memory may be wrong) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2517
* replicate: optimistic changelogAnand Avati2010-11-091-13/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The standard way of maintaining changelog in replicate has been to write out pending flags and to unset the pending flag post the actual operation. This new optimization kicks in only when all subvolumes are up. The optimization is that, during pre-op, no changelog is written for METADATA and ENTRY/RENAME operations. If during the operation nothing failed, no changelog is updated in post-op either. If however, something does fail during an operation, then, pending flags get written during post op pointing only towards the failed nodes. DATA transactions continue to work the way they are. If one subvolume is down, pending flags are written in pre-op changelog itself as before. The impact of this optimization is only in the case when both servers die or the client dies while the 'FOP' stage of the transaction is in progress. By nature of METADATA and ENTRY operations, detecting a mismatch later is not dependent on the presence of changelog. Changelog only determines the direction in which self-heal happens for these types of transactions. For the direction too this optimization does not have a major impact because in the cases of failure (both servers dieing or client dieing) the final state (direction of self-heal) would be arbitrary anyways as the syscall wouldn't have completed. Signed-off-by: Anand V. Avati <avati@blackhole.gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 2068 (performance enhancements) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2068
* Copyright changesVijay Bellur2010-10-111-1/+1
| | | | | | | | Signed-off-by: Vijay Bellur <vijay@gluster.com> Signed-off-by: Vijay Bellur <vijay@dev.gluster.com> BUG: 971 (dynamic volume management) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=971
* Change GNU GPL to GNU AGPLPranith K2010-10-041-3/+3
| | | | | | | | Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Signed-off-by: Vijay Bellur <vijay@dev.gluster.com> BUG: 1388 () URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=1388
* replicate: keep read_child in inode ctx as up-to-date as possibleAnand Avati2010-09-291-1/+49
| | | | | | | | | | | | | In every transaction check if the currently set read child in the inode context failed in the fop and set it to another subvol on which the latest fop has passed. This will prevent read fops landing on subvols which have witnessed a failure. Signed-off-by: Anand V. Avati <avati@amp.gluster.com> Signed-off-by: Vijay Bellur <vijay@dev.gluster.com> BUG: 1172 (ls -lh on NFS mount of 2-mirror replicate gives incorrect file size) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=1172
* replicate: replace first-write-to-flush optimizationAnand V. Avati2010-09-291-405/+363
| | | | | | | | | | | | use a changelog piggybacking optimization instead of first-write-to-flush optimization and do other cleanups (removal of post-post-op hook etc.) Signed-off-by: Anand V. Avati <avati@blackhole.gluster.com> Signed-off-by: Anand V. Avati <avati@amp.gluster.com> Signed-off-by: Vijay Bellur <vijay@dev.gluster.com> BUG: 1235 (Bug for all pump/migrate commits) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=1235
* whitespace cleanupAnand V. Avati2010-09-291-72/+81
| | | | | | | | | Signed-off-by: Anand V. Avati <avati@blackhole.gluster.com> Signed-off-by: Anand V. Avati <avati@amp.gluster.com> Signed-off-by: Vijay Bellur <vijay@dev.gluster.com> BUG: 1235 (Bug for all pump/migrate commits) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=1235
* cluster/afr: Use 2 phase locking for transactions and self heal.Pavan Sondur2010-08-221-474/+194
| | | | | | | | Signed-off-by: Pavan Vilas Sondur <pavan@gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 960 () URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=960
* add pump xlator and changes for replace-brickPavan Sondur2010-08-061-1/+1
| | | | | | | | Signed-off-by: Pavan Vilas Sondur <pavan@gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 1235 (Bug for all pump/migrate commits) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=1235
* cluster/afr: Check for call_count in ENTRY_RENAME_TRANSACTIONVijay Bellur2010-04-201-1/+2
| | | | | | | | Signed-off-by: Vijay Bellur <vijay@gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 824 (Crash in afr rename transaction) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=824
* afr: handle fdctx->pre_op_done handlingAnand Avati2009-11-291-0/+1
| | | | | | | | | | | | reset pre_op_done[i] to 0 after issuing a postop in flush. this was missed during the introduction of pre_op_done[] array and was resulting in a lot of spurious self heals when spurious flushes were received Signed-off-by: Anand V. Avati <avati@blackhole.gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 170 (Auto-heal fails on files that are open()-ed/mmap()-ed) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=170
* cluster/afr: Include "common-utils.h" instead of alloca.hVikas Gorur2009-11-261-1/+1
| | | | | | | | | | | alloca.h should be included on a platform-specific basis. Lets common-utils.h handle that. Signed-off-by: Vikas Gorur <vikas@gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 349 (FreeBSD compilation error (alloca.h).) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=349
* cluster/afr: Do self-heal on unopened fds.Vikas Gorur2009-11-251-4/+37
| | | | | | | | | | | | | | This patch completes the previous patch for self-heal of open fds in replicate. If an fd was never opened on a subvolume, we remember that and do the open after we've done self-heal on that fd. Signed-off-by: Vikas Gorur <vikas@gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 170 (Auto-heal fails on files that are open()-ed/mmap()-ed) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=170
* cluster/afr: Do self-heal on reopened fds.Vikas Gorur2009-11-241-18/+112
| | | | | | | | | | | | | | | | | | | | | | | This patch brings in partial support for self-heal of open fds. The precondition is that the fd should have been opened successfully during the initial open() (or create()), and we assume that protocol/client has successfully reopened the fd when the subvolume comes back up. It works by doing an "up/down flush" (a dummy flush transaction to do post-op wherever necessary) and then triggering data self-heal on the file in the post-post-op hook of the dummy flush transaction. This ensures that any writes that come in during self-heal will wait until self-heal completes. The up/down flush is also done when a subvolume goes down, so that post-op is done on all subvolumes where pre-op was done. Signed-off-by: Vikas Gorur <vikas@gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 170 (Auto-heal fails on files that are open()-ed/mmap()-ed) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=170
* cluster/afr: Provide a post-post_op hook in the transaction.Vikas Gorur2009-11-241-6/+20
| | | | | | | | Signed-off-by: Vikas Gorur <vikas@gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 170 (Auto-heal fails on files that are open()-ed/mmap()-ed) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=170
* cluster/afr: Hold blocking locks for data self-heal.Vikas Gorur2009-11-241-1/+1
| | | | | | | | | | | Data self-heal now holds blocking locks, and instead of locking on all subvolumes, it only locks on {data-lock-server-count} subvolumes. Signed-off-by: Vikas Gorur <vikas@gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 170 (Auto-heal fails on files that are open()-ed/mmap()-ed) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=170
* cluster/afr: Unlock only those paths which have been locked during rename.Vikas Gorur2009-11-241-77/+142
| | | | | | | | | | | For ENTRY_RENAME_TRANSACTIONs, keep track separately whether the lower_path and the higher_path have been locked, and unlock only those which have been. Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 112 (parallel deletion of files mounted by different clients on the same back-end hangs and/or does not completely delete) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=112
* afr transaction: fix op_ret check during lockingAnand Avati2009-10-131-3/+3
| | | | | | | Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 112 (parallel deletion of files mounted by different clients on the same back-end hangs and/or does not completely delete) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=112
* afr transaction prevent spurious unlocksAnand Avati2009-10-131-2/+4
| | | | | | | | | mark a subvol with held lock only if op_ret == 0 Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 112 (parallel deletion of files mounted by different clients on the same back-end hangs and/or does not completely delete) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=112
* cluster/afr: Hold second lock after first lock has been granted for rename ↵Vikas Gorur2009-10-121-30/+84
| | | | | | | | | | | | transactions. Hold the lock on the {higher_path} only after the lock on the {lower_path} has been granted successfully. Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 112 (parallel deletion of files mounted by different clients on the same back-end hangs and/or does not completely delete) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=112
* Changed occurrences of Z Research to Gluster.Vijay Bellur2009-10-071-1/+1
| | | | Signed-off-by: Anand V. Avati <avati@dev.gluster.com>
* cluster/afr - use different dictionaries for sending xattrop requests to ↵Basavanagowda Kanur2009-06-301-24/+50
| | | | | | | | | | | | | | | | each of the subvolume - This patch fixes bug #29. - Using separate copies of dictionaries also eliminates a potential bug in a setup consisting of afr with a posix and client, each having io-threads on top as children. Since posix_xattrop after performing required operations on the xattr array passed in dictionary, sets the result at the same key and in the same dictionary passed as input argument, there can be race conditions where in the results of the operation on posix-child can be sent to the other child as input argument for xattrop, which ofcourse is wrong. Signed-off-by: Anand V. Avati <avati@dev.gluster.com>
* Cleaned up log messages in replicate.Vikas Gorur2009-04-241-4/+5
| | | | Signed-off-by: Anand V. Avati <avati@amp.gluster.com>
* afr-transaction: handle double flushesAnand V. Avati2009-04-201-1/+3
| | | | __if_fd_pre_op_done - reset fd_ctx->pre_op_done to 0 so that double flushes do not result in two xattrop() calls
* Use original pid when calling the FOP in afr transaction.Vikas Gorur2009-04-161-2/+39
| | | | | | | | Save the original pid while locking and restore it after the FOP is done. This ensures posix-locks can release locks (fcntl) properly. Signed-off-by: Anand V. Avati <avati@amp.gluster.com>
* Changed xattr format of afr changelog to support adding and removing of ↵Vikas Gorur2009-04-161-45/+87
| | | | | | subvolumes while keeping existing data. Signed-off-by: Anand V. Avati <avati@amp.gluster.com>
* Fix in changelog logic.Vikas Gorur2009-04-071-18/+108
| | | | | | If a writev fails, remember it by marking it in the fd context. Signed-off-by: Anand V. Avati <avati@amp.gluster.com>
* Add extra 'volume' parameter to inodelk/entrylk callsVikas Gorur2009-03-121-10/+17
| | | | Signed-off-by: Anand V. Avati <avati@amp.gluster.com>
* check for fd ctx set in changelog_needed_post_op for flush() in afrVikas Gorur2009-02-271-7/+39
| | | | | | | Earlier the check was in afr_flush(), which caused race conditions with writev() Signed-off-by: Anand V. Avati <avati@amp.gluster.com>
* fixing warnings in previous patch (typecasting pointer to long)Anand V. Avati2009-02-271-1/+1
|
* reverting nested locks in posix-locks for inodelkVikas Gorur2009-02-271-0/+2
| | | | Signed-off-by: Anand V. Avati <avati@amp.gluster.com>
* updated copyright header to extend copyright upto 2009Basavanagowda Kanur2009-02-261-1/+1
| | | | | | updated copyright header to include 2009. Signed-off-by: Anand V. Avati <avati@amp.gluster.com>
* protect fd_ctx get/set in afr_transaction and afr.c with locksVikas Gorur2009-02-261-14/+18
| | | | Signed-off-by: Anand V. Avati <avati@amp.gluster.com>
* Added all filesVikas Gorur2009-02-181-0/+957