summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/afr/src/afr-transaction.c
Commit message (Collapse)AuthorAgeFilesLines
* cluster/afr: Handle child_up & fd not opened case in xactionPranith Kumar K2012-08-171-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RCA: When an fd is opened while a brick is down, after the brick comes back up afr issues open on the other brick. It can fail for a number of reasons (enoent etc). While the system is in that state, inode/entrylks pre-op happen only on the brick that is up and fd is opened for fd-fops. post-op should consider only the bricks where both pre-op and fop succeeded as success, rest of them as failures. Code now marks only the children that are down as failures as opposed to child_down & fd-not-opened. This makes change-log appear as success on the subvolume where we did not do any fop leading to no change-log but differences in data/metadata for reg-files. Fix: Mark non-participants of fop as failure. This is tracked in transaction.pre_op[]. Tests: Simulated the scenario using err-gen on top of one of the client xlator which fails all fops always. Performed fops and the changelog represented pending fops on the brick with err-gen loaded. Tested the case of brick down and perform entry/metadata/data operations to confirm they still work as expected. Change-Id: I41905936126b19abba56ca581c0301a894507e1a BUG: 844987 Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Reviewed-on: http://review.gluster.com/3776 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* license: dual license under GPLV2 and LGPLV3+Kaleb KEITHLEY2012-05-101-14/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Note that the license was not changed in any of the following: .../argp-standalone/... .../booster/... .../cli/... .../contrib/... .../extras/... .../glusterfsd/... .../glusterfs-hadoop/... .../mod_clusterfs/... .../scheduler/... .../swift/... The license was not changed in any of the non-building xlators. The license was not changed in any of the xlators that seemed — to me — to be clearly server-side only, e.g. protocol/server Note too that copyright was changed along with the license; I did not change the copyright in files where the license did not change. If you find any errors or ommissions please don't hesitate to let me know. The complete list of files with the license change is: libglusterfs/src/byte-order.h libglusterfs/src/call-stub.c libglusterfs/src/call-stub.h libglusterfs/src/checksum.c libglusterfs/src/checksum.h libglusterfs/src/circ-buff.c libglusterfs/src/circ-buff.h libglusterfs/src/common-utils.c libglusterfs/src/common-utils.h libglusterfs/src/compat-errno.c libglusterfs/src/compat-errno.h libglusterfs/src/compat.c libglusterfs/src/compat.h libglusterfs/src/daemon.c libglusterfs/src/daemon.h libglusterfs/src/defaults.c libglusterfs/src/defaults.h libglusterfs/src/dict.c libglusterfs/src/dict.h libglusterfs/src/event-history.c libglusterfs/src/event-history.h libglusterfs/src/event.c libglusterfs/src/event.h libglusterfs/src/fd-lk.c libglusterfs/src/fd-lk.h libglusterfs/src/fd.c libglusterfs/src/fd.h libglusterfs/src/gf-dirent.c libglusterfs/src/gf-dirent.h libglusterfs/src/globals.c libglusterfs/src/globals.h libglusterfs/src/glusterfs.h libglusterfs/src/graph-print.c libglusterfs/src/graph-utils.h libglusterfs/src/graph.c libglusterfs/src/hashfn.c libglusterfs/src/hashfn.h libglusterfs/src/iatt.h libglusterfs/src/inode.c libglusterfs/src/inode.h libglusterfs/src/iobuf.c libglusterfs/src/iobuf.h libglusterfs/src/latency.c libglusterfs/src/latency.h libglusterfs/src/list.h libglusterfs/src/lkowner.h libglusterfs/src/locking.h libglusterfs/src/logging.c libglusterfs/src/logging.h libglusterfs/src/mem-pool.c libglusterfs/src/mem-pool.h libglusterfs/src/mem-types.h libglusterfs/src/options.c libglusterfs/src/options.h libglusterfs/src/rbthash.c libglusterfs/src/rbthash.h libglusterfs/src/run.c libglusterfs/src/run.h libglusterfs/src/scheduler.c libglusterfs/src/scheduler.h libglusterfs/src/stack.c libglusterfs/src/stack.h libglusterfs/src/statedump.c libglusterfs/src/statedump.h libglusterfs/src/syncop.c libglusterfs/src/syncop.h libglusterfs/src/syscall.c libglusterfs/src/syscall.h libglusterfs/src/timer.c libglusterfs/src/timer.h libglusterfs/src/trie.c libglusterfs/src/trie.h libglusterfs/src/xlator.c libglusterfs/src/xlator.h libglusterfsclient/src/libglusterfsclient-dentry.c libglusterfsclient/src/libglusterfsclient-internals.h libglusterfsclient/src/libglusterfsclient.c libglusterfsclient/src/libglusterfsclient.h rpc/rpc-lib/src/auth-glusterfs.c rpc/rpc-lib/src/auth-null.c rpc/rpc-lib/src/auth-unix.c rpc/rpc-lib/src/protocol-common.h rpc/rpc-lib/src/rpc-clnt.c rpc/rpc-lib/src/rpc-clnt.h rpc/rpc-lib/src/rpc-transport.c rpc/rpc-lib/src/rpc-transport.h rpc/rpc-lib/src/rpcsvc-auth.c rpc/rpc-lib/src/rpcsvc-common.h rpc/rpc-lib/src/rpcsvc.c rpc/rpc-lib/src/rpcsvc.h rpc/rpc-lib/src/xdr-common.h rpc/rpc-lib/src/xdr-rpc.c rpc/rpc-lib/src/xdr-rpc.h rpc/rpc-lib/src/xdr-rpcclnt.c rpc/rpc-lib/src/xdr-rpcclnt.h rpc/rpc-transport/rdma/src/name.c rpc/rpc-transport/rdma/src/name.h rpc/rpc-transport/rdma/src/rdma.c rpc/rpc-transport/rdma/src/rdma.h rpc/rpc-transport/socket/src/name.c rpc/rpc-transport/socket/src/name.h rpc/rpc-transport/socket/src/socket.c rpc/rpc-transport/socket/src/socket.h xlators/cluster/afr/src/afr-common.c xlators/cluster/afr/src/afr-dir-read.c xlators/cluster/afr/src/afr-dir-read.h xlators/cluster/afr/src/afr-dir-write.c xlators/cluster/afr/src/afr-dir-write.h xlators/cluster/afr/src/afr-inode-read.c xlators/cluster/afr/src/afr-inode-read.h xlators/cluster/afr/src/afr-inode-write.c xlators/cluster/afr/src/afr-inode-write.h xlators/cluster/afr/src/afr-lk-common.c xlators/cluster/afr/src/afr-mem-types.h xlators/cluster/afr/src/afr-open.c xlators/cluster/afr/src/afr-self-heal-algorithm.c xlators/cluster/afr/src/afr-self-heal-algorithm.h xlators/cluster/afr/src/afr-self-heal-common.c xlators/cluster/afr/src/afr-self-heal-common.h xlators/cluster/afr/src/afr-self-heal-data.c xlators/cluster/afr/src/afr-self-heal-entry.c xlators/cluster/afr/src/afr-self-heal-metadata.c xlators/cluster/afr/src/afr-self-heal.h xlators/cluster/afr/src/afr-self-heald.c xlators/cluster/afr/src/afr-self-heald.h xlators/cluster/afr/src/afr-transaction.c xlators/cluster/afr/src/afr-transaction.h xlators/cluster/afr/src/afr.c xlators/cluster/afr/src/afr.h xlators/cluster/afr/src/pump.c xlators/cluster/afr/src/pump.h xlators/cluster/dht/src/dht-common.c xlators/cluster/dht/src/dht-common.h xlators/cluster/dht/src/dht-diskusage.c xlators/cluster/dht/src/dht-hashfn.c xlators/cluster/dht/src/dht-helper.c xlators/cluster/dht/src/dht-inode-read.c xlators/cluster/dht/src/dht-inode-write.c xlators/cluster/dht/src/dht-layout.c xlators/cluster/dht/src/dht-linkfile.c xlators/cluster/dht/src/dht-mem-types.h xlators/cluster/dht/src/dht-rebalance.c xlators/cluster/dht/src/dht-rename.c xlators/cluster/dht/src/dht-selfheal.c xlators/cluster/dht/src/dht.c xlators/cluster/dht/src/nufa.c xlators/cluster/dht/src/switch.c xlators/cluster/stripe/src/stripe-helpers.c xlators/cluster/stripe/src/stripe-mem-types.h xlators/cluster/stripe/src/stripe.c xlators/cluster/stripe/src/stripe.h xlators/features/index/src/index-mem-types.h ¹ xlators/features/index/src/index.c ¹ xlators/features/index/src/index.h ¹ xlators/performance/io-cache/src/io-cache.c xlators/performance/io-cache/src/io-cache.h xlators/performance/io-cache/src/ioc-inode.c xlators/performance/io-cache/src/ioc-mem-types.h xlators/performance/io-cache/src/page.c xlators/performance/io-threads/src/io-threads.c xlators/performance/io-threads/src/io-threads.h xlators/performance/io-threads/src/iot-mem-types.h xlators/performance/md-cache/src/md-cache-mem-types.h xlators/performance/md-cache/src/md-cache.c xlators/performance/quick-read/src/quick-read-mem-types.h xlators/performance/quick-read/src/quick-read.c xlators/performance/quick-read/src/quick-read.h xlators/performance/read-ahead/src/page.c xlators/performance/read-ahead/src/read-ahead-mem-types.h xlators/performance/read-ahead/src/read-ahead.c xlators/performance/read-ahead/src/read-ahead.h xlators/performance/symlink-cache/src/symlink-cache.c xlators/performance/write-behind/src/write-behind-mem-types.h xlators/performance/write-behind/src/write-behind.c xlators/protocol/auth/addr/src/addr.c ¹ xlators/protocol/auth/login/src/login.c ¹ xlators/protocol/client/src/client-callback.c xlators/protocol/client/src/client-handshake.c xlators/protocol/client/src/client-helpers.c xlators/protocol/client/src/client-lk.c xlators/protocol/client/src/client-mem-types.h xlators/protocol/client/src/client.c xlators/protocol/client/src/client.h xlators/protocol/client/src/client3_1-fops.c ¹ Copyright only, license reverted to original Change-Id: If560e826c61b6b26f8b9af7bed6e4bcbaeba31a8 BUG: 820551 Signed-off-by: Kaleb KEITHLEY <kkeithle@redhat.com> Reviewed-on: http://review.gluster.com/3304 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* cluster/afr: Perform Flush with lk-owner given by parent xlator.Pranith Kumar K2012-05-071-0/+5
| | | | | | | | | | | | Lk-owner of posix-lk and flush should be same, flush can't clear posix-lks without that lk-owner. Change-Id: If775abb5741a0beb00c419b54d023fbd429e3cb7 BUG: 810502 Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Reviewed-on: http://review.gluster.com/3221 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* cluster/afr: increment change log with correct byte orderPranith Kumar K2012-04-161-2/+5
| | | | | | | | | | Change-Id: Id2af3e61ad659ff6d168161673e5e1e19f36bdb5 BUG: 765194 Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Reviewed-on: http://review.gluster.com/3149 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* core: adding extra data for fopsAmar Tumballi2012-03-221-24/+47
| | | | | | | | | | | | | with this change, the xlator APIs will have a dictionary as extra argument, which is passed between all the layers. This can be utilized for overloading in some of the operations. Change-Id: I58a8186b3ef647650280e63f3e5e9b9de7827b40 Signed-off-by: Amar Tumballi <amarts@redhat.com> BUG: 782265 Reviewed-on: http://review.gluster.com/2960 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Enable eager-lockPranith Kumar K2012-03-171-1/+2
| | | | | | | | | | | | | | | | | | Eager-lock is disabled by default. Use cluster.eager-lock on/off to change the config. write-behind on and eager-lock off is not supported configuration. In afr, when eager-lock is enabled the inode lock on fd is taken using the fd address as the lk-owner. So the lock is interchangableale between the inode-locks on the same fd. Change-Id: I7eef1ecd510f8028f5395dee882782da53c0de3f BUG: 802515 Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Reviewed-on: http://review.gluster.com/2925 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Revert "afr: [Un]Set the 'right' lkowner for [f]{inode|entry}_lk and the ↵Vijay Bellur2012-03-031-0/+2
| | | | | | | | | | | 'enclosed' fop." This reverts commit 2e80fdbeb6abbb23ff6789c2b98c82704883af0a. Change-Id: I417fd43e4195d63e5b8b83dd3beb712887130e1e Reviewed-on: http://review.gluster.com/2860 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* afr: [Un]Set the 'right' lkowner for [f]{inode|entry}_lk and the 'enclosed' fop.Krishnan Parthasarathi2012-03-011-2/+0
| | | | | | | | | | | | | | afr 'mangles' the lkowner inorder to ensure [f]inodelk/[f]entrylk fops from the same application contend. But other fops that are 'visible' to the application should operate with the lkowner provided by fuse for correct functioning of posix-locks xlator. Change-Id: I7e71f35ae7df2a070f1f46d4fc77eed26a717673 BUG: 790743 Signed-off-by: Krishnan Parthasarathi <kp@gluster.com> Reviewed-on: http://review.gluster.com/2752 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* cluster/afr: Perform xattrop with all afr-keysPranith Kumar K2012-01-271-11/+7
| | | | | | | | | | | | | | | | | | | | | | | | Self-heal does not happen if the file has change log xattr only for one of the subvol keys. This patch makes sure that xattrop is done for all the afr subvol keys after a new entry is created in entry-self-heal. 1) Added matrix create/cleanup functions 2) Impunging a new file does multiple xattrops on the source subvol, one per sink. The code can do a single xattrop after the entry is created on all the sinks. 3) Missing entry self-heal uses one frame per sink to heal the file. This leads to multiple xattrops on the source subvol. That code is changed now to use one frame which will create the file on all subvols. Change-Id: I65a42f9779b03f7efae283479f8653fb2cb8046b BUG: 762680 Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Reviewed-on: http://review.gluster.com/2503 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Krishnan Parthasarathi <kp@gluster.com>
* core: GFID filehandle based backend and anonymous FDsAnand Avati2012-01-201-6/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. What -------- This change introduces an infrastructure change in the filesystem which lets filesystem operation address objects (inodes) just by its GFID. Thus far GFID has been a unique identifier of a user-visible inode. But in terms of addressability the only mechanism thus far has been the backend filesystem path, which could be derived from the GFID only if it was cached in the inode table along with the entire set of dentry ancestry leading up to the root. This change essentially decouples addressability from the namespace. It is no more necessary to be aware of the parent directory to address a file or directory. 2. Why ------- The biggest use case for such a feature is NFS for generating persistent filehandles. So far the technique for generating filehandles in NFS has been to encode path components so that the appropriate inode_t can be repopulated into the inode table by means of a recursive lookup of each component top-down. Another use case is the ability to perform more intelligent self-healing and rebalancing of inodes with hardlinks and also to detect renames. A derived feature from GFID filehandles is anonymous FDs. An anonymous FD is an internal USABLE "fd_t" which does not map to a user opened file descriptor or to an internal ->open()'d fd. The ability to address a file by the GFID eliminates the need to have a persistent ->open()'d fd for the purpose of avoiding the namespace. This improves NFS read/write performance significantly eliminating open/close calls and also fixes some of today's limitations (like keeping an FD open longer than necessary resulting in disk space leakage) 3. How ------- At each storage/posix translator level, every file is hardlinked inside a hidden .glusterfs directory (under the top level export) with the name as the ascii-encoded standard UUID format string. For reasons of performance and scalability there is a two-tier classification of those hardlinks under directories with the initial parts of the UUID string as the directory names. For directories (which cannot be hardlinked), the approach is to use a symlink which dereferences the parent GFID path along with basename of the directory. The parent GFID dereference will in turn be a dereference of the grandparent with the parent's basename, and so on recursively up to the root export. 4. Development --------------- 4a. To leverage the ability to address an inode by its GFID, the technique is to perform a "nameless lookup". This means, to populate a loc_t structure as: loc_t { pargfid: NULL parent: NULL name: NULL path: NULL gfid: GFID to be looked up [out parameter] inode: inode_new () result [in parameter] } and performing such lookup will return in its callback an inode_t populated with the right contexts and a struct iatt which can be used to perform an inode_link () on the inode (without a parent and basename). The inode will now be hashed and linked in the inode table and findable via inode_find(). A fundamental change moving forward is that the primary fields in a loc_t structure are now going to be (pargfid, name) and (gfid) depending on the kind of FOP. So far path had been the primary field for operations. The remaining fields only serve as hints/helpers. 4b. If read/write is to be performed on an inode_t, the approach so far has been to: fd_create(), STACK_WIND(open, fd), fd_bind (in callback) and then perform STACK_WIND(read, fd) etc. With anonymous fds now you can do fd_anonymous (inode), STACK_WIND (read, fd). This results in great boost in performance in the inbuilt NFS server. 5. Misc ------- The inode_ctx_put[2] has been renamed to inode_ctx_set[2] to be consistent with the rest of the codebase. Change-Id: Ie4629edf6bd32a595f4d7f01e90c0a01f16fb12f BUG: 781318 Reviewed-on: http://review.gluster.com/669 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@gluster.com>
* cluster/afr: Double the call count if transaction is for renamePranith Kumar K2011-12-131-4/+18
| | | | | | | | | | | | | In rename the changelog modification needs to happen both on old parent-dir and new parent-dir, so 2 stack winds are done per brick. Change-Id: I43f34661e397c4288162213944529e18b7724b1d BUG: 766603 Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Reviewed-on: http://review.gluster.com/783 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* cluster/afr: Update read-child if it becomes stalePranith Kumar K2011-11-281-25/+19
| | | | | | | | Change-Id: I00c714a89575023f6dbdd3430dcbf191e5d08019 BUG: 3650 Reviewed-on: http://review.gluster.com/740 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* Second round of warning suppression.Jeff Darcy2011-09-291-3/+0
| | | | | | | | | | | | | | | | | | | | | | Used a #pragma to kill ~170 in rpcgen code. Added GF_UNUSED to deal with a few more from macros elsewhere. The remainder are function return values (mostly context and dict calls) that really should be checked. Those would be harder to fix without real understanding of the code where they occur, so they remain as reminders. (Patchset 2: deal with older gcc that doesn't handle #pragma GCC diagnostic) (Patchset 3: fix include paths in generated files) (Patchset 4: keep up with trunk, squash 9 new warnings) (Patchset 5: six more, all in AFR) Change-Id: I29760c8c81be4d7e6489312c5d0e92cc24814b7b BUG: 2550 Reviewed-on: http://review.gluster.com/378 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* cluster/afr: Make local->child_up immutablePranith Kumar K2011-09-211-83/+47
| | | | | | | | | | | | | | | | | | | | | | | Afr transaction performs lock, pre-op, op, post-op and unlock steps in that order. The child_up[] is overloaded with the information of where all the first two steps succeeded. This works perfectly fine for Transaction, but the locking/unlocking part of the code is re-used by data self-heal. In that each loop_frame does lock, rchecksum, read-from-source and write-to-sinks, unlock steps. Rchecksum fop assumes that the fop needs to happen on one source + all sinks and sets the call_count to that number. But if the lock step fails on any of the sinks it will mark the child_up of that child to 0, which will result in call_count mismatch and the frame will hang thinking that some more cbks need to come. When this happens loop_frame will never go to unlock step leading to hangs on that file. Change-Id: I3dd0449cc6193a980bacf637d935881f4b22210a BUG: 3597 Reviewed-on: http://review.gluster.com/474 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amar@gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* cluster/afr: eager locking of FD writesAnand Avati2011-09-081-7/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is a change in the way write transactions hold a lock which optimizes the case of sequential writes from a single writer. Lock phase of a transaction has two sub-phases. First is an attempt to acquire locks in parallel by broadcasting non-blocking lock requests. If lock aquistion fails on any server, then the held locks are unlocked and revert to a blocking locked mode sequentially on one server after another. The change in this patch is to make the initial broadcasting lock request attempt to acquire lock on the entire file. If this fails, we revert back to the sequential "regional" blocking lock as before. In the case where such an "eager" lock is granted in the non-blocking phase, it gives rise to an opportunity for optimization. i.e, if the next write transaction on the same FD arrives before the unlock phase of the first transaction, it "takes over" the full file lock. Similarly if yet another transaction arrives before the unlock phase of the "optimized" transaction, that in turn "takes over" the lock as well. The actual unlock now happens at the end of the last "optimzed" transaction. Any operation which arrives before the unlock phase of the previous transaction is a potential candidate to become an "optimized" transaction. In cases where the previous transaction had aquired lock as a "regional" blocking lock, and the next transaction comes in before its unlock phase, then it would not be an "optimized" transaction. Implied assumption ------------------ Since two or more transactions can now operate within the same large lock, there is a possibility that overlapping transactions can arrive at oppoosite orders on the servers. However in the larger picture this is not possible as write-behind already ensures that no two overlapping writes on an inode are in transit at the same time. Overlapping writes across clients are not a problem as they compete at locks anyways. Theoretical benefits and potential harms ---------------------------------------- In case of a single writer: The benefits are large for sequential writes. In the best case the entire file write can happen with just one lock and unlock per server, provided writes are coming in fast enough and getting pipelined by write-behind soon enough (which is usually the case). If the writes are not coming in fast enough, then the optimization "kicks in" for only those subsets of writes which are close enough to get "piggybacked". For random writes the benefits are the same as well. In any case the overall performance is better than or equal to the performance without this optimization for a single writer. In case of multiple writers: When multiple writers are not writing concurrently, there is no negative performance impact. When multiple writers are writing concurrently to the same region, there is no negative impact either, as they were previously getting arbitrated at the locks translator too. In the case of multiple writers writing to different regions concurrently, there will be an increased number of "failovers" from failed parallel non-blocking to sequential blocking regional locks. This above "worst case" has a simple workaround that as soon as we detect > 1 open-fd-count in lookup xattr, we can disable this optimization on those fds. Beneficial side-effects ----------------------- There is another similar optimization in AFR for changelogs which goes by the name of "changelog-piggybacking". That works in a similar way where pending flags get 'taken over' or 'piggybacked' by the next transaction if its 'pre-op' phase kicks in before the 'post-op' phase of the previous transaction. It has been observed that this changelog-piggybacking optimization gives a saving of about ~55% savings of xattr calls hitting the wire, measured across various types of network interfaces. The side effect of this eager-lock optimization is that it gives an almost 100% saving of xattr calls by making the optimistic-changelog work much more efficiently as it gives a wider overlap of the xattr phases of two consecutive transactions. Change-Id: I41c02eb3b64c14c68ef66a344610ec3f024cd59d BUG: 3409 Reviewed-on: http://review.gluster.com/240 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@gluster.com>
* Eliminate many "var set but not used" warnings with newer gcc.Jeff Darcy2011-09-071-2/+0
| | | | | | | | | | | | | | | | This fixes ~200 such warnings, but leaves three categories untouched. (1) Rpcgen code. (2) Macros which set variables in the outer (calling function) scope. (3) Variables which are set via function calls which may have side effects. Change-Id: I6554555f78ed26134251504b038da7e94adacbcd BUG: 2550 Reviewed-on: http://review.gluster.com/371 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@gluster.com>
* cluster/afr: Change definition of stale childPranith Kumar K2011-08-221-1/+1
| | | | | | | | | | | | The code is checking for priv->child_up[i], which can change while the fop is in progress. Since pending[child][id-of-transaction] alone is enough to tell if the child became stale or not, use just that. Change-Id: I494bf02cca66f4fd41526195fafce86a202c6bd1 BUG: 3455 Reviewed-on: http://review.gluster.com/293 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* cluster/afr: Perform self-heal without locking the whole filePranith Kumar K2011-08-201-16/+39
| | | | | | | | Change-Id: I206571c77f2d7b3c9f9d7bb82a936366fd99ce5c BUG: 3182 Reviewed-on: http://review.gluster.com/141 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* cluster/afr: Update fresh_children in lookup if no other ops in progressPranith Kumar K2011-08-191-27/+42
| | | | | | | | | | | | | If write/truncate fails we should remove the child that failed the fop from the fresh children. The previous code assumes that the children that succeeded the fop are fresh children, which is wrong. Fixed that in this patch. Change-Id: I1e6e21e20faea00516a0fdd2e95f2d7e9cf9076d BUG: 3411 Reviewed-on: http://review.gluster.com/263 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* Change Copyright current yearPranith Kumar K2011-08-101-1/+1
| | | | | | | | Change-Id: I2d10f2be44f518f496427f257988f1858e888084 BUG: 3348 Reviewed-on: http://review.gluster.com/200 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@gluster.com>
* LICENSE: s/GNU Affero General Public/GNU General Public/Pranith Kumar K2011-08-061-3/+3
| | | | | | | | Change-Id: I3914467611e573cccee0d22df93920cf1b2eb79f BUG: 3348 Reviewed-on: http://review.gluster.com/182 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@gluster.com>
* cluster/afr: Choose next call child from fresh-children for inode-read-fopsPranith K2011-07-171-2/+2
| | | | | | | | Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Signed-off-by: Anand Avati <avati@gluster.com> BUG: 2840 (files not getting self-healed when the first child goes down) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2840
* cluster/afr: Add fresh children along with read-child to inode contextPranith K2011-07-171-15/+23
| | | | | | | | Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Signed-off-by: Anand Avati <avati@gluster.com> BUG: 2840 (files not getting self-healed when the first child goes down) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2840
* cluster/afr: log enhancement - part 3Amar Tumballi2011-04-011-19/+15
| | | | | | | | Signed-off-by: Amar Tumballi <amar@gluster.com> Signed-off-by: Vijay Bellur <vijay@dev.gluster.com> BUG: 2346 (Log message enhancements in GlusterFS - phase 1) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2346
* cluster/afr: white-space cleanup - part 2Amar Tumballi2011-03-311-211/+211
| | | | | | | | Signed-off-by: Amar Tumballi <amar@gluster.com> Signed-off-by: Vijay Bellur <vijay@dev.gluster.com> BUG: 2346 (Log message enhancements in GlusterFS - phase 1) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2346
* cluster/afr: Fix wrong memory allocationPranith K2011-03-141-4/+5
| | | | | | | | Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Signed-off-by: Vijay Bellur <vijay@dev.gluster.com> BUG: 2517 (the size of allocated memory may be wrong) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2517
* replicate: optimistic changelogAnand Avati2010-11-091-13/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The standard way of maintaining changelog in replicate has been to write out pending flags and to unset the pending flag post the actual operation. This new optimization kicks in only when all subvolumes are up. The optimization is that, during pre-op, no changelog is written for METADATA and ENTRY/RENAME operations. If during the operation nothing failed, no changelog is updated in post-op either. If however, something does fail during an operation, then, pending flags get written during post op pointing only towards the failed nodes. DATA transactions continue to work the way they are. If one subvolume is down, pending flags are written in pre-op changelog itself as before. The impact of this optimization is only in the case when both servers die or the client dies while the 'FOP' stage of the transaction is in progress. By nature of METADATA and ENTRY operations, detecting a mismatch later is not dependent on the presence of changelog. Changelog only determines the direction in which self-heal happens for these types of transactions. For the direction too this optimization does not have a major impact because in the cases of failure (both servers dieing or client dieing) the final state (direction of self-heal) would be arbitrary anyways as the syscall wouldn't have completed. Signed-off-by: Anand V. Avati <avati@blackhole.gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 2068 (performance enhancements) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2068
* Copyright changesVijay Bellur2010-10-111-1/+1
| | | | | | | | Signed-off-by: Vijay Bellur <vijay@gluster.com> Signed-off-by: Vijay Bellur <vijay@dev.gluster.com> BUG: 971 (dynamic volume management) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=971
* Change GNU GPL to GNU AGPLPranith K2010-10-041-3/+3
| | | | | | | | Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Signed-off-by: Vijay Bellur <vijay@dev.gluster.com> BUG: 1388 () URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=1388
* replicate: keep read_child in inode ctx as up-to-date as possibleAnand Avati2010-09-291-1/+49
| | | | | | | | | | | | | In every transaction check if the currently set read child in the inode context failed in the fop and set it to another subvol on which the latest fop has passed. This will prevent read fops landing on subvols which have witnessed a failure. Signed-off-by: Anand V. Avati <avati@amp.gluster.com> Signed-off-by: Vijay Bellur <vijay@dev.gluster.com> BUG: 1172 (ls -lh on NFS mount of 2-mirror replicate gives incorrect file size) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=1172
* replicate: replace first-write-to-flush optimizationAnand V. Avati2010-09-291-405/+363
| | | | | | | | | | | | use a changelog piggybacking optimization instead of first-write-to-flush optimization and do other cleanups (removal of post-post-op hook etc.) Signed-off-by: Anand V. Avati <avati@blackhole.gluster.com> Signed-off-by: Anand V. Avati <avati@amp.gluster.com> Signed-off-by: Vijay Bellur <vijay@dev.gluster.com> BUG: 1235 (Bug for all pump/migrate commits) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=1235
* whitespace cleanupAnand V. Avati2010-09-291-72/+81
| | | | | | | | | Signed-off-by: Anand V. Avati <avati@blackhole.gluster.com> Signed-off-by: Anand V. Avati <avati@amp.gluster.com> Signed-off-by: Vijay Bellur <vijay@dev.gluster.com> BUG: 1235 (Bug for all pump/migrate commits) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=1235
* cluster/afr: Use 2 phase locking for transactions and self heal.Pavan Sondur2010-08-221-474/+194
| | | | | | | | Signed-off-by: Pavan Vilas Sondur <pavan@gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 960 () URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=960
* add pump xlator and changes for replace-brickPavan Sondur2010-08-061-1/+1
| | | | | | | | Signed-off-by: Pavan Vilas Sondur <pavan@gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 1235 (Bug for all pump/migrate commits) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=1235
* cluster/afr: Check for call_count in ENTRY_RENAME_TRANSACTIONVijay Bellur2010-04-201-1/+2
| | | | | | | | Signed-off-by: Vijay Bellur <vijay@gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 824 (Crash in afr rename transaction) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=824
* afr: handle fdctx->pre_op_done handlingAnand Avati2009-11-291-0/+1
| | | | | | | | | | | | reset pre_op_done[i] to 0 after issuing a postop in flush. this was missed during the introduction of pre_op_done[] array and was resulting in a lot of spurious self heals when spurious flushes were received Signed-off-by: Anand V. Avati <avati@blackhole.gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 170 (Auto-heal fails on files that are open()-ed/mmap()-ed) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=170
* cluster/afr: Include "common-utils.h" instead of alloca.hVikas Gorur2009-11-261-1/+1
| | | | | | | | | | | alloca.h should be included on a platform-specific basis. Lets common-utils.h handle that. Signed-off-by: Vikas Gorur <vikas@gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 349 (FreeBSD compilation error (alloca.h).) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=349
* cluster/afr: Do self-heal on unopened fds.Vikas Gorur2009-11-251-4/+37
| | | | | | | | | | | | | | This patch completes the previous patch for self-heal of open fds in replicate. If an fd was never opened on a subvolume, we remember that and do the open after we've done self-heal on that fd. Signed-off-by: Vikas Gorur <vikas@gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 170 (Auto-heal fails on files that are open()-ed/mmap()-ed) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=170
* cluster/afr: Do self-heal on reopened fds.Vikas Gorur2009-11-241-18/+112
| | | | | | | | | | | | | | | | | | | | | | | This patch brings in partial support for self-heal of open fds. The precondition is that the fd should have been opened successfully during the initial open() (or create()), and we assume that protocol/client has successfully reopened the fd when the subvolume comes back up. It works by doing an "up/down flush" (a dummy flush transaction to do post-op wherever necessary) and then triggering data self-heal on the file in the post-post-op hook of the dummy flush transaction. This ensures that any writes that come in during self-heal will wait until self-heal completes. The up/down flush is also done when a subvolume goes down, so that post-op is done on all subvolumes where pre-op was done. Signed-off-by: Vikas Gorur <vikas@gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 170 (Auto-heal fails on files that are open()-ed/mmap()-ed) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=170
* cluster/afr: Provide a post-post_op hook in the transaction.Vikas Gorur2009-11-241-6/+20
| | | | | | | | Signed-off-by: Vikas Gorur <vikas@gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 170 (Auto-heal fails on files that are open()-ed/mmap()-ed) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=170
* cluster/afr: Hold blocking locks for data self-heal.Vikas Gorur2009-11-241-1/+1
| | | | | | | | | | | Data self-heal now holds blocking locks, and instead of locking on all subvolumes, it only locks on {data-lock-server-count} subvolumes. Signed-off-by: Vikas Gorur <vikas@gluster.com> Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 170 (Auto-heal fails on files that are open()-ed/mmap()-ed) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=170
* cluster/afr: Unlock only those paths which have been locked during rename.Vikas Gorur2009-11-241-77/+142
| | | | | | | | | | | For ENTRY_RENAME_TRANSACTIONs, keep track separately whether the lower_path and the higher_path have been locked, and unlock only those which have been. Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 112 (parallel deletion of files mounted by different clients on the same back-end hangs and/or does not completely delete) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=112
* afr transaction: fix op_ret check during lockingAnand Avati2009-10-131-3/+3
| | | | | | | Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 112 (parallel deletion of files mounted by different clients on the same back-end hangs and/or does not completely delete) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=112
* afr transaction prevent spurious unlocksAnand Avati2009-10-131-2/+4
| | | | | | | | | mark a subvol with held lock only if op_ret == 0 Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 112 (parallel deletion of files mounted by different clients on the same back-end hangs and/or does not completely delete) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=112
* cluster/afr: Hold second lock after first lock has been granted for rename ↵Vikas Gorur2009-10-121-30/+84
| | | | | | | | | | | | transactions. Hold the lock on the {higher_path} only after the lock on the {lower_path} has been granted successfully. Signed-off-by: Anand V. Avati <avati@dev.gluster.com> BUG: 112 (parallel deletion of files mounted by different clients on the same back-end hangs and/or does not completely delete) URL: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=112
* Changed occurrences of Z Research to Gluster.Vijay Bellur2009-10-071-1/+1
| | | | Signed-off-by: Anand V. Avati <avati@dev.gluster.com>
* cluster/afr - use different dictionaries for sending xattrop requests to ↵Basavanagowda Kanur2009-06-301-24/+50
| | | | | | | | | | | | | | | | each of the subvolume - This patch fixes bug #29. - Using separate copies of dictionaries also eliminates a potential bug in a setup consisting of afr with a posix and client, each having io-threads on top as children. Since posix_xattrop after performing required operations on the xattr array passed in dictionary, sets the result at the same key and in the same dictionary passed as input argument, there can be race conditions where in the results of the operation on posix-child can be sent to the other child as input argument for xattrop, which ofcourse is wrong. Signed-off-by: Anand V. Avati <avati@dev.gluster.com>
* Cleaned up log messages in replicate.Vikas Gorur2009-04-241-4/+5
| | | | Signed-off-by: Anand V. Avati <avati@amp.gluster.com>
* afr-transaction: handle double flushesAnand V. Avati2009-04-201-1/+3
| | | | __if_fd_pre_op_done - reset fd_ctx->pre_op_done to 0 so that double flushes do not result in two xattrop() calls
* Use original pid when calling the FOP in afr transaction.Vikas Gorur2009-04-161-2/+39
| | | | | | | | Save the original pid while locking and restore it after the FOP is done. This ensures posix-locks can release locks (fcntl) properly. Signed-off-by: Anand V. Avati <avati@amp.gluster.com>