cluster/afr: eager locking of FD writes

This patch is a change in the way write transactions hold a lock which optimizes the case of sequential writes from a single writer. Lock phase of a transaction has two sub-phases. First is an attempt to acquire locks in parallel by broadcasting non-blocking lock requests. If lock aquistion fails on any server, then the held locks are unlocked and revert to a blocking locked mode sequentially on one server after another. The change in this patch is to make the initial broadcasting lock request attempt to acquire lock on the entire file. If this fails, we revert back to the sequential "regional" blocking lock as before. In the case where such an "eager" lock is granted in the non-blocking phase, it gives rise to an opportunity for optimization. i.e, if the next write transaction on the same FD arrives before the unlock phase of the first transaction, it "takes over" the full file lock. Similarly if yet another transaction arrives before the unlock phase of the "optimized" transaction, that in turn "takes over" the lock as well. The actual unlock now happens at the end of the last "optimzed" transaction. Any operation which arrives before the unlock phase of the previous transaction is a potential candidate to become an "optimized" transaction. In cases where the previous transaction had aquired lock as a "regional" blocking lock, and the next transaction comes in before its unlock phase, then it would not be an "optimized" transaction. Implied assumption ------------------ Since two or more transactions can now operate within the same large lock, there is a possibility that overlapping transactions can arrive at oppoosite orders on the servers. However in the larger picture this is not possible as write-behind already ensures that no two overlapping writes on an inode are in transit at the same time. Overlapping writes across clients are not a problem as they compete at locks anyways. Theoretical benefits and potential harms ---------------------------------------- In case of a single writer: The benefits are large for sequential writes. In the best case the entire file write can happen with just one lock and unlock per server, provided writes are coming in fast enough and getting pipelined by write-behind soon enough (which is usually the case). If the writes are not coming in fast enough, then the optimization "kicks in" for only those subsets of writes which are close enough to get "piggybacked". For random writes the benefits are the same as well. In any case the overall performance is better than or equal to the performance without this optimization for a single writer. In case of multiple writers: When multiple writers are not writing concurrently, there is no negative performance impact. When multiple writers are writing concurrently to the same region, there is no negative impact either, as they were previously getting arbitrated at the locks translator too. In the case of multiple writers writing to different regions concurrently, there will be an increased number of "failovers" from failed parallel non-blocking to sequential blocking regional locks. This above "worst case" has a simple workaround that as soon as we detect > 1 open-fd-count in lookup xattr, we can disable this optimization on those fds. Beneficial side-effects ----------------------- There is another similar optimization in AFR for changelogs which goes by the name of "changelog-piggybacking". That works in a similar way where pending flags get 'taken over' or 'piggybacked' by the next transaction if its 'pre-op' phase kicks in before the 'post-op' phase of the previous transaction. It has been observed that this changelog-piggybacking optimization gives a saving of about ~55% savings of xattr calls hitting the wire, measured across various types of network interfaces. The side effect of this eager-lock optimization is that it gives an almost 100% saving of xattr calls by making the optimistic-changelog work much more efficiently as it gives a wider overlap of the xattr phases of two consecutive transactions. Change-Id: I41c02eb3b64c14c68ef66a344610ec3f024cd59d BUG: 3409 Reviewed-on: http://review.gluster.com/243 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@gluster.com>
author: Anand Avati <avati@gluster.com> 2011-08-16 12:56:21 +0530
committer: Anand Avati <avati@gluster.com> 2011-09-08 07:07:45 -0700
commit: f421ca932152c3a6102b19607d789dafdbe87ef1 (patch)
tree: b456b4f1c0393037846ccbed1af0fb533da767af
parent: e6992c73dc39a8e9e17f83eb1a6d5ab63c31f511 (diff)
6 files changed, 173 insertions, 61 deletions
diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c
index a435a38b16d..94938d97f61 100644
--- a/xlators/cluster/afr/src/afr-common.c
+++ b/xlators/cluster/afr/src/afr-common.c
@@ -363,6 +363,7 @@ afr_local_transaction_cleanup (afr_local_t *local, xlator_t *this)
 
         GF_FREE (local->transaction.child_errno);
         GF_FREE (local->child_errno);
+        GF_FREE (local->transaction.eager_lock);
 
         GF_FREE (local->transaction.basename);
         GF_FREE (local->transaction.new_basename);
@@ -1264,6 +1265,22 @@ afr_fd_ctx_set (xlator_t *this, fd_t *fd)
                         goto unlock;
                 }
 
+                fd_ctx->lock_piggyback = GF_CALLOC (sizeof (*fd_ctx->lock_piggyback),
+                                                    priv->child_count,
+                                                    gf_afr_mt_char);
+                if (!fd_ctx->lock_piggyback) {
+                        ret = -ENOMEM;
+                        goto unlock;
+                }
+
+                fd_ctx->lock_acquired = GF_CALLOC (sizeof (*fd_ctx->lock_acquired),
+                                                    priv->child_count,
+                                                    gf_afr_mt_char);
+                if (!fd_ctx->lock_acquired) {
+                        ret = -ENOMEM;
+                        goto unlock;
+                }
+
                 fd_ctx->up_count   = priv->up_count;
                 fd_ctx->down_count = priv->down_count;
 
@@ -1504,6 +1521,12 @@ afr_cleanup_fd_ctx (xlator_t *this, fd_t *fd)
                 if (fd_ctx->pre_op_piggyback)
                         GF_FREE (fd_ctx->pre_op_piggyback);
 
+                if (fd_ctx->lock_piggyback)
+                        GF_FREE (fd_ctx->lock_piggyback);
+
+                if (fd_ctx->lock_acquired)
+                        GF_FREE (fd_ctx->lock_acquired);
+
                 GF_FREE (fd_ctx);
         }
 
diff --git a/xlators/cluster/afr/src/afr-lk-common.c b/xlators/cluster/afr/src/afr-lk-common.c
index c059a6b9e5b..157d54d42aa 100644
--- a/xlators/cluster/afr/src/afr-lk-common.c
+++ b/xlators/cluster/afr/src/afr-lk-common.c
@@ -555,6 +555,7 @@ afr_unlock_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
                         int32_t op_ret, int32_t op_errno)
 {
         afr_local_t *local = NULL;
+        int child_index = (long) cookie;
 
         local = frame->local;
 
@@ -568,6 +569,8 @@ afr_unlock_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
                         local->loc.path, strerror (op_errno));
         }
 
+        local->transaction.eager_lock[child_index] = 0;
+
         afr_unlock_common_cbk (frame, cookie, this, op_ret, op_errno);
 
         return 0;
@@ -581,8 +584,13 @@ afr_unlock_inodelk (call_frame_t *frame, xlator_t *this)
         afr_local_t         *local    = NULL;
         afr_private_t       *priv     = NULL;
         struct gf_flock flock = {0,};
+        struct gf_flock full_flock = {0,};
+        struct gf_flock *flock_use = &flock;
         int call_count = 0;
         int i = 0;
+        int piggyback = 0;
+        afr_fd_ctx_t        *fd_ctx      = NULL;
+
 
         local    = frame->local;
         int_lock = &local->internal_lock;
@@ -592,6 +600,8 @@ afr_unlock_inodelk (call_frame_t *frame, xlator_t *this)
         flock.l_len   = int_lock->lk_flock.l_len;
         flock.l_type  = F_UNLCK;
 
+        full_flock.l_type = F_UNLCK;
+
         call_count = afr_locked_nodes_count (int_lock->inode_locked_nodes,
                                              priv->child_count);
 
@@ -604,42 +614,70 @@ afr_unlock_inodelk (call_frame_t *frame, xlator_t *this)
                 goto out;
         }
 
+        if (local->fd)
+                fd_ctx = afr_fd_ctx_get (local->fd, this);
+
         for (i = 0; i < priv->child_count; i++) {
-                if (int_lock->inode_locked_nodes[i] & LOCKED_YES) {
-                        if (local->fd) {
-                                afr_trace_inodelk_in (frame, AFR_INODELK_TRANSACTION,
-                                                      AFR_UNLOCK_OP, &flock, F_SETLK, i);
+                if ((int_lock->inode_locked_nodes[i] & LOCKED_YES)
+                    != LOCKED_YES)
+                        continue;
 
-                                STACK_WIND_COOKIE (frame, afr_unlock_inodelk_cbk,
-                                                   (void *) (long)i,
-                                                   priv->children[i],
-                                                   priv->children[i]->fops->finodelk,
-                                                   this->name, local->fd,
-                                                   F_SETLK, &flock);
+                if (local->fd) {
+                        if (!local->transaction.eager_lock[i]) {
+                                goto wind;
+                        }
+
+                        piggyback = 0;
+                        flock_use = &full_flock;
+
+                        LOCK (&local->fd->lock);
+                        {
+                                if (fd_ctx->lock_piggyback[i]) {
+                                        fd_ctx->lock_piggyback[i]--;
+                                        piggyback = 1;
+                                } else {
+                                        fd_ctx->lock_acquired[i]--;
+                                }
+                        }
+                        UNLOCK (&local->fd->lock);
 
+                        if (piggyback) {
+                                afr_unlock_inodelk_cbk (frame, (void *) (long) i,
+                                                        this, 1, 0);
                                 if (!--call_count)
                                         break;
+                                continue;
+                        }
 
-                        } else {
-                                afr_trace_inodelk_in (frame, AFR_INODELK_TRANSACTION,
-                                                      AFR_UNLOCK_OP, &flock, F_SETLK, i);
+                wind:
+                        afr_trace_inodelk_in (frame, AFR_INODELK_TRANSACTION,
+                                              AFR_UNLOCK_OP, flock_use, F_SETLK, i);
 
-                                STACK_WIND_COOKIE (frame, afr_unlock_inodelk_cbk,
-                                                   (void *) (long)i,
-                                                   priv->children[i],
-                                                   priv->children[i]->fops->inodelk,
-                                                   this->name, &local->loc,
-                                                   F_SETLK, &flock);
+                        STACK_WIND_COOKIE (frame, afr_unlock_inodelk_cbk,
+                                           (void *) (long)i,
+                                           priv->children[i],
+                                           priv->children[i]->fops->finodelk,
+                                           this->name, local->fd,
+                                           F_SETLK, flock_use);
 
-                                if (!--call_count)
-                                        break;
+                        if (!--call_count)
+                                break;
 
-                        }
+                } else {
+                        afr_trace_inodelk_in (frame, AFR_INODELK_TRANSACTION,
+                                              AFR_UNLOCK_OP, &flock, F_SETLK, i);
 
-                }
+                        STACK_WIND_COOKIE (frame, afr_unlock_inodelk_cbk,
+                                           (void *) (long)i,
+                                           priv->children[i],
+                                           priv->children[i]->fops->inodelk,
+                                           this->name, &local->loc,
+                                           F_SETLK, &flock);
 
+                        if (!--call_count)
+                                break;
+                }
         }
-
 out:
         return 0;
 }
@@ -1288,6 +1326,7 @@ afr_nonblocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
         afr_private_t       *priv     = NULL;
         int call_count  = 0;
         int child_index = (long) cookie;
+        afr_fd_ctx_t        *fd_ctx = NULL;
 
         local    = frame->local;
         int_lock = &local->internal_lock;
@@ -1303,7 +1342,7 @@ afr_nonblocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
         }
         UNLOCK (&frame->lock);
 
-        if (op_ret < 0 ) {
+        if (op_ret < 0) {
                 if (op_errno == ENOSYS) {
                         /* return ENOTSUP */
                         gf_log (this->name, GF_LOG_ERROR,
@@ -1315,10 +1354,27 @@ afr_nonblocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
                         int_lock->lock_op_errno      = op_errno;
                         local->op_errno              = op_errno;
                 }
-        } else if (op_ret == 0) {
+        } else {
                 int_lock->inode_locked_nodes[child_index]
                         |= LOCKED_YES;
                 int_lock->inodelk_lock_count++;
+
+                if (priv->eager_lock && local->fd) {
+                        fd_ctx = afr_fd_ctx_get (local->fd, this);
+                        local->transaction.eager_lock[child_index] = 1;
+                        /* piggybacked */
+
+                        if (op_ret == 1) {
+                                /* piggybacked */
+                        } else if (op_ret == 0) {
+                                /* lock acquired from server */
+                                LOCK (&local->fd->lock);
+                                {
+                                        fd_ctx->lock_acquired[child_index]++;
+                                }
+                                UNLOCK (&local->fd->lock);
+                        }
+                }
         }
 
         if (call_count == 0) {
@@ -1358,6 +1414,9 @@ afr_nonblocking_inodelk (call_frame_t *frame, xlator_t *this)
         int      i          = 0;
         int      ret        = 0;
         struct gf_flock flock = {0,};
+        struct gf_flock full_flock = {0,};
+        struct gf_flock *flock_use = &flock;
+        int     piggyback = 0;
 
         local    = frame->local;
         int_lock = &local->internal_lock;
@@ -1367,6 +1426,8 @@ afr_nonblocking_inodelk (call_frame_t *frame, xlator_t *this)
         flock.l_len   = int_lock->lk_flock.l_len;
         flock.l_type  = int_lock->lk_flock.l_type;
 
+        full_flock.l_type = int_lock->lk_flock.l_type;
+
         initialize_inodelk_variables (frame, this);
 
         if (local->fd) {
@@ -1398,49 +1459,72 @@ afr_nonblocking_inodelk (call_frame_t *frame, xlator_t *this)
                         goto out;
                 }
 
+                frame->root->lk_owner = (long) (local->fd);
+
                 /* Send non-blocking inodelk calls only on up children
                    and where the fd has been opened */
                 for (i = 0; i < priv->child_count; i++) {
-                        if (local->child_up[i] && fd_ctx->opened_on[i]) {
-                                afr_trace_inodelk_in (frame, AFR_INODELK_NB_TRANSACTION,
-                                                      AFR_LOCK_OP, &flock, F_SETLK, i);
+                        if (!local->child_up[i] || !fd_ctx->opened_on[i])
+                                continue;
 
-                                STACK_WIND_COOKIE (frame, afr_nonblocking_inodelk_cbk,
-                                                   (void *) (long) i,
-                                                   priv->children[i],
-                                                   priv->children[i]->fops->finodelk,
-                                                   this->name, local->fd,
-                                                   F_SETLK, &flock);
+                        if (!priv->eager_lock)
+                                goto wind;
+
+                        flock_use = &full_flock;
+                        piggyback = 0;
+
+                        LOCK (&local->fd->lock);
+                        {
+                                if (fd_ctx->lock_acquired[i]) {
+                                        fd_ctx->lock_piggyback[i]++;
+                                        piggyback = 1;
+                                }
+                        }
+                        UNLOCK (&local->fd->lock);
 
+                        if (piggyback) {
+                                /* (op_ret == 1) => indicate piggybacked lock */
+                                afr_nonblocking_inodelk_cbk (frame, (void *) (long) i,
+                                                             this, 1, 0);
                                 if (!--call_count)
                                         break;
-
+                                continue;
                         }
+                wind:
+                        afr_trace_inodelk_in (frame, AFR_INODELK_NB_TRANSACTION,
+                                              AFR_LOCK_OP, flock_use, F_SETLK, i);
 
+                        STACK_WIND_COOKIE (frame, afr_nonblocking_inodelk_cbk,
+                                           (void *) (long) i,
+                                           priv->children[i],
+                                           priv->children[i]->fops->finodelk,
+                                           this->name, local->fd,
+                                           F_SETLK, flock_use);
+
+                        if (!--call_count)
+                                break;
                 }
         } else {
                 call_count = internal_lock_count (frame, this, NULL);
                 int_lock->lk_call_count = call_count;
 
                 for (i = 0; i < priv->child_count; i++) {
-                        if (local->child_up[i]) {
-                                afr_trace_inodelk_in (frame, AFR_INODELK_NB_TRANSACTION,
-                                                      AFR_LOCK_OP, &flock, F_SETLK, i);
+                        if (!local->child_up[i])
+                                continue;
+                        afr_trace_inodelk_in (frame, AFR_INODELK_NB_TRANSACTION,
+                                              AFR_LOCK_OP, &flock, F_SETLK, i);
 
-                                STACK_WIND_COOKIE (frame, afr_nonblocking_inodelk_cbk,
-                                                   (void *) (long) i,
-                                                   priv->children[i],
-                                                   priv->children[i]->fops->inodelk,
-                                                   this->name, &local->loc,
-                                                   F_SETLK, &flock);
-
-                                if (!--call_count)
-                                        break;
+                        STACK_WIND_COOKIE (frame, afr_nonblocking_inodelk_cbk,
+                                           (void *) (long) i,
+                                           priv->children[i],
+                                           priv->children[i]->fops->inodelk,
+                                           this->name, &local->loc,
+                                           F_SETLK, &flock);
 
-                        }
+                        if (!--call_count)
+                                break;
                 }
         }
-
 out:
         return ret;
 }
diff --git a/xlators/cluster/afr/src/afr-transaction.c b/xlators/cluster/afr/src/afr-transaction.c
index ac329958d2a..c0c409bbc96 100644
--- a/xlators/cluster/afr/src/afr-transaction.c
+++ b/xlators/cluster/afr/src/afr-transaction.c
@@ -377,13 +377,6 @@ afr_changelog_post_op_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
 
         child_index = (long) cookie;
 
-        if (op_ret == 1) {
-        }
-
-        if (op_ret == 0) {
-                __mark_pre_op_undone_on_fd (frame, this, child_index);
-        }
-
         LOCK (&frame->lock);
         {
                 call_count = --local->call_count;
@@ -563,6 +556,7 @@ afr_changelog_post_op (call_frame_t *frame, xlator_t *this)
                                 afr_changelog_post_op_cbk (frame, (void *)(long)i,
                                                            this, 1, 0, xattr[i]);
                         } else {
+                                __mark_pre_op_undone_on_fd (frame, this, i);
                                 STACK_WIND_COOKIE (frame,
                                                    afr_changelog_post_op_cbk,
                                                    (void *) (long) i,
@@ -1134,8 +1128,6 @@ afr_lock (call_frame_t *frame, xlator_t *this)
 {
         afr_pid_save (frame);
 
-        frame->root->pid = (long) frame->root;
-
         afr_set_lk_owner (frame, this);
 
         afr_set_lock_number (frame, this);
diff --git a/xlators/cluster/afr/src/afr.c b/xlators/cluster/afr/src/afr.c
index a662037e788..47224ec8caa 100644
--- a/xlators/cluster/afr/src/afr.c
+++ b/xlators/cluster/afr/src/afr.c
@@ -744,6 +744,8 @@ init (xlator_t *this)
         pthread_mutex_init (&priv->mutex, NULL);
         INIT_LIST_HEAD (&priv->saved_fds);
 
+        priv->eager_lock = _gf_true;
+
         ret = 0;
 out:
         return ret;
diff --git a/xlators/cluster/afr/src/afr.h b/xlators/cluster/afr/src/afr.h
index 6f40ded1274..6a9907e974b 100644
--- a/xlators/cluster/afr/src/afr.h
+++ b/xlators/cluster/afr/src/afr.h
@@ -92,6 +92,7 @@ typedef struct _afr_private {
         pthread_mutex_t  mutex;
         struct list_head saved_fds;   /* list of fds on which locks have succeeded */
         gf_boolean_t     optimistic_change_log;
+        gf_boolean_t     eager_lock;
 
         char                   vol_uuid[UUID_SIZE + 1];
         int32_t                *last_event;
@@ -597,6 +598,8 @@ typedef struct _afr_local {
         struct {
                 off_t start, len;
 
+                int *eager_lock;
+
                 char *basename;
                 char *new_basename;
 
@@ -636,6 +639,9 @@ typedef struct {
         unsigned int *opened_on;     /* which subvolumes the fd is open on */
         unsigned int *pre_op_piggyback;
 
+        unsigned int *lock_piggyback;
+        unsigned int *lock_acquired;
+
         int flags;
         int32_t wbflags;
         uint64_t up_count;   /* number of CHILD_UPs this fd has seen */
@@ -939,6 +945,10 @@ afr_transaction_local_init (afr_local_t *local, afr_private_t *priv)
                                                     priv->child_count,
                                                     gf_afr_mt_int32_t);
 
+        local->transaction.eager_lock = GF_CALLOC (sizeof (*local->transaction.eager_lock),
+                                                   priv->child_count,
+                                                   gf_afr_mt_int32_t);
+
         local->internal_lock.transaction_lk_type = AFR_TRANSACTION_LK;
 
         return 0;
@@ -947,5 +957,5 @@ afr_transaction_local_init (afr_local_t *local, afr_private_t *priv)
 int32_t
 afr_marker_getxattr (call_frame_t *frame, xlator_t *this,
                      loc_t *loc, const char *name,afr_local_t *local, afr_private_t *priv );
-
+afr_fd_ctx_t * afr_fd_ctx_get (fd_t *fd, xlator_t *this);
 #endif /* __AFR_H__ */
diff --git a/xlators/features/locks/src/inodelk.c b/xlators/features/locks/src/inodelk.c
index 6fa9f399763..a90f244ecdf 100644
--- a/xlators/features/locks/src/inodelk.c
+++ b/xlators/features/locks/src/inodelk.c
@@ -131,7 +131,8 @@ static int
 inodelk_conflict (pl_inode_lock_t *l1, pl_inode_lock_t *l2)
 {
         return (inodelk_overlap (l1, l2) &&
-                inodelk_type_conflict (l1, l2));
+                inodelk_type_conflict (l1, l2) &&
+                !same_inodelk_owner (l1, l2));
 }
 
 /* Determine if lock is grantable or not */
author	Anand Avati <avati@gluster.com>	2011-08-16 12:56:21 +0530
committer	Anand Avati <avati@gluster.com>	2011-09-08 07:07:45 -0700
commit	f421ca932152c3a6102b19607d789dafdbe87ef1 (patch)
tree	b456b4f1c0393037846ccbed1af0fb533da767af
parent	e6992c73dc39a8e9e17f83eb1a6d5ab63c31f511 (diff)