diff options
| author | Anand Avati <avati@redhat.com> | 2013-02-02 18:59:10 -0800 | 
|---|---|---|
| committer | Anand Avati <avati@redhat.com> | 2013-02-03 12:14:19 -0800 | 
| commit | 80d08f13b0fd6ee0d10f0569165982913339607d (patch) | |
| tree | 904951edac6d0b38b9c597df4258893d1f988d67 | |
| parent | 34d8edf58ac0379d7495e57c9a846cdec3794b0d (diff) | |
cluster/dht: ignore EEXIST error in mkdir to avoid GFID mismatch
In dht_mkdir_cbk, EEXIST error is treated like a true error. Because
of this the following sequence of events can happen, eventually
resulting in GFID mismatch and (and possibly leaked locks and hang,
in the presence of replicate.)
The issue exists when many clients concurrently attempt creation of
directory and subdirectory (e.g mkdir -p /mnt/gluster/dir1/subdir)
0. First mkdir happens by one client on the hashed subvolume. Only
   one client wins the race. Others racing mkdirs get EEXIST. Yet
   other "laggers" in the race encounter the just-created directory
   in lookup() on the hash dir.
1. At least one "lagger" lookup() notices that there are missing
   directories on other subvolumes (which the "winner" mkdir is yet
   to create), and starts off self-heal of the directory.
2. At least on some subvolumes, self-heal's mkdir wins the race
   against the "winner" mkdir and creates the directory first. This
   causes the "winner" mkdir to experience EEXIST error on those
   subvolumes.
3. On other subvolumes where "winner" mkdir won the race, self-heal
   experiences EEXIST error, but self-heal is properly translating
   that into a success (but mkdir code path is not -- which is the
   bug.)
4. Both mkdir and self-heal assign hash layouts to the just created
   directory. But self-heal distributes hash range across N (total)
   subvolumes, whereas mkdir distributes hash range across N - M
   (where M is the number of subvolumes where mkdir lost the race).
   Both the clients "cache" their respective layouts in the near
   future for all future creates inside them (evidence in logs)
5. During the creation of the subdirectory, two clients race again.
   Ideally winner performs mkdir() on the hashed subvolume and proceeds
   to create other dirs, loser experiences EEXIST error on the hashed
   subvolume and backs off. But in this case, because the two clients
   have different layout views of the parent directory (because of
   different hash splits and assignements), the hashed subvolumes for
   the new directory can end up being different. Therefore, both clients
   now win the race (they were never fighting against each other on a
   common server), assigning different GFIDs to the directory on their
   respective (different) subvolumes. Some of the remaining subvolumes
   get GFID1, others GFID2.
Conclusion/Fix:
   Making mkdir translate EEXIST error as success (just the way self-heal
   is already rightly doing) will bring back truth to the design claim
   that concurrent mkdir/self-heals perform deterministic + idempotent
   operations. This will prevent the differing "hash views" by different
   clients and thereby also avoid GFID mismatch by forcing all clients
   to have a "fair race", because the hashed subvolume for all will be
   the same (and thereby avoiding leaked locks and hangs.)
Change-Id: I84592fb9b8a3f739a07e2afb23b33758a0a9a157
BUG: 907072
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4459
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
| -rw-r--r-- | xlators/cluster/dht/src/dht-common.c | 9 | 
1 files changed, 9 insertions, 0 deletions
diff --git a/xlators/cluster/dht/src/dht-common.c b/xlators/cluster/dht/src/dht-common.c index 99cf6f787a3..802bd5a49ee 100644 --- a/xlators/cluster/dht/src/dht-common.c +++ b/xlators/cluster/dht/src/dht-common.c @@ -3876,6 +3876,15 @@ dht_mkdir_cbk (call_frame_t *frame, void *cookie, xlator_t *this,                          ret = dht_layout_merge (this, layout, prev->this,                                                  -1, ENOSPC, NULL);                  } else { +			if (op_ret == -1 && op_errno == EEXIST) +				/* Very likely just a race between mkdir and +				   self-heal (from lookup of a concurrent mkdir +				   attempt). +				   Ignore error for now. layout setting will +				   anyways fail if this was a different (old) +				   pre-existing different directory. +				*/ +				op_ret = 0;                          ret = dht_layout_merge (this, layout, prev->this,                                                  op_ret, op_errno, NULL);                  }  | 
