<feed xmlns='http://www.w3.org/2005/Atom'>
<title>glusterfs-afrv1.git/xlators/cluster/dht, branch v3.4.0alpha2</title>
<subtitle></subtitle>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs-afrv1.git/'/>
<entry>
<title>cluster/distribute: Reopen fds in migration internally as root:root</title>
<updated>2013-03-04T10:42:30+00:00</updated>
<author>
<name>shishir gowda</name>
<email>sgowda@redhat.com</email>
</author>
<published>2013-02-14T07:34:58+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs-afrv1.git/commit/?id=cd4736baba8a60d007bff6ed633f9feba9862bfb'/>
<id>cd4736baba8a60d007bff6ed633f9feba9862bfb</id>
<content type='text'>
Though linkfile_create and rebalance dst file create sent a setattr
with correct ownership, there is still a race window where the linkfile
open (client open due to migration) will fail, as its ownership will be
root:root.

BUG: 884597
Change-Id: Iba73681eae4f280d39ee6c9a40009e195768bee7
Signed-off-by: shishir gowda &lt;sgowda@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4612
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Though linkfile_create and rebalance dst file create sent a setattr
with correct ownership, there is still a race window where the linkfile
open (client open due to migration) will fail, as its ownership will be
root:root.

BUG: 884597
Change-Id: Iba73681eae4f280d39ee6c9a40009e195768bee7
Signed-off-by: shishir gowda &lt;sgowda@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4612
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>cluster/distribute: Prevent spurious multiple defrag crawls</title>
<updated>2013-03-04T10:42:10+00:00</updated>
<author>
<name>shishir gowda</name>
<email>sgowda@redhat.com</email>
</author>
<published>2013-02-27T11:34:47+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs-afrv1.git/commit/?id=3335a3ded7f46ddcdf2a42edf0c6e78abeb9e898'/>
<id>3335a3ded7f46ddcdf2a42edf0c6e78abeb9e898</id>
<content type='text'>
In dht_notify, we used to create a thread to start defrag
crawls after we had heard from all child subvols.
This was in-correct, as a later event, could also trigger the
crawl again(due to the fact that all subvols had responded).

The fix is to make sure, the thread is started only once after
all subvols have responded the first time

BUG: 916449
Change-Id: I1619344fbb1cb51d5e1db38d8a29821fa870fa8b
Signed-off-by: shishir gowda &lt;sgowda@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4610
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In dht_notify, we used to create a thread to start defrag
crawls after we had heard from all child subvols.
This was in-correct, as a later event, could also trigger the
crawl again(due to the fact that all subvols had responded).

The fix is to make sure, the thread is started only once after
all subvols have responded the first time

BUG: 916449
Change-Id: I1619344fbb1cb51d5e1db38d8a29821fa870fa8b
Signed-off-by: shishir gowda &lt;sgowda@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4610
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>cluster/distribute: Preserve file size during rebalance migration</title>
<updated>2013-03-04T10:42:00+00:00</updated>
<author>
<name>shishir gowda</name>
<email>sgowda@redhat.com</email>
</author>
<published>2013-02-25T04:32:15+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs-afrv1.git/commit/?id=f6a9f19be0e1afe7850842997b88182133d3464e'/>
<id>f6a9f19be0e1afe7850842997b88182133d3464e</id>
<content type='text'>
If holes are encountered, then we do not write these to the dst,
which sometimes causes file size to be lesser than src. Data is not
corrupted, as when non-zero reads are received, we do write that data.

Calling a truncrate to give file size to prevent it from being
truncated to less than src in case the file end has holes.

Thanks to Brian Foster for providing the test case

BUG: 915554
Change-Id: I7e1e0c475118b073c3ebb87e93220c1ec22e8b7d
Signed-off-by: shishir gowda &lt;sgowda@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4609
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Vijay Bellur &lt;vbellur@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
If holes are encountered, then we do not write these to the dst,
which sometimes causes file size to be lesser than src. Data is not
corrupted, as when non-zero reads are received, we do write that data.

Calling a truncrate to give file size to prevent it from being
truncated to less than src in case the file end has holes.

Thanks to Brian Foster for providing the test case

BUG: 915554
Change-Id: I7e1e0c475118b073c3ebb87e93220c1ec22e8b7d
Signed-off-by: shishir gowda &lt;sgowda@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4609
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Vijay Bellur &lt;vbellur@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>cluster/distribute: Remove suprious fd_unref call</title>
<updated>2013-03-04T10:40:32+00:00</updated>
<author>
<name>shishir gowda</name>
<email>sgowda@redhat.com</email>
</author>
<published>2013-02-13T07:33:10+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs-afrv1.git/commit/?id=ea0b1c66bf61c12491bb4f2b7313df2a4d66ed6a'/>
<id>ea0b1c66bf61c12491bb4f2b7313df2a4d66ed6a</id>
<content type='text'>
After fix http://review.gluster.org/4282 (libglusterfsterfs/syncop: do not
hold ref on the fd in cbk) was pushed, syncop_open does not take a ref anymore.

BUG: 910661
Change-Id: Idedff91270966e6e70e71ee83785c0228e238d31
Signed-off-by: shishir gowda &lt;sgowda@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4608
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
After fix http://review.gluster.org/4282 (libglusterfsterfs/syncop: do not
hold ref on the fd in cbk) was pushed, syncop_open does not take a ref anymore.

BUG: 910661
Change-Id: Idedff91270966e6e70e71ee83785c0228e238d31
Signed-off-by: shishir gowda &lt;sgowda@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4608
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>cluster/dht: Create linkfile with file uid/gid</title>
<updated>2013-03-04T10:40:01+00:00</updated>
<author>
<name>shishir gowda</name>
<email>sgowda@redhat.com</email>
</author>
<published>2013-02-11T16:57:29+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs-afrv1.git/commit/?id=41400b3a4e34a77af5e13ba926ec372291553fe6'/>
<id>41400b3a4e34a77af5e13ba926ec372291553fe6</id>
<content type='text'>
Currently, linkfile creation happens as root.

use uid/gid returned from _cbk (link/rename) to set the correct ownership of
the link files.

Also added test/dht.rc to implement common dht functions

BUG: 884597
Change-Id: I6bc0e04f62d4716fc033681e5678e852a1be7a2f
Signed-off-by: shishir gowda &lt;sgowda@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4607
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Currently, linkfile creation happens as root.

use uid/gid returned from _cbk (link/rename) to set the correct ownership of
the link files.

Also added test/dht.rc to implement common dht functions

BUG: 884597
Change-Id: I6bc0e04f62d4716fc033681e5678e852a1be7a2f
Signed-off-by: shishir gowda &lt;sgowda@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4607
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>cluster/dht: pathinfo xattr changes for directories</title>
<updated>2013-02-09T03:09:46+00:00</updated>
<author>
<name>Venky Shankar</name>
<email>vshankar@redhat.com</email>
</author>
<published>2012-09-18T07:11:46+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs-afrv1.git/commit/?id=19de18219b93097ede8d14c218011a873ebd50ed'/>
<id>19de18219b93097ede8d14c218011a873ebd50ed</id>
<content type='text'>
Since directories have presence on all subvolumes there is
no definite meaning of -&gt;hashed_subvol or -&gt;cached_subvol.
getxattr() code path chooses -&gt;cached_subvol for pathinfo
extended attribute. While this makes sense of files, it makes
less sense for directories. Further if a hashed or a cached
subvolume is down, and there's a getxattr request for a
directory, we return with an errno.

This patch changes pathinfo extended attribute contents by
aggregating information from all subvolumes that are up.

Change-Id: I58adb741d63ccfd1d0239af75eb65f26f0fb384d
Signed-off-by: Venky Shankar &lt;vshankar@redhat.com&gt;
BUG: 856455
Reviewed-on: http://review.gluster.org/4047
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Anand Avati &lt;avati@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Since directories have presence on all subvolumes there is
no definite meaning of -&gt;hashed_subvol or -&gt;cached_subvol.
getxattr() code path chooses -&gt;cached_subvol for pathinfo
extended attribute. While this makes sense of files, it makes
less sense for directories. Further if a hashed or a cached
subvolume is down, and there's a getxattr request for a
directory, we return with an errno.

This patch changes pathinfo extended attribute contents by
aggregating information from all subvolumes that are up.

Change-Id: I58adb741d63ccfd1d0239af75eb65f26f0fb384d
Signed-off-by: Venky Shankar &lt;vshankar@redhat.com&gt;
BUG: 856455
Reviewed-on: http://review.gluster.org/4047
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Anand Avati &lt;avati@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Use proper libtool option -avoid-version instead of bogus -avoidversion</title>
<updated>2013-02-07T23:12:56+00:00</updated>
<author>
<name>Anand Avati</name>
<email>avati@redhat.com</email>
</author>
<published>2013-02-07T22:25:03+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs-afrv1.git/commit/?id=d3e7881ecdba2124115de6666e48f34ce267d30d'/>
<id>d3e7881ecdba2124115de6666e48f34ce267d30d</id>
<content type='text'>
Change-Id: I1c9541058c7d07786539a3266ca125a6a15287d8
BUG: 859835
Signed-off-by: Anand Avati &lt;avati@redhat.com&gt;
Original-author: Kacper Kowalik (Xarthisius) &lt;xarthisius.kk@gmail.com&gt;
Signed-off-by: Kacper Kowalik (Xarthisius) &lt;xarthisius.kk@gmail.com&gt;
Reviewed-on: http://review.gluster.org/3967
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Change-Id: I1c9541058c7d07786539a3266ca125a6a15287d8
BUG: 859835
Signed-off-by: Anand Avati &lt;avati@redhat.com&gt;
Original-author: Kacper Kowalik (Xarthisius) &lt;xarthisius.kk@gmail.com&gt;
Signed-off-by: Kacper Kowalik (Xarthisius) &lt;xarthisius.kk@gmail.com&gt;
Reviewed-on: http://review.gluster.org/3967
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>dht: better layout-optimization algorithm</title>
<updated>2013-02-07T16:27:40+00:00</updated>
<author>
<name>Jeff Darcy</name>
<email>jdarcy@redhat.com</email>
</author>
<published>2013-02-06T00:19:06+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs-afrv1.git/commit/?id=da9d54cac629d9c0f7ae6b431abfb134b5f0eca3'/>
<id>da9d54cac629d9c0f7ae6b431abfb134b5f0eca3</id>
<content type='text'>
This method deals with the case where swapping might gain a bigger overlap
for the xlator currently under consideration, but sacrifices even more from
the xlator we're swapping with. For example:

A = 0x00000000 - 0x44444443 (new 0x00000000 - 0x55555554)
B = 0x44444444 - 0x77777776 (new 0x55555555 - 0xaaaaaaa9)
C = 0x77777777 - 0xffffffff (new 0xaaaaaaaa - 0xffffffff)

Here, the new range for B has a bigger overlap with the old C than with the
old B (0x33333333 vs. 0x22222222 to be precise) so looking only at that
might lead us to swap. However, such a swap turns the new C's overlap from
0x55555556 (vs. old C) to *zero* (vs. old B).  In other words, we've gained
0x11111111 for B but lost 0x55555556 for C, so it's a bad idea.

The new algorithm accounts for all effects of the swap, so it not only avoids
bad swaps but can make some good ones that would have been missed previously.
For example, if swapping a range X with a later range Y would not increase the
overlap for X we would previously have skipped it even if the swap would
increase Y's overlap without affecting X's.  This is the normal case when we're
adding a new brick (which initially has zero overlap with any old range) so
finding more good swaps is probably even more important than avoiding bad ones.

Also, the logic in dht_overlap_calc was completely broken before, causing
integer overflows instead of providing correct values, so no matter what
higher-level algorithm was in place the GIGO effect would have resulted in
bad decisions.

Change-Id: If61ed513cfcb931916c6b51da293e3efbaaf385f
BUG: 853258
Signed-off-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
Reviewed-on: http://review.gluster.org/3908
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Anand Avati &lt;avati@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This method deals with the case where swapping might gain a bigger overlap
for the xlator currently under consideration, but sacrifices even more from
the xlator we're swapping with. For example:

A = 0x00000000 - 0x44444443 (new 0x00000000 - 0x55555554)
B = 0x44444444 - 0x77777776 (new 0x55555555 - 0xaaaaaaa9)
C = 0x77777777 - 0xffffffff (new 0xaaaaaaaa - 0xffffffff)

Here, the new range for B has a bigger overlap with the old C than with the
old B (0x33333333 vs. 0x22222222 to be precise) so looking only at that
might lead us to swap. However, such a swap turns the new C's overlap from
0x55555556 (vs. old C) to *zero* (vs. old B).  In other words, we've gained
0x11111111 for B but lost 0x55555556 for C, so it's a bad idea.

The new algorithm accounts for all effects of the swap, so it not only avoids
bad swaps but can make some good ones that would have been missed previously.
For example, if swapping a range X with a later range Y would not increase the
overlap for X we would previously have skipped it even if the swap would
increase Y's overlap without affecting X's.  This is the normal case when we're
adding a new brick (which initially has zero overlap with any old range) so
finding more good swaps is probably even more important than avoiding bad ones.

Also, the logic in dht_overlap_calc was completely broken before, causing
integer overflows instead of providing correct values, so no matter what
higher-level algorithm was in place the GIGO effect would have resulted in
bad decisions.

Change-Id: If61ed513cfcb931916c6b51da293e3efbaaf385f
BUG: 853258
Signed-off-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
Reviewed-on: http://review.gluster.org/3908
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Anand Avati &lt;avati@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>cluster/dht: Correct min_free_disk behaviour</title>
<updated>2013-02-04T16:43:50+00:00</updated>
<author>
<name>Raghavendra Talur</name>
<email>rtalur@redhat.com</email>
</author>
<published>2013-01-24T05:56:37+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs-afrv1.git/commit/?id=2a46c8769bc2b6ad491a305ea1d38023d0e22617'/>
<id>2a46c8769bc2b6ad491a305ea1d38023d0e22617</id>
<content type='text'>
Problem:
Files were being created in subvol which had less than
min_free_disk available even in the cases where other
subvols with more space were available.

Solution:
Changed the logic to look for subvol which has more
space available.
In cases where all the subvols have lesser than
Min_free_disk available , the one with max space and
atleast one inode is available.

Known Issue: Cannot ensure that first file that is
created right after min-free-value is crossed on a
brick will get created in other brick because disk
usage stat takes some time to update in glusterprocess.
Will fix that as part of another bug.

Change-Id: If3ae0bf5a44f8739ce35b3ee3f191009ddd44455
BUG: 858488
Signed-off-by: Raghavendra Talur &lt;rtalur@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4420
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Anand Avati &lt;avati@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Problem:
Files were being created in subvol which had less than
min_free_disk available even in the cases where other
subvols with more space were available.

Solution:
Changed the logic to look for subvol which has more
space available.
In cases where all the subvols have lesser than
Min_free_disk available , the one with max space and
atleast one inode is available.

Known Issue: Cannot ensure that first file that is
created right after min-free-value is crossed on a
brick will get created in other brick because disk
usage stat takes some time to update in glusterprocess.
Will fix that as part of another bug.

Change-Id: If3ae0bf5a44f8739ce35b3ee3f191009ddd44455
BUG: 858488
Signed-off-by: Raghavendra Talur &lt;rtalur@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4420
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Anand Avati &lt;avati@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>cluster/dht: ignore EEXIST error in mkdir to avoid GFID mismatch</title>
<updated>2013-02-03T20:14:19+00:00</updated>
<author>
<name>Anand Avati</name>
<email>avati@redhat.com</email>
</author>
<published>2013-02-03T02:59:10+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs-afrv1.git/commit/?id=80d08f13b0fd6ee0d10f0569165982913339607d'/>
<id>80d08f13b0fd6ee0d10f0569165982913339607d</id>
<content type='text'>
In dht_mkdir_cbk, EEXIST error is treated like a true error. Because
of this the following sequence of events can happen, eventually
resulting in GFID mismatch and (and possibly leaked locks and hang,
in the presence of replicate.)

The issue exists when many clients concurrently attempt creation of
directory and subdirectory (e.g mkdir -p /mnt/gluster/dir1/subdir)

0. First mkdir happens by one client on the hashed subvolume. Only
   one client wins the race. Others racing mkdirs get EEXIST. Yet
   other "laggers" in the race encounter the just-created directory
   in lookup() on the hash dir.

1. At least one "lagger" lookup() notices that there are missing
   directories on other subvolumes (which the "winner" mkdir is yet
   to create), and starts off self-heal of the directory.

2. At least on some subvolumes, self-heal's mkdir wins the race
   against the "winner" mkdir and creates the directory first. This
   causes the "winner" mkdir to experience EEXIST error on those
   subvolumes.

3. On other subvolumes where "winner" mkdir won the race, self-heal
   experiences EEXIST error, but self-heal is properly translating
   that into a success (but mkdir code path is not -- which is the
   bug.)

4. Both mkdir and self-heal assign hash layouts to the just created
   directory. But self-heal distributes hash range across N (total)
   subvolumes, whereas mkdir distributes hash range across N - M
   (where M is the number of subvolumes where mkdir lost the race).
   Both the clients "cache" their respective layouts in the near
   future for all future creates inside them (evidence in logs)

5. During the creation of the subdirectory, two clients race again.
   Ideally winner performs mkdir() on the hashed subvolume and proceeds
   to create other dirs, loser experiences EEXIST error on the hashed
   subvolume and backs off. But in this case, because the two clients
   have different layout views of the parent directory (because of
   different hash splits and assignements), the hashed subvolumes for
   the new directory can end up being different. Therefore, both clients
   now win the race (they were never fighting against each other on a
   common server), assigning different GFIDs to the directory on their
   respective (different) subvolumes. Some of the remaining subvolumes
   get GFID1, others GFID2.

Conclusion/Fix:
   Making mkdir translate EEXIST error as success (just the way self-heal
   is already rightly doing) will bring back truth to the design claim
   that concurrent mkdir/self-heals perform deterministic + idempotent
   operations. This will prevent the differing "hash views" by different
   clients and thereby also avoid GFID mismatch by forcing all clients
   to have a "fair race", because the hashed subvolume for all will be
   the same (and thereby avoiding leaked locks and hangs.)

Change-Id: I84592fb9b8a3f739a07e2afb23b33758a0a9a157
BUG: 907072
Signed-off-by: Anand Avati &lt;avati@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4459
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Amar Tumballi &lt;amarts@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In dht_mkdir_cbk, EEXIST error is treated like a true error. Because
of this the following sequence of events can happen, eventually
resulting in GFID mismatch and (and possibly leaked locks and hang,
in the presence of replicate.)

The issue exists when many clients concurrently attempt creation of
directory and subdirectory (e.g mkdir -p /mnt/gluster/dir1/subdir)

0. First mkdir happens by one client on the hashed subvolume. Only
   one client wins the race. Others racing mkdirs get EEXIST. Yet
   other "laggers" in the race encounter the just-created directory
   in lookup() on the hash dir.

1. At least one "lagger" lookup() notices that there are missing
   directories on other subvolumes (which the "winner" mkdir is yet
   to create), and starts off self-heal of the directory.

2. At least on some subvolumes, self-heal's mkdir wins the race
   against the "winner" mkdir and creates the directory first. This
   causes the "winner" mkdir to experience EEXIST error on those
   subvolumes.

3. On other subvolumes where "winner" mkdir won the race, self-heal
   experiences EEXIST error, but self-heal is properly translating
   that into a success (but mkdir code path is not -- which is the
   bug.)

4. Both mkdir and self-heal assign hash layouts to the just created
   directory. But self-heal distributes hash range across N (total)
   subvolumes, whereas mkdir distributes hash range across N - M
   (where M is the number of subvolumes where mkdir lost the race).
   Both the clients "cache" their respective layouts in the near
   future for all future creates inside them (evidence in logs)

5. During the creation of the subdirectory, two clients race again.
   Ideally winner performs mkdir() on the hashed subvolume and proceeds
   to create other dirs, loser experiences EEXIST error on the hashed
   subvolume and backs off. But in this case, because the two clients
   have different layout views of the parent directory (because of
   different hash splits and assignements), the hashed subvolumes for
   the new directory can end up being different. Therefore, both clients
   now win the race (they were never fighting against each other on a
   common server), assigning different GFIDs to the directory on their
   respective (different) subvolumes. Some of the remaining subvolumes
   get GFID1, others GFID2.

Conclusion/Fix:
   Making mkdir translate EEXIST error as success (just the way self-heal
   is already rightly doing) will bring back truth to the design claim
   that concurrent mkdir/self-heals perform deterministic + idempotent
   operations. This will prevent the differing "hash views" by different
   clients and thereby also avoid GFID mismatch by forcing all clients
   to have a "fair race", because the hashed subvolume for all will be
   the same (and thereby avoiding leaked locks and hangs.)

Change-Id: I84592fb9b8a3f739a07e2afb23b33758a0a9a157
BUG: 907072
Signed-off-by: Anand Avati &lt;avati@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4459
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Amar Tumballi &lt;amarts@redhat.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
