glusterfs-afrv1.git/xlators/cluster/dht, branch v3.4.0alpha2

cluster/distribute: Reopen fds in migration internally as root:root

2013-03-04T10:42:30+00:00

Though linkfile_create and rebalance dst file create sent a setattr
with correct ownership, there is still a race window where the linkfile
open (client open due to migration) will fail, as its ownership will be
root:root.

BUG: 884597
Change-Id: Iba73681eae4f280d39ee6c9a40009e195768bee7
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4612
Tested-by: Gluster Build System 
Reviewed-by: Jeff Darcy

cluster/distribute: Prevent spurious multiple defrag crawls

2013-03-04T10:42:10+00:00

In dht_notify, we used to create a thread to start defrag
crawls after we had heard from all child subvols.
This was in-correct, as a later event, could also trigger the
crawl again(due to the fact that all subvols had responded).

The fix is to make sure, the thread is started only once after
all subvols have responded the first time

BUG: 916449
Change-Id: I1619344fbb1cb51d5e1db38d8a29821fa870fa8b
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4610
Tested-by: Gluster Build System 
Reviewed-by: Jeff Darcy

cluster/distribute: Preserve file size during rebalance migration

2013-03-04T10:42:00+00:00

If holes are encountered, then we do not write these to the dst,
which sometimes causes file size to be lesser than src. Data is not
corrupted, as when non-zero reads are received, we do write that data.

Calling a truncrate to give file size to prevent it from being
truncated to less than src in case the file end has holes.

Thanks to Brian Foster for providing the test case

BUG: 915554
Change-Id: I7e1e0c475118b073c3ebb87e93220c1ec22e8b7d
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4609
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/distribute: Remove suprious fd_unref call

2013-03-04T10:40:32+00:00

After fix http://review.gluster.org/4282 (libglusterfsterfs/syncop: do not
hold ref on the fd in cbk) was pushed, syncop_open does not take a ref anymore.

BUG: 910661
Change-Id: Idedff91270966e6e70e71ee83785c0228e238d31
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4608
Tested-by: Gluster Build System 
Reviewed-by: Jeff Darcy

cluster/dht: Create linkfile with file uid/gid

2013-03-04T10:40:01+00:00

Currently, linkfile creation happens as root.

use uid/gid returned from _cbk (link/rename) to set the correct ownership of
the link files.

Also added test/dht.rc to implement common dht functions

BUG: 884597
Change-Id: I6bc0e04f62d4716fc033681e5678e852a1be7a2f
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4607
Tested-by: Gluster Build System 
Reviewed-by: Jeff Darcy

cluster/dht: pathinfo xattr changes for directories

2013-02-09T03:09:46+00:00

Since directories have presence on all subvolumes there is
no definite meaning of ->hashed_subvol or ->cached_subvol.
getxattr() code path chooses ->cached_subvol for pathinfo
extended attribute. While this makes sense of files, it makes
less sense for directories. Further if a hashed or a cached
subvolume is down, and there's a getxattr request for a
directory, we return with an errno.

This patch changes pathinfo extended attribute contents by
aggregating information from all subvolumes that are up.

Change-Id: I58adb741d63ccfd1d0239af75eb65f26f0fb384d
Signed-off-by: Venky Shankar 
BUG: 856455
Reviewed-on: http://review.gluster.org/4047
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

Use proper libtool option -avoid-version instead of bogus -avoidversion

2013-02-07T23:12:56+00:00

Change-Id: I1c9541058c7d07786539a3266ca125a6a15287d8
BUG: 859835
Signed-off-by: Anand Avati 
Original-author: Kacper Kowalik (Xarthisius) 
Signed-off-by: Kacper Kowalik (Xarthisius) 
Reviewed-on: http://review.gluster.org/3967
Tested-by: Gluster Build System

dht: better layout-optimization algorithm

2013-02-07T16:27:40+00:00

This method deals with the case where swapping might gain a bigger overlap
for the xlator currently under consideration, but sacrifices even more from
the xlator we're swapping with. For example:

A = 0x00000000 - 0x44444443 (new 0x00000000 - 0x55555554)
B = 0x44444444 - 0x77777776 (new 0x55555555 - 0xaaaaaaa9)
C = 0x77777777 - 0xffffffff (new 0xaaaaaaaa - 0xffffffff)

Here, the new range for B has a bigger overlap with the old C than with the
old B (0x33333333 vs. 0x22222222 to be precise) so looking only at that
might lead us to swap. However, such a swap turns the new C's overlap from
0x55555556 (vs. old C) to *zero* (vs. old B).  In other words, we've gained
0x11111111 for B but lost 0x55555556 for C, so it's a bad idea.

The new algorithm accounts for all effects of the swap, so it not only avoids
bad swaps but can make some good ones that would have been missed previously.
For example, if swapping a range X with a later range Y would not increase the
overlap for X we would previously have skipped it even if the swap would
increase Y's overlap without affecting X's.  This is the normal case when we're
adding a new brick (which initially has zero overlap with any old range) so
finding more good swaps is probably even more important than avoiding bad ones.

Also, the logic in dht_overlap_calc was completely broken before, causing
integer overflows instead of providing correct values, so no matter what
higher-level algorithm was in place the GIGO effect would have resulted in
bad decisions.

Change-Id: If61ed513cfcb931916c6b51da293e3efbaaf385f
BUG: 853258
Signed-off-by: Jeff Darcy 
Reviewed-on: http://review.gluster.org/3908
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

cluster/dht: Correct min_free_disk behaviour

2013-02-04T16:43:50+00:00

Problem:
Files were being created in subvol which had less than
min_free_disk available even in the cases where other
subvols with more space were available.

Solution:
Changed the logic to look for subvol which has more
space available.
In cases where all the subvols have lesser than
Min_free_disk available , the one with max space and
atleast one inode is available.

Known Issue: Cannot ensure that first file that is
created right after min-free-value is crossed on a
brick will get created in other brick because disk
usage stat takes some time to update in glusterprocess.
Will fix that as part of another bug.

Change-Id: If3ae0bf5a44f8739ce35b3ee3f191009ddd44455
BUG: 858488
Signed-off-by: Raghavendra Talur 
Reviewed-on: http://review.gluster.org/4420
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

cluster/dht: ignore EEXIST error in mkdir to avoid GFID mismatch

2013-02-03T20:14:19+00:00

In dht_mkdir_cbk, EEXIST error is treated like a true error. Because
of this the following sequence of events can happen, eventually
resulting in GFID mismatch and (and possibly leaked locks and hang,
in the presence of replicate.)

The issue exists when many clients concurrently attempt creation of
directory and subdirectory (e.g mkdir -p /mnt/gluster/dir1/subdir)

0. First mkdir happens by one client on the hashed subvolume. Only
   one client wins the race. Others racing mkdirs get EEXIST. Yet
   other "laggers" in the race encounter the just-created directory
   in lookup() on the hash dir.

1. At least one "lagger" lookup() notices that there are missing
   directories on other subvolumes (which the "winner" mkdir is yet
   to create), and starts off self-heal of the directory.

2. At least on some subvolumes, self-heal's mkdir wins the race
   against the "winner" mkdir and creates the directory first. This
   causes the "winner" mkdir to experience EEXIST error on those
   subvolumes.

3. On other subvolumes where "winner" mkdir won the race, self-heal
   experiences EEXIST error, but self-heal is properly translating
   that into a success (but mkdir code path is not -- which is the
   bug.)

4. Both mkdir and self-heal assign hash layouts to the just created
   directory. But self-heal distributes hash range across N (total)
   subvolumes, whereas mkdir distributes hash range across N - M
   (where M is the number of subvolumes where mkdir lost the race).
   Both the clients "cache" their respective layouts in the near
   future for all future creates inside them (evidence in logs)

5. During the creation of the subdirectory, two clients race again.
   Ideally winner performs mkdir() on the hashed subvolume and proceeds
   to create other dirs, loser experiences EEXIST error on the hashed
   subvolume and backs off. But in this case, because the two clients
   have different layout views of the parent directory (because of
   different hash splits and assignements), the hashed subvolumes for
   the new directory can end up being different. Therefore, both clients
   now win the race (they were never fighting against each other on a
   common server), assigning different GFIDs to the directory on their
   respective (different) subvolumes. Some of the remaining subvolumes
   get GFID1, others GFID2.

Conclusion/Fix:
   Making mkdir translate EEXIST error as success (just the way self-heal
   is already rightly doing) will bring back truth to the design claim
   that concurrent mkdir/self-heals perform deterministic + idempotent
   operations. This will prevent the differing "hash views" by different
   clients and thereby also avoid GFID mismatch by forcing all clients
   to have a "fair race", because the hashed subvolume for all will be
   the same (and thereby avoiding leaked locks and hangs.)

Change-Id: I84592fb9b8a3f739a07e2afb23b33758a0a9a157
BUG: 907072
Signed-off-by: Anand Avati 
Reviewed-on: http://review.gluster.org/4459
Tested-by: Gluster Build System 
Reviewed-by: Amar Tumballi