glusterfs.git/xlators/cluster/dht/src, branch v3.4.0alpha

cluster/dht: Correct min_free_disk behaviour

2013-02-04T16:43:50+00:00

Problem:
Files were being created in subvol which had less than
min_free_disk available even in the cases where other
subvols with more space were available.

Solution:
Changed the logic to look for subvol which has more
space available.
In cases where all the subvols have lesser than
Min_free_disk available , the one with max space and
atleast one inode is available.

Known Issue: Cannot ensure that first file that is
created right after min-free-value is crossed on a
brick will get created in other brick because disk
usage stat takes some time to update in glusterprocess.
Will fix that as part of another bug.

Change-Id: If3ae0bf5a44f8739ce35b3ee3f191009ddd44455
BUG: 858488
Signed-off-by: Raghavendra Talur 
Reviewed-on: http://review.gluster.org/4420
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

cluster/dht: ignore EEXIST error in mkdir to avoid GFID mismatch

2013-02-03T20:14:19+00:00

In dht_mkdir_cbk, EEXIST error is treated like a true error. Because
of this the following sequence of events can happen, eventually
resulting in GFID mismatch and (and possibly leaked locks and hang,
in the presence of replicate.)

The issue exists when many clients concurrently attempt creation of
directory and subdirectory (e.g mkdir -p /mnt/gluster/dir1/subdir)

0. First mkdir happens by one client on the hashed subvolume. Only
   one client wins the race. Others racing mkdirs get EEXIST. Yet
   other "laggers" in the race encounter the just-created directory
   in lookup() on the hash dir.

1. At least one "lagger" lookup() notices that there are missing
   directories on other subvolumes (which the "winner" mkdir is yet
   to create), and starts off self-heal of the directory.

2. At least on some subvolumes, self-heal's mkdir wins the race
   against the "winner" mkdir and creates the directory first. This
   causes the "winner" mkdir to experience EEXIST error on those
   subvolumes.

3. On other subvolumes where "winner" mkdir won the race, self-heal
   experiences EEXIST error, but self-heal is properly translating
   that into a success (but mkdir code path is not -- which is the
   bug.)

4. Both mkdir and self-heal assign hash layouts to the just created
   directory. But self-heal distributes hash range across N (total)
   subvolumes, whereas mkdir distributes hash range across N - M
   (where M is the number of subvolumes where mkdir lost the race).
   Both the clients "cache" their respective layouts in the near
   future for all future creates inside them (evidence in logs)

5. During the creation of the subdirectory, two clients race again.
   Ideally winner performs mkdir() on the hashed subvolume and proceeds
   to create other dirs, loser experiences EEXIST error on the hashed
   subvolume and backs off. But in this case, because the two clients
   have different layout views of the parent directory (because of
   different hash splits and assignements), the hashed subvolumes for
   the new directory can end up being different. Therefore, both clients
   now win the race (they were never fighting against each other on a
   common server), assigning different GFIDs to the directory on their
   respective (different) subvolumes. Some of the remaining subvolumes
   get GFID1, others GFID2.

Conclusion/Fix:
   Making mkdir translate EEXIST error as success (just the way self-heal
   is already rightly doing) will bring back truth to the design claim
   that concurrent mkdir/self-heals perform deterministic + idempotent
   operations. This will prevent the differing "hash views" by different
   clients and thereby also avoid GFID mismatch by forcing all clients
   to have a "fair race", because the hashed subvolume for all will be
   the same (and thereby avoiding leaked locks and hangs.)

Change-Id: I84592fb9b8a3f739a07e2afb23b33758a0a9a157
BUG: 907072
Signed-off-by: Anand Avati 
Reviewed-on: http://review.gluster.org/4459
Tested-by: Gluster Build System 
Reviewed-by: Amar Tumballi

cluster/dht: stack wind with cookie

2013-02-01T01:18:03+00:00

Default_fops uses stack_wind_tail. It winds without creating the frame leading
into wrong subvol return in the cookie. To avoid the problem caused by the
same, we're getting the subvol by passing the cookie.

Change-Id: I51ee79b22c89e4fb0b89e9a0bc3ac96c5b469f8f
BUG: 893338
Signed-off-by: Varun Shastry 
Reviewed-on: http://review.gluster.org/4388
Reviewed-by: Jeff Darcy 
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati 
Tested-by: Anand Avati

libglusterfs/syncop: do not hold ref on the fd in cbk

2013-01-31T07:40:37+00:00

* Do not do fd_ref in cbks of the fops which return a fd (such as
  open, opendir, create).

Change-Id: Ic2f5b234c5c09c258494f4fb5d600a64813823ad
BUG: 885008
Signed-off-by: Raghavendra Bhat 
Reviewed-on: http://review.gluster.org/4282
Tested-by: Gluster Build System 
Reviewed-by: Amar Tumballi 
Reviewed-by: Anand Avati

cluster/distribute: get_layout should account only available subvols

2013-01-24T07:43:39+00:00

The earlier logic used to check if (layout-spread-count <= subvol_cnt -
decommissioned bricks). With this if a subvol was down, and layout-spread was >
upsubvols, a mkdir ended up creating holes in the layout.

The fix is to consider only the combination of subvols which are usable (not
down or not decommissioned).

Change-Id: I61ad3bcaf4589f5a75f7887cfa595c98311ae3bb
BUG: 902610
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4412
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

glusterd/cli: Updated the options descriptions for "volume set help"

2013-01-22T07:03:57+00:00

Change-Id: I0db00b7334bb9707ab48bd661ac03a3ad818d6e4
BUG: 893458
Signed-off-by: Avra Sengupta 
Reviewed-on: http://review.gluster.org/4393
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

core: fixes for gcc's '-pedantic' flag build

2013-01-22T06:17:36+00:00

* warnings on 'void *' arguments
* warnings on empty initializations
* warnings on empty array (array[0])

Change-Id: Iae440f54cbd59580eb69f3ecaed5a9926c0edf95
BUG: 875913
Signed-off-by: Avra Sengupta 
Reviewed-on: http://review.gluster.org/4219
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

cluster/distribute: If cached_subvol is down, return ENOTCONN in lookup

2013-01-21T20:03:03+00:00

When we follow a linkfile, and the lookup returns a ENOTCONN error, return
the error, as the cached subvol is down, and lookup_everywhere wont succeed,
but actually ends up clearing the linkfile, and clearing the namespace.

Change-Id: I772bf71531bc646e8fb62d3e8549a5fe0f3896da
BUG: 893378
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4383
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

cluster/dht: update ctx-time only if we receive the new iatt

2013-01-17T08:26:57+00:00

1. Used local->postparent(contains merged iatt of all succesful calls) instead
of postparent for dht ctx time update.

2. dht_inode_ctx_time_update avoided in case of opret -1.

Change-Id: Ie04a7842a41c241f911b6a3f76267b996d27fb43
BUG: 881013
Signed-off-by: Varun Shastry 
Reviewed-on: http://review.gluster.org/4338
Reviewed-by: Shishir Gowda 
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

cluster/dht: Add "afr.readdir-failover=off" option the rebalance process

2012-12-17T03:45:53+00:00

By failing over readdir (default behaviour), rebalance could get duplicate
files, as readdir would re-read from offset 0. Rebalance should not attempt
to migrate these files again.

Additionally, we need to handle these cases as failure in rebalance crawl.

No test case provided, as we cannot determine the read child in afr.

Change-Id: If07508b4f92dacc17e0f695b48a866c7c66004be
BUG: 859387
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4300
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati