glusterfs.git, branch v3.3.2qa3

cluster/dht: Linkfiles creation with correct uid/gid

2013-05-16T15:49:39+00:00

If renames are done with different uid/gid (non-owners), then we would
end up with incorrect uid/gid.

The fix is to create linkfiles, and heal the uid/gid as root:root. This
preserves our notion of creation as root:root and heal the uid/gid as
root:root in all paths. Additionally, we need to consider uid/gid from
only src_cached subvol, and not from linkfiles.

rename is also done as root:root if done on linkfile, as setattr of ownership
on linkfile is done after the rename

BUG: 884597
Change-Id: Ifaacd8dba0f39cb909761ffc8fe7e06cd44ec8de
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/5025
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/dht: Create linkfile with file uid/gid

2013-05-16T15:47:44+00:00

Currently, linkfile creation happens as root.

use uid/gid returned from _cbk (link/rename) to set the correct ownership of
the link files.

Change-Id: I5345cff193d5095442ca446fbe5ea05f2c2d86a3
Signed-off-by: shishir gowda 
BUG: 884597
Reviewed-on: http://review.gluster.org/5024
Reviewed-by: Vijay Bellur 
Tested-by: Gluster Build System

libglusterfs/statedump: move options file and statedumps from /tmp

2013-05-14T17:47:09+00:00

Change-Id: I6b107b9a668b0521b955dba8895cbbeaf9e7cb02
BUG: 764890
Signed-off-by: Raghavendra Bhat 
Reviewed-on: http://review.gluster.org/5005
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

glusterfs: add gf_mkostemp api and use it instead of mkostemp of libc

2013-05-14T16:02:11+00:00

Change-Id: Ia3d2f37ae1f7a7d87a75c82bedb4963729d45b6c
BUG: 764890
Signed-off-by: Raghavendra Bhat 
Reviewed-on: http://review.gluster.org/5004
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

geo-rep: retire old style ssh setup

2013-04-27T16:36:58+00:00

Users are still using geo-rep with the old, deprecated, insecure, unsupported
ssh setup. Not their fault -- the implementation of the new method had the
following charasteristics:
- old method is possible, but with default settings it's not working
- it can be made operational by fiddling with "remote-gsyncd" tunable
- with default setting, an unhelpful, actually misleading error message is
  produced
- the UI gave no hint to the changes in the ssh setup

http://review.gluster.org/4392 tried to fix these; what it accomplished was
unrestricted support to the bad practice (by making the default old setup
operational).

From this on:
- we disable the old method by reserving the "remote-gsyncd" tunable
- if the old method is attempted, give a hint what to do

Change-Id: Icade94725d8d8d2d4c89cab992d4226351637b86
BUG: 895656
Signed-off-by: Csaba Henk 
Reviewed-on: http://review.gluster.org/4892
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

glusterd: replace obsolete /usr/local reference for remote ssh/gsyncd

2013-04-27T16:36:39+00:00

See https://bugzilla.redhat.com/show_bug.cgi?id=895656
    https://bugzilla.redhat.com/show_bug.cgi?id=764679 (GLUSTER-2947)
    https://bugzilla.redhat.com/show_bug.cgi?id=764623 (GLUSTER-2891)

The comments in the bzs are a bit obtuse and/or vague. As near as I
can make out we had, for a while, a "convenience symlink" to or from
/usr/local/libexec/gsyncd, which no longer exists.

And, lacking any comments in the code, I gather this is some sort of
fallback or failsafe logic: if the first, normal attempt to invoke gsyncd
fails then an attempt is made to ssh to the box and invoke it.

In any event, there's nothing in /usr/local/... so it's unquestionably
wrong to try to invoke anything there.

[Backporting Kaleb's patch]

BUG: 895656
Change-Id: I3b7ac7a049b91ce101b930599294830147cc60ad
Signed-off-by: Kaleb S. KEITHLEY 
Signed-off-by: Csaba Henk 
Reviewed-on: http://review.gluster.org/4891
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

distribute: Fix fds being leaked during rebalance

2013-04-26T07:35:00+00:00

This patch is a backport of 2 patches from master branch which fixes the
leak of fds during a rebalance process.

The patches are,
* libglusterfs/syncop: do not hold ref on the fd in cbk
  (e979c0de9dde14fe18d0ad7298c6da9cc878bbab)
* cluster/distribute: Remove suprious fd_unref call
  (5d29e598665456b2b7250fdca14de7409098877a)

Change-Id: Icea1d0b32cb3670f7decc24261996bca3fe816dc
BUG: 928631
Signed-off-by: Kaushal M 
Reviewed-on: http://review.gluster.org/4888
Reviewed-by: Vijay Bellur 
Tested-by: Gluster Build System

cluster/dht: Correct min_free_disk behaviour

2013-04-17T11:53:57+00:00

Problem:
Files were being created in subvol which had less than min_free_disk available
even in the cases where other subvols with more space were available.

Solution:
Changed the logic to look for subvol which has more space available. In cases
where all the subvols have lesser than Min_free_disk available , the one with
max space and atleast one inode is available.

Known Issue: Cannot ensure that first file that is created right after
min-free-value is crossed on a brick will get created in other brick because
disk usage stat takes some time to update in glusterprocess. Will fix that as
part of another bug.

Change-Id: Icaba552db053ad8b00be0914b1f4853fb7661bd3
BUG: 874554
Signed-off-by: Raghavendra Talur 
Signed-off-by: Varun Shastry 
Reviewed-on: http://review.gluster.org/4839
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

dht: improve transform/detransform of d_off (and be ext4 safe)

2013-04-16T17:35:59+00:00

Backporting  Avati's fix http://review.gluster.org/4711

The scheme to encode brick d_off and brick id into global d_off has
two approaches. Since both brick d_off and global d_off are both 64-bit
wide, we need to be careful about how the brick id is encoded.

Filesystems like XFS always give a d_off which fits within 32bits. So
we have another 32bits (actually 31, in this scheme, as seen ahead) to
encode the brick id - which is typically plenty.

Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off,
as the d_off is calculated based on a hash function value. This leaves
us no "unused" bits to encode the brick id.

However both these filesystmes (EXT4 more importantly) are "tolerant" in
terms of the accuracy of the value presented back in seekdir(). i.e, a
seekdir(val) actually seeks to the entry which has the "closest" true
offset.

This "two-prong" scheme exploits this behavior - which seems to be the
best middle ground amongst various approaches and has all the advantages
of the old approach:

- Works against XFS and EXT4, the two most common filesystems out there.
  (which wasn't an "advantage" of the old approach as it is borken against
   EXT4)

- Probably works against most of the others as well. The ones which would
  NOT work are those which return HUGE d_offs _and_ NOT tolerant to
  seekdir() to "closest" true offset.

- Nothing to "remember in memory" or evict "old entries".

- Works fine across NFS server reboots and also NFS head failover.

- Tolerant to seekdir() to arbitrary locations.

Algorithm:

Each d_off can be encoded in either of the two schemes. There is no
requirement to encode all d_offs of a directory or a reply-set in
the same scheme.

The topmost bit of the 64 bits is used to specify the "type" of encoding
of this particular d_off. If the topmost bit (bit-63) is 1, it indicates
that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0,
it indicates that the "small" d_off encoding scheme is used.

The goal of the "small" d_off encoding is to stay as dense as possible
towards the lower bits even in the global d_off.

The goal of the HUGE d_off encoding is to stay as accurate (close) as
possible to the "true" d_off after a round of encoding and decoding.

If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick
ID (call it "n").

SMALL d_off
===========

Encoding
--------
    If the top n + 1 bits are free in a brick offset, then we leave the
top bit as 0 and set the remaining bits based on the old formula:

   hi_mask = 0xffffffffffffffff

   hi_mask = ~(hi_mask >> (n + 1))

   if ((hi_mask & d_off_brick) != 0)
       do_large_d_off_encoding ()

   d_off_global = (d_off_brick * N) + brick_id

Decoding
--------
    If the top bit in the global offset is 0, it indicates that this
is the encoding formula used. So decoding such a global offset will
be like the old formula:

   if ((d_off_global & 0x8000000000000000) != 0)
      do_large_d_off_decoding()

   d_off_brick = (d_off_global % N)

   brick_id = d_off_global / N

HUGE d_off
==========

Encoding
--------
   If the top n + 1 bits are NOT free in a given brick offset, then we
set the top bit as 1 in the global offset. The low n bits are replaced
by brick_id.

    low_mask = 0xffffffffffffffff << n   // where n is ROOF(Log2(N))

    d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id

    if (d_off_global == 0xffffffffffffffff)
        discard_entry();

Decoding
--------
    If the top bit in the global offset is set 1, it indicates that
the encoding formula used is above. So decoding would look like:

    hi_mask = (0xffffffffffffffff << n)
    low_mask = ~(hi_mask)

    d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff)

    brick_id = global_d_off & low_mask

    If "losing" the low n bits in this decoding of d_off_brick looks
"scary", we need to realize that till recently EXT4 used to only
return what can now be expressed as (d_off_global >> 32). The extra
31 bits of hash added by EXT recently, only decreases the probability
of a collision, and not eliminate it completely, anyways. In a way,
the "lost" n bits are made up by decreasing the probability of
collision by sharding the files into N bricks / EXT directories
    -- call it "hash hedging", if you will :-)

Change-Id: I9551c581c3f3d4c9e719764881036d554f60c557
Thanks-to: Zach Brown 
BUG: 838784
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4799
Reviewed-by: Amar Tumballi 
Reviewed-by: Jeff Darcy 
Tested-by: Gluster Build System 
Reviewed-on: http://review.gluster.org/4822

dict: Put "goto out" in dict_unserialize to avoid process crash

2013-04-12T07:21:09+00:00

Problem:
In the dictionary serialization function, if the
[(buf + vallen) > (orig_buf + size)], then memdup is getting failed.

Fix:
Put "goto out" whenever this condition is met.

Change-Id: Ia10ddc7e1cf551eed0e2c3d0f0364c6961e13025
BUG: 947824
Signed-off-by: Venkatesh Somyajulu 
Reviewed-on: http://review.gluster.org/4770
Tested-by: Gluster Build System 
Reviewed-by: Jeff Darcy