glusterfs.git/xlators/cluster/dht, branch v3.6.0alpha1

dht: fix rename race

2014-07-17T17:30:56+00:00

If two clients try to rename the same file at the same time, we
sometimes end up with *no file at all* in either the old or new
location.  That's kind of bad.  The culprit seems to be some overly
aggressive cleanup code.  AFAICT, based on today's study of the code,
the intent of the changed section is to remove any linkfile we might
have created before the actual rename.  However, what we're removing
might not be our extra link.  If we're racing with another client that's
also doing a rename, it might be the only remaining link to the user's
data.  The solution, which is good enough to pass this test but almost
certainly still not complete, is to be more selective about when we do
this unlink.  Now, we only do it if we know that, at some point, we did
in fact create the link without error (notably ENOENT on the source or
EEXIST on the destination) ourselves.

Change-Id: I8d8cce150b6f8b372c9fb813c90be58d69f8eb7b
BUG: 1117851
Signed-off-by: Jeff Darcy 
Reviewed-on: http://review.gluster.org/8269
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

dht: support heterogeneous brick sizes

2014-07-12T16:20:52+00:00

Calculation of layouts now considers the size of each brick, so that
smaller bricks don't get an "unfair" share of allocations and start
returning ENOSPC while the larger bricks still have plenty of space.

The observation has been made that some clients might get ENOTCONN when
trying to fetch disk-size information, and end up calculating layouts
differently.  The following meta-observations can be made.

(1) This scenario is extremely unlikely in configurations with AFR.

(2) The most likely consequence of this scenario is that some files will
be placed sub-optimally by the client with the obsolete (non-weighted)
layout.  They'll still be found anyway, so this isn't a show stopper.

(3) Without this patch it's *guaranteed* that some files will be placed
sub-optimally, because any layout that fails to account for brick sizes
is sub-optimal.

(4) We shouldn't be doing fix-layout from two nodes simultaneously
anyway.  That's inefficient at best.  Any instances of such behavior are
separate bugs, which should be fixed separately.

(5) In the most extreme edge case, two nodes doing weighted and
non-weighted layout fixes could race and end up creating an internally
inconsistent layout.  This condition is still transient; it will be
detected and repaired automatically the next time anyone fetches the
layout.  (If it's not that's also a preexisting bug that can show up in
other contexts.)

In conclusion, it's not the purpose of this patch to fix bugs elsewhere
in DHT.  Its purpose is to make life incrementally better for users who
add new hardware with larger disks etc. than the older equipment.  It's
only one part of an ongoing process to improve layout management and
repair, all the way up to support for multiple hash rings or tiering.

Change-Id: I05eb6f9eface9cdaf8622e0260c8c7f29020447f
BUG: 1114680
Signed-off-by: Jeff Darcy 
Reviewed-on: http://review.gluster.org/8093
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra G 
Reviewed-by: Shyamsundar Ranganathan 
Reviewed-by: Vijay Bellur

DHT/Logging

2014-07-12T16:16:54+00:00

Changed the log level of a message from none to debug as none does
not print a log level in the log file.

Change-Id: I463d1095d69bbd0036958282da13cb8e0226f34f
BUG: 1116797
Signed-off-by: Nithya Balachandran 
Reviewed-on: http://review.gluster.org/8253
Reviewed-by: Krutika Dhananjay 
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/dht: Added logging of new layout for dir-selfheal

2014-07-03T16:22:37+00:00

Added a log which logs the new layout which will be used
for the directory self healing

It prints:

a) Subvolume name
b) Error --> Is needed because layout healing depends on
             the error and having it in log will help in
             debugging
c) Start     Starting of the layout range
d) Stop      Ending of the layout range

Change-Id: I48c9c697716a899165ed29b737362a75c62e09b3
BUG: 1113066
Signed-off-by: Venkatesh Somyajulu 
Reviewed-on: http://review.gluster.org/8173
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

porting: Port for FreeBSD rebased from Mike Ma's efforts

2014-07-03T00:20:34+00:00

- Provides a working Gluster Management Daemon, CLI
- Provides a working GlusterFS server, GlusterNFS server
- Provides a working GlusterFS client
- execinfo port from FreeBSD is moved into ./contrib/libexecinfo
  for ease of portability on NetBSD. (FreeBSD 10 and OSX provide
  execinfo natively)
- More portability cleanups for Darwin, FreeBSD and NetBSD
- Provides a new rc script for FreeBSD

Change-Id: I8dff336f97479ca5a7f9b8c6b730051c0f8ac46f
BUG: 1111774
Original-Author: Mike Ma 
Signed-off-by: Harshavardhana 
Reviewed-on: http://review.gluster.org/8141
Tested-by: Gluster Build System 
Reviewed-by: Kaleb KEITHLEY

dht: pass xdata to xlators above.

2014-06-30T03:50:40+00:00

Change-Id: I96e9feb88443fcd7da40c33c0e8c4e2645b1fcf3
BUG: 1096047
Signed-off-by: Krishnan Parthasarathi 
Reviewed-on: http://review.gluster.org/7872
Reviewed-by: Shyamsundar Ranganathan 
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/dht: handle ESTALE appropriately in rmdir codepath.

2014-06-23T12:55:30+00:00

Till we separated the scenario of a file/directory not existing from
parent not existing [1], we used to include a subvolume in the layout
of a directory even if it is not present on that subvolume. This was
done to allow a lookup racing with mkdir to create correct layout.
However, there are other scenarios as well where a directory is not
present. One such situation is trying to create a directory after an
add-brick. Since there is no guarantee that all the ancestors are
created after an add-brick (and hence directory cannot be created), the
newly added brick should not be part of the layout. However, we used to
consider newly added brick as part of layout (even before we do
fix-layout of all the ancestors) and this was the root cause of [2].
With [1], this issue got fixed and hence [2] got fixed too. However,
[1] is not complete in the sense we didn't modify rmdir codepath
appropriately. This patch fixes that gap.

[1] http://review.gluster.org/6322
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1006809

Change-Id: I79ab96bb8abb6f3d90bb6e235a1c465e1be0fd19
BUG: 1032894
Signed-off-by: Raghavendra G 
Reviewed-on: http://review.gluster.org/8142
Reviewed-by: Vijay Bellur 
Tested-by: Vijay Bellur

Cluster/DHT : Logging changes

2014-06-19T05:59:57+00:00

Removed trailing spaces from the code

Change-Id: I427c9a01b514824f903e301863c2c29071db6483
BUG: 1075611
Signed-off-by: Nithya Balachandran 
Reviewed-on: http://review.gluster.org/8096
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

features/quota: Make dht_statfs_cbk more fool proof from quota_deem_statfs

2014-06-18T16:50:32+00:00

Problem:
The function depends on the fact that if quota-deem-statfs option is enabled,
all of the subvolumes send their xdata with quota-deem-statfs flag ON. But,
this may not be true in case of errors in some of the subvolumes.

There is a decision/policy made which assumes quota-deem-statfs to be ON if at
least ONE of the subvolumes sends the flag ON. By this, df reports quota
modified statfs values if *at least ONE* of the bricks sends the
quota-deem-statfs flag ON. This can be visualized with the below "Transition
Diagram/State Machine".

        Event: Each Quota deem statfs status from the individual bricks
        Action: Decision taken on the calculation of the statvfs received
        State: Whether quota deem statfs is ON or OFF (0: OFF, 1: ON)
        Input: Event from individual bricks

              ___                  ___
             /   \   OFF*         /   \  (OFF|ON)*
            |     |              |     |
             \   /        ON      \   /
        -----> 0  ----------------> 1

The below Transition Function depicts the relation between the statfs
calculation based on the events received.

         State          Event          action
        -------------------------------------
          OFF            OFF            OFF
          OFF            ON            REPLACE
          ON             OFF           NEGLECT
          ON             ON            COMPARE

Change-Id: I0e8fb7d3945a3ca3dde0bb99de6cd397e27a3162
BUG: 1048786
Signed-off-by: Varun Shastry 
Reviewed-on: http://review.gluster.org/6652
Reviewed-by: Krishnan Parthasarathi 
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra G 
Tested-by: Raghavendra G

cluster/dht: Do layout self healing of directory for nameless lookup

2014-06-17T12:39:22+00:00

Problem: Currently in the  nameless lookup code path, if at the
         end of the lookup, even if it detects that layout
         anamolies are there, layout healing will not be done as
         there is no code to heal it.
         So there can be race between mkdir and lookup.

         Assume mkdir is going on from some other mount point,
         Say, M1. Directories are created on some nodes but layout
         is not set yet.

         Now from M2, nameless lookup goes, lookup will be success
         full as the directory is present on some of the nodes, but
         it won't heal layout. Now if create goes after lookup fop,
         because layout is absent, file creation will fail.

Fix:     Included the code of layout self-heal in the nameless
         lookup path. At the end of lookup, layout will be computed
         as it would have been in the named lookup, but it will be
         set to those node only, where directory is present.
         So after that if create fop goes, the probabiliy to get the
         subvolume with proper hash-range is high now, so reduces
         the race window.

Other:  Whenever a directory is created, we have to choose a brick
        from which we start allocating layout in a circular fashion.
        To calculate this starting brick, I have changed the candidate
        from name of the directory to gfid of the directory

        But to compute where a given file belongs, we will still
        use the name of the file. Hash computed from the name of the
        file should belong to any one of the directory-hash-range

        Calculation of hash for a file is acting as a consumer and the
        setting of directory layout based on gfid is acting as a producer,
        which are independent from each other.

Change-Id: I3808c55082cd1b5c72d2c77cbbc063f55aa38bee
BUG: 1095888
Signed-off-by: Venkatesh Somyajulu 
Reviewed-on: http://review.gluster.org/7493
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra Bhat 
Reviewed-by: Vijay Bellur