glusterfs.git/xlators/cluster/ec/src/ec-helpers.h, branch v3.7.14

cluster/ec: Automate heal for replace brick

2016-02-10T08:31:12+00:00

Problem:
After a replace brick command, newly added
brick does not contain data which existed
on old brick.

Solution:
Do getxattr after initialization of all the
bricks. This will trigger heal for brick root
as soon as it finds the version mismatch on
newly added brick.

Removing tests from ec-new-entry.t which were
required to simulate automation of heal after
replace brick.

master -
http://review.gluster.org/#/c/13353/

Change-Id: I08e3dfa565374097f6c08856325ea77727437e11
BUG: 1305755
Signed-off-by: Ashish Pandey 
Reviewed-on: http://review.gluster.org/13353
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Signed-off-by: Ashish Pandey 
Reviewed-on: http://review.gluster.org/13403
Reviewed-by: Xavier Hernandez 
Tested-by: Xavier Hernandez

cluster/ec: Fix all EIO errors in EC

2015-05-28T11:12:06+00:00

        Backport of http://review.gluster.org/10770
        Backport of http://review.gluster.org/10806
        Backport of http://review.gluster.org/10787
        Backport of http://review.gluster.org/10868
        Backport of http://review.gluster.com/10852

 - When a blocking lock is requested, lock request is succeeded even when
ec->fragment number of locks are acquired successfully in non-blocking locking
phase. This will lead to fop succeeding only on the bricks where the locks are
acquired, leading to the necessity of self-heals. To prevent these un-necessary
self-heals, if the remaining locks fail with EAGAIN in non-blocking lock phase
try blocking locking phase instead.

 -  Handle lookup failures while op in progress

 - cluster/ec: Correctly cleanup delayed locks
When a delayed lock is pending, a graph switch doesn't correctly
terminate it. This means that the update of version and size xattrs
is lost, causing EIO errors. This patch handles GF_EVENT_PARENT_DOWN
event to correctly finish pending udpdates before completing the
graph switch.

 - Fix use after free crash
ec_heal creates ec_fop_data but doesn't run ec_manager. ec_fop_data_allocate
adds this fop to ec->pending_fops, because ec_manager is not run on this heal
fop it is never removed from ec->pending_fops. When it is accessed after free
it leads to crash. It is better to not to add HEAL fops to ec->pending_fops
because we don't want graph switch to hang the mount because of a BIG
file/directory heal.

- Forced unlock when lock contention is detected
EC uses an eager lock mechanism to optimize multiple read/write
requests on the same entry or inode. This increases performance
but can have adverse results when other clients try to access the
same entry/inode. To solve this, this patch adds a functionality
to detect when this happens and force an earlier release to not
block other clients.

The method consists on requesting GF_GLUSTERFS_INODELK_COUNT and
GF_GLUSTERFS_ENTRYLK_COUNT for all fops that take a lock. When this
count is greater than one, the lock is marked to be released. All
fops already waiting for this lock will be executed normally before
releasing the lock, but new requests that also require it will be
blocked and restarted after the lock has been released and reacquired
again.

Another problem was that some operations did correctly lock the
parent of an entry when needed, but got the size and version xattrs
from the entry instead of the parent.

This patch solves this problem by binding all queries of size and
version to each lock and replacing all entrylk calls by inodelk ones
to remove concurrent updates on directory metadata.  This also allows
rename to correctly update source and destination directories.

BUG: 1225279
Change-Id: I02a6084b138dd38e018a462347cd9ce38610c7ef
Reviewed-on: http://review.gluster.org/10926
Tested-by: NetBSD Build System
Tested-by: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/ec: add separate versions for data/entry, metadata

2015-05-08T17:29:38+00:00

Adding 64 bits in "version" key of extended attributes. First 64 bits (Left)
represents Data version. Last 64 bits (right) represents Meta Data version.

Note: 3.7 and 3.6 version ec can't co-exist with this change because xattrop in
3.6 will fail with ERANGE as the buffer passed to it will be '8' bytes where as
the value will be 16 bytes in 3.7. Where as 3.7 version clients can work with
old version files. For upgrades we need to tell users to complete heals and
then upgrade

BUG: 1215265
Change-Id: Ib85114680cb7e75b8371c984d9f7b6401c1ffb93
Signed-off-by: Ashish Pandey 
Reviewed-on: http://review.gluster.org/10312
Reviewed-on: http://review.gluster.org/10626
Tested-by: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/dht: Change the subvolume encoding in d_off to be a "global"

2015-03-18T11:47:41+00:00

position in the graph rather than relative (local) to a particular
translator.

Encoding the volume in this way allows a single translator to manage
which brick is currently being scanned for directory entries. Using a
single translator minimizes allocated bits in the d_off. It also allows
multiple DHT translators in the same graph to have a common frame of
reference (the graph position) for which brick is being read. Multiple
DHT translators are needed for the Tiering feature.

The fix builds off a previous change (9332) which removed subvolume
encoding from AFR. The fix makes an equivalent change to the EC
translator.

More background can be found in fix 9332 and gluster-dev discussions [1].

DHT and AFR/EC are responsibile (as before) for choosing which brick to
enumerate directory entries in over the readdir lifecycle.

The client translator receiving the readdir fop encodes the dht_t. It
is referred to as the "leaf node" in the graph and corresponds to the
brick being scanned.

When DHT decodes the d_off, it translates the leaf node to a local
subvolume, which represents the next node in the graph leading to
the brick.

Tracking of leaf nodes is done in common utility functions. Leaf nodes
counts and positional information are updated on a graph switch.

[1] www.gluster.org/pipermail/gluster-devel/2015-January/043592.html

Change-Id: Iaf0ea86d7046b1ceadbad69d88707b243077ebc8
BUG: 1190734
Signed-off-by: Dan Lambright 
Reviewed-on: http://review.gluster.org/9688
Reviewed-by: Xavier Hernandez 
Reviewed-by: Krishnan Parthasarathi 
Reviewed-by: Vijay Bellur 
Tested-by: Vijay Bellur

ec: Remove unneeded 'inline' for ec_is_internal_xattr()

2015-01-19T06:01:17+00:00

An incorrectly placed 'inline' keyword caused compilation warnings
with gcc 5.

Change-Id: I2bf8c39b1514ea7dac13e82eb3b6ff4b98e62c79
BUG: 1182267
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/9452
Tested-by: Gluster Build System 
Reviewed-by: Kaleb KEITHLEY 
Reviewed-by: Vijay Bellur

cluster/ec: Handle internal xattr get/set

2015-01-09T05:55:37+00:00

Problem:
Internal xattrs of EC like trusted.ec.size/config/version
can be modified by users and that can lead to misbehavior
in EC.

Fix:
Don't let the user modify the xattrs. Hide these xattrs
in getfattr outputs.

Change-Id: I39cec96ae12826b506b496fda7da74201015fd75
BUG: 1178688
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/9385
Tested-by: Gluster Build System 
Reviewed-by: Emmanuel Dreyfus 
Tested-by: Emmanuel Dreyfus 
Reviewed-by: Xavier Hernandez

ec: Fix self-healing issues.

2014-12-04T19:34:38+00:00

Three problems have been detected:

1. Self healing is executed in background, allowing the fop that
   detected the problem to continue without blocks nor delays.

   While this is quite interesting to avoid unnecessary delays,
   it can cause spurious failures of self-heal because it may
   try to recover a file inside a directory that a previous
   self-heal has not recovered yet, causing the file self-heal
   to fail.

2. When a partial self-heal is being executed on a directory,
   if a full self-heal is attempted, it won't be executed
   because another self-heal is already in process, so the
   directory won't be fully repaired.

3. Information contained in loc's of some fop's is not enough
   to do a complete self-heal.

To solve these problems, I've made some changes:

* Improved ec_loc_from_loc() to add all available information
  to a loc.

* Before healing an entry, it's parent is checked and partially
  healed if necessary to avoid failures.

* All heal requests received for the same inode while another
  self-heal is being processed are queued. When the first heal
  completes, all pending requests are answered using the results
  of the first heal (without full execution), unless the first
  heal was a partial heal. In this case all partial heals are
  answered, and the first full heal is processed normally.

* An special virtual xattr (not physically stored on bricks)
  named 'trusted.ec.heal' has been created to allow synchronous
  self-heal of files.

  Now, the recommended way to heal an entire volume is this:

    find  -d -exec getfattr -h -n trusted.ec.heal {} \;

Some minor changes:

* ec_loc_prepare() has been renamed to ec_loc_update().

* All loc management functions return 0 on success and -1 on
  error.

* Do not delay fop unlocks if heal is needed.

* Added basic ec xattrs initially on create, mkdir and mknod
  fops.

* Some coding style changes

Change-Id: I2a5fd9c57349a153710880d6ac4b1fa0c1475985
BUG: 1161588
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/9072
Tested-by: Gluster Build System 
Reviewed-by: Dan Lambright

ec: Change license

2014-12-03T18:45:26+00:00

Change-Id: Iae90ade2421898417b53dec0417a610cf306c44b
BUG: 1168167
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/9201
Reviewed-by: Kaleb KEITHLEY 
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

ec: Fix self-heal issues

2014-10-21T18:38:45+00:00

Problem: Doing an 'ls' of a directory that has been modified while one
         of the bricks was down, sometimes returns the old directory
         contents.

Cause: Directories are not marked when they are modified as files are.
       The ec xlator balances requests amongst available and healthy
       bricks. Since there is no way to detect that a directory is
       out of date in one of the bricks, it is used from time to time
       to return the directory contents.

Solution: Basically the solution consists in use versioning information
          also for directories, however some additional changes have
          been necessary.

Changes:

 * Use directory versioning:

     This required to lock full directory instead of a single entry for
     all requests that add or remove entries from it. This is needed to
     allow atomic version update. This affects the following fops:

         create, mkdir, mknod, link, symlink, rename, unlink, rmdir

     Another side effect is that opendir requires to do a previous
     lookup to get versioning information and discard out of date
     bricks for subsequent readdir(p) calls.

 * Restrict directory self-heal:

     Till now, when one discrepancy was found in lookup, a self-heal
     was automatically started. This caused the versioning information
     of a bad directory to be healed instantly, making the original
     problem to reapear again.

     To solve this, when a missing directory is detected in one or more
     bricks on lookup or opendir fops, only a partial self-heal is
     performed on it. A partial self-heal basically creates the
     directory but does not restore any additional information.

     This avoids that an 'ls' could repair the directory and cause the
     problem to happen again. With this change, output of 'ls' is
     always consistent. However, since the directory has been created
     in the brick, this allows any other operation on it (create new
     files, for example) to succeed on all bricks and not add additional
     work to the self-heal process.

     To force a self-heal of a directory, any other operation must be
     done on it. For example a getxattr.

     With these changes, the correct healing procedure that would avoid
     inconsistent directory browsing consists on a post-order traversal
     of directoriesi being healed. This way, the directory contents will
     be healed before healing the directory itslef.

 * Additional changes to fix self-heal errors

     - Don't use fop->fd to decide between fd/loc.

         open, opendir and create have an fd, but the correct data is in
         loc.

     - Fix incorrect management of bad bricks per inode/fd.

     - Fix incorrect selection of fop's target bricks when there are bad
       bricks involved.

     - Improved ec_loc_parent() to always return a parent loc as
       complete as possible.

Change-Id: Iaf3df174d7857da57d4a87b4a8740a7048b366ad
BUG: 1149726
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/8916
Reviewed-by: Dan Lambright 
Tested-by: Gluster Build System

ec: Add config information in an xattr

2014-09-23T16:12:32+00:00

To simplify backward compatibility of the ec xlator when some
parameter or the implementation itself is changed, a new xattr
is added to each file with the configuration needed to recover
it.

The new attribute is called 'trusted.ec.config', and it's a 64-bit
value containing the following information:

    8 bits: version of the config information (currently always 0)
    8 bits: algorithm used to encode the file (currently always 0)
    8 bits: size of the galois field (currently always 8)
    8 bits: number of bricks
    8 bits: redundancy
   24 bits: chunk size (currently 512)

This new xattr could allow, in a future version, to have different
configurations per file.

Change-Id: I8c12d40ff546cc201fc66caa367484be3d48aeb4
BUG: 1140861
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/8770
Tested-by: Gluster Build System 
Reviewed-by: Dan Lambright 
Reviewed-by: Vijay Bellur