glusterfs.git/xlators/cluster/ec/src/ec-helpers.h, branch v4.1dev

cluster/ec: Change [f]getxattr to parallel-dispatch-one

2017-12-22T09:25:29+00:00

At the moment in EC, [f]getxattr operations wait to acquire a lock
while other operations are in progress even when it is in the same mount with a
lock on the file/directory. This happens because [f]getxattr operations
follow the model where the operation is wound on 'k' of the bricks and are
matched to make sure the data returned is same on all of them. This consistency
check requires that no other operations are on-going while [f]getxattr
operations are wound to the bricks. We can perform [f]getxattr in
another way as well, where we find the good_mask from the lock that is already
granted and wind the operation on any one of the good bricks and unwind the
answer after adjusting size/blocks to the parent xlator. Since we are taking
into account good_mask, the reply we get will either be before or after a
possible on-going operation. Using this method, the operation doesn't need to
depend on completion of on-going operations which could be taking long time (In
case of some slow disks and writes are in progress etc). Thus we reduce the
time to serve [f]getxattr requests.

I changed [f]getxattr to dispatch-one and added extra logic in
ec_link_has_lock_conflict() to not have any conflicts for fops with
EC_MINIMUM_ONE as fop->minimum to achieve the effect described above.
Modified scripts to make sure READ fop is received in EC to trigger heals.

Updates gluster/glusterfs#368
Change-Id: I3b4ebf89181c336b7b8d5471b0454f016cdaf296
Signed-off-by: Pranith Kumar K

cluster/ec: Implement DISCARD FOP for EC

2017-10-25T11:52:41+00:00

Updates #254

This code change implements DISCARD FOP support for
EC.

BUG: 1461018
Change-Id: I09a9cb2aa9d91ec27add4f422dc9074af5b8b2db
Signed-off-by: Sunil Kumar Acharya

cluster/ec: add functions for stripe alignment

2017-10-13T08:17:27+00:00

This patch removes old functions to align offsets and sizes
to stripe size boundaries and adds new ones to offer more
possibilities.

The new functions are:

 * ec_adjust_offset_down()
     Aligns a given offset to a multiple of the stripe size
     equal or smaller than the initial one. It returns the
     size of the gap between the aligned offset and the given
     one.

 * ec_adjust_offset_up()
     Aligns a given offset to a multiple of the stripe size
     equal or greater than the initial one. It returns the
     size of the skipped region between the given offset and
     the aligned one. If an overflow happens, the returned
     valid has negative sign (but correct value) and the
     offset is set to the maximum value (not aligned).

 * ec_adjust_size_down()
     Aligns the given size to a multiple of the stripe size
     equal or smaller than the initial one. It returns the
     size of the missed region between the aligned size and
     the given one.

 * ec_adjust_size_up()
     Aligns the given size to a multiple of the stripe size
     equal or greater than the initial one. It returns the
     size of the gap between the given size and the aligned
     one. If an overflow happens, the returned value has
     negative sign (but correct value) and the size is set
     to the maximum value (not aligned).

These functions have been defined in ec-helpers.h as static
inline since they are very small and compilers can optimize
them (specially the 'scale' argument).

Change-Id: I4c91009ad02f76c73772034dfde27ee1c78a80d7
Signed-off-by: Xavier Hernandez

cluster/ec: Don't trigger data/metadata heal on Lookups

2017-02-27T03:06:55+00:00

Problem-1
If Lookup which doesn't take any locks observes version mismatch it can't be
trusted. If we launch a heal based on this information it will lead to
self-heals which will affect I/O performance in the cases where Lookup is
wrong. Considering self-heal-daemon and operations on the inode from client
which take locks can still trigger heal we can choose to not attempt a heal on
Lookup.

Problem-2:
Fixed spurious failure of
tests/bitrot/bug-1373520.t
For the issues above, what was happening was that ec_heal_inspect()
is preventing 'name' heal to happen

Problem-3:
tests/basic/ec/ec-background-heals.t
To be honest I don't know what the problem was, while fixing
the 2 problems above, I made some changes to ec_heal_inspect() and
ec_need_heal() after which when I tried to recreate the spurious
failure it just didn't happen even after a long time.

BUG: 1414287
Signed-off-by: Pranith Kumar K 
Change-Id: Ife2535e1d0b267712973673f6d474e288f3c6834
Reviewed-on: https://review.gluster.org/16468
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Xavier Hernandez 
CentOS-regression: Gluster Build System 
Reviewed-by: Ashish Pandey

cluster/ec: fix selinux issues with mmap()

2017-02-02T12:02:28+00:00

EC uses mmap() to create a memory area for the dynamic code. Since
the code is created on the fly and executed when needed, this region
of memory needs to have write and execution privileges.

This combination is not allowed by default by selinux. To solve the
problem a file is used as a backend storage for the dynamic code and
it's mapped into two distinct memory regions, one with write access
and the other one with execution access. This approach is the
recommended way to create dynamic code by a program in a more secure
way, and selinux allows it.

Additionally selinux requires that the backend file be stored in a
directory marked with type bin_t to be able to map it in an executable
area. To satisfy this condition, GLUSTERFS_LIBEXECDIR has been used.

This fix also changes the error check for mmap(), that was done
incorrectly (it checked against NULL instead of MAP_FAILED), and it
also correctly propagates the error codes and makes sure they aren't
silently ignored.

Change-Id: I71c2f88be4e4d795b6cfff96ab3799c362c54291
BUG: 1402661
Signed-off-by: Xavier Hernandez 
Reviewed-on: https://review.gluster.org/16405
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Jeff Darcy

cluster/ec: Add support for hardware acceleration

2016-09-08T17:08:25+00:00

This patch implements functionalities for fast encoding/decoding
using hardware support. Currently optimized x86_64, SSE and AVX is
added.

Additionally this patch implements a caching mecanism for inverse
matrices to reduce computation time, as well as a new method for
computing the inverse that takes quadratic time instead of cubic.

Finally some unnecessary memory copies have been eliminated to
further increase performance.

Change-Id: I26c75f26fb4201bd22b51335448ea4357235065a
BUG: 1289922
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/12837
Tested-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order

2016-08-22T09:38:36+00:00

When the bricks are brought offline and then online in cyclic
order while writes are in progress on a file, thanks to inode
refresh in write txns, AFR will mostly fail the write attempt
when the only good copy is offline. However, there is still a
remote possibility that the file will run into split-brain if
the brick that has the lone good copy goes offline *after* the
inode refresh but *before* the write txn completes (I call it
in-flight split-brain in the patch for ease of reference),
requiring intervention from admin to resolve the split-brain
before the IO can resume normally on the file. To get around this,
the patch does the following things:
i) retains the dirty xattrs on the file
ii) avoids marking the last of the good copies as bad (or accused)
    in case it is the one to go down during the course of a write.
iii) fails that particular write with the appropriate errno.

This way, we still have one good copy left despite the split-brain situation
which when it is back online, will be chosen as source to do the heal.

Change-Id: I9ca634b026ac830b172bac076437cc3bf1ae7d8a
BUG: 1363721
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/15080
Tested-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Ravishankar N 
Reviewed-by: Oleksandr Natalenko 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/ec: Automate heal for replace brick

2016-02-09T07:20:43+00:00

Problem:
After a replace brick command, newly added
brick does not contain data which existed
on old brick.

Solution:
Do getxattr after initialization of all the
bricks. This will trigger heal for brick root
as soon as it finds the version mismatch on
newly added brick.

Removing tests from ec-new-entry.t which were
required to simulate automation of heal after
replace brick.

Change-Id: I08e3dfa565374097f6c08856325ea77727437e11
BUG: 1304686
Signed-off-by: Ashish Pandey 
Reviewed-on: http://review.gluster.org/13353
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

cluster/ec: Forced unlock when lock contention is detected

2015-05-27T10:25:47+00:00

EC uses an eager lock mechanism to optimize multiple read/write
requests on the same entry or inode. This increases performance
but can have adverse results when other clients try to access the
same entry/inode.

To solve this, this patch adds a functionality to detect when this
happens and force an earlier release to not block other clients.

The method consists on requesting GF_GLUSTERFS_INODELK_COUNT and
GF_GLUSTERFS_ENTRYLK_COUNT for all fops that take a lock. When this
count is greater than one, the lock is marked to be released. All
fops already waiting for this lock will be executed normally before
releasing the lock, but new requests that also require it will be
blocked and restarted after the lock has been released and reacquired
again.

Another problem was that some operations did correctly lock the
parent of an entry when needed, but got the size and version xattrs
from the entry instead of the parent.

This patch solves this problem by binding all queries of size and
version to each lock and replacing all entrylk calls by inodelk ones
to remove concurrent updates on directory metadata.  This also allows
rename to correctly update source and destination directories.

Change-Id: I2df0b22bc6f407d49f3cbf0733b0720015bacfbd
BUG: 1165041
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/10852
Tested-by: NetBSD Build System
Tested-by: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/ec: add separate versions for data/entry, metadata

2015-05-06T21:02:53+00:00

Adding 64 bits in "version" key of extended attributes. First 64 bits (Left)
represents Data version. Last 64 bits (right) represents Meta Data version.

Note: 3.7 and 3.6 version ec can't co-exist with this change because xattrop in
3.6 will fail with ERANGE as the buffer passed to it will be '8' bytes where as
the value will be 16 bytes in 3.7. Where as 3.7 version clients can work with
old version files. For upgrades we need to tell users to complete heals and
then upgrade

BUG: 1215265
Change-Id: Ib85114680cb7e75b8371c984d9f7b6401c1ffb93
Signed-off-by: Ashish Pandey 
Reviewed-on: http://review.gluster.org/10312
Tested-by: Gluster Build System 
Tested-by: NetBSD Build System
Reviewed-by: Pranith Kumar Karampuri