glusterfs.git/xlators/cluster/ec/src/ec-generic.c, branch v3.7.14

cluster/ec: Do not ref dictionary in lookup

2016-04-09T18:51:08+00:00

Problem:
1) dict_for_each loops over the elements without any locks, so the members of
   the dictionary can be ref/unrefed while dict_for_each is executed by another
   thread leading to crashes.

Basically with distributed ec + disctributed replicate as cold, hot tiers. tier
sends a lookup which fails on ec. (By this time dict already contains ec
xattrs) After this lookup_everywhere code path is hit in tier which triggers
lookup on each of distribute's hash lookup but fails which leads to the cold,
hot dht's lookup_everywhere in two parallel epoll threads where in ec when it
tries to set trusted.ec.version/dirty/size as keys in the dictionary, the older
values against the same key get erased. While this erasing is going on if the
thread that is doing lookup on afr's subvolume accesses these keys either in
dict_copy_with_ref or client xlator trying to serialize, that can either lead
to crash or hang based on if the spin/mutex lock is called on invalid memory.

2) EC deletes GF_CONTENT_KEY from the dictionary, this may lead to extra reads
   in case of lookup-everwhere for tiered volumes.

Fix:
Do dict_copy_with_ref() for the lookup-dictionary.
This is avoiding the problem and is not actually fixing the 1st problem.
2nd problem will be fixed.

 >Change-Id: I5427aa14c48cb7572977d4de9a28c5ffff2b4b95
 >BUG: 1315560
 >Signed-off-by: Pranith Kumar K 
 >Reviewed-on: http://review.gluster.org/13680
 >Smoke: Gluster Build System 
 >NetBSD-regression: NetBSD Build System 
 >CentOS-regression: Gluster Build System 
 >Reviewed-by: Xavier Hernandez 
 >(cherry picked from commit 64cba025b13aad7fb3020a04930cfa22fbfcb859)

Change-Id: I2828a0d9e730bc4b0ea6cee037365131767ae43e
BUG: 1322520
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/13859
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Ravishankar N 
Reviewed-by: Krutika Dhananjay 
Smoke: Gluster Build System

cluster/ec: Allow read fops to be processed in parallel

2015-11-25T03:03:17+00:00

Currently ec only sends a single read request at a time for a given
inode. Since reads do not interfere between them, this patch allows
multiple concurrent read requests to be sent in parallel.

This is a backport of these patches:

> Change-Id: If853430482a71767823f39ea70ff89797019d46b
> BUG: 1245689
> Signed-off-by: Xavier Hernandez 
> Reviewed-on: http://review.gluster.org/11742
> Tested-by: NetBSD Build System 
> Reviewed-by: Pranith Kumar Karampuri 
> Tested-by: Gluster Build System 
>
> Change-Id: I6042129f09082497b80782b5704a52c35c78f44d
> BUG: 1276031
> Signed-off-by: Xavier Hernandez 

Change-Id: I1b1146d1fd1828b12bfc566cd76e5ea110f8909b
BUG: 1251467
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/12447
Tested-by: Gluster Build System 
Tested-by: NetBSD Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/ec: Fix tracking of good bricks

2015-08-14T09:02:21+00:00

The bitmask of good and bad bricks was kept in the context of the
corresponding inode or fd. This was problematic when an external
process (another client or the self-heal process) did heal the
bricks but no one changed the bitmaks of other clients.

This patch removes the bitmask stored in the context and calculates
which bricks are healthy after locking them and doing the initial
xattrop. After that, it's updated using the result of each fop.

> Change-Id: I225e31cd219a12af4ca58871d8a4bb6f742b223c
> BUG: 1236065
> Signed-off-by: Xavier Hernandez 
> Reviewed-on: http://review.gluster.org/11844
> Tested-by: NetBSD Build System 
> Tested-by: Gluster Build System 
> Reviewed-by: Pranith Kumar Karampuri 

Change-Id: Idbe68b28b865c4b28366703ad1e96ae16ba44b66
BUG: 1235964
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/11867
Tested-by: NetBSD Build System 
Tested-by: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/ec: Minimize usage of EIO error

2015-08-08T15:36:57+00:00

>Change-Id: I82e245615419c2006a2d1b5e94ff0908d2f5e891
>BUG: 1245276
>Signed-off-by: Xavier Hernandez 
>Reviewed-on: http://review.gluster.org/11741
>Tested-by: Gluster Build System 
>Reviewed-by: Pranith Kumar Karampuri 
>Tested-by: NetBSD Build System 

Change-Id: Ifd3d63f88a686a2963c5ba2e62110249f84f338d
BUG: 1250864
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/11852
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: NetBSD Build System 
Tested-by: Gluster Build System

cluster/ec: Do not handle GF_CONTENT_KEY

2015-07-21T11:53:20+00:00

GF_CONTENT_KEY aggregation requires that the fragments on the bricks belong to
same data i.e. no operations are modifying the content while lookup is
performed on it. The only way to know it is to get at least ec->fragments+1
number of responses and see that two different sets of ec->fragments number of
fragments give same data. But at the moment we feel that this slows down
ec-lookup. So removing handling of this for now.

 >Change-Id: I2da5087f1311d5cdde999062607b143b48c17713
 >BUG: 1226279
 >Signed-off-by: Pranith Kumar K 
 >Reviewed-on: http://review.gluster.org/11003
 >Reviewed-by: Xavier Hernandez 
 >Tested-by: Gluster Build System 
 >Tested-by: NetBSD Build System 

BUG: 1243642
Change-Id: I490e33a7cec64ce4c2670c6f17c93e5ce9576b14
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/11678
Tested-by: Gluster Build System 
Reviewed-by: Xavier Hernandez

quota: Fix statfs values in EC when quota_deem_statfs is enabled

2015-06-27T10:09:24+00:00

This is a backport of http://review.gluster.org/#/c/11315/

> When quota_deem_statfs is enabled, quota sends aggregated statfs values
> In EC we should not multiply statfs values with fragment number
>
> Change-Id: I7ef8ea1598d84b86ba5c5941a2bbe0a6ab43c101
> BUG: 1233162
> Signed-off-by: vmallika 

Change-Id: Iacc96b1ad42babd4de630f6cdc0092e8e9ac7f3b
BUG: 1236260
Signed-off-by: vmallika 
Reviewed-on: http://review.gluster.org/11434
Tested-by: NetBSD Build System 
Tested-by: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

ec: Porting messages to new logging framework

2015-06-27T09:19:38+00:00

This is a backport of http://review.gluster.org/#/c/10465/

cherry-picked from commit b0b9eaea9dbb4e9a535f5e969defc4556a9e2204
>Change-Id: Ia05ae750a245a37d48978e5f37b52f4fb0507a8c
>BUG: 1194640
>Signed-off-by: Nandaja Varma 

Change-Id: Ia05ae750a245a37d48978e5f37b52f4fb0507a8c
BUG: 1217722
Signed-off-by: Nandaja Varma 
Reviewed-on: http://review.gluster.org/11429
Tested-by: Gluster Build System 
Tested-by: NetBSD Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/ec: Ignore differences in non locked inodes

2015-05-30T12:35:44+00:00

        Backport of http://review.gluster.org/10974

When ec combines iatt structures from multiple bricks, it checks
for equality in important fields. This is ok for iatt related to
inodes involved in the operation that have been locked before
starting execution. However some fops return iatt information
from other inodes. For example a rename locks source and destination
parent directories, but it also returns an iatt from the entry
itself.

In these cases we ignore differences in some fields to avoid false
detection of inconsistencies and trigger unnecessary self-heals.

Another issue is solved in this patch that caused that the real
size of the file stored into the inode context was lost during
self-heal.

BUG: 1225796
Change-Id: I29f328a7b4895368ded859f3bae0359436c3588f
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/10983
Tested-by: Gluster Build System

cluster/ec: Fix all EIO errors in EC

2015-05-28T11:12:06+00:00

        Backport of http://review.gluster.org/10770
        Backport of http://review.gluster.org/10806
        Backport of http://review.gluster.org/10787
        Backport of http://review.gluster.org/10868
        Backport of http://review.gluster.com/10852

 - When a blocking lock is requested, lock request is succeeded even when
ec->fragment number of locks are acquired successfully in non-blocking locking
phase. This will lead to fop succeeding only on the bricks where the locks are
acquired, leading to the necessity of self-heals. To prevent these un-necessary
self-heals, if the remaining locks fail with EAGAIN in non-blocking lock phase
try blocking locking phase instead.

 -  Handle lookup failures while op in progress

 - cluster/ec: Correctly cleanup delayed locks
When a delayed lock is pending, a graph switch doesn't correctly
terminate it. This means that the update of version and size xattrs
is lost, causing EIO errors. This patch handles GF_EVENT_PARENT_DOWN
event to correctly finish pending udpdates before completing the
graph switch.

 - Fix use after free crash
ec_heal creates ec_fop_data but doesn't run ec_manager. ec_fop_data_allocate
adds this fop to ec->pending_fops, because ec_manager is not run on this heal
fop it is never removed from ec->pending_fops. When it is accessed after free
it leads to crash. It is better to not to add HEAL fops to ec->pending_fops
because we don't want graph switch to hang the mount because of a BIG
file/directory heal.

- Forced unlock when lock contention is detected
EC uses an eager lock mechanism to optimize multiple read/write
requests on the same entry or inode. This increases performance
but can have adverse results when other clients try to access the
same entry/inode. To solve this, this patch adds a functionality
to detect when this happens and force an earlier release to not
block other clients.

The method consists on requesting GF_GLUSTERFS_INODELK_COUNT and
GF_GLUSTERFS_ENTRYLK_COUNT for all fops that take a lock. When this
count is greater than one, the lock is marked to be released. All
fops already waiting for this lock will be executed normally before
releasing the lock, but new requests that also require it will be
blocked and restarted after the lock has been released and reacquired
again.

Another problem was that some operations did correctly lock the
parent of an entry when needed, but got the size and version xattrs
from the entry instead of the parent.

This patch solves this problem by binding all queries of size and
version to each lock and replacing all entrylk calls by inodelk ones
to remove concurrent updates on directory metadata.  This also allows
rename to correctly update source and destination directories.

BUG: 1225279
Change-Id: I02a6084b138dd38e018a462347cd9ce38610c7ef
Reviewed-on: http://review.gluster.org/10926
Tested-by: NetBSD Build System
Tested-by: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/ec: Change meaning of trusted.ec.dirty

2015-05-09T03:12:31+00:00

- With this change, the xattr will represent if the file needs to be healed or
  not. It will have different values for data/entry and metadata changes.
- inode ref leaks and dict_set_dynstr related leaks fixed
- Added support for trylock/lock based on heal-cmd execution or not
  in data heal.
- Made fixes to pass regression runs

Change-Id: I9d8def4c2badde18a76b7898816fecfac113737a
BUG: 1216303
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/10385
Reviewed-on: http://review.gluster.org/10693
Tested-by: NetBSD Build System
Tested-by: Gluster Build System