glusterfs.git/xlators/features/shard, branch v8dev

libglusterfs: cleanup iovec functions

2019-06-11T13:28:07+00:00

This patch cleans some iovec code and creates two additional helper
functions to simplify management of iovec structures.

  iov_range_copy(struct iovec *dst, uint32_t dst_count, uint32_t dst_offset,
                 struct iovec *src, uint32_t src_count, uint32_t src_offset,
                 uint32_t size);

    This function copies up to 'size' bytes from 'src' at offset
    'src_offset' to 'dst' at 'dst_offset'. It returns the number of
    bytes copied.

  iov_skip(struct iovec *iovec, uint32_t count, uint32_t size);

    This function removes the initial 'size' bytes from 'iovec' and
    returns the updated number of iovec vectors remaining.

The signature of iov_subset() has also been modified to make it safer
and easier to use. The new signature is:

  iov_subset(struct iovec *src, int src_count, uint32_t start, uint32_t size,
             struct iovec **dst, int32_t dst_count);

    This function creates a new iovec array containing the subset of the
    'src' vector starting at 'start' with size 'size'. The resulting
    array is allocated if '*dst' is NULL, or copied to '*dst' if it fits
    (based on 'dst_count'). It returns the number of iovec vectors used.

A new set of functions to iterate through an iovec array have been
created. They can be used to simplify the implementation of other
iovec-based helper functions.

Change-Id: Ia5fe57e388e23392a8d6cdab17670e337cadd587
Updates: bz#1193929
Signed-off-by: Xavi Hernandez

features/shard: Fix extra unref when inode object is lru'd out and added back

2019-06-09T17:28:07+00:00

Long tale of double unref! But do read...

In cases where a shard base inode is evicted from lru list while still
being part of fsync list but added back soon before its unlink, there
could be an extra inode_unref() leading to premature inode destruction
leading to crash.

One such specific case is the following -

Consider features.shard-deletion-rate = features.shard-lru-limit = 2.
This is an oversimplified example but explains the problem clearly.

First, a file is FALLOCATE'd to a size so that number of shards under
/.shard = 3 > lru-limit.
Shards 1, 2 and 3 need to be resolved. 1 and 2 are resolved first.
Resultant lru list:
                               1 -----> 2
refs on base inode -          (1)  +   (1) = 2
3 needs to be resolved. So 1 is lru'd out. Resultant lru list -
		               2 -----> 3
refs on base inode -          (1)  +   (1) = 2

Note that 1 is inode_unlink()d but not destroyed because there are
non-zero refs on it since it is still participating in this ongoing
FALLOCATE operation.

FALLOCATE is sent on all participant shards. In the cbk, all of them are
added to fync_list.
Resulting fsync list -
                               1 -----> 2 -----> 3 (order doesn't matter)
refs on base inode -          (1)  +   (1)  +   (1) = 3
Total refs = 3 + 2 = 5

Now an attempt is made to unlink this file. Background deletion is triggered.
The first $shard-deletion-rate shards need to be unlinked in the first batch.
So shards 1 and 2 need to be resolved. inode_resolve fails on 1 but succeeds
on 2 and so it's moved to tail of list.
lru list now -
                              3 -----> 2
No change in refs.

shard 1 is looked up. In lookup_cbk, it's linked and added back to lru list
at the cost of evicting shard 3.
lru list now -
                              2 -----> 1
refs on base inode:          (1)  +   (1) = 2
fsync list now -
                              1 -----> 2 (again order doesn't matter)
refs on base inode -         (1)  +   (1) = 2
Total refs = 2 + 2 = 4
After eviction, it is found 3 needs fsync. So fsync is wound, yet to be ack'd.
So it is still inode_link()d.

Now deletion of shards 1 and 2 completes. lru list is empty. Base inode unref'd and
destroyed.
In the next batched deletion, 3 needs to be deleted. It is inode_resolve()able.
It is added back to lru list but base inode passed to __shard_update_shards_inode_list()
is NULL since the inode is destroyed. But its ctx->inode still contains base inode ptr
from first addition to lru list for no additional ref on it.
lru list now -
                              3
refs on base inode -         (0)
Total refs on base inode = 0
Unlink is sent on 3. It completes. Now since the ctx contains ptr to base_inode and the
shard is part of lru list, base shard is unref'd leading to a crash.

FIX:
When shard is readded back to lru list, copy the base inode pointer as is into its inode ctx,
even if it is NULL. This is needed to prevent double unrefs at the time of deleting it.

Change-Id: I99a44039da2e10a1aad183e84f644d63ca552462
Updates: bz#1696136
Signed-off-by: Krutika Dhananjay

across: clang-scan: fix NULL dereferencing warnings

2019-06-04T10:30:29+00:00

All these checks are done after analyzing clang-scan report produced
by the CI job @ https://build.gluster.org/job/clang-scan

updates: bz#1622665
Change-Id: I590305af4ceb779be952974b2a36066ffc4865ca
Signed-off-by: Amar Tumballi

features/shard: Fix block-count accounting upon truncate to lower size

2019-06-04T07:30:12+00:00

The way delta_blocks is computed in shard is incorrect, when a file
is truncated to a lower size. The accounting only considers change
in size of the last of the truncated shards.

FIX:

Get the block-count of each shard just before an unlink at posix in
xdata.  Their summation plus the change in size of last shard
(from an actual truncate) is used to compute delta_blocks which is
used in the xattrop for size update.

Change-Id: I9128a192e9bf8c3c3a959e96b7400879d03d7c53
fixes: bz#1705884
Signed-off-by: Krutika Dhananjay

features/shard: Fix crash during background shard deletion in a specific case

2019-05-16T11:58:54+00:00

Consider the following case -
1. A file gets FALLOCATE'd such that > "shard-lru-limit" number of
   shards are created.
2. And then it is deleted after that.

The unique thing about FALLOCATE is that unlike WRITE, all of the
participant shards are resolved and created and fallocated in a single
batch. This means, in this case, after the first "shard-lru-limit"
number of shards are resolved and added to lru list, as part of
resolution of the remaining shards, some of the existing shards in lru
list will need to be evicted. So these evicted shards will be
inode_unlink()d as part of eviction. Now once the fop gets to the actual
FALLOCATE stage, the lru'd-out shards get added to fsync list.

2 things to note at this point:
i. the lru'd out shards are only part of fsync list, so each holds 1 ref
   on base shard
ii. and the more recently used shards are part of both fsync and lru list.
    So each of these shards holds 2 refs on base inode  - one for being
    part of fsync list, and the other for being part of lru list.

FALLOCATE completes successfully and then this very file is deleted, and
background shard deletion launched. Here's where the ref counts get mismatched.
First as part of inode_resolve()s during the deletion, the lru'd-out inodes
return NULL, because they are inode_unlink()'d by now. So these inodes need to
be freshly looked up. But as part of linking them in lookup_cbk (precisely in
shard_link_block_inode()), inode_link() returns the lru'd-out inode object.
And its inode ctx is still valid and ctx->base_inode valid from the last
time it was added to list.

But shard_common_lookup_shards_cbk() passes NULL in the place of base_pointer
to __shard_update_shards_inode_list(). This means, as part of adding the lru'd out
inode back to lru list, base inode is not ref'd since its NULL.

Whereas post unlinking this shard, during shard_unlink_block_inode(),
ctx->base_inode is accessible and is unref'd because the shard was found to be part
of LRU list, although the matching ref didn't occur. This at some point leads to
base_inode refcount becoming 0 and it getting destroyed and released back while some
of its associated shards are continuing to be unlinked in parallel and the client crashes
whenever it is accessed next.

Fix is to pass base shard correctly, if available, in shard_link_block_inode().

Also, the patch fixes the ret value check in tests/bugs/shard/shard-fallocate.c

Change-Id: Ibd0bc4c6952367608e10701473cbad3947d7559f
Updates: bz#1696136
Signed-off-by: Krutika Dhananjay

features/shard: Fix integer overflow in block count accounting

2019-05-06T10:49:04+00:00

... by holding delta_blocks in 64-bit int as opposed to 32-bit int.

Change-Id: I2c1ddab17457f45e27428575ad16fa678fd6c0eb
updates: bz#1705884
Signed-off-by: Krutika Dhananjay

shard: fix crash caused by using null inode

2019-03-14T18:07:02+00:00

Change-Id: I156bf962223304e586b83a36be59a0ca74589b43
Updates: bz#1688287
Signed-off-by: Kinglong Mee

features/shard: Ref shard inode while adding to fsync list

2019-01-24T18:14:57+00:00

PROBLEM:

Lot of the earlier changes in the management of shards in lru, fsync
lists assumed that if a given shard exists in fsync list, it must be
part of lru list as well. This was found to be not true.

Consider this - a file is FALLOCATE'd to a size which would make the
number of participant shards to be greater than the lru list size.
In this case, some of the resolved shards that are to participate in
this fop will be evicted from lru list to give way to the rest of the
shards. And once FALLOCATE completes, these shards are added to fsync
list but without a ref. After the fop completes, these shard inodes
are unref'd and destroyed while their inode ctxs are still part of
fsync list. Now when an FSYNC is called on the base file and the
fsync-list traversed, the client crashes due to illegal memory access.

FIX:

Hold a ref on the shard inode when adding to fsync list as well.
And unref under following conditions:
1. when the shard is evicted from lru list
2. when the base file is fsync'd
3. when the shards are deleted.

Change-Id: Iab460667d091b8388322f59b6cb27ce69299b1b2
fixes: bz#1669077
Signed-off-by: Krutika Dhananjay

features/shard: Fix launch of multiple synctasks for background deletion

2019-01-11T08:35:55+00:00

PROBLEM:

When multiple sharded files are deleted in quick succession, multiple
issues were observed:
1. misleading logs corresponding to a sharded file where while one log
   message said the shards corresponding to the file were deleted
   successfully, this was followed by multiple logs suggesting the very
   same operation failed. This was because of multiple synctasks
   attempting to clean up shards of the same file and only one of them
   succeeding (the one that gets ENTRYLK successfully), and the rest of
   them logging failure.

2. multiple synctasks to do background deletion would be launched, one
   for each deleted file but all of them could readdir entries from
   .remove_me at the same time could potentially contend for ENTRYLK on
   .shard for each of the entry names. This is undesirable and wasteful.

FIX:
Background deletion will now follow a state machine. In the event that
there are multiple attempts to launch synctask for background deletion,
one for each file deleted, only the first task is launched. And if while
this task is doing the cleanup, more attempts are made to delete other
files, the state of the synctask is adjusted so that it restarts the
crawl even after reaching end-of-directory to pick up any files it may
have missed in the previous iteration.

This patch also fixes uninitialized lk-owner during syncop_entrylk()
which was leading to multiple background deletion synctasks entering
the critical section at the same time and leading to illegal memory access
of base inode in the second syntcask after it was destroyed post shard deletion
by the first synctask.

Change-Id: Ib33773d27fb4be463c7a8a5a6a4b63689705324e
updates: bz#1662368
Signed-off-by: Krutika Dhananjay

features/shard: Assign fop id during background deletion to prevent excessive logging

2019-01-08T12:08:48+00:00

... of the kind

"[2018-12-26 05:22:44.195019] E [MSGID: 133010]
[shard.c:2253:shard_common_lookup_shards_cbk] 0-volume1-shard: Lookup
on shard 785 failed. Base file gfid = cd938e64-bf06-476f-a5d4-d580a0d37416
[No such file or directory]"

shard_common_lookup_shards_cbk() has a specific check to ignore ENOENT error without
logging them during specific fops. But because background deletion is done in a new
frame (with local->fop being GF_FOP_NULL), the ENOENT check is skipped and the
absence of shards gets logged everytime.

To fix this, local->fop is initialized to GF_FOP_UNLINK during background deletion.

Change-Id: I0ca8d3b3bfbcd354b4a555eee520eb0479bcda35
updates: bz#1662368
Signed-off-by: Krutika Dhananjay