glusterfs.git/xlators, branch v5.3

fuse: add --lru-limit option

2019-01-16T15:59:03+00:00

The inode LRU mechanism is moot in fuse xlator (ie. there is no
limit for the LRU list), as fuse inodes are referenced from
kernel context, and thus they can only be dropped on request of
the kernel. This might results in a high number of passive
inodes which are useless for the glusterfs client, causing a
significant memory overhead.

This change tries to remedy this by extending the LRU semantics
and allowing to set a finite limit on the fuse inode LRU.

A brief history of problem:

When gluster's inode table was designed, fuse didn't have any
'invalidate' method, which means, userspace application could
never ask kernel to send a 'forget()' fop, instead had to wait
for kernel to send it based on kernel's parameters. Inode table
remembers the number of times kernel has cached the inode based
on the 'nlookup' parameter. And 'nlookup' field is not used by
no other entry points (like server-protocol, gfapi etc).

Hence the inode_table of fuse module always has to have lru-limit
as '0', which means no limit. GlusterFS always had to keep all
inodes in memory as kernel would have had a reference to it.
Again, the reason for this is, kernel's glusterfs inode reference
was pointer of 'inode_t' structure in glusterfs. As it is a
pointer, we could never free it (to prevent segfault, or memory
corruption).

Solution:

In the inode table, handle the prune case of inodes with 'nlookup'
differently, and call a 'invalidator' method, which in this case is
fuse_invalidate(), and it sends the request to kernel for getting
the forget request.

When the kernel sends the forget, it means, it has dropped all
the reference to the inode, and it will send the forget with the
'nlookup' parameter too. We just need to make sure to reduce the
'nlookup' value we have when we get forget. That automatically
cause the relevant prune to happen.

Credits: Csaba Henk, Xavier Hernandez, Raghavendra Gowdappa, Nithya B

fixes: bz#1623107
Change-Id: Ifee0737b23b12b1426c224ec5b8f591f487d83a2
Signed-off-by: Amar Tumballi

features/shard: Fix launch of multiple synctasks for background deletion

2019-01-15T04:57:11+00:00

PROBLEM:

When multiple sharded files are deleted in quick succession, multiple
issues were observed:
1. misleading logs corresponding to a sharded file where while one log
   message said the shards corresponding to the file were deleted
   successfully, this was followed by multiple logs suggesting the very
   same operation failed. This was because of multiple synctasks
   attempting to clean up shards of the same file and only one of them
   succeeding (the one that gets ENTRYLK successfully), and the rest of
   them logging failure.

2. multiple synctasks to do background deletion would be launched, one
   for each deleted file but all of them could readdir entries from
   .remove_me at the same time could potentially contend for ENTRYLK on
   .shard for each of the entry names. This is undesirable and wasteful.

FIX:
Background deletion will now follow a state machine. In the event that
there are multiple attempts to launch synctask for background deletion,
one for each file deleted, only the first task is launched. And if while
this task is doing the cleanup, more attempts are made to delete other
files, the state of the synctask is adjusted so that it restarts the
crawl even after reaching end-of-directory to pick up any files it may
have missed in the previous iteration.

This patch also fixes uninitialized lk-owner during syncop_entrylk()
which was leading to multiple background deletion synctasks entering
the critical section at the same time and leading to illegal memory access
of base inode in the second syntcask after it was destroyed post shard deletion
by the first synctask.

Change-Id: Ib33773d27fb4be463c7a8a5a6a4b63689705324e
updates: bz#1665803
Signed-off-by: Krutika Dhananjay 
(cherry picked from commit c0c2022e7d7097e96270a74f37813eda0c4e6339)

features/shard: Assign fop id during background deletion to prevent excessive logging

2019-01-14T06:41:20+00:00

... of the kind

"[2018-12-26 05:22:44.195019] E [MSGID: 133010]
[shard.c:2253:shard_common_lookup_shards_cbk] 0-volume1-shard: Lookup
on shard 785 failed. Base file gfid = cd938e64-bf06-476f-a5d4-d580a0d37416
[No such file or directory]"

shard_common_lookup_shards_cbk() has a specific check to ignore ENOENT error without
logging them during specific fops. But because background deletion is done in a new
frame (with local->fop being GF_FOP_NULL), the ENOENT check is skipped and the
absence of shards gets logged everytime.

To fix this, local->fop is initialized to GF_FOP_UNLINK during background deletion.

Change-Id: I0ca8d3b3bfbcd354b4a555eee520eb0479bcda35
updates: bz#1665803
Signed-off-by: Krutika Dhananjay 
(cherry picked from commit aa28fe32364e39981981d18c784e7f396d56153f)

leases: Reset lease_ctx->timer post deletion

2019-01-09T15:34:21+00:00

To avoid use_after_free, reset lease_ctx->timer back to NULL
after the structure has been freed.

Change-Id: Icd213ec809b8af934afdb519c335a4680a1d6cdc
updates: bz#1651323
Signed-off-by: Soumya Koduri 
(cherry picked from commit a9b0003c717087ff168bc143c70559162e53e0d5)

core: Fixed typos in nl-cache and logging-guidelines.md

2019-01-09T15:17:18+00:00

Replaced "recieve" with "receive".

Change-Id: I58a3d3d4a0093df4743de9fae4d8ff152d4b216c
fixes: bz#1662200
Signed-off-by: N Balachandran 
(cherry picked from commit a11c5c66321dd8411373a68cc163c981c7d083df)

io-cache: xdata needs to be passed for readv operations

2018-12-30T13:32:00+00:00

io-cache xlator has been skipping xdata references when the
date needs to be read into page cache. This patch fixes the same.

Note: similar changes may be needed for other fops as well
which are handled by io-cache.

Change-Id: I28d73d4ba471d13eb55d0fd0b5197d222df77a2a
updates: bz#1651323
Signed-off-by: Soumya Koduri 
(cherry picked from commit b3d88a0904131f6851f4185e43f815ecc3353ab5)

cluster/dht: sync brick root perms on add brick

2018-12-26T16:40:30+00:00

If a single brick is added to the volume and the
newly added brick is the first to respond to a
dht_revalidate call, its stbuf will not be merged
into local->stbuf as the brick does not yet have
a layout. The is_permission_different check therefore
fails to detect that an attr heal is required as it
only considers the stbuf values from existing bricks.
To fix this, merge all stbuf values into local->stbuf
and use local->prebuf to store the correct directory
attributes.

Change-Id: Ic9e8b04a1ab9ed1248b6b056e3450bbafe32e1bc
fixes: bz#1660736
Signed-off-by: N Balachandran

performance/rda: Fixed dict_t memory leak

2018-12-26T16:38:37+00:00

Removed all references to dict_t xdata_from_req which is
allocated but not used anywhere. It is also not cleaned up
and hence causes a memory leak.

fixes: bz#1659676

Change-Id: I2edb857696191e872ad12a12efc36999626bacc7
Signed-off-by: N Balachandran

shard: prevent segfault in shard_unlink_block_inode()

2018-12-20T14:19:55+00:00

gluster-blockd sometimes segfaults with the following backtrace: 

    Core was generated by `/usr/sbin/gluster-blockd --glfs-lru-count 5 --log-level INFO'.
    Program terminated with signal 11, Segmentation fault.
    #0  0x00007fbb9cd639b9 in shard_unlink_block_inode (local=local@entry=0x7fbb80000a78, shard_block_num=) at shard.c:2929
    2929	            base_ictx->fsync_count--;
    (gdb) bt
    #0  0x00007fbb9cd639b9 in shard_unlink_block_inode (local=local@entry=0x7fbb80000a78, shard_block_num=) at shard.c:2929
    #1  0x00007fbb9cd64311 in shard_unlink_shards_do_cbk (frame=frame@entry=0x7fbb9010a768, cookie=, this=, op_ret=, op_errno=, preparent=preparent@entry=0x7fbb7470dcf8, 
        postparent=postparent@entry=0x7fbb7470dd90, xdata=xdata@entry=0x0) at shard.c:2987

A fix for this has already been provided through a Converity report.

Backport of:

> Change-Id: Ic5d302a5e32d375acf8adc412763ab94e6dabc3d
> Signed-off-by: Sunny Kumar 
> (cherry picked from commit 145e180517054626d07892219fdee689b703c218)

Change-Id: I699a039e9c5115eb3376190dd8014427d12a293b
Updates: bz#1659563
Signed-off-by: Niels de Vos

afr: thin-arbiter 2 domain locking and in-memory state

2018-12-12T14:26:42+00:00

2 domain locking + xattrop for write-txn failures:
--------------------------------------------------
- A post-op wound on TA takes AFR_TA_DOM_NOTIFY range lock and
AFR_TA_DOM_MODIFY full lock, does xattrop on TA and releases
AFR_TA_DOM_MODIFY lock and stores in-memory which brick is bad.

- All further write txn failures are handled based on this in-memory
value without querying the TA.

- When shd heals the files, it does so by requesting full lock on
AFR_TA_DOM_NOTIFY domain. Client uses this as a cue (via upcall),
releases AFR_TA_DOM_NOTIFY range lock and invalidates its in-memory
notion of which brick is bad. The next write txn failure is wound on TA
to again update the in-memory state.

- Any incomplete write txns before the AFR_TA_DOM_NOTIFY upcall release
request is got is completed before the lock is released.

- Any write txns got after the release request are maintained in a ta_waitq.

- After the release is complete, the ta_waitq elements are spliced to a
separate queue which is then processed one by one.

- For fops that come in parallel when the in-memory bad brick is still
unknown, only one is wound to TA on wire. The other ones are maintained
in a ta_onwireq which is then processed after we get the response from
TA.

Change-Id: I32c7b61a61776663601ab0040e2f0767eca1fd64
updates: bz#1648205
Signed-off-by: Ravishankar N 
Signed-off-by: Ashish Pandey