glusterfs.git/xlators/performance/write-behind, branch v3.8.8

performance/write-behind: fix flush stuck by former failed writes

2016-11-03T09:28:13+00:00

the issue is happened in this case:
assume a file is opened with fd1 and fd2.
1. some WRITE opto fd1 got error, they were add back to 'todo' queue
   because of those error.
2. fd2 closed, a FLUSH op is send to write-behind.
3. FLUSH can not be unwind because it's not a legal waiter for those
   failed write(as func __wb_request_waiting_on() say). and those failed
   WRITE also can not be ended if fd1 is not closed. fd2 stuck in close
   syscall.

to resolve this issue, we can change the way we determine 2 requests is
'conflict': flush/fsync is not conflict with those write that is not
belonged to them. so __wb_pick_winds() can wind the FLUSH op.

below is some information when the stuck issue happen:
glusterdump logs:
[xlator.performance.write-behind.wb_inode]
path=/ltp-F9eG0ZSOME/rw-buffered-16436
inode=0x7fdbe8039b9c
window_conf=1048576
window_current=249856
transit-size=0
dontsync=0

[.WRITE]
request-ptr=0x7fdbe8020200
refcount=1
wound=no
generation-number=4
req->op_ret=-1
req->op_errno=116
sync-attempts=3
sync-in-progress=no
size=131072
offset=1220608
lied=-1
append=0
fulfilled=0
go=0

[.WRITE]
request-ptr=0x7fdbe8068c30
refcount=1
wound=no
generation-number=5
req->op_ret=-1
req->op_errno=116
sync-attempts=2
sync-in-progress=no
size=118784
offset=1351680
lied=-1
append=0
fulfilled=0
go=0

[.FLUSH]
request-ptr=0x7fdbe8021cd0
refcount=1
wound=no
generation-number=6
req->op_ret=0
req->op_errno=0
sync-attempts=0

gdb detail about above 3 requests:
(gdb) print *((wb_request_t *)0x7fdbe8021cd0)
$2 = {all = {next = 0x7fdbe803a608, prev = 0x7fdbe8068c30}, todo = {next
= 0x7fdbe803a618, prev = 0x7fdbe8068c40}, lie = {next = 0x7fdbe8021cf0,
    prev = 0x7fdbe8021cf0}, winds = {next = 0x7fdbe8021d00, prev =
0x7fdbe8021d00}, unwinds = {next = 0x7fdbe8021d10, prev =
0x7fdbe8021d10}, wip = {
    next = 0x7fdbe8021d20, prev = 0x7fdbe8021d20}, stub =
0x7fdbe80224dc, write_size = 0, orig_size = 0, total_size = 0, op_ret =
0, op_errno = 0,
  refcount = 1, wb_inode = 0x7fdbe803a5f0, fop = GF_FOP_FLUSH, lk_owner
= {len = 8, data = "W\322T\f\271\367y$", '\000' },
  iobref = 0x0, gen = 6, fd = 0x7fdbe800f0dc, wind_count = 0, ordering =
{size = 0, off = 0, append = 0, tempted = 0, lied = 0, fulfilled = 0,
    go = 0}}
(gdb) print *((wb_request_t *)0x7fdbe8020200)
$3 = {all = {next = 0x7fdbe8068c30, prev = 0x7fdbe803a608}, todo = {next
= 0x7fdbe8068c40, prev = 0x7fdbe803a618}, lie = {next = 0x7fdbe8068c50,
    prev = 0x7fdbe803a628}, winds = {next = 0x7fdbe8020230, prev =
0x7fdbe8020230}, unwinds = {next = 0x7fdbe8020240, prev =
0x7fdbe8020240}, wip = {
    next = 0x7fdbe8020250, prev = 0x7fdbe8020250}, stub =
0x7fdbe8062c3c, write_size = 131072, orig_size = 4096, total_size = 0,
op_ret = -1,
  op_errno = 116, refcount = 1, wb_inode = 0x7fdbe803a5f0, fop =
GF_FOP_WRITE, lk_owner = {len = 8, data = '\000' },
  iobref = 0x7fdbe80311a0, gen = 4, fd = 0x7fdbe805c89c, wind_count = 3,
ordering = {size = 131072, off = 1220608, append = 0, tempted = -1,
    lied = -1, fulfilled = 0, go = 0}}
(gdb) print *((wb_request_t *)0x7fdbe8068c30)
$4 = {all = {next = 0x7fdbe8021cd0, prev = 0x7fdbe8020200}, todo = {next
= 0x7fdbe8021ce0, prev = 0x7fdbe8020210}, lie = {next = 0x7fdbe803a628,
    prev = 0x7fdbe8020220}, winds = {next = 0x7fdbe8068c60, prev =
0x7fdbe8068c60}, unwinds = {next = 0x7fdbe8068c70, prev =
0x7fdbe8068c70}, wip = {
    next = 0x7fdbe8068c80, prev = 0x7fdbe8068c80}, stub =
0x7fdbe806746c, write_size = 118784, orig_size = 4096, total_size = 0,
op_ret = -1,
  op_errno = 116, refcount = 1, wb_inode = 0x7fdbe803a5f0, fop =
GF_FOP_WRITE, lk_owner = {len = 8, data = '\000' },
  iobref = 0x7fdbe8052b10, gen = 5, fd = 0x7fdbe805c89c, wind_count = 2,
ordering = {size = 118784, off = 1351680, append = 0, tempted = -1,
    lied = -1, fulfilled = 0, go = 0}}

you can see they are all on 'todo' queue, and FLUSH op fd is not the
same WRITE op fd.

> Change-Id: Id687f9cd3b9f281e1a97c83f1ce981ede272b8ab
> BUG: 1372211
> Signed-off-by: Ryan Ding 

Change-Id: Id687f9cd3b9f281e1a97c83f1ce981ede272b8ab
BUG: 1390838
Signed-off-by: Ryan Ding 
Reviewed-on: http://review.gluster.org/15762
Tested-by: Raghavendra G 
Reviewed-by: Raghavendra G 
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System

performance/write-behind: remove the request from liability queue in

2016-10-18T05:37:09+00:00

wb_fulfill_request

Before this patch, a request is removed from liability queue only when
ref count of request hits 0. Though, wb_fulfill_request does an unref,
it need not be the last unref and hence the request may survive in
liability queue till the last unref. Let,

T1: the time at which wb_fulfill_request is invoked
T2: the time at which last unref is done on request

Let's consider a case of T2 > T1. In the time window between T1 and
T2, any other request (waiter) conflicting with request in liability
queue (blocker - basically a write which has been lied) is blocked
from winding. If T2 happens to be when wb_do_unwinds is invoked, no
further processing of request list happens and "waiter" would get
blocked forever. An example imaginary sequence of events is given
below:

1. A write request w1 is picked up for unwinding in __wb_pick_unwinds
   (but unwind is not done _yet_ and hence reference
   remains). However, w1 is moved to liability queue. Let's call this
   invocation of wb_process_queue by wb_writev as PQ1.

2. A flush (f1) request hits write behind. Since the liability queue
   of inode is not empty, f1 is not picked for unwinding. Let's call
   the invocation of wb_process_queue by wb_flush as PQ2.

3. PQ2 continues and picks w1 for fulfilling and invokes
   wb_fulfill. As part of successful wb_fulfill_cbk,
   wb_fulfill_request (w1) is invoked. But, w1 is not freed (and hence
   not removed from liability queue) as w1 is not unwound _yet_ and a
   ref remains (PQ1 has not invoked wb_do_unwinds _yet_).

4. wb_fulfill_cbk (triggered by PQ2) invokes a wb_process_queue (let's
   say PQ3). f1 is not resumed in PQ3 as w1 is still in liability
   queue. At this time, PQ2 and PQ3 are complete.

5. PQ1 continues, unwinds w1 and does last unref on w1 and w1 is freed
   (and removed from liability queue). Since PQ1 didn't invoke
   wb_fulfill on any other write requests, there won't be any future
   codepaths that would invoke wb_process_queue and f1 is stuck
   forever.

With this fix, w1 is removed from liability queue in step 3 above and
PQ3 resumes f1 in step 4 (as there are no requests conflicting with f1
in liability queue during execution of PQ3).

> Signed-off-by: Raghavendra G 
> BUG: 1379655
> Change-Id: Idacda1fcd520ac27f30224f8dfe8360dba6ac6cb
> Reviewed-on: http://review.gluster.org/15579
> CentOS-regression: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> Smoke: Gluster Build System 
(cherry picked from commit a8b2a981881221925bb5edfe7bb65b25ad855c04)

Signed-off-by: Raghavendra G 
BUG: 1385620
Change-Id: Idacda1fcd520ac27f30224f8dfe8360dba6ac6cb
Reviewed-on: http://review.gluster.org/15658
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

performance/write-behind: guaranteed retry after a short write

2016-05-05T03:27:10+00:00

* Don't mark the request with a fake EIO after a short write.
* retry the remaining buffer at least once before unwinding reply to
  application. This way we capture correct error from backend (ENOSPC,
  EDQUOT etc).

Thanks to "Vijaikumar Mallikarjuna" for the test
script.

Change-Id: I73a18b39b661a7424db1a7855a980469a51da8f9
BUG: 1332789
Signed-off-by: Raghavendra G 
Reviewed-on: http://review.gluster.org/14197
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

performance/write-behind: fix memory corruption

2016-01-14T05:42:13+00:00

1. while handling short writes, __wb_fulfill_short_write would've freed
   current request before moving on to next in the list if request is not
   big enough to accomodate completed number of bytes. So, make sure to
   save next member before invoking __wb_fulfill_short_write on current
   request. Also handle the case where request is exactly the size of
   remaining completed number of bytes.

2. When write request is unwound because there is a conflicting failed
   liability, make sure its deleted from tempted list. Otherwise there
   will be two unwinds (one as part handling a failed request in
   wb_do_winds and another in wb_do_unwinds).

Change-Id: Id1b54430796b19b956d24b8d99ab0cdd5140e4f6
BUG: 1297373
Signed-off-by: Raghavendra G 
Reviewed-on: http://review.gluster.org/13213
Tested-by: NetBSD Build System 
Reviewed-by: Raghavendra Talur

performance/write-behind: maintain correct transit size in case of

2016-01-07T05:59:52+00:00

short writes.

1. Imagine a write when cache is filled with failed syncs.
2. This write won't be unwound since cache size has exceeded
configured limit.
3. With trickling-writes on by default, the last write request wont be
considered for winding when there is non zero in-transit size.
4. There was a bug in accounting of in-transit size when winds
resulted in short writes. Due to this bug, in-transit size used to be
non-zero even when there are no syncs in progress.
5. Due to 3 and 4, current write request won't be wound till there is
another write or fsync or flush from application. But application
can't do any other fop till current write request is unwound. This
resulted in deadlock and hence application would be hung in 'D'
state.

This patch fixes bug in accounting of in-transit size during short
writes.

Change-Id: I04ce8bb510efaaed7623cac38d69b32dbc3730ce
Signed-off-by: Raghavendra G 
BUG: 1279730
Reviewed-on: http://review.gluster.org/13177
Tested-by: Gluster Build System

wb: remove inline keyword

2015-12-31T10:33:10+00:00

When compiled with -Werror flag gcc throws the following
error:

‘iov_length’ is static but used in inline
function ‘__wb_modify_write_request’ which is not static.
Let gcc decide what functions to inline and remove the inline
keyword.

Change-Id: I6d832596eefcf08306634936e11d2c8d4b8f9ccd
BUG: 1279730
Signed-off-by: Raghavendra Talur 
Reviewed-on: http://review.gluster.org/13113

build: export minimum symbols from xlators for correct resolution

2015-12-22T17:15:01+00:00

Revisiting http://review.gluster.org/#/c/11814/, which unintentionally
introduced warnings from libtool about the xlator .so names.

According to [1], the -module option must appear in the Makefile.am
file(s); if -module is defined in a macro, e.g. in configure(.ac),
then libtool will not recognize that this is a module and will emit a
warning.

[1]
http://www.gnu.org/software/automake/manual/automake.html#Libtool-Modules

Change-Id: Ifa5f9327d18d139597791c305aa10cc4410fb078
BUG: 1248669
Signed-off-by: Kaleb S KEITHLEY 
Reviewed-on: http://review.gluster.org/13003
Tested-by: NetBSD Build System 
Tested-by: Gluster Build System 
Reviewed-by: soumya k 
Reviewed-by: Niels de Vos

performance/write-behind: retry "failed syncs to backend"

2015-12-22T09:55:57+00:00

1. When sync fails, the cached-write is still preserved unless there
   is a flush/fsync waiting on it.

2. When a sync fails and there is a flush/fsync waiting on the
   cached-write, the cache is thrown away and no further retries will
   be made. In other words flush/fsync act as barriers for all the
   previous writes. The behaviour of fsync acting as a barrier is
   controlled by an option (see below for details). All previous
   writes are either successfully synced to backend or forgotten in
   case of an error. Without such barrier fop (especially flush which
   is issued prior to a close), we end up retrying for ever even after
   fd is closed.

3. If a fop is waiting on cached-write and syncing to backend fails,
   the waiting fop is failed.

4. sync failures when no fop is waiting are ignored and are not
   propagated to application. For eg.,
   a. first attempt of sync of a cached-write w1 fails
   b. second attempt of sync of w1 succeeds

   If there are no fops dependent on w1 are issued b/w a and b,
   application won't know about failure encountered in a.

5. The effect of repeated sync failures is that, there will be no
   cache for future writes and they cannot be written behind.

fsync as a barrier and resync of cached writes post fsync failure:
==================================================================
Whether to keep retrying failed syncs post fsync is controlled by an
option "resync-failed-syncs-after-fsync". By default, this option is
set to "off".

If sync of "cached-writes issued before fsync" (to backend) fails,
this option configures whether to retry syncing them after fsync or
forget them. If set to on, cached-writes are retried till a "flush"
fop (or a successful sync) on sync failures. fsync itself is failed
irrespective of the value of this option, when there is a sync failure
of any cached-writes issued before fsync.

Change-Id: I6097c9257bfb9ee5b15616fbe6a0576ae9af369a
Signed-off-by: Raghavendra G 
BUG: 1279730
Reviewed-on: http://review.gluster.org/12594

build: export minimum symbols from xlators for correct resolution

2015-09-24T14:37:42+00:00

We've been lucky that we haven't had any symbol collisions until now.
Now we have a collision between the snapview-client's svc_lookup() and
libntirpc's svc_lookup() with nfs-ganesha's FSAL_GLUSTER and libgfapi.

As a short term solution all the snapview-client's FOP methods were
changed to static scope. See http://review.gluster.org/11805. This
works in snapview-client because all the FOP methods are defined in
a single source file. This solution doesn't work for other xlators
with FOP methods defined in multiple source files.

To address this we link with libtool's '-export-symbols $symbol-file'
(a wrapper around `ld --version-script ...` --- on linux anyway) and
only export the minimum required symbols from the xlator sharedlib.

N.B. the libtool man page says that the symbol file should be named
foo.sym, thus the rename of *.exports to *.sym. While foo.exports
worked, we will follow the documentation.

Signed-off-by: Kaleb S. KEITHLEY 
BUG: 1248669
Change-Id: I1de68b3e3be58ae690d8bfb2168bfc019983627c
Reviewed-on: http://review.gluster.org/11814
Tested-by: NetBSD Build System 
Tested-by: Gluster Build System 
Reviewed-by: soumya k 
Reviewed-by: Niels de Vos

feature/performace: Fix broken build

2015-06-28T10:03:49+00:00

Fix the build broken because of patch
http://review.gluster.org/#/c/9822/

Change-Id: I0ee502c0fad5be87186c80ab4729036f52f85fa3
BUG: 1194640
Signed-off-by: Kotresh HR 
Reviewed-on: http://review.gluster.org/11451
Reviewed-by: Niels de Vos 
Tested-by: Niels de Vos