glusterfs.git/xlators/performance/write-behind, branch release-3.7

performance/write-behind: fix flush stuck by former failed writes

2016-11-03T09:28:17+00:00

the issue is happened in this case:
assume a file is opened with fd1 and fd2.
1. some WRITE opto fd1 got error, they were add back to 'todo' queue
   because of those error.
2. fd2 closed, a FLUSH op is send to write-behind.
3. FLUSH can not be unwind because it's not a legal waiter for those
   failed write(as func __wb_request_waiting_on() say). and those failed
   WRITE also can not be ended if fd1 is not closed. fd2 stuck in close
   syscall.

to resolve this issue, we can change the way we determine 2 requests is
'conflict': flush/fsync is not conflict with those write that is not
belonged to them. so __wb_pick_winds() can wind the FLUSH op.

below is some information when the stuck issue happen:
glusterdump logs:
[xlator.performance.write-behind.wb_inode]
path=/ltp-F9eG0ZSOME/rw-buffered-16436
inode=0x7fdbe8039b9c
window_conf=1048576
window_current=249856
transit-size=0
dontsync=0

[.WRITE]
request-ptr=0x7fdbe8020200
refcount=1
wound=no
generation-number=4
req->op_ret=-1
req->op_errno=116
sync-attempts=3
sync-in-progress=no
size=131072
offset=1220608
lied=-1
append=0
fulfilled=0
go=0

[.WRITE]
request-ptr=0x7fdbe8068c30
refcount=1
wound=no
generation-number=5
req->op_ret=-1
req->op_errno=116
sync-attempts=2
sync-in-progress=no
size=118784
offset=1351680
lied=-1
append=0
fulfilled=0
go=0

[.FLUSH]
request-ptr=0x7fdbe8021cd0
refcount=1
wound=no
generation-number=6
req->op_ret=0
req->op_errno=0
sync-attempts=0

gdb detail about above 3 requests:
(gdb) print *((wb_request_t *)0x7fdbe8021cd0)
$2 = {all = {next = 0x7fdbe803a608, prev = 0x7fdbe8068c30}, todo = {next
= 0x7fdbe803a618, prev = 0x7fdbe8068c40}, lie = {next = 0x7fdbe8021cf0,
    prev = 0x7fdbe8021cf0}, winds = {next = 0x7fdbe8021d00, prev =
0x7fdbe8021d00}, unwinds = {next = 0x7fdbe8021d10, prev =
0x7fdbe8021d10}, wip = {
    next = 0x7fdbe8021d20, prev = 0x7fdbe8021d20}, stub =
0x7fdbe80224dc, write_size = 0, orig_size = 0, total_size = 0, op_ret =
0, op_errno = 0,
  refcount = 1, wb_inode = 0x7fdbe803a5f0, fop = GF_FOP_FLUSH, lk_owner
= {len = 8, data = "W\322T\f\271\367y$", '\000' },
  iobref = 0x0, gen = 6, fd = 0x7fdbe800f0dc, wind_count = 0, ordering =
{size = 0, off = 0, append = 0, tempted = 0, lied = 0, fulfilled = 0,
    go = 0}}
(gdb) print *((wb_request_t *)0x7fdbe8020200)
$3 = {all = {next = 0x7fdbe8068c30, prev = 0x7fdbe803a608}, todo = {next
= 0x7fdbe8068c40, prev = 0x7fdbe803a618}, lie = {next = 0x7fdbe8068c50,
    prev = 0x7fdbe803a628}, winds = {next = 0x7fdbe8020230, prev =
0x7fdbe8020230}, unwinds = {next = 0x7fdbe8020240, prev =
0x7fdbe8020240}, wip = {
    next = 0x7fdbe8020250, prev = 0x7fdbe8020250}, stub =
0x7fdbe8062c3c, write_size = 131072, orig_size = 4096, total_size = 0,
op_ret = -1,
  op_errno = 116, refcount = 1, wb_inode = 0x7fdbe803a5f0, fop =
GF_FOP_WRITE, lk_owner = {len = 8, data = '\000' },
  iobref = 0x7fdbe80311a0, gen = 4, fd = 0x7fdbe805c89c, wind_count = 3,
ordering = {size = 131072, off = 1220608, append = 0, tempted = -1,
    lied = -1, fulfilled = 0, go = 0}}
(gdb) print *((wb_request_t *)0x7fdbe8068c30)
$4 = {all = {next = 0x7fdbe8021cd0, prev = 0x7fdbe8020200}, todo = {next
= 0x7fdbe8021ce0, prev = 0x7fdbe8020210}, lie = {next = 0x7fdbe803a628,
    prev = 0x7fdbe8020220}, winds = {next = 0x7fdbe8068c60, prev =
0x7fdbe8068c60}, unwinds = {next = 0x7fdbe8068c70, prev =
0x7fdbe8068c70}, wip = {
    next = 0x7fdbe8068c80, prev = 0x7fdbe8068c80}, stub =
0x7fdbe806746c, write_size = 118784, orig_size = 4096, total_size = 0,
op_ret = -1,
  op_errno = 116, refcount = 1, wb_inode = 0x7fdbe803a5f0, fop =
GF_FOP_WRITE, lk_owner = {len = 8, data = '\000' },
  iobref = 0x7fdbe8052b10, gen = 5, fd = 0x7fdbe805c89c, wind_count = 2,
ordering = {size = 118784, off = 1351680, append = 0, tempted = -1,
    lied = -1, fulfilled = 0, go = 0}}

you can see they are all on 'todo' queue, and FLUSH op fd is not the
same WRITE op fd.

> Change-Id: Id687f9cd3b9f281e1a97c83f1ce981ede272b8ab
> BUG: 1372211
> Signed-off-by: Ryan Ding 

Change-Id: Id687f9cd3b9f281e1a97c83f1ce981ede272b8ab
BUG: 1390840
Signed-off-by: Ryan Ding 
Reviewed-on: http://review.gluster.org/15763
Tested-by: Raghavendra G 
Reviewed-by: Raghavendra G 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

performance/write-behind: remove the request from liability queue in

2016-10-18T05:37:02+00:00

wb_fulfill_request

Before this patch, a request is removed from liability queue only when
ref count of request hits 0. Though, wb_fulfill_request does an unref,
it need not be the last unref and hence the request may survive in
liability queue till the last unref. Let,

T1: the time at which wb_fulfill_request is invoked
T2: the time at which last unref is done on request

Let's consider a case of T2 > T1. In the time window between T1 and
T2, any other request (waiter) conflicting with request in liability
queue (blocker - basically a write which has been lied) is blocked
from winding. If T2 happens to be when wb_do_unwinds is invoked, no
further processing of request list happens and "waiter" would get
blocked forever. An example imaginary sequence of events is given
below:

1. A write request w1 is picked up for unwinding in __wb_pick_unwinds
   (but unwind is not done _yet_ and hence reference
   remains). However, w1 is moved to liability queue. Let's call this
   invocation of wb_process_queue by wb_writev as PQ1.

2. A flush (f1) request hits write behind. Since the liability queue
   of inode is not empty, f1 is not picked for unwinding. Let's call
   the invocation of wb_process_queue by wb_flush as PQ2.

3. PQ2 continues and picks w1 for fulfilling and invokes
   wb_fulfill. As part of successful wb_fulfill_cbk,
   wb_fulfill_request (w1) is invoked. But, w1 is not freed (and hence
   not removed from liability queue) as w1 is not unwound _yet_ and a
   ref remains (PQ1 has not invoked wb_do_unwinds _yet_).

4. wb_fulfill_cbk (triggered by PQ2) invokes a wb_process_queue (let's
   say PQ3). f1 is not resumed in PQ3 as w1 is still in liability
   queue. At this time, PQ2 and PQ3 are complete.

5. PQ1 continues, unwinds w1 and does last unref on w1 and w1 is freed
   (and removed from liability queue). Since PQ1 didn't invoke
   wb_fulfill on any other write requests, there won't be any future
   codepaths that would invoke wb_process_queue and f1 is stuck
   forever.

With this fix, w1 is removed from liability queue in step 3 above and
PQ3 resumes f1 in step 4 (as there are no requests conflicting with f1
in liability queue during execution of PQ3).

> Signed-off-by: Raghavendra G 
> BUG: 1379655
> Change-Id: Idacda1fcd520ac27f30224f8dfe8360dba6ac6cb
> Reviewed-on: http://review.gluster.org/15579
> CentOS-regression: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> Smoke: Gluster Build System 
(cherry picked from commit a8b2a981881221925bb5edfe7bb65b25ad855c04)

Signed-off-by: Raghavendra G 
BUG: 1385622
Change-Id: Idacda1fcd520ac27f30224f8dfe8360dba6ac6cb
Reviewed-on: http://review.gluster.org/15657
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

performance/write-behind: guaranteed retry after a short write

2016-05-05T03:27:33+00:00

* Don't mark the request with a fake EIO after a short write.
* retry the remaining buffer at least once before unwinding reply to
  application. This way we capture correct error from backend (ENOSPC,
  EDQUOT etc).

Thanks to "Vijaikumar Mallikarjuna" for the test
script.

Change-Id: I73a18b39b661a7424db1a7855a980469a51da8f9
BUG: 1332790
Signed-off-by: Raghavendra G 
Reviewed-on: http://review.gluster.org/14196
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

performance/write-behind: retry "failed syncs to backend"

2016-02-16T09:09:34+00:00

1. When sync fails, the cached-write is still preserved unless there
   is a flush/fsync waiting on it.

2. When a sync fails and there is a flush/fsync waiting on the
   cached-write, the cache is thrown away and no further retries will
   be made. In other words flush/fsync act as barriers for all the
   previous writes. The behaviour of fsync acting as a barrier is
   controlled by an option (see below for details). All previous
   writes are either successfully synced to backend or forgotten in
   case of an error. Without such barrier fop (especially flush which
   is issued prior to a close), we end up retrying for ever even after
   fd is closed.

3. If a fop is waiting on cached-write and syncing to backend fails,
   the waiting fop is failed.

4. sync failures when no fop is waiting are ignored and are not
   propagated to application. For eg.,
   a. first attempt of sync of a cached-write w1 fails
   b. second attempt of sync of w1 succeeds

   If there are no fops dependent on w1 are issued b/w a and b,
   application won't know about failure encountered in a.

5. The effect of repeated sync failures is that, there will be no
   cache for future writes and they cannot be written behind.

fsync as a barrier and resync of cached writes post fsync failure:
==================================================================
Whether to keep retrying failed syncs post fsync is controlled by an
option "resync-failed-syncs-after-fsync". By default, this option is
set to "off".

If sync of "cached-writes issued before fsync" (to backend) fails,
this option configures whether to retry syncing them after fsync or
forget them. If set to on, cached-writes are retried till a "flush"
fop (or a successful sync) on sync failures. fsync itself is failed
irrespective of the value of this option, when there is a sync failure
of any cached-writes issued before fsync.

Change-Id: I6097c0257bfb9ee5b1f616fbe6a0576ae9af369a
Signed-off-by: Raghavendra G 
BUG: 1293534
Signed-off-by: Raghavendra Talur 
Reviewed-on: http://review.gluster.org/13057
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

Logging: Porting the performance translator

2015-06-29T09:21:53+00:00

         logs to new logging framework

> Change-Id: Ie6aaf8d30bd4457bb73c48e23e6b1dea27598644
> BUG: 1194640
> Signed-off-by: arao 

BUG: 1217722
Change-Id: I0856c43dbf8c0a1aa084d4478c9bdf3f41dfc0b8
Signed-off-by: arao 
Reviewed-on: http://review.gluster.org/11442
Tested-by: NetBSD Build System 
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra G

build: MacOSX Porting fixes

2014-04-24T21:41:48+00:00

git@forge.gluster.org:~schafdog/glusterfs-core/osx-glusterfs

Working functionality on MacOSX

 - GlusterD (management daemon)
 - GlusterCLI (management cli)
 - GlusterFS FUSE (using OSXFUSE)
 - GlusterNFS (without NLM - issues with rpc.statd)

Change-Id: I20193d3f8904388e47344e523b3787dbeab044ac
BUG: 1089172
Signed-off-by: Harshavardhana 
Signed-off-by: Dennis Schafroth 
Tested-by: Harshavardhana 
Tested-by: Dennis Schafroth 
Reviewed-on: http://review.gluster.org/7503
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

write-behind: track filesize when doing extending writes

2014-02-28T05:56:48+00:00

A program that calls mmap() on a newly created sparse file, may receive
a SIGBUS signal. If SIGBUS is not handled, a segmentation fault will
occur and the program will exit.

A bug in the write-behind translator can cause the creation of a sparse
file created with open(), seek(), write() to be cached. The last write()
may not be sent to the server, until write-behind deems this necessary.

* open(.., O_TRUNC, ...)/creat() the file, it is 0 bytes big
* seek() into the file, use offset 31
* write() 1 byte to the file
* the range from byte 0-30 are unwritten so called 'sparse'

The following illustration tries to capture this:

    Legend:
    [ = start of file
    _ = unallocated/unwritten bytes
    # = allocated bytes in the file
    ] = end of file

    [_______________#]
     |              |
     '- byte 0      '- byte 31

Without this change, reading from byte 0-30 will return an error, and
reading the same area through an mmap()'d pointer will trigger a SIGBUS.
Reading from this range did not trigger the outstanding write() to be
flushed. The brick that receives the read() (translated over the network
from mmap()) does not know that the file has been extended, and returns
-EINVAL. This error gets transported back from the brick to the
glusterfs-fuse client, and translated by the Linux kernel/VFS into
SIGBUS triggered by mmap().

In order to solve this, a new attribute to the wb_inode structure is
introduced; the current size of the file. All FOPs that can modify the
size, are expected to update wb_inode->size. This makes it possible for
extending writes with an offset bigger than EOF to mark the unwritten
area as modified/pending.

Change-Id: If5ba6646732e6be26568541ea9b12852a5d0b988
BUG: 1058663
Signed-off-by: Niels de Vos 
Reviewed-on: http://review.gluster.org/6835
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra G 
Reviewed-by: Anand Avati

Fix for 'use after free' errors reported by coverity.

2014-02-06T06:09:13+00:00

Change-Id: I941fc89b2d696c7f227330321ed4bba3ed1deac4
BUG: 789278
Signed-off-by: Poornima 
Reviewed-on: http://review.gluster.org/6868
Reviewed-by: Raghavendra G 
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

write-behind: handle iobref_merge() error gracefully

2013-11-26T18:32:00+00:00

.. by UNWINDing ENOMEM error, rather than crashing.

Change-Id: Ica2d6399eaf7e381e7ebc41155620559c139c4d3
BUG: 1034398
Signed-off-by: Anand Avati 
Reviewed-on: http://review.gluster.org/6349
Tested-by: Gluster Build System 
Reviewed-by: Amar Tumballi

performance/write-behind: invoke request queue processing if

2013-08-14T09:11:58+00:00

we find fd marked bad while trying to fulfill lies.

* flush was queued behind some unfulfilled write.
* A previously wound write returned an error and hence fd was marked
  bad with corresponding error.
* wb_fulfill_head (invocation probably rooted in wb_flush), before
  winding checks for failures of previous writes and since there was a
  failure, calls wb_head_done without even winding one request in head.
* wb_head_done unrefs all the requests in list "head".
* since flush was last operation on fd (and most likely last operation
  on inode itself), no one invokes wb_process_queue and flush is stuck
  in request queue for eternity.

Change-Id: I3b5b114a1c401d477dd7ff64fb6119b43fda2d18
BUG: 988642
Signed-off-by: Raghavendra G 
Reviewed-on: http://review.gluster.org/5398
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati