glusterfs.git/rpc, branch v3.10.10

glusterd: clean up portmap on brick disconnect

2017-10-31T18:07:17+00:00

GlusterD's portmap entry for a brick is cleaned up when a PMAP_SIGNOUT event is
initiated by the brick process at the shutdown. But if the brick process crashes
or gets killed through SIGKILL then this event is not initiated and glusterd
ends up with a stale port. Since GlusterD's portmap traversal happens both ways,
forward for allocation and backward for registry search, there is a possibility
that glusterd might end up running with a stale port for a brick which
eventually will end up with clients to fail to connect to the bricks.

Solution is to clean up the port entry in case the process is down as
part of the brick disconnect event. Although with this the handling
PMAP_SIGNOUT event becomes redundant in most of the cases, but this is
the safeguard method to avoid glusterd getting into the stale port
issues.

This patch also needs to bring in the changes from change id
I705f101739ab1647ff52a92820d478354407264a which is needed for the
compilation to go through.

> mainline patch : https://review.gluster.org/#/c/18541/
>                  https://review.gluster.org/#/c/17129/

Change-Id: I04c5be6d11e772ee4de16caf56dbb37d5c944303
BUG: 1507749
Signed-off-by: Atin Mukherjee

rpc: TLSv1_2_method() is deprecated in OpenSSL-1.1

2017-09-17T12:55:40+00:00

Fedora 26 has OpenSSL-1.1. Compile-time warnings indicate
that TLSv1_2_method() is now deprecated. As per the SSL man page:

  TLS_method(), TLS_server_method(), TLS_client_method()
    These are the general-purpose version-flexible SSL/TLS methods.
    The actual protocol version used will be negotiated to the highest
    version mutually supported by the client and the server. The
    supported protocols are SSLv3, TLSv1, TLSv1.1 and TLSv1.2.
    Applications should use these methods, and avoid the version-
    specific methods described below.
  ...
  TLSv1_2_method(), ...
  ...

Note that OpenSSL-1.1 is the version of OpenSSL; Fedora 25 and RHEL 7.3
and other distributions (still) have OpenSSL-1.0.

TLS versions are orthogonal to the OpenSSL version.  TLS_method() is the
new — in OpenSSL-1.1 — version flexible function intended to replace the
TLSv1_2_method() function in OpenSSL-1.0 and the older (?), insecure
TLSv23_method(). (OpenSSL-1.0 does not have TLS_method())

master: https://review.gluster.org/18268
master BZ: 1491025
release-3.12: https://review.gluster.org/18284
release-3.12 BZ: 1491690

Change-Id: I190363ccffe7c25606ea2cf30a6b9ff1ec186057
BUG: 1491691
Signed-off-by: Kaleb S. KEITHLEY 
Reviewed-on: https://review.gluster.org/18285
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System

refcount: typecast function for calling on free

2017-08-11T11:22:43+00:00

All of the functions called to free the refcounted structure are doing a
typecast from (void*) to their own type taht is being free'd. This
really is not needed and the refcount interface is made a little simpler
without the requirement of typecasting.

With this small improvement in the API, all callers are updated too.

Cherry picked from commit f2ca301bd741e3e3f076cd3f72fcd377bcef2a1a:
> Change-Id: I32473b6d1799f62861d4b2d78ea30c09e6c80ab1
> BUG: 1416889
> Signed-off-by: Niels de Vos 
> Reviewed-on: https://review.gluster.org/16471
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> Reviewed-by: Xavier Hernandez 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Kaleb KEITHLEY 

Backport note: This patch makes it easier to backport changes that use
               gf_refcount_t. There is no functional change.

Change-Id: I32473b6d1799f62861d4b2d78ea30c09e6c80ab1
BUG: 1471870
Signed-off-by: Niels de Vos 
Reviewed-on: https://review.gluster.org/17913
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

rpc: add options to manage socket keepalive lifespan

2017-06-20T04:58:25+00:00

Problem:
Default values for handling socket timeouts for brick responses are
insufficient for aggressive applications such as databases.

Solution:
Add 1:1 gluster options for keepalive, keepalive-idle,
keepalive-interval and keepalive-timeout as per the socket level options
available as per tcp(7) man page.

Default values for options are NOT agressive and continue to be values
which result in default timeout when only the keep alive option is
turned on.

These options are Linux specific and will not be applicable to the
*BSDs.

mainline:
> BUG: 1426059
> Signed-off-by: Milind Changire 
> Reviewed-on: https://review.gluster.org/16731
> Smoke: Gluster Build System 
> CentOS-regression: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> Reviewed-by: Raghavendra G 
(cherry picked from commit 6b8df081b46ac4f485c86a5052fc30472e74bfbb)

Change-Id: I2a08ecd949ca8ceb3e090d336ad634341e2dbf14
BUG: 1452038
Signed-off-by: Milind Changire 
Reviewed-on: https://review.gluster.org/17330
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra Talur

rpc: log more about socket disconnects

2017-05-31T05:57:36+00:00

Log more about the different paths leading to socket disconnect for 
ease of debugging.

Log via gf_log_callingfn() in __socket_disconnect() at loglevel
TRACE if socket connection is being torn down.

mainline:
> BUG: 1426125
> Signed-off-by: Milind Changire 
> Reviewed-on: https://review.gluster.org/16732
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Jeff Darcy 
(cherry picked from commit 67a35ac54bfd61a920c1919fbde588a04ac3358a)

Change-Id: I1e551c2d685784b5ec747f481179f64d524c0461
BUG: 1451977
Signed-off-by: Milind Changire 
Reviewed-on: https://review.gluster.org/17321
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra Talur

rpc: avoid logging success on failure

2017-05-31T05:57:16+00:00

Avoid logging Success in the event of failure especially when errno
has no meaningful value w.r.t. the failure. In this case the errno 
is set to zero when there's indeed a failure at the RPC level.

mainline:
> BUG: 1426032
> Signed-off-by: Milind Changire 
> Reviewed-on: https://review.gluster.org/16730
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: N Balachandran 
> Reviewed-by: Jeff Darcy 
(cherry picked from commit 89c6bedc1c2e978f67ca29f212a357984cd8a2dd)

Change-Id: If2cc81aa1e590023ed22892dacbef7cac213e591
BUG: 1451995
Signed-off-by: Milind Changire 
Reviewed-on: https://review.gluster.org/17326
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra Talur

rpc: fix a routine to destory RDMA qp(queue-pair)

2017-05-13T21:04:36+00:00

    This is backport of https://review.gluster.org/#/c/17249/

Problem: If an error has occured with rdma_create_id() in gf_rdma_connect(),
         process will jump to the 'unlock' label and then call gf_rdma_teardown()
         which call __gf_rdma_teardown().
         Presently, __gf_rdma_teardown() checks InifiniBand QP with peer->cm_id->qp!
         Unfortunately, cm_id is not allocated and will be crushed in this situation :)

Solution: If 'this->private->peer->cm_id' member is null, do not check
          'this->private->peer->cm_id->qp'.

> Change-Id: Ie321b8cf175ef4f1bdd9733d73840f03ddff8c3b
> BUG: 1449495
> Signed-off-by: Ji-Hyeon Gim 
> Reviewed-on: https://review.gluster.org/17249
> Reviewed-by: Amar Tumballi 
> Reviewed-by: Prashanth Pai 
> NetBSD-regression: NetBSD Build System 
> Tested-by: Ji-Hyeon Gim
> CentOS-regression: Gluster Build System 
> Smoke: Gluster Build System 
> Reviewed-by: Jeff Darcy 

(cherry picked from commit ccfa06767f1282d9a3783e37555515a63cc62e69)

Change-Id: Ie321b8cf175ef4f1bdd9733d73840f03ddff8c3b
BUG: 1450564
Signed-off-by: Ji-Hyeon Gim 
Reviewed-on: https://review.gluster.org/17281
Smoke: Gluster Build System 
Tested-by: Ji-Hyeon Gim
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra Talur

rpc: fix transport add/remove race on port probing

2017-05-11T05:51:08+00:00

Problem:
Spurious __gf_free() assertion failures seen all over the place with
header->magic being overwritten when running port probing tests with
'nmap'

Solution:
Fix sequence of:
1. add accept()ed socket connection fd to epoll set
2. add newly created rpc_transport_t object in RPCSVC service list

Correct sequence is #2 followed by #1.

Reason:
Adding new fd returned by accept() to epoll set causes an epoll_wait()
to return immediately with a POLLIN event. This races ahead to a readv()
which returms with errno:104 (Connection reset by peer) during port
probing using 'nmap'. The error is then handled by POLLERR code to
remove the new transport object from RPCSVC service list and later
unref and destroy the rpc transport object.
socket_server_event_handler() then catches up with registering the
unref'd/destroyed rpc transport object. This is later manifest as
assertion failures in __gf_free() with the header->magic field botched
due to invalid address references.
All this does not result in a Segmentation Fault since the address
space continues to be mapped into the process and pages still being
referenced elsewhere.

As a further note:
This race happens only in accept() codepath. Only in this codepath,
the notify will be referring to two transports:
1, listener transport and
2. newly accepted transport
All other notify refer to only one transport i.e., the transport/socket
on which the event is received. Since epoll is ONE_SHOT another event
won't arrive on the same socket till the current event is processed.
However, in the accept() codepath, the current event - ACCEPT - and the
new event - POLLIN/POLLER - arrive on two different sockets:
1. ACCEPT on listener socket and
2. POLLIN/POLLERR on newly registered socket.
Also, note that these two events are handled different thread contexts.

Cleanup:
Critical section in socket_server_event_handler() has been removed.
Instead, an additional ref on new_trans has been used to avoid ref/unref
race when notifying RPCSVC.

mainline:
> BUG: 1438966
> Signed-off-by: Milind Changire 
> Reviewed-on: https://review.gluster.org/17139
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Amar Tumballi 
> Reviewed-by: Oleksandr Natalenko 
> Reviewed-by: Jeff Darcy 
(cherry picked from commit 4f7ef3020edcc75cdeb22d8da8a1484f9db77ac9)

Change-Id: I4417924bc9e6277d24bd1a1c5bcb7445bcb226a3
BUG: 1449169
Signed-off-by: Milind Changire 
Reviewed-on: https://review.gluster.org/17217
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G

build: errors generating xdr stubs+headers with `make -j`

2017-03-28T12:08:12+00:00

Using a makebomb, on f23 at least, blows up when generating the
xdr headers and stubs. (Works reliably on f25 though, go figure.)
This change appears to mitigate the race on f23.

Master change https://review.gluster.org/16941
Master BZ: 1429696

Change-Id: I006066f0e7c3f8b65189f97c70089f3422e3e08b
BUG: 1430512
Signed-off-by: Kaleb S. KEITHLEY 
Reviewed-on: https://review.gluster.org/16942
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

rpc: bump up conn->cleanup_gen in rpc_clnt_reconnect_cleanup

2017-03-27T13:58:29+00:00

Commit 086436a introduced generation number (cleanup_gen) to ensure that
rpc layer doesn't end up cleaning up the connection object if
application layer has already destroyed it. Bumping up cleanup_gen was
done only in rpc_clnt_connection_cleanup (). However the same is needed
in rpc_clnt_reconnect_cleanup () too as with out it if the object gets destroyed
through the reconnect event in the application layer, rpc layer will
still end up in trying to delete the object resulting into double free
and crash.

Peer probing an invalid host/IP was the basic test to catch this issue.

>Reviewed-on: https://review.gluster.org/16914
>Smoke: Gluster Build System 
>NetBSD-regression: NetBSD Build System 
>Reviewed-by: Milind Changire 
>CentOS-regression: Gluster Build System 
>Reviewed-by: Jeff Darcy 
>(cherry picked from commit 39e09ad1e0e93f08153688c31433c38529f93716)

Change-Id: Id5332f3239cb324cead34eb51cf73d426733bd46
BUG: 1434399
Signed-off-by: Atin Mukherjee 
Reviewed-on: https://review.gluster.org/16936
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan