glusterfs.git/rpc/rpc-transport/socket, branch v4.0dev

libglusterfs: Name threads on creation

2017-07-19T14:16:19+00:00

Set names to threads on creation for easier
debugging.

Output of top -H -p 
Before:
19773 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19774 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19775 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19776 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19777 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19778 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19779 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19780 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19781 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19782 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19783 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19784 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19785 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.01 glusterfsd
19786 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.01 glusterfsd
19787 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.01 glusterfsd
19789 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19790 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
25178 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
 5398 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
 7881 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd

After:
19773 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19774 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glustertimer
19775 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19776 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glustermemsweep
19777 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glustersproc0
19778 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glustersproc1
19779 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterepoll0
19780 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusteridxwrker
19781 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusteriotwr0
19782 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterbrssign
19783 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterbrswrker
19784 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterclogecon
19785 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.01 glusterclogd0
19786 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.01 glusterclogd1
19787 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.01 glusterclogd2
19789 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterposixjan
19790 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterposixfsy
25178 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterepoll1
 5398 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterepoll2
 7881 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterposixhc

Change-Id: Id5f333755c1ba168a2ffaa4fce6e71c375e10703
BUG: 1254002
Updates: #271
Signed-off-by: Raghavendra Talur 
Reviewed-on: https://review.gluster.org/11926
Reviewed-by: Prashanth Pai 
Smoke: Gluster Build System 
Reviewed-by: Niels de Vos 
CentOS-regression: Gluster Build System

socket: call init_openssl_mt() in init() and add cleanup

2017-07-17T17:09:05+00:00

init_openssl_mt() wasn't explicitly invoked and was run implicitly before
dlopen() returned as it was tagged as __attribute__ ((constructor)). This
function used to call GF_CALLOC() which wasn't available or initialized
when socket.so is dlopen()ed by an external program. The program used to
crash with SIGSEGV as follows:

0x00007ffff5efe1ad in __gf_calloc (nmemb=41, size=40, type=158, typestr=0x7ffff63eb3d6 "gf_sock_mt_lock_array")
at mem-pool.c:109
0x00007ffff63e6acf in init_openssl_mt () at socket.c:4016
0x00007ffff7de90da in call_init.part () from /lib64/ld-linux-x86-64.so.2
0x00007ffff7de91eb in _dl_init () from /lib64/ld-linux-x86-64.so.2
0x00007ffff7dedde1 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
0x00007ffff7de8f84 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
0x00007ffff7ded339 in _dl_open () from /lib64/ld-linux-x86-64.so.2

This change moves call to init_openssl_mt() from being a constructor function
to the init() function and introduces fini_openssl_mt() which cleans up
resources (called in destructor).

BUG: 1193929
Change-Id: Iab690897ec34e24c33f6b43f8d8d9f8fd75ac607
Signed-off-by: Prashanth Pai 
Reviewed-on: https://review.gluster.org/17753
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Amar Tumballi 
Reviewed-by: Jeff Darcy

Link against missed libraries to resolve symbols

2017-07-03T10:58:14+00:00

When external programs perform a dlopen("..so", RTLD_LAZY|RTLD_LOCAL)
on some shared objects like xlators, it can fail with dlerror set to
error string "undefined symbol ".

This was observed for the following shared objects: fuse.so, quota.so,
quotad.so, server.so, libgfrpc.so and socket.so

P.S: This was found while running a go program which fetches the list
of xlator options (volume_option_t) from xlator's shared object.

BUG: 1193929
Change-Id: I7b958409cf11fb67c2be32a3f85a96fb1260236b
Signed-off-by: Prashanth Pai 
Reviewed-on: https://review.gluster.org/17659
Smoke: Gluster Build System 
Reviewed-by: Amar Tumballi 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Jeff Darcy

protocol/server: make listen backlog value as configurable

2017-06-08T15:32:30+00:00

problem:

When we call listen from protocol/server, we are giving a
hard coded valie of 10 if it is not manually given.
With multiplexing, especially when glusterd restarts all
clients may try to connect to the server at a time.
Which will result in overflowing the queue, and kernel
will complain about the errors.

Solution:

This patch will introduce a volume set command to make backlog
value as a configurable. This patch also changes the default
values for backlog from 10 to 128. This changes is only applicable
for sockets listening from protocol.

Example:

gluster volume set  transport.listen-backlog 1024

Note: 1 Brick has to be restarted to get this value in effect
      2 This changes won't be reflected in glusterd, or other
        xlators which calls listen. If you need, you have to
        add this option to the volfile.

Change-Id: I0c5a2bbf28b5db612f9979e7560e05dd82b41477
BUG: 1456405
Signed-off-by: Mohammed Rafi KC 
Reviewed-on: https://review.gluster.org/17411
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G 
Reviewed-by: Raghavendra Talur 
Reviewed-by: Atin Mukherjee 
Reviewed-by: Niels de Vos 
Reviewed-by: Jeff Darcy

socket/reconfigure: reconfigure should be done on new dict

2017-05-29T16:51:49+00:00

In socket reconfigure, reconfigurations are doing with old
dict values. It should be with new reconfigured dict values

Change-Id: Iac5ad4382fe630806af14c99bb7950a288756a87
BUG: 1456405
Signed-off-by: Mohammed Rafi KC 
Reviewed-on: https://review.gluster.org/17412
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G

socket: Avoid flooding of error message in case of SSL

2017-05-15T12:15:44+00:00

Problem: socket poller is throwing Input/Output messages during volume operation

Solution: Update the code in socket_connect function before call
          socket_spwan.

BUG: 1450559
Change-Id: I5f275fe7a4b730b16d7b0a0407e76288b07ceaef
Signed-off-by: Mohit Agrawal 
Reviewed-on: https://review.gluster.org/17280
NetBSD-regression: NetBSD Build System 
Smoke: Gluster Build System 
Reviewed-by: Zhou Zhengping 
CentOS-regression: Gluster Build System 
Reviewed-by: Jeff Darcy

event/epoll: Add back socket for polling of events immediately after

2017-05-12T05:26:42+00:00

             reading the entire rpc message from the wire

Currently socket is added back for future events after higher layers
(rpc, xlators etc) have processed the message. If message processing
involves signficant delay (as in writev replies processed by Erasure
Coding), performance takes hit. Hence this patch modifies
transport/socket to add back the socket for polling of events
immediately after reading the entire rpc message, but before
notification to higher layers.

credits: Thanks to "Kotresh Hiremath Ravishankar"
          for assitance in fixing a regression in
         bitrot caused by this patch.

Change-Id: I04b6b9d0b51a1cfb86ecac3c3d87a5f388cf5800
BUG: 1448364
Signed-off-by: Raghavendra G 
Reviewed-on: https://review.gluster.org/15036
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Smoke: Gluster Build System 
Reviewed-by: Amar Tumballi

rpc: Remove accidental IPV6 changes

2017-05-05T11:39:28+00:00

They snuck in with the HALO patch (07cc8679c)

Change-Id: I8ced6cbb0b49554fc9d348c453d4d5da00f981f6
BUG: 1447953
Signed-off-by: Kaushal M 
Reviewed-on: https://review.gluster.org/17174
Smoke: Gluster Build System 
Reviewed-by: Niels de Vos 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Amar Tumballi

rpc: fix transport add/remove race on port probing

2017-05-03T16:29:49+00:00

Problem:
Spurious __gf_free() assertion failures seen all over the place with
header->magic being overwritten when running port probing tests with
'nmap'

Solution:
Fix sequence of:
1. add accept()ed socket connection fd to epoll set
2. add newly created rpc_transport_t object in RPCSVC service list

Correct sequence is #2 followed by #1.

Reason:
Adding new fd returned by accept() to epoll set causes an epoll_wait()
to return immediately with a POLLIN event. This races ahead to a readv()
which returms with errno:104 (Connection reset by peer) during port
probing using 'nmap'. The error is then handled by POLLERR code to
remove the new transport object from RPCSVC service list and later
unref and destroy the rpc transport object.
socket_server_event_handler() then catches up with registering the
unref'd/destroyed rpc transport object. This is later manifest as
assertion failures in __gf_free() with the header->magic field botched
due to invalid address references.
All this does not result in a Segmentation Fault since the address
space continues to be mapped into the process and pages still being
referenced elsewhere.

As a further note:
This race happens only in accept() codepath. Only in this codepath,
the notify will be referring to two transports:
1, listener transport and
2. newly accepted transport
All other notify refer to only one transport i.e., the transport/socket
on which the event is received. Since epoll is ONE_SHOT another event won't
arrive on the same socket till the current event is processed. However, in
the accept() codepath, the current event - ACCEPT - and the new event -
POLLIN/POLLER - arrive on two different sockets:
1. ACCEPT on listener socket and
2. POLLIN/POLLERR on newly registered socket.
Also, note that these two events are handled different thread contexts.

Cleanup:
Critical section in socket_server_event_handler() has been removed.
Instead, an additional ref on new_trans has been used to avoid ref/unref
race when notifying RPCSVC.

Change-Id: I4417924bc9e6277d24bd1a1c5bcb7445bcb226a3
BUG: 1438966
Signed-off-by: Milind Changire 
Reviewed-on: https://review.gluster.org/17139
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Amar Tumballi 
Reviewed-by: Oleksandr Natalenko 
Reviewed-by: Jeff Darcy

Halo Replication feature for AFR translator

2017-05-02T10:23:53+00:00

Summary:
Halo Geo-replication is a feature which allows Gluster or NFS clients to write
locally to their region (as defined by a latency "halo" or threshold if you
like), and have their writes asynchronously propagate from their origin to the
rest of the cluster.  Clients can also write synchronously to the cluster
simply by specifying a halo-latency which is very large (e.g. 10seconds) which
will include all bricks.

In other words, it allows clients to decide at mount time if they desire
synchronous or asynchronous IO into a cluster and the cluster can support both
of these modes to any number of clients simultaneously.

There are a few new volume options due to this feature:
  halo-shd-latency:  The threshold below which self-heal daemons will
  consider children (bricks) connected.

  halo-nfsd-latency: The threshold below which NFS daemons will consider
  children (bricks) connected.

  halo-latency: The threshold below which all other clients will
  consider children (bricks) connected.

  halo-min-replicas: The minimum number of replicas which are to
  be enforced regardless of latency specified in the above 3 options.
  If the number of children falls below this threshold the next
  best (chosen by latency) shall be swapped in.

New FUSE mount options:
  halo-latency & halo-min-replicas: As descripted above.

This feature combined with multi-threaded SHD support (D1271745) results in
some pretty cool geo-replication possibilities.

Operational Notes:
- Global consistency is gaurenteed for synchronous clients, this is provided by
  the existing entry-locking mechanism.
- Asynchronous clients on the other hand and merely consistent to their region.
  Writes & deletes will be protected via entry-locks as usual preventing
  concurrent writes into files which are undergoing replication.  Read operations
  on the other hand should never block.
- Writes are allowed from _any_ region and propagated from the origin to all
  other regions.  The take away from this is care should be taken to ensure
  multiple writers do not write the same files resulting in a gfid split-brain
  which will require resolution via split-brain policies (majority, mtime &
  size).  Recommended method for preventing this is using the nfs-auth feature to
  define which region for each share has RW permissions, tiers not in the origin
  region should have RO perms.

TODO:
- Synchronous clients (including the SHD) should choose clients from their own
  region as preferred sources for reads.  Most of the plumbing is in place for
  this via the child_latency array.
- Better GFID split brain handling & better dent type split brain handling
  (i.e. create a trash can and move the offending files into it).
- Tagging in addition to latency as a means of defining which children you wish
  to synchronously write to

Test Plan:
- The usual suspects, clang, gcc w/ address sanitizer & valgrind
- Prove tests

Reviewers: jackl, dph, cjh, meyering

Reviewed By: meyering

Subscribers: ethanr

Differential Revision: https://phabricator.fb.com/D1272053

Tasks: 4117827

Change-Id: I694a9ab429722da538da171ec528406e77b5e6d1
BUG: 1428061
Signed-off-by: Kevin Vigor 
Reviewed-on: http://review.gluster.org/16099
Reviewed-on: https://review.gluster.org/16177
Tested-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri