summaryrefslogtreecommitdiffstats
path: root/xlators/mgmt
diff options
context:
space:
mode:
authorRavishankar N <ravishankar@redhat.com>2014-01-24 17:27:38 +0530
committerVijay Bellur <vbellur@redhat.com>2014-02-03 05:45:03 -0800
commit8cf2a36dad6c8bac7cd3a05455fd555544ebb457 (patch)
tree796f4489363839009b18140d8d83a603316ffba7 /xlators/mgmt
parent3395b98a341228e89cf0a88e0d86c90117dbb285 (diff)
glusterd: Fix race in pid file update
This patch only removes lines of code. For personal gratification, giving a detailed explanation of what the problem was. When glusterd spawns the local brick process, say when a reboot of the node occurs,the glusterd_brick_start() and subsequently the glusterd_volume_start_glusterfs() function gets called twice; from glusterd_spawn_daemons() and glusterd_do_volume_quorum_action() respectively. This causes a race, best described by a pseudo-code of current behaviour. glusterd_volume_start_glusterfs() { if(!brick process running) { step-a) reap pid file( i.e. unlink it) step-b) fork a brick process which creates and locks pid file and binds the process to a socket. } } Time Event ---- ----- T1 Call-1 arrives, completes step-a, starts step-b T2 Call-2 arrives, enters step-a as Call-1's forked child is not yet running. T3 Call-1's forked child is alive, creates pidfile and locks it,binds its address to a socket. T4 Call-2 performs step-a; i.e.unlinks the pid file created by Call-1 !! (files can still be stil be unlinked despite a lockf on it) T5 Call-2 does step-b, and the forked child process creates a *new* pid file with it's pid and locks this file. T6 But Call-2's brick process is not able to bind to socket as it is already in use (courtesy T3) and hence exits (so no locks anymore on the pidfile). Result: - Pid file now contains PID of an extinct brick process. - `gluster volume status` shows this PID value. It also notices that there is no lock held on pid file by the currently running brick process (created by Call-1) and hence shows N/A for the online status. Also, as a result of events at T4, "ls -l /proc/<brick process PID>/fd/pidfile" shows up as deleted. Fix: 1.Do not unlink pid file. i.e. avoid step-a. Now at T5,Call-2's brick process cannot obtain lock on pid file (Call-1's process still holds it) and exits. 2. Unrelated, but remove lock-unlock sequence in glusterfs_pidfile_setup() which does not seem to be doing anything useful. Change-Id: I18d3e577bb56f68d85d3e0d0bfd268e50ed4f038 BUG: 1035586 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/6786 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Diffstat (limited to 'xlators/mgmt')
-rw-r--r--xlators/mgmt/glusterd/src/glusterd-utils.c12
1 files changed, 0 insertions, 12 deletions
diff --git a/xlators/mgmt/glusterd/src/glusterd-utils.c b/xlators/mgmt/glusterd/src/glusterd-utils.c
index 7c58e51ad..9fb2ab31e 100644
--- a/xlators/mgmt/glusterd/src/glusterd-utils.c
+++ b/xlators/mgmt/glusterd/src/glusterd-utils.c
@@ -1307,16 +1307,6 @@ out:
return ret;
}
-/* Caller should ensure that brick process is not running*/
-static void
-_reap_brick_process (char *pidfile, char *brickpath)
-{
- unlink (pidfile);
- /* Brick process is not running and pmap may have an entry for it.*/
- pmap_registry_remove (THIS, 0, brickpath,
- GF_PMAP_PORT_BRICKSERVER, NULL);
-}
-
static int
_mk_rundir_p (glusterd_volinfo_t *volinfo)
{
@@ -1371,8 +1361,6 @@ glusterd_volume_start_glusterfs (glusterd_volinfo_t *volinfo,
if (gf_is_service_running (pidfile, NULL))
goto connect;
- _reap_brick_process (pidfile, brickinfo->path);
-
port = brickinfo->port;
if (!port)
port = pmap_registry_alloc (THIS);