glusterfs.git/geo-replication, branch v4.1.9

geo-rep: Fix permissions with non-root setup

2018-11-28T05:04:10+00:00

Problem:
In non-root fail-over/fail-back(FO/FB), when slave is
promoted as master, the session goes to 'Faulty'

Cause:
The command 'gluster-mountbroker  '
is run as a pre-requisite on slave in non-root setup.
It modifies the permission and group of following required
directories and files recursively

  [1] /var/lib/glusterd/geo-replication
  [2] /var/log/glusterfs/geo-replication-slaves

In a normal setup, this is executed on slave node and hence
doing it recursively is not an issue on [1]. But when original
master becomes slave in non-root during FO/FB, it contains
ssh public keys and modifying permissions on them causes
geo-rep to fail with incorrect permissions.

Fix:
Don't do permission change recursively. Fix permissions for
required files.

Backport of:
 > Patch: https://review.gluster.org/#/c/glusterfs/+/21689/
 > fixes: bz#1651498
 > Change-Id: I68a744644842e3b00abc26c95c06f123aa78361d
 > Signed-off-by: Kotresh HR 
(cherry picked from commit b2776b1ec1ad845ba568c4439bca3b57cc4d2592)

fixes: bz#1654118
Change-Id: I68a744644842e3b00abc26c95c06f123aa78361d
Signed-off-by: Kotresh HR

geo-rep: Fix traceback with symlink metadata sync

2018-11-12T15:40:59+00:00

While syncing metadata, 'os.chmod', 'os.chown',
'os.utime' should be used without de-reference.
But python supports only 'os.chown' without
de-reference. That's mostly because Linux
doesn't support 'chmod' on symlink file itself
but it does support 'chown'.

So while syncing metadata ops, if it's symlink
we should only sync 'chown' and not do 'chmod'
and 'utime'. It will lead to tracebacks with
errors like EROFS, EPERM, ACCESS, ENOENT.
All the three errors (EPERM, ACCESS, ENOENT)
were handled except EROFS. But the way it was
handled was not fool proof. The operation is
tried and failure was handled based on the errors.
All the errors with symlink file for 'chown',
'utime' had to be passed to safe errors list of
'errno_wrap'. This patch handles it better by
avoiding 'chmod' and 'utime' if it's symlink
file.

Backport of:
> Patch: https://review.gluster.org/21546/
> BUG: 1646104
> Change-Id: Ic354206455cdc7ab2a87d741d81f4efe1f19d77d
> Signed-off-by: Kotresh HR 
(cherry picked from commit 3c6cf9a4a1b46cab2dc53c1ee0afca0fe993102e)

fixes: bz#1646806
Change-Id: Ic354206455cdc7ab2a87d741d81f4efe1f19d77d
Signed-off-by: Kotresh HR

geo-rep/scripts: Fix traceback in gluster-mountbroker

2018-11-05T20:37:25+00:00

When 'gluster-mountbroker status' was issued, it
crashes in a corner case with 'str object has not
attribute get'. Fixed the same.

Backport of:
> Patch: https://review.gluster.org/21507
> fixes: bz#1643929
> Signed-off-by: Kotresh HR 
> Change-Id: Iaf1a937ed0136b3b2058230c75fa89a215d8a5eb
(cherry picked from commit 5987b3388126a3c5e77481913cbaa4142117d19a)

fixes: bz#1644516
Signed-off-by: Kotresh HR 
Change-Id: Iaf1a937ed0136b3b2058230c75fa89a215d8a5eb

geo-rep: Add more intelligence to automatic error handling

2018-11-05T19:10:07+00:00

Geo-rep's automatic error handling does gfid conflict
resolution. But if there are ENOENT errors because the
parent is not synced to slave, it doesn' handle them.
This patch adds the intelligence to create missing
parent directories on slave. It can create the missing
directories upto the depth of 10.

Backport of:

> Patch: https://review.gluster.org/21498
> BUG: 1643402
> Change-Id: Ic97ed1fa5899c087e404d559e04f7963ed7bb54c
> Signed-off-by: Kotresh HR 
(cherry picked from commit 19775e0445411cca9ddd9d294fd54d0b6fbe6a03)

fixes: bz#1644518
Change-Id: Ic97ed1fa5899c087e404d559e04f7963ed7bb54c
Signed-off-by: Kotresh HR

geo-rep: Fix issue in gfid-conflict-resolution

2018-10-30T19:20:58+00:00

Problem:
During gfid-conflict-resolution, geo-rep crashes
with 'ValueError: list.remove(x): x not in list'

Cause and Analysis:
During gfid-conflict-resolution, the entry blob is
passed back to master along with additional
information to verify it's integrity. If everything
looks fine, the entry creation is ignored and is
deleted from the original list.  But it is crashing
during removal of entry from the list saying entry
not in list. The reason is that the stat information
in the entry blob was modified and sent back to
master if present.

Fix:
Send back the correct stat information for
gfid-conflict-resolution.

Backport of:

> BUG: 1642865
> Change-Id: I47a6aa60b2a495465aa9314eebcb4085f0b1c4fd
> Signed-off-by: Kotresh HR 
(cherry picked from commit ff18121945bff394f3234e9f1a9d61ac97d4d493)

fixes: bz#1644163
Change-Id: I47a6aa60b2a495465aa9314eebcb4085f0b1c4fd
Signed-off-by: Kotresh HR

georep: fix hard-coded paths in gsyncd.conf.in

2018-09-23T10:40:18+00:00

This is part of the reason why we use autoconf (i.e. configure).
For an ordinary clone+autogen.sh+configure SBIN_DIR is
/usr/local/sbin; for an rpm or dpkg build it will be /usr/sbin.

I wonder how many more are lurking in our sources? /usr/libexec is
one that frequently bites us on  Debian and Ubuntu, which don't have
/usr/libexec. (But it's all Linux, right?)

See https://bugzilla.redhat.com/show_bug.cgi?id=1601532

Reported-by: lohmaier+rhbz@gmail.com
Change-Id: I6523894416cc06236ea1f99529efd36e957bd98e
updates: bz#1632013
Signed-off-by: Kaleb S. KEITHLEY

geo-rep: Fix issues related config set

2018-09-21T13:25:43+00:00

1. '--ignore-mising-args' option for rsync is not
   being used even though the rsync version is
   greater than 3.1.0. Fixed the same.

2. '--existing' option for rsync is also not being
   used. Fixed the same.

3. geo-rep config fails to set rsync-options as the
   value contains '--'. Interestingly, python argsparse
   treats the value with '--' (e.g., --ignore-missing-args)
   as option. But when passed with something like
   --value=--ignore-missing-args, it succeeds. Fixed the
   same.

Backport of:
 > Patch: https://review.gluster.org/21191
 > Change-Id: Iaeb838acaff1c2920fee9c7f920c99edce13a0a1
 > Signed-off-by: Kotresh HR 
 > BUG: 1629561

Change-Id: Iaeb838acaff1c2920fee9c7f920c99edce13a0a1
Signed-off-by: Kotresh HR 
fixes: bz#1630140

geo-rep: Fix deadlock during worker start

2018-09-21T13:25:43+00:00

Analysis:
Monitor process spawns monitor threads (one per brick).
Each monitor thread, forks worker and agent processes.
Each monitor thread, while intializing, updates the
monitor status file. It is synchronized using flock.
The race is that, some thread can fork worker while
other thread opened the status file resulting in
holding the reference of fd in worker process.

Cause:
flock gets unlocked either by specifically unlocking it
or by closing all duplicate fds referring to the file.
The code was relying on fd close, hence a reference
in worker/agent process by fork could cause the deadlock.

Fix:
1. flock is unlocked specifically.
2. Also made sure to update status file in approriate places so that
the reference is not leaked to worker/agent process.

With this fix, both the deadlock and possible fd
leaks is solved.

Backport of:
 > Patch: https://review.gluster.org/20704
 > BUG: bz#1614799
 > Change-Id: I0d1ce93072dab07d0dbcc7e779287368cd9f093d
 > Signed-off-by: Kotresh HR 

fixes: bz#1630145
Change-Id: I0d1ce93072dab07d0dbcc7e779287368cd9f093d
Signed-off-by: Kotresh HR

geo-rep: Fix issues with gfid conflict handling

2018-08-16T04:21:14+00:00

1. MKDIR/RMDIR is recorded on all bricks. So if
   one brick succeeds creating it, other bricks
   should ignore it. But this was not happening.
   The fix rename of directories in hybrid crawl,
   was trying to rename the directory to itself
   and in the process crashing with ENOENT if the
   directory is removed.

2. If file is created, deleted and a directory is
   created with same name, it was failing to sync.
   Again the issue is around the fix for rename
   of directories in hybrid crawl. Fixed the same.

   If the same case was done with hardlink present
   for the file, it was failing. This patch fixes
   that too.

Backport of:
 > BUG: 1598884
 > Change-Id: I6f3bca44e194e415a3d4de3b9d03cc8976439284
 > Signed-off-by: Kotresh HR 

fixes: bz#1611114
Change-Id: I6f3bca44e194e415a3d4de3b9d03cc8976439284
Signed-off-by: Kotresh HR

geo-rep: Fix symlink rename syncing issue

2018-08-16T04:21:01+00:00

Problem:
   Geo-rep sometimes fails to sync the rename of symlink
if the I/O is as follows

  1. touch file1
  2. ln -s "./file1" sym_400
  3. mv sym_400 renamed_sym_400
  4. mkdir sym_400

 The file 'renamed_sym_400' failed to sync to slave

Cause:
  Assume there are three distribute subvolume (brick1, brick2, brick3).
  The changelogs are recorded as follows for above I/O pattern.
  Note that the MKDIR is recorded on all bricks.

  1. brick1:
     -------

     CREATE file1
     SYMLINK sym_400
     RENAME sym_400 renamed_sym_400
     MKDIR sym_400

  2. brick2:
     -------

     MKDIR sym_400

  3. brick3:
     -------

     MKDIR sym_400

  The operations on 'brick1' should be processed sequentially. But
  since MKDIR is recorded on all the bricks, The brick 'brick2/brick3'
  processed MKDIR first before 'brick1' causing out of order syncing
  and created directory sym_400 first.

  Now 'brick1' processed it's changelog.

     CREATE file1 -> succeeds
     SYMLINK sym_400 -> No longer present in master. Ignored
     RENAME sym_400 renamed_sym_400
            While processing RENAME, if source('sym_400') doesn't
            present, destination('renamed_sym_400') is created. But
            geo-rep stats the name 'sym_400' to confirm source file's
            presence. In this race, since source name 'sym_400' is
            present as directory, it doesn't create destination.
            Hence RENAME is ignored.

Fix:
  The fix is not rely only on stat of source name during RENAME.
  It should stat the name and if the name is present, gfid should
  be same. Only then it can conclude the presence of source.

Backport of:
 > BUG: 1600405
 > Change-Id: I9fbec4f13ca6a182798a7f81b356fe2003aff969
 > Signed-off-by: Kotresh HR 

fixes: bz#1611113
Change-Id: I9fbec4f13ca6a182798a7f81b356fe2003aff969
Signed-off-by: Kotresh HR