summaryrefslogtreecommitdiffstats
path: root/geo-replication/syncdaemon/monitor.py
Commit message (Collapse)AuthorAgeFilesLines
* [geo-rep] Merging multiple import into single importkshithijiyer2020-04-081-5/+5
| | | | | | | | | | | | | | | | | | Geo-replication has a large number of repeated imports as shown below: ``` from syncdutils import set_term_handler, finalize, lf from syncdutils import log_raise_exception, FreeObject, escape ``` There imports can be clubbed together as shown below: `` from syncdutils import (set_term_handler, finalize, lf, log_raise_exception, FreeObject, escape) ``` Fixes: #1105 Change-Id: I59a48dd57a70fc851d93150b85e736ce41e8b793 Signed-off-by: kshithijiyer <kshithij.ki@gmail.com>
* georep: Merge Worker and Agent as a single processAravinda VK2019-11-071-77/+6
| | | | | | | | | | | | - libgfchangelog is simplified by removing unnecessary API Class - Merged Agent logic into Worker instead of running Worker and Agent as two separate processes and maintaining RPC between Worker and Agent. - Geo-rep command Pause and Resume will continue without any changes. But Agent functionality also gets paused with that. Updates: #755 Change-Id: Ie2c00fa7dddf21f180f0649e0aaf084d29023c98 Signed-off-by: Aravinda VK <avishwan@redhat.com>
* geo-rep: performance improvement while syncing renames with existing gfidSunny Kumar2019-09-231-0/+2
| | | | | | | | | | | | | | | | | | | | Problem: The bug[1] addresses issue of data inconsistency when handling RENAME with existing destination. This fix requires some performance tuning considering this issue occurs in heavy rename workload. Solution: If distribution count for master volume is one do not verify op's on master and go ahead with rename. The performance improvement with this patch can only be observed if master volume has distribution count one. [1]. https://bugzilla.redhat.com/show_bug.cgi?id=1694820 fixes: bz#1753857 Change-Id: I8e9bcd575e7e35f40f9f78b7961c92dee642f47b Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
* geo-rep : fix gluster command path for non-root sessionSunny Kumar2019-06-281-2/+2
| | | | | | | | | | | | | | | | | | | | | Problem: gluster command not found. Cause: In Volinfo class we issue command 'gluster vol info' to get information about volume like getting brick_root to perform various operation. When geo-rep session is configured for non-root user Volinfo class fails to issue gluster command due to unavailability of gluster binary path for non-root user. Solution: Use config value 'slave-gluster-command-dir'/'gluster-command-dir' to get path for gluster command based on caller. fixes: bz#1722740 Change-Id: I4ec46373da01f5d00ecd160c4e8c6239da8b3859 Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
* georep: python2 to python3 compat - syscallsKotresh HR2018-10-081-1/+3
| | | | | | | | | | | | 1. ctypes/syscalls A) arguments is expected to be encoded B) Raw conversion of return value from bytearray into string 2. struct pack/unpack - Raw converstion of string to bytearray 3. basestring -> str Updates: #411 Change-Id: I80f939adcdec0ed0022c87c0b76d057ad5559e5a Signed-off-by: Kotresh HR <khiremat@redhat.com>
* georep: Fix python3 compatibility (os.pipe)Kotresh HR2018-09-251-5/+5
| | | | | | | | | | | | | | 'os.pipe' returns pair of file descriptors which are non-inheritable by child processes. But geo-rep uses te inheritable nature of pipe fds to communicate between parent and child processes. Hence wrote a compatiable pipe routine which works well both with python2 and python3 with inheritable nature. Updates: #411 Change-Id: I869d7a52eeecdecf3851d44ed400e69b32a612d9 Signed-off-by: Kotresh HR <khiremat@redhat.com>
* geo-rep: Fix deadlock during worker startKotresh HR2018-08-131-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Analysis: Monitor process spawns monitor threads (one per brick). Each monitor thread, forks worker and agent processes. Each monitor thread, while intializing, updates the monitor status file. It is synchronized using flock. The race is that, some thread can fork worker while other thread opened the status file resulting in holding the reference of fd in worker process. Cause: flock gets unlocked either by specifically unlocking it or by closing all duplicate fds referring to the file. The code was relying on fd close, hence a reference in worker/agent process by fork could cause the deadlock. Fix: 1. flock is unlocked specifically. 2. Also made sure to update status file in approriate places so that the reference is not leaked to worker/agent process. With this fix, both the deadlock and possible fd leaks is solved. fixes: bz#1614799 Change-Id: I0d1ce93072dab07d0dbcc7e779287368cd9f093d Signed-off-by: Kotresh HR <khiremat@redhat.com>
* All: run codespell on the code and fix issues.Yaniv Kaul2018-07-221-1/+1
| | | | | | | | | | | | Please review, it's not always just the comments that were fixed. I've had to revert of course all calls to creat() that were changed to create() ... Only compile-tested! Change-Id: I7d02e82d9766e272a7fd9cc68e51901d69e5aab5 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* geo-rep: Fix geo-rep for older versions of unshareKotresh HR2018-06-221-4/+11
| | | | | | | | | | | | | | Geo-rep mounts are private to worker. It uses mount namespace using unshare command to achieve the same. Well, the unshare command has to support '--propagation' option. So geo-rep breaks on the systems with older unshare version. The patch makes it fall back to lazy umount behaviour if the unshare does not support propagation option. fixes: bz#1589782 Change-Id: Ia614f068aede288d63ac62fea4461b1865066054 Signed-off-by: Kotresh HR <khiremat@redhat.com>
* geo-rep: Remove lazy umount and use mount namespacesKotresh HR2018-02-221-1/+7
| | | | | | | | | | | | | Lazy umounting the master volume by worker causes issues with rsync's usage of getcwd. Henc removing the lazy umount and using private mount namespace for the same. On the slave, the lazy umount is retained as we can't use private namespace in non root geo-rep setup. Change-Id: I403375c02cb3cc7d257a5f72bbdb5118b4c8779a BUG: 1546129 Signed-off-by: Kotresh HR <khiremat@redhat.com>
* geo-rep: Support for using Volinfo from Conf fileAravinda VK2018-01-231-50/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | Once Geo-replication is started, it runs Gluster commands to get Volume info from Master and Slave. With this patch, Georep can get Volume info from Conf file if `--use-gconf-volinfo` argument is specified to monitor Create a config(Or add to the config if exists) with following fields [vars] master-bricks=NODEID:HOSTNAME:PATH,.. slave-bricks=NODEID:HOSTNAME,.. master-volume-id= slave-volume-id= master-replica-count= master-disperse_count= Note: Exising Geo-replication is not affected since this is activated only when `--use-gconf-volinfo` is passed while spawning `gsyncd monitor` Tiering support is not yet added since Tiering + Glusterd2 is still under discussion. Fixes: #396 Change-Id: I281baccbad03686c00f6488a8511dd6db0edc57a Signed-off-by: Aravinda VK <avishwan@redhat.com>
* geo-rep: Refactoring Config and Arguments parsingAravinda VK2017-11-151-102/+138
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Fixed Python pep8 issues - Removed dead code - Rewritten configuration management - Rewritten Arguments/subcommands handling - Added Args upgrade to accommodate all these changes without changing glusterd code - use of md5 removed, which was used to hash the brick path for workdir Both Master and Slave nodes will have subdir for session in the format "<mastervol>_<primary_slave_host>_<slavevol> $GLUSTER_LOGDIR/geo-replication/<mastervol>_<primary_slave_host>_<slavevol> $GLUSTER_LOGDIR/geo-replication-slaves/<mastervol>_<primary_slave_host>_<slavevol> Log file paths renamed since session info is available with directory name itself. $LOG_DIR_MASTER/ - gsyncd.log - Gsyncd, Worker monitor logs - mnt-<brick-path>.log - Aux mount logs, mounted by each worker - changes-<brick-path>.log - Changelog related logs(One per brick) $LOG_DIR_SLAVE/ - gsyncd.log - Slave Gsyncd logs - mnt-<master-node>-<master-brick-path>.log - Aux mount logs, mounted for each connection from master-node:master-brick - mnt-mbr-<master-node>-<master-brick-path>.log - Same as above, but mountbroker setup Fixes: #73 Change-Id: I2ec2a21e4e2a92fd92899d026e8543725276f021 Signed-off-by: Aravinda VK <avishwan@redhat.com>
* geo-rep: Fix rename of directory in hybrid crawlKotresh HR2017-11-101-82/+3
| | | | | | | | | | | | | In hybrid crawl, renames and unlink can't be synced but directory renames can be detected. While syncing the directory on slave, if the gfid already exists, it should be rename. Hence if directory gfid already exists, rename it. Change-Id: Ibf9f99e76a3e02795a3c2befd8cac48a5c365bb6 BUG: 1499566 Signed-off-by: Kotresh HR <khiremat@redhat.com>
* geo-rep: Fix status transitionKotresh HR2017-10-111-1/+0
| | | | | | | | | | | | | | | | | | | The status transition is as below which is wrong. Created->Initializing->Active->Active/Passive->Stopped As soon as the monitor spawns the worker, the state is changed from 'Initializing' to 'Active' and then to 'Active/Passive' based on whether worker gets the lock or not. This is wrong and it should directly tranistion as below. Created->Initializing->Active/Passive->Stopped Change-Id: Ibf5ca5c4fdf168c403c6da01db60b93f0604aae7 BUG: 1500284 Signed-off-by: Kotresh HR <khiremat@redhat.com>
* geo-rep: Structured log supportAravinda VK2017-06-201-26/+35
| | | | | | | | | | | | | Changed all log messages to structured log format Change-Id: Idae25f8b4ad0bbae38f4362cbda7bbf51ce7607b Updates: #240 Signed-off-by: Aravinda VK <avishwan@redhat.com> Reviewed-on: https://review.gluster.org/17551 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Kotresh HR <khiremat@redhat.com>
* geo-rep: Improve worker log messagesKotresh HR2017-04-071-2/+7
| | | | | | | | | | | | | | | | | | | Monitor process expects worker to establish SSH Tunnel to slave node and mount master volume locally with in 60 secs and acknowledge monitor process by closing feedback fd. If something goes wrong and worker does not close feedback fd with in 60 secs, monitor kills the worker. But there was no clue in log message about the actual issue. This patch adds log and indicates whether the worker is hung during SSH or master mount. Change-Id: Id08a12fa6f3bba1d4fe8036728dbc290e6c14c8c BUG: 1261689 Signed-off-by: Kotresh HR <khiremat@redhat.com> Reviewed-on: https://review.gluster.org/16997 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Aravinda VK <avishwan@redhat.com>
* geo-rep: Use Host UUID to find local Gluster nodeAravinda VK2016-12-131-1/+1
| | | | | | | | | | | | | | | | | To spawn workers for each local brick, Geo-rep was collecting all the machine IPs based on hostname and finds based on the connectivity. With this patch, Geo-rep finds local brick if host UUID matches with UUID of the brick from Volume info. BUG: 1401801 Change-Id: Ic83c65df89e43cb86346e3ede227aa84d17ffd79 Signed-off-by: Aravinda VK <avishwan@redhat.com> Reviewed-on: http://review.gluster.org/16035 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Kotresh HR <khiremat@redhat.com>
* geo-rep/eventsapi: Add Master node information in Geo-rep EventsAravinda VK2016-12-021-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added Master node information to GEOREP_ACTIVE, GEOREP_PASSIVE, GEOREP_FAULTY and GEOREP_CHECKPOINT_COMPLETED events. EVENT_GEOREP_ACTIVE(master_node and master_node_id are new fields) { "nodeid": NODEID, "ts": TIMESTAMP, "event": "GEOREP_ACTIVE", "message": { "master_volume": MASTER_VOLUME_NAME, "master_node": MASTER_NODE, "master_node_id": MASTER_NODE_ID, "slave_host": SLAVE_HOST, "slave_volume": SLAVE_VOLUME, "brick_path": BRICK_PATH } } EVENT_GEOREP_PASSIVE(master_node and master_node_id are new fields) { "nodeid": NODEID, "ts": TIMESTAMP, "event": "GEOREP_PASSIVE", "message": { "master_volume": MASTER_VOLUME_NAME, "master_node": MASTER_NODE, "master_node_id": MASTER_NODE_ID, "slave_host": SLAVE_HOST, "slave_volume": SLAVE_VOLUME, "brick_path": BRICK_PATH } } EVENT_GEOREP_FAULTY(master_node and master_node_id are new fields) { "nodeid": NODEID, "ts": TIMESTAMP, "event": "GEOREP_FAULTY", "message": { "master_volume": MASTER_VOLUME_NAME, "master_node": MASTER_NODE, "master_node_id": MASTER_NODE_ID, "current_slave_host": CURRENT_SLAVE_HOST, "slave_host": SLAVE_HOST, "slave_volume": SLAVE_VOLUME, "brick_path": BRICK_PATH } } EVENT_GEOREP_CHECKPOINT_COMPLETED(master_node and master_node_id are new fields) { "nodeid": NODEID, "ts": TIMESTAMP, "event": "GEOREP_CHECKPOINT_COMPLETED", "message": { "master_volume": MASTER_VOLUME_NAME, "master_node": MASTER_NODE, "master_node_id": MASTER_NODE_ID, "slave_host": SLAVE_HOST, "slave_volume": SLAVE_VOLUME, "brick_path": BRICK_PATH, "checkpoint_time": CHECKPOINT_TIME, "checkpoint_completion_time": CHECKPOINT_COMPLETION_TIME } } BUG: 1395660 Change-Id: Ic91af52fa248c8e982e93a06be861dfd69689f34 Signed-off-by: Aravinda VK <avishwan@redhat.com> Reviewed-on: http://review.gluster.org/15858 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Kotresh HR <khiremat@redhat.com>
* geo-rep: Logging improvementsAravinda VK2016-10-241-4/+4
| | | | | | | | | | | | | | | | | | | - Redundant log messages removed. - Worker and connected slave node details added in "starting worker" log - Added log for Monitor state change - Added log for Worker status change(Initializing/Active/Passive/Faulty) - Added log for Crawl status Change - Added log for config set and reset - Added log for checkpoint set, reset and completion BUG: 1359612 Change-Id: Icc7173ff3c93de4b862bdb1a61760db7eaf14271 Signed-off-by: Aravinda VK <avishwan@redhat.com> Reviewed-on: http://review.gluster.org/15684 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Kotresh HR <khiremat@redhat.com>
* geo-rep/eventsapi: Additional EventsAravinda VK2016-10-181-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added following events EVENT_GEOREP_ACTIVE { "nodeid": NODEID, "ts": TIMESTAMP, "event": "GEOREP_ACTIVE", "message": { "master_volume": MASTER_VOLUME_NAME, "slave_host": SLAVE_HOST, "slave_volume": SLAVE_VOLUME, "brick_path": BRICK_PATH } } EVENT_GEOREP_PASSIVE { "nodeid": NODEID, "ts": TIMESTAMP, "event": "GEOREP_PASSIVE", "message": { "master_volume": MASTER_VOLUME_NAME, "slave_host": SLAVE_HOST, "slave_volume": SLAVE_VOLUME, "brick_path": BRICK_PATH } } EVENT_GEOREP_CHECKPOINT_COMPLETED { "nodeid": NODEID, "ts": TIMESTAMP, "event": "GEOREP_ACTIVE", "message": { "master_volume": MASTER_VOLUME_NAME, "slave_host": SLAVE_HOST, "slave_volume": SLAVE_VOLUME, "brick_path": BRICK_PATH, "checkpoint_time": CHECKPOINT_TIME, "checkpoint_completion_time": CHECKPOINT_COMPLETION_TIME } } BUG: 1379330 Change-Id: I90716175868c59dd65c8d202e73e0ede90347b6a Signed-off-by: Aravinda VK <avishwan@redhat.com> Reviewed-on: http://review.gluster.org/15630 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Kotresh HR <khiremat@redhat.com> Tested-by: Kotresh HR <khiremat@redhat.com>
* eventsapi/geo-rep: Geo-rep will not work without eventsapi rpmsAravinda VK2016-09-281-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | If glusterfs-events rpm is not installed, Geo-replication will fail since it imports eventtypes. Any call to gsyncd will fail with Import error. Glusterd start fails since it runs `gsyncd.py --version` Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 29, in <module> from syncdutils import FreeObject, norm, grabpidfile, finalize File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 28, in <module> from events import eventtypes ImportError: No module named events BUG: 1378057 Change-Id: I1a9bc086c3d52449ec7296cb2f9ceb16cd41a8a4 Signed-off-by: Aravinda VK <avishwan@redhat.com> Reviewed-on: http://review.gluster.org/15539 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Saravanakumar Arumugam <sarumuga@redhat.com> Reviewed-by: Kotresh HR <khiremat@redhat.com>
* geo-rep: add geo-rep events for server side changesSaravanakumar Arumugam2016-08-311-16/+25
| | | | | | | | | | | | | | | | | Event Type defined in #15351 to avoid merge conflicts Add geo-rep events applicable to changes in geo-rep session in the server side. Change-Id: Ia66574d2abccad7fce6a96667efbc7c6c8903fc6 BUG: 1370445 Signed-off-by: Saravanakumar Arumugam <sarumuga@redhat.com> Reviewed-on: http://review.gluster.org/15328 Tested-by: Aravinda VK <avishwan@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Aravinda VK <avishwan@redhat.com>
* georep: add reset-sync-time option for session deleteMilind Changire2016-06-281-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | Set the stime xattr at all the brick roots to (0,0) if the argument reset-sync-time has been provided on the command-line. To avoid testing against directory specific stime, the remote stime is assumed to be minus_infinity, if the root directory stime is set to (0,0), before the directory scan begins. This triggers a full volume resync to slave in the case of a geo-rep session recreation with the same master-slave volume pair. Command synopsis: gluster volume geo-replication <MASTERVOL> <SLAVE>::<SLAVEVOL> delete \ [reset-sync-time] Update gluster cli man page to include new sub-command reset-sync-time. Change-Id: Ie4ce03b9425ed9bb81eda8681058c0fc6f990948 BUG: 1311926 Signed-off-by: Milind Changire <mchangir@redhat.com> Reviewed-on: http://review.gluster.org/14051 Reviewed-by: Kotresh HR <khiremat@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Aravinda VK <avishwan@redhat.com>
* geo-rep: Handle Worker kill gracefully if worker already diedAravinda VK2016-05-301-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If Agent dies for any reason, monitor tries to kill Worker also. But if worker is also died then kill command raises error ESRCH: No such process. [2016-05-23 16:49:33.903965] I [monitor(monitor):326:monitor] Monitor: Changelog Agent died, Aborting Worker(/bricks/brick0/master_brick0) [2016-05-23 16:49:33.904535] E [syncdutils(monitor):276:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306 in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in wmon slave_host, master) File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in monitor os.kill(cpid, signal.SIGKILL) OSError: [Errno 3] No such process With this patch, monitor will gracefully handle if worker is already died. Change-Id: I3ae5f816a3a197343b64540cf46f5453167fb660 Signed-off-by: Aravinda VK <avishwan@redhat.com> BUG: 1339472 Reviewed-on: http://review.gluster.org/14512 Smoke: Gluster Build System <jenkins@build.gluster.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Kotresh HR <khiremat@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.com>
* geo-rep: Fix getting subvol countKotresh HR2015-12-221-1/+4
| | | | | | | | | | | | | | | | Tiering doesn't support disperse volume as hot tier, hence xml output doesn't give 'hotdisperseCount'. Remove the usage of 'hotdisperseCount' in geo-rep and return 0 instead. Change-Id: I736e29257de085a25e38eb02959caad3465ebcda BUG: 1292084 Signed-off-by: Kotresh HR <khiremat@redhat.com> Reviewed-on: http://review.gluster.org/13062 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vivek Reviewed-by: Aravinda VK <avishwan@redhat.com>
* geo-rep: Fix getting subvol numberKotresh HR2015-12-211-16/+46
| | | | | | | | | | | | | | | | | | Fix getting subvol number if the volume type is tier. If the volume type was tier, the subvol number was calculated incorrectly and hence few of workers didn't become ACTIVE resulting in files not being replicated from corresponding brick. This patch addresses the same. Change-Id: Ic10ad7f09a0fa91b4bf2aa361dea3bd48be74853 BUG: 1292084 Signed-off-by: Kotresh HR <khiremat@redhat.com> Reviewed-on: http://review.gluster.org/12994 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Aravinda VK <avishwan@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* geo-rep: use cold tier bricks for namespace operationsSaravanakumar Arumugam2015-12-031-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: symlinks are not getting synced to slave in a Tiering based volume. Solution: Now, symlinks are created directly in cold tier bricks( in the backend). Earlier, cold tier was avoided for namespace operations and only hot tier was used while processing changelogs. Now, cold tier is HASH subvolume in a Tiering volume. So, carry out namespace operation only in cold tier subvolume and avoid hot tier subvolume to avoid any races. Earlier, XSYNC was used(and changeloghistory avoided) during initial sync in order to avoid race while processing historychangelog in Hot tier. This is no longer required as there is no race from Hot tier. Also, avoid both live and history changelog ENTRY operations from Hot tier to avoid any race with cold tier. Change-Id: Ia8fbb7ae037f5b6cb683f36c0df5c3fc2894636e BUG: 1287519 Signed-off-by: Saravanakumar Arumugam <sarumuga@redhat.com> Reviewed-on: http://review.gluster.org/12844 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Kotresh HR <khiremat@redhat.com> Reviewed-by: Venky Shankar <vshankar@redhat.com>
* geo-rep: Fix syntax errors in GsyncdErrorAravinda VK2015-11-231-3/+2
| | | | | | | | | | | | %s was not replaced by actual values in GsyncdError BUG: 1279921 Change-Id: I3c0a10f07383ca72844a46f930b4aa3d3c29f568 Signed-off-by: Aravinda VK <avishwan@redhat.com> Reviewed-on: http://review.gluster.org/12566 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* geo-rep: Kill Geo-rep Worker when Agent process diesAravinda VK2015-11-091-10/+44
| | | | | | | | | | | | | | | | | | When Changelog agent process dies, Geo-replication fails to detect and worker will run without respective Changelog agent. Status shows Active/Passive without any progress. With this patch, Worker process gets killed whenever Changelog agent dies. Change-Id: I30b4cc77f924f7e8174b8bfe415ac17f0b3851b4 Signed-off-by: Aravinda VK <avishwan@redhat.com> BUG: 1277076 Reviewed-on: http://review.gluster.org/12485 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Kotresh HR <khiremat@redhat.com>
* geo-rep: Avoid cold tier bricks during ENTRY operationSaravanakumar Arumugam2015-10-261-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a series of patch which aims to fix geo-replication in a Tiering Volume. Problem: Consider, a file is placed in volume initially and then hot tier is attached. During any operation on the file, due to lookup a linkto file is created in hot tier. Now, any namespace operation carried out on the file is recorded in both cold and hot tier. There is a room for races when both changelogs are replayed. Solution: So, We are going to replay (namespace related)operations only in the hot tier. Why? a. If the file is directly placed in Hot tier , all fops will be recorded in HOT tier. b. If the file is already present in Cold tier, and if any fop is carried out, it creates linkto file in Hot tier. Now, operations like UNLINK, RENAME are captured in Hot tier(by means of linkto file). This way, we can get both tier's operation in HOT tier itself. Now, once the file is demoted to COLD tier, any namespace operation carried out on the cold tier can be avoided as we directly RECORD the same in HOT tier. How? 1. Check whether the brick is cold tier and skip ENTRY operation. 2. Also, if it is cold tier brick, use Xsync(which is used during initial run). This will help in getting all cold tier bricks changes using File System crawl and helps in avoiding races with hot tier brick(which can happen if historychangelog used in cold tier brick). Dependent patches: 1. http://review.gluster.org/12239 2. http://review.gluster.org/12326 Change-Id: I7692b1dbb8813a7e253451bca02f8f09a5782dde BUG: 1266875 Signed-off-by: Saravanakumar Arumugam <sarumuga@redhat.com> Reviewed-on: http://review.gluster.org/12355 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Aravinda VK <avishwan@redhat.com>
* geo-rep: Status EnhancementsAravinda VK2015-05-051-39/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Discussion in gluster-devel http://www.gluster.org/pipermail/gluster-devel/2015-April/044301.html MASTER NODE - Master Volume Node MASTER VOL - Master Volume name MASTER BRICK - Master Volume Brick SLAVE USER - Slave User to which Geo-rep session is established SLAVE - <SLAVE_NODE>::<SLAVE_VOL> used in Geo-rep Create command SLAVE NODE - Slave Node to which Master worker is connected STATUS - Worker Status(Created, Initializing, Active, Passive, Faulty, Paused, Stopped) CRAWL STATUS - Crawl type(Hybrid Crawl, History Crawl, Changelog Crawl) LAST_SYNCED - Last Synced Time(Local Time in CLI output and UTC in XML output) ENTRY - Number of entry Operations pending.(Resets on worker restart) DATA - Number of Data operations pending(Resets on worker restart) META - Number of Meta operations pending(Resets on worker restart) FAILURES - Number of Failures CHECKPOINT TIME - Checkpoint set Time(Local Time in CLI output and UTC in XML output) CHECKPOINT COMPLETED - Yes/No or N/A CHECKPOINT COMPLETION TIME - Checkpoint Completed Time(Local Time in CLI output and UTC in XML output) XML output: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> cliOutput> geoRep> volume> name> sessions> session> session_slave> pair> master_node> master_brick> slave_user> slave/> slave_node> status> crawl_status> entry> data> meta> failures> checkpoint_completed> master_node_uuid> last_synced> checkpoint_time> checkpoint_completion_time> BUG: 1212410 Change-Id: I944a6c3c67f1e6d6baf9670b474233bec8f61ea3 Signed-off-by: Aravinda VK <avishwan@redhat.com> Reviewed-on: http://review.gluster.org/10121 Tested-by: NetBSD Build System Reviewed-by: Kotresh HR <khiremat@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* geo-rep: Adhering to the common storage for geo-repKotresh HR2015-04-271-20/+0
| | | | | | | | | | | | | | | | | | | Making geo-rep use the common storage shared by nfs, snapshot and geo-rep. The meta volume should be named as gluster_shared_storage, and it should be mounted at "/var/run/gluster/shared_storage/". geo-rep will have create a directory called 'geo-rep' in the meta-volume and all the lock files are created inside it. Change-Id: I82d0bff9be191f75f643606a9a21d53559047ac4 BUG: 1210344 Signed-off-by: Kotresh HR <khiremat@redhat.com> Reviewed-on: http://review.gluster.org/10196 Reviewed-by: Aravinda VK <avishwan@redhat.com> Tested-by: NetBSD Build System Tested-by: Gluster Build System <jenkins@build.gluster.com>
* feature/geo-rep: Active Passive Switching logic flockKotresh HR2015-03-151-1/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CURRENT DESIGN AND ITS LIMITATIONS: ----------------------------------- Geo-replication syncs changes across geography using changelogs captured by changelog translator. Changelog translator sits on server side just above posix translator. Hence, in distributed replicated setup, both replica pairs collect changelogs w.r.t their bricks. Geo-replication syncs the changes using only one brick among the replica pair at a time, calling it as "ACTIVE" and other non syncing brick as "PASSIVE". Let's consider below example of distributed replicated setup where NODE-1 as b1 and its replicated brick b1r is in NODE-2 NODE-1 NODE-2 b1 b1r At the beginning, geo-replication chooses to sync changes from NODE-1:b1 and NODE-2:b1r will be "PASSIVE". The logic depends on virtual getxattr 'trusted.glusterfs.node-uuid' which always returns first up subvolume i.e., NODE-1. When NODE-1 goes down, the above xattr returns NODE-2 and that is made 'ACTIVE'. But when NODE-1 comes back again, the above xattr returns NODE-1 and it is made 'ACTIVE' again. So for a brief interval of time, if NODE-2 had not finished processing the changelog, both NODE-2 and NODE-1 will be ACTIVE causing rename race as mentioned in the bug. SOLUTION: --------- 1. Have a shared replicated storage, a glusterfs management volume specific to geo-replication. 2. Geo-rep creates a file per replica set on management volume. 3. fcntl lock on the above said file is used for synchronization between geo-rep workers belonging to same replica set. 4. If management volume is not configured, geo-replication will back to previous logic of using first up sub volume. Each worker tries to lock the file on shared storage, who ever wins will be ACTIVE. With this, we are able to solve the problem but there is an issue when the shared replicated storage goes down (when all replicas goes down). In that case, the lock state is lost. So AFR needs to rebuild the lock state after brick comes up. NOTE: ----- This patch brings in the, pre-requisite step of setting up management volume for geo-replication during creation. 1. Create mgmt-vol for geo-replicatoin and start it. Management volume should be part of master cluster and recommended to be three way replicated volume having each brick in different nodes for availability. 2. Create geo-rep session. 3. Configure mgmt-vol created with geo-replication session as follows. gluster vol geo-rep <mastervol> slavenode::<slavevol> config meta_volume \ <meta-vol-name> 4. Start geo-rep session. Backward Compatiability: ----------------------- If management volume is not configured, it falls back to previous logic of using node-uuid virtual xattr. But it is not recommended. Change-Id: I7319d2289516f534b69edd00c9d0db5a3725661a BUG: 1196632 Signed-off-by: Kotresh HR <khiremat@redhat.com> Reviewed-on: http://review.gluster.org/9759 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* geo-rep: Fixing the typo errorsarao2015-02-031-1/+1
| | | | | | | | | Change-Id: Iacc67e4ba9ac45e0858f3befe84ffb8fccf7e1c3 BUG: 1075417 Signed-off-by: arao <arao@redhat.com> Reviewed-on: http://review.gluster.org/9502 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Humble Devassy Chirammal <humble.devassy@gmail.com>
* geo-rep: Handle Volume status error while getting slave nodesAravinda VK2015-01-121-1/+6
| | | | | | | | | | | | | | | gluster volume status command not returns xml output, when any error like "Transaction in Progress", we need to handle returncode along with xml error. BUG: 1151412 Signed-off-by: Aravinda VK <avishwan@redhat.com> Change-Id: Id5b7712df7cff58744b4c5a0d00870aec1d926a8 Reviewed-on: http://review.gluster.org/9432 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Kotresh HR <khiremat@redhat.com> Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Venky Shankar <vshankar@redhat.com>
* geo-rep: Failover when a Slave node goes downAravinda VK2014-10-191-5/+64
| | | | | | | | | | | | | | | | | | When a slave node goes down, worker in master node can connect to different slave node and resume the operation. Existing georep was not checking the status of slave node before worker restart. With this patch, geo-rep worker will check the node status using `gluster volume status` when it goes faulty. BUG: 1151412 Change-Id: If3ab7fdcf47f5b3f3ba383c515703c5f1f9dd668 Signed-off-by: Aravinda VK <avishwan@redhat.com> Reviewed-on: http://review.gluster.org/8921 Reviewed-by: Kotresh HR <khiremat@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Venky Shankar <vshankar@redhat.com>
* geo-rep: Fix the fd leak in worker/agent spawnAravinda VK2014-06-301-2/+5
| | | | | | | | | | | | | | | | | | worker and agent uses pipe to communicate, if worker dies for some reason agent should get EOF and terminate. Each worker-agent spawning is done in thread, Due to race if multiple workers in same node retain the pipe refs of other workers. Hence agent will not get EOF even if worker dies. BUG: 1114003 Change-Id: I36b9709b9392299483606bd3ef1db764fa3f2bff Signed-off-by: Aravinda VK <avishwan@redhat.com> Reviewed-on: http://review.gluster.org/8194 Tested-by: Justin Clift <justin@gluster.org> Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Venky Shankar <vshankar@redhat.com>
* feature/geo-rep: Fix to retain pause state of gsyncd on restartKotresh H R2014-06-201-7/+18
| | | | | | | | | | | | | | On soft reboot, geo-rep monitor is writing 'faulty' into status file. It should not do it if previous state is paused as glusterd depend on the state file on node restart. Change-Id: Idd45abf13350b087371935f1b4f6e1a346433d27 BUG: 1101410 Signed-off-by: Kotresh H R <khiremat@redhat.com> Reviewed-on: http://review.gluster.org/8097 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Venky Shankar <vshankar@redhat.com>
* feature/geo-rep: Fix for changelog agent becoming zombie.Kotresh H R2014-06-171-4/+12
| | | | | | | | | | | | | | | | | | Monitor process spawns changelog agent and is not wait on it, hence becoming zombie. When worker is dies/killed, it respawns both worker and corresponding agent leaving the earlier changelog agent in zombie state. This patch addresses this issue by waiting on agent process in montor process. Change-Id: I571b7d6487133848edca67e7446f1caa70ae01c9 BUG: 1103643 Signed-off-by: Kotresh H R <khiremat@redhat.com> Reviewed-on: http://review.gluster.org/7956 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Aravinda VK <avishwan@redhat.com> Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Venky Shankar <vshankar@redhat.com>
* feature/geo-rep: Fix to retain pause state of gsyncd on restart.Kotresh H R2014-06-051-2/+7
| | | | | | | | | | | | | | | A new gsyncd options '--pause-on-start' is introduced. When node reboots, if the status is paused, gsyncd is started with this option. After gsyncd spawns worker and agent, worker will send SIGSTOP to negative pid of monitor to enter pause mode. Change-Id: I5aad82c9a9fc8c243f384940b77d25e26e520d6d BUG: 1101410 Signed-off-by: Kotresh H R <khiremat@redhat.com> Reviewed-on: http://review.gluster.org/7885 Reviewed-by: Aravinda VK <avishwan@redhat.com> Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Venky Shankar <vshankar@redhat.com>
* gsyncd / geo-rep: fix cli query for volinfo fetchVenky Shankar2014-05-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | With an unprivileged geo-replication session, monitor was using user@slave for --remote-host option for gluster cli, thereby failing to sucessfully connect to the slave glusterd. This patch fixes the issue by selecting the hostname/IP from the speicified slave endpoint url. - For privileged geo-replication sessions, this patch has no effect as the slave endpoint url is just the hostname/IP. Change-Id: I88f66c406a8d9a34db7fc626965f949075e3ceac BUG: 1077452 Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-on: http://review.gluster.org/7818 Reviewed-by: Aravinda VK <avishwan@redhat.com> Reviewed-by: Kotresh HR <khiremat@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* gsyncd/geo-rep: Fix remote vol info fetching for non-rootKotresh H R2014-05-151-1/+1
| | | | | | | | | | Signed-off-by: Kotresh H R <khiremat@redhat.com> Change-Id: If1d2cab3fcfe2391105551e54f0b9729a7c204e4 BUG: 1077452 Reviewed-on: http://review.gluster.org/7767 Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Venky Shankar <vshankar@redhat.com>
* gsyncd / geo-rep: Partial support for Non-root geo-replication.Venky Shankar2014-05-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | This patch enables geo-replication to be run as an unprivileged user. As of now, this is just the partial support, but is very close to achieve full functionality. Current limitation * Geo-replication executed Gluster CLI commands on the slave via SSH. On a non-root setup, Gluster CLI would run as an unprivileged user, failing to execute the command. As a workaround (for testing), setuid(2) Gluster CLI executable or use the glusterd option to accept commands by unprivileged CLI process. The nature of cli commands are "system::" commands (for key management) and remote volume info fetching. Remote volume info fetching has been modified to use --remote-host gluster cli option rather than ssh and remote cli execution. Change-Id: Ica89e2ba9b7f48fd6e1c876c477d7822dc693617 BUG: 1077452 Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-on: http://review.gluster.org/7658 Tested-by: Gluster Build System <jenkins@build.gluster.com>
* geo-rep: Pause and Resume feature for geo-replicationAravinda VK2014-05-091-2/+32
| | | | | | | | | | | | | | | | Changelog consumption/processing now happens in seperate process group than monitor. When monitor process group gets SIGSTOP all worker process, ssh, rsync will be paused except the changelog processing. When it gets SIGCONT it resumes its operation. Changelog agent runs as RepceServer, geo-rep worker communicates with changelog agent using RepceClient. Change-Id: I35c333e4d8b13d03a7808aed601960eef23cfa04 BUG: 1093602 Signed-off-by: Venky Shankar <vshankar@redhat.com> Signed-off-by: Aravinda VK <avishwan@redhat.com> Reviewed-on: http://review.gluster.org/7322
* geo-rep: code pep8/flake8 fixesAravinda VK2014-04-071-25/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | pep8 is a style guide for python. http://legacy.python.org/dev/peps/pep-0008/ pep8 can be installed using, `pip install pep8` Usage: `pep8 <python file>`, For example, `pep8 master.py` will display all the coding standard errors. flake8 is used to identify unused imports and other issues in code. pip install flake8 cd $GLUSTER_REPO/geo-replication/ flake8 syncdaemon Updated license headers to each source file. Change-Id: I01c7d0a6091d21bfa48720e9fb5624b77fa3db4a Signed-off-by: Aravinda VK <avishwan@redhat.com> Reviewed-on: http://review.gluster.org/7311 Reviewed-by: Kotresh HR <khiremat@redhat.com> Reviewed-by: Prashanth Pai <ppai@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* geo-rep: Fix ValueError - signal only works in main threadAravinda VK2014-03-201-8/+7
| | | | | | | | | | | | | | | | | | | | When a worker process not confirmed within 60 seconds of start then monitor thread was terminated instead of stopping and restarting the worker thread. Before terminate monitor thread tries to add a signal handler for SIGTERM to cleanup the stuff before terminate. Signal handling will not work inside thread, so ValueError was raised. This patch will not terminate monitor thread, instead only kills and restarts the worker. Change-Id: I14df26c0cc3097af29293c81536c13b86075e28f BUG: 1078068 Signed-off-by: Aravinda VK <avishwan@redhat.com> Reviewed-on: http://review.gluster.org/7294 Reviewed-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Vijay Bellur <vbellur@redhat.com>
* geo-rep: logrotate: Logrotate handlingAravinda VK2013-10-031-13/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In existing georep logrotate was implemented by handling SIGSTOP and SIGCONT, gsyncd was failing to start again after SIGSTOP. New approach uses WatchedFileHandler in logging, which tracks the log file changes or logrotate. Reopens the log file if logrotate is triggered or if same log file is updated from other process. As per python doc: http://docs.python.org/2/library/logging.handlers.html: The WatchedFileHandler class, located in the logging.handlers module, is a FileHandler which watches the file it is logging to. If the file changes, it is closed and reopened using the file name. A file change can happen because of usage of programs such as newsyslog and logrotate which perform log file rotation. This handler, intended for use under Unix/Linux, watches the file to see if it has changed since the last emit. (A file is deemed to have changed if its device or inode have changed.) If the file has changed, the old file stream is closed, and the file opened to get a new stream. Change-Id: I30f65eb1e9778b12943d6e43b60a50344a7885c6 BUG: 1012776 Signed-off-by: Aravinda VK <avishwan@redhat.com> Reviewed-on: http://review.gluster.org/5968 Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Harshavardhana <harsha@harshavardhana.net> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* gsyncd / geo-rep: distributify slaveVenky Shankar2013-09-041-14/+1
| | | | | | | | | | | | | | | | | commit fbb8fd92 introduced slave distributification but had some problems (monitor would crash upon gsyncd start). This patch fixes the issue and makes code more pythonic ;) Change-Id: I2cbf5669d81966046a4aeeb4a6ad11a947aa8f09 BUG: 1003807 Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Amar Tumballi <amarts@redhat.com> Reviewed-on: http://review.gluster.org/5761 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Avra Sengupta <asengupt@redhat.com> Tested-by: Avra Sengupta <asengupt@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* geo-replication: fix the logic of choosing the remote node to syncVenky Shankar2013-09-041-2/+11
| | | | | | | | | | | | Change-Id: Ie15636357d89e94b6bfad0e168b1fcad53508c47 BUG: 1003807 Signed-off-by: Amar Tumballi <amarts@redhat.com> Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-on: http://review.gluster.org/5759 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Avra Sengupta <asengupt@redhat.com> Tested-by: Avra Sengupta <asengupt@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* glusterd: Saving geo-rep session details in a more specific pathVenky Shankar2013-09-041-1/+1
| | | | | | | | | | | | | | | Now saving the session details in /var/lib/glusterd/geo-replication/<mastervol>_<slaveip>_<slavevol> repo to distinguish between two master-slave sessions where the slavename is same across two different clusters. Change-Id: I57c93f55cc9bd4fe2bffe579028aaf5e4335b223 BUG: 991501 Signed-off-by: Avra Sengupta <asengupt@redhat.com> Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-on: http://review.gluster.org/5488 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>