glusterfs.git -

	Commit message (Collapse)	Author	Age	Files	Lines
*	cluster/afr: Delay post-op for fsync	Pranith Kumar K	2020-08-27	3	-0/+175
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: AFR doesn't delay post-op for fsync fop. For fsync heavy workloads this leads to un-necessary fxattrop/finodelk for every fsync leading to bad performance. Fix: Have delayed post-op for fsync. Add special flag in xdata to indicate that afr shouldn't delay post-op in cases where either the process will terminate or graph-switch would happen. Otherwise it leads to un-necessary heals when the graph-switch/process-termination happens before delayed-post-op completes. Fixes: #1253 Change-Id: I531940d13269a111c49e0510d49514dc169f4577 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	features/shard: Aggregate file size, block-count before unwinding removexattr	Krutika Dhananjay	2020-07-13	1	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Posix translator returns pre and postbufs in the dict in {F}REMOVEXATTR fops. These iatts are further cached at layers like md-cache. Shard translator, in its current state, simply returns these values without updating the aggregated file size and block-count. This patch fixes this problem. Change-Id: I4b2dd41ede472c5829af80a67401ec5a6376d872 Fixes: #1243 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> (cherry picked from commit 32519525108a2ac6bcc64ad931dc8048d33d64de)
*	cluster/afr: Prioritize ENOSPC over other errors	karthik-us	2020-06-22	1	-0/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: In a replicate/arbiter volume if file creations or writes fails on quorum number of bricks and on one brick it is due to ENOSPC and on other brick it fails for a different reason, it may fail with errors other than ENOSPC in some cases. Fix: Prioritize ENOSPC over other lesser priority errors and do not set op_errno in posix_gfid_set if op_ret is 0 to avoid receiving any error_no which can be misinterpreted by __afr_dir_write_finalize(). Also removing the function afr_has_arbiter_fop_cbk_quorum() which might consider a successful reply form a single brick as quorum success in some cases, whereas we always need fop to be successful on quorum number of bricks in arbiter configuration. Change-Id: I106e267f8b9451f681022f1cccb410d9bc824c08 Fixes: #1254 Signed-off-by: karthik-us <ksubrahm@redhat.com> (cherry picked from commit fa63b45ca5edf172b1b89b28b5db3c5129cc57b6)
*	afr: more quorum checks in lookup and new entry marking	Ravishankar N	2020-06-18	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: See github issue for details. Fix: -In lookup if the entry exists in 2 out of 3 bricks, don't fail the lookup with ENOENT just because there is an entrylk on the parent. Consider quorum before deciding. -If entry FOP does not succeed on quorum no. of bricks, do not perform new entry mark. Fixes: #1303 Change-Id: I56df8c89ad53b29fa450c7930a7b7ccec9f4a6c5 Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit c4a6748f25d2c1ab3ebcf89952278ebf94c8d371)
*	features/shard: Aggregate size, block-count in iatt before unwinding setxattr	Krutika Dhananjay	2020-06-15	1	-0/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Posix translator returns pre and postbufs in the dict in {F}SETXATTR fops. These iatts are further cached at layers like md-cache. Shard translator, in its current state, simply returns these values without updating the aggregated file size and block-count. This patch fixes this problem. Change-Id: I4da0eceb4235b91546df79270bcc0af8cd64e9ea Fixes: #1243 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> (cherry picked from commit 29ec66c6ab77e2d6893c6e213a3d1fb148702c99)
*	open-behind: rewrite of internal logic	Xavi Hernandez	2020-06-15	5	-0/+872
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There was a critical flaw in the previous implementation of open-behind. When an open is done in the background, it's necessary to take a reference on the fd_t object because once we "fake" the open answer, the fd could be destroyed. However as long as there's a reference, the release function won't be called. So, if the application closes the file descriptor without having actually opened it, there will always remain at least 1 reference, causing a leak. To avoid this problem, the previous implementation didn't take a reference on the fd_t, so there were races where the fd could be destroyed while it was still in use. To fix this, I've implemented a new xlator cbk that gets called from fuse when the application closes a file descriptor. The whole logic of handling background opens have been simplified and it's more efficient now. Only if the fop needs to be delayed until an open completes, a stub is created. Otherwise no memory allocations are needed. Correctly handling the close request while the open is still pending has added a bit of complexity, but overall normal operation is simpler. Change-Id: I6376a5491368e0e1c283cc452849032636261592 Fixes: #1225 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	tests: skip tests on absence of reflink in xfs	Pranith Kumar K	2020-06-10	1	-7/+9
\| \| \| \| \| \|	Fixes: #1223 Change-Id: I36cb72d920ffd77405051546615c5262c392daef Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	afr: event gen changes	Ravishankar N	2020-05-09	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The general idea of the changes is to prevent resetting event generation to zero in the inode ctx, since event gen is something that should follow 'causal order'. Change #1: For a read txn, in inode refresh cbk, if event_generation is found zero, we are failing the read fop. This is not needed because change in event gen is only a marker for the next inode refresh to happen and should not be taken into account by the current read txn. Change #2: The event gen being zero above can happen if there is a racing lookup, which resets even get (in afr_lookup_done) if there are non zero afr xattrs. The resetting is done only to trigger an inode refresh and a possible client side heal on the next lookup. That can be acheived by setting the need_refresh flag in the inode ctx. So replaced all occurences of resetting even gen to zero with a call to afr_inode_need_refresh_set(). Change #3: In both lookup and discover path, we are doing an inode refresh which is not required since all 3 essentially do the same thing- update the inode ctx with the good/bad copies from the brick replies. Inode refresh also triggers background heals, but I think it is okay to do it when we call refresh during the read and write txns and not in the lookup path. The .ts which relied on inode refresh in lookup path to trigger heals are now changed to do read txn so that inode refresh and the heal happens. Change-Id: Iebf39a9be6ffd7ffd6e4046c96b0fa78ade6c5ec Fixes: #1179 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reported-by: Erik Jacobson <erik.jacobson at hpe.com> (cherry picked from commit f0fcd909ad4535b60c9208d4804ebe6afe421a09)
*	snap_scheduler: python3 compatibility and new test case	Sunny Kumar	2020-04-14	1	-0/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: "snap_scheduler.py init" command failing with the below traceback: [root@dhcp43-104 ~]# snap_scheduler.py init Traceback (most recent call last): File "/usr/sbin/snap_scheduler.py", line 941, in <module> sys.exit(main(sys.argv[1:])) File "/usr/sbin/snap_scheduler.py", line 851, in main initLogger() File "/usr/sbin/snap_scheduler.py", line 153, in initLogger logfile = os.path.join(process.stdout.read()[:-1], SCRIPT_NAME + ".log") File "/usr/lib64/python3.6/posixpath.py", line 94, in join genericpath._check_arg_types('join', a, *p) File "/usr/lib64/python3.6/genericpath.py", line 151, in _check_arg_types raise TypeError("Can't mix strings and bytes in path components") from None TypeError: Can't mix strings and bytes in path components Solution: Added the 'universal_newlines' flag to Popen to support backward compatibility. Added a basic test for snapshot scheduler. Backport of: >Upstream patch: >https://review.gluster.org/#/c/glusterfs/+/24257/ >Change-Id: I78e8fabd866fd96638747ecd21d292f5ca074a4e >Fixes: #1134 >Signed-off-by: Sunny Kumar <sunkumar@redhat.com> >(cherry picked from commit a7d7ec066e56ac03bf252c26beb20fdc2c3b6772) Change-Id: I78e8fabd866fd96638747ecd21d292f5ca074a4e Fixes: #1134 Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
*	features/utime: Don't access frame after stack-wind	Pranith Kumar K	2020-04-07	1	-0/+32
\| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: frame is accessed after stack-wind. This can lead to crash if the cbk frees the frame. Fix: Use new frame for the wind instead. Fixes: #832 Change-Id: I64754609f1114b0bbd4d1336fa81a56f2cca6e03 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	write-behind: fix data corruption	Xavi Hernandez	2020-04-07	2	-0/+307
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There was a bug in write-behind that allowed a previous completed write to overwrite the overlapping region of data from a future write. Suppose we want to send three writes (W1, W2 and W3). W1 and W2 are sequential, and W3 writes at the same offset of W2: W2.offset = W3.offset = W1.offset + W1.size Both W1 and W2 are sent in parallel. W3 is only sent after W2 completes. So W3 should always overwrite the overlapping part of W2. Suppose write-behind processes the requests from 2 concurrent threads: Thread 1 Thread 2 <received W1> <received W2> wb_enqueue_tempted(W1) /* W1 is assigned gen X / wb_enqueue_tempted(W2) / W2 is assigned gen X / wb_process_queue() __wb_preprocess_winds() / W1 and W2 are sequential and all * other requisites are met to merge * both requests. / __wb_collapse_small_writes(W1, W2) __wb_fulfill_request(W2) __wb_pick_unwinds() -> W2 / In this case, since the request is * already fulfilled, wb_inode->gen * is not updated. / wb_do_unwinds() STACK_UNWIND(W2) / The application has received the * result of W2, so it can send W3. / <received W3> wb_enqueue_tempted(W3) / W3 is assigned gen X / wb_process_queue() / Here we have W1 (which contains * the conflicting W2) and W3 with * same gen, so they are interpreted * as concurrent writes that do not * conflict. / __wb_pick_winds() -> W3 wb_do_winds() STACK_WIND(W3) wb_process_queue() / Eventually W1 will be * ready to be sent / __wb_pick_winds() -> W1 __wb_pick_unwinds() -> W1 / Here wb_inode->gen is * incremented. */ wb_do_unwinds() STACK_UNWIND(W1) wb_do_winds() STACK_WIND(W1) So, as we can see, W3 is sent before W1, which shouldn't happen. The problem is that wb_inode->gen is only incremented for requests that have not been fulfilled but, after a merge, the request is marked as fulfilled even though it has not been sent to the brick. This allows that future requests are assigned to the same generation, which could be internally reordered. Solution: Increment wb_inode->gen before any unwind, even if it's for a fulfilled request. Special thanks to Stefan Ring for writing a reproducer that has been crucial to identify the issue. Change-Id: Id4ab0f294a09aca9a863ecaeef8856474662ab45 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> Fixes: #884
*	afr: mark pending xattrs as a part of metadata heal	Ravishankar N	2020-04-07	1	-0/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...if pending xattrs are zero for all children. Problem: If there are no pending xattrs and a metadata heal needs to be performed, it can be possible that we end up with xattrs inadvertendly deleted from all bricks, as explained in the BZ. Fix: After picking one among the sources as the good copy, mark pending xattrs on all sources to blame the sinks. Now even if this metadata heal fails midway, a subsequent heal will still choose one of the valid sources that it picked previously. Updates: #1067 Change-Id: If1b050b70b0ad911e162c04db4d89b263e2b8d7b Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 2d5ba449e9200b16184b1e7fc84cabd015f1f779)
*	glusterd: Brick process fails to come up with brickmux on	Vishal Pandey	2020-03-17	1	-1/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Issue: 1- In a cluster of 3 Nodes N1, N2, N3. Create 3 volumes vol1, vol2, vol3 with 3 bricks (one from each node) 2- Set cluster.brick-multiplex on 3- Start all 3 volumes 4- Check if all bricks on a node are running on same port 5- Kill N1 6- Set performance.readdir-ahead for volumes vol1, vol2, vol3 7- Bring N1 up and check volume status 8- All bricks processes not running on N1. Root Cause - Since, There is a diff in volfile versions in N1 as compared to N2 and N3 therefore glusterd_import_friend_volume() is called. glusterd_import_friend_volume() copies the new_volinfo and deletes old_volinfo and then calls glusterd_start_bricks(). glusterd_start_bricks() looks for the volfiles and sends an rpc request to glusterfs_handle_attach(). Now, since the volinfo has been deleted by glusterd_delete_stale_volume() from priv->volumes list before glusterd_start_bricks() and glusterd_create_volfiles_and_notify_services() and glusterd_list_add_order is called after glusterd_start_bricks(), therefore the attach RPC req gets an empty volfile path and that causes the brick to crash. Fix- Call glusterd_list_add_order() and glusterd_create_volfiles_and_notify_services before glusterd_start_bricks() cal is made in glusterd_import_friend_volume > Change-Id: Idfe0e8710f7eb77ca3ddfa1cabeb45b2987f41aa > Bug: bz#1773856 > Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> (cherry picked from commit 45e81aae791da9d013aba2286af44826227c05ec) Change-Id: Idfe0e8710f7eb77ca3ddfa1cabeb45b2987f41aa fixes: bz#1808964 Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
*	afr: prevent spurious entry heals leading to gfid split-brain	Ravishankar N	2020-02-25	2	-14/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: In a hyperconverged setup with granular-entry-heal enabled, if a file is recreated while one of the bricks is down, and an index heal is triggered (with the brick still down), entry-self heal was doing a spurious heal with just the 2 good bricks. It was doing a post-op leading to removal of the filename from .glusterfs/indices/entry-changes as well as erroneous setting of afr xattrs on the parent. When the brick came up, the xattrs were cleared, resulting in the renamed file not getting healed and leading to gfid split-brain and EIO on the mount. Fix: Proceed with entry heal only when shd can connect to all bricks of the replica, just like in data and metadata heal. fixes: bz#1804591 Change-Id: I916ae26ad1fabf259bc6362da52d433b7223b17e Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 06453d77d056fbaa393a137ca277a20e38d2f67e)
*	server: Mount fails after reboot 1/3 gluster nodes	Mohit Agrawal	2020-02-10	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: At the time of coming up one server node(1x3) after reboot client is unmounted.The client is unmounted because a client is getting AUTH_FAILED event and client call fini for the graph.The client is getting AUTH_FAILED because brick is not attached with a graph at that moment Solution: To avoid the unmounting the client graph throw ENOENT error from server in case if brick is not attached with server at the time of authenticate clients. > Credits: Xavi Hernandez <xhernandez@redhat.com> > Change-Id: Ie6fbd73cbcf23a35d8db8841b3b6036e87682f5e > Fixes: bz#1793852 > Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> > (cherry picked from commit > f6421dff22a6ddaf14134f6894deae219948c89d) Change-Id: Ie6fbd73cbcf23a35d8db8841b3b6036e87682f5e Fixes: bz#1794019 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	rpc: Cleanup SSL specific data at the time of freeing rpc object	l17zhou	2020-02-10	1	-3/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: At the time of cleanup rpc object ssl specific data is not freeing so it has become a leak. Solution: To avoid the leak cleanup ssl specific data at the time of cleanup rpc object > Credits: l17zhou <cynthia.zhou@nokia-sbell.com.cn> > Fixes: bz#1768407 > Change-Id: I37f598673ae2d7a33c75f39eb8843ccc6dffaaf0 > (cherry picked from commit > > 54ed71dba174385ab0d8fa415e09262f6250430c) Change-Id: I37f598673ae2d7a33c75f39eb8843ccc6dffaaf0 Fixes: bz#1795540 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	geo-rep: Fix ssh-port validation	Sunny Kumar	2020-02-10	2	-0/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If non-standard ssh-port is used, Geo-rep can be configured to use ssh port by using config option, the value should be in allowed port range and non negative. At present it can accept negative value and outside allowed port range which is incorrect. Many Linux kernels use the port range 32768 to 61000. IANA suggests it should be in the range 1 to 2^16 - 1, so keeping the same. $ gluster volume geo-replication master 127.0.0.1::slave config ssh-port -22 geo-replication config updated successfully $ gluster volume geo-replication master 127.0.0.1::slave config ssh-port 22222222 geo-replication config updated successfully This patch fixes the above issue and have added few validations around this in test cases. Upstream Patch: https://review.gluster.org/#/c/glusterfs/+/24035/ Backport of: > Change-Id: I9875ab3f00d7257370fbac6f5ed4356d2fed3f3c > Fixes: bz#1792276 > Signed-off-by: Sunny Kumar <sunkumar@redhat.com> > (cherry picked from commit 485212e858bddd97573a3b2b811357b0d822005a) Change-Id: I9875ab3f00d7257370fbac6f5ed4356d2fed3f3c Fixes: bz#1793412 Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
*	performance/md-cache: Do not skip caching of null character xattr values	Anoop C S	2019-12-19	1	-0/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Null character string is a valid xattr value in file system. But for those xattrs processed by md-cache, it does not update its entries if value is null('\0'). This results in ENODATA when those xattrs are queried afterwards via getxattr() causing failures in basic operations like create, copy etc in a specially configured Samba setup for Mac OS clients. On the other side snapview-server is internally setting empty string("") as value for xattrs received as part of listxattr() and are not intended to be cached. Therefore we try to maintain that behaviour using an additional dictionary key to prevent updation of entries in getxattr() and fgetxattr() callbacks in md-cache. Credits: Poornima G <pgurusid@redhat.com> Change-Id: I7859cbad0a06ca6d788420c2a495e658699c6ff7 Fixes: bz#1785228 Signed-off-by: Anoop C S <anoopcs@redhat.com> (cherry picked from commit b4b683736367d93daad08a5ee6ca95778c07c5a4)
*	test: fix non-root test case for geo-rep	Sunny Kumar	2019-12-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: On a freshly installed system non-root geo-rep test case gets blocked. Solution: On a freshly installed system, the remote key need to be accepted automatically by ssh-copy-id. Credits: M. Scherer <mscherer@redhat.com> Backport of: > Change-Id: I5077f99a6681660f7e3e84c25ef216f521b7c29c > Fixes: bz#1779742 > Signed-off-by: Sunny Kumar <sunkumar@redhat.com> Change-Id: I5077f99a6681660f7e3e84c25ef216f521b7c29c Fixes: bz#1784790 Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
*	test: fix suspicous non-root geo-rep test failures	Sunny Kumar	2019-11-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Export of env variable is required for ssh-copy-id command. Backport of: >fixes: bz#1765426 >Change-Id: Icaf7a848cb8f4ae9f887d885a8c5bb71f26633b4 >Signed-off-by: Sunny Kumar <sunkumar@redhat.com> >(cherry picked from commit febfa9f2ec9dfc5dbf4a68c3518f98364ebc461) Change-Id: Ic244b065db9959c0c6ba952955f0f68e3f96e925 fixes: bz#1765431 Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
*	cluster/afr: Heal entries when there is a source & no healed_sinks	karthik-us	2019-11-14	1	-0/+89
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: In a situation where B1 blames B2, B2 blames B1 and B3 doesn't blame anything for entry heal, heal will not complete even though we have clear source and sinks. This will happen because while doing afr_selfheal_find_direction() only the bricks which are blamed by non-accused bricks are considered as sinks. Later in __afr_selfheal_entry_finalize_source() when it tries to mark all the non-sources as sinks it fails to do so because there won't be any healed_sinks marked, no witness present and there will be a source. Fix: If there is a source and no healed_sinks, then reset all the locked sources to 0 and healed sinks to 1 to do conservative merge. Change-Id: If40d8bc95d52a52b2730f55bdcf135109b421548 Fixes: bz#1760699 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	afr: support split-brain CLI for replica 3	Ravishankar N	2019-11-13	1	-0/+111
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ever since we added quorum checks for lookups in afr via commit bd44d59741bb8c0f5d7a62c5b1094179dd0ce8a4, the split-brain resolution commands would not work for replica 3 because there would be no readables for the lookup fop. The argument was that split-brains do not occur in replica 3 but we do see (data/metadata) split-brain cases once in a while which indicate that there are a few bugs/corner cases yet to be discovered and fixed. Fortunately, commit 8016d51a3bbd410b0b927ed66be50a09574b7982 added GF_CLIENT_PID_GLFS_HEALD as the pid for all fops made by glfsheal. If we leverage this and allow lookups in afr when pid is GF_CLIENT_PID_GLFS_HEALD, split-brain resolution commands will work for replica 3 volumes too. Likewise, the check is added in shard_lookup as well to permit resolving split-brains by specifying "/.shard/shard-file.xx" as the file name (which previously used to fail with EPERM). Change-Id: I3c543dea79caf7cfbc1633e9089cb1cdd2538ba9 Fixes: bz#1760791 Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 47dbd753187f69b3835d2e42fdbe7485874c4b3e)
*	geo-rep: Fix Permission denied traceback on non root setup	Kotresh HR	2019-11-06	1	-6/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: While syncing rename of directory in hybrid crawl, geo-rep crashes as below. Traceback (most recent call last): File "/usr/local/libexec/glusterfs/python/syncdaemon/repce.py", line 118, in worker res = getattr(self.obj, rmeth)(in_data[2:]) File "/usr/local/libexec/glusterfs/python/syncdaemon/resource.py", line 588, in entry_ops src_entry = get_slv_dir_path(slv_host, slv_volume, gfid) File "/usr/local/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 687, in get_slv_dir_path [ENOENT], [ESTALE]) File "/usr/local/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 546, in errno_wrap return call(arg) PermissionError: [Errno 13] Permission denied: '/bricks/brick1/b1/.glusterfs/8e/c0/8ec0fcd4-d50f-4a6e-b473-a7943ab66640' Cause: Conversion of gfid to path for a directory uses readlink on backend .glusterfs gfid path. But this fails for non root user with permission denied. Fix: Use gfid2path interface to get the path from gfid Backport of: > Patch: https://review.gluster.org/23570 > Change-Id: I9d40c713a1b32cea95144cbc0f384ada82972222 > BUG: 1763439 > Signed-off-by: Kotresh HR <khiremat@redhat.com> Change-Id: I9d40c713a1b32cea95144cbc0f384ada82972222 fixes: bz#1764030 Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	geo-rep: Fix config upgrade on non-participating node	Kotresh HR	2019-11-06	2	-0/+179
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After upgrade, if the config files are of old format, it gets migrated to new format. Monitor process migrates it. Since monitor doesn't run on nodes where bricks are not hosted, it doesn't get migrated there. So this patch fixes the config upgrade on nodes which doesn't host bricks. This happens during config either on get/set/reset. Backport of: Patch: https://review.gluster.org/23555 Change-Id: Ibade2f2310b0f3affea21a3baa1ae0eb71162cba Signed-off-by: Kotresh HR <khiremat@redhat.com> BUG: 1762220 Change-Id: Ibade2f2310b0f3affea21a3baa1ae0eb71162cba fixes: bz#1764028 Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	tests : test case for non-root geo-rep setup	Sunny Kumar	2019-11-06	1	-0/+251
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Added test case for non-root geo-rep setup. Backport of: > Patch: https://review.gluster.org/22902 > Change-Id: Ib6ebee79949a9f61bdc5c7b5e11b51b262750e98 > BUG: 1717827 > Signed-off-by: Sunny Kumar <sunkumar@redhat.com> > Signed-off-by: Kotresh HR <khiremat@redhat.com> Change-Id: Ib6ebee79949a9f61bdc5c7b5e11b51b262750e98 fixes: bz#1764026 Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	geo-rep: Test case for upgrading config file	Shwetha K Acharya	2019-11-06	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Added test case for the patch https://review.gluster.org/#/c/glusterfs/+/22894/4 Also updated if else structure in gsyncdconfig.py to avoid repeated occurance of values in new configfile. Backport of: > Patch: https://review.gluster.org/22982 > BUG: 1707731 > Change-Id: If97e1d37ac52dbd17d47be6cb659fc5a3ccab6d7 > Signed-off-by: Shwetha K Acharya <sacharya@redhat.com> fixes: bz#1764003 Change-Id: If97e1d37ac52dbd17d47be6cb659fc5a3ccab6d7 Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	tests: Fix spurious failure	Pranith Kumar K	2019-11-06	1	-2/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If heal from next brick starts after the first brick completes heal, then opendir on the brick can change atime leading to failure of the test. When ctime is disabled it is better to just check mtime to be same after heal. Backport of: > BUG: 1751134 > Change-Id: Ia03e30fd547e6bbe85c1e299845ffa122f3a2692 > Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> (cherry picked from commit 0e37cdf271a48d3e58c212e95664a2aa34da3940) fixes: bz#1769320 Change-Id: Ia03e30fd547e6bbe85c1e299845ffa122f3a2692 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	gfapi: 'glfs_h_creat_open' - new API to create handle and open fd	Soumya Koduri	2019-09-26	2	-0/+145
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Right now we have two separate APIs, one - 'glfs_h_creat_handle' to create handle & another - 'glfs_h_open' to create a glfd to return to application Having two separate routines can result in access errors while trying to create and write into a read-only file. Since a fd is opened even during file/directory creation, introducing a new API to make these two operations atomic i.e, which can create both handle & fd and pass them to application This is backport of below mainline patch - - https://review.gluster.org/#/c/glusterfs/+/23448/ - bz#1753569 release-6: - https://review.gluster.org/#/c/glusterfs/+/23491/ Change-Id: Ibf513fcfcdad175f4d7eb6fa7a61b8feec6d33b5 fixes: bz#1756002 Signed-off-by: Soumya Koduri <skoduri@redhat.com>
*	ctime/rebalance: Heal ctime xattr on directory during rebalance	Kotresh HR	2019-09-16	8	-1/+497
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After add-brick and rebalance, the ctime xattr is not present on rebalanced directories on new brick. This patch fixes the same. Note that ctime still doesn't support consistent time across distribute sub-volume. This patch also fixes the in-memory inconsistency of time attributes when metadata is self healed. Backport of: > Patch: https://review.gluster.org/23127/ > Change-Id: Ia20506f1839021bf61d4753191e7dc34b31bb2df > BUG: 1734026 > Signed-off-by: Kotresh HR <khiremat@redhat.com> (cherry picked from commit 304640e55c0f3c6d15f4e230dc6376e4f5020fea) Change-Id: Ia20506f1839021bf61d4753191e7dc34b31bb2df Signed-off-by: Kotresh HR <khiremat@redhat.com> fixes: bz#1752429
*	afr/lookup: Pass xattr_req in while doing a selfheal in lookup	Mohammed Rafi KC	2019-09-11	2	-0/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We were not passing xattr_req when doing a name self heal as well as a meta data heal. Because of this, some xdata was missing which causes i/o errors Backport of > https://review.gluster.org/#/c/glusterfs/+/23024/ >Change-Id: Ibfb1205a7eb0195632dc3820116ffbbb8043545f >Fixes: bz#1728770 >Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Fixes: bz#1749305 Change-Id: Ibfb1205a7eb0195632dc3820116ffbbb8043545f Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> (cherry picked from commit d026f0bcfd301712e4f0671ccf238f43f2e6dd30)
*	tests: fix spurious failure of bug-1402841.t-mt-dir-scan-race.t	Ravishankar N	2019-09-05	1	-4/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Since commit 600ba94183333c4af9b4a09616690994fd528478, shd starts healing as soon as it is toggled from disabled to enabled. This was causing the following line in the .t to fail on a 'fast' machine (always on my laptop and sometimes on the jenkins slaves). EXPECT_NOT "^0$" get_pending_heal_count $V0 because by the time shd was disabled, the heal was already completed. Fix: Increase the no. of files to be healed and make it a variable called FILE_COUNT, should we need to bump it up further because the machines become even faster. Also created pending metadata heals to increase the time taken to heal a file. fixes: bz#1749155 Change-Id: I5a26b08e45b8c19bce3c01ce67bdcc28ed48198d Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 724c657995a2e148243eeb78c68b620c6d7714a5)
*	afr: wake up index healer threads	Ravishankar N	2019-08-30	1	-0/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	...whenever shd is re-enabled after disabling or there is a change in `cluster.heal-timeout`, without needing to restart shd or waiting for the current `cluster.heal-timeout` seconds to expire. See BZ 1743988 for more details. Change-Id: Ia5ebd7c8e9f5b54cba3199c141fdd1af2f9b9bfe fixes: bz#1747301 Reported-by: Glen Kiessling <glenk1973@hotmail.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 600ba94183333c4af9b4a09616690994fd528478)
*	glusterd: ./tests/bugs/glusterd/bug-1595320.t is failing	Mohit Agrawal	2019-08-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: sometime ./tests/bugs/glusterd/bug-1595320.t is failing is failing at the time of checking brick_process after sending a kill signal to brick process Solution: Wait sometime after just sending a kill signal to brick process to make sure brick process is stopped > Change-Id: Iee9e91284618abfc62a550d47e4f9117785def58 > Fixes: bz#1743200 > Signed-off-by: Mohit Agrawal <moagrawal@redhat.com> > (cherry picked from commit 8f1620ad7f5d3d040fee55c5f873349800e2268d) Change-Id: Iee9e91284618abfc62a550d47e4f9117785def58 Fixes: bz#1745422 Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
*	cluster/ec: fix EIO error for concurrent writes on sparse files	Xavi Hernandez	2019-08-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	EC doesn't allow concurrent writes on overlapping areas, they are serialized. However non-overlapping writes are serviced in parallel. When a write is not aligned, EC first needs to read the entire chunk from disk, apply the modified fragment and write it again. The problem appears on sparse files because a write to an offset implicitly creates data on offsets below it (so, in some way, they are overlapping). For example, if a file is empty and we read 10 bytes from offset 10, read() will return 0 bytes. Now, if we write one byte at offset 1M and retry the same read, the system call will return 10 bytes (all containing 0's). So if we have two writes, the first one at offset 10 and the second one at offset 1M, EC will send both in parallel because they do not overlap. However, the first one will try to read missing data from the first chunk (i.e. offsets 0 to 9) to recombine the entire chunk and do the final write. This read will happen in parallel with the write to 1M. What could happen is that half of the bricks process the write before the read, and the half do the read before the write. Some bricks will return 10 bytes of data while the otherw will return 0 bytes (because the file on the brick has not been expanded yet). When EC tries to recombine the answers from the bricks, it can't, because it needs more than half consistent answers to recover the data. So this read fails with EIO error. This error is propagated to the parent write, which is aborted and EIO is returned to the application. The issue happened because EC assumed that a write to a given offset implies that offsets below it exist. This fix prevents the read of the chunk from bricks if the current size of the file is smaller than the read chunk offset. This size is correctly tracked, so this fixes the issue. Also modifying ec-stripe.t file for Test #13 within it. In this patch, if a file size is less than the offset we are writing, we fill zeros in head and tail and do not consider it strip cache miss. That actually make sense as we know what data that part holds and there is no need of reading it from bricks. Change-Id: Ic342e8c35c555b8534109e9314c9a0710b6225d6 Fixes: bz#1739427 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	afr: restore timestamp of parent dir during entry-heal	Ravishankar N	2019-08-21	1	-0/+78
\| \| \| \| \| \| \|	Fixes: bz#1741041 Change-Id: I29e338bac62104233a6f80212df8d0fb016affda Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 8e9c53ebf16705b9a1db2fc486dc24a5cb244ddd)
*	features/shard: Send correct size when reads are sent beyond file size	Krutika Dhananjay	2019-08-21	1	-0/+29
\| \| \| \| \| \| \|	Change-Id: I0cebaaf55c09eb1fb77a274268ff564e871b743b fixes bz#1740316 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> (cherry picked from commit 51237eda7c4b3846d08c5d24d1e3fe9b7ffba1d4)
*	ctime: Set mdata xattr on legacy files	Kotresh HR	2019-08-19	1	-0/+83
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: The files which were created before ctime enabled would not have "trusted.glusterfs.mdata"(stores time attributes) xattr. Upon fops which modifies either ctime or mtime, the xattr gets created with latest ctime, mtime and atime, which is incorrect. It should update only the corresponding time attribute and rest from backend Solution: Creating xattr with values from brick is not possible as each brick of replica set would have different times. So create the xattr upon successful lookup if the xattr is not created Note To Reviewers: The time attributes used to set xattr is got from successful lookup. Instead of sending the whole iatt over the wire via setxattr, a structure called mdata_iatt is sent. The mdata_iatt contains only time attributes. Backport of: > Patch: https://review.gluster.org/22936 > Change-Id: I5e535631ddef04195361ae0364336410a2895dd4 > BUG: 1593542 > Signed-off-by: Kotresh HR <khiremat@redhat.com> Change-Id: I5e535631ddef04195361ae0364336410a2895dd4 updates: bz#1739430 Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	cluster/afr: Fix incorrect reporting of gfid & type mismatch	karthik-us	2019-07-20	1	-0/+116
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problems: 1. When checking for type and gfid mismatch, if the type or gfid is unknown because of missing gfid handle and the gfid xattr it will be reported as type or gfid mismatch and the heal will not complete. 2. If the source selected during entry heal has null gfid the same will be sent to afr_lookup_and_heal_gfid(). In this function when we try to assign the gfid on the bricks where it does not exist, we are considering the same gfid and try to assign that on those bricks. This will fail in posix_gfid_set() since the gfid sent is null. Fix: If the gfid sent to afr_lookup_and_heal_gfid() is null choose a valid gfid before proceeding to assign the gfid on the bricks where it is missing. In afr_selfheal_detect_gfid_and_type_mismatch(), do not report type/gfid mismatch if the type/gfid is unknown or not set. Change-Id: Ia06552e4dc4a9f89cb7f5302833604bd21bbf7da fixes: bz#1729481 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	glusterd: do not mark skip_locking as true for geo-rep operations	Sanju Rakonde	2019-07-19	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need to send the commit req to peers in case of geo-rep operations even though it is a no volname operation. In commit phase peers try to set the txn_opinfo which will fail because it is a no volname operation where we don't require a commit phase. We mark skip_locking as true for no volname operations, but we have to give an exception to geo-rep operations, so that they can set txn_opinfo in commit phase. Please refer to detailed RCA at the bug: 1730543 fixes: bz#1730543 Change-Id: I9f2478b12a281f6e052035c0563c40543493a3fc Signed-off-by: Sanju Rakonde <srakonde@redhat.com> (cherry picked from commit b917974ee922d7a2e079692ad7d6f61f900b37b2)
*	tests: Fix bug-1717819-metadata-split-brain-detection.t failure	karthik-us	2019-07-15	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: tests/bugs/replicate/bug-1717819-metadata-split-brain-detection.t fails intermittently in test cases #49 & #50, which compare the values of the user set xattr values after enabling the heal. We are not waiting for the heal to complete before comparing those values, which might lead those tests to fail. Fix: Wait till the HEAL-TIMEOUT before comparing the xattr values. Also cheking for the shd to come up and the bricks to connect to the shd process in another case. Change-Id: I0e245b328da9df23ce70c5300278fad1c1d9f7ff fixes: bz#1729895 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	test: Fix spurious failures in bug-1040275-brick-uid-reset-on-volume-restart.t	Mohit Agrawal	2019-07-09	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: test case is failing after just starting the volume at the time of running stat command on mount point and client is getting error transport endpoint is not conencted Solution: To avoid the error make sure all brick instance should be up and mount point should be active > Change-Id: I49553a04d5b13e155ee02f4a1888a07fe3ee2ff5 > fixes: bz#1721590 > Signed-off-by: Mohit Agrawal <moagrawal@redhat.com> > (cherry picked from commit 283b77805cca3027e333a11c9b00ac611662c9ee) Change-Id: I49553a04d5b13e155ee02f4a1888a07fe3ee2ff5 fixes: bz#1728182 Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
*	glusterd/thin-arbiter: Thin-arbiter integration with GD1	Vishal Pandey	2019-07-04	2	-0/+70
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	gluster volume create <VOLNAME> replica 2 thin-arbiter 1 <host1>:<brick1> <host2>:<brick2> <thin-arbiter-host>:<path-to-store-replica-id-file> [force] The changes have been made in a way that the last brick in the bricks list will be treated as the thin-arbiter. GD1 will be manipulated to consider replica count to be as 2 and continue creating the volume like any other replica 2 volume but since thin-arbiter volumes need ta-brick client xlator entries for each subvolume in fuse volfile, volfile generation is modified in a way to inject these entries seperately in the volfile for every subvolume. Few more additions - 1- Save the volinfo with new fields ta_bricks list and thin_arbiter_count. 2- Introduce a new option client.ta-brick-port to add remote-port to ta-brick xlator entry in fuse volfiles. The option can be set using the following CLI syntax - gluster volume set <VOLNAME> client.ta-brick-port <PORTNO.> 3- Volume Info will contain a Thin-Arbiter-path entry to distinguish from other replicate volumes. Change-Id: Ib434e2313b29716f32476c6c211d282c4ef39406 Updates #687 Signed-off-by: Vishal Pandey <vpandey@redhat.com> (cherry picked from commit 9b223b15ab69fce4076de036ee162f36a058bcd2)
*	cluster/ec: Prevent double pre-op xattrops	Pranith Kumar K	2019-06-22	1	-0/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Race: Thread-1 Thread-2 1) Does ec_get_size_version() to perform pre-op fxattrop as part of write-1 2) Calls ec_set_dirty_flag() in ec_get_size_version() for write-2. This sets dirty[] to 1 3) Completes executing ec_prepare_update_cbk leading to ctx->dirty[] = '1' 4) Takes LOCK(inode->lock) to check if there are any flags and sets dirty-flag because lock->waiting_flag is 0 now. This leads to fxattrop to increment on-disk dirty[] to '2' At the end of the writes the file will be marked for heal even when it doesn't need heal. Fix: Perform ec_set_dirty_flag() and other checks inside LOCK() to prevent dirty[] to be marked as '1' in step 2) above Updates bz#1593224 Change-Id: Icac2ab39c0b1e7e154387800fbededc561612865 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	posix/ctime: Fix ctime upgrade issue	Kotresh HR	2019-06-21	3	-16/+202
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: On a EC volume, during upgrade from the older version where ctime feature is not enabled(or not present) to the newer version where the ctime feature is available (enabled default), the self heal hangs and doesn't complete. Cause: The ctime feature has both client side code (utime) and server side code (posix). The feature is driven from client. Only if the client side sets the time in the frame, should the server side sets the time attributes in xattr. But posix setattr/fseattr was not doing that. When one of the server nodes is updated, since ctime is enabled by default, it starts setting xattr on setattr/fseattr on the updated node/brick. On a EC volume the first two updated nodes(bricks) are not a problem because there are 4 other bricks with consistent data. However once the third brick is updated, the new attribute(mdata xattr) will cause an inconsistency on metadata on 3 bricks, which prevents the file to be repaired. Fix: Don't create mdata xattr with utimes/utimensat system call. Only update if already present. Change-Id: Ieacedecb8a738bb437283ef3e0f042fd49dc4c8c fixes: bz#1720201 Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	encryption/crypt: remove from volume file	Amar Tumballi	2019-06-20	3	-457/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The feature is not supported and is moved out of the codebase from glusterfs-5.x release. Doesn't make sense to keep the code to support it. For those who want to upgrade from an version supporting it to higher version, please do a 'gluster volume reset $VOL encryption reset' and then continue with the upgrade process. updates: bz#1648169 Change-Id: I8cf822c0d7195940bd37f6af2432a3cac68d44d1 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	tests: subdir-mount.t is failing for brick_mux regrssion	Mohit Agrawal	2019-06-17	1	-3/+8
\| \| \| \| \| \| \| \| \|	To avoid the failure wait to run hook script S13create-subdir-mounts.sh after executed add-brick command by test case. Change-Id: I063b6d0f86a550ed0a0527255e4dfbe8f0a8c02e fixes: bz#1720993 Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
*	core: improve timer accuracy	Xavier Hernandez	2019-06-17	1	-3/+4
\| \| \| \| \| \| \| \|	Also fixed some issues on test ec-1468261.t. Change-Id: If156f86af986d9eed13cdd1f15c5a7214cd11706 Updates: bz#1193929 Signed-off-by: Xavier Hernandez <jahernan@redhat.com>
*	glusterd: add GF_TRANSPORT_BOTH_TCP_RDMA in glusterd_get_gfproxy_client_volfile	Atin Mukherjee	2019-06-17	1	-1/+3
\| \| \| \| \| \| \| \| \|	... with out which volume creation fails with "volume create: <xyz>: failed: Failed to create volume files" Fixes: bz#1716812 Change-Id: I2f4c2c6d5290f066b54e1c1db19e25db9937bedb Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
*	tests: Add missing NFS test tag to the testfile	Aravinda VK	2019-06-15	1	-0/+2
\| \| \| \| \| \| \| \|	$SRC/glusterfs/bugs/nfs/showmount-many-clients.t Change-Id: I48758cc66fcb55f48c4a8a0a738b06867f6814a1 Signed-off-by: Aravinda VK <avishwan@redhat.com> Updates: bz#1193929
*	gfapi: provide an api for setting statedump path	Amar Tumballi	2019-06-14	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently for an application using glfsapi to use glusterfs, when a statedump is taken, it uses /var/run/gluster dir to dump info. There can be concerns as this directory may be owned by some other user, and hence it may fail taking statedump. Such applications should have an option to use different path. This patch provides an API to do so. Updates: bz#1689097 Change-Id: I8918e002bc823d83614c972b6c738baa04681b23 Signed-off-by: Amar Tumballi <amarts@redhat.com>