glusterfs.git -

	Commit message (Collapse)	Author	Age	Files	Lines
*	cluster/afr: Delay post-op for fsync	Pranith Kumar K	2020-06-08	3	-0/+175
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: AFR doesn't delay post-op for fsync fop. For fsync heavy workloads this leads to un-necessary fxattrop/finodelk for every fsync leading to bad performance. Fix: Have delayed post-op for fsync. Add special flag in xdata to indicate that afr shouldn't delay post-op in cases where either the process will terminate or graph-switch would happen. Otherwise it leads to un-necessary heals when the graph-switch/process-termination happens before delayed-post-op completes. Fixes: #1253 Change-Id: I531940d13269a111c49e0510d49514dc169f4577 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	open-behind: rewrite of internal logic	Xavi Hernandez	2020-06-04	4	-0/+871
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There was a critical flaw in the previous implementation of open-behind. When an open is done in the background, it's necessary to take a reference on the fd_t object because once we "fake" the open answer, the fd could be destroyed. However as long as there's a reference, the release function won't be called. So, if the application closes the file descriptor without having actually opened it, there will always remain at least 1 reference, causing a leak. To avoid this problem, the previous implementation didn't take a reference on the fd_t, so there were races where the fd could be destroyed while it was still in use. To fix this, I've implemented a new xlator cbk that gets called from fuse when the application closes a file descriptor. The whole logic of handling background opens have been simplified and it's more efficient now. Only if the fop needs to be delayed until an open completes, a stub is created. Otherwise no memory allocations are needed. Correctly handling the close request while the open is still pending has added a bit of complexity, but overall normal operation is simpler. Change-Id: I6376a5491368e0e1c283cc452849032636261592 Fixes: #1225 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	io-cache,quick-read: deprecate volume options with flawed semantics or naming	Csaba Henk	2020-06-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- performance.cache-size has a flawed semantics, as it's dispatched on two independent translators, io-cache and quick-read. - performance.qr-cache-timeout has a confusing name, as other options affecting quick-read have an unabbreviated "quick-read-..." prefix in their names. We keep these options with unchanged operation, but in the help output we indicate their deprecation. The following better alternatives are introduced: - performance.io-cache-size to tune cache-size option of io-cache - performance.quick-read-cache-size to tune cache-size option of quick-read - performance.quick-read-cache-timeout as a preferred synonym for performance.qr-cache-timeout Fixes: #952 Change-Id: Ibd04fb638de8cac450ba992ad8a415154f9f4281 Signed-off-by: Csaba Henk <csaba@redhat.com>
*	dht - sparse files rebalance enhancements	Barak Sason Rofman	2020-06-01	1	-0/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently data migration in rebalance reads sparse file sequentially, disregarding which segments are holes and which are data. This can lead to extremely long migration time for large sparse file. Data migration mechanism needs to be enhanced so only data segments are read and migrated. This can be achieved using lseek to seek for holes and data in the file. This enhancement is a consequence of https://bugzilla.redhat.com/show_bug.cgi?id=1823703 fixes: #1222 Change-Id: If5f448a0c532926464e1f34f504c5c94749b08c3 Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
*	mgmt/glusterd: Stop old shd before increasing replica count	Pranith Kumar K	2020-05-16	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: In add-brick that increases replica count SHD was restarted after pending xattrs are set on the new bricks and adding bricks. But before restarting SHD there is a possibility that old SHD would do a scan on root-directory see no heal is needed and delete index for root-dir leading to no heals until lookup is executed on the mount Fix: Stop shd, perform pending-xattr setting/adding new bricks and then restart shd Fixes: #1240 Change-Id: I94fd7c6c909211b597185dfe097a559db6c0d00f Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	tests: Disable client-heals	Pranith Kumar K	2020-05-15	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: ok 32 [ 11/ 9] < 46> 'gf_rm_file_and_gfid_link /d/backends/patchy0 del-file' not ok 33 [ 13/ 131] < 48> '! dd if=/dev/zero of=/mnt/glusterfs/0/del-file bs=1M count=1 oflag=direct' -> '' The assumption in the test above is that the file wouldn't exist when dd happens. But heal can lead to creation of the file in some cases leading to spurious failures. Fix: Disable client side heal. Fixes: #1245 Change-Id: I96b2b45528f9dfb3199d503a467cafafba9b387f Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	Fix ./tests/basic/fencing/afr-lock-heal-basic.t failure	Pranith Kumar K	2020-05-13	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In brick-mux tests, all bricks of the volume have same pid. "generate_brick_statedump" cleans up the older statedumps with same brick pid. So successive calls to this function will delete previous brick's statedump as all bricks share same pid. So grep calls to the statedump were failing leading to failure of the .t To fix this, stored the result we need from statedump before calling next brick's statedump Fixes: #1234 Change-Id: I824ed4dff79e7242b3e980364836b9af0e87a6ee Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	tests: skip tests on absence of reflink in xfs	Pranith Kumar K	2020-05-06	1	-7/+9
\| \| \| \| \| \|	Fixes: #1223 Change-Id: I36cb72d920ffd77405051546615c5262c392daef Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	afr: event gen changes	Ravishankar N	2020-04-24	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The general idea of the changes is to prevent resetting event generation to zero in the inode ctx, since event gen is something that should follow 'causal order'. Change #1: For a read txn, in inode refresh cbk, if event_generation is found zero, we are failing the read fop. This is not needed because change in event gen is only a marker for the next inode refresh to happen and should not be taken into account by the current read txn. Change #2: The event gen being zero above can happen if there is a racing lookup, which resets even get (in afr_lookup_done) if there are non zero afr xattrs. The resetting is done only to trigger an inode refresh and a possible client side heal on the next lookup. That can be acheived by setting the need_refresh flag in the inode ctx. So replaced all occurences of resetting even gen to zero with a call to afr_inode_need_refresh_set(). Change #3: In both lookup and discover path, we are doing an inode refresh which is not required since all 3 essentially do the same thing- update the inode ctx with the good/bad copies from the brick replies. Inode refresh also triggers background heals, but I think it is okay to do it when we call refresh during the read and write txns and not in the lookup path. The .ts which relied on inode refresh in lookup path to trigger heals are now changed to do read txn so that inode refresh and the heal happens. Change-Id: Iebf39a9be6ffd7ffd6e4046c96b0fa78ade6c5ec Fixes: #1179 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reported-by: Erik Jacobson <erik.jacobson at hpe.com>
*	tests: Fix spurious failure of tests/basic/quick-read-with-upcall.t	Pranith Kumar K	2020-04-22	1	-10/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: The test is failing at 14:56:41 ok 13, LINENUM:38 14:56:41 not ok 14 Got "test-message0" instead of "test-message1", LINENUM:41 14:56:41 FAILED COMMAND: test-message1 cat /mnt/glusterfs/1/test.txt This happens because fuse sometimes doesn't send 'read' fop to glusterfs and is served from cache. Fix: Mount with direct-io-mode=yes so that read is always received by gluster Fixes: #1190 Change-Id: I369e2024a85dc492dc24c7579b161fb965f55d19 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	tests: do not truncate file offsets and sizes to 32-bit	Dmitry Antipov	2020-04-15	1	-3/+4
\| \| \| \| \| \| \| \| \| \|	Do not truncate file offsets and sizes to 32-bit to prevent tests from spurious failures on >2Gb files. Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Change-Id: I2a77ea5f9f415249b23035eecf07129f19194ac2 Fixes: #1161
*	Posix: Optimize posix code to improve file creation	Mohit Agrawal	2020-04-06	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Before executing a fop in POSIX xlator it builds an internal path based on GFID.To validate the path it call's (l)stat system call and while .glusterfs is heavily loaded kernel takes time to lookup inode and due to that performance drops Solution: In this patch we followed two ways to improve the performance. 1) Keep open fd specific to first level directory(gfid[0]) in .glusterfs, it would force to kernel keep the inodes from all those files in cache. In case of memory pressure kernel won't uncache first level inodes. We need to open 256 fd's per brick to access the entry faster. 2) Use at based call's to access relative path to reduce path based lookup time. Note: To verify the patch we have executed kernel untar 100 times on 6 different clients after enabling metadata group-cache and some other option.We were getting more than 20 percent improvement in kenel untar after applying the patch. Credits: Xavi Hernandez <xhernandez@redhat.com> Change-Id: I1643e6b01ed669b2bb148d02f4e6a8e08da45343 updates: #891 Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
*	cluster/ec: Add test for reset-brick command	Ashish Pandey	2020-04-06	1	-0/+50
\| \| \| \| \| \| \| \| \| \| \|	Following tests are done - 1 - After finishing reset-brick all the bricks should be up. 2 - Heal should be completed. 3 - Check number of entries present on brick which was reset. Change-Id: I9314bed180293a99d400d94bb8cc7ece999da29e Updates: #1144
*	snap_scheduler: python3 compatibility and new test case	Sunny Kumar	2020-04-03	1	-0/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: "snap_scheduler.py init" command failing with the below traceback: [root@dhcp43-104 ~]# snap_scheduler.py init Traceback (most recent call last): File "/usr/sbin/snap_scheduler.py", line 941, in <module> sys.exit(main(sys.argv[1:])) File "/usr/sbin/snap_scheduler.py", line 851, in main initLogger() File "/usr/sbin/snap_scheduler.py", line 153, in initLogger logfile = os.path.join(process.stdout.read()[:-1], SCRIPT_NAME + ".log") File "/usr/lib64/python3.6/posixpath.py", line 94, in join genericpath._check_arg_types('join', a, *p) File "/usr/lib64/python3.6/genericpath.py", line 151, in _check_arg_types raise TypeError("Can't mix strings and bytes in path components") from None TypeError: Can't mix strings and bytes in path components Solution: Added the 'universal_newlines' flag to Popen to support backward compatibility. Added a basic test for snapshot scheduler. Change-Id: I78e8fabd866fd96638747ecd21d292f5ca074a4e Fixes: #1134 Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
*	tests: fix afr-lock-heal-* failure	Ravishankar N	2020-03-16	2	-12/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When brick-mux is enabled: i)brick statedumps seem to be listing the same lock information multiple times. While that is getting fixed, make changes to the .ts to check for unique values. ii)detecting a brick as online via brick_up_status() seems to be taking longer time when delaygen is enabled. Hence bump up PROCESS_UP_TIMEOUT to 90 for afr-lock-heal-advanced.t Updates: #1042 Change-Id: Ife76008f7a99dd1f1fe5791a32577366baaab4b3 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	cluster/afr: Fixes for halo	Pranith Kumar K	2020-03-13	1	-0/+61
\| \| \| \| \| \| \| \| \| \| \|	Current implementation assumes that ping-event will come after connect event but that may not be the case in the cases where after socket connection fds need to be re-opened which would consume more time. So handle any order of the ping/child-up events. fixes: bz#1800583 Change-Id: I6bcdc0caa503bdc039ef2b4739fbf4afae121f05 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	mgmt/glusterd: Adding validation for statedump path	yatipadia	2020-03-09	1	-5/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Description of problem: server.statedump-path is the path where statedumps are stored, by default it is /var/run/gluster. And can be set to any valid directory path. It was observed that server.statedump-path was also accepting file, non-existent file and non-existent paths as well. And statedump command was successful even when statedumps with all the invalid paths. a. A file b. A non-existent path Solution: Added a validation function in gluster-volume-set.c which will allow volume set to success if it's a valid directory and in all other cases, volume set should fail. Fixes: bz#1787122 Change-Id: Ia66e2b3d35f23efc5444c829928779a79d827b42 Signed-off-by: yatipadia <ypadia@redhat.com> Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
*	Updating gluster manual.	Rishubh Jain	2020-02-18	1	-0/+4
\| \| \| \| \| \| \| \| \| \|	Adding disperse-data to gluster manual under volume create command Change-Id: Ic9eb47c9e71a1d7a11af9394c615c8e90f8d1d69 Fixes: bz#1668239 Signed-off-by: Rishubh Jain <risjain@redhat.com> Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
*	feature/changelog: Avoid thread creation if xlator is not enabled	Mohit Agrawal	2020-02-09	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Changelog creates threads even if the changelog is not enabled Background: Changelog xlator broadly does two things 1. Journalling - Cosumers are geo-rep and glusterfind 2. Event Notification for registered events like (open, release etc) - Consumers are bitrot, geo-rep The existing option "changelog.changelog" controls journalling and there is no option to control event notification and is enabled by default. So when bitrot/geo-rep is not enabled on the volume, threads and resources(rpc and rbuf) related to event notifications consumes resources and cpu cycle which is unnecessary. Solution: The solution is to have two different options as below. 1. changelog-notification : Event notifications 2. changelog : Journalling This patch introduces the option "changelog-notification" which is not exposed to user. When either bitrot or changelog (journalling) is enabled, it internally enbales 'changelog-notification'. But once the 'changelog-notification' is enabled, it will not be disabled for the life time of the brick process even after bitrot and changelog is disabled. As of now, rpc resource cleanup has lot of races and is difficult to cleanup cleanly. If allowed, it leads to memory leaks and crashes on enable/disable of bitrot or changelog (journal) in a loop. Hence to be safer, the event notification is not disabled within lifetime of process once enabled. Change-Id: Ifd00286e0966049e8eb9f21567fe407cf11bb02a Updates: #475 Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
*	ctime: Fix ctime inconsisteny with utimensat	Kotresh HR	2019-12-10	1	-0/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: When touch is used to create a file, the ctime is not matching atime and mtime which ideally should match. There is a difference in nano seconds. Cause: When touch is used modify atime or mtime to current time (UTIME_NOW), the current time is taken from kernel. The ctime gets updated to current time when atime or mtime is updated. But the current time to update ctime is taken from utime xlator. Hence the difference in nano seconds. Fix: When utimesat uses UTIME_NOW, use the current time from kernel. fixes: bz#1773530 Change-Id: I9ccfa47dcd39df23396852b4216f1773c49250ce Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	glusterfsd-mgmt: unify read and write tests	Yaniv Kaul	2019-11-07	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. Both read and write tests required writing first. Either just writing (write test) or write and then read (read test). So the code is now unified. 2. There's no reason to read zeros from /dev/zero. Just use a CALLOC'ed buffer. I don't think we should read and write zeros, but I did not change the code yet (I think compression and/or dedup will offset results) It appears neither read-perf nor write-perf were tested, so added basic tests for them. Change-Id: I24b1f249fa0335ed652a8982e99c0687d940230e updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
*	afr: lock healing changes	Ravishankar N	2019-10-30	4	-0/+612
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implements lock healing for gluster-block fencing use case. If mandatory lock is enabled: - Add domain lock/unlock to afr_lk fop. - Maintain a list of locks to be healed in afr_private_t. - Add lock to the list if afr_lk(F_SETLK or F_SETLKW) was sucessful. - Remove it from the list during afr_lk(F_UNLCK). - On child_down, mark lock as needing heal on that child. If lock is lost on quorum no. of bricks, remove it from the list and mark fd bad. - For fds marked as bad, fail the subsequent fd based fops. - On parent up, traverse the list and heal the locks IFF the client is the lk owner and has quorum. (shd does not heal any locks). updates: #613 Change-Id: I03c46ceaea30f5e6236d5ec13f71d843d827f1bc Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	tests: add tests for bz#1758878	Ravishankar N	2019-10-17	3	-34/+62
\| \| \| \| \| \| \| \|	updates: bz#1193929 Change-Id: I517fa29e57bde970c2c22ebc2de80fec1509cd2d Signed-off-by: Sanju Rakonde <srakonde@redhat.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	tests: Remove hard-coding of lower bound	Pranith Kumar K	2019-10-16	1	-4/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Total blocks of XFS partition are dependent on XFS parameters used for formatting. So hardcoded lower-bound may not be the lower bound for different default parameters. Used blocks are lesser on my machine for a freshly formatted XFS partition compared to where the test fails. On my machine: Filesystem 1024-blocks Used Available Capacity Mounted on /dev/loop0 98980 5472 93508 6% /d/backends/patchy1 /dev/loop1 98980 5472 93508 6% /d/backends/patchy2 On a machine where this test fails: Filesystem 1024-blocks Used Available Capacity Mounted on /dev/loop0 96928 6112 90816 7% /d/backends/patchy1 /dev/loop1 96928 6112 90816 7% /d/backends/patchy2 Fix: Make lower bound 2% less than the brick-blocks available fixes: bz#1761759 Change-Id: I974d5e75766f7ff44780a2e4c2a19cd5d1d14a79 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/afr: Add afr_seek to fops table	Pranith Kumar K	2019-10-14	3	-1/+57
\| \| \| \| \| \|	fixes: bz#1760189 Change-Id: Iffbf8d6f4c50b8e2de8364658697bdbe96549f5d Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/ec: Implement read-mask feature	Pranith Kumar K	2019-09-27	1	-0/+114
\| \| \| \| \| \|	fixes: #725 Change-Id: Iaaefe6f49c8193c476b987b92df6bab3e2f62601 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	read-ahead/io-cache: turn off by default	Raghavendra Gowdappa	2019-09-26	2	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We've found perf xlators io-cache and read-ahead not adding any performance improvement. At best read-ahead is redundant due to kernel read-ahead and at worst io-cache is degrading the performance for workloads that doesn't involve re-read. Given that VFS already have both these functionalities, this patch makes these two translators turned off by default for native fuse mounts. For non-native fuse mounts like gfapi (NFS-ganesha/samba) we can have these xlators on by having custom profiles. Change-Id: Ie7535788909d4c741844473696f001274dc0bb60 Signed-off-by: Raghavendra Gowdappa <rgowdapp@redhat.com> fixes: bz#1676479
*	gfapi: 'glfs_h_creat_open' - new API to create handle and open fd	Soumya Koduri	2019-09-25	2	-0/+145
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Right now we have two separate APIs, one - 'glfs_h_creat_handle' to create handle & another - 'glfs_h_open' to create a glfd to return to application Having two separate routines can result in access errors while trying to create and write into a read-only file. Since a fd is opened even during file/directory creation, introducing a new API to make these two operations atomic i.e, which can create both handle & fd and pass them to application Change-Id: Ibf513fcfcdad175f4d7eb6fa7a61b8feec6d33b5 Fixes: bz#1753569 Signed-off-by: Soumya Koduri <skoduri@redhat.com>
*	ctime/rebalance: Heal ctime xattr on directory during rebalance	Kotresh HR	2019-09-16	6	-0/+477
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After add-brick and rebalance, the ctime xattr is not present on rebalanced directories on new brick. This patch fixes the same. Note that ctime still doesn't support consistent time across distribute sub-volume. This patch also fixes the in-memory inconsistency of time attributes when metadata is self healed. Change-Id: Ia20506f1839021bf61d4753191e7dc34b31bb2df fixes: bz#1734026 Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	tests/shd: Mark "tests/basic/volume-scale-shd-mux.t" as bad	Mohammed Rafi KC	2019-09-16	1	-0/+2
\| \| \| \| \| \| \| \| \|	This test case is failing in upstream. Marking this test as bad for now. Change-Id: I014c67628c14683c32a3c1dd770b10aaf35ad4cc Updates: bz#1752331 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
*	libgfapi: return correct errno on invalid volume name	Sheetal Pamecha	2019-09-12	3	-8/+95
\| \| \| \| \| \| \| \| \| \| \| \| \|	glfs_init when called with volume name prefixed by '/' sets errno to 0. Setting errno to EINVAL to resolve the issue. Also volname is a parameter to glfs_new. Thus, validating volname in glfs_new itself and returning EINVAL from that function fixes: bz#1507896 Change-Id: I0d4d2423e26cc07644d50ec8cce788ecc639203d Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
*	tests: revive back volume-scale-shd-mux.t	Atin Mukherjee	2019-09-12	5	-30/+28
\| \| \| \| \| \| \|	Fixes: bz#1708929 Change-Id: I9cc81a9047ff874df752ca5552e00bf033485bd8 Signed-off-by: Atin Mukherjee <amukherj@redhat.com> Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
*	cluster/ec: quorum-count implementation	Pranith Kumar K	2019-09-08	2	-0/+215
\| \| \| \| \| \|	fixes: #721 Change-Id: I5333540e3c635ccf441cf1f4696e4c8986e38ea8 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/ec: Fail fsync/flush for files on update size/version failure	Pranith Kumar K	2019-09-06	2	-0/+150
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: If update size/version is not successful on the file, updates on the same stripe could lead to data corruptions if the earlier un-aligned write is not successful on all the bricks. Application won't have any knowledge of this because update size/version happens in the background. Fix: Fail fsync/flush on fds that are opened before update-size-version went bad. fixes: bz#1748836 Change-Id: I9d323eddcda703bd27d55f340c4079d76e06e492 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	graph/cleanup: Fix race in graph cleanup	Mohammed Rafi KC	2019-09-05	2	-3/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We were unconditionally cleaning up the grap when we get child_down followed by parent_down. But this is prone to race condition when some of the bricks are already disconnected. In this case, even before the last child down is executed in the client xlator code,we might have freed the graph. Because the child_down event is alreadt recevied. To fix this race, we have introduced a check to see if all client xlator have cleared thier reconnect chain, and called the child_down for last time. Change-Id: I7d02813bc366dac733a836e0cd7b14a6fac52042 fixes: bz#1727329 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
*	tests/dht: Add a test file for file renames	N Balachandran	2019-08-19	1	-0/+1021
\| \| \| \| \| \| \| \| \|	Test the various combinations of hashed and cached subvols for the src and dst. Change-Id: I41416f9e5f2b7ea1c880d1913fdd6576da1ee868 fixes: bz#1626543 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	tests/shd: Break down shd mux tests into multiple .t file	Mohammed Rafi KC	2019-08-05	4	-149/+191
\| \| \| \| \| \| \| \| \| \|	Test file tests/basic/shd-mux.t was taking longer than 200 seconds in some iterations. So this patch is breaking the test case to three files Change-Id: I1430f58798f876edf6368d6f4b8b5a75f0114c31 Updates: bz#1708929 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
*	xdr: add code so we have more xdr functions covered	Amar Tumballi	2019-08-04	1	-0/+3
\| \| \| \| \| \|	Updates: bz#1693692 Change-Id: Ia10ccca5e1fed6c4269842ebb4d507662ca0f6a6 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	gfapi: increase function-coverage	Amar Tumballi	2019-07-31	2	-25/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Add few more mgmt functions to the coverage * While testing mgmt function, found an issue, where if the 'glfs_set_volfile_server()' is not called before calling 'glfs_unset_volfile_server()', unset would cause a crash. Null check of few variables fixes the issue, which is handled in this patch itself. * Added a test for volfile API Updates: bz#1693692 Change-Id: Iba151f8da1b64107e2f436ddbfef9da45b1c1588 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	trace: add more coverage by testing it with glfs-coverage too.	Amar Tumballi	2019-07-30	1	-0/+22
\| \| \| \| \| \| \| \| \| \|	make sure to provide 'log-file' option, so we can see the logs. This test does test volgen inserting the trace xlator in server graph. Updates: bz#1693692 Change-Id: I26c736b04376674b4c094d48060660421e6c983c Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	cluster/ec: fix EIO error for concurrent writes on sparse files	Xavi Hernandez	2019-07-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	EC doesn't allow concurrent writes on overlapping areas, they are serialized. However non-overlapping writes are serviced in parallel. When a write is not aligned, EC first needs to read the entire chunk from disk, apply the modified fragment and write it again. The problem appears on sparse files because a write to an offset implicitly creates data on offsets below it (so, in some way, they are overlapping). For example, if a file is empty and we read 10 bytes from offset 10, read() will return 0 bytes. Now, if we write one byte at offset 1M and retry the same read, the system call will return 10 bytes (all containing 0's). So if we have two writes, the first one at offset 10 and the second one at offset 1M, EC will send both in parallel because they do not overlap. However, the first one will try to read missing data from the first chunk (i.e. offsets 0 to 9) to recombine the entire chunk and do the final write. This read will happen in parallel with the write to 1M. What could happen is that half of the bricks process the write before the read, and the half do the read before the write. Some bricks will return 10 bytes of data while the otherw will return 0 bytes (because the file on the brick has not been expanded yet). When EC tries to recombine the answers from the bricks, it can't, because it needs more than half consistent answers to recover the data. So this read fails with EIO error. This error is propagated to the parent write, which is aborted and EIO is returned to the application. The issue happened because EC assumed that a write to a given offset implies that offsets below it exist. This fix prevents the read of the chunk from bricks if the current size of the file is smaller than the read chunk offset. This size is correctly tracked, so this fixes the issue. Also modifying ec-stripe.t file for Test #13 within it. In this patch, if a file size is less than the offset we are writing, we fill zeros in head and tail and do not consider it strip cache miss. That actually make sense as we know what data that part holds and there is no need of reading it from bricks. Change-Id: Ic342e8c35c555b8534109e9314c9a0710b6225d6 Fixes: bz#1730715 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	ctime: Set mdata xattr on legacy files	Kotresh HR	2019-07-22	1	-0/+83
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: The files which were created before ctime enabled would not have "trusted.glusterfs.mdata"(stores time attributes) xattr. Upon fops which modifies either ctime or mtime, the xattr gets created with latest ctime, mtime and atime, which is incorrect. It should update only the corresponding time attribute and rest from backend Solution: Creating xattr with values from brick is not possible as each brick of replica set would have different times. So create the xattr upon successful lookup if the xattr is not created Note To Reviewers: The time attributes used to set xattr is got from successful lookup. Instead of sending the whole iatt over the wire via setxattr, a structure called mdata_iatt is sent. The mdata_iatt contains only time attributes. Change-Id: I5e535631ddef04195361ae0364336410a2895dd4 fixes: bz#1593542 Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	quick-read: rename cache-invalidation key to avoid redundant keys	Atin Mukherjee	2019-07-08	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With group-metadata-cache group profile settings performance.cache-invalidation option when turned on enables both md-cache and quick-read xlator's cache-invalidation feature. While the intent of the group-metadata-cache is to set md-cache xlator's cache-invalidation feature, quick-read xlator also gets affected due to the same. While md-cache feature and it's profile existed since release-3.9, quick-read cache-invalidation was introduced in release-4 and due to this op-version mismatch on any cluster which is >= glusterfs-4 when this group profile is applied it breaks backward compatibility with the old clients. The proposed fix here is to rename the key in quick-read to 'quick-read-cache-invalidation' so that both these features have distinct identification. While this brings in by itself a backward compatibility challenge where this feature is enabled in an existing cluster and when the same is upgraded to a version where this change exists, it will lead to an unidentified old key. But as a workaround we can always ask users upgrading to release-7 version to turn off this option, upgrade the cluster and turn it back on with the new key. This needs to be documented once the patch is accepted. Fixes: bz#1698042 Change-Id: I30422ba6496208e21191a8d78ad29b2e21078664 Signed-off-by: Atin Mukherjee <amukherj@redhat.com> Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
*	glusterd/thin-arbiter: Thin-arbiter integration with GD1	Vishal Pandey	2019-06-28	2	-0/+70
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	gluster volume create <VOLNAME> replica 2 thin-arbiter 1 <host1>:<brick1> <host2>:<brick2> <thin-arbiter-host>:<path-to-store-replica-id-file> [force] The changes have been made in a way that the last brick in the bricks list will be treated as the thin-arbiter. GD1 will be manipulated to consider replica count to be as 2 and continue creating the volume like any other replica 2 volume but since thin-arbiter volumes need ta-brick client xlator entries for each subvolume in fuse volfile, volfile generation is modified in a way to inject these entries seperately in the volfile for every subvolume. Few more additions - 1- Save the volinfo with new fields ta_bricks list and thin_arbiter_count. 2- Introduce a new option client.ta-brick-port to add remote-port to ta-brick xlator entry in fuse volfiles. The option can be set using the following CLI syntax - gluster volume set <VOLNAME> client.ta-brick-port <PORTNO.> 3- Volume Info will contain a Thin-Arbiter-path entry to distinguish from other replicate volumes. Change-Id: Ib434e2313b29716f32476c6c211d282c4ef39406 Updates #687 Signed-off-by: Vishal Pandey <vpandey@redhat.com>
*	lcov: add more tests to glfsxmp-coverage	Amar Tumballi	2019-06-25	1	-19/+79
\| \| \| \| \| \| \| \| \|	* found a bug with quiesce fallocate() - fixed. * found a bug with cloudsync part of code in posix - fixed updates: bz#1693692 Change-Id: I4f315ffebb612de072ae08761b8cd0f47714080a Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	cluster/ec: Prevent double pre-op xattrops	Pranith Kumar K	2019-06-22	1	-0/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Race: Thread-1 Thread-2 1) Does ec_get_size_version() to perform pre-op fxattrop as part of write-1 2) Calls ec_set_dirty_flag() in ec_get_size_version() for write-2. This sets dirty[] to 1 3) Completes executing ec_prepare_update_cbk leading to ctx->dirty[] = '1' 4) Takes LOCK(inode->lock) to check if there are any flags and sets dirty-flag because lock->waiting_flag is 0 now. This leads to fxattrop to increment on-disk dirty[] to '2' At the end of the writes the file will be marked for heal even when it doesn't need heal. Fix: Perform ec_set_dirty_flag() and other checks inside LOCK() to prevent dirty[] to be marked as '1' in step 2) above Updates bz#1593224 Change-Id: Icac2ab39c0b1e7e154387800fbededc561612865 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	posix/ctime: Fix ctime upgrade issue	Kotresh HR	2019-06-21	1	-16/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: On a EC volume, during upgrade from the older version where ctime feature is not enabled(or not present) to the newer version where the ctime feature is available (enabled default), the self heal hangs and doesn't complete. Cause: The ctime feature has both client side code (utime) and server side code (posix). The feature is driven from client. Only if the client side sets the time in the frame, should the server side sets the time attributes in xattr. But posix setattr/fseattr was not doing that. When one of the server nodes is updated, since ctime is enabled by default, it starts setting xattr on setattr/fseattr on the updated node/brick. On a EC volume the first two updated nodes(bricks) are not a problem because there are 4 other bricks with consistent data. However once the third brick is updated, the new attribute(mdata xattr) will cause an inconsistency on metadata on 3 bricks, which prevents the file to be repaired. Fix: Don't create mdata xattr with utimes/utimensat system call. Only update if already present. Change-Id: Ieacedecb8a738bb437283ef3e0f042fd49dc4c8c fixes: bz#1720201 Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	core: improve timer accuracy	Xavier Hernandez	2019-06-17	1	-3/+4
\| \| \| \| \| \| \| \|	Also fixed some issues on test ec-1468261.t. Change-Id: If156f86af986d9eed13cdd1f15c5a7214cd11706 Updates: bz#1193929 Signed-off-by: Xavier Hernandez <jahernan@redhat.com>
*	gfapi: provide an api for setting statedump path	Amar Tumballi	2019-06-14	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently for an application using glfsapi to use glusterfs, when a statedump is taken, it uses /var/run/gluster dir to dump info. There can be concerns as this directory may be owned by some other user, and hence it may fail taking statedump. Such applications should have an option to use different path. This patch provides an API to do so. Updates: bz#1689097 Change-Id: I8918e002bc823d83614c972b6c738baa04681b23 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	tests: keep glfsxmp in tests directory	Amar Tumballi	2019-06-11	4	-9/+1815
\| \| \| \| \| \| \| \| \| \| \| \| \|	this is critical so all the tests will be contained in the same directory, and one can just 'cp -a tests/ <any-location>/' and run glusterfs tests. only 'glfsxmp.c' was an exception as it was just copying the file from api example directory. Now moved it to tests. updates: bz#1193929 Change-Id: I00359d64be580bffc5b3c3a090968d86c2c6952a Signed-off-by: Amar Tumballi <amarts@redhat.com>