| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
AFR doesn't delay post-op for fsync fop. For fsync heavy workloads
this leads to un-necessary fxattrop/finodelk for every fsync leading
to bad performance.
Fix:
Have delayed post-op for fsync. Add special flag in xdata to indicate
that afr shouldn't delay post-op in cases where either the
process will terminate or graph-switch would happen. Otherwise it leads
to un-necessary heals when the graph-switch/process-termination
happens before delayed-post-op completes.
Fixes: #1253
Change-Id: I531940d13269a111c49e0510d49514dc169f4577
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There was a critical flaw in the previous implementation of open-behind.
When an open is done in the background, it's necessary to take a
reference on the fd_t object because once we "fake" the open answer,
the fd could be destroyed. However as long as there's a reference,
the release function won't be called. So, if the application closes
the file descriptor without having actually opened it, there will
always remain at least 1 reference, causing a leak.
To avoid this problem, the previous implementation didn't take a
reference on the fd_t, so there were races where the fd could be
destroyed while it was still in use.
To fix this, I've implemented a new xlator cbk that gets called from
fuse when the application closes a file descriptor.
The whole logic of handling background opens have been simplified and
it's more efficient now. Only if the fop needs to be delayed until an
open completes, a stub is created. Otherwise no memory allocations are
needed.
Correctly handling the close request while the open is still pending
has added a bit of complexity, but overall normal operation is simpler.
Change-Id: I6376a5491368e0e1c283cc452849032636261592
Fixes: #1225
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- performance.cache-size has a flawed semantics, as it's
dispatched on two independent translators, io-cache
and quick-read.
- performance.qr-cache-timeout has a confusing name, as
other options affecting quick-read have an unabbreviated
"quick-read-..." prefix in their names.
We keep these options with unchanged operation, but in the
help output we indicate their deprecation.
The following better alternatives are introduced:
- performance.io-cache-size to tune cache-size option of io-cache
- performance.quick-read-cache-size to tune cache-size option of
quick-read
- performance.quick-read-cache-timeout as a preferred synonym for
performance.qr-cache-timeout
Fixes: #952
Change-Id: Ibd04fb638de8cac450ba992ad8a415154f9f4281
Signed-off-by: Csaba Henk <csaba@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently data migration in rebalance reads sparse file sequentially,
disregarding which segments are holes and which are data. This can lead
to extremely long migration time for large sparse file.
Data migration mechanism needs to be enhanced so only data segments are
read and migrated. This can be achieved using lseek to seek for holes
and data in the file.
This enhancement is a consequence of
https://bugzilla.redhat.com/show_bug.cgi?id=1823703
fixes: #1222
Change-Id: If5f448a0c532926464e1f34f504c5c94749b08c3
Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
In add-brick that increases replica count
SHD was restarted after pending xattrs are set on the new bricks and
adding bricks. But before restarting SHD there is a possibility that
old SHD would do a scan on root-directory see no heal is needed and
delete index for root-dir leading to no heals until lookup is executed
on the mount
Fix:
Stop shd, perform pending-xattr setting/adding new bricks and
then restart shd
Fixes: #1240
Change-Id: I94fd7c6c909211b597185dfe097a559db6c0d00f
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
ok 32 [ 11/ 9] < 46> 'gf_rm_file_and_gfid_link /d/backends/patchy0 del-file'
not ok 33 [ 13/ 131] < 48> '! dd if=/dev/zero of=/mnt/glusterfs/0/del-file bs=1M count=1 oflag=direct' -> ''
The assumption in the test above is that the file wouldn't exist when dd
happens. But heal can lead to creation of the file in some cases leading to
spurious failures.
Fix:
Disable client side heal.
Fixes: #1245
Change-Id: I96b2b45528f9dfb3199d503a467cafafba9b387f
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In brick-mux tests, all bricks of the volume have same pid.
"generate_brick_statedump" cleans up the older statedumps
with same brick pid. So successive calls to this function
will delete previous brick's statedump as all bricks share
same pid. So grep calls to the statedump were failing leading
to failure of the .t
To fix this, stored the result we need from statedump before calling
next brick's statedump
Fixes: #1234
Change-Id: I824ed4dff79e7242b3e980364836b9af0e87a6ee
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
| |
Fixes: #1223
Change-Id: I36cb72d920ffd77405051546615c5262c392daef
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The general idea of the changes is to prevent resetting event generation
to zero in the inode ctx, since event gen is something that should
follow 'causal order'.
Change #1:
For a read txn, in inode refresh cbk, if event_generation is
found zero, we are failing the read fop. This is not needed
because change in event gen is only a marker for the next inode refresh to
happen and should not be taken into account by the current read txn.
Change #2:
The event gen being zero above can happen if there is a racing lookup,
which resets even get (in afr_lookup_done) if there are non zero afr
xattrs. The resetting is done only to trigger an inode refresh and a
possible client side heal on the next lookup. That can be acheived by
setting the need_refresh flag in the inode ctx. So replaced all
occurences of resetting even gen to zero with a call to
afr_inode_need_refresh_set().
Change #3:
In both lookup and discover path, we are doing an inode refresh which is
not required since all 3 essentially do the same thing- update the inode
ctx with the good/bad copies from the brick replies. Inode refresh also
triggers background heals, but I think it is okay to do it when we call
refresh during the read and write txns and not in the lookup path.
The .ts which relied on inode refresh in lookup path to trigger heals are
now changed to do read txn so that inode refresh and the heal happens.
Change-Id: Iebf39a9be6ffd7ffd6e4046c96b0fa78ade6c5ec
Fixes: #1179
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
Reported-by: Erik Jacobson <erik.jacobson at hpe.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
The test is failing at
14:56:41 ok 13, LINENUM:38
14:56:41 not ok 14 Got "test-message0" instead of "test-message1", LINENUM:41
14:56:41 FAILED COMMAND: test-message1 cat /mnt/glusterfs/1/test.txt
This happens because fuse sometimes doesn't send 'read' fop to glusterfs
and is served from cache.
Fix:
Mount with direct-io-mode=yes so that read is always received by
gluster
Fixes: #1190
Change-Id: I369e2024a85dc492dc24c7579b161fb965f55d19
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Do not truncate file offsets and sizes to 32-bit to
prevent tests from spurious failures on >2Gb files.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Change-Id: I2a77ea5f9f415249b23035eecf07129f19194ac2
Fixes: #1161
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: Before executing a fop in POSIX xlator it builds an internal
path based on GFID.To validate the path it call's (l)stat
system call and while .glusterfs is heavily loaded kernel takes
time to lookup inode and due to that performance drops
Solution: In this patch we followed two ways to improve the performance.
1) Keep open fd specific to first level directory(gfid[0])
in .glusterfs, it would force to kernel keep the inodes
from all those files in cache. In case of memory pressure
kernel won't uncache first level inodes. We need to open
256 fd's per brick to access the entry faster.
2) Use at based call's to access relative path to reduce
path based lookup time.
Note: To verify the patch we have executed kernel untar 100 times on 6
different clients after enabling metadata group-cache and some
other option.We were getting more than 20 percent improvement in
kenel untar after applying the patch.
Credits: Xavi Hernandez <xhernandez@redhat.com>
Change-Id: I1643e6b01ed669b2bb148d02f4e6a8e08da45343
updates: #891
Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Following tests are done -
1 - After finishing reset-brick all the bricks should be up.
2 - Heal should be completed.
3 - Check number of entries present on brick which was reset.
Change-Id: I9314bed180293a99d400d94bb8cc7ece999da29e
Updates: #1144
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
"snap_scheduler.py init" command failing with the below traceback:
[root@dhcp43-104 ~]# snap_scheduler.py init
Traceback (most recent call last):
File "/usr/sbin/snap_scheduler.py", line 941, in <module>
sys.exit(main(sys.argv[1:]))
File "/usr/sbin/snap_scheduler.py", line 851, in main
initLogger()
File "/usr/sbin/snap_scheduler.py", line 153, in initLogger
logfile = os.path.join(process.stdout.read()[:-1], SCRIPT_NAME + ".log")
File "/usr/lib64/python3.6/posixpath.py", line 94, in join
genericpath._check_arg_types('join', a, *p)
File "/usr/lib64/python3.6/genericpath.py", line 151, in _check_arg_types
raise TypeError("Can't mix strings and bytes in path components") from None
TypeError: Can't mix strings and bytes in path components
Solution:
Added the 'universal_newlines' flag to Popen to support backward compatibility.
Added a basic test for snapshot scheduler.
Change-Id: I78e8fabd866fd96638747ecd21d292f5ca074a4e
Fixes: #1134
Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When brick-mux is enabled:
i)brick statedumps seem to be listing the same lock information multiple times.
While that is getting fixed, make changes to the .ts to check for unique values.
ii)detecting a brick as online via brick_up_status() seems to be taking
longer time when delaygen is enabled. Hence bump up PROCESS_UP_TIMEOUT to
90 for afr-lock-heal-advanced.t
Updates: #1042
Change-Id: Ife76008f7a99dd1f1fe5791a32577366baaab4b3
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Current implementation assumes that ping-event will come after connect event
but that may not be the case in the cases where after socket connection fds
need to be re-opened which would consume more time. So handle any order of the
ping/child-up events.
fixes: bz#1800583
Change-Id: I6bcdc0caa503bdc039ef2b4739fbf4afae121f05
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Description of problem:
server.statedump-path is the path where statedumps are stored,
by default it is /var/run/gluster. And can be set to any valid
directory path. It was observed that server.statedump-path was
also accepting file, non-existent file and non-existent paths
as well. And statedump command was successful even when
statedumps with all the invalid paths.
a. A file
b. A non-existent path
Solution:
Added a validation function in gluster-volume-set.c which will
allow volume set to success if it's a valid directory
and in all other cases, volume set should fail.
Fixes: bz#1787122
Change-Id: Ia66e2b3d35f23efc5444c829928779a79d827b42
Signed-off-by: yatipadia <ypadia@redhat.com>
Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Adding disperse-data to gluster manual under
volume create command
Change-Id: Ic9eb47c9e71a1d7a11af9394c615c8e90f8d1d69
Fixes: bz#1668239
Signed-off-by: Rishubh Jain <risjain@redhat.com>
Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Changelog creates threads even if the changelog is not enabled
Background:
Changelog xlator broadly does two things
1. Journalling - Cosumers are geo-rep and glusterfind
2. Event Notification for registered events like (open, release etc) -
Consumers are bitrot, geo-rep
The existing option "changelog.changelog" controls journalling and
there is no option to control event notification and is enabled by
default. So when bitrot/geo-rep is not enabled on the volume, threads
and resources(rpc and rbuf) related to event notifications consumes
resources and cpu cycle which is unnecessary.
Solution:
The solution is to have two different options as below.
1. changelog-notification : Event notifications
2. changelog : Journalling
This patch introduces the option "changelog-notification" which is
not exposed to user. When either bitrot or changelog (journalling)
is enabled, it internally enbales 'changelog-notification'. But
once the 'changelog-notification' is enabled, it will not be disabled
for the life time of the brick process even after bitrot and changelog
is disabled. As of now, rpc resource cleanup has lot of races and is
difficult to cleanup cleanly. If allowed, it leads to memory leaks
and crashes on enable/disable of bitrot or changelog (journal) in a
loop. Hence to be safer, the event notification is not disabled within
lifetime of process once enabled.
Change-Id: Ifd00286e0966049e8eb9f21567fe407cf11bb02a
Updates: #475
Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
When touch is used to create a file, the ctime is not matching
atime and mtime which ideally should match. There is a difference
in nano seconds.
Cause:
When touch is used modify atime or mtime to current time (UTIME_NOW),
the current time is taken from kernel. The ctime gets updated to current
time when atime or mtime is updated. But the current time to update
ctime is taken from utime xlator. Hence the difference in nano seconds.
Fix:
When utimesat uses UTIME_NOW, use the current time from kernel.
fixes: bz#1773530
Change-Id: I9ccfa47dcd39df23396852b4216f1773c49250ce
Signed-off-by: Kotresh HR <khiremat@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. Both read and write tests required writing first. Either just
writing (write test) or write and then read (read test).
So the code is now unified.
2. There's no reason to read zeros from /dev/zero. Just use a
CALLOC'ed buffer.
I don't think we should read and write zeros, but I did not change
the code yet (I think compression and/or dedup will offset results)
It appears neither read-perf nor write-perf were tested, so added
basic tests for them.
Change-Id: I24b1f249fa0335ed652a8982e99c0687d940230e
updates: bz#1193929
Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implements lock healing for gluster-block fencing use case.
If mandatory lock is enabled:
- Add domain lock/unlock to afr_lk fop.
- Maintain a list of locks to be healed in afr_private_t.
- Add lock to the list if afr_lk(F_SETLK or F_SETLKW) was sucessful.
- Remove it from the list during afr_lk(F_UNLCK).
- On child_down, mark lock as needing heal on that child. If lock is
lost on quorum no. of bricks, remove it from the list and mark fd bad.
- For fds marked as bad, fail the subsequent fd based fops.
- On parent up, traverse the list and heal the locks IFF the client is
the lk owner and has quorum. (shd does not heal any locks).
updates: #613
Change-Id: I03c46ceaea30f5e6236d5ec13f71d843d827f1bc
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
|
|
|
|
|
|
|
|
| |
updates: bz#1193929
Change-Id: I517fa29e57bde970c2c22ebc2de80fec1509cd2d
Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Total blocks of XFS partition are dependent on XFS parameters used
for formatting. So hardcoded lower-bound may not be the lower bound
for different default parameters.
Used blocks are lesser on my machine for a freshly formatted XFS partition
compared to where the test fails.
On my machine:
Filesystem 1024-blocks Used Available Capacity Mounted on
/dev/loop0 98980 5472 93508 6% /d/backends/patchy1
/dev/loop1 98980 5472 93508 6% /d/backends/patchy2
On a machine where this test fails:
Filesystem 1024-blocks Used Available Capacity Mounted on
/dev/loop0 96928 6112 90816 7% /d/backends/patchy1
/dev/loop1 96928 6112 90816 7% /d/backends/patchy2
Fix:
Make lower bound 2% less than the brick-blocks available
fixes: bz#1761759
Change-Id: I974d5e75766f7ff44780a2e4c2a19cd5d1d14a79
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
| |
fixes: bz#1760189
Change-Id: Iffbf8d6f4c50b8e2de8364658697bdbe96549f5d
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
| |
fixes: #725
Change-Id: Iaaefe6f49c8193c476b987b92df6bab3e2f62601
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We've found perf xlators io-cache and read-ahead not adding any
performance improvement. At best read-ahead is redundant due to kernel
read-ahead and at worst io-cache is degrading the performance for
workloads that doesn't involve re-read. Given that VFS already have
both these functionalities, this patch makes these two
translators turned off by default for native fuse mounts.
For non-native fuse mounts like gfapi (NFS-ganesha/samba) we can have
these xlators on by having custom profiles.
Change-Id: Ie7535788909d4c741844473696f001274dc0bb60
Signed-off-by: Raghavendra Gowdappa <rgowdapp@redhat.com>
fixes: bz#1676479
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Right now we have two separate APIs, one
- 'glfs_h_creat_handle' to create handle & another
- 'glfs_h_open' to create a glfd to return to application
Having two separate routines can result in access errors
while trying to create and write into a read-only file.
Since a fd is opened even during file/directory creation,
introducing a new API to make these two operations atomic i.e,
which can create both handle & fd and pass them to application
Change-Id: Ibf513fcfcdad175f4d7eb6fa7a61b8feec6d33b5
Fixes: bz#1753569
Signed-off-by: Soumya Koduri <skoduri@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After add-brick and rebalance, the ctime xattr is not present
on rebalanced directories on new brick. This patch fixes the
same.
Note that ctime still doesn't support consistent time across
distribute sub-volume.
This patch also fixes the in-memory inconsistency of time attributes
when metadata is self healed.
Change-Id: Ia20506f1839021bf61d4753191e7dc34b31bb2df
fixes: bz#1734026
Signed-off-by: Kotresh HR <khiremat@redhat.com>
|
|
|
|
|
|
|
|
|
| |
This test case is failing in upstream. Marking this test as
bad for now.
Change-Id: I014c67628c14683c32a3c1dd770b10aaf35ad4cc
Updates: bz#1752331
Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
glfs_init when called with volume name prefixed by '/'
sets errno to 0. Setting errno to EINVAL to resolve the issue.
Also volname is a parameter to glfs_new.
Thus, validating volname in glfs_new itself and
returning EINVAL from that function
fixes: bz#1507896
Change-Id: I0d4d2423e26cc07644d50ec8cce788ecc639203d
Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
|
|
|
|
|
|
|
| |
Fixes: bz#1708929
Change-Id: I9cc81a9047ff874df752ca5552e00bf033485bd8
Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
|
|
|
|
|
|
| |
fixes: #721
Change-Id: I5333540e3c635ccf441cf1f4696e4c8986e38ea8
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
If update size/version is not successful on the file, updates on the
same stripe could lead to data corruptions if the earlier un-aligned
write is not successful on all the bricks. Application won't have
any knowledge of this because update size/version happens in the
background.
Fix:
Fail fsync/flush on fds that are opened before update-size-version
went bad.
fixes: bz#1748836
Change-Id: I9d323eddcda703bd27d55f340c4079d76e06e492
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We were unconditionally cleaning up the grap when we get
child_down followed by parent_down. But this is prone to
race condition when some of the bricks are already disconnected.
In this case, even before the last child down is executed in the
client xlator code,we might have freed the graph. Because the
child_down event is alreadt recevied.
To fix this race, we have introduced a check to see if all client
xlator have cleared thier reconnect chain, and called the child_down
for last time.
Change-Id: I7d02813bc366dac733a836e0cd7b14a6fac52042
fixes: bz#1727329
Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Test the various combinations of hashed and cached
subvols for the src and dst.
Change-Id: I41416f9e5f2b7ea1c880d1913fdd6576da1ee868
fixes: bz#1626543
Signed-off-by: N Balachandran <nbalacha@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Test file tests/basic/shd-mux.t was taking longer than 200
seconds in some iterations. So this patch is breaking the
test case to three files
Change-Id: I1430f58798f876edf6368d6f4b8b5a75f0114c31
Updates: bz#1708929
Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
|
|
|
|
|
|
| |
Updates: bz#1693692
Change-Id: Ia10ccca5e1fed6c4269842ebb4d507662ca0f6a6
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add few more mgmt functions to the coverage
* While testing mgmt function, found an issue, where if the
'glfs_set_volfile_server()' is not called before calling
'glfs_unset_volfile_server()', unset would cause a crash.
Null check of few variables fixes the issue, which is handled
in this patch itself.
* Added a test for volfile API
Updates: bz#1693692
Change-Id: Iba151f8da1b64107e2f436ddbfef9da45b1c1588
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
make sure to provide 'log-file' option, so we can see the logs.
This test does test volgen inserting the trace xlator in server
graph.
Updates: bz#1693692
Change-Id: I26c736b04376674b4c094d48060660421e6c983c
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
EC doesn't allow concurrent writes on overlapping areas, they are
serialized. However non-overlapping writes are serviced in parallel.
When a write is not aligned, EC first needs to read the entire chunk
from disk, apply the modified fragment and write it again.
The problem appears on sparse files because a write to an offset
implicitly creates data on offsets below it (so, in some way, they
are overlapping). For example, if a file is empty and we read 10 bytes
from offset 10, read() will return 0 bytes. Now, if we write one byte
at offset 1M and retry the same read, the system call will return 10
bytes (all containing 0's).
So if we have two writes, the first one at offset 10 and the second one
at offset 1M, EC will send both in parallel because they do not overlap.
However, the first one will try to read missing data from the first chunk
(i.e. offsets 0 to 9) to recombine the entire chunk and do the final write.
This read will happen in parallel with the write to 1M. What could happen
is that half of the bricks process the write before the read, and the
half do the read before the write. Some bricks will return 10 bytes of
data while the otherw will return 0 bytes (because the file on the brick
has not been expanded yet).
When EC tries to recombine the answers from the bricks, it can't, because
it needs more than half consistent answers to recover the data. So this
read fails with EIO error. This error is propagated to the parent write,
which is aborted and EIO is returned to the application.
The issue happened because EC assumed that a write to a given offset
implies that offsets below it exist.
This fix prevents the read of the chunk from bricks if the current size
of the file is smaller than the read chunk offset. This size is
correctly tracked, so this fixes the issue.
Also modifying ec-stripe.t file for Test #13 within it.
In this patch, if a file size is less than the offset we are writing, we
fill zeros in head and tail and do not consider it strip cache miss.
That actually make sense as we know what data that part holds and there is
no need of reading it from bricks.
Change-Id: Ic342e8c35c555b8534109e9314c9a0710b6225d6
Fixes: bz#1730715
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
The files which were created before ctime enabled would not
have "trusted.glusterfs.mdata"(stores time attributes) xattr.
Upon fops which modifies either ctime or mtime, the xattr
gets created with latest ctime, mtime and atime, which is
incorrect. It should update only the corresponding time
attribute and rest from backend
Solution:
Creating xattr with values from brick is not possible as
each brick of replica set would have different times.
So create the xattr upon successful lookup if the xattr
is not created
Note To Reviewers:
The time attributes used to set xattr is got from successful
lookup. Instead of sending the whole iatt over the wire via
setxattr, a structure called mdata_iatt is sent. The mdata_iatt
contains only time attributes.
Change-Id: I5e535631ddef04195361ae0364336410a2895dd4
fixes: bz#1593542
Signed-off-by: Kotresh HR <khiremat@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With group-metadata-cache group profile settings
performance.cache-invalidation option when turned on enables both md-cache
and quick-read xlator's cache-invalidation feature. While the intent of
the group-metadata-cache is to set md-cache xlator's cache-invalidation
feature, quick-read xlator also gets affected due to the same. While
md-cache feature and it's profile existed since release-3.9,
quick-read cache-invalidation was introduced in release-4 and due to
this op-version mismatch on any cluster which is >= glusterfs-4 when
this group profile is applied it breaks backward compatibility with the
old clients.
The proposed fix here is to rename the key in quick-read to
'quick-read-cache-invalidation' so that both these features have
distinct identification. While this brings in by itself a backward
compatibility challenge where this feature is enabled in an existing
cluster and when the same is upgraded to a version where this change
exists, it will lead to an unidentified old key. But as a workaround we
can always ask users upgrading to release-7 version to turn off this
option, upgrade the cluster and turn it back on with the new key. This
needs to be documented once the patch is accepted.
Fixes: bz#1698042
Change-Id: I30422ba6496208e21191a8d78ad29b2e21078664
Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
gluster volume create <VOLNAME> replica 2 thin-arbiter 1 <host1>:<brick1> <host2>:<brick2>
<thin-arbiter-host>:<path-to-store-replica-id-file> [force]
The changes have been made in a way that the last brick in the bricks list
will be treated as the thin-arbiter.
GD1 will be manipulated to consider replica count to be as 2 and continue creating the
volume like any other replica 2 volume but since thin-arbiter volumes need ta-brick
client xlator entries for each subvolume in fuse volfile, volfile generation is
modified in a way to inject these entries seperately in the volfile for every subvolume.
Few more additions -
1- Save the volinfo with new fields ta_bricks list and thin_arbiter_count.
2- Introduce a new option client.ta-brick-port to add remote-port to ta-brick xlator entry
in fuse volfiles. The option can be set using the following CLI syntax -
gluster volume set <VOLNAME> client.ta-brick-port <PORTNO.>
3- Volume Info will contain a Thin-Arbiter-path entry to distinguish
from other replicate volumes.
Change-Id: Ib434e2313b29716f32476c6c211d282c4ef39406
Updates #687
Signed-off-by: Vishal Pandey <vpandey@redhat.com>
|
|
|
|
|
|
|
|
|
| |
* found a bug with quiesce fallocate() - fixed.
* found a bug with cloudsync part of code in posix - fixed
updates: bz#1693692
Change-Id: I4f315ffebb612de072ae08761b8cd0f47714080a
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Race:
Thread-1 Thread-2
1) Does ec_get_size_version() to perform
pre-op fxattrop as part of write-1
2) Calls ec_set_dirty_flag() in
ec_get_size_version() for write-2.
This sets dirty[] to 1
3) Completes executing
ec_prepare_update_cbk leading to
ctx->dirty[] = '1'
4) Takes LOCK(inode->lock) to check if there are
any flags and sets dirty-flag because
lock->waiting_flag is 0 now. This leads to
fxattrop to increment on-disk dirty[] to '2'
At the end of the writes the file will be marked for heal even when it doesn't need heal.
Fix:
Perform ec_set_dirty_flag() and other checks inside LOCK() to prevent dirty[] to be marked
as '1' in step 2) above
Updates bz#1593224
Change-Id: Icac2ab39c0b1e7e154387800fbededc561612865
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
On a EC volume, during upgrade from the older version where
ctime feature is not enabled(or not present) to the newer
version where the ctime feature is available (enabled default),
the self heal hangs and doesn't complete.
Cause:
The ctime feature has both client side code (utime) and
server side code (posix). The feature is driven from client.
Only if the client side sets the time in the frame, should
the server side sets the time attributes in xattr. But posix
setattr/fseattr was not doing that. When one of the server
nodes is updated, since ctime is enabled by default, it
starts setting xattr on setattr/fseattr on the updated node/brick.
On a EC volume the first two updated nodes(bricks) are not a
problem because there are 4 other bricks with consistent data.
However once the third brick is updated, the new attribute(mdata xattr)
will cause an inconsistency on metadata on 3 bricks, which
prevents the file to be repaired.
Fix:
Don't create mdata xattr with utimes/utimensat system call.
Only update if already present.
Change-Id: Ieacedecb8a738bb437283ef3e0f042fd49dc4c8c
fixes: bz#1720201
Signed-off-by: Kotresh HR <khiremat@redhat.com>
|
|
|
|
|
|
|
|
| |
Also fixed some issues on test ec-1468261.t.
Change-Id: If156f86af986d9eed13cdd1f15c5a7214cd11706
Updates: bz#1193929
Signed-off-by: Xavier Hernandez <jahernan@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently for an application using glfsapi to use glusterfs, when a
statedump is taken, it uses /var/run/gluster dir to dump info.
There can be concerns as this directory may be owned by some other
user, and hence it may fail taking statedump. Such applications
should have an option to use different path.
This patch provides an API to do so.
Updates: bz#1689097
Change-Id: I8918e002bc823d83614c972b6c738baa04681b23
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
this is critical so all the tests will be contained in the same
directory, and one can just 'cp -a tests/ <any-location>/' and
run glusterfs tests.
only 'glfsxmp.c' was an exception as it was just copying the
file from api example directory. Now moved it to tests.
updates: bz#1193929
Change-Id: I00359d64be580bffc5b3c3a090968d86c2c6952a
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|