| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Right now we have two separate APIs, one
- 'glfs_h_creat_handle' to create handle & another
- 'glfs_h_open' to create a glfd to return to application
Having two separate routines can result in access errors
while trying to create and write into a read-only file.
Since a fd is opened even during file/directory creation,
introducing a new API to make these two operations atomic i.e,
which can create both handle & fd and pass them to application
This is backport of below mainline patch -
- https://review.gluster.org/#/c/glusterfs/+/23448/
- bz#1753569
release-6:
- https://review.gluster.org/#/c/glusterfs/+/23491/
Change-Id: Ibf513fcfcdad175f4d7eb6fa7a61b8feec6d33b5
fixes: bz#1756002
Signed-off-by: Soumya Koduri <skoduri@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After add-brick and rebalance, the ctime xattr is not present
on rebalanced directories on new brick. This patch fixes the
same.
Note that ctime still doesn't support consistent time across
distribute sub-volume.
This patch also fixes the in-memory inconsistency of time attributes
when metadata is self healed.
Backport of:
> Patch: https://review.gluster.org/23127/
> Change-Id: Ia20506f1839021bf61d4753191e7dc34b31bb2df
> BUG: 1734026
> Signed-off-by: Kotresh HR <khiremat@redhat.com>
(cherry picked from commit 304640e55c0f3c6d15f4e230dc6376e4f5020fea)
Change-Id: Ia20506f1839021bf61d4753191e7dc34b31bb2df
Signed-off-by: Kotresh HR <khiremat@redhat.com>
fixes: bz#1752429
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
EC doesn't allow concurrent writes on overlapping areas, they are
serialized. However non-overlapping writes are serviced in parallel.
When a write is not aligned, EC first needs to read the entire chunk
from disk, apply the modified fragment and write it again.
The problem appears on sparse files because a write to an offset
implicitly creates data on offsets below it (so, in some way, they
are overlapping). For example, if a file is empty and we read 10 bytes
from offset 10, read() will return 0 bytes. Now, if we write one byte
at offset 1M and retry the same read, the system call will return 10
bytes (all containing 0's).
So if we have two writes, the first one at offset 10 and the second one
at offset 1M, EC will send both in parallel because they do not overlap.
However, the first one will try to read missing data from the first chunk
(i.e. offsets 0 to 9) to recombine the entire chunk and do the final write.
This read will happen in parallel with the write to 1M. What could happen
is that half of the bricks process the write before the read, and the
half do the read before the write. Some bricks will return 10 bytes of
data while the otherw will return 0 bytes (because the file on the brick
has not been expanded yet).
When EC tries to recombine the answers from the bricks, it can't, because
it needs more than half consistent answers to recover the data. So this
read fails with EIO error. This error is propagated to the parent write,
which is aborted and EIO is returned to the application.
The issue happened because EC assumed that a write to a given offset
implies that offsets below it exist.
This fix prevents the read of the chunk from bricks if the current size
of the file is smaller than the read chunk offset. This size is
correctly tracked, so this fixes the issue.
Also modifying ec-stripe.t file for Test #13 within it.
In this patch, if a file size is less than the offset we are writing, we
fill zeros in head and tail and do not consider it strip cache miss.
That actually make sense as we know what data that part holds and there is
no need of reading it from bricks.
Change-Id: Ic342e8c35c555b8534109e9314c9a0710b6225d6
Fixes: bz#1739427
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
The files which were created before ctime enabled would not
have "trusted.glusterfs.mdata"(stores time attributes) xattr.
Upon fops which modifies either ctime or mtime, the xattr
gets created with latest ctime, mtime and atime, which is
incorrect. It should update only the corresponding time
attribute and rest from backend
Solution:
Creating xattr with values from brick is not possible as
each brick of replica set would have different times.
So create the xattr upon successful lookup if the xattr
is not created
Note To Reviewers:
The time attributes used to set xattr is got from successful
lookup. Instead of sending the whole iatt over the wire via
setxattr, a structure called mdata_iatt is sent. The mdata_iatt
contains only time attributes.
Backport of:
> Patch: https://review.gluster.org/22936
> Change-Id: I5e535631ddef04195361ae0364336410a2895dd4
> BUG: 1593542
> Signed-off-by: Kotresh HR <khiremat@redhat.com>
Change-Id: I5e535631ddef04195361ae0364336410a2895dd4
updates: bz#1739430
Signed-off-by: Kotresh HR <khiremat@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
gluster volume create <VOLNAME> replica 2 thin-arbiter 1 <host1>:<brick1> <host2>:<brick2>
<thin-arbiter-host>:<path-to-store-replica-id-file> [force]
The changes have been made in a way that the last brick in the bricks list
will be treated as the thin-arbiter.
GD1 will be manipulated to consider replica count to be as 2 and continue creating the
volume like any other replica 2 volume but since thin-arbiter volumes need ta-brick
client xlator entries for each subvolume in fuse volfile, volfile generation is
modified in a way to inject these entries seperately in the volfile for every subvolume.
Few more additions -
1- Save the volinfo with new fields ta_bricks list and thin_arbiter_count.
2- Introduce a new option client.ta-brick-port to add remote-port to ta-brick xlator entry
in fuse volfiles. The option can be set using the following CLI syntax -
gluster volume set <VOLNAME> client.ta-brick-port <PORTNO.>
3- Volume Info will contain a Thin-Arbiter-path entry to distinguish
from other replicate volumes.
Change-Id: Ib434e2313b29716f32476c6c211d282c4ef39406
Updates #687
Signed-off-by: Vishal Pandey <vpandey@redhat.com>
(cherry picked from commit 9b223b15ab69fce4076de036ee162f36a058bcd2)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Race:
Thread-1 Thread-2
1) Does ec_get_size_version() to perform
pre-op fxattrop as part of write-1
2) Calls ec_set_dirty_flag() in
ec_get_size_version() for write-2.
This sets dirty[] to 1
3) Completes executing
ec_prepare_update_cbk leading to
ctx->dirty[] = '1'
4) Takes LOCK(inode->lock) to check if there are
any flags and sets dirty-flag because
lock->waiting_flag is 0 now. This leads to
fxattrop to increment on-disk dirty[] to '2'
At the end of the writes the file will be marked for heal even when it doesn't need heal.
Fix:
Perform ec_set_dirty_flag() and other checks inside LOCK() to prevent dirty[] to be marked
as '1' in step 2) above
Updates bz#1593224
Change-Id: Icac2ab39c0b1e7e154387800fbededc561612865
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
On a EC volume, during upgrade from the older version where
ctime feature is not enabled(or not present) to the newer
version where the ctime feature is available (enabled default),
the self heal hangs and doesn't complete.
Cause:
The ctime feature has both client side code (utime) and
server side code (posix). The feature is driven from client.
Only if the client side sets the time in the frame, should
the server side sets the time attributes in xattr. But posix
setattr/fseattr was not doing that. When one of the server
nodes is updated, since ctime is enabled by default, it
starts setting xattr on setattr/fseattr on the updated node/brick.
On a EC volume the first two updated nodes(bricks) are not a
problem because there are 4 other bricks with consistent data.
However once the third brick is updated, the new attribute(mdata xattr)
will cause an inconsistency on metadata on 3 bricks, which
prevents the file to be repaired.
Fix:
Don't create mdata xattr with utimes/utimensat system call.
Only update if already present.
Change-Id: Ieacedecb8a738bb437283ef3e0f042fd49dc4c8c
fixes: bz#1720201
Signed-off-by: Kotresh HR <khiremat@redhat.com>
|
|
|
|
|
|
|
|
| |
Also fixed some issues on test ec-1468261.t.
Change-Id: If156f86af986d9eed13cdd1f15c5a7214cd11706
Updates: bz#1193929
Signed-off-by: Xavier Hernandez <jahernan@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently for an application using glfsapi to use glusterfs, when a
statedump is taken, it uses /var/run/gluster dir to dump info.
There can be concerns as this directory may be owned by some other
user, and hence it may fail taking statedump. Such applications
should have an option to use different path.
This patch provides an API to do so.
Updates: bz#1689097
Change-Id: I8918e002bc823d83614c972b6c738baa04681b23
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
this is critical so all the tests will be contained in the same
directory, and one can just 'cp -a tests/ <any-location>/' and
run glusterfs tests.
only 'glfsxmp.c' was an exception as it was just copying the
file from api example directory. Now moved it to tests.
updates: bz#1193929
Change-Id: I00359d64be580bffc5b3c3a090968d86c2c6952a
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
The test case is failing to heal the volume within $HEAL_TIMEOUT @195.
This is happening because as part of split-brain resolution the file
gets expunged from the sink and the new entry mark for that file will
be done on the source bricks as part of impunging. Since the source
bricks shd-threads failed to get the heal-domain lock, they will wait
for the heal-timeout of 10 minutes, which is greater than $HEAL_TIMEOUT.
Fix:
Set the cluster.heal-timeout to 5 seconds to trigger the heal so that
one of the source brick heals the file within the $HEAL_TIMEOUT.
Change-Id: Ie73c578cc5361c0d617a48ccc86026734d20ba8c
fixes: bz#1718998
Signed-off-by: karthik-us <ksubrahm@redhat.com>
|
|
|
|
|
|
|
|
| |
* Also some logging enhancements in snapview-server
Change-Id: I6a7646771cedf4bd1c62806eea69d720bbaf0c83
fixes: bz#1715921
Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Frequent intermittent failures observed.
```
08:59:24 ok 11 [ 10/ 3] < 36> 'write_to /mnt/glusterfs/0/test.txt test-message1'
08:59:24 ok 12 [ 10/ 6] < 37> 'test-message1 cat /mnt/glusterfs/0/test.txt'
08:59:24 ok 13 [ 10/ 4] < 38> 'test-message0 cat /mnt/glusterfs/1/test.txt'
08:59:24 not ok 14 [ 3715/ 6] < 45> 'test-message1 cat /mnt/glusterfs/1/test.txt' -> 'Got "test-message0" instead of "test-message1"'
08:59:24 ok 15 [ 10/ 162] < 47> 'gluster --mode=script --wignore volume set patchy features.cache-invalidation on'
08:59:24 ok 16 [ 10/ 148] < 48> 'gluster --mode=script --wignore volume set patchy performance.qr-cache-timeout 15'
```
updates: bz#1718191
Change-Id: Ieb9e5a9a428995ff178f77bc4a5155b8298d3fa0
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
The test is giving frequent failures in regression.
Error seen is normally like below:
`09:09:24 not ok 58 [ 14/ 80343] < 104> '^3$ number_healer_threads_shd patchy_distribute1 __afr_shd_healer_wait' -> 'Got "1" instead of "^3$"'`
updates: bz#1708929
Change-Id: I240bdcfb76b1f953d75937a53c5dfabba134f282
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch add more test cases for shd mux test cases
The test case includes
1) Createing multiple volumes to check the attach and detach
of self heal daemon requests.
2) Make sure the healing happens in all sceanarios
3) After a volume detach make sure the threads of the detached
volume is all cleaned.
4) Repeat all the above tests for ec volume
5) Node Reboot case
6) glusterd restart cases
7) Add-brick/remove brick
8) Convert a distributed volume to disperse volume
9) Convert a replicated volume to distributed volume
Change-Id: I7c317ef9d23a45ffd831157e4890d7c83a8fce7b
fixes: bz#1708929
Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Translators covered:
* playground/template
* debug/delay-gen
* debug/error-gen
* features/namespace
* features/quiesce
* meta
updates: bz#1693692
Change-Id: Ic8fde8efcb309ea492d8e819241f786f7ff467a1
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
| |
updates: bz#1693692
Change-Id: If4c30572d4501d169bb4b0871c677d974515867c
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Also some cleanup:
* old-protocol.t was actually added to make sure we have line-coverage
* first-test.t should have been removed as per the comment. It doesn't do anything.
* add statvfs to rpc-coverage so we can cover statvfs in few xlators.
updates: bz#1693692
Change-Id: Ie8651ce007de484c4abced16b4de765aa5e517be
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
| |
updates: bz#1193929
Change-Id: Iee9aab8140882069165621189741f189fb2cc884
Signed-off-by: Kotresh HR <khiremat@redhat.com>
|
|
|
|
|
|
| |
updates: bz#1193929
Change-Id: Ic26ab5277f720c734f083150c1c541763dfa64aa
Signed-off-by: Kotresh HR <khiremat@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
add test for async Read/Write combinations
glfs_read_async/write_async
glfs_pread_async/pwrite_async
glfs_readv_async/writev_async
glfs_preadv_async/pwritev_async
ftruncate/ftruncate_async
fsync/fsync_async
fdatasync/fdatasync_async
Updates: #655
Change-Id: I12beb97029fd60bce79650a376d8fcd8d383ef16
Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
* add more fops: f{get,set,list,remove}xattr(), access(), fstat(), fsetattr(),
getxattr(), lgetxattr(), llistxattr(), lsetxattr(), fgetxattr()
* handle some error cases (like volume not found)
Updates: #655
Change-Id: I3334bdf3090eafd83a54e1be12036ea01b181089
Signed-off-by: Amar Tumballi <amarts@redhat.com>
Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
afr_child_up_status_meta works only when LOOKUP on $M0 is successful.
There are cases where quorum is not met and LOOKUP fails on $M0 which
leads to failures similar to:
grep: /mnt/glusterfs/0/.meta/graphs/active/patchy-replicate-0/private: Transport endpoint is not connected
This was happening once in a while based on attribute-timeout and
md-cache not serving the lookup.
Fix:
Find child-up status based on statedump instead. Also changed mount
options to include --entry-timeout=0 and --attribute-timeout=0
updates bz#1193929
Change-Id: Ic0de72c3006d7399a5feb3e4d10d4748949b2ab3
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
| |
fixes bz#1706603
Change-Id: I0bfd30f787f157b7a54f71088f767ccfd7621208
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Running with 2 second sleep at this place caused failures like:
`not ok 14 [ 2014/ 7] < 41> 'test-message1 cat /mnt/glusterfs/1/test.txt' -> 'Got "test-message0" instead of "test-message1"'`
in few runs in 100 iterations. But when increased to higher than sleep 3,
have not seen any failures in 100 runs.
While I don't know the exact reasons for the behavior yet, looks like this
increase in wait helps to pass the regression without failures.
updates: bz#1693692
Change-Id: I0610b79bea53e36de3eea6c11234b7fc9dfd6232
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
| |
Change-Id: Iceefe22af754096c599dc570d4894d14fce4deae
Updates: bz#1193929
Signed-off-by: Xavier Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
In uss.t multiple snapshots are taken and after all the tests
things are left for the cleanup () function to get removed.
Instead of that, delete the snapshots and the volume once all
the tests are over so that cleanup operation becomes relatively
a light operation.
Change-Id: I2342740bbb185cd6c9a450eb3b4f5cbbba78974c
fixes: bz#1704888
Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Add testcase to test snapshot creation
while I/O is happening with changelog
enabled.
updates: bz#1193929
Change-Id: Ice4cb596286c583ed7308484d65902007a48396c
Signed-off-by: Kotresh HR <khiremat@redhat.com>
|
|
|
|
|
|
| |
Change-Id: Ie0a5c522dfa0123ca45f9decf5015d39b92cb0f3
updates: bz#1693692
Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
|
|
|
|
|
|
|
| |
updates: bz#1693692
Change-Id: I848e622d7b8562e864f0e208aafdc21d9cb757d3
Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently EC tries to reopen fd's that have been opened while a brick
was down. This is done as part of regular write operations, just after
having acquired the locks, and it's sent as a sub-fop of the main write
fop.
There were two problems:
1. The reopen was attempted on all UP bricks, even if a previous lock
didn't succeed. This is incorrect because most probably the open will
fail.
2. If reopen is sent and fails, the error is propagated to the main
operation, causing it to fail when it shouldn't.
To fix this, we only attempt reopens on bricks where the current fop
owns a lock, and we prevent any error to be propagated to the main
fop.
To implement this behaviour an argument used to indicate the minimum
number of required answers has overloaded to also include some flags. To
make the change consistent, it has been necessary to rename the
argument, which means that a lot of files have been changed. However
there are no functional changes.
This change has also uncovered a problem in discard code, which didn't
correctely process requests of small sizes because no real discard fop
was being processed, only a write of 0's on some region. In this case
some fields of the fop remained uninitialized or with incorrect values.
To fix this, a new function has been created to simulate success on a
fop and it's used in the discard case.
Thanks to Pranith for providing a test script that has also detected an
issue in this patch. This patch includes a small modification of this
script to force data to be written into bricks before stopping them.
Change-Id: If272343873369186c2fb8f43c1d9c52c3ea304ec
Fixes: bz#1699866
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
| |
updates: bz#1699866
Change-Id: I7ccd1fc5fc134eeb6d443c755962a20819320d48
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Creation of tar file on gluster volume throws warning
'file changed as we read it'
Cause:
During readdirp, for few of the files whose inode is not
present, time attributes were served from backend. This caused
the ctime of few files to be different between before readdir
and after readdir by tar.
Solution:
If ctime feature is enabled and inode is not present, don't
serve the time attributes from backend file, serve it from xattr.
fixes: bz#1698078
Change-Id: I427ef865f97399475faf5aa6ca495f7e317603ae
Signed-off-by: Kotresh HR <khiremat@redhat.com>
|
|
|
|
|
|
| |
Change-Id: I3556793c5e9d58cc6a08644b41dc5740fab2610b
updates: bz#1628194
Signed-off-by: N Balachandran <nbalacha@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
1) The placement of cloudsync xlator has been changed
to make it shard xlator's child. If cloudsync has to
work with shard in the graph, it needs to be child of shard.
Change-Id: Ib55424fdcb7ce8edae9f19b8a6e3d3ba86c1f0c4
fixes: bz#1642168
Signed-off-by: Anuradha Talur <atalur@commvault.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As protocol implements every fop, and in general a large part of
the codebase. Considering our regression is run mostly in 1 machine,
there was no way of forcing the client to use old protocol (while new
one is available). With this patch, a new 'testing' option is provided
which forces client to use old protocol if found.
This should help increase the code coverage by at least 10k lines overall.
updates: bz#1693692
Change-Id: Ie45256f7dea250671b689c72b4b6f25037cef948
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Test ec-cpu-extensions.t has been modified so that it uses a bigger
matrix. This makes use of more functions from ec-code-c.c. Changing
read-policy to round-robin increases even more the functions used,
reaching 100% of line and function coverage for this file.
Change-Id: I26e4d33269cbd67f5d76d862f4cf1e69285e85e1
updates: bz#1193929
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
| |
this test alone covers most of code of trace xlator
updates: bz#1693692
Change-Id: I287c72ee89bd1c02d992b020d5644e8dac0b77ab
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Part 1: refactor the dht_lookup_dir_cbk
and dht_selfheal_directory functions.
Added a simple dht selfheal directory test
Change-Id: I1410c26359e3c14b396adbe751937a52bd2fcff9
updates: bz#1590385
Signed-off-by: N Balachandran <nbalacha@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
When split-brain choice is changed from one brick to another
brick, inode-invalidate is not called so readv call is served
from cache leading to failures in split-brain-resolution.t.
Fixed it by calling inode_invaldate() when this happens.
updates bz#1193929
Change-Id: I2624614eec38c0303f3e1dc55dfae3d4b864218b
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
| |
It helps in increased code coverage of playground.
updates: bz#1693692
Change-Id: I81bcf30be1450948a6360d8915f06b973387a560
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Shd daemon is per node, which means they create a graph
with all volumes on it. While this is a great for utilizing
resources, it is so good in terms of performance and managebility.
Because self-heal daemons doesn't have capability to automatically
reconfigure their graphs. So each time when any configurations
changes happens to the volumes(replicate/disperse), we need to restart
shd to bring the changes into the graph.
Because of this all on going heal for all other volumes has to be
stopped in the middle, and need to restart all over again.
Solution:
This changes makes shd as a per volume daemon, so that the graph
will be generated for each volumes.
When we want to start/reconfigure shd for a volume, we first search
for an existing shd running on the node, if there is none, we will
start a new process. If already a daemon is running for shd, then
we will simply detach a graph for a volume and reatach the updated
graph for the volume. This won't touch any of the on going operations
for any other volumes on the shd daemon.
Example of an shd graph when it is per volume
graph
-----------------------
| debug-iostat |
-----------------------
/ | \
/ | \
--------- --------- ----------
| AFR-1 | | AFR-2 | | AFR-3 |
-------- --------- ----------
A running shd daemon with 3 volumes will be like-->
graph
-----------------------
| debug-iostat |
-----------------------
/ | \
/ | \
------------ ------------ ------------
| volume-1 | | volume-2 | | volume-3 |
------------ ------------ ------------
Change-Id: Idcb2698be3eeb95beaac47125565c93370afbd99
fixes: bz#1659708
Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch implements a thread pool that is wait-free for adding jobs to
the queue and uses a very small locked region to get jobs. This makes it
possible to decrease contention drastically. It's based on wfcqueue
structure provided by urcu library.
It automatically enables more threads when load demands it, and stops
them when not needed. There's a maximum number of threads that can be
used. This value can be configured.
Depending on the workload, the maximum number of threads plays an
important role. So it needs to be configured for optimal performance.
Currently the thread pool doesn't self adjust the maximum for the
workload, so this configuration needs to be changed manually.
For this reason, the global thread pool has been made optional, so that
volumes can still use the thread pool provided by io-threads.
To enable it for bricks, the following option needs to be set:
config.global-threading = on
This option has no effect if bricks are already running. A restart is
required to activate it. It's recommended to also enable the following
option when running bricks with the global thread pool:
performance.iot-pass-through = on
To enable it for a FUSE mount point, the option '--global-threading'
must be added to the mount command. To change it, an umount and remount
is needed. It's recommended to disable the following option when using
global threading on a mount point:
performance.client-io-threads = off
To enable it for services managed by glusterd, glusterd needs to be
started with option '--global-threading'. In this case all daemons, like
self-heal, will be using the global thread pool.
Currently it can only be enabled for bricks, FUSE mounts and glusterd
services.
The maximum number of threads for clients and bricks can be configured
using the following options:
config.client-threads
config.brick-threads
These options can be applied online and its effect is immediate most of
the times. If one of them is set to 0, the maximum number of threads
will be calcutated as #cores * 2.
Some distributions use a very old userspace-rcu library (version 0.7)
for this reason, some header files from version 0.10 have been copied
into contrib/userspace-rcu and are used if the detected version is 0.7
or older.
An additional change has been made to io-threads to prevent that threads
are started when iot-pass-through is set.
Change-Id: I09d19e246b9e6d53c6247b29dfca6af6ee00a24b
updates: #532
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Without this patch the following error is seen:
....
warning: implicit declaration of function ‘makedev’ [-Wimplicit-function-declaration]
ret = mknod("cspecial", S_IFCHR | S_IRWXU | S_IRWXG, makedev(2, 3));
^~~~~~~
/usr/bin/ld: /tmp/ccIVwT46.o: in function `path_based_fops':
/home/pk/workspace/gerrit-repo/tests/basic/fops-sanity.c:478:
undefined reference to `makedev'
....
updates bz#1676797
Change-Id: I8a17c38fdfd458dd2dc75f4c7e2bf20ce555a042
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Auto invalidation is necessary when same (meta)data is shared/access
across multiple mounts. However, if (meta)data is not shared, all
relevant I/O goes through the cache of single mount and hence is
coherent with (meta)data on bricks always. So, fuse-auto-invalidation
can be disabled for this case which gives a huge performance boost for
workloads that write data and then immediately read the data they just
wrote.
From glusterfs --help,
<snip>
--auto-invalidation[=BOOL] controls whether fuse-kernel can
auto-invalidate attribute, dentry and page-cache.
Disable this only if same files/directories are
not accessed across two different mounts
concurrently [default: "on"]
</snip>
Details on how disabling auto-invalidation helped to reduce pgbench
init times can be found at [1]. Time taken for pgbench init of scale
8000 was 8340s. That will be an improvement of 86% (59280s vs 8340s)
with auto-invalidations turned off along with other
optimizations. Just disabling auto-invalidation contributed 56%
improvement by reducing the total time taken by 33260s.
[1] https://www.spinics.net/lists/gluster-devel/msg25907.html
Change-Id: I0ed730dba9064bd9c576ad1800170a21e100e1ce
Signed-off-by: Raghavendra Gowdappa <rgowdapp@redhat.com>
updates: bz#1664934
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a fop to create an entry fails on one of the data brick,
we mark the pending changelog on the entry on brick for which
it was successful. This is done as part of post op phase to
make sure that entry gets healed even if it gets renamed to
some other path where its parent was not marked as bad.
As it happens as part of post op, we should consider thin-arbiter
to check if the brick, which was successful, is the good brick or not.
This will avoide split brain and other issues.
Change-Id: I12686675be98f02f70a5186b3ed748c541514d53
updates: bz#1662264
Signed-off-by: Ashish Pandey <aspandey@redhat.com>
|
|
|
|
|
|
| |
Fixes: bz#1665358
Change-Id: Idbf88ec3ac683733b32c313377eeb72f2819bf0d
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is a low level security issue with fencing since one client
can preempt another client's lock.
This patch does not completely eliminate the issue of a client
misbehaving, but certainly it adds a security layer for default use cases
that does not need fencing.
Change-Id: I55cd15f2ed1ae0f2556e3d27a2ef4bc10fdada1c
updates: #466
Signed-off-by: Susant Palai <spalai@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
design reference: https://review.gluster.org/#/c/glusterfs-specs/+/21925/
This patch adds the lock preempt support.
Note: The current model stores lock enforcement information as separate
xattr on disk. There is another effort going in parallel to store this
in stat(x) of the file. This patch is self sufficient to add fencing
support. Based on the availability of the stat(x) support either I will
rebase this patch or we can modify the necessary bits post merging this
patch.
Change-Id: If4a42f3e0afaee1f66cdb0360ad4e0c005b5b017
updates: #466
Signed-off-by: Susant Palai <spalai@redhat.com>
|
|
|
|
|
|
|
|
|
| |
With this changeset, default value for the AFR client side
heal volume option is set to "off"
fixes: bz#1663102
Change-Id: Ie4016932339c4896487e3e7cb5caca68739b7ba2
Signed-off-by: Sunil Kumar Acharya <sheggodu@redhat.com>
|