summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/ec/src/ec.c
Commit message (Collapse)AuthorAgeFilesLines
* cluster/ec: Implement read-mask featurePranith Kumar K2019-09-271-0/+77
| | | | | | fixes: #725 Change-Id: Iaaefe6f49c8193c476b987b92df6bab3e2f62601 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: prevent filling shd log with "table not found" messagesXavi Hernandez2019-09-261-2/+13
| | | | | | | | | | | | | | | When self-heal daemon receives an inodelk contention notification, it tries to locate the related inode using inode_find() and the inode table owned by top-most xlator, which in this case doesn't have any inode table. This causes many messages to be logged by inode_find() function because the inode table passed is NULL. This patch prevents this by making sure the inode table is not NULL before calling inode_find(). Change-Id: I8d001bd180aaaf1521ba40a536b097fcf70c991f Fixes: bz#1755344 Signed-off-by: Xavi Hernandez <jahernan@redhat.com>
* cluster/ec: quorum-count implementationPranith Kumar K2019-09-081-0/+13
| | | | | | fixes: #721 Change-Id: I5333540e3c635ccf441cf1f4696e4c8986e38ea8 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* ec/fini: Fix race between xlator cleanup and on going async fopMohammed Rafi KC2019-06-081-12/+25
| | | | | | | | | | | | | | | | Problem: While we process a cleanup, there is a chance for a race between async operations, for example ec_launch_replace_heal. So this can lead to invalid mem access. Solution: Just like we track on going heal fops, we can also track fops like ec_launch_replace_heal, so that we can decide when to send a PARENT_DOWN request. Change-Id: I055391c5c6c34d58aef7336847f3b570cb831298 fixes: bz#1703948 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* ec/fini: Fix race with ec_fini and ec_notifyMohammed Rafi KC2019-05-211-0/+3
| | | | | | | | | | | | | | | During a graph cleanup, we first sent a PARENT_DOWN and wait for a child down to ultimately free the xlator and the graph. In the ec xlator, we cleanup the threads when we get a PARENT_DOWN event. But a racing event like CHILD_UP or event xl_op may trigger healing threads after threads cleanup. So there is a chance that the threads might access a freed private variabe Change-Id: I252d10181bb67b95900c903d479de707a8489532 fixes: bz#1703948 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* ec/shd: Cleanup self heal daemon resources during ec finiMohammed Rafi KC2019-05-131-0/+47
| | | | | | | | | | We were not properly cleaning self-heal daemon resources during ec fini. With shd multiplexing, it is absolutely necessary to cleanup all the resources during ec fini. Change-Id: Iae4f1bce7d8c2e1da51ac568700a51088f3cc7f2 fixes: bz#1703948 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* cluster/ec: fix shd healer wait timeoutKinglong Mee2019-05-061-0/+1
| | | | | | | | | Right now, the timeout is written by hard code, fix it by using heal-timeout. fixes: bz#1703020 Change-Id: I0d154e7807f9dba7efc3896805559bbfaa7af2ad Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
* cluster/ec: fix fd reopenXavi Hernandez2019-04-231-20/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently EC tries to reopen fd's that have been opened while a brick was down. This is done as part of regular write operations, just after having acquired the locks, and it's sent as a sub-fop of the main write fop. There were two problems: 1. The reopen was attempted on all UP bricks, even if a previous lock didn't succeed. This is incorrect because most probably the open will fail. 2. If reopen is sent and fails, the error is propagated to the main operation, causing it to fail when it shouldn't. To fix this, we only attempt reopens on bricks where the current fop owns a lock, and we prevent any error to be propagated to the main fop. To implement this behaviour an argument used to indicate the minimum number of required answers has overloaded to also include some flags. To make the change consistent, it has been necessary to rename the argument, which means that a lot of files have been changed. However there are no functional changes. This change has also uncovered a problem in discard code, which didn't correctely process requests of small sizes because no real discard fop was being processed, only a write of 0's on some region. In this case some fields of the fop remained uninitialized or with incorrect values. To fix this, a new function has been created to simulate success on a fop and it's used in the discard case. Thanks to Pranith for providing a test script that has also detected an issue in this patch. This patch includes a small modification of this script to force data to be written into bricks before stopping them. Change-Id: If272343873369186c2fb8f43c1d9c52c3ea304ec Fixes: bz#1699866 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* xlator: make 'xlator_api' mandatoryAmar Tumballi2018-12-131-1/+19
| | | | | | | | | | | | | | * Remove the options to load old symbol. * keep only 'xlator_api' symbol from being exported using xlator.sym * add xlator_api to all the xlators where its missing NOTE: This covers all the xlators which has at least a test case to validate its loading. If there is a translator, which doesn't have any test, then we should probably remove that from codebase. fixes: #164 Change-Id: Ibcdc8c9844cda6b4463d907a15813745d14c1ebb Signed-off-by: Amar Tumballi <amarts@redhat.com>
* libglusterfs: Move devel headers under glusterfs directoryShyamsundarR2018-12-051-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | libglusterfs devel package headers are referenced in code using include semantics for a program, this while it works can be better especially when dealing with out of tree xlator builds or in general out of tree devel package usage. Towards this, the following changes are done, - moved all devel headers under a glusterfs directory - Included these headers using system header notation <> in all code outside of libglusterfs - Included these headers using own program notation "" within libglusterfs This change although big, is just moving around the headers and making it correct when including these headers from other sources. This helps us correctly include libglusterfs includes without namespace conflicts. Change-Id: Id2a98854e671a7ee5d73be44da5ba1a74252423b Updates: bz#1193929 Signed-off-by: ShyamsundarR <srangana@redhat.com>
* all: fix the format string exceptionsAmar Tumballi2018-11-051-9/+9
| | | | | | | | | | | | | | | | Currently, there are possibilities in few places, where a user-controlled (like filename, program parameter etc) string can be passed as 'fmt' for printf(), which can lead to segfault, if the user's string contains '%s', '%d' in it. While fixing it, makes sense to make the explicit check for such issues across the codebase, by making the format call properly. Fixes: CVE-2018-14661 Fixes: bz#1644763 Change-Id: Ib547293f2d9eb618594cbff0df3b9c800e88bde4 Signed-off-by: Amar Tumballi <amarts@redhat.com>
* all: fix warnings on non 64-bits architecturesXavi Hernandez2018-10-101-1/+1
| | | | | | | | | | When compiling in other architectures there appear many warnings. Some of them are actual problems that prevent gluster to work correctly on those architectures. Change-Id: Icdc7107a2bc2da662903c51910beddb84bdf03c0 fixes: bz#1632717 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* Land part 2 of clang-format changesGluster Ant2018-09-121-945/+935
| | | | | Change-Id: Ia84cc24c8924e6d22d02ac15f611c10e26db99b4 Signed-off-by: Nigel Babu <nigelb@redhat.com>
* All: run codespell on the code and fix issues.Yaniv Kaul2018-07-221-1/+1
| | | | | | | | | | | | Please review, it's not always just the comments that were fixed. I've had to revert of course all calls to creat() that were changed to create() ... Only compile-tested! Change-Id: I7d02e82d9766e272a7fd9cc68e51901d69e5aab5 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* afr,ec: Print if the subvolume is up in statedumpPranith Kumar K2018-07-031-0/+1
| | | | | | fixes bz#1597156 Change-Id: I323eb9190e40b12df216698dcdba74a6d336beeb Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Fix pre-op xattrop managementXavi Hernandez2018-05-231-0/+1
| | | | | | | | | | | | | | | | | | | | Multiple pre-op xattrop can be simultaneously being processed. On the cbk it was checked if the fop was waiting for some specific data (like size and version) and, if so, it was assumed that this answer should contain that data. This is not true, since a fop can be waiting for some data, but it may come from the xattrop of another fop. This patch differentiates between needing some information and providing it. This is related to parallel writes. Disabling them fixed the problem, but also prevented concurrent reads. A change has been made so that disabling parallel writes still allows parallel reads. Fixes: bz#1578325 Change-Id: I74772ad6b80b7b37805da93d5ec3ae099e96b041 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/ec: Turn ON the stripe-cache option by defaultAshish Pandey2018-04-061-1/+1
| | | | | | Change-Id: I0a290396c30c635b13ee73004d20259efb76a954 fixes: bz#1563945 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/ec: send list-node-uuids request to all subvolumesXavi Hernandez2018-03-281-1/+1
| | | | | | | | | | | | The xattr trusted.glusterfs.list-node-uuids was only sent to a single subvolume. This was returning null uuids from the other subvolumes as if they were down. This fix forces that xattr to be requested from all subvolumes. Change-Id: If62eb39a6857258923ba625e153d4ad79018ea2f fixes: bz#1561406 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/ec: Change default read policy to gfid-hashAshish Pandey2018-03-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Whenever we read data from file over NFS, NFS reads more data then requested and caches it. Based on the stat information it makes sure that the cached/pre-read data is valid or not. Consider 4 + 2 EC volume and all the bricks are on differnt nodes. In EC, with round-robin read policy, reads are sent on different set of data bricks. This way, it balances the read fops to go on all the bricks and avoid heating UP (overloading) same set of bricks. Due to small difference in clock speed, it is possible that we get minor difference for atime, mtime or ctime for different bricks. That might cause a different stat returned to NFS based on which NFS will discard cached/pre-read data which is actually not changed and could be used. Solution: Change read policy for EC as gfid-hash. That will force all the read to go to same set of bricks. Change-Id: I825441cc519e94bf3dc3aa0bd4cb7c6ae6392c84 BUG: 1554743 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/ec: avoid delays in self-healXavi Hernandez2018-03-141-41/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Self-heal creates a thread per brick to sweep the index looking for files that need to be healed. These threads are started before the volume comes online, so nothing is done but waiting for the next sweep. This happens once per minute. When a replace brick command is executed, the new graph is loaded and all index sweeper threads started. When all bricks have reported, a getxattr request is sent to the root directory of the volume. This causes a heal on it (because the new brick doesn't have good data), and marks its contents as pending to be healed. This is done by the index sweeper thread on the next round, one minute later. This patch solves this problem by waking all index sweeper threads after a successful check on the root directory. Additionally, the index sweep thread scans the index directory sequentially, but it might happen that after healing a directory entry more index entries are created but skipped by the current directory scan. This causes the remaining entries to be processed on the next round, one minute later. The same can happen in the next round, so the heal is running in bursts and taking a lot to finish, specially on volumes with many directory levels. This patch solves this problem by immediately restarting the index sweep if a directory has been healed. Change-Id: I58d9ab6ef17b30f704dc322e1d3d53b904e5f30e BUG: 1547662 Signed-off-by: Xavi Hernandez <jahernan@redhat.com>
* cluster/ec : EC options for GD2Sunil Kumar Acharya2018-01-221-1/+34
| | | | | | | Updates #302 Change-Id: I31b4648f7b1a394fceece5cba8120c579c66edd9 Signed-off-by: Sunil Kumar Acharya <sheggodu@redhat.com>
* locks: added inodelk/entrylk contention upcall notificationsXavier Hernandez2018-01-161-7/+70
| | | | | | | | | | | | | | The locks xlator now is able to send a contention notification to the current owner of the lock. This is only a notification that can be used to improve performance of some client side operations that might benefit from extended duration of lock ownership. Nothing is done if the lock owner decides to ignore the message and to not release the lock. For forced release of acquired resources, leases must be used. Change-Id: I7f1ad32a0b4b445505b09908a050080ad848f8e0 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es>
* cluster/ec: Fix possible shift overflowXavier Hernandez2017-12-221-7/+0
| | | | | | | | | | A coverity scan has revelaed a potential shift overflow while scanning the bitmap of available subvolumes. The actual overflow cannot happen, but I've changed to test used to control the limit to make it explicit. Change-Id: Ieb55f010bbca68a1d86a93e47822f7c709a26e83 BUG: 789278 Signed-off-by: Xavier Hernandez <jahernan@redhat.com>
* cluster/ec: Change [f]getxattr to parallel-dispatch-onePranith Kumar K2017-12-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | At the moment in EC, [f]getxattr operations wait to acquire a lock while other operations are in progress even when it is in the same mount with a lock on the file/directory. This happens because [f]getxattr operations follow the model where the operation is wound on 'k' of the bricks and are matched to make sure the data returned is same on all of them. This consistency check requires that no other operations are on-going while [f]getxattr operations are wound to the bricks. We can perform [f]getxattr in another way as well, where we find the good_mask from the lock that is already granted and wind the operation on any one of the good bricks and unwind the answer after adjusting size/blocks to the parent xlator. Since we are taking into account good_mask, the reply we get will either be before or after a possible on-going operation. Using this method, the operation doesn't need to depend on completion of on-going operations which could be taking long time (In case of some slow disks and writes are in progress etc). Thus we reduce the time to serve [f]getxattr requests. I changed [f]getxattr to dispatch-one and added extra logic in ec_link_has_lock_conflict() to not have any conflicts for fops with EC_MINIMUM_ONE as fop->minimum to achieve the effect described above. Modified scripts to make sure READ fop is received in EC to trigger heals. Updates gluster/glusterfs#368 Change-Id: I3b4ebf89181c336b7b8d5471b0454f016cdaf296 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Add default value for the redundancy optionKamal Mohanan2017-12-211-0/+1
| | | | | | | Updates #302 Change-Id: Ifccc6a64c2b2a3b3a0133e6c8042cdf61a0838ce Signed-off-by: Kamal Mohanan <kmohanan@redhat.com>
* cluster/ec: Fix op-version for disperse.other-eager-lockXavier Hernandez2017-11-141-1/+1
| | | | | | | | | The op-version used for the new option was wrong. It has been set to 3.13.0. Change-Id: I88fbd7834e4a8018c8906303e734c251e90be8cf BUG: 1502610 Signed-off-by: Xavier Hernandez <jahernan@redhat.com>
* cluster/ec: Keep last written strip in in-memory cacheAshish Pandey2017-11-101-0/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Consider an EC volume with configuration 4 + 2. The stripe size for this would be 512 * 4 = 2048. That means, 2048 bytes of user data stored in one stripe. Let's say 2048 + 512 = 2560 bytes are already written on this volume. 512 Bytes would be in second stripe. Now, if there are sequential writes with offset 2560 and of size 1 Byte, we have to read the whole stripe, encode it with 1 Byte and then again have to write it back. Next, write with offset 2561 and size of 1 Byte will again READ-MODIFY-WRITE the whole stripe. This is causing bad performance because of lots of READ request travelling over the network. There are some tools and scenario's where such kind of load is coming and users are not aware of that. Example: fio and zip Solution: One possible solution to deal with this issue is to keep last stripe in memory. This way, we need not to read it again and we can save READ fop going over the network. Considering the above example, we have to keep last 2048 bytes (maximum) in memory per file. Change-Id: I3f95e6fc3ff81953646d374c445a40c6886b0b85 BUG: 1471753 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/ec: create eager-lock option for non-regular filesXavier Hernandez2017-11-051-12/+25
| | | | | | | | | A new option is added to allow independent configuration of eager locking for regular files and non-regular files. Change-Id: I8f80e46d36d8551011132b15c0fac549b7fb1c60 BUG: 1502610 Signed-off-by: Xavier Hernandez <jahernan@redhat.com>
* cluster/ec: Implement DISCARD FOP for ECSunil Kumar Acharya2017-10-251-1/+2
| | | | | | | | | | | Updates #254 This code change implements DISCARD FOP support for EC. BUG: 1461018 Change-Id: I09a9cb2aa9d91ec27add4f422dc9074af5b8b2db Signed-off-by: Sunil Kumar Acharya <sheggodu@redhat.com>
* cluster/ec: Allow parallel writes in EC if possiblePranith Kumar K2017-10-241-21/+30
| | | | | | | | | | | | | | | | | | Problem: Ec at the moment sends one modification fop after another, so if some of the disks become slow, for a while then the wait time for the writes that are waiting in the queue becomes really bad. Fix: Allow parallel writes when possible. For this we need to make 3 changes. 1) Each fop now has range parameters they will be updating. 2) Xattrop is changed to handle parallel xattrop requests where some would be modifying just dirty xattr. 3) Fops that refer to size now take locks and update the locks. Fixes #251 Change-Id: Ibc3c15372f91bbd6fb617f0d99399b3149fa64b2 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* ec: Increase notification in all the casesAshish Pandey2017-06-281-31/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: "gluster v heal <volname> info" is taking long time to respond when a brick is down. RCA: Heal info command does virtual mount. EC wait for 10 seconds, before sending UP call to upper xlator, to get notification (DOWN or UP) from all the bricks. Currently, we are increasing ec->xl_notify_count based on the current status of the brick. So, if a DOWN event notification has come and brick is already down, we are not increasing ec->xl_notify_count in ec_handle_down. Solution: Handle DOWN even as notification irrespective of what is the current status of brick. Change-Id: I0acac0db7ec7622d4c0584692e88ad52f45a910f BUG: 1464091 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: https://review.gluster.org/17606 Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
* cluster/ec: Implement FALLOCATE FOP for ECSunil Kumar Acharya2017-05-231-2/+3
| | | | | | | | | | | | | | | FALLOCATE file operations is not implemented in the existing EC code. This change set implements it for EC. BUG: 1448293 Change-Id: Id9ed914db984c327c16878a5b2304a0ea461b623 Signed-off-by: Sunil Kumar Acharya <sheggodu@redhat.com> Reviewed-on: https://review.gluster.org/15200 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/ec: return all node uuids from all subvolumesXavier Hernandez2017-05-171-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | EC was retuning the UUID of the brick with smaller value. This had the side effect of not evenly balancing the load between bricks on rebalance operations. This patch modifies the common functions that combine multiple subvolume values into a single result to take into account the subvolume order and, optionally, other subvolumes that could be damaged. This makes easier to add future features where brick order is important. It also makes possible to easily identify the originating brick of each answer, in case some brick will have an special meaning in the future. Change-Id: Iee0a4da710b41224a6dc8e13fa8dcddb36c73a2f BUG: 1366817 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: https://review.gluster.org/17297 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ashish Pandey <aspandey@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/ec: Implement self-heal-window_size optionSunil Kumar Acharya2017-04-251-0/+14
| | | | | | | | | | | | | | | | Fix implements the heal window size option for EC. This option control the maximum size of read/write operation carried out in self-heal process. BUG: 1441491 Change-Id: I6c0ef65c9ca18b0828f91b319d4f52ac5b77d0d8 Signed-off-by: Sunil Kumar Acharya <sheggodu@redhat.com> Reviewed-on: https://review.gluster.org/17098 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/ec: Don't mark dirty on entry/meta ops in query-infoPranith Kumar K2017-03-071-3/+2
| | | | | | | | | | | | | | | | | | | | | We wanted to mark dirty for metadata/entry operations whenever query-info is set and info is not yet there because we are anyway sending xattrop over the network. But this is causing 25% regression from 3.8.8 so removing this optimization Also fixed two small issues that we didn't find in the previous patch 1) reconfigure failure was sending return value 0 for optimistic-changelog 2) ec->optimistic_changelog was set to true even before OPTION_INIT BUG: 1408809 Change-Id: Iabb0b64bd4d3623688790e4b67e5c20b4da977a1 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: https://review.gluster.org/16865 Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org>
* cluster/ec: Introduce optimistic changelog in ECPranith Kumar K2017-03-041-1/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Fix to https://bugzilla.redhat.com/show_bug.cgi?id=1316873 has made changes to set dirty flag before every update fop, data or metadata, and unset it after successful operation. That makes some of the fops very slow such as entry operations or metadata operations. Solution: File data operations are the only operation which take some time and setting dirty flag before a fop and unsetting it after serves the purpose as probability of failure of a fop is high when the time duration is more. For all the other operations, set dirty flag at the end of the fop, if any brick is down and need heal. Providing following option to choose between high performance or better heal marking for metadata and entry fops. Set/Unset dirty flag for every update fop at the start of the fop. If ON, this option impacts performance of entry operations or metadata operations as it will set dirty flag at the start and unset it at the end of ALL update fop. If OFF and all the bricks are good, dirty flag will be set at the start only for file fops For metadata and entry fops dirty flag will not be set at the start, if all the bricks are good. This does not impact performance for metadata operations and entry operation but has a very small window to miss marking entry as dirty in case it is required to be healed. Thanks to Xavi and Ashish for the design Picked the .t file from Ashish' patch https://review.gluster.org/16298 BUG: 1408809 Change-Id: I3ce860063f0e2901e50754dcfc3e4ed22daf819f Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: https://review.gluster.org/16821 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Tested-by: Xavier Hernandez <xhernandez@datalab.es> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/ec: fix selinux issues with mmap()Xavier Hernandez2017-02-021-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | EC uses mmap() to create a memory area for the dynamic code. Since the code is created on the fly and executed when needed, this region of memory needs to have write and execution privileges. This combination is not allowed by default by selinux. To solve the problem a file is used as a backend storage for the dynamic code and it's mapped into two distinct memory regions, one with write access and the other one with execution access. This approach is the recommended way to create dynamic code by a program in a more secure way, and selinux allows it. Additionally selinux requires that the backend file be stored in a directory marked with type bin_t to be able to map it in an executable area. To satisfy this condition, GLUSTERFS_LIBEXECDIR has been used. This fix also changes the error check for mmap(), that was done incorrectly (it checked against NULL instead of MAP_FAILED), and it also correctly propagates the error codes and makes sure they aren't silently ignored. Change-Id: I71c2f88be4e4d795b6cfff96ab3799c362c54291 BUG: 1402661 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: https://review.gluster.org/16405 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* core: run many bricks within one glusterfsd processJeff Darcy2017-01-301-10/+9
| | | | | | | | | | | | | | | | | | | | | | | This patch adds support for multiple brick translator stacks running in a single brick server process. This reduces our per-brick memory usage by approximately 3x, and our appetite for TCP ports even more. It also creates potential to avoid process/thread thrashing, and to improve QoS by scheduling more carefully across the bricks, but realizing that potential will require further work. Multiplexing is controlled by the "cluster.brick-multiplex" global option. By default it's off, and bricks are started in separate processes as before. If multiplexing is enabled, then *compatible* bricks (mostly those with the same transport options) will be started in the same process. Change-Id: I45059454e51d6f4cbb29a4953359c09a408695cb BUG: 1385758 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: https://review.gluster.org/14763 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* ec: Invalidations in disperse volume should not update the statPoornima G2017-01-051-0/+13
| | | | | | | | | | | | | | | | | | | | | | Issue: In disperse volume, the file is present across bricks, hence the stat from one brick doesn't carry the valid size of the file. Therefore the upcall from one brick updating the md-cache results in wrong size being updated. Fix: If the notification is cache invalidation then, indicate md-cache that the attributes is invalid. BUG: 1410375 Change-Id: Id89d2283478e70b62b435a8891fffc86d2be8cb2 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/16329 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/ec: Fix lk-owner set race in ec_unlockPranith Kumar K2016-12-131-4/+4
| | | | | | | | | | | | | | | | | | | | | | Problem: Rename does two locks. There is a case where when it tries to unlock it sends xattrop of the directory with new version, callback of these two xattrops can be picked up by two separate epoll threads. Both of them will try to set the lk-owner for unlock in parallel on the same frame so one of these unlocks will fail because the lk-owner doesn't match. Fix: Specify the lk-owner which will be set on inodelk frame which will not be over written by any other thread/operation. BUG: 1402710 Change-Id: I666ffc931440dc5253d72df666efe0ef1d73f99a Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/16074 Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* afr,dht,ec: Replace GF_EVENT_CHILD_MODIFIED with event SOME_DESCENDENT_DOWN/UPPoornima G2016-11-211-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently these are few events related to child_up/down: GF_EVENT_CHILD_UP : Issued when any of the protocol client connects. GF_EVENT_CHILD_MODIFIED : Issued by afr/dht/ec GF_EVENT_CHILD_DOWN : Issued when any of the protocol client disconnects. These events get modified at the dht/afr/ec layers. Here is a brief on the same. DHT: - All the subvolumes reported once, and atleast one child came up, then GF_EVENT_CHILD_UP is issued - connect GF_EVENT_CHILD_UP is issued - disconnect GF_EVENT_CHILD_MODIFIED is issued - All the subvolumes disconnected, GF_EVENT_CHILD_DOWN is issued AFR: - First subvolume came up, then GF_EVENT_CHILD_UP is issued - Subsequent subvolumes coming up, results in GF_EVENT_CHILD_MODIFIED - Any of the subvolumes go down, then GF_EVENT_SOME_CHILD_DOWN is issued - Last up subvolume goes down, then GF_EVENT_CHILD_DOWN is issued Until the patch [1] introduced GF_EVENT_SOME_CHILD_UP, GF_EVENT_CHILD_MODIFIED was issued by afr/dht when any of the subvolumes go up or down. Now with md-cache changes, there is a necessity to differentiate between child up and down. Hence, introducing GF_EVENT_SOME_DESCENDENT_DOWN/UP and getting rid of GF_EVENT_CHILD_MODIFIED. [1] http://review.gluster.org/12573 Change-Id: I704140b6598f7ec705493251d2dbc4191c965a58 BUG: 1396038 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/15764 CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: N Balachandran <nbalacha@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Rajesh Joseph <rjoseph@redhat.com>
* cluster/ec: Implement heal info with lockAshish Pandey2016-10-111-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | Problem: Currently heal info command prints all the files/directories if the index for the file/directory is present in .glusterfs/indices folder. After implementing patch http://review.gluster.org/#/c/13733/ indices of the file which is going through update fop will also be present in .glusterfs/indices even if the fop is successful on all the brick. At this time if heal info command is being used, it will also display this file which is actually healthy and does not require any heal. Solution: Take lock on a file corresponding to the indices and inspect xattrs to decide if the file needs heal or not. Change-Id: I6361e2813ece369be12d02e74816df4eddb81cfa BUG: 1366815 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: http://review.gluster.org/15543 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org>
* ec: Implement ipc fopPoornima G2016-09-251-1/+9
| | | | | | | | | | | | | | | The ipc will be wound to all the bricks, but for it to be successfull, the fop should succeed on minimum number of bricks. Change-Id: I3f8cb6a349e87bafd0773583def9d4e3765aa140 BUG: 1211863 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/15387 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ashish Pandey <aspandey@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/ec: Add support for hardware accelerationXavier Hernandez2016-09-081-9/+44
| | | | | | | | | | | | | | | | | | | | | | | This patch implements functionalities for fast encoding/decoding using hardware support. Currently optimized x86_64, SSE and AVX is added. Additionally this patch implements a caching mecanism for inverse matrices to reduce computation time, as well as a new method for computing the inverse that takes quadratic time instead of cubic. Finally some unnecessary memory copies have been eliminated to further increase performance. Change-Id: I26c75f26fb4201bd22b51335448ea4357235065a BUG: 1289922 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/12837 Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/ec: Add events for EC translatorAshish Pandey2016-09-071-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | This patch will generates events in following cases which will be consumed by new event framework. Consider an EC volume with (K+M) configuration K = Data bricks M = Redundancy bricks 1- EVENT_EC_MIN_BRICKS_NOT_UP - When minimum "K" number of bricks, required for any ec fop, are not up. 2- EVENT_EC_MIN_BRICKS_UP When minimum "K" number of bricks, required for any ec fop, are up. Change-Id: I0414b8968c39740a171e5aa14b087afd524d574f BUG: 1371470 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: http://review.gluster.org/15348 Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/ec: Do multi-threaded self-healPranith Kumar K2016-08-241-0/+25
| | | | | | | | | | | BUG: 1368451 Change-Id: I5d6b91d714ad6906dc478a401e614115c89a8fbb Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/15083 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ashish Pandey <aspandey@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/ec: Restrict the launch of replace brick healAshish Pandey2016-06-061-1/+2
| | | | | | | | | | | | | | | | | | | | Problem: When features.cache-invalidation is ON, a lot of ec_notify function gets called which leads to launch of too many heals. This leads to no heal completion, which causes accumulation of heals. Solution: ec_launch_replace_heal should not be launch for every event. Replace brick will trigger a child up event and then only this heal function should be called. Change-Id: I57b44c6a279d57230daea1d93229be6069245b7d BUG: 1342796 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: http://review.gluster.org/14649 Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Smoke: Gluster Build System <jenkins@build.gluster.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.com>
* cluster/ec: Add/Modify description for eager-lock optionAshish Pandey2016-06-031-2/+12
| | | | | | | | | | | | | | | | | | | This patch provides description for disperse.eager-lock option for disperse volume. It also modifies the description for cluster.eager-lock option to indicate that this option is only for replica volume. Change-Id: Ie73298947fcaaa6aaf825978bc2d27ceaff386d2 BUG: 1327171 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: http://review.gluster.org/13999 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Ravishankar N <ravishankar@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/ec: Provide an option to enable/disable eager lockAshish Pandey2016-03-151-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: If a fop takes lock, and completes its operation, it waits for 1 second before releasing the lock. However, If ec find any lock contention within this time period, it release the lock immediately before time expires. As we take lock on first brick, for few operations, like read, it might happen that discovery of lock contention might take long time and can degrades the performance. Solution: Provide an option to enable/disable eager lock. If eager lock is disabled, lock will be released as soon as fop completes. gluster v set <VOLUME NAME> disperse.eager-lock on gluster v set <VOLUME NAME> disperse.eager-lock off Change-Id: I000985a787eba3c190fdcd5981dfbf04e64af166 BUG: 1314649 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: http://review.gluster.org/13605 Smoke: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
* cluster/ec: Automate heal for replace brickAshish Pandey2016-02-081-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: After a replace brick command, newly added brick does not contain data which existed on old brick. Solution: Do getxattr after initialization of all the bricks. This will trigger heal for brick root as soon as it finds the version mismatch on newly added brick. Removing tests from ec-new-entry.t which were required to simulate automation of heal after replace brick. Change-Id: I08e3dfa565374097f6c08856325ea77727437e11 BUG: 1304686 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: http://review.gluster.org/13353 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.com>