summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/ec/src/ec-common.c
Commit message (Collapse)AuthorAgeFilesLines
* cluster/ec: Inform failure when some bricks are unavailable.Ashish Pandey2020-08-251-30/+46
| | | | | | | | | | | Provide proper information about failure when a fop fails on some of the brick. Also provide information about parent fop and the map of the bricks on which it is failing. Change-Id: If812739617df65cd146c8e667fbacff653717248 updates #1434 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/ec: Improve detection of new healsXavi Hernandez2020-07-221-1/+1
| | | | | | | | | | | | | | | | | | | | When EC successfully healed a directory it assumed that maybe other entries inside that directory could have been created, which could require additional heal cycles. For this reason, when the heal happened as part of one index heal iteration, it triggered a new iteration. The problem happened when the directory was healthy, so no new entries were added, but its index entry was not removed for some reason. In this case self-heal started and endless loop healing the same directory continuously, cause high CPU utilization. This patch improves detection of new files added to the heal index so that a new index heal iteration is only triggered if there is new work to do. Change-Id: I2355742b85fbfa6de758bccc5d2e1a283c82b53f Fixes: #1354 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* Improve logging in EC, client and lock translatorAshish Pandey2020-02-031-1/+1
| | | | | Change-Id: I98af8672a25ff9fd9dba91a2e1384719f9155255 Fixes: bz#1779760
* cluster/ec: Mark release only when it is acquiredPranith Kumar K2019-09-121-2/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Mount-1 Mount-2 1)Tries to acquire lock on 'dir1' 1)Tries to acquire lock on 'dir1' 2)Lock is granted on brick-0 2)Lock gets EAGAIN on brick-0 and leads to blocking lock on brick-0 3)Gets a lock-contention 3) Doesn't matter what happens on mount-2 notification, marks lock->release from here on. to true. 4)New fop comes on 'dir1' which will be put in frozen list as lock->release is set to true. 5) Lock acquisition from step-2 fails because 3 bricks went down in 4+2 setup. Fop on mount-1 which is put in frozen list will hang because no codepath will move it from frozen list to any other list and the lock will not be retried. Fix: Don't set lock->release to true if lock is not acquired at the time of lock-contention-notification fixes: bz#1743573 Change-Id: Ie6630db8735ccf372cc54b873a3a3aed7a6082b7 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: quorum-count implementationPranith Kumar K2019-09-081-0/+13
| | | | | | fixes: #721 Change-Id: I5333540e3c635ccf441cf1f4696e4c8986e38ea8 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Fail fsync/flush for files on update size/version failurePranith Kumar K2019-09-061-0/+23
| | | | | | | | | | | | | | | | | Problem: If update size/version is not successful on the file, updates on the same stripe could lead to data corruptions if the earlier un-aligned write is not successful on all the bricks. Application won't have any knowledge of this because update size/version happens in the background. Fix: Fail fsync/flush on fds that are opened before update-size-version went bad. fixes: bz#1748836 Change-Id: I9d323eddcda703bd27d55f340c4079d76e06e492 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Fix reopen flags to avoid misbehaviorPranith Kumar K2019-07-301-1/+3
| | | | | | | | | | | | | | | | | | | | | | | Problem: when a file needs to be re-opened O_APPEND and O_EXCL flags are not filtered in EC. - O_APPEND should be filtered because EC doesn't send O_APPEND below EC for open to make sure writes happen on the individual fragments instead of at the end of the file. - O_EXCL should be filtered because shd could have created the file so even when file exists open should succeed - O_CREAT should be filtered because open happens with gfid as parameter. So open fop will create just the gfid which will lead to problems. Fix: Filter out these two flags in reopen. Change-Id: Ia280470fcb5188a09caa07bf665a2a94bce23bc4 Fixes: bz#1733935 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Always read from good-maskPranith Kumar K2019-07-261-0/+3
| | | | | | | | | | There are cases where fop->mask may have fop->healing added and readv shouldn't be wound on fop->healing. To avoid this always wind readv to lock->good_mask fixes bz#1727081 Change-Id: I2226ef0229daf5ff315d51e868b980ee48060b87 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: inherit healing from lock when it has infoKinglong Mee2019-07-161-2/+3
| | | | | | | | | If lock has info, fop should inherit healing mask from it. Otherwise, fop cannot inherit right healing when changed_flags is zero. Change-Id: Ife80c9169d2c555024347a20300b0583f7e8a87f fixes: bz#1727081 Signed-off-by: Kinglong Mee <mijinlong@horiscale.com>
* cluster/ec: Prevent double pre-op xattropsPranith Kumar K2019-06-221-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Race: Thread-1 Thread-2 1) Does ec_get_size_version() to perform pre-op fxattrop as part of write-1 2) Calls ec_set_dirty_flag() in ec_get_size_version() for write-2. This sets dirty[] to 1 3) Completes executing ec_prepare_update_cbk leading to ctx->dirty[] = '1' 4) Takes LOCK(inode->lock) to check if there are any flags and sets dirty-flag because lock->waiting_flag is 0 now. This leads to fxattrop to increment on-disk dirty[] to '2' At the end of the writes the file will be marked for heal even when it doesn't need heal. Fix: Perform ec_set_dirty_flag() and other checks inside LOCK() to prevent dirty[] to be marked as '1' in step 2) above Updates bz#1593224 Change-Id: Icac2ab39c0b1e7e154387800fbededc561612865 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* ec/fini: Fix race between xlator cleanup and on going async fopMohammed Rafi KC2019-06-081-0/+10
| | | | | | | | | | | | | | | | Problem: While we process a cleanup, there is a chance for a race between async operations, for example ec_launch_replace_heal. So this can lead to invalid mem access. Solution: Just like we track on going heal fops, we can also track fops like ec_launch_replace_heal, so that we can decide when to send a PARENT_DOWN request. Change-Id: I055391c5c6c34d58aef7336847f3b570cb831298 fixes: bz#1703948 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* cluster/ec: honor contention notifications for partially acquired locksXavi Hernandez2019-05-251-1/+1
| | | | | | | | | | | | | | | | | | | | EC was ignoring lock contention notifications received while a lock was being acquired. When a lock is partially acquired (some bricks have granted the lock but some others not yet) we can receive notifications from acquired bricks, which should be honored, since we may not receive more notifications after that. Since EC was ignoring them, once the lock was acquired, it was not released until the eager-lock timeout, causing unnecessary delays on other clients. This fix takes into consideration the notifications received before having completed the full lock acquisition. After that, the lock will be releaed as soon as possible. Fixes: bz#1708156 Change-Id: I2a306dbdb29fb557dcab7788a258bd75d826cc12 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/ec: Reopen shouldn't happen with O_TRUNCPranith Kumar K2019-05-051-1/+1
| | | | | | | | | | | | | Problem: Doing re-open with O_TRUNC will truncate the fragment even when it is not needed needing extra heals Fix: At the time of re-open don't use O_TRUNC. fixes bz#1706603 Change-Id: Idc6408968efaad897b95a5a52481c66e843d3fb8 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: fix fd reopenXavi Hernandez2019-04-231-20/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently EC tries to reopen fd's that have been opened while a brick was down. This is done as part of regular write operations, just after having acquired the locks, and it's sent as a sub-fop of the main write fop. There were two problems: 1. The reopen was attempted on all UP bricks, even if a previous lock didn't succeed. This is incorrect because most probably the open will fail. 2. If reopen is sent and fails, the error is propagated to the main operation, causing it to fail when it shouldn't. To fix this, we only attempt reopens on bricks where the current fop owns a lock, and we prevent any error to be propagated to the main fop. To implement this behaviour an argument used to indicate the minimum number of required answers has overloaded to also include some flags. To make the change consistent, it has been necessary to rename the argument, which means that a lot of files have been changed. However there are no functional changes. This change has also uncovered a problem in discard code, which didn't correctely process requests of small sizes because no real discard fop was being processed, only a write of 0's on some region. In this case some fields of the fop remained uninitialized or with incorrect values. To fix this, a new function has been created to simulate success on a fop and it's used in the discard case. Thanks to Pranith for providing a test script that has also detected an issue in this patch. This patch includes a small modification of this script to force data to be written into bricks before stopping them. Change-Id: If272343873369186c2fb8f43c1d9c52c3ea304ec Fixes: bz#1699866 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/ec: Don't enqueue an entry if it is already healingAshish Pandey2019-03-271-16/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: 1 - heal-wait-qlength is by default 128. If shd is disabled and we need to heal files, client side heal is needed. If we access these files that will trigger the heal. However, it has been observed that a file will be enqueued multiple times in the heal wait queue, which in turn causes queue to be filled and prevent other files to be enqueued. 2 - While a file is going through healing and a write fop from mount comes on that file, it sends write on all the bricks including healing one. At the end it updates version and size on all the bricks. However, it does not unset dirty flag on all the bricks, even if this write fop was successful on all the bricks. After healing completion this dirty flag remain set and never gets cleaned up if SHD is disabled. Solution: 1 - If an entry is already in queue or going through heal process, don't enqueue next client side request to heal the same file. 2 - Unset dirty on all the bricks at the end if fop has succeeded on all the bricks even if some of the bricks are going through heal. Change-Id: Ia61ffe230c6502ce6cb934425d55e2f40dd1a727 updates: bz#1593224 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* libglusterfs: Move devel headers under glusterfs directoryShyamsundarR2018-12-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | libglusterfs devel package headers are referenced in code using include semantics for a program, this while it works can be better especially when dealing with out of tree xlator builds or in general out of tree devel package usage. Towards this, the following changes are done, - moved all devel headers under a glusterfs directory - Included these headers using system header notation <> in all code outside of libglusterfs - Included these headers using own program notation "" within libglusterfs This change although big, is just moving around the headers and making it correct when including these headers from other sources. This helps us correctly include libglusterfs includes without namespace conflicts. Change-Id: Id2a98854e671a7ee5d73be44da5ba1a74252423b Updates: bz#1193929 Signed-off-by: ShyamsundarR <srangana@redhat.com>
* all: fix warnings on non 64-bits architecturesXavi Hernandez2018-10-101-24/+25
| | | | | | | | | | When compiling in other architectures there appear many warnings. Some of them are actual problems that prevent gluster to work correctly on those architectures. Change-Id: Icdc7107a2bc2da662903c51910beddb84bdf03c0 fixes: bz#1632717 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* Land part 2 of clang-format changesGluster Ant2018-09-121-903/+882
| | | | | Change-Id: Ia84cc24c8924e6d22d02ac15f611c10e26db99b4 Signed-off-by: Nigel Babu <nigelb@redhat.com>
* cluster/ec: Don't update trusted.ec.version if fop succeedsAshish Pandey2018-09-071-0/+9
| | | | | | | | | | | | | | | | | | | | If a fop has succeeded on all the bricks and trying to release the lock, there is no need to update the version for the file/entry. All it will do is to increase the version from x to x+1 on all the bricks. If this update (x to x+1) fails on some brick, this will indicate that the entry is unhealthy while in realty everything is fine with the entry. Avoiding this update will help to not to send one xattrop at the end of the fops. Which will decrease the chances of entries being in unhealthy state and also improve the performance. Change-Id: Id9fca6bd2991425db6ed7d1f36af27027accb636 fixes: bz#1623759 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/ec: Improve logging for some critical error messagesAshish Pandey2018-09-071-14/+52
| | | | | | Change-Id: I037e52a3467467b81a1ba5416317870864060d4d updates: bz#1615703 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* All: run codespell on the code and fix issues.Yaniv Kaul2018-07-221-1/+1
| | | | | | | | | | | | Please review, it's not always just the comments that were fixed. I've had to revert of course all calls to creat() that were changed to create() ... Only compile-tested! Change-Id: I7d02e82d9766e272a7fd9cc68e51901d69e5aab5 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* cluster/ec: Fix pre-op xattrop managementXavi Hernandez2018-05-231-29/+40
| | | | | | | | | | | | | | | | | | | | Multiple pre-op xattrop can be simultaneously being processed. On the cbk it was checked if the fop was waiting for some specific data (like size and version) and, if so, it was assumed that this answer should contain that data. This is not true, since a fop can be waiting for some data, but it may come from the xattrop of another fop. This patch differentiates between needing some information and providing it. This is related to parallel writes. Disabling them fixed the problem, but also prevented concurrent reads. A change has been made so that disabling parallel writes still allows parallel reads. Fixes: bz#1578325 Change-Id: I74772ad6b80b7b37805da93d5ec3ae099e96b041 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/ec: Do lock conflict check correctly for wait-listPranith Kumar K2018-02-011-8/+15
| | | | | | | | | | | | | | Problem: ec_link_has_lock_conflict() is traversing over only owner_list but the function is also getting called with wait_list. Fix: Modify ec_link_has_lock_conflict() to traverse lists correctly. Updated the callers to reflect the changes. BUG: 1540669 Change-Id: Ibd7ea10f4498e7c2761f9a6faac6d5cb7d750c91 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* locks: added inodelk/entrylk contention upcall notificationsXavier Hernandez2018-01-161-87/+166
| | | | | | | | | | | | | | The locks xlator now is able to send a contention notification to the current owner of the lock. This is only a notification that can be used to improve performance of some client side operations that might benefit from extended duration of lock ownership. Nothing is done if the lock owner decides to ignore the message and to not release the lock. For forced release of acquired resources, leases must be used. Change-Id: I7f1ad32a0b4b445505b09908a050080ad848f8e0 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es>
* cluster/ec: OpenFD heal implementation for ECSunil Kumar Acharya2018-01-051-0/+113
| | | | | | | | | | | | | Existing EC code doesn't try to heal the OpenFD to avoid unnecessary healing of the data later. Fix implements the healing of open FDs before carrying out file operations on them by making an attempt to open the FDs on required up nodes. BUG: 1431955 Change-Id: Ib696f59c41ffd8d5678a484b23a00bb02764ed15 Signed-off-by: Sunil Kumar Acharya <sheggodu@redhat.com>
* cluster/ec: Fix possible shift overflowXavier Hernandez2017-12-221-3/+3
| | | | | | | | | | A coverity scan has revelaed a potential shift overflow while scanning the bitmap of available subvolumes. The actual overflow cannot happen, but I've changed to test used to control the limit to make it explicit. Change-Id: Ieb55f010bbca68a1d86a93e47822f7c709a26e83 BUG: 789278 Signed-off-by: Xavier Hernandez <jahernan@redhat.com>
* cluster/ec: Change [f]getxattr to parallel-dispatch-onePranith Kumar K2017-12-221-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | At the moment in EC, [f]getxattr operations wait to acquire a lock while other operations are in progress even when it is in the same mount with a lock on the file/directory. This happens because [f]getxattr operations follow the model where the operation is wound on 'k' of the bricks and are matched to make sure the data returned is same on all of them. This consistency check requires that no other operations are on-going while [f]getxattr operations are wound to the bricks. We can perform [f]getxattr in another way as well, where we find the good_mask from the lock that is already granted and wind the operation on any one of the good bricks and unwind the answer after adjusting size/blocks to the parent xlator. Since we are taking into account good_mask, the reply we get will either be before or after a possible on-going operation. Using this method, the operation doesn't need to depend on completion of on-going operations which could be taking long time (In case of some slow disks and writes are in progress etc). Thus we reduce the time to serve [f]getxattr requests. I changed [f]getxattr to dispatch-one and added extra logic in ec_link_has_lock_conflict() to not have any conflicts for fops with EC_MINIMUM_ONE as fop->minimum to achieve the effect described above. Modified scripts to make sure READ fop is received in EC to trigger heals. Updates gluster/glusterfs#368 Change-Id: I3b4ebf89181c336b7b8d5471b0454f016cdaf296 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Fix bugs in stripe-cache featureAshish Pandey2017-12-051-1/+1
| | | | | | | | | | | | | | 1 - This patch fixes a bug in ec_update_stripe() that prevented some stripes to be updated after a write. 2 - This patch also include code modification for the case in which a file does not exist and we write on unaligned offset and user size, the last stripe on which "end" will fall should also be cached. Change-Id: I069cb4be1c8d59c206e3b35a6991e1fbdbc9b474 BUG: 1520758 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/ec: Keep last written strip in in-memory cacheAshish Pandey2017-11-101-0/+115
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Consider an EC volume with configuration 4 + 2. The stripe size for this would be 512 * 4 = 2048. That means, 2048 bytes of user data stored in one stripe. Let's say 2048 + 512 = 2560 bytes are already written on this volume. 512 Bytes would be in second stripe. Now, if there are sequential writes with offset 2560 and of size 1 Byte, we have to read the whole stripe, encode it with 1 Byte and then again have to write it back. Next, write with offset 2561 and size of 1 Byte will again READ-MODIFY-WRITE the whole stripe. This is causing bad performance because of lots of READ request travelling over the network. There are some tools and scenario's where such kind of load is coming and users are not aware of that. Example: fio and zip Solution: One possible solution to deal with this issue is to keep last stripe in memory. This way, we need not to read it again and we can save READ fop going over the network. Considering the above example, we have to keep last 2048 bytes (maximum) in memory per file. Change-Id: I3f95e6fc3ff81953646d374c445a40c6886b0b85 BUG: 1471753 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/ec: Remove possibility of NULL derefAshish Pandey2017-11-051-1/+1
| | | | | | | | | | | | | Coverity ID: 237 Problem: In ec_check_status we are trying to deref fop->answer which could be NULL. Solution: Check Null condition before using this pointer. Change-Id: I4f9a73dc2f062ca9c62b4c4baf0a6fcadade88f2 BUG: 789278 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/ec: create eager-lock option for non-regular filesXavier Hernandez2017-11-051-1/+21
| | | | | | | | | A new option is added to allow independent configuration of eager locking for regular files and non-regular files. Change-Id: I8f80e46d36d8551011132b15c0fac549b7fb1c60 BUG: 1502610 Signed-off-by: Xavier Hernandez <jahernan@redhat.com>
* cluster/ec: Allow parallel writes in EC if possiblePranith Kumar K2017-10-241-59/+132
| | | | | | | | | | | | | | | | | | Problem: Ec at the moment sends one modification fop after another, so if some of the disks become slow, for a while then the wait time for the writes that are waiting in the queue becomes really bad. Fix: Allow parallel writes when possible. For this we need to make 3 changes. 1) Each fop now has range parameters they will be updating. 2) Xattrop is changed to handle parallel xattrop requests where some would be modifying just dirty xattr. 3) Fops that refer to size now take locks and update the locks. Fixes #251 Change-Id: Ibc3c15372f91bbd6fb617f0d99399b3149fa64b2 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Handle parallel get_size_versionPranith Kumar K2017-10-101-55/+95
| | | | | | Updates #251 Change-Id: I6244014dbc90af3239d63d75a064ae22ec12a054 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* Coverity Issue Fix : CHECKED_RETURNSubha sree Mohankumar2017-09-261-1/+1
| | | | | | | | | | Issue :Event check_return: Calling "ec_dict_set_number" without checking return value. Fix : Type casted the return value of the function "ec_dict_set_number" to void. Change-Id: Id97034f9b1b8591536d63dca680ca7c7a9c4fcc3 BUG: 789278 Signed-off-by: Subha sree Mohankumar <smohanku@redhat.com>
* cluster/ec: fix for BAD_SHIFT, follow-up patchKaleb S. KEITHLEY2017-09-201-11/+14
| | | | | | | | | | | | | | | | | | Address comments to https://review.gluster.org/18067, (Change-Id I86e15d12939c610c99f5f96c551bb870df20f4b4) Which was posted as an RFC as an example of a possible alternative fix to https://review.gluster.org/17860 (Change-Id I28a3bdd4a357526dba0cf84c262919c05cfa173e) An alternative fix that preserved the unsignedness of the indexes throughout, obviating the need to check its value before using it to shift. (shift by negative number is undefined, as is shift by more bits than in the type.) BUG: 1474309 Change-Id: I46fe9cec140d3397463780748f6876251acb06dd Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com>
* cluster/ec: coverity, fix for BAD_SHIFTKaleb S. KEITHLEY2017-08-281-11/+14
| | | | | | | | | | | | | | | | | This is how I would like to see this fixed. passes (eliminates the warning in) coverity. The use of uintptr_t as a bitmask is a problem IMO, especially on 32-bit clients. Change-Id: I86e15d12939c610c99f5f96c551bb870df20f4b4 Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com> Reviewed-on: https://review.gluster.org/18067 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Jeff Darcy <jeff@pl.atyp.us>
* cluster/ec: Non-disruptive upgrade on EC volume failsSunil Kumar Acharya2017-07-141-1/+4
| | | | | | | | | | | | | | | | | | | | Problem: Enabling optimistic changelog on EC volume was not handling node down scenarios appropriately resulting in volume data inaccessibility. Solution: Update dirty xattr appropriately on good bricks whenever nodes are down. This would fix the metadata information as part of heal and thus ensures data accessibility. BUG: 1468261 Change-Id: I08b0d28df386d9b2b49c3de84b4aac1c729ac057 Signed-off-by: Sunil Kumar Acharya <sheggodu@redhat.com> Reviewed-on: https://review.gluster.org/17703 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/ec: Get size of file in EC [f]xattropPranith Kumar K2017-07-131-2/+17
| | | | | | | | | | | | | | | | | | | | Problem: For allowing parallel writes we shouldn't depend on ia_size to be same for all the bricks in each write_cbk(). But we need to make sure backend size is correct on all the bricks and no crashes/manual modifications happened. Fix: At the time of get_size_version() we do 1 check to make sure size of the file is same across the bricks. From then on the FOPs will give the status of the fop, so we rely on this information to keep which bricks are good/bad. Updates #251 Change-Id: I1df645347e2e9f2e09cfa4411b6cc305d7f4e4e5 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: https://review.gluster.org/17741 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es>
* cluster/ec: Update xattr and heal size properlyAshish Pandey2017-06-061-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem-1 : Recursive healing of same file is happening when IO is going on even after data heal completes. Solution: RCA: At the end of the write, when ec_update_size_version gets called, we send it only on good bricks and not on healing brick. Due to this, xattr on healing brick will always remain out of sync and when the background heal check source and sink, it finds this brick to be healed and start healing from scratch. That involve ftruncate and writing all of the data again. To solve this, send xattrop on all the good bricks as well as healing bricks. Problem-2: The above fix exposes the data corruption during heal. If the write on a file is going on and heal finishes, we find that the file gets corrupted. RCA: The real problem happens in ec_rebuild_data(). Here we receive the 'size' argument which contains the real file size at the time of starting self-heal and it's assigned to heal->total_size. After that, a sequence of calls to ec_sync_heal_block() are done. Each call ends up calling ec_manager_heal_block(), which does the actual work of healing a block. First a lock on the inode is taken in state EC_STATE_INIT using ec_heal_inodelk(). When the lock is acquired, ec_heal_lock_cbk() is called. This function calls ec_set_inode_size() to store the real size of the inode (it uses heal->total_size). The next step is to read the block to be healed. This is done using a regular ec_readv(). One of the things this call does is to trim the returned size if the file is smaller than the requested size. In our case, when we read the last block of a file whose size was = 512 mod 1024 at the time of starting self-heal, ec_readv() will return only the first 512 bytes, not the whole 1024 bytes. This isn't a problem since the following ec_writev() sent from the heal code only attempts to write the amount of data read, so it shouldn't modify the remaining 512 bytes. However ec_writev() also checks the file size. If we are writing the last block of the file (determined by the size stored on the inode that we have set to heal->total_size), any data beyond the (imposed) end of file will be cleared with 0's. This causes the 512 bytes after the heal->total_size to be cleared. Since the file was written after heal started, the these bytes contained data, so the block written to the damaged brick will be incorrect. Solution: Align heal->total_size to a multiple of the stripe size. Thanks "Xavier Hernandez" <xhernandez@datalab.es> to find out the root cause and to fix the issue. Change-Id: I6c9f37b3ff9dd7f5dc1858ad6f9845c05b4e204e BUG: 1428673 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: https://review.gluster.org/16985 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es>
* cluster/ec : Don't count healing brick as healthy brickAshish Pandey2017-04-121-1/+1
| | | | | | | | | | | | | | | | | | | In ec_child_select, we should send fop on healing bricks unconditionaly but to check the number of healthy bricks against fragments and minimum count, we should not count these healing bricks. Count bits of fop->mask before adding ealing brick to fop->mask Change-Id: I3fa80bdd5ca34ca070d610116b84154b917c5999 BUG: 1439527 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: https://review.gluster.org/17007 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/ec: Don't mark dirty on entry/meta ops in query-infoPranith Kumar K2017-03-071-6/+0
| | | | | | | | | | | | | | | | | | | | | We wanted to mark dirty for metadata/entry operations whenever query-info is set and info is not yet there because we are anyway sending xattrop over the network. But this is causing 25% regression from 3.8.8 so removing this optimization Also fixed two small issues that we didn't find in the previous patch 1) reconfigure failure was sending return value 0 for optimistic-changelog 2) ec->optimistic_changelog was set to true even before OPTION_INIT BUG: 1408809 Change-Id: Iabb0b64bd4d3623688790e4b67e5c20b4da977a1 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: https://review.gluster.org/16865 Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org>
* cluster/ec: Introduce optimistic changelog in ECPranith Kumar K2017-03-041-1/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Fix to https://bugzilla.redhat.com/show_bug.cgi?id=1316873 has made changes to set dirty flag before every update fop, data or metadata, and unset it after successful operation. That makes some of the fops very slow such as entry operations or metadata operations. Solution: File data operations are the only operation which take some time and setting dirty flag before a fop and unsetting it after serves the purpose as probability of failure of a fop is high when the time duration is more. For all the other operations, set dirty flag at the end of the fop, if any brick is down and need heal. Providing following option to choose between high performance or better heal marking for metadata and entry fops. Set/Unset dirty flag for every update fop at the start of the fop. If ON, this option impacts performance of entry operations or metadata operations as it will set dirty flag at the start and unset it at the end of ALL update fop. If OFF and all the bricks are good, dirty flag will be set at the start only for file fops For metadata and entry fops dirty flag will not be set at the start, if all the bricks are good. This does not impact performance for metadata operations and entry operation but has a very small window to miss marking entry as dirty in case it is required to be healed. Thanks to Xavi and Ashish for the design Picked the .t file from Ashish' patch https://review.gluster.org/16298 BUG: 1408809 Change-Id: I3ce860063f0e2901e50754dcfc3e4ed22daf819f Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: https://review.gluster.org/16821 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Tested-by: Xavier Hernandez <xhernandez@datalab.es> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/ec: Don't trigger data/metadata heal on LookupsPranith Kumar K2017-02-261-14/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem-1 If Lookup which doesn't take any locks observes version mismatch it can't be trusted. If we launch a heal based on this information it will lead to self-heals which will affect I/O performance in the cases where Lookup is wrong. Considering self-heal-daemon and operations on the inode from client which take locks can still trigger heal we can choose to not attempt a heal on Lookup. Problem-2: Fixed spurious failure of tests/bitrot/bug-1373520.t For the issues above, what was happening was that ec_heal_inspect() is preventing 'name' heal to happen Problem-3: tests/basic/ec/ec-background-heals.t To be honest I don't know what the problem was, while fixing the 2 problems above, I made some changes to ec_heal_inspect() and ec_need_heal() after which when I tried to recreate the spurious failure it just didn't happen even after a long time. BUG: 1414287 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Change-Id: Ife2535e1d0b267712973673f6d474e288f3c6834 Reviewed-on: https://review.gluster.org/16468 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ashish Pandey <aspandey@redhat.com>
* cluster/ec: Change level of messages to DEBUGAshish Pandey2017-01-271-2/+2
| | | | | | | | | | | | | | | | Heal failed or passed should not be logged as warning. These can be observed from heal info if the heal is happening or not. If we require to debug a case where heal is not happening, we can set the level to DEBUG. Change-Id: I347665c8c8b6223bb08a9f3dd5643a10ddc3b93e BUG: 1417050 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: https://review.gluster.org/16473 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/disperse: Do not log fop failed for lockless fopsAshish Pandey2017-01-191-12/+13
| | | | | | | | | | | | | | | | | | | | Problem: Operation failed messages are getting logged based on the callbacks of lockless fop's. If a fop does not take a lock, it is possible that it will get some out of sync xattr, iatts. We can not depend on these callback to psay that the fop has failed. Solution: Print failed messages only for locked fops. However, heal would still be triggered. Change-Id: I4427402c8c944c23f16073613caa03ea788bead3 BUG: 1414287 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: http://review.gluster.org/16435 Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/ec: Fixing log messageSunil Kumar H G2017-01-081-5/+10
| | | | | | | | | | | | | | | Updating the warning message with details to improve user understanding. BUG: 1409202 Change-Id: I001f8d5c01c97fff1e4e1a3a84b62e17c025c520 Signed-off-by: Sunil Kumar H G <sheggodu@redhat.com> Reviewed-on: http://review.gluster.org/16315 Tested-by: Sunil Kumar Acharya Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es>
* cluster/ec: Do lookup on an existing file in linkPranith Kumar K2017-01-051-3/+4
| | | | | | | | | | | | | | | | | | | Problem: In link fop lookup is happening on the new fop which doesn't exist so the iatt ec serves parent xlators has size as zero which leads to 'cat' giving empty output Fix: Change code so that lookup happens on the existing link instead. BUG: 1409730 Change-Id: I70eb02fe0633e61d1d110575589cc2dbe5235d76 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/16320 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Tested-by: Xavier Hernandez <xhernandez@datalab.es> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
* cluster/ec: Fix lk-owner set race in ec_unlockPranith Kumar K2016-12-131-6/+8
| | | | | | | | | | | | | | | | | | | | | | Problem: Rename does two locks. There is a case where when it tries to unlock it sends xattrop of the directory with new version, callback of these two xattrops can be picked up by two separate epoll threads. Both of them will try to set the lk-owner for unlock in parallel on the same frame so one of these unlocks will fail because the lk-owner doesn't match. Fix: Specify the lk-owner which will be set on inodelk frame which will not be over written by any other thread/operation. BUG: 1402710 Change-Id: I666ffc931440dc5253d72df666efe0ef1d73f99a Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/16074 Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/ec: fix unused variable warnings/errorsKaleb S. KEITHLEY2016-09-161-2/+0
| | | | | | | | | | | | | | | | | | http://review.gluster.org/14085 fixes a "pragma leak" where the generated rpc/xdr headers have a pair of pragmas that disable these warnings. With the warnings disabled, many unused variables have crept into the code base. And 14085 won't pass its own smoke test until all these warnings are fixed. BUG: 1369124 Change-Id: I24607fc2082c3424f876f740a88fb7d0173d322d Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com> Reviewed-on: http://review.gluster.org/15518 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/ec: set/unset dirty flag for data/metadata updateAshish Pandey2016-09-151-122/+162
| | | | | | | | | | | | | | | | | | | | Currently, for all the update operations, metadata or data, we set the dirty flag at the end of the operation only if a brick is down. This leads to delay in healing and in some cases not at all. In this patch we set (+1) the dirty flag at the start of the metadata or data update operations and after successfull completion of the fop, we unset (-1) it again. Change-Id: Ide5668bdec7b937a61c5c840cdc79a967598e1e9 BUG: 1316873 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: http://review.gluster.org/13733 Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es>