summaryrefslogtreecommitdiffstats
path: root/xlators/cluster
Commit message (Collapse)AuthorAgeFilesLines
* cluster/afr: fix race when bricks come upXavi Hernandez2020-04-063-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | The was a problem when self-heal was sending lookups at the same time that one of the bricks was coming up. In this case there was a chance that the number of 'up' bricks changes in the middle of sending the requests to subvolumes which caused a discrepancy in the expected number of replies and the actual number of sent requests. This discrepancy caused that AFR continued executing requests before all requests were complete. Eventually, the frame of the pending request was destroyed when the operation terminated, causing a use- after-free issue when the answer was finally received. In theory the same thing could happen in the reverse way, i.e. AFR tries to wait for more replies than sent requests, causing a hang. Backport of: > Change-Id: I7ed6108554ca379d532efb1a29b2de8085410b70 > Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> > Fixes: bz#1808875 Change-Id: I7ed6108554ca379d532efb1a29b2de8085410b70 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> Fixes: bz#1809440
* afr: prevent spurious entry heals leading to gfid split-brainRavishankar N2020-04-065-15/+7
| | | | | | | | | | | | | | | | | | | | | Problem: In a hyperconverged setup with granular-entry-heal enabled, if a file is recreated while one of the bricks is down, and an index heal is triggered (with the brick still down), entry-self heal was doing a spurious heal with just the 2 good bricks. It was doing a post-op leading to removal of the filename from .glusterfs/indices/entry-changes as well as erroneous setting of afr xattrs on the parent. When the brick came up, the xattrs were cleared, resulting in the renamed file not getting healed and leading to gfid split-brain and EIO on the mount. Fix: Proceed with entry heal only when shd can connect to all bricks of the replica, just like in data and metadata heal. fixes: #1103 Change-Id: I916ae26ad1fabf259bc6362da52d433b7223b17e Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 06453d77d056fbaa393a137ca277a20e38d2f67e)
* afr: mark pending xattrs as a part of metadata healRavishankar N2020-04-021-1/+61
| | | | | | | | | | | | | | | | | | | | ...if pending xattrs are zero for all children. Problem: If there are no pending xattrs and a metadata heal needs to be performed, it can be possible that we end up with xattrs inadvertendly deleted from all bricks, as explained in the BZ. Fix: After picking one among the sources as the good copy, mark pending xattrs on all sources to blame the sinks. Now even if this metadata heal fails midway, a subsequent heal will still choose one of the valid sources that it picked previously. Updates: #1067 Change-Id: If1b050b70b0ad911e162c04db4d89b263e2b8d7b Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 2d5ba449e9200b16184b1e7fc84cabd015f1f779)
* Cluster/afr: Don't treat all bricks having metadata pending as split-brainkarthik-us2020-03-022-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Problem: We currently don't have a roll-back/undoing of post-ops if quorum is not met. Though the FOP is still unwound with failure, the xattrs remain on the disk. Due to these partial post-ops and partial heals (healing only when 2 bricks are up), we can end up in metadata split-brain purely from the afr xattrs point of view i.e each brick is blamed by atleast one of the others for metadata. These scenarios are hit when there is frequent connect/disconnect of the client/shd to the bricks. Fix: Pick a source based on the xattr values. If 2 bricks blame one, the blamed one must be treated as sink. If there is no majority, all are sources. Once we pick a source, self-heal will then do the heal instead of erroring out due to split-brain. This patch also adds restriction of all the bricks to be up to perform metadata heal to avoid any metadata loss. Removed the test case tests/bugs/replicate/bug-1468279-source-not-blaming-sinks.t as it was doing metadata heal even when only 2 of 3 bricks were up. Change-Id: I07a9d62f84ceda329dcab1f02a33aeed258dcb09 fixes: bz#1806931 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* afr: wake up index healer threadsRavishankar N2020-02-275-11/+25
| | | | | | | | | | | | | | | Backport of https://review.gluster.org/#/c/glusterfs/+/23288/ ...whenever shd is re-enabled after disabling or there is a change in `cluster.heal-timeout`, without needing to restart shd or waiting for the current `cluster.heal-timeout` seconds to expire. See BZ 1743988 for more details. Change-Id: Ia5ebd7c8e9f5b54cba3199c141fdd1af2f9b9bfe fixes: bz#1807431 Reported-by: Glen Kiessling <glenk1973@hotmail.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/ec: Change handling of heal failure to avoid crashAshish Pandey2020-02-252-13/+13
| | | | | | | | | | | | | | | | | Problem: ec_getxattr_heal_cbk was called with NULL as second argument in case heal was failing. This function was dereferencing "cookie" argument which caused crash. Solution: Cookie is changed to carry the value that was supposed to be stored in fop->data, so even in the case when fop is NULL in error case, there won't be any NULL dereference. Thanks to Xavi for the suggestion about the fix. Change-Id: I0798000d5cadb17c3c2fbfa1baf77033ffc2bb8c fixes: bz#1805057
* cluster/ec: Update lock->good_mask on parent fop failurePranith Kumar K2020-02-252-0/+4
| | | | | | | | | | When discard/truncate performs write fop, it should do so after updating lock->good_mask to make sure readv happens on the correct mask fixes: bz#1805056 Change-Id: Idfef0bbcca8860d53707094722e6ba3f81c583b7 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Fix reopen flags to avoid misbehaviorPranith Kumar K2020-02-252-3/+8
| | | | | | | | | | | | | | | | | | | | | | | Problem: when a file needs to be re-opened O_APPEND and O_EXCL flags are not filtered in EC. - O_APPEND should be filtered because EC doesn't send O_APPEND below EC for open to make sure writes happen on the individual fragments instead of at the end of the file. - O_EXCL should be filtered because shd could have created the file so even when file exists open should succeed - O_CREAT should be filtered because open happens with gfid as parameter. So open fop will create just the gfid which will lead to problems. Fix: Filter out these two flags in reopen. Change-Id: Ia280470fcb5188a09caa07bf665a2a94bce23bc4 Fixes: bz#1805055 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Always read from good-maskPranith Kumar K2020-02-252-5/+25
| | | | | | | | | | There are cases where fop->mask may have fop->healing added and readv shouldn't be wound on fop->healing. To avoid this always wind readv to lock->good_mask fixes: bz#1805054 Change-Id: I2226ef0229daf5ff315d51e868b980ee48060b87 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: fix EIO error for concurrent writes on sparse filesXavi Hernandez2020-02-251-9/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | EC doesn't allow concurrent writes on overlapping areas, they are serialized. However non-overlapping writes are serviced in parallel. When a write is not aligned, EC first needs to read the entire chunk from disk, apply the modified fragment and write it again. The problem appears on sparse files because a write to an offset implicitly creates data on offsets below it (so, in some way, they are overlapping). For example, if a file is empty and we read 10 bytes from offset 10, read() will return 0 bytes. Now, if we write one byte at offset 1M and retry the same read, the system call will return 10 bytes (all containing 0's). So if we have two writes, the first one at offset 10 and the second one at offset 1M, EC will send both in parallel because they do not overlap. However, the first one will try to read missing data from the first chunk (i.e. offsets 0 to 9) to recombine the entire chunk and do the final write. This read will happen in parallel with the write to 1M. What could happen is that half of the bricks process the write before the read, and the half do the read before the write. Some bricks will return 10 bytes of data while the otherw will return 0 bytes (because the file on the brick has not been expanded yet). When EC tries to recombine the answers from the bricks, it can't, because it needs more than half consistent answers to recover the data. So this read fails with EIO error. This error is propagated to the parent write, which is aborted and EIO is returned to the application. The issue happened because EC assumed that a write to a given offset implies that offsets below it exist. This fix prevents the read of the chunk from bricks if the current size of the file is smaller than the read chunk offset. This size is correctly tracked, so this fixes the issue. Also modifying ec-stripe.t file for Test #13 within it. In this patch, if a file size is less than the offset we are writing, we fill zeros in head and tail and do not consider it strip cache miss. That actually make sense as we know what data that part holds and there is no need of reading it from bricks. Change-Id: Ic342e8c35c555b8534109e9314c9a0710b6225d6 Fixes: bz#1805053 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/ec: skip updating ctx->loc again when ec_fix_open/opendirKinglong Mee2020-02-252-10/+14
| | | | | | | | | | | | | The ec_manager_open/opendir memsets ctx->loc which causes memory/inode leak, and ec_fheal uses ctx->loc out of fd->lock that loc_copy may copy bad data when memset it. This patch skips updating ctx->loc when it is initilizaed. With it, ctx->loc is filled once, and never updated. Change-Id: I3bf5ffce4caf4c1c667f7acaa14b451d37a3550a fixes: bz#1805052 Signed-off-by: Kinglong Mee <mijinlong@horiscale.com>
* cluster/ec: inherit healing from lock when it has infoKinglong Mee2020-02-251-2/+3
| | | | | | | | | If lock has info, fop should inherit healing mask from it. Otherwise, fop cannot inherit right healing when changed_flags is zero. Change-Id: Ife80c9169d2c555024347a20300b0583f7e8a87f fixes: bz#1805051 Signed-off-by: Kinglong Mee <mijinlong@horiscale.com>
* cluster/dht: Correct fd processing loopN Balachandran2020-02-251-22/+62
| | | | | | | | | | | | | | backport of https://review.gluster.org/#/c/glusterfs/+/23506/ The fd processing loops in the dht_migration_complete_check_task and the dht_rebalance_inprogress_task functions were unsafe and could cause an open to be sent on an already freed fd. This has been fixed. Change-Id: I0a3c7d2fba314089e03dfd704f9dceb134749540 Fixes: bz#1804522 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/ec: Prevent double pre-op xattropsPranith Kumar K2020-02-241-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Race: Thread-1 Thread-2 1) Does ec_get_size_version() to perform pre-op fxattrop as part of write-1 2) Calls ec_set_dirty_flag() in ec_get_size_version() for write-2. This sets dirty[] to 1 3) Completes executing ec_prepare_update_cbk leading to ctx->dirty[] = '1' 4) Takes LOCK(inode->lock) to check if there are any flags and sets dirty-flag because lock->waiting_flag is 0 now. This leads to fxattrop to increment on-disk dirty[] to '2' At the end of the writes the file will be marked for heal even when it doesn't need heal. Fix: Perform ec_set_dirty_flag() and other checks inside LOCK() to prevent dirty[] to be marked as '1' in step 2) above Fixes: bz#1805050 Change-Id: Icac2ab39c0b1e7e154387800fbededc561612865 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Reopen shouldn't happen with O_TRUNCPranith Kumar K2020-02-241-1/+1
| | | | | | | | | | | | | Problem: Doing re-open with O_TRUNC will truncate the fragment even when it is not needed needing extra heals Fix: At the time of re-open don't use O_TRUNC. fixes: bz#1805049 Change-Id: Idc6408968efaad897b95a5a52481c66e843d3fb8 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: fix fd reopenXavi Hernandez2020-02-2014-274/+328
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently EC tries to reopen fd's that have been opened while a brick was down. This is done as part of regular write operations, just after having acquired the locks, and it's sent as a sub-fop of the main write fop. There were two problems: 1. The reopen was attempted on all UP bricks, even if a previous lock didn't succeed. This is incorrect because most probably the open will fail. 2. If reopen is sent and fails, the error is propagated to the main operation, causing it to fail when it shouldn't. To fix this, we only attempt reopens on bricks where the current fop owns a lock, and we prevent any error to be propagated to the main fop. To implement this behaviour an argument used to indicate the minimum number of required answers has overloaded to also include some flags. To make the change consistent, it has been necessary to rename the argument, which means that a lot of files have been changed. However there are no functional changes. This change has also uncovered a problem in discard code, which didn't correctely process requests of small sizes because no real discard fop was being processed, only a write of 0's on some region. In this case some fields of the fop remained uninitialized or with incorrect values. To fix this, a new function has been created to simulate success on a fop and it's used in the discard case. Thanks to Pranith for providing a test script that has also detected an issue in this patch. This patch includes a small modification of this script to force data to be written into bricks before stopping them. Change-Id: If272343873369186c2fb8f43c1d9c52c3ea304ec Fixes: bz#1805047 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/afr: Heal entries when there is a source & no healed_sinkskarthik-us2019-10-111-0/+15
| | | | | | | | | | | | | | | | | | | | Problem: In a situation where B1 blames B2, B2 blames B1 and B3 doesn't blame anything for entry heal, heal will not complete even though we have clear source and sinks. This will happen because while doing afr_selfheal_find_direction() only the bricks which are blamed by non-accused bricks are considered as sinks. Later in __afr_selfheal_entry_finalize_source() when it tries to mark all the non-sources as sinks it fails to do so because there won't be any healed_sinks marked, no witness present and there will be a source. Fix: If there is a source and no healed_sinks, then reset all the locked sources to 0 and healed sinks to 1 to do conservative merge. Change-Id: If40d8bc95d52a52b2730f55bdcf135109b421548 Fixes: bz#1760710 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* afr/lookup: Pass xattr_req in while doing a selfheal in lookupMohammed Rafi KC2019-09-053-5/+17
| | | | | | | | | | | | | | | We were not passing xattr_req when doing a name self heal as well as a meta data heal. Because of this, some xdata was missing which causes i/o errors Backport of>https://review.gluster.org/#/c/glusterfs/+/23024/ >Change-Id: Ibfb1205a7eb0195632dc3820116ffbbb8043545f >Fixes: bz#1728770 >Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Change-Id: I740ccdbfcf2667fe4a850c12ecdc4b9eeed08293 Fixes: bz#1749352 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* cluster/ec: honor contention notifications for partially acquired locksXavi Hernandez2019-06-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | EC was ignoring lock contention notifications received while a lock was being acquired. When a lock is partially acquired (some bricks have granted the lock but some others not yet) we can receive notifications from acquired bricks, which should be honored, since we may not receive more notifications after that. Since EC was ignoring them, once the lock was acquired, it was not released until the eager-lock timeout, causing unnecessary delays on other clients. This fix takes into consideration the notifications received before having completed the full lock acquisition. After that, the lock will be releaed as soon as possible. Backport of: > BUG: bz#1708156 > Change-Id: I2a306dbdb29fb557dcab7788a258bd75d826cc12 > Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> Fixes: bz#1717282 Change-Id: I2a306dbdb29fb557dcab7788a258bd75d826cc12 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> Signed-off-by: Hari Gowtham <hgowtham@redhat.com>
* ec: fix truncate lock to cover the write in tuncate cleanKinglong Mee2019-05-081-2/+6
| | | | | | | | | | | ec_truncate_clean does writing under the lock granted for truncate, but the lock is calculated by ec_adjust_offset_up, so that, the write in ec_truncate_clean is out of lock. Updates: bz#1699500 Change-Id: I15ed1b0807d75c5eb817323f1c227e97d03e0e7c Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> (cherry picked from commit 0e1223491e964096384edfae5032ed0d50d028ad)
* cluster/afr: Remove local from owners_list on failure of lock-acquisitionPranith Kumar K2019-05-084-18/+14
| | | | | | | | | | | | | When eager-lock lock acquisition fails because of say network failures, the local is not being removed from owners_list, this leads to accumulation of waiting frames and the application will hang because the waiting frames are under the assumption that another transaction is in the process of acquiring lock because owner-list is not empty. Handled this case as well in this patch. Added asserts to make it easier to find these problems in future. fixes bz#1699736 Change-Id: I3101393265e9827755725b1f2d94a93d8709e923 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/dht: Request linkto xattrs in dht_rmdir opendirN Balachandran2019-04-101-1/+26
| | | | | | | | | | | | | | | | | | | If parallel-readdir is enabled, the rda xlator is loaded below dht in the graph and proactively lists and caches entries when an opendir is performed. dht_rmdir checks if the directory being deleted contains stale linkto files by performing a readdirp on its child subvols. However, as the entries are actually read in during the opendir operation which does not request the linkto xattr,no linkto xattrs are present for the entries causing dht to incorrectly identify them as data files and fail the rmdir operation with ENOTEMPTY. DHT now always adds the linkto xattr in the list of xattrs requested in the opendir. Change-Id: I0711198e66c59146282eb8b88084170bedfb4018 fixes: bz#1695399 Signed-off-by: N Balachandran <nbalacha@redhat.com> (cherry picked from commit 110006bbcd5bb3e814b4cfe7d74cb41891ac3b0c)
* cluster/dht: Fix lookup selfheal and rmdir raceN Balachandran2019-04-081-9/+25
| | | | | | | | | | | | | | | A race between the lookup selfheal and rmdir can cause directories to be healed only on non-hashed subvols. This can prevent the directory from being listed from the mount point and in turn causes rm -rf to fail with ENOTEMPTY. Fix: Update the layout information correctly and reduce the call count only after processing the response. Change-Id: I812779aaf3d7bcf24aab1cb158cb6ed50d212451 fixes: bz#1695403 Signed-off-by: N Balachandran <nbalacha@redhat.com> (cherry picked from commit b0f1d782fc45313fce4e1c0e74127401d5342d05)
* cluster/afr: Send truncate on arbiter brick from SHDkarthik-us2019-03-121-15/+13
| | | | | | | | | | | | | | | | | | | Problem: In an arbiter volume configuration SHD will not send any writes onto the arbiter brick even if there is data pending marker for the arbiter brick. If we have a arbiter setup on the geo-rep master and there are data pending markers for the files on arbiter brick, SHD will not mark any data changelog during healing. While syncing the data from master to slave, if the arbiter-brick is considered as ACTIVE, then there is a chance that slave will miss out some data. If the arbiter brick is being newly added or replaced there is a chance of slave missing all the data during sync. Fix: If there is data pending marker for the arbiter brick, send truncate on the arbiter brick during heal, so that it will record truncate as the data transaction in changelog. Change-Id: I3242ba6cea6da495c418ef860d9c3359c5459dec fixes: bz#1687687 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* cluster/thin-arbiter: Consider thin-arbiter before marking new entry changelogAshish Pandey2019-02-184-19/+87
| | | | | | | | | | | | | | | | | | | If a fop to create an entry fails on one of the data brick, we mark the pending changelog on the entry on brick for which it was successful. This is done as part of post op phase to make sure that entry gets healed even if it gets renamed to some other path where its parent was not marked as bad. As it happens as part of post op, we should consider thin-arbiter to check if the brick, which was successful, is the good brick or not. This will avoide split brain and other issues. >Change-Id: I12686675be98f02f70a5186b3ed748c541514d53 >Signed-off-by: Ashish Pandey <aspandey@redhat.com> Change-Id: I12686675be98f02f70a5186b3ed748c541514d53 updates: bz#1672314 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/dht: Delete invalid linkto files in rmdirN Balachandran2019-02-041-1/+4
| | | | | | | | | | | | rm -rf <dir> fails on dirs which contain linkto files that point to themselves because dht incorrectly thought that they were cached files after looking them up. The fix now treats them as invalid linkto files and deletes them. Change-Id: I376c72a5309714ee339c74485e02cfb4e29be643 fixes: bz#1671611 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: sync brick root perms on add brickN Balachandran2018-12-261-16/+9
| | | | | | | | | | | | | | | | | If a single brick is added to the volume and the newly added brick is the first to respond to a dht_revalidate call, its stbuf will not be merged into local->stbuf as the brick does not yet have a layout. The is_permission_different check therefore fails to detect that an attr heal is required as it only considers the stbuf values from existing bricks. To fix this, merge all stbuf values into local->stbuf and use local->prebuf to store the correct directory attributes. Change-Id: Ic9e8b04a1ab9ed1248b6b056e3450bbafe32e1bc fixes: bz#1660736 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* afr: thin-arbiter 2 domain locking and in-memory stateRavishankar N2018-12-126-76/+679
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 2 domain locking + xattrop for write-txn failures: -------------------------------------------------- - A post-op wound on TA takes AFR_TA_DOM_NOTIFY range lock and AFR_TA_DOM_MODIFY full lock, does xattrop on TA and releases AFR_TA_DOM_MODIFY lock and stores in-memory which brick is bad. - All further write txn failures are handled based on this in-memory value without querying the TA. - When shd heals the files, it does so by requesting full lock on AFR_TA_DOM_NOTIFY domain. Client uses this as a cue (via upcall), releases AFR_TA_DOM_NOTIFY range lock and invalidates its in-memory notion of which brick is bad. The next write txn failure is wound on TA to again update the in-memory state. - Any incomplete write txns before the AFR_TA_DOM_NOTIFY upcall release request is got is completed before the lock is released. - Any write txns got after the release request are maintained in a ta_waitq. - After the release is complete, the ta_waitq elements are spliced to a separate queue which is then processed one by one. - For fops that come in parallel when the in-memory bad brick is still unknown, only one is wound to TA on wire. The other ones are maintained in a ta_onwireq which is then processed after we get the response from TA. Change-Id: I32c7b61a61776663601ab0040e2f0767eca1fd64 updates: bz#1648205 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* afr: assign gfid during name heal when no 'source' is present.Ravishankar N2018-12-124-52/+52
| | | | | | | | | | | | | | | | | | | Problem: If parent dir is in split-brain or has dirty xattrs set, and the file has gfid missing on one of the bricks, then name heal won't assign the gfid. Fix: Use the brick we select the gfid from as the 'source'. Note: Problem was found while trying to debug a split-brain issue on Cynthia Zhou's setup. fixes: bz#1655545 Change-Id: Id088d4f0fb017aa35122de426654194e581ed742 Reported-by: Cynthia Zhou <cynthia.zhou@nokia-sbell.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 4d58730c0cd6ab5db39aec8a15276f7bd3371b04)
* afr: open_ftruncate_cbk should read fd from local->cont.open structSoumya Koduri2018-11-291-2/+2
| | | | | | | | | | | | | afr_open stores the fd as part of its local->cont.open struct but when it calls ftruncate (if open flags contain O_TRUNC), the corresponding cbk function (afr_ open_ftruncate_cbk) is incorrectly referencing uninitialized local->fd. This patch fixes the same. Change-Id: Icbdedbd1b8cfea11d8f41b6e5c4cb4b44d989aba updates: bz#1651322 Signed-off-by: Soumya Koduri <skoduri@redhat.com> (cherry picked from commit fda594875c4cdb2a22e27aa13f5c66bee032ccb5)
* cluster/afr: Use 2 domain locking in SHD for thin-arbiterkarthik-us2018-11-293-89/+160
| | | | | | | | | | | | | | | | | | | | | | | | With this change when SHD starts the index crawl it requests all the clients to release the AFR_TA_DOM_NOTIFY lock so that clients will know the in memory state is no more valid and any new operations needs to query the thin-arbiter if required. When SHD completes healing all the files without any failure, it will again take the AFR_TA_DOM_NOTIFY lock and gets the xattrs on TA to see whether there are any new failures happened by that time. If there are new failures marked on TA, SHD will start the crawl immediately to heal those failures as well. If there are no new failures, then SHD will take the AFR_TA_DOM_MODIFY lock and unsets the xattrs on TA, so that both the data bricks will be considered as good there after. >Change-Id: I037b89a0823648f314580ba0716d877bd5ddb1f1 >fixes: bz#1579788 >Signed-off-by: karthik-us <ksubrahm@redhat.com> (cherry picked from commit 5784a00f997212d34bd52b2303e20c097240d91c) Change-Id: I037b89a0823648f314580ba0716d877bd5ddb1f1 fixes: bz#1648205
* cluster/ec: prevent infinite loop in self-heal fullXavi Hernandez2018-11-291-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There was a problem in commit 7f81067 that caused infinite loop when full heal was triggered. The previous commit was made to prevent self-heal to go idle after a replace brick operation. One of the changes consisted on setting a flag to force an immediate scan of the dirty directory if a heal on a directory succeeded (assuming it could have generated newer entries). However that change was causing an issue with a full self-heal, since every time an already healed directory was checked and it returned suceessfully, it was also setting the flag, forcing self-heal to start over again. This patch fixes this issue by only setting the flag if the heal is not full. It's assumed that a full self-heal will already traverse all entries automatically, so there's no need to force a new scan later. >Change-Id: Id12dbfc04e622b18183e796cc6cc87ccc30a6d55 >fixes: bz#1636631 >Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> (cherry picked from commit 7150c51ad75ccba22045a35fc31e5037612d1ad4) Change-Id: Id12dbfc04e622b18183e796cc6cc87ccc30a6d55 fixes: bz#1651525 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/ec: Change log level to DEBUG for lookup combineAshish Pandey2018-11-141-1/+1
| | | | | | | | | | | | | | As lookup is not a locked fop, we can not trust the data received in this to be same. Changing the log level to DEBUG in case lookup finds any difference. (cherry picked from commit 9be6bf3d90e3783b3ba559c93d41b933f8d53f03) Change-Id: I39499c44688a2455c7c6c69a798762d045d21b39 updates: bz#1644622 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* all: fix the format string exceptionsAmar Tumballi2018-11-094-13/+14
| | | | | | | | | | | | | | | | Currently, there are possibilities in few places, where a user-controlled (like filename, program parameter etc) string can be passed as 'fmt' for printf(), which can lead to segfault, if the user's string contains '%s', '%d' in it. While fixing it, makes sense to make the explicit check for such issues across the codebase, by making the format call properly. Fixes: CVE-2018-14661 Fixes: bz#1647666 Change-Id: Ib547293f2d9eb618594cbff0df3b9c800e88bde4 Signed-off-by: Amar Tumballi <amarts@redhat.com>
* cluster/afr : Check for UP bricks before starting healAshish Pandey2018-11-083-1/+19
| | | | | | | | | | | | | | | | | | | | | | | | | Problem: Currently for replica volume, even if only one brick is UP SHD will keep crawling index entries even if it can not heal anything. In thin-arbiter volume which is also a replica 2 volume, this causes inode lock contention which in turn sends upcall to all the clients to release notify locks, even if it can not do anything for healing. This will slow down the client performance and kills the purpose of keeping in memory information about bad brick. Solution: Before starting heal or even crawling, check if sufficient number of children are UP and available to check and heal entries. (cherry picked from commit f73b4476b15f9d6d3dc3c8e20c9742aacd857f9f) Change-Id: I011c9da3b37cae275f791affd56b8f1c1ac9255d updates: bz#1644645 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* afr/lease: Read child nodes from lease structureroot2018-11-081-1/+1
| | | | | | | | | | For lease operation, we allocate and store child nodes data in lease structure. Use the same in afr_lease_cbk() while checking for the quorum. Change-Id: If1fdd5a0798888afd39ad3df57d96487baf9d1e6 updates: #350 Signed-off-by: Soumya Koduri <skoduri@redhat.com>
* afr: fix incorrect reporting of directory split-brainRavishankar N2018-10-117-16/+22
| | | | | | | | | | | | | | | | | | | Backport of https://review.gluster.org/#/c/glusterfs/+/21135/ Problem: When a directory has dirty xattrs due to failed post-ops or when replace/reset brick is performed, AFR does a conservative merge as expected, but heal-info reports it as split-brain because there are no clear sources. Fix: Modify pending flag to contain information about pending heals and split-brains. For directories, if spit-brain flag is not set,just show them as needing heal and not being in split-brain. Change-Id: I09ef821f6887c87d315ae99e6b1de05103cd9383 fixes: bz#1638163 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: prevent winding inodelks twice for arbiter volumesRavishankar N2018-10-111-1/+1
| | | | | | | | | | | | | | | | | | | | Backport of https://review.gluster.org/#/c/glusterfs/+/21380/ Problem: In an arbiter volume, if there is a pending data heal of a file only on arbiter brick, self-heal takes inodelks twice due to a code-bug but unlocks it only once, leaving behind a stale lock on the brick. This causes the next write to the file to hang. Fix: Fix the code-bug to take lock only once. This bug was introduced master with commit eb472d82a083883335bc494b87ea175ac43471ff Thanks to Pranith Kumar K <pkarampu@redhat.com> for finding the RCA. fixes: bz#1638159 Change-Id: I15ad969e10a6a3c4bd255e2948b6be6dcddc61e1 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Make data eager-lock decision based on number of locksPranith Kumar K2018-10-053-6/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For both Virt and block workloads the file is opened multiple times leading to dynamically setting eager-lock to off for the workload. Instead of depending on the number-of-open-fds, if we change the logic to depend on number of inodelks, then it will give better performance than the earlier logic. When there is an eager-lock and number of inodelks is more than 1 we know that there is a conflicting lock, so depend on that information to decide whether to keep the current transaction go through delayed-post-op or not. Locks xlator doesn't have implementation to query number of locks in fxattrop in releases older than 3.10 so to keep things backward compatible in 3.12, data transactions will use new logic where as fxattrop transactions will use old logic. I am planning to send one more patch which makes metadata domain locks also depend on inodelk-count Profile info for a dd of 500MB to a file with another fd opened on the file using exec 250>filename Without this patch: 0.14 67.41 us 16.72 us 3870.82 us 892 FINODELK 0.59 279.87 us 95.71 us 2085.89 us 898 FXATTROP 3.46 366.43 us 81.75 us 6952.79 us 4000 WRITE 95.79 148733.99 us 50568.12 us 919127.86 us 273 FSYNC With this patch: 0.00 51.01 us 38.07 us 80.16 us 4 FINODELK 0.00 235.43 us 235.43 us 235.43 us 1 TRUNCATE 0.00 125.07 us 56.80 us 193.33 us 2 GETXATTR 0.00 135.86 us 62.13 us 209.59 us 2 INODELK 0.00 197.88 us 155.39 us 253.90 us 4 FXATTROP 0.00 450.59 us 394.28 us 506.89 us 2 XATTROP 0.00 56.96 us 19.06 us 406.59 us 23 FLUSH 37.81 273648.93 us 48.43 us 6017657.05 us 44 LOOKUP 62.18 4951.86 us 93.80 us 1143154.75 us 3999 WRITE postgresql benchmark performance changed from ~1130 TPS to ~2300TPS randio fio job inside Ovirt based VM went from ~600IOPs to ~2000IOPS fixes bz#1635972 Change-Id: If7f7388d2f08cf7f17ca517a4ea222560661dc36 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Batch writes in same lock even when multiple fds are openPranith Kumar K2018-10-051-9/+2
| | | | | | | | | | | | | | | | | | | | | | Problem: When eager-lock is disabled because of multiple-fds opened and app writes come on conflicting regions, the number of locks grows very fast leading to all the CPU being spent just in locking and unlocking by traversing huge queues in locks xlator for granting locks. Fix: Reduce the number of locks in transit by bundling the writes in the same lock and disable delayed piggy-pack when we learn that multiple fds are open on the file. This will reduce the size of queues in the locks xlator. This also reduces the number of network calls like inodelk/fxattrop. Please note that this problem can still happen if eager-lock is disabled as the writes will not be bundled in the same lock. fixes bz#1635975 Change-Id: I8fd1cf229aed54ce5abd4e6226351a039924dd91 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* gfapi: revert several patchs that introduced pre/post attrsShyamsundarR2018-09-172-13/+12
| | | | | | | | | | | | | | Reverted the following: - 248152767b0599986bbb6bb35fc27197f6be6964 - 09943beb499617212f2985ca8ea9ecd1ed1b470e - d01f7244e9d9f7e3ef84e0ba7b48ef1b1b09d809 The reverts are redone by hand, due to clang format changes that made using git to revert the changes more tedious. Change-Id: I96489638a2b641fb2206a110298543225783f7be Updates: bz#1628620 Signed-off-by: ShyamsundarR <srangana@redhat.com>
* build: cleanup xlator link, --no-undefined, libuuidv6devKaleb S. KEITHLEY2018-09-121-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While attempting to build a (pre-)5.0 of glusterfs on Ubuntu bionic and cosmic, it became apparent that there are some gremlins hiding in the combination of the xlator export-symbols, the newish addition of -Wl,--no-undefined, and the new switch to libuuid from the old contrib/uuid. Note: even though Fedora 28 (and later) and Ubuntu bionic (and later) have the same nominal version of libtool, the Fedora version appears to do a better job of recursing through dependencies to determine the libraries to link with. Examination of the build logs showed that despite appearing to work on Fedora, not all xlators and shared libs were linked with -Wl, --no-undefined, and -luuid. And in the case of the gnfs xlator, it was not only not linked with -Wl,--no-undefined but alsos not linked with -lgfxdr and -lgfrpc. Added GF_XLATOR_LDFLAGS, similar to GF_XLATOR_DEFAULT_LDFLAGS. GF_XLATOR_DEFAULT_LDFLAGS is for xlators that export/expose the default or common set of symbols. GF_XLATOR_LDFLAGS is for those remaining xlators that export/expose non-default symbols, e.g. dht and glupy. This removes the need in the future to add things like $(UUID_LIBS) to every xlator's Makefile.am. Just add it to GF_XLATOR_LDFLAGS and GF_XLATOR_DEFAULT_LDFLAGS in configure.ac and you're done. This patch was tested on Fedora 28 (build, rpmbuild), Fedora Rawhide/30 (rpmbuild), RHEL8 (rpmbuild), CentOS7 (rpmbuild), Fedora koji --scratch build for f30/rawhide, and a Launchpad build for Ubuntu cosmic/18.10. Change-Id: Ieca104fa5c5d3c094e701c8ca4a73754dd0292b0 updates: bz#1193929 Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com>
* Land part 2 of clang-format changesGluster Ant2018-09-1259-70721/+67551
| | | | | Change-Id: Ia84cc24c8924e6d22d02ac15f611c10e26db99b4 Signed-off-by: Nigel Babu <nigelb@redhat.com>
* Land clang-format changesGluster Ant2018-09-1232-3929/+3873
| | | | Change-Id: I6f5d8140a06f3c1b2d196849299f8d483028d33b
* dht: Use snprintf instead of strncpyN Balachandran2018-09-121-9/+20
| | | | | | | | | | | The recent changes to use malloc instead of calloc left the new_name and new_path non-null terminated. We now use snprintf instead of strncpy to fix this. Change-Id: I1a31701ca9447efde38921be0ba2c73cde2e7976 fixes: bz#1626346 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Create a linkto file if requiredN Balachandran2018-09-101-0/+30
| | | | | | | | | | | Using the dht_filter_loc_subvol_key to create files on specific subvols did not create a linkto file. This can make the file inaccessible as lookup-optimize is now enabled by default. Change-Id: I78add5a31887378a479cb9c746b91678876b0dbe fixes: bz#1626394 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: optimize readdir for 1xn volsN Balachandran2018-09-101-12/+42
| | | | | | | | | Skip the hashed subvol check for volumes with distribute count of 1. Change-Id: I5703508b54a17c49a217c8a8e09884980705953a fixes: bz#1608175 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* ec-heal: remove a duplicate definition of alloca0Amar Tumballi2018-09-101-1/+0
| | | | | | | | | | the same macro is defined in common-utils.h, which seems to be much better place for the same. Updates: bz#1193929 Change-Id: I409b719c291102136500b955e5827a550142ed96 Signed-off-by: Amar Tumballi <amarts@redhat.com>
* xlators/cluster/dht/src/tier-common.c:move to GF_MALLOC() instead of ↵Yaniv Kaul2018-09-101-1/+1
| | | | | | | | | | | | | | | | | | GF_CALLOC() when It doesn't make sense to calloc (allocate and clear) memory when the code right away fills that memory with data. It may be optimized by the compiler, or have a microscopic performance improvement. Please review carefully, especially for string allocation, with the terminating NULL string. Only compile-tested! Change-Id: I7d38a7d576f6777976fe86e5351a8d95caddbb9c updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* cluster/dht: Rework the debug xattr to get hashed subvolN Balachandran2018-09-072-28/+33
| | | | | | | | | | | | | | | | | | | The earlier implementation required the file to already exist when trying to get the hashed subvol. The reworked implementation allows a user to get the hashed subvol for any filename, whether it exists or not. Usage: getfattr -n "dht.file.hashed-subvol.<filename>" <parent dir> Eg:To get the hashed subvol for file-1 inside dir-1 getfattr -n "dht.file.hashed-subvol.file-1" /mnt/gluster/dir1 credit: rgowdapp@redhat.com Change-Id: Iae20bd5f56d387ef48c1c0a4ffa9f692866bf739 fixes: bz#1624244 Signed-off-by: N Balachandran <nbalacha@redhat.com>