summaryrefslogtreecommitdiffstats
path: root/xlators/cluster
Commit message (Collapse)AuthorAgeFilesLines
* cluster/afr: Fix dict-leak in pre-opPranith Kumar K2018-08-173-20/+20
| | | | | | | | | | | | | | | | | At the time of pre-op, pre_op_xdata is populted with the xattrs we get from the disk and at the time of post-op it gets over-written without unreffing the previous value stored leading to a leak. This is a regression we missed in https://review.gluster.org/#/q/ba149bac92d169ae2256dbc75202dc9e5d06538e Originally: > Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> > (cherry picked from commit e7b79c59590c203c65f7ac8548b30d068c232d33) Change-Id: I0456f9ad6f77ce6248b747964a037193af3a3da7 Fixes: bz#1613512 Signed-off-by: Amar Tumballi <amarts@redhat.com>
* afr: don't update readables if inode refresh failed on all childrenRavishankar N2018-07-114-21/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | Backport of: https://review.gluster.org/#/c/20029/ 3.12 still supports quorum-reads, hence modified afr_inode_refresh_done() to support that. If inode refresh failed on all children of afr due to ENOENT (say file migrated by dht), it resets the readables to zero. Any inflight txn which then later comes on the inode fails with EIO because no readable children present for the inode. Fix: Don't update readables when inode refresh fails on *all* children of afr. In that way any inflight txns will either proceed with its own inode refresh if needed and fail it with the right errno or use the old value of readables and continue with the txn. Also, add quorum checks to the beginning of afr_transaction(). Otherwise, we seem to be winding the lock and checking for quorum only in pre-op pahse. Note: This should ideally fix BZ 1329505 since the stop gap fix for it is has been reverted at https://review.gluster.org/#/c/20028. Change-Id: I82990769f01be918a073fec83fc67ba4b3be24b1 BUG: 1599247 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: heal gfids when file is not present on all bricksRavishankar N2018-07-115-12/+51
| | | | | | | | | | | | | | | Backport of https://review.gluster.org/#/c/20271/ (only change is in .t) commit 20fa80057eb430fd72b4fa31b9b65598b8ec1265 introduced a regression wherein if a file is present in only 1 brick of replica *and* doesn't have a gfid associated with it, it doesn't get healed upon the next lookup from the client. Fix it. Change-Id: I7d1111dcb45b1b8b8340a7d02558f05df70aa599 BUG: 1598121 fixes: bz#1598121 Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit eb472d82a083883335bc494b87ea175ac43471ff)
* afr: fix bug-1363721.t failureRavishankar N2018-07-093-1/+70
| | | | | | | | | | | | | | | | | | | | | | | Backport of https://review.gluster.org/#/c/20036/ Note: We need to update inode context's write_subvol even in case of compound fops. This is not there in master and 4.1 since compound FOPS was removed in it. Problem: In the .t, when the only good brick was brought down, writes on the fd were still succeeding on the bad bricks. The inflight split-brain check was marking the write as failure but since the write succeeded on all the bad bricks, afr_txn_nothing_failed() was set to true and we were unwinding writev with success to DHT and then catching the failure in post-op in the background. Fix: Don't wind the FOP phase if the write_subvol (which is populated with readable subvols obtained in pre-op cbk) does not have at least 1 good brick which was up when the transaction started. Change-Id: I4a1fef4569609c31cffeaef591a64c10870e8d0b BUG: 1598720 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: add quorum checks in pre-opRavishankar N2018-07-061-29/+31
| | | | | | | | | | | | | | | | | Backport of https://review.gluster.org/#/c/19781/ Problem: We seem to be winding the FOP if pre-op did not succeed on quorum bricks and then failing the FOP with EROFS since the fop did not meet quorum. This essentially masks the actual error due to which pre-op failed. (See BZ). Fix: Skip FOP phase if pre-op quorum is not met and go to post-op. Change-Id: Ie58a41e8fa1ad79aa06093706e96db8eef61b6d9 BUG: 1597154 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: capture the correct errno in post-op quorum checkRavishankar N2018-07-051-8/+8
| | | | | | | | | | If the post-op phase of txn did not meet quorm checks, use that errno to unwind the FOP rather than blindly setting ENOTCONN. Change-Id: I0cb0c8771ec75a45f9a25ad4cd8601103deddf0c BUG: 1597120 Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 440a048f24b006c80af3d7bcd0a1f13fe3459d87)
* cluster/dht: act as passthrough for renames on single child DHTRaghavendra G2018-07-051-7/+15
| | | | | | | | | | | Various synchronization present in dht_rename while handling directories and files is necessary only if we have more than only one child. Change-Id: Ie21ad419125504ca2f391b1ae2e5c1d166fee247 fixes: bz#1563513 Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
* afr: add quorum checks in post-opRavishankar N2018-07-041-0/+29
| | | | | | | | | | | afr relies on pending changelog xattrs to identify source and sinks and the setting of these xattrs happen in post-op. So if post-op fails, we need to unwind the write txn with a failure. Change-Id: I0f019ac03890108324ee7672883d774918b20be1 BUG: 1597120 Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit a40a87ec3b226ae86a6ed8f4af25b45965a20cad)
* afr: don't treat all cases all bricks being blamed as split-brainRavishankar N2018-07-042-9/+48
| | | | | | | | | | | | | | | | | | | | | | | | Problem: We currently don't have a roll-back/undoing of post-ops if quorum is not met. Though the FOP is still unwound with failure, the xattrs remain on the disk. Due to these partial post-ops and partial heals (healing only when 2 bricks are up), we can end up in split-brain purely from the afr xattrs point of view i.e each brick is blamed by atleast one of the others. These scenarios are hit when there is frequent connect/disconnect of the client/shd to the bricks while I/O or heal are in progress. Fix: Instead of undoing the post-op, pick a source based on the xattr values. If 2 bricks blame one, the blamed one must be treated as sink. If there is no majority, all are sources. Once we pick a source, self-heal will then do the heal instead of erroring out due to split-brain. Change-Id: I3d0224b883eb0945785ade0e9697a1c828aec0ae BUG: 1597123 Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 0e6e8216823c2d9dafb81aae0f6ee3497c23d140)
* cluster/dht: Remove EIO from dht_inode_missingN Balachandran2018-07-042-4/+2
| | | | | | | | | | | Removed EIO from the list of errnos that triggered a migrate check task. (cherry picked from commit c925962b91c67c8cd2391df7dd0251e0cbf66648) Change-Id: I7f89c7a16056421588f1af2377cebe6affddcb47 BUG: 1579673 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Fix dht_rename lock orderN Balachandran2018-05-091-18/+47
| | | | | | | | | | Fixed dht_order_rename_lock to use the same inodelk ordering as that of the dht selfheal locks (dictionary order of lock subvolumes). Change-Id: Ia3f8353b33ea2fd3bc1ba7e8e777dda6c1d33e0d BUG: 1570475 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Handle file migrations when brick downN Balachandran2018-04-181-5/+51
| | | | | | | | | | | | | | | | | | | The decision as to which node would migrate a file was based on the gfid of the file. Files were divided among the nodes for the replica/disperse set. However, if a brick was down when rebalance started, the nodeuuids would be saved as NULL and a set of files would not be migrated. Now, if the nodeuuid is NULL, the first non-null entry in the set is the node responsible for migrating the file. Change-Id: I72554c107792c7d534e0f25640654b6f8417d373 fixes: bz#1566820 Signed-off-by: N Balachandran <nbalacha@redhat.com> (cherry picked from commit 1f0765242a689980265c472646c64473a92d94c0) Change-Id: Id1a6e847b0191b6a40707bea789a2a35ea3d9f68
* cluster/dht: Wind open to all subvolsN Balachandran2018-04-181-10/+5
| | | | | | | | | | | dht_opendir should wind the open to all subvols whether or not local->subvols is set. This is because dht_readdirp winds the calls to all subvols. Change-Id: I67a96b06dad14a08967c3721301e88555aa01017 updates: bz#1566820 Signed-off-by: N Balachandran <nbalacha@redhat.com> (cherry picked from commit c4251edec654b4e0127577e004923d9729bc323d)
* cluster/afr: Fixing the flaws in arbiter becoming source patchRavishankar N2018-04-187-179/+276
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Backport of https://review.gluster.org/19045 Problem: Setting the write_subvol value to read_subvol in case of metadata transaction during pre-op (commit 19f9bcff4aada589d4321356c2670ed283f02c03) might lead to the original problem of arbiter becoming source. Scenario: 1) All bricks are up and good 2) 2 writes w1 and w2 are in progress in parallel 3) ctx->read_subvol is good for all the subvolumes 4) w1 succeeds on brick0 and fails on brick1, yet to do post-op on the disk 5) read/lookup comes on the same file and refreshes read_subvols back to all good 6) metadata transaction happens which makes ctx->write_subvol to be assigned with ctx->read_subvol which is all good 7) w2 succeeds on brick1 and fails on brick0 and this will update the brick in reverse order leading to arbiter becoming source Fix: Instead of setting the ctx->write_subvol to ctx->read_subvol in the pre-op statge, if there is a metadata transaction, check in the function __afr_set_in_flight_sb_status() if it is a data/metadata transaction. Use the value of ctx->write_subvol if it is a data transactions and ctx->read_subvol value for other transactions. With this patch we assign the value of ctx->write_subvol in the afr_transaction_perform_fop() with the on disk value, instead of assigning it in the afr_changelog_pre_op() with the in memory value. Change-Id: Id2025a7e965f0578af35b1abaac793b019c43cc4 BUG: 1566131 Signed-off-by: karthik-us <ksubrahm@redhat.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Fix for arbiter becoming sourcekarthik-us2018-04-184-6/+102
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Backport of https://review.gluster.org/#/c/18049/ Problem: When eager-lock is on, and two writes happen in parallel on a FD we were observing the following behaviour: - First write fails on one data brick - Since the post-op is not yet happened, the inode refresh will get both the data bricks as readable and set it in the inode context - In flight split brain check see both the data bricks as readable and allows the second write - Second write fails on the other data brick - Now the post-op happens and marks both the data bricks as bad and arbiter will become source for healing Fix: Adding one more variable called write_suvol in inode context and it will have the in memory representation of the writable subvols. Inode refresh will not update this value and its lifetime is pre-op through unlock in the afr transaction. Initially the pre-op will set this value same as read_subvol in inode context and then in the in flight split brain check we will use this value instead of read_subvol. After all the checks we will update the value of this and set the read_subvol same as this to avoid having incorrect value in that. Change-Id: I2ef6904524ab91af861d59690974bbc529ab1af3 BUG: 1566131 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* cluster/dht: Skipped files are not treated as errorsN Balachandran2018-04-061-5/+9
| | | | | | | | | | | | | For skipped files, use a return value of 1 to prevent error messages being logged. > Change-Id: I18de31ac1a64d4460e88dea7826c3ba03c895861 > BUG: 1553598 > Signed-off-by: N Balachandran <nbalacha@redhat.com> Change-Id: I18de31ac1a64d4460e88dea7826c3ba03c895861 BUG: 1555161 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/afr: Prevent ping-event handling on shdPranith Kumar K2018-04-061-0/+2
| | | | | | | | | | On shd, we shouldn't treat any brick down based on latency, otherwise self-heal will never happen fixes: 1562723 Change-Id: Ica07fcc4fae91a6bfd9c9a670e2be464704d94b7 BUG: 1562723 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: send list-node-uuids request to all subvolumesXavi Hernandez2018-04-061-1/+1
| | | | | | | | | | | | | | | The xattr trusted.glusterfs.list-node-uuids was only sent to a single subvolume. This was returning null uuids from the other subvolumes as if they were down. This fix forces that xattr to be requested from all subvolumes. Backport of: > BUG: 1561406 Change-Id: If62eb39a6857258923ba625e153d4ad79018ea2f BUG: 1561731 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/ec: Change default read policy to gfid-hashAshish Pandey2018-04-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Whenever we read data from file over NFS, NFS reads more data then requested and caches it. Based on the stat information it makes sure that the cached/pre-read data is valid or not. Consider 4 + 2 EC volume and all the bricks are on differnt nodes. In EC, with round-robin read policy, reads are sent on different set of data bricks. This way, it balances the read fops to go on all the bricks and avoid heating UP (overloading) same set of bricks. Due to small difference in clock speed, it is possible that we get minor difference for atime, mtime or ctime for different bricks. That might cause a different stat returned to NFS based on which NFS will discard cached/pre-read data which is actually not changed and could be used. Solution: Change read policy for EC as gfid-hash. That will force all the read to go to same set of bricks. >Change-Id: I825441cc519e94bf3dc3aa0bd4cb7c6ae6392c84 >BUG: 1554743 >Signed-off-by: Ashish Pandey <aspandey@redhat.com> Change-Id: I825441cc519e94bf3dc3aa0bd4cb7c6ae6392c84 BUG: 1558352 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/ec: avoid delays in self-healXavi Hernandez2018-04-064-48/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Self-heal creates a thread per brick to sweep the index looking for files that need to be healed. These threads are started before the volume comes online, so nothing is done but waiting for the next sweep. This happens once per minute. When a replace brick command is executed, the new graph is loaded and all index sweeper threads started. When all bricks have reported, a getxattr request is sent to the root directory of the volume. This causes a heal on it (because the new brick doesn't have good data), and marks its contents as pending to be healed. This is done by the index sweeper thread on the next round, one minute later. This patch solves this problem by waking all index sweeper threads after a successful check on the root directory. Additionally, the index sweep thread scans the index directory sequentially, but it might happen that after healing a directory entry more index entries are created but skipped by the current directory scan. This causes the remaining entries to be processed on the next round, one minute later. The same can happen in the next round, so the heal is running in bursts and taking a lot to finish, specially on volumes with many directory levels. This patch solves this problem by immediately restarting the index sweep if a directory has been healed. Backport of: > BUG: 1547662 Change-Id: I58d9ab6ef17b30f704dc322e1d3d53b904e5f30e BUG: 1555201 Signed-off-by: Xavi Hernandez <jahernan@redhat.com>
* cluster/dht: ENOSPC will not fail rebalanceN Balachandran2018-04-021-9/+3
| | | | | | | | | ENOSPC returned by a file migration is no longer considered a rebalance failure. Change-Id: I21cf3a8acdc827bc478e138d6cb5db649d53a28c BUG: 1555161 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/afr: Fail open on split-brainPranith Kumar K2018-03-0811-95/+208
| | | | | | | | | | | | | | | | | | Problem: Append on a file with split-brain succeeds. Open is intercepted by open-behind, when write comes on the file, open-behind does open+write. Open succeeds because afr doesn't fail it. Then write succeeds because write-behind intercepts it. Flush is also intercepted by write-behind, so the application never gets to know that the write failed. Fix: Fail open on split-brain, so that when open-behind does open+write open fails which leads to write failure. Application will know about this failure. Change-Id: I4bff1c747c97bb2925d6987f4ced5f1ce75dbc15 BUG: 1544635 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> (cherry picked from commit 786343abca3474ff01aa1017210112d97cbc4843)
* cluster/dht: Handle single dht child in dht_lookupN Balachandran2018-03-051-0/+10
| | | | | | | | | | | | | | | | | | | This patch limits itself to only handling the case where no file (data or linkto) exists on the subvol. Additional cases to be handled: 1. A linkto file was found on the only child subvol. This currently calls dht_lookup_everywhere which eventually deletes it. It can be deleted directly as it will not be pointing to a valid subvol. 2. Directory lookups - locking might be unnecessary in some cases. > Change-Id: I940ba34531f2aaee1d36fd9ca45ecfd46be662a4 > BUG: 1546620 > Signed-off-by: N Balachandran <nbalacha@redhat.com> Change-Id: I940ba34531f2aaee1d36fd9ca45ecfd46be662a4 BUG: 1548270 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Ignore ENODATA from getxattr for posix aclsN Balachandran2018-03-051-7/+8
| | | | | | | | | | | | | dht_migrate_file no longer prints an error if getxattr for posix acls fails with ENODATA/ENOATTR. > Change-Id: Id9ecf6852cb5294c1c154b28d609889ea3420e1c > BUG: 1546954 > Signed-off-by: N Balachandran <nbalacha@redhat.com> Change-Id: Id9ecf6852cb5294c1c154b28d609889ea3420e1c BUG: 1548078 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Fixed a typoN Balachandran2018-03-051-2/+2
| | | | | | | | | | | | Replaced "then" with "than" > Change-Id: I73090e8c1a639befd7c5458e8d63bd173248bc7d > BUG: 1547128 > Signed-off-by: N Balachandran <nbalacha@redhat.com> Change-Id: I73090e8c1a639befd7c5458e8d63bd173248bc7d BUG: 1547841 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Skip '..' for the volume root dirN Balachandran2018-02-121-0/+5
| | | | | | | | | | | | | | | | dht_populate_inode_for_dentry tries to update the layout for the '..' entry when listing the root of the volume. This entry does not correspond to an entry in the volume and therefore does not have a gfid or a layout on disk, causing layout processing to fail. > Change-Id: I2b7470e1c5e20d87b5545160697f24d041045140 > BUG: 1537457 > Signed-off-by: N Balachandran <nbalacha@redhat.com> Change-Id: I2b7470e1c5e20d87b5545160697f24d041045140 BUG: 1539516 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Cleanup on fallocate failureN Balachandran2018-02-091-1/+17
| | | | | | | | | | | | | | | It looks like fallocate leaves a non-empty file behind in case of some failures. We now truncate the file to 0 bytes on failure in __dht_rebalance_create_dst_file. > Change-Id: Ia4ad7b94bb3624a301fcc87d9e36c4dc751edb59 > BUG: 1541916 > Signed-off-by: N Balachandran <nbalacha@redhat.com> Change-Id: Ia4ad7b94bb3624a301fcc87d9e36c4dc751edb59 BUG: 1542601 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Unlink linkto files as rootN Balachandran2018-02-081-3/+7
| | | | | | | | | | | | | | | Non-privileged users cannot delete linkto files. However the failure to unlink a stale linkto causes DHT to fail the lookup with EIO and hence prevent access to the file. > Change-Id: Id295362d41e52263790694602f36f1219f0646a2 > BUG: 1542318 > Signed-off-by: N Balachandran <nbalacha@redhat.com> Change-Id: Id295362d41e52263790694602f36f1219f0646a2 BUG: 1543016 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Fixed leak in dht_populate_inode_for_dentryN Balachandran2018-02-072-6/+10
| | | | | | | | | | | | | | | | Fixed an issue in dht_populate_inode_for_dentry where a layout is set in the inode without checking if it is already set. This overwrites the value each time without freeing the already existing layout. Also includes the changes in https://review.gluster.org/19471 > Change-Id: I651bf539a0b82b4ddc4c355890c16a8e91f5f1fd > BUG: 1541264 > Signed-off-by: N Balachandran <nbalacha@redhat.com> Change-Id: I651bf539a0b82b4ddc4c355890c16a8e91f5f1fd BUG: 1541267 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/afr: remove unnecessary child_up initializationXavier Hernandez2018-02-061-7/+0
| | | | | | | | | | | | | | | The child_up array was initialized with all elements being -1 to allow afr_notify() to differentiate down bricks from bricks that haven't reported yet. With current implementation this is not needed anymore and it was causing unexpected results when other parts of the code considered that if child_up[i] != 0, it meant that it was up. Backport of: > BUG: 1541038 Change-Id: I2a9d712ee64c512f24bd5cd3a48dcb37e3139472 BUG: 1541930 Signed-off-by: Xavier Hernandez <jahernan@redhat.com>
* cluster/dht: Add migration checks to dht_(f)xattropN Balachandran2018-02-065-45/+325
| | | | | | | | | | | | | | | | The dht_(f)xattrop implementation did not implement migration phase1/phase2 checks which could cause issues with rebalance on sharded volumes. This does not solve the issue where fops may reach the target out of order. > Change-Id: I2416fc35115e60659e35b4b717fd51f20746586c > BUG: 1471031 > Signed-off-by: N Balachandran <nbalacha@redhat.com> Change-Id: I2416fc35115e60659e35b4b717fd51f20746586c BUG: 1540224 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/ec: OpenFD heal implementation for ECSunil Kumar Acharya2018-02-027-32/+184
| | | | | | | | | | | | | | | | Existing EC code doesn't try to heal the OpenFD to avoid unnecessary healing of the data later. Fix implements the healing of open FDs before carrying out file operations on them by making an attempt to open the FDs on required up nodes. Backport of: >BUG: 1431955 BUG: 1536334 Change-Id: Ib696f59c41ffd8d5678a484b23a00bb02764ed15 Signed-off-by: Sunil Kumar Acharya <sheggodu@redhat.com>
* cluster/dht: Use percentages for space checkN Balachandran2018-01-121-5/+20
| | | | | | | | | | | | | | | | | | | | | With heterogenous bricks now being supported in DHT we could run into issues where files are not migrated even though there is sufficient space in newly added bricks which just happen to be considerably smaller than older bricks. Using percentages instead of absolute available space for space checks can mitigate that to some extent. Marking bug-1247563.t as that used to depend on the easier code to prevent a file from migrating. This will be removed once we find a way to force a file migration failure. > Change-Id: I3452520511f304dbf5af86f0632f654a92fcb647 > BUG: 1529440 > Signed-off-by: N Balachandran <nbalacha@redhat.com> Change-Id: I3452520511f304dbf5af86f0632f654a92fcb647 BUG: 1530455 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: don't overfill the buffer in readdir(p)Raghavendra G2017-12-081-3/+18
| | | | | | | | | | | | | | | | | | | | | | | | Superflous dentries that cannot be fit in the buffer size provided by kernel are thrown away by fuse-bridge. This means, * the next readdir(p) seen by readdir-ahead would have an offset of a dentry returned in a previous readdir(p) response. When readdir-ahead detects non-monotonic offset it turns itself off which can result in poor readdir performance. * readdirp can be cpu-intensive on brick and there is no point to read all those dentries just to be thrown away by fuse-bridge. So, the best strategy would be to fill the buffer optimally - neither overfill nor underfill. > Change-Id: Idb3d85dd4c08fdc4526b2df801d49e69e439ba84 > BUG: 1492625 > Signed-off-by: Raghavendra G <rgowdapp@redhat.com> (cherry picked from commit e785faead91f74dce7c832848f2e8f3f43bd0be5) Change-Id: Idb3d85dd4c08fdc4526b2df801d49e69e439ba84 BUG: 1478411 Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: populate inode in dentry for single subvolume dhtRaghavendra G2017-12-062-1/+69
| | | | | | | | | | | | | | | ... in readdirp response if dentry points to a directory inode. This is a special case where the entire layout is stored in one single subvolume and hence no need for lookup to construct the layout >Change-Id: I44fd951e2393ec9dac2af120469be47081a32185 >BUG: 1492625 >Signed-off-by: Raghavendra G <rgowdapp@redhat.com> (cherry picked from commit 59d1cc720f52357f7a6f20bb630febc6a622c99c) Change-Id: I44fd951e2393ec9dac2af120469be47081a32185 BUG: 1478411 Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/afr: Honor default timeout of 5min for analyzing split-brain fileskarthik-us2017-11-301-1/+5
| | | | | | | | | | | | | | | | | | | | Problem: After setting split-brain-choice option to analyze the file to resolve the split brain using the command "setfattr -n replica.split-brain-choice -v "choiceX" <path-to-file>" should allow to access the file from mount for default timeout of 5mins. But the timeout was not honored and was able to access the file even after the timeout. Fix: Call the inode_invalidate() in afr_set_split_brain_choice_cbk() so that it will triger the cache invalidate after resetting the timer and the split brain choice. So the next calls to access the file will fail with EIO. Change-Id: I698cb833676b22ff3e4c6daf8b883a0958f51a64 BUG: 1514380 Signed-off-by: karthik-us <ksubrahm@redhat.com> (cherry picked from commit 933ec57ccda2c1ba5ce6f207313c3b6802e67ca3)
* cluster/dht: make rebalance use truncate incaseSusant Palai2017-11-233-71/+99
| | | | | | | | | | | | | .. the brick file system does not support fallocate. > Change-Id: Id76cda2d8bb3b223b779e5e7a34f17c8bfa6283c > BUG: 1488103 > Signed-off-by: Susant Palai <spalai@redhat.com> Change-Id: Id76cda2d8bb3b223b779e5e7a34f17c8bfa6283c BUG: 1516691 Signed-off-by: Susant Palai <spalai@redhat.com>
* cluster/dht: Don't set ACLs on linkto fileN Balachandran2017-11-201-0/+11
| | | | | | | | | | | | | | | | | The trusted.SGI_ACL_FILE appears to set posix ACLs on the linkto file that is a target of file migration. This can mess up file permissions and cause linkto identification to fail. Now we remove all ACL xattrs from the results of the listxattr call on the source before setting them on the target. > BUG: 1514329 > Signed-off-by: N Balachandran <nbalacha@redhat.com> Change-Id: I56802dbaed783a16e3fb90f59f4ce849f8a4a9b4 BUG: 1515042 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: fix crash when deleting directoriesZhang Huan2017-10-251-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | In DHT, after locks on all subvolumes are acquired, it would perform the following steps sequentially, 1. send remove dir on all other subvolumes except the hashed one in a loop; 2. wait for all pending rmdir to be done 3. remove dir on the hashed subvolume The problem is that in step 1 there is a check to skip hashed subvolume in the loop. If the last subvolume to check is actually the hashed one, and step 3 is quickly done before the last and hashed subvolume is checked, by accessing shared context data be destroyed in step 3, would cause a crash. Fix by saving shared data in a local variable to access later in the loop. > BUG: 1490642 > Signed-off-by: Zhang Huan <zhanghuan@open-fs.com> (cherry picked from commit 206120126d455417a81a48ae473d49be337e9463) Change-Id: I8db7cf7cb262d74efcb58eb00f02ea37df4be4e2 BUG: 1505221 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/ec: Improve performance with xattrop updateSunil Kumar Acharya2017-10-121-24/+102
| | | | | | | | | | | | | | | | Existing EC code updates the xattr on the subvolume in a sequential pattern resulting in very poor performance. With this fix EC now updates the xattr on the subvolume in parallel which improves the xattr update performance. >BUG: 1445663 >Change-Id: I3fc40d66db0b88875ca96a9fa01002ba386c0486 >Signed-off-by: Sunil Kumar Acharya <sheggodu@redhat.com> BUG: 1499150 Change-Id: I3fc40d66db0b88875ca96a9fa01002ba386c0486 Signed-off-by: Sunil Kumar Acharya <sheggodu@redhat.com>
* cluster/afr: Make choose-local "reconfigurable"Krutika Dhananjay2017-10-121-0/+11
| | | | | | | | | | | | | | | | Backport of: > Change-Id: Ibab292ba705d993b475cd0303fb3318211fb2500 > Reviewed-on: https://review.gluster.org/18026 > BUG: 1480525 > cherry-picked from commit 1e2d6537875d16b783e3c50ada7ee61487c6d796 With this change, enabling choose-local (which means its state makes transition from "off" to "on") will be effective after the first gfid-lookup on "/" since volume-set was executed. Change-Id: Ibab292ba705d993b475cd0303fb3318211fb2500 BUG: 1501022 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
* cluster/dht: Don't store the entire uuid for subvolsN Balachandran2017-10-124-17/+45
| | | | | | | | | | | | | | | | | Comparing the uuid string of the local node against that stored in the local_subvol information is inefficient, especially as it is done for every file to be migrated. The code has now been changed to set the value of info to 1 if the nodeuuid is that of the node making the comparison so this becomes an integer comparison. > BUG: 1451434 > Signed-off-by: N Balachandran <nbalacha@redhat.com> > https://review.gluster.org/#/c/17851 (cherry picked from commit c4a608799a577a4f38139f6bb8a47da8efb0fec3) Change-Id: I7491d59caad3b71dbf5facc94dcde0cd53962775 BUG: 1500472 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* afr: heal gfid as a part of entry healRavishankar N2017-10-104-67/+120
| | | | | | | | | | | | | | | | | | | Problem: If a brick crashes after an entry (file or dir) is created but before gfid is assigned, the good bricks will have pending entry heal xattrs but the heal won't complete because afr_selfheal_recreate_entry() tries to create the entry again and it fails with EEXIST. Fix: We could have fixed posx_mknod/mkdir etc to assign the gfid if the file already exists but the right thing to do seems to be to trigger a lookup on the bad brick and let it heal the gfid instead of winding an mknod/mkdir in the first place. (cherry picked from commit 20fa80057eb430fd72b4fa31b9b65598b8ec1265) Change-Id: I82f76665a7541f1893ef8d847b78af6466aff1ff BUG: 1499202 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Sending subvol up/down events when subvol comes up or goes downkarthik-us2017-10-061-0/+2
| | | | | | | | | | > BUG: 1493539 (cherry picked from commit 3bbb4fe4b33dc3a3ed068ed2284077f2a4d8265a) Change-Id: I6580351b245d5f868e9ddc6a4eb4dd6afa3bb6ec BUG: 1492066 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* afr: don't check for file size in afr_mark_source_sinks_if_file_emptyRavishankar N2017-10-051-6/+7
| | | | | | | | | | ... for AFR_METADATA_TRANSACTION and just mark source and sinks if metadata is the same. (cherry picked from commit 24637d54dcbc06de8a7de17c75b9291fcfcfbc84) Change-Id: I69e55d3c842c7636e3538d1b57bc4deca67bed05 BUG: 1496317 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: auto-resolve split-brains for zero-byte filesRavishankar N2017-10-053-0/+79
| | | | | | | | | | | | | | | | | | | | | | | | | Problems: As described in BZ 1491670, renaming hardlinks can result in data/mdata split-brain of the DHT link-to files (T files) without any mismatch of data and metadata. As described in BZ 1486063, for a zero-byte file with only dirty bits set, arbiter brick will likely be chosen as the source brick. Fix: For zero byte files in split-brain, pick first brick as a) data source if file size is zero on all bricks. b) metadata source if metadata is the same on all bricks In arbiter case, if file size is zero on all bricks and there are no pending afr xattrs, pick 1st brick as data source. (cherry picked from commit 1719cffa911c5287715abfdb991bc8862f0c994e) Change-Id: I0270a9a2f97c3b21087e280bb890159b43975e04 BUG: 1496317 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reported-by: Rahul Hinduja <rhinduja@redhat.com> Reported-by: Mabi <mabi@protonmail.ch>
* afr: heal metadata in discover code pathRavishankar N2017-09-112-17/+46
| | | | | | | | | | | | | | | | | | | | | | | Combined backport of https://review.gluster.org/17850 and https://review.gluster.org/18187 During graph switch, if fuse sends nameless (gfid) lookups, afr takes the discover code path to serve it. If there are pending metadata heals, they do not happen unless an inode refresh happens as a part of discover (which is not guaranteed to happen always). This patch fixes it by attempting metadata heal as a part of discover, just like how it is done in lookup code path. Change-Id: I87c493045b9225741cad173bf3f645848697032e BUG: 1488168 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: https://review.gluster.org/18202 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Karthik U S <ksubrahm@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: jiffin tony Thottan <jthottan@redhat.com>
* dht: add FOP check to dht_file_setattr_cbkRavishankar N2017-09-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: bug-797171.7 loaded error-gen xlator on the brick which sent EBADF for a non fd-based fop, namely setattr. This caused dht_check_and_open_fd_on_subvol_task() to crash as local->fd was NULL. Fix: Call dht_check_and_open_fd_on_subvol_task() from dht_file_setattr_cbk only for dht_fsetattr and not dht_setattr or dht_setattr2 > Reviewed-on: https://review.gluster.org/18208 > Smoke: Gluster Build System <jenkins@build.gluster.org> > Reviewed-by: Susant Palai <spalai@redhat.com> > Reviewed-by: Amar Tumballi <amarts@redhat.com> > Reviewed-by: Raghavendra G <rgowdapp@redhat.com> > Reviewed-by: N Balachandran <nbalacha@redhat.com> > CentOS-regression: Gluster Build System <jenkins@build.gluster.org> (cherry picked from commit 47188e9eac59de416a5c86c7ec7540ed6aaa1c98) Signed-off-by: Ravishankar N <ravishankar@redhat.com> Change-Id: Iab4999e213bf2065804f3f8237e470ad454e3c99 BUG: 1489260 Reviewed-on: https://review.gluster.org/18222 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Log files skipped by rebalanceN Balachandran2017-09-072-1/+19
| | | | | | | | | | | | | | | | | | | | | | | | | There was no easy way to find out which files were skipped during a rebalance. Rebalance now logs a message for every skipped file using msgid 109126, making it easier to find all files that were skipped. > BUG: 1480445 > Signed-off-by: N Balachandran <nbalacha@redhat.com> > Reviewed-on: https://review.gluster.org/18021 > Smoke: Gluster Build System <jenkins@build.gluster.org> > CentOS-regression: Gluster Build System <jenkins@build.gluster.org> > Reviewed-by: hari gowtham <hari.gowtham005@gmail.com> > Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Change-Id: I2cac7db7285e2f82354251f3ea4094827b0daf3e BUG: 1486557 (cherry picked from commit a4c43ba9374b8f75a48d38a032353a0c7d311a73) Signed-off-by: N Balachandran <nbalacha@redhat.com> Reviewed-on: https://review.gluster.org/18149 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/dht: Aggregate xattrs only for dirs in dht_discover_cbkN Balachandran2017-09-071-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | If dht_discover finds data files on more than one subvol, racing calls to dht_discover_cbk could end up calling dht_aggregate_xattr which could delete dictionary data that is being accessed by higher layer translators. Fixed to call dht_aggregate_xattr only for directories and consider only the first file to be found. > BUG: 1484709 > Signed-off-by: N Balachandran <nbalacha@redhat.com> > Reviewed-on: https://review.gluster.org/18137 > Smoke: Gluster Build System <jenkins@build.gluster.org> > CentOS-regression: Gluster Build System <jenkins@build.gluster.org> > Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Change-Id: I4f3d2a405ec735d4f1bb33a04b7255eb2d179f8a BUG: 1486538 (cherry picked from commit 9420022df0962b1fa4f3ea3774249be81bc945cc) Signed-off-by: N Balachandran <nbalacha@redhat.com> Reviewed-on: https://review.gluster.org/18146 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>