summaryrefslogtreecommitdiffstats
path: root/xlators/cluster
Commit message (Collapse)AuthorAgeFilesLines
* ctime/rebalance: Heal ctime xattr on directory during rebalanceKotresh HR2019-09-163-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | After add-brick and rebalance, the ctime xattr is not present on rebalanced directories on new brick. This patch fixes the same. Note that ctime still doesn't support consistent time across distribute sub-volume. This patch also fixes the in-memory inconsistency of time attributes when metadata is self healed. Backport of: > Patch: https://review.gluster.org/23127/ > Change-Id: Ia20506f1839021bf61d4753191e7dc34b31bb2df > BUG: 1734026 > Signed-off-by: Kotresh HR <khiremat@redhat.com> (cherry picked from commit 304640e55c0f3c6d15f4e230dc6376e4f5020fea) Change-Id: Ia20506f1839021bf61d4753191e7dc34b31bb2df Signed-off-by: Kotresh HR <khiremat@redhat.com> fixes: bz#1752429
* afr/lookup: Pass xattr_req in while doing a selfheal in lookupMohammed Rafi KC2019-09-113-5/+16
| | | | | | | | | | | | | | | | | | We were not passing xattr_req when doing a name self heal as well as a meta data heal. Because of this, some xdata was missing which causes i/o errors Backport of > https://review.gluster.org/#/c/glusterfs/+/23024/ >Change-Id: Ibfb1205a7eb0195632dc3820116ffbbb8043545f >Fixes: bz#1728770 >Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Fixes: bz#1749305 Change-Id: Ibfb1205a7eb0195632dc3820116ffbbb8043545f Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> (cherry picked from commit d026f0bcfd301712e4f0671ccf238f43f2e6dd30)
* [RFC] change get_real_filename implementation to use ENOATTR instead of ENOENTMichael Adam2019-09-031-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | get_real_filename is implemented as a virtual extended attribute to help Samba implement the case-insensitive but case preserving SMB protocol more efficiently. It is implemented as a getxattr call on the parent directory with the virtual key of "get_real_filename:<entryname>" by looking for a spelling with different case for the provided file/dir name (<entryname>) and returning this correct spelling as a result if the entry is found. Originally (05aaec645a6262d431486eb5ac7cd702646cfcfb), the implementation used the ENOENT errno to return the authoritative answer that <entryname> does not exist in any case folding. Now this implementation is actually a violation or misuse of the defined API for the getxattr call which returns ENOENT for the case that the dir that the call is made against does not exist and ENOATTR (or the synonym ENODATA) for the case that the xattr does not exist. This was not a problem until the gluster fuse-bridge was changed to do map ENOENT to ESTALE in 59629f1da9dca670d5dcc6425f7f89b3e96b46bf, after which we the getxattr call for get_real_filename returned an ESTALE instead of ENOENT breaking the expectation in Samba. It is an independent problem that ESTALE should not leak out to user space but is intended to trigger retries between fuse and gluster. But nevertheless, the semantics seem to be incorrect here and should be changed. This patch changes the implementation of the get_real_filename virtual xattr to correctly return ENOATTR instead of ENOENT if the file/directory being looked up is not found. The Samba glusterfs_fuse vfs module which takes advantage of the get_real_filename over a fuse mount will receive a corresponding change to map ENOATTR to ENOENT. Without this change, it will still work correctly, but the performance optimization for nonexisting files is lost. On the other hand side, this change removes the distinction between the old not-implemented case and the implemented case. So Samba changed to treat ENOATTR like ENOENT will not work correctly any more against old servers that don't implement get_real_filename. I.e. existing files will be reported as non-existing Change-Id: I971b427ab8410636d5d201157d9af70e0d075b67 fixes: bz#1745914 Signed-off-by: Michael Adam <obnox@samba.org> (cherry picked from commit dc1b87fcfef08c9497b0c02b2410c9d18bbc2dba)
* afr: wake up index healer threadsRavishankar N2019-08-305-11/+25
| | | | | | | | | | | | | | ...whenever shd is re-enabled after disabling or there is a change in `cluster.heal-timeout`, without needing to restart shd or waiting for the current `cluster.heal-timeout` seconds to expire. See BZ 1743988 for more details. Change-Id: Ia5ebd7c8e9f5b54cba3199c141fdd1af2f9b9bfe fixes: bz#1747301 Reported-by: Glen Kiessling <glenk1973@hotmail.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 600ba94183333c4af9b4a09616690994fd528478)
* cluster/ec: Update lock->good_mask on parent fop failurePranith Kumar K2019-08-222-0/+4
| | | | | | | | | | When discard/truncate performs write fop, it should do so after updating lock->good_mask to make sure readv happens on the correct mask fixes: bz#1739424 Change-Id: Idfef0bbcca8860d53707094722e6ba3f81c583b7 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Fix reopen flags to avoid misbehaviorPranith Kumar K2019-08-222-3/+8
| | | | | | | | | | | | | | | | | | | | | | | Problem: when a file needs to be re-opened O_APPEND and O_EXCL flags are not filtered in EC. - O_APPEND should be filtered because EC doesn't send O_APPEND below EC for open to make sure writes happen on the individual fragments instead of at the end of the file. - O_EXCL should be filtered because shd could have created the file so even when file exists open should succeed - O_CREAT should be filtered because open happens with gfid as parameter. So open fop will create just the gfid which will lead to problems. Fix: Filter out these two flags in reopen. Change-Id: Ia280470fcb5188a09caa07bf665a2a94bce23bc4 Fixes: bz#1739426 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Always read from good-maskPranith Kumar K2019-08-222-5/+25
| | | | | | | | | | There are cases where fop->mask may have fop->healing added and readv shouldn't be wound on fop->healing. To avoid this always wind readv to lock->good_mask updates: bz#1739424 Change-Id: I2226ef0229daf5ff315d51e868b980ee48060b87 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: fix EIO error for concurrent writes on sparse filesXavi Hernandez2019-08-221-9/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | EC doesn't allow concurrent writes on overlapping areas, they are serialized. However non-overlapping writes are serviced in parallel. When a write is not aligned, EC first needs to read the entire chunk from disk, apply the modified fragment and write it again. The problem appears on sparse files because a write to an offset implicitly creates data on offsets below it (so, in some way, they are overlapping). For example, if a file is empty and we read 10 bytes from offset 10, read() will return 0 bytes. Now, if we write one byte at offset 1M and retry the same read, the system call will return 10 bytes (all containing 0's). So if we have two writes, the first one at offset 10 and the second one at offset 1M, EC will send both in parallel because they do not overlap. However, the first one will try to read missing data from the first chunk (i.e. offsets 0 to 9) to recombine the entire chunk and do the final write. This read will happen in parallel with the write to 1M. What could happen is that half of the bricks process the write before the read, and the half do the read before the write. Some bricks will return 10 bytes of data while the otherw will return 0 bytes (because the file on the brick has not been expanded yet). When EC tries to recombine the answers from the bricks, it can't, because it needs more than half consistent answers to recover the data. So this read fails with EIO error. This error is propagated to the parent write, which is aborted and EIO is returned to the application. The issue happened because EC assumed that a write to a given offset implies that offsets below it exist. This fix prevents the read of the chunk from bricks if the current size of the file is smaller than the read chunk offset. This size is correctly tracked, so this fixes the issue. Also modifying ec-stripe.t file for Test #13 within it. In this patch, if a file size is less than the offset we are writing, we fill zeros in head and tail and do not consider it strip cache miss. That actually make sense as we know what data that part holds and there is no need of reading it from bricks. Change-Id: Ic342e8c35c555b8534109e9314c9a0710b6225d6 Fixes: bz#1739427 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/ec: inherit healing from lock when it has infoKinglong Mee2019-08-221-2/+3
| | | | | | | | | If lock has info, fop should inherit healing mask from it. Otherwise, fop cannot inherit right healing when changed_flags is zero. Change-Id: Ife80c9169d2c555024347a20300b0583f7e8a87f updates: bz#1739424 Signed-off-by: Kinglong Mee <mijinlong@horiscale.com>
* afr: restore timestamp of parent dir during entry-healRavishankar N2019-08-211-0/+2
| | | | | | | Fixes: bz#1741041 Change-Id: I29e338bac62104233a6f80212df8d0fb016affda Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 8e9c53ebf16705b9a1db2fc486dc24a5cb244ddd)
* cluster/dht: Fixed a memleak in dht_rename_cbkN Balachandran2019-08-091-11/+33
| | | | | | | | | Fixed a memleak in dht_rename_cbk when creating a linkto file. Change-Id: I705adef3cb79e33806520fc2b15558e90e2c211c fixes: bz#1739337 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/ta: Notify the clients only if there are pending healskarthik-us2019-07-244-22/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In case of thin arbiter, before index healer starts crawling the indices at every heal-timeout interval, even if there is nothing to be healed it will send an upcall notification to all the clients to release any AFR_TA_DOM_NOTIFY locks that they hold. SHD will wait for the upcall to return before proceeding with the heal even though there is nothing to be healed. This will also invalidates the cached information about the bricks states on the clients which leads to extra calls on TA from clients for the next reads & writes if needed. This will impact the IO performance. Fix: - Before sending the upcall to the clients, check for any pending heals on TA without taking any locks. - If there is nothing marked bad on TA, then continue with the index crawl to heal any dirty markings present on the files due to any post-op failure. - If there is a brick marked as bad on TA, then take the AFR_TA_DOM_NOTIFY lock on TA from SHD, get the state on TA and continue with the current healing process. Change-Id: Ieb477bc6cb18bbdfd4e7a0453c5ed79b574ec9d6 fixes: bz#1729483 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* cluster/afr: Fix incorrect reporting of gfid & type mismatchkarthik-us2019-07-202-2/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | Problems: 1. When checking for type and gfid mismatch, if the type or gfid is unknown because of missing gfid handle and the gfid xattr it will be reported as type or gfid mismatch and the heal will not complete. 2. If the source selected during entry heal has null gfid the same will be sent to afr_lookup_and_heal_gfid(). In this function when we try to assign the gfid on the bricks where it does not exist, we are considering the same gfid and try to assign that on those bricks. This will fail in posix_gfid_set() since the gfid sent is null. Fix: If the gfid sent to afr_lookup_and_heal_gfid() is null choose a valid gfid before proceeding to assign the gfid on the bricks where it is missing. In afr_selfheal_detect_gfid_and_type_mismatch(), do not report type/gfid mismatch if the type/gfid is unknown or not set. Change-Id: Ia06552e4dc4a9f89cb7f5302833604bd21bbf7da fixes: bz#1729481 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* cluster/ec: Prevent double pre-op xattropsPranith Kumar K2019-06-221-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Race: Thread-1 Thread-2 1) Does ec_get_size_version() to perform pre-op fxattrop as part of write-1 2) Calls ec_set_dirty_flag() in ec_get_size_version() for write-2. This sets dirty[] to 1 3) Completes executing ec_prepare_update_cbk leading to ctx->dirty[] = '1' 4) Takes LOCK(inode->lock) to check if there are any flags and sets dirty-flag because lock->waiting_flag is 0 now. This leads to fxattrop to increment on-disk dirty[] to '2' At the end of the writes the file will be marked for heal even when it doesn't need heal. Fix: Perform ec_set_dirty_flag() and other checks inside LOCK() to prevent dirty[] to be marked as '1' in step 2) above Updates bz#1593224 Change-Id: Icac2ab39c0b1e7e154387800fbededc561612865 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* ec-heal: check file's gfid when deleting stale nameKinglong Mee2019-06-201-1/+11
| | | | | | | | | | | | A name-less lookup does not contain parent's stat, It is hard to check the lookuped file is at the right path. This patch changes to a name lookup, and check file's gfid with expected gfid. If the gfid is different, mark it estale. fixes: bz#1702131 Change-Id: I2de20b10d680eed1e2fb1d3830b3b3dec4520dbf Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
* afr/read: Implement latency based read child selectionMohammed Rafi KC2019-06-203-27/+98
| | | | | | | | | | | | | | | | | Network latency is an important factor selecting a read subvolume. So this patch is adding two new policy. 1) We measure the latency of a child during a GF_DUMP rpc call. Then use this latency to pick a read subvol having the least latency. 2) Second one is an hybrid mode where it calculates the effective latency by multiplying outstanding pending read request and latency, and choose the least one. Change-Id: Ia49c8a08ab61f7dcdad8b8950aa4d338e7accf97 fixes: #520 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* cluster/dht: Strip out dht xattrsN Balachandran2019-06-191-0/+2
| | | | | | | | | | Some internal DHT xattrs were not being removed when calling getxattr in pass-through mode. This has been fixed. Change-Id: If7e3dbc7b495db88a566bd560888e3e9c167defa fixes: bz#1721435 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* afr/fini: Free local_pool data during an afr finiMohammed Rafi KC2019-06-171-0/+6
| | | | | | | | | We should free the mem_pool local_pool during an afr_fini. Otherwise this will lead to mem leak for shd Change-Id: I805a34a88077bf7b886c28b403798bf9eeeb1c0b Updates: bz#1716695 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* clang-scan: resolve warningAmar Tumballi2019-06-151-4/+0
| | | | | | | | | | | | | | | dht-common.c: because there was a 'goto err' before assigning the 'local' variable, there is possibility of NULL dereference. As the check which was done wouldn't ever be true, removed the check. glusterd-geo-rep.c: a possible path where 'slave_host' could be NULL when it gets passed to strcmp() is found. strcmp() expects a valid string. Add a NULL check. Updates: bz#1622665 Change-Id: I64c280bc1beac9a2b109e8fa88f2a5ce8b823c3a Signed-off-by: Amar Tumballi <amarts@redhat.com>
* multiple files: another attempt to remove includesYaniv Kaul2019-06-1431-95/+7
| | | | | | | | | | | | | | | | | | There are many include statements that are not needed. A previous more ambitious attempt failed because of *BSD plafrom (see https://review.gluster.org/#/c/glusterfs/+/21929/ ) Now trying a more conservative reduction. It does not solve all circular deps that we have, but it does reduce some of them. There is just too much to handle reasonably (dht-common.h includes dht-lock.h which includes dht-common.h ...), but it does reduce the overall number of lines of include we need to look at in the future to understand and fix the mess later one. Change-Id: I550cd001bdefb8be0fe67632f783c0ef6bee3f9f updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* Cluster/afr: Don't treat all bricks having metadata pending as split-brainkarthik-us2019-06-102-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Problem: We currently don't have a roll-back/undoing of post-ops if quorum is not met. Though the FOP is still unwound with failure, the xattrs remain on the disk. Due to these partial post-ops and partial heals (healing only when 2 bricks are up), we can end up in metadata split-brain purely from the afr xattrs point of view i.e each brick is blamed by atleast one of the others for metadata. These scenarios are hit when there is frequent connect/disconnect of the client/shd to the bricks. Fix: Pick a source based on the xattr values. If 2 bricks blame one, the blamed one must be treated as sink. If there is no majority, all are sources. Once we pick a source, self-heal will then do the heal instead of erroring out due to split-brain. This patch also adds restriction of all the bricks to be up to perform metadata heal to avoid any metadata loss. Removed the test case tests/bugs/replicate/bug-1468279-source-not-blaming-sinks.t as it was doing metadata heal even when only 2 of 3 bricks were up. Change-Id: I07a9d62f84ceda329dcab1f02a33aeed258dcb09 fixes: bz#1717819 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* ec/fini: Fix race between xlator cleanup and on going async fopMohammed Rafi KC2019-06-086-15/+56
| | | | | | | | | | | | | | | | Problem: While we process a cleanup, there is a chance for a race between async operations, for example ec_launch_replace_heal. So this can lead to invalid mem access. Solution: Just like we track on going heal fops, we can also track fops like ec_launch_replace_heal, so that we can decide when to send a PARENT_DOWN request. Change-Id: I055391c5c6c34d58aef7336847f3b570cb831298 fixes: bz#1703948 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* cluster/dht: Fix directory perms during selfhealN Balachandran2019-06-051-3/+5
| | | | | | | | | Fixed a bug in the revalidate code path that wiped out directory permissions if no mds subvol was found. Change-Id: I8b4239ffee7001493c59d4032a2d3062586ea115 fixes: bz#1716830 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* Fix some "Null pointer dereference" coverity issuesXavi Hernandez2019-05-262-0/+13
| | | | | | | | | | | | | | | | | | | | | | This patch fixes the following CID's: * 1124829 * 1274075 * 1274083 * 1274128 * 1274135 * 1274141 * 1274143 * 1274197 * 1274205 * 1274210 * 1274211 * 1288801 * 1398629 Change-Id: Ia7c86cfab3245b20777ffa296e1a59748040f558 Updates: bz#789278 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/ec: honor contention notifications for partially acquired locksXavi Hernandez2019-05-251-1/+1
| | | | | | | | | | | | | | | | | | | | EC was ignoring lock contention notifications received while a lock was being acquired. When a lock is partially acquired (some bricks have granted the lock but some others not yet) we can receive notifications from acquired bricks, which should be honored, since we may not receive more notifications after that. Since EC was ignoring them, once the lock was acquired, it was not released until the eager-lock timeout, causing unnecessary delays on other clients. This fix takes into consideration the notifications received before having completed the full lock acquisition. After that, the lock will be releaed as soon as possible. Fixes: bz#1708156 Change-Id: I2a306dbdb29fb557dcab7788a258bd75d826cc12 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/dht: Lookup all files when processing directoryN Balachandran2019-05-231-6/+6
| | | | | | | | | | | | | | | | | | | | | A rebalance process currently only looks up files that it is supposed to migrate. This could cause issues when lookup-optimize is enabled as the dir layout can be updated with the commit hash before all files are looked up. This is expecially problematic of one of the rebalance processes fails to complete as clients will try to access files whose linkto files might not have been created. Each process will now lookup every file in the directory it is processing. Pros: Less likely that files will be inaccessible. Cons: More lookup requests sent to the bricks and a potential performance hit. Note: this does not handle races such as when a layout is updated on disk just as the create fop is sent by the client. Change-Id: I22b55846effc08d3b827c3af9335229335f67fb8 fixes: bz#1711764 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* ec/fini: Fix race with ec_fini and ec_notifyMohammed Rafi KC2019-05-213-0/+13
| | | | | | | | | | | | | | | During a graph cleanup, we first sent a PARENT_DOWN and wait for a child down to ultimately free the xlator and the graph. In the ec xlator, we cleanup the threads when we get a PARENT_DOWN event. But a racing event like CHILD_UP or event xl_op may trigger healing threads after threads cleanup. So there is a chance that the threads might access a freed private variabe Change-Id: I252d10181bb67b95900c903d479de707a8489532 fixes: bz#1703948 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* afr/frame: Destroy frame after afr_selfheal_entry_granularMohammed Rafi KC2019-05-211-3/+8
| | | | | | | | | | | | In function "afr_selfheal_entry_granular", after completing the heal we are not destroying the frame. This will lead to crash. when we execute statedump operation, where it tried to access xlator object. If this xlator object is freed as part of the graph destroy this will lead to an invalid memory access Change-Id: I0a5e78e704ef257c3ac0087eab2c310e78fbe36d fixes: bz#1708926 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* ec/shd: Cleanup self heal daemon resources during ec finiMohammed Rafi KC2019-05-135-13/+122
| | | | | | | | | | We were not properly cleaning self-heal daemon resources during ec fini. With shd multiplexing, it is absolutely necessary to cleanup all the resources during ec fini. Change-Id: Iae4f1bce7d8c2e1da51ac568700a51088f3cc7f2 fixes: bz#1703948 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* afr: thin-arbiter lock release fixesRavishankar N2019-05-103-47/+93
| | | | | | | | | | | | | | | | | | | | - pass fop state instead of afr local to afr_ta_dom_lock_check_and_release() - avoid afr_lock_release_synctask() being called simultaneosuly from notify code path and transaction (post-op) code path due to races. - Check if the post-op on TA is valid based on event_gen checks. - Invalidate in-memory information when we get TA child down. Note: Thi patch addresses some pending review comments of commit 053b1309dc8fbc05fcde5223e734da9f694cf5cc (https://review.gluster.org/#/c/glusterfs/+/20095/) fixes: bz#1698449 Change-Id: I2ccd7e1b53362f9f3fed8680aecb23b5011eb18c Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: log before attempting data self-heal.Ravishankar N2019-05-081-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I was working on a blog about troubleshooting AFR issues and I wanted to copy the messages logged by self-heal for my blog. I then realized that AFR-v2 is not logging *before* attempting data heal while it logs it for metadata and entry heals. I [MSGID: 108026] [afr-self-heal-entry.c:883:afr_selfheal_entry_do] 0-testvol-replicate-0: performing entry selfheal on d120c0cf-6e87-454b-965b-0d83a4c752bb I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed entry selfheal on d120c0cf-6e87-454b-965b-0d83a4c752bb. sources=[0] 2 sinks=1 I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed data selfheal on a9b5f183-21eb-4fb3-a342-287d3a7dddc5. sources=[0] 2 sinks=1 I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-testvol-replicate-0: performing metadata selfheal on a9b5f183-21eb-4fb3-a342-287d3a7dddc5 I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed metadata selfheal on a9b5f183-21eb-4fb3-a342-287d3a7dddc5. sources=[0] 2 sinks=1 Adding it in this patch. Now there is a 'performing' and a corresponding 'Completed' message for every type of heal. fixes: bz#1707746 Change-Id: I0b954cf1e17b48280aefa76640b5119b92133d61 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* dht: Custom xattrs are not healed in case of add-brickroot2019-05-081-8/+1
| | | | | | | | | | | | | | | | Problem: If any custom xattrs are set on the directory before add a brick, xattrs are not healed on the directory after adding a brick. Solution: xattr are not healed because dht_selfheal_dir_mkdir_lookup_cbk checks the value of MDS and if MDS value is not negative selfheal code path does not take reference of MDS xattrs.Change the condition to take reference of MDS xattr so that custom xattrs are populated on newly added brick Updates: bz#1702299 Change-Id: Id14beedb98cce6928055f294e1594b22132e811c Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
* afr : fix Coverity CID 1398627Rinku Kothiya2019-05-071-2/+9
| | | | | | | | | | | Fixed coverity error, "Unchecked return value (CHECKED_RETURN)". Checking return value & logging error message if afr_set_pending_dict fails. updates: bz#789278 Change-Id: Iab7da6b4f3cd0622b95b8e1c412b007a330467e5 Signed-off-by: Rinku Kothiya <rkothiya@redhat.com>
* cluster/ec: fix shd healer wait timeoutKinglong Mee2019-05-062-1/+2
| | | | | | | | | Right now, the timeout is written by hard code, fix it by using heal-timeout. fixes: bz#1703020 Change-Id: I0d154e7807f9dba7efc3896805559bbfaa7af2ad Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
* cluster/ec: Reopen shouldn't happen with O_TRUNCPranith Kumar K2019-05-051-1/+1
| | | | | | | | | | | | | Problem: Doing re-open with O_TRUNC will truncate the fragment even when it is not needed needing extra heals Fix: At the time of re-open don't use O_TRUNC. fixes bz#1706603 Change-Id: Idc6408968efaad897b95a5a52481c66e843d3fb8 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/dht: Refactor dht lookup functionsN Balachandran2019-04-251-74/+30
| | | | | | | | | | Part 2: Modify dht_revalidate_cbk to call dht_selfheal_directory instead of separate calls to heal attrs and xattrs. Change-Id: Id41ac6c4220c2c35484812bbfc6157fc3c86b142 updates: bz#1590385 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/ec: fix fd reopenXavi Hernandez2019-04-2314-274/+328
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently EC tries to reopen fd's that have been opened while a brick was down. This is done as part of regular write operations, just after having acquired the locks, and it's sent as a sub-fop of the main write fop. There were two problems: 1. The reopen was attempted on all UP bricks, even if a previous lock didn't succeed. This is incorrect because most probably the open will fail. 2. If reopen is sent and fails, the error is propagated to the main operation, causing it to fail when it shouldn't. To fix this, we only attempt reopens on bricks where the current fop owns a lock, and we prevent any error to be propagated to the main fop. To implement this behaviour an argument used to indicate the minimum number of required answers has overloaded to also include some flags. To make the change consistent, it has been necessary to rename the argument, which means that a lot of files have been changed. However there are no functional changes. This change has also uncovered a problem in discard code, which didn't correctely process requests of small sizes because no real discard fop was being processed, only a write of 0's on some region. In this case some fields of the fop remained uninitialized or with incorrect values. To fix this, a new function has been created to simulate success on a fop and it's used in the discard case. Thanks to Pranith for providing a test script that has also detected an issue in this patch. This patch includes a small modification of this script to force data to be written into bricks before stopping them. Change-Id: If272343873369186c2fb8f43c1d9c52c3ea304ec Fixes: bz#1699866 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/afr: Set lk-owner before inodelk/entrylk/lkPranith Kumar K2019-04-222-19/+23
| | | | | | Updates: bz#1624701 Change-Id: I7152c28ad85925abccdcc4cd6de8cb2a2b847a51 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Remove local from owners_list on failure of lock-acquisitionPranith Kumar K2019-04-154-18/+14
| | | | | | | | | | | | | When eager-lock lock acquisition fails because of say network failures, the local is not being removed from owners_list, this leads to accumulation of waiting frames and the application will hang because the waiting frames are under the assumption that another transaction is in the process of acquiring lock because owner-list is not empty. Handled this case as well in this patch. Added asserts to make it easier to find these problems in future. fixes bz#1696599 Change-Id: I3101393265e9827755725b1f2d94a93d8709e923 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* Replace memdup() with gf_memdup()Vijay Bellur2019-04-121-1/+1
| | | | | | | | | memdup() and gf_memdup() have the same implementation. Removed one API as the presence of both can be confusing. Change-Id: I562130c668457e13e4288e592792872d2e49887e updates: bz#1193929 Signed-off-by: Vijay Bellur <vbellur@redhat.com>
* ec: fix truncate lock to cover the write in tuncate cleanKinglong Mee2019-04-121-2/+6
| | | | | | | | | | ec_truncate_clean does writing under the lock granted for truncate, but the lock is calculated by ec_adjust_offset_up, so that, the write in ec_truncate_clean is out of lock. Updates: bz#1699189 Change-Id: Idbe1fd48d26afe49c36b77db9f12e0907f5a4134 Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
* cluster/afr: Thin-arbiter SHD fixeskarthik-us2019-04-122-13/+13
| | | | | | | | | This patch address post-merge review comments for commit 5784a00f997212d34bd52b2303e20c097240d91c Change-Id: I7ed954664a2ae8e1091d23ee3ceb9c66e83bfeac fixes: bz#1697930 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* cluster/dht: refactor dht lookup functionsN Balachandran2019-04-052-124/+119
| | | | | | | | | | Part 1: refactor the dht_lookup_dir_cbk and dht_selfheal_directory functions. Added a simple dht selfheal directory test Change-Id: I1410c26359e3c14b396adbe751937a52bd2fcff9 updates: bz#1590385 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/afr: Invalidate inode on change of split-brain-choicePranith Kumar K2019-04-051-4/+1
| | | | | | | | | | | When split-brain choice is changed from one brick to another brick, inode-invalidate is not called so readv call is served from cache leading to failures in split-brain-resolution.t. Fixed it by calling inode_invaldate() when this happens. updates bz#1193929 Change-Id: I2624614eec38c0303f3e1dc55dfae3d4b864218b Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Fix handling of heal info cases without locksAshish Pandey2019-04-041-25/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we use heal info command, it takes lot of time as in some cases it takes lock on entries to find out if the entry actually needs heal or not. There are some cases where we can avoid these locks and can conclude if the entry needs heal or not. 1 - We do a lookup (without lock) on an entry, which we found in .glusterfs/indices/xattrop, and find that lock count is zero. Now if the file contains dirty bit set on all or any brick, we can say that this entry needs heal. 2 - If the lock count is one and dirty is greater than 1, then it also means that some fop had left the dirty bit set which made the dirty count of current fop (which has taken lock) more than one. At this point also we can definitely say that this entry needs heal. This patch is modifying code to take into consideration above two points. It is also changing code to not to call ec_heal_inspect if ec_heal_do was called from client side heal. Client side heal triggeres heal only when it is sure that it requires heal. [We have changed the code to not to call heal for lookup] updates bz#1689799 Change-Id: I7f09f0ecd12f65a353297aefd57026fd2bebdf9c Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/afr: Send inodelk/entrylk with non-zero lk-ownerPranith Kumar K2019-04-022-12/+48
| | | | | | | | | | | | | | Found missing assignment of lk-owner for an inodelk/entrylk before winding the fops. locks xlator at the moment allows this operation. This leads to multiple threads in the same client being able to get locks on the inode because lk-owner is same and transport is same. So isolation with locks can't be achieved. To fix it, we need locks xlator change which will disallow null-lk-owner based inodelk/entrylk/lk. To achieve that we need to first fix all the places which do this mistake. updates bz#1624701 Change-Id: Ic3431da3f451a1414f1f4fdcfc4cf41e555f69dd Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* afr: thin-arbiter read txn fixesRavishankar N2019-03-293-22/+37
| | | | | | | | | | | | | - Fixes afr_ta_read_txn() to handle inode refresh failures. code-path. - Fixes a double free issue of dict. Note: This patch address post-merge review comments for commit 69532c141be160b3fea03c1579ae4ac13018dcdf fixes: bz#1686398 Change-Id: Id5299b45b68569d47df6b73755918237a1592cb4 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/ec: Don't enqueue an entry if it is already healingAshish Pandey2019-03-275-30/+127
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: 1 - heal-wait-qlength is by default 128. If shd is disabled and we need to heal files, client side heal is needed. If we access these files that will trigger the heal. However, it has been observed that a file will be enqueued multiple times in the heal wait queue, which in turn causes queue to be filled and prevent other files to be enqueued. 2 - While a file is going through healing and a write fop from mount comes on that file, it sends write on all the bricks including healing one. At the end it updates version and size on all the bricks. However, it does not unset dirty flag on all the bricks, even if this write fop was successful on all the bricks. After healing completion this dirty flag remain set and never gets cleaned up if SHD is disabled. Solution: 1 - If an entry is already in queue or going through heal process, don't enqueue next client side request to heal the same file. 2 - Unset dirty on all the bricks at the end if fop has succeeded on all the bricks even if some of the bricks are going through heal. Change-Id: Ia61ffe230c6502ce6cb934425d55e2f40dd1a727 updates: bz#1593224 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* afr: add client-pid to all gf_event() callsRavishankar N2019-03-276-15/+28
| | | | | | | | | client-pid for glustershd is GF_CLIENT_PID_SELF_HEALD client-pid for glfsheal is GF_CLIENT_PID_GLFS_HEALD updates: bz#1689250 Change-Id: Ib3a863af160ff48c822a5e6b0c27c575c9887470 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Remove un-used variables related to pumpPranith Kumar K2019-03-261-3/+0
| | | | | | updates bz#1193929 Change-Id: I01b60d644f517c00a1bcc127bf9a8ed90b6eb7a0 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>