summaryrefslogtreecommitdiffstats
path: root/xlators/cluster
Commit message (Collapse)AuthorAgeFilesLines
* cluster/afr: Do not log of split-brain when there isn't oneKrutika Dhananjay2017-01-113-24/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | * Even on errors like ENOENT, AFR logs split-brain after read-txn refresh, introduced by commit a07ddd8f. This can be a cause of much panic and confusion and needs to be fixed. * Also fixed this issue in write-txns. * Fixed afr read txns to log about split-brain only after knowing that there is no split-brain choice configured. * Removed code duplication * Fixed incorrect passing of error code in afr_write_txn_refresh_done() (the function was passing -0 as errno to gf_msg(). Change-Id: I354f454ce5bf0e5f00bc27916eb597367cb7d927 BUG: 1411625 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/16362 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* feature/dht: undo partially successful dir renameCsaba Henk2017-01-114-7/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As with dht, dirs are present on all subvolumes, renaming them is a compound operation and thus a partial success + partial failure scenario is possible, resulting in an inconsistent state. For purposes of reproduction, such a scenario can easily be produced by stopping the volume, edit the volfile of a certain subvolume to get at an "option read-only on" setting, and then restart the volume. Thus those operations that are to make change on the affected subvolume will fail with EROFS. To handle such scenarios, we introduce an in-memory cache where we record the return values obtained from the subvolumes. At the final stage of the dir rename operation we check if it's a partial success/fail situation. If yes, then we perform a reverse rename op on those subvolumes where the operation succeeded. Change-Id: I3d05f74f53932cb984a918d252a7309c1009a51d BUG: 1412069 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-on: http://review.gluster.org/15739 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: N Balachandran <nbalacha@redhat.com>
* dht: At places needed use STACK_WIND_COOKIEPoornima G2017-01-0910-654/+664
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Issue: frame has a void * cookie pointer. In case of STACK_WIND_COOKIE frame->cookie is assigned to what is sent by the caller. In case of STACK_WIND frame->cookie is assigned to point point to the frame itself. For ease of coding, at many places, the cookie in the cbk is used to get the pointer to the next xl. This is inconsistent when STACK_WIND_TAIL comes into picture. Eg: dht_setxattr () { for (i = 0 ; i < conf->subvolume_cnt ; i++) { STACK_WIND (..dht_checking_pathinfo_cbk, conf->subvolumes[i] ..); } dht_checking_pathinfo_cbk (...void *cookie...) { prev = cookie; ... for (i = 0; i < conf->subvolume_cnt; i++) { if (conf->subvolumes[i] == prev->this) { ... } } } Consider the below graph: dht (parent) readdir-ahead => Doesn't define setxattr and uses STACK_WIND_TAIL protocol-client With this graph, when dht_checking_pathinfo_cbk is called, cookie will have frame pointing to protocol-client. i.e. prev->this will be protocol-client. But dht was expecting it to be readdir-ahead as it has stored in conf->subvolumes[i] Solution: Hence, as a thumb rule, if cbk is using cookie, then we explicitly call STACK_WIND_COOKIE. Change-Id: I83aea1e24c809c5a91a0db7283e908e125471bd4 BUG: 1401812 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/16039 Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/ec: Fixing log messageSunil Kumar H G2017-01-081-5/+10
| | | | | | | | | | | | | | | Updating the warning message with details to improve user understanding. BUG: 1409202 Change-Id: I001f8d5c01c97fff1e4e1a3a84b62e17c025c520 Signed-off-by: Sunil Kumar H G <sheggodu@redhat.com> Reviewed-on: http://review.gluster.org/16315 Tested-by: Sunil Kumar Acharya Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es>
* cluster/dht: Incorrect migration checks in fsyncN Balachandran2017-01-081-5/+6
| | | | | | | | | | | | | | | | | Fixed the order of the migration phase checks in dht_fsync_cbk. Phase1 should never be hit if op_ret is non zero. Signed-off-by: N Balachandran <nbalacha@redhat.com> Change-Id: I9222692e04868bffa93498059440f0aa553c83ec BUG: 1410777 Reviewed-on: http://review.gluster.org/16350 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
* dht/rebalance: remove errno check for failure detectionSusant Palai2017-01-061-16/+13
| | | | | | | | | | | BUG: 1410355 Change-Id: I867419ca36a81ef7209e6911a46c1c2c898b8eab Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/16328 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/ec: Do lookup on an existing file in linkPranith Kumar K2017-01-053-17/+18
| | | | | | | | | | | | | | | | | | | Problem: In link fop lookup is happening on the new fop which doesn't exist so the iatt ec serves parent xlators has size as zero which leads to 'cat' giving empty output Fix: Change code so that lookup happens on the existing link instead. BUG: 1409730 Change-Id: I70eb02fe0633e61d1d110575589cc2dbe5235d76 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/16320 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Tested-by: Xavier Hernandez <xhernandez@datalab.es> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
* ec: Invalidations in disperse volume should not update the statPoornima G2017-01-052-2/+14
| | | | | | | | | | | | | | | | | | | | | | Issue: In disperse volume, the file is present across bricks, hence the stat from one brick doesn't carry the valid size of the file. Therefore the upcall from one brick updating the md-cache results in wrong size being updated. Fix: If the notification is cache invalidation then, indicate md-cache that the attributes is invalid. BUG: 1410375 Change-Id: Id89d2283478e70b62b435a8891fffc86d2be8cb2 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/16329 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/dht: Do rename cleanup as rootPranith Kumar K2017-01-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | Problem: Rename linkfile cleanup is done as non-root which may not have priviliges to do the rename so it fails with EACCESS. MKDIR on that name in future will start to hole on this subvolume. It is not easy to hit on fuse mounts because vfs takes care of the permission checks even before rename fop is wound. But with nfs-ganesha mounts it happens. Fix: Do rename cleanup as root BUG: 1409727 Change-Id: I414c1eb6dce76b4516a6c940557b249e6c3f22f4 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/16317 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-by: N Balachandran <nbalacha@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
* cluster/dht: Fix dict_leak in migration check tasksN Balachandran2017-01-021-0/+7
| | | | | | | | | | | | | | | | Fixed a memleak where dict was not being unrefed in the dht_migration_complete_check_task and dht_rebalance_inprogress_task functions. Change-Id: I3d42e9a2e5c8596c985bf6431a68fd3905227383 BUG: 1409186 Signed-off-by: N Balachandran <nbalacha@redhat.com> Reviewed-on: http://review.gluster.org/16308 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: MOHIT AGRAWAL <moagrawa@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com>
* cluster/afr: Fix missing name indices due to EEXIST errorKrutika Dhananjay2016-12-271-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PROBLEM: Consider a volume with granular-entry-heal and sharding enabled. When a replica is down and a shard is created as part of a write, the name index is correctly created under indices/entry-changes/<dot-shard-gfid>. Now when a read on the same region triggers another MKNOD, the fop fails on the online bricks with EEXIST. By virtue of this being a symmetric error, the failed_subvols[] array is reset to all zeroes. Because of this, before post-op, the GF_XATTROP_ENTRY_OUT_KEY will be set, causing the name index, which was created in the previous MKNOD operation, to be wrongly deleted in THIS MKNOD operation. FIX: The ideal fix would have been for a transaction to delete the name index ONLY if it knows it is the one that created the index in the first place. This would involve gathering information as to whether THIS xattrop created the index from individual bricks, aggregating their responses and based on the various posisble combinations of responses, decide whether to delete the index or not. This is rather complex. Simpler fix would be for post-op to examine local->op_ret in the event of no failed_subvols to figure out whether to delete the name index or not. This can occasionally lead to creation of stale name indices but they won't be affecting the IO path or mess with pending changelogs in any way and self-heal in its crawl of "entry-changes" directory would take care to delete such indices. Change-Id: Ic1b5257f4dc9c20cb740a866b9598cf785a1affa BUG: 1408712 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/16286 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* afr: use accused matrix instead of readable matrix for deciding healsRavishankar N2016-12-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Problem: afr_replies_interpret() used the 'readable' matrix to trigger client side heals after inode refresh. But for arbiter, readable is always zero. So when `dd` is run with a data brick down, spurious data heals are are triggered. These heals open an fd, causing eager lock to be disabled (open fd count >1) in afr transactions, leading to extra FXATTROPS Fix: Use the accused matrix (derived from interpreting the afr pending xattrs) to decide whether we can start heal or not. Change-Id: Ibbd56c9aed6026de6ec42422e60293702aaf55f9 BUG: 1408395 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/16277 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* dht/rebalance: reverify lookup failuresSusant Palai2016-12-222-58/+130
| | | | | | | | | | | | | | | | | | race: readdirp has read one entry, and doing a lookup on that entry, but user might have renamed/removed that entry just after readdirp but before lookup. Since remove-brick is a costly opertaion,will ingore any ENOENT/ESTALE failures and move on. Change-Id: I62c7fa93c0b9b7e764065ad1574b97acf51b5996 BUG: 1408115 Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/15846 Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
* afr: Ignore event_generation checks post inode refresh for write txnsRavishankar N2016-12-223-1/+4
| | | | | | | | | | | | | | | | | | | Before http://review.gluster.org/#/c/15673/, after inode refresh, we failed read txns in case of EIO or event_generation being zero. For write transactions, the check was only for EIO. 15673 re-factored the code to fail both read and write when event_generation=0. This seems to have caused a regression as explained in the BZ. This patch restores that behaviour in afr_txn_refresh_done(). Change-Id: Ib8e116506badce6f58b55827dbe403d95069d744 BUG: 1406224 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/16205 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
* cluster/ec: Fix lk-owner set race in ec_unlockPranith Kumar K2016-12-136-34/+42
| | | | | | | | | | | | | | | | | | | | | | Problem: Rename does two locks. There is a case where when it tries to unlock it sends xattrop of the directory with new version, callback of these two xattrops can be picked up by two separate epoll threads. Both of them will try to set the lk-owner for unlock in parallel on the same frame so one of these unlocks will fail because the lk-owner doesn't match. Fix: Specify the lk-owner which will be set on inodelk frame which will not be over written by any other thread/operation. BUG: 1402710 Change-Id: I666ffc931440dc5253d72df666efe0ef1d73f99a Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/16074 Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: Fix per-txn optimistic changelog initialisationKrutika Dhananjay2016-12-122-29/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Incorrect initialisation of local->optimistic_change_log was leading to skipped pre-op and post-op even when a brick didn't participate in the txn because it was down. The result - missing granular name index resulting in some entries never getting healed. FIX: Initialise local->optimistic_change_log just before pre-op. Also fixed granular entry heal to create the granular name index in pre-op as opposed to post-op. This is to prevent loss of granular information when during an entry txn, the good (src) brick goes offline before the post-op is done. This would cause self-heal to do conservative merge (since dirty xattr is the only information available), which when granular-entry-heal is enabled, expects granular indices, the lack of which can lead to loss of data in the worst case. Change-Id: Ia3ad716d6fb1821555f02180e86e8711a79f958d BUG: 1402730 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/16075 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/dht: Fix memory corruption while accessing regex stored inRaghavendra G2016-12-083-45/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | private If reconfigure is executed parallely (or concurrently with dht_init), there are races that can corrupt memory. One such race is modification of regexes stored in conf (conf->rsync_regex_valid and conf->extra_regex_valid) through dht_init_regex. With change [1], reconfigure codepath can get executed parallely (with itself or with dht_init) and this fix is needed. Also, a reconfigure can race with any thread doing dht_layout_search, resulting in dht_layout_search accessing regex freed up by reconfigure (like in bz 1399134). [1] http://review.gluster.org/15046 Change-Id: I039422a65374cf0ccbe0073441f0e8c442ebf830 BUG: 1399134 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-on: http://review.gluster.org/15945 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: N Balachandran <nbalacha@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com>
* cluster/afr: Remove backward compatibility for locks with v1Pranith Kumar K2016-12-072-69/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we have cascading locks with same lk-owner there is a possibility for a deadlock to happen. One example is as follows: self-heal takes a lock in data-domain for big name with 256 chars of "aaaa...a" and starts heal in a 3-way replication when brick-0 is offline and healing from brick-1 to brick-2 is in progress. So this lock is active on brick-1 and brick-2. Now brick-0 comes online and an operation wants to take full lock and the lock is granted at brick-0 and it is waiting for lock on brick-1. As part of entry healing it takes full locks on all the available bricks and then proceeds with healing the entry. Now this lock will start waiting on brick-0 because some other operation already has a granted lock on it. This leads to a deadlock. Operation is waiting for unlock on "aaaa..." by heal where as heal is waiting for the operation to unlock on brick-0. Initially I thought this is happening because healing is trying to take a lock on all the available bricks instead of just the bricks that are participating in heal. But later realized that same kind of deadlock can happen if a brick goes down after the heal starts but comes back before it completes. So the essential problem is the cascading locks with same lk-owner which were added for backward compatibility with afr-v1 which can be safely removed now that versions with afr-v1 are already EOL. This patch removes the compatibility with v1 which requires cascading locks with same lk-owner. In the next version we can make locking-scheme option a dummy and switch completely to v2. BUG: 1401404 Change-Id: Ic9afab8260f5ff4dff5329eb0429811bcb879079 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/16024 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: Serialize conflicting locks on all subvolsPranith Kumar K2016-12-062-33/+53
| | | | | | | | | | | | | | | | | | | | | | | Problem: 1) When a blocking lock is issued and the parallel lock phase fails on all subvolumes with EAGAIN, it is not switching to serialized locking phase. 2) When quorum is enabled and locks fail partially it is better to give errno returned by brick rather than the default quorum errno. Fix: Handled this error case and changed op_errno to reflect the actual errno in case of quorum error. BUG: 1369077 Change-Id: Ifac2e4a13686e9fde601873012700966d56a7f31 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/15984 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com>
* afr: fix bug in passing child index in afr_inode_write_fillRavishankar N2016-12-061-4/+3
| | | | | | | | | | | Change-Id: I7b70de317a5f15a3bf483ffe40b971143deddc11 BUG: 1401218 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/16029 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* afr, client: More mem-leak fixes in COMPOUND fop cbkKrutika Dhananjay2016-12-043-20/+22
| | | | | | | | | | | | | | | | | | | | | | Bugs found and fixed: 1. Use correct subvolume index in pre-op-writev compound cbk 2. Prevent use-after-free of local->compound_args members in compound fops cbk in protocol/client 3. Fix xdata and xattr leaks in client_process_response 4. Fix possible leak of xdata in client_pre_writev() in test mode. 5. Free req->compound_req_array.compound_req_array_val as well after freeing its members 6. Free tmp_rsp->flock.lk_owner.lk_owner_val in LK fop. Change-Id: I15b646d7d4e0e5cd4ea3d2d6452c815cf2eaf68f BUG: 1401218 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/16020 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/ec: Check xdata to avoid memory leakAshish Pandey2016-12-021-0/+3
| | | | | | | | | | | | | | | | | | | | | | Problem: ec_writev_start calls ec_make_internal_fop_xdata to set "yes" in xdata before ec_readv (an internal fop) is called for head and tail. Second call to this function is overwriting the previous allocated dict_t to "xdata", which results in memory leak. Solution: In ec_make_internal_fop_xdata, check if *xdata is NULL or not to avoid overwriting *xdata. Change-Id: I49b83923e11aff9b92d002e86424c0c2e1f5f74f BUG: 1400818 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: http://review.gluster.org/16007 Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* dht/md-cache: Filter invalidate if the file is made a linkto filePoornima G2016-12-021-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | Upcall as a part of setattr, sends an invalidation and the invalidation carries the resulting stat value. When a file is converted to linkto files, even then an invalidation is set and as a result the mountpoint shows the sticky bit in the stat of the file. eg: ---------T. 945 root root 0 Nov 8 10:14 hardlink.999 Fix: When dht recieves a notification of sticky bit change, it updates the flag, to indicate md-cache to send the subsequent lookup. Change-Id: Ic2fd7a5b196db0754f9b97072e644e6bf69da606 BUG: 1392713 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/15789 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Niels de Vos <ndevos@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Susant Palai <spalai@redhat.com> Reviewed-by: Rajesh Joseph <rjoseph@redhat.com>
* selfheal: fix memory leak on client side healing queueMateusz Slupny2016-12-023-3/+8
| | | | | | | | | | | | | Change-Id: I2beaba829710565a3246f7449a5cd21755cf5f7d BUG: 1399592 Signed-off-by: Mateusz Slupny <mateusz.slupny@appeartv.com> Reviewed-on: http://review.gluster.org/15968 Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org>
* dht/rename : Incase of failure remove linkto file properlyJiffin Tony Thottan2016-12-012-2/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Generally linkto file is created using root user. Consider following case, a user is trying to rename a file which he is not permitted. So the rename fails with EACESS and when rename tries to cleanup the linkto file, it fails. The above issue happens when rename/00.t test executed on nfs-ganesha clients : Steps executed in script * create a file "abc" using root * rename the file "abc" to "xyz" using a non root user, it fails with EACESS * delete "abc" * create directory "abc" using root * again try ot rename "abc" to "xyz" using non root user, test hungs here which slowly leds to OOM kill of ganesha process RCA put forwarded by Du for OOM kill of ganesha Note that when we hit this bug, we've a scenario of a dentry being present as: * a linkto file on one subvol * a directory on rest of subvols When a lookup happens on the dentry in such a scenario, the control flow goes into an infinite loop of: dht_lookup_everywhere dht_lookup_everywhere_cbk dht_lookup_unlink_cbk dht_lookup_everywhere_done dht_lookup_directory (as local->dir_count > 0) dht_lookup_dir_cbk (sets to local->need_selfheal = 1 as the entry is a linkto file on one of the subvol) dht_lookup_everywhere (as need_selfheal = 1). This infinite loop can cause increased consumption of memory due to: 1) dht_lookup_directory assigns a new layout to local->layout unconditionally 2) Most of the functions in this loop do a stack_wind of various fops. This results in growing of call stack (note that call-stack is destroyed only after lookup response is received by fuse - which never happens in this case) Thanks Du for root causing the oom kill and Sushant for suggesting the fix Change-Id: I1e16bc14aa685542afbd21188426ecb61fd2689d BUG: 1397052 Signed-off-by: Jiffin Tony Thottan <jthottan@redhat.com> Reviewed-on: http://review.gluster.org/15894 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
* all: remove dead translatorsJeff Darcy2016-11-2915-9875/+2
| | | | | | | | | | | | | | | | | | | | | The following have been completely removed from the source tree, makefiles, configure script, and RPM specfile. cluster/afr/pump cluster/ha cluster/map features/filter features/mac-compat features/path-convertor features/protect Change-Id: I2f966999ac3c180296ff90c1799548fba504f88f Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/15906 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: A hard link is lost during rebalance + lookupMohit Agrawal2016-11-291-40/+62
| | | | | | | | | | | | | | | | | | | | | | Problem: A hard link is lost during rebalance + lookup.Rebalance skip files if file has hardlink.In dht_migrate_file __is_file_migratable () function checks if a file has hardlink, if yes file is not migrated but if link is created after call this function then link will lost. Solution: Call __check_file_has_hardlink to check hardlink existence after (S+T) bits in migration process ,if file has hardlink then skip the file for migrate rebalance process. BUG: 1396048 Change-Id: Ia53c07ef42f1128c2eedf959a757e8df517b9d12 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Reviewed-on: http://review.gluster.org/15866 Reviewed-by: Susant Palai <spalai@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: N Balachandran <nbalacha@redhat.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
* afr: fix auto-quorumJeff Darcy2016-11-283-49/+31
| | | | | | | | | | | | | | | | | | | | | | | | | (1) afr_have_quorum is dead code. It was copied to afr_has_quorum, and everything else uses that, but the original was never deleted (until now). (2) Auto-quorum should be default for any N>2. Leaving quorum disabled is BAD, but apparently deemed acceptable for N=2 because there's no real quorum in that case. For any larger number (including arbiter configurations) there is such a thing as real quorum and we should use it by default. Note that for N=3 the answers we get from "N % 2" (the old check) and "N > 2" (the new one) are the same. (3) The special case for even N in afr_has_quorum has been simplified and explained more thoroughly in a comment. Change-Id: I48b33c15093512fecf516b26dcf09afecb7ae33b Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/15873 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Vijay Bellur <vbellur@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: Fix the EIO that can occur in afr_inode_refresh as a resultPoornima G2016-11-284-18/+84
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | of cache invalidation(upcall). Issue: ------ When a cache invalidation is recieved as a result of changing pending xattr, the read_subvol is reset. Consider the below chain of execution: CHILD_DOWN ... afr_readv ... afr_inode_refresh ... afr_inode_read_subvol_reset <- as a result of pending xattr set by some other client GF_EVENT_UPCALL will be sent afr_refresh_done -> this results in an EIO, as the read subvol was reset by the end of the afr_inode_refresh Solution: --------- When GF_EVENT_UPCALL is recieved, instead of resetting read_subvol, set a variable need_refresh in inode_ctx, the next time some one starts a txn, along with event gen, need_rrefresh also needs to be checked. Change-Id: Ifda21a7a8039b8874215e1afa4bdf20f7d991b58 BUG: 1396952 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/15892 Reviewed-by: Ravishankar N <ravishankar@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/ec: Healing should not start if only "data" bricks are UPAshish Pandey2016-11-281-11/+17
| | | | | | | | | | | | | | | | | | | | | | Problem: In a disperse volume with "K+R" configuration, where "K" is the number of data bricks and "R" is the number of redundancy bricks (Total number of bricks, N = K+R), if only K bricks are UP, we should NOT start heal process. This is because the bricks, which are supposed to be healed, are not UP. This will unnecessary eat up the resources. Solution: Check for the number of xl_up_count and only if it is greater than ec->fragments (number of data bricks), start heal process. Change-Id: I8579f39cfb47b65ff0f76e623b048bd67b15473b BUG: 1399072 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: http://review.gluster.org/15937 Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
* afr: allow I/O when favorite-child-policy is enabledRavishankar N2016-11-278-61/+387
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Currently, I/O on a split-brained file fails even when the favorite-child-policy is set until the self-heal is complete. Fix: If a valid 'source' is found using the set favorite-child-policy, inspect and reset the afr pending xattrs on the 'sinks' (inside appropriate locks), refresh the inode and then proceed with the read or write transaction. The resetting itself happens in the self-heal code and hence can also happen in the client side background-heal or by the shd's index-heal in addition to the txn code path explained above. When it happens in via heal, we also add checks in undo-pending to not reset the sink xattrs again. Change-Id: Ic8c1317720cb26bd114b6fe6af4e58c73b864626 BUG: 1386188 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reported-by: Simon Turcotte-Langevin <simon.turcotte-langevin@ubisoft.com> Reviewed-on: http://review.gluster.org/15673 Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: Fix bugs in [f]inodelk/[f]entrylkPranith Kumar K2016-11-265-336/+381
| | | | | | | | | | | | | | | | | | | | | | Problems: 1) Inodelk is not taking quorum into account 2) finodelk, [f]entrylk are not implemented correctly 3) By default afr doesn't go for non-blocking parallel locks. Fix: Implemented a common framework which can be used by [f]inodelk/[f]entrylk. Used quorum for the same. Change-Id: I239f13875a065298630d266941df10cfa3addc85 BUG: 1369077 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/15802 Tested-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
* cluster/afr: Fix deadlock due to compound fopsKrutika Dhananjay2016-11-261-14/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an afr data transaction is eligible for using eager-lock, this information is represented in local->transaction.eager_lock_on. However, if non-blocking inodelk attempt (which is a full lock) fails, AFR falls back to blocking locks which are range locks. At this point, local->transaction.eager_lock[] per brick is reset but local->transaction.eager_lock_on is still true. When AFR decides to compound post-op and unlock, it is after confirming that the transaction did not use eager lock (well, except for a small bug where local->transaction.locks_acquired[] is not considered). But within afr_post_op_unlock_do(), afr again incorrectly sets the lock range to full-lock based on local->transaction.eager_lock_on value. This is a bug and can lead to deadlock since the locks acquired were range locks and a full unlock is being sent leading to unlock failure and thereby every other lock request (be it from SHD or other clients or glfsheal) getting blocked forever and the user perceives a hang. FIX: Unconditionally rely on the range locks in inodelk object for unlocking when using compounded post-op + unlock. Big thanks to Pranith for helping with the debugging. Change-Id: Idb4938f90397fb4bd90921f9ae6ea582042e5c67 BUG: 1398566 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/15929 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: Handle rpc errors, xdr failures etc with proper NULL checksKrutika Dhananjay2016-11-241-6/+17
| | | | | | | | | | | Change-Id: Id8ba76ba116d056bc7299dc5ce0980680a5a23f8 BUG: 1398226 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/15924 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/dht Set layout after mkdir as rootN Balachandran2016-11-211-0/+3
| | | | | | | | | | | | | | | | | | | | DHT does not set the layout for newly created directories as root. This causes EPERM failures when a non-root user with insufficient permissions creates directories. credit: srangana@redhat.com for RCA Change-Id: Ia646e41665ce172c43c5f01d2707455e8eb374ed BUG: 1392772 Signed-off-by: N Balachandran <nbalacha@redhat.com> Reviewed-on: http://review.gluster.org/15794 Reviewed-by: Susant Palai <spalai@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* afr,dht,ec: Replace GF_EVENT_CHILD_MODIFIED with event SOME_DESCENDENT_DOWN/UPPoornima G2016-11-213-27/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently these are few events related to child_up/down: GF_EVENT_CHILD_UP : Issued when any of the protocol client connects. GF_EVENT_CHILD_MODIFIED : Issued by afr/dht/ec GF_EVENT_CHILD_DOWN : Issued when any of the protocol client disconnects. These events get modified at the dht/afr/ec layers. Here is a brief on the same. DHT: - All the subvolumes reported once, and atleast one child came up, then GF_EVENT_CHILD_UP is issued - connect GF_EVENT_CHILD_UP is issued - disconnect GF_EVENT_CHILD_MODIFIED is issued - All the subvolumes disconnected, GF_EVENT_CHILD_DOWN is issued AFR: - First subvolume came up, then GF_EVENT_CHILD_UP is issued - Subsequent subvolumes coming up, results in GF_EVENT_CHILD_MODIFIED - Any of the subvolumes go down, then GF_EVENT_SOME_CHILD_DOWN is issued - Last up subvolume goes down, then GF_EVENT_CHILD_DOWN is issued Until the patch [1] introduced GF_EVENT_SOME_CHILD_UP, GF_EVENT_CHILD_MODIFIED was issued by afr/dht when any of the subvolumes go up or down. Now with md-cache changes, there is a necessity to differentiate between child up and down. Hence, introducing GF_EVENT_SOME_DESCENDENT_DOWN/UP and getting rid of GF_EVENT_CHILD_MODIFIED. [1] http://review.gluster.org/12573 Change-Id: I704140b6598f7ec705493251d2dbc4191c965a58 BUG: 1396038 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/15764 CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: N Balachandran <nbalacha@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Rajesh Joseph <rjoseph@redhat.com>
* events: Add FMT_WARN for gf_eventPranith Kumar K2016-11-182-2/+5
| | | | | | | | | | | | | | | | | Raghavendra G found that posix is trying to print %s but passing an int when HEALTH_CHECK fails in posix. These are the kind of bugs that should be caught at compilation itself. Also fixed the problematic gf_event() callers. BUG: 1386097 Change-Id: Id7bd6d9a9690237cec3ca1aefa2aac085e8a1270 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/15671 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Atin Mukherjee <amukherj@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/dht: Check for null inodeN Balachandran2016-11-151-2/+5
| | | | | | | | | | | | | | | Check for NULL inode before attempting to set dht inode ctx. Change-Id: I7693c18445f138221d8417df5e95b118cedb818a BUG: 1395261 Signed-off-by: N Balachandran <nbalacha@redhat.com> Reviewed-on: http://review.gluster.org/15847 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Atin Mukherjee <amukherj@redhat.com>
* cluster/afr: When failing fop due to lack of quorum, also log error stringKrutika Dhananjay2016-11-091-11/+12
| | | | | | | | | | | | Change-Id: I39de2bcfc660f23a4229a885a0e0420ca949ffc9 BUG: 1392865 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/15800 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* features/shard: Fill loc.pargfid too for named lookups on individual shardsKrutika Dhananjay2016-11-081-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On a sharded volume when a brick is replaced while IO is going on, named lookup on individual shards as part of read/write was failing with ENOENT on the replaced brick, and as a result AFR initiated name heal in lookup callback. But since pargfid was empty (which is what this patch attempts to fix), the resolution of the shards by protocol/server used to fail and the following pattern of logs was seen: Brick-logs: [2016-11-08 07:41:49.387127] W [MSGID: 115009] [server-resolve.c:566:server_resolve] 0-rep-server: no resolution type for (null) (LOOKUP) [2016-11-08 07:41:49.387157] E [MSGID: 115050] [server-rpc-fops.c:156:server_lookup_cbk] 0-rep-server: 91833: LOOKUP(null) (00000000-0000-0000-0000-000000000000/16d47463-ece5-4b33-9c93-470be918c0f6.82) ==> (Invalid argument) [Invalid argument] Client-logs: [2016-11-08 07:41:27.497687] W [MSGID: 114031] [client-rpc-fops.c:2930:client3_3_lookup_cbk] 2-rep-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument] [2016-11-08 07:41:27.497755] W [MSGID: 114031] [client-rpc-fops.c:2930:client3_3_lookup_cbk] 2-rep-client-1: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument] [2016-11-08 07:41:27.498500] W [MSGID: 114031] [client-rpc-fops.c:2930:client3_3_lookup_cbk] 2-rep-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument] [2016-11-08 07:41:27.499680] E [MSGID: 133010] Also, this patch makes AFR by itself choose a non-NULL pargfid even if its ancestors fail to initialize all pargfid placeholders. Change-Id: I5f85b303ede135baaf92e87ec8e09941f5ded6c1 BUG: 1392445 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/15788 CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org>
* cluster/tier: break the monolith processing functionMilind Changire2016-10-261-309/+429
| | | | | | | | | | | | | | Break tier_migrate_using_query_file() into a more manageable tier_migrate_link() and helpers. Change-Id: I5eb2d2cff9e7a2a2da567c3c4c2d53aab195f477 BUG: 1358296 Signed-off-by: Milind Changire <mchangir@redhat.com> Reviewed-on: http://review.gluster.org/14957 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Dan Lambright <dlambrig@redhat.com>
* afr,ec: Heal device files with correct major, minor numbersPranith Kumar K2016-10-262-2/+4
| | | | | | | | | | | | | | Thanks a lot to xiaoping.wu@nokia.com from Nokia for the bug and the fix. BUG: 1384297 Change-Id: Ie443237e85d34633b5dd30f85eaa2ac34e45754c Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/15728 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/dht: Incorrect volname in rebalance eventsN Balachandran2016-10-251-6/+50
| | | | | | | | | | | | | | | | The rebalance event code was using strtok to parse the volume name which is incorrect. Reworked the code to get the correct volume name using strstr. Change-Id: Ib5f3305a34e6bf1ecfef677d87c5aff96bdeb0e6 BUG: 1388010 Signed-off-by: N Balachandran <nbalacha@redhat.com> Reviewed-on: http://review.gluster.org/15712 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* md-cache, afr: Reduce the window of stale readPoornima G2016-10-202-2/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Consider a replica setup, where one mount writes data to a file and the other mount reads the file. In afr, read operations are not transaction based, a brick(read subvolume) is chosen as a part of lookup or other operations, read is always wound only to the read subvolume, even if there was write from a different client that failed on this brick. This stale read continues until there is a lookup or any write operation from the mount point. Currently, this is not a major issue, as a lookup is issued before every read and it will switch the read subvolume to a correct one. But with the plan of increasing md-cache timeout to 600s, the stale read problem will be more pronounced, i.e. stale read can continue for 600s(or more if cascaded with readdirp), as there will be no lookups. Solution: Afr doesn't have any built-in solution for stale read(without affecting the performance). The solution that came up, was to use upcall. When a file on any brick is marked bad for the first time, upcall sends a notification to all the clients that had recently accessed the file. The solution has 2 parts: - Identifying when a file is marked bad, on any of the bricks, for the first time - Client side actions on recieving the notifications Identifying when a file is marked bad on any of the bricks for the first time: ----------------------------------------------------------------------------- The idea is to track xattrop in upcall. xattrop currently comes with 2 afr xattrs - afr dirty bit and afr pending xattrs. Dirty xattr is set to 1 before every write, and is unset if write succeeds. In certain scenarios, dirty xattr can be 0 and still the file could be bad copy. Hence do not track dirty xattr. Pending xattr is set on the good copy, indicating the other bricks that have bad copy. It is still not as simple as, notifying when any of the pending xattrs change. It could lead to flood of notifcations, in case the other brick is completely down or consistantly failing. Hence it is important to notify only once, the first time a good copy is marked bad. Client side actions on recieving pending xattr change, notification: -------------------------------------------------------------------- md-cache will invalidate the cache of that file, so that further lookup is passed down to afr and hence update the read subvolume. Invalidating only in md-cache is not enough, consider the folling oder of opertaions: - pending xattr invalidation - invalidate md-cache - readdirp on the bad read subvolume - fill md-cache - lookup (served from md-cache) - read - wound to the old read subvol. Hence, along with invalidating md-cache, it is very important to reset the read subvolume for that file, in afr. Design Credit: Anuradha Talur, Ravishankar N 1. xattrop doesn't carry info saying post op/pre op. 2. Pre xattrop will have 0 value for all pending xattrs, the cbk of pre xattrop carries the on-disk xattr value. Non zero indicated healing is required. 3. Post xattrop will have non zero value for any of the pending xattrs, if the fop failed on any of the bricks. Change-Id: I469cbc111714c433984fe1c922be2ef113c25804 BUG: 1211863 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/15398 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/tier: handle fast demotionsMilind Changire2016-10-194-16/+85
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Demote files on priority if hi-watermark has been breached and continue to demote until the watermark drops below hi-watermark. Monitor watermark more frequently. Trigger demotion as soon as hi-watermark is breached. Add cluster.tier-emergency-demote-query-limit option to limit number of files returned from the database query for every iteration of tier_migrate_using_query_file(). If watermark hasn't dropped below hi-watermark during the first iteration, the next iteration will be triggered approximately 1 second after tier_demote() returns to the main tiering loop. Update changetimerecorder xlator to handle query for emergency demote mode. Add tier-ctr-interface.h: Move tier and ctr interface specific macros and struct definition from libglusterfs/src/gfdb/gfdb_data_store.h to new header libglusterfs/src/tier-ctr-interface.h Change-Id: If56af78c6c81d37529b9b6e65ae606ba5c99a811 BUG: 1366648 Signed-off-by: Milind Changire <mchangir@redhat.com> Reviewed-on: http://review.gluster.org/15158 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Dan Lambright <dlambrig@redhat.com>
* trivial: correct some spelling mistakes in comments and logsNiels de Vos2016-10-181-1/+1
| | | | | | | | | | | | | | BUG: 1385593 Change-Id: Icfae9e557a284182c6c22e9606fdd641528906f0 Reported-by: Patrick Matthäi <pmatthaei@debian.org> Signed-off-by: Niels de Vos <ndevos@redhat.com> Reviewed-on: http://review.gluster.org/15656 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Atin Mukherjee <amukherj@redhat.com> Reviewed-by: jiffin tony Thottan <jthottan@redhat.com> Reviewed-by: Kotresh HR <khiremat@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: Prevent dict_set() on NULL dictPranith Kumar K2016-10-151-1/+2
| | | | | | | | | | | | | | | | | In afr lookup when NULL dict is received in lookup, afr is supposed to set all the xattrs it requires in a new dict it creates, but for 'link-count' it is trying to set to the dict that is passed in lookup which can be NULL sometimes. This is leading to error logs. Fixed the same in this patch. BUG: 1385104 Change-Id: I679af89cfc410cbc35557ae0691763a05eb5ed0e Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/15646 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com>
* afr: Take full locks in arbiter only for data transactionsRavishankar N2016-10-141-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Sharding exposed a bug in arbiter config. where `dd` throughput was extremely slow. Shard xlator was sending a fxattrop to update the file size immediately after a writev. Arbiter was incorrectly over-riding the LLONGMAX-1 start offset (for metadata domain locks) for this fxattrop, causing the inodelk to be taken on the data domain. And since the preceeding writev hadn't released the lock (afr does a 'lazy' unlock if write succeeds on all bricks), this degraded to a blocking lock causing extra lock/unlock calls and delays. Fix: Modify flock.l_len and flock.l_start to take full locks only for data transactions. Change-Id: I906895da2f2d16813607e6c906cb4defb21d7c3b BUG: 1384906 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reported-by: Max Raba <max.raba@comsysto.com> Reviewed-on: http://review.gluster.org/15641 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/ec: Implement heal info with lockAshish Pandey2016-10-115-79/+261
| | | | | | | | | | | | | | | | | | | | | | | | | Problem: Currently heal info command prints all the files/directories if the index for the file/directory is present in .glusterfs/indices folder. After implementing patch http://review.gluster.org/#/c/13733/ indices of the file which is going through update fop will also be present in .glusterfs/indices even if the fop is successful on all the brick. At this time if heal info command is being used, it will also display this file which is actually healthy and does not require any heal. Solution: Take lock on a file corresponding to the indices and inspect xattrs to decide if the file needs heal or not. Change-Id: I6361e2813ece369be12d02e74816df4eddb81cfa BUG: 1366815 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: http://review.gluster.org/15543 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org>
* afr: fix incorrect debug log in selfheal pathRavishankar N2016-10-041-2/+2
| | | | | | | | | | | | | | | | | | | 1. While looking at glustershd logs in DEBUG log-level, it was found that all bricks of the replica were printed as local bricks even though they were not. Fixed it. 2. Print the name of the subvol from which the entry was got during index crawl. Change-Id: I08b32e38694c755715e9fe0c0e1dd9212abcfb16 BUG: 1381421 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/15610 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Vijay Bellur <vbellur@redhat.com>