glusterfs.git -

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	cluster/dht: coverity fixes	N Balachandran	2018-08-21	2	-12/+14
\| \| \| \| \| \| \| \| \|	Fixes 1133997, 1370910, 1382387, 1382444, 1394635 Change-Id: Ie63ad47abd5519b9b9536da26b61ed4c9eaf2c75 updates: bz#789278 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	performance/readdir-ahead: keep stats of cached dentries in sync with ↵	Krutika Dhananjay	2018-08-18	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	modifications PROBLEM: Stats of dentries that are readdirp'd ahead can become stale due to fops like writes, truncate etc that modify the file pointed by dentries. When a readdir is finally wound at offset corresponding to these entries, the iatts that are returned to the application come from readdir-ahead's cache, which are stale by now. This problem gets further aggravated when caching translators/modules cache and continue to serve this stale information. FIX: * Store the iatt in context of the inode pointed by dentry. * Whenever the inode pointed by dentry undergoes modification, in cbk of modification fop, update the iatt stored in inode-ctx to reflect the modification. * When serving a readdirp response from application, update iatts of dentries with the iatts stored in the context of inodes pointed by these dentries. * Some fops don't have valid iatts in their responses. For eg., write response whose data is still cached in write-behind will have zeroed out stat. In this case keep only ia_type and ia_gfid and reset rest of the iatt members to zero. - fuse-bridge in this case just sends "entry" information back to kernel and attr is not sent. - gfapi sets entry->inode to NULL and zeroes out the entire stat * There is one tiny race between the entry creation and a readdirp on its parent dir, which could cause the inode-ctx setting and inode ctx reading to happen on two different inode objects. To prevent this, when entry->inode doesn't eqaul to linked_inode, - fuse-bridge is made to send only "entry" information without attributes - gfapi sets entry->inode to NULL and zeroes out the entire stat. Change-Id: Ia27ff49a61922e88c73a1547ad8aacc9968a69df BUG: 1390050 Updates: bz#1390050 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
*	cluster/dht: Fixed rebalanced files	N Balachandran	2018-08-14	1	-1/+1
\| \| \| \| \| \| \| \| \|	An error caused skipped files to be counted as rebalanced files. Change-Id: I02333f099fb8b73ba953f41a2922021a1e4da7be fixes: bz#1615474 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	cluster/dht: fix inode ref management in dht_heal_path	Susant Palai	2018-08-14	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In dht_heal_path, the inodes are created & looked up from top to down. If the path is "a/b/c", then lookup will be done on a, then b and so on. Here is a rough snippet of the function "dht_heal_path". <snippet> if (bname) { ref_count - loc.inode = create/grep inode 1 - syncop_lookup (loc.inode) - linked_inode = inode_link (loc.inode) 2 /clean up current loc/ - loc_wipe(&loc) 1 /set up parent and bname for next child / - loc.parent = inode - bname = next_child_name } out: - inode_ref (linked_inode) 2 - loc_wipe (&loc) 1 </snippet> The problem with the above code is if _bname_ is empty ie the chain lookup is done, then for the next iteration we populate loc.parent anyway. Now that bname is empty, the loc_wipe is done in the _out_ section as well. Since, the loc.parent was set to the previous inode, we lose a ref unwantedly. Now a dht_local_wipe as part of the DHT_STACK_UNWIND takes away the last ref leading to inode_destroy. This problenm is observed currently with nfs-ganesha with the nameless lookup. Post the inode_purge, gfapi does not get the new inode to link and hence, it links the inode it sent in the lookup fop, which does not have any dht related context (layout) leading to "invalid argument error" in lookup path done parallely with tar operation. test done in the following way: - create two nfs client connected with two different nfs servers. - run untar on one client and run lookup continuously on the other. - Prior to this patch, invalid arguement was seen which is fixed with the current patch. Change-Id: Ifb90c178a2f3c16604068c7da8fa562b877f5c61 fixes: bz#1610256 Signed-off-by: Susant Palai <spalai@redhat.com>
*	All: remove memset() before sprintf()	Yaniv Kaul	2018-08-14	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	It's not needed. There's a good chance the compiler is smart enough to remove it anyway, but it can't hurt - I hope. Compile-tested only! Change-Id: Id7c054e146ba630227affa591007803f3046416b updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
*	Revert "performance/readdir-ahead: Invalidate cached dentries if they're ↵	Raghavendra G	2018-08-03	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	modified while in cache" This reverts commit 7131de81f72dda0ef685ed60d0887c6e14289b8c. With the latest master, I created a single brick volume and some files inside it. [root@rhgs313-6 ~]# umount -f /mnt/fuse1; mount -t glusterfs -s 192.168.122.6:/thunder /mnt/fuse1; ls -l /mnt/fuse1/; echo "Trying again"; ls -l /mnt/fuse1 umount: /mnt/fuse1: not mounted total 0 ----------. 0 root root 0 Jan 1 1970 file-1 ----------. 0 root root 0 Jan 1 1970 file-2 ----------. 0 root root 0 Jan 1 1970 file-3 ----------. 0 root root 0 Jan 1 1970 file-4 ----------. 0 root root 0 Jan 1 1970 file-5 d---------. 0 root root 0 Jan 1 1970 subdir Trying again total 3 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-1 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-2 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-3 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-4 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-5 d---------. 0 root root 0 Jan 1 1970 subdir [root@rhgs313-6 ~]# Conversation can be followed on gluster-devel on thread with subj: tests/bugs/distribute/bug-1122443.t - spurious failure. git-bisected pointed this patch as culprit. Change-Id: I1eb46f6c196f44fde8ce991840a0e724e6f50862 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Updates: bz#1390050
*	performance/readdir-ahead: Invalidate cached dentries if they're modified ↵	Krutika Dhananjay	2018-07-28	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	while in cache PROBLEM: Entries that are readdirp'd ahead can undergo modification in terms of writes, truncates which could modify their iatts. When a readdir is finally wound at offset corresponding to these entries, the iatts that are returned to the application come from readdir-ahead's cache, which are stale by now. This problem gets further aggravated when caching translators/modules cache and continue to serve this stale information. FIX: Whenever a dentry undergoes modification, in the cbk of the modification fop, a "dirty" flag (default 0) is set in its inode ctx. When it's time for readdir-ahead to serve these entries, it will read the inode ctx and check if the entry is "dirty", and if it is, set the entry's attrs to all zeroes, as an indicator to fuse, md-cache etc not to cache these attributes. Also there is one tiny race between the entry creation and a readdirp on its parent dir, which could cause the inode-ctx setting and inode ctx reading to happen on two different inode objects. To prevent this, fuse-bridge is made to drop entries for which dentry->inode is not the same as linked inode, in readdirp cbk. Change-Id: If7396507632b5268442ca580473d5155fee9cbef BUG: 1390050 Updates: bz#1390050 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
*	core (named threads): flood of -Wformat-truncation warnings with gcc-7.1	Kaleb S. KEITHLEY	2018-07-23	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Starting in Fedora 26 which has gcc-7.1.x, -Wformat-trunction is enabled with -Wformat, resulting in a flood of new warnings. This many warnings is a concern because it makes it hard(er) to see other warnings that should be addressed. An example is at https://kojipkgs.fedoraproject.org//packages/glusterfs/3.12.0/1.fc28/data/logs/x86_64/build.log For more info see https://review.gluster.org/#/c/18267/ I can't find much (or good) documentation on the heuristics the compiler uses for this warning. In the case of printing integer types it appears it looks at the available space in the destination and the range of values for the variable and/or its type. To address the specific question about why 0x3ff versus 0xfff to mask the value, either would suffice to hint to the compiler that the printed value will fit in three characters. But the loop is from 0...1023 (or 0...0x3ff if you prefer) so I chose that as a more "accurate" mask to use as it exactly matches the range of values of the loop. Fixes: bz#1492847 Change-Id: I6e309ba42159841131d8241bfc0566ef09e00aa9
*	All: run codespell on the code and fix issues.	Yaniv Kaul	2018-07-22	7	-25/+25
\| \| \| \| \| \| \| \| \| \| \| \|	Please review, it's not always just the comments that were fixed. I've had to revert of course all calls to creat() that were changed to create() ... Only compile-tested! Change-Id: I7d02e82d9766e272a7fd9cc68e51901d69e5aab5 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
*	cluster/dht: Set loc->gfid before healing attr	N Balachandran	2018-07-18	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	AFR takes inodelks when setting attrs. The loc->gfid and loc->inode->gfid were both null when dht_dir_attr_heal was called during a fresh lookup of an existing directory. As the gfid is null, client_pre_inodelk asserts in the gfid check. We now set the loc->gfid before calling dht_dir_attr_heal. Change-Id: I457f5a73fd301d97a03ca032587e73d4803298ac fixes: bz#1602866 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	dht: remove useless argument from dht_iatt_merge	Kinglong Mee	2018-07-18	7	-88/+78
\| \| \| \| \| \| \| \| \| \|	The last using of the subvol argument has been removed at 4e1ec35ef4f7 ("core: fill 'ia_ino' from 'ia_gfid' in 'storage/posix' ......") 7 years ago (2011-06-16). Change-Id: I9788d79e2e40cc153cf2960e28c7c1c1033dc8f7 fixes: bz#1601683 Signed-off-by: Kinglong Mee <mijinlong@open-fs.com>
*	dht: delete tier related internal xattr in dht_getxattr_cbk	Sunny Kumar	2018-07-17	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use dict_del instead of GF_REMOVE_INTERNAL_XATTR. For problem and fix related information see here - https://review.gluster.org/20450. This patch have some modification as requested by reviewers on already merged patch : https://review.gluster.org/20450. Change-Id: I50c263e3411354bb9c1e028b64b9ebfd755dfe37 fixes: bz#1597563 Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
*	tier: Move time string formatting	Yaniv Kaul	2018-07-17	1	-7/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is no need to format the time, unless ret is true. I don't think there's a reason to allocate memory for those struct and char array unless we are formatting either (But I'm not sure what the code convention is - are we ok with 'local' variable declarations?) Only compile-tested. Change-Id: I9feb09871943764bd76bdfc9ac6ca506f329aac1 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
*	dht: delete tier related internal xattr in dht_getxattr_cbk	Sunny Kumar	2018-07-16	1	-0/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem : Hot and Cold tier brick changelogs report rsync failure Solution : georep session is failing to sync directory from master volume to slave volume due to lot of changelog retries, solution would be to ignore tier related internal xattrs trusted.tier.fix.layout.complete and trusted.tier.tier-dht.commithash in dht_getxattr_cbk. Change-Id: I3530ffe7c4157584b439486f33ecd82ed8d66aee fixes: bz#1597563 Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
*	md-cache: Do not invalidate cache post set/remove xattr	Poornima G	2018-07-11	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since setxattr and removexattr fops cbk do not carry poststat, the stat cache was being invalidated in setxatr/remoxattr cbk. Hence the further lookup wouldn't be served from cache. To prevent this invalidation, md-cache is modified to get the poststat in set/removexattr_cbk in dict. Co-authored with Xavi Hernandez. Change-Id: I6b946be2d20b807e2578825743c25ba5927a60b4 fixes: bz#1586018 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> Signed-off-by: Poornima G <pgurusid@redhat.com>
*	dht: Inconsistent permission for directories after brick stop/start	Mohit Agrawal	2018-07-11	3	-9/+77
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Inconsistent access permissions on directories after bringing back the down sub-volumes, in case of directories dht_setattr first wind a call on MDS once call is finished on MDS then wind a call on NON-MDS.At the time of revalidating dht just compare the uid/gid with stbuf uid/gid and if anyone differs set a flag to heal the same. Solution: Add a condition to compare permission also in dht_revalidate_cbk to set a flag to call dht_dir_attr_heal. BUG: 1584517 Change-Id: I3e039607148005015b5d93364536158380d4c5aa fixes: bz#1584517 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	Fix compile warnings	Xavi Hernandez	2018-07-10	1	-12/+25
\| \| \| \| \| \| \| \| \| \| \|	This patch fixes compile warnings that appear with newer compilers. The solution applied is only to remove the warnings, but it doesn't always solve the problem in the best way. It assumes that the problem will never happen, as the previous code assumed. Change-Id: I6e8470d6c2e2dbd3bd7d324b5fd2f92ffdc3d6ec updates: bz#1193929 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	cluster/dht: Do not try to use up the readdirp buffer	N Balachandran	2018-06-29	3	-53/+66
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DHT attempts to use up the entire buffer in readdirp before unwinding in an attempt to reduce the number of calls. However, this has 2 disadvantages: 1. This can cause a stack overflow when parallel readdir is enabled. If the buffer only has a little space,rda can send back only one or two entries. If those entries are stripped out by dht_readdirp_cbk (linkto files for example) it will once again wind down to rda in an attempt to fill the buffer before unwinding to FUSE. This process can continue for several iterations, causing the stack to grow and eventually overflow, causing the process to crash. 2. If parallel readdir is disabled, dht could send readdirp calls with small buffers to the bricks, thus increasing the number of network calls. We are therefore reverting to the existing behaviour. Please note, this only mitigates the stack overflow, it does not prevent it from happening. This is still possible if a subvol has thousands of linkto files for instance. Change-Id: I291bc181c5249762d0c4fe27fa4fc2631166adf5 fixes: bz#1593548 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	cluster/dht: Fix rename journal in changelog	Kotresh HR	2018-06-24	1	-0/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With patch [1], renames are journalled only on cached subvolume. The dht sends the special key on the cached subvolume so that the changelog journals the rename. With single distribute sub-volume, the key is not being set. This patch fixes the same. [1] https://review.gluster.org/10410 fixes: bz#1583018 Change-Id: Ic2e35b40535916fa506a714f257ba325e22d0961 Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	cluster/dht: refactor dht_lookup	N Balachandran	2018-06-21	1	-211/+321
\| \| \| \| \| \| \| \| \| \|	The dht lookup code is getting difficult to maintain due to its size. Refactoring the code will make it easier to modify it in future. Change-Id: Ic7cb5bf4f018504dfaa7f0d48cf42ab0aa34abdd updates: bz#1590385 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	cluster/dht: Minor code cleanup	N Balachandran	2018-06-20	1	-15/+13
\| \| \| \| \| \| \| \|	Removed extra variable. Change-Id: If43c47f6630454aeadab357a36d061ec0b53cdb5 updates: bz#1590385 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	cluster/dht: Refactor rebalance code	N Balachandran	2018-06-13	1	-309/+253
\| \| \| \| \| \| \| \| \| \|	Created init and cleanup functions for certain functionality in order to improve readability. Removed unused code. Change-Id: Ia6a2f4ab64923b6ea8e10487227fb5621eec1488 updates: bz#1586363 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	cluster/dht: Leverage MDS subvol for dht_removexattr also	Mohit Agrawal	2018-06-11	1	-60/+190
\| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: In a distributed volume situation can be arise when custom extended attributed are not removed from all bricks after stop/start or added newly brick. Solution: To resolve the same use MDS subvol for remove xattr also BUG: 1575587 Change-Id: I7701e0d3833e3064274cb269f26061bff9b71f50 fixes: bz#1575587 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	dht: Delete MDS internal xattr from dict in dht_getxattr_cbk	Mohit Agrawal	2018-06-03	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: At the time of fetching xattr to heal xattr by afr it is not able to fetch xattr because posix_getxattr has a check to ignore if xattr name is MDS Solution: To ignore same xattr update a check in dht_getxattr_cbk instead of having a check in posix_getxattr BUG: 1584098 Change-Id: I86cd2b2ee08488cb6c12f407694219d57c5361dc fixes: bz#1584098 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	cluster:dht: Corrected ret code check	N Balachandran	2018-05-30	1	-1/+1
\| \| \| \| \| \| \| \|	syncop functions return -op_errno. Change-Id: Ifdb1bd1d1d11972b4306a2336e6737d6236a2fb1 fixes: bz#1580238 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	dht: Excessive 'dict is null' logs in dht_revalidate_cbk	Mohit Agrawal	2018-05-29	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: In case of error(ESTALE/ENOENT) dht_revalidate_cbk throws "dict is null" error because xattr is not available Solution: To avoid the logs update condition in dht_revalidate_cbk and dht_lookup_dir_cbk BUG: 1583565 Change-Id: Ife6b3eeb6d91bf24403ed3100e237bb5d15b4357 fixes: bz#1583565 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	cluster/dht: Increase failure count for lookup failure in remove-brick op	Susant Palai	2018-05-28	1	-3/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	An entry from readdirp might get renamed just before migration leading to lookup failures. For such lookup failure, remove-brick process does not see any increment in failure count. Though there is a warning message after remove-brick commit for the user to check in the decommissioned brick for any files those are not migrated, it's better to increase the failure count so that user can check in the decommissioned bricks for files before commit. Note: This can result in false negative cases for rm -rf interaction with remove-brick op, where remove-brick shows non-zero failed count, but the entry was actually deleted by user. Fixes :bz#1580269 Change-Id: Icd1047ab9edc1d5bfc231a1f417a7801c424917c fixes: bz#1580269 Signed-off-by: Susant Palai <spalai@redhat.com>
*	cluster/dht: Fix rebalance log msg	N Balachandran	2018-05-24	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	Corrected the name of the xattr and fixed the code to log an error only if op_errno is not ENODATA or ENOATTR. Change-Id: I42c5b1d838eec586ac7bed2471eb1d27ff09a9ea fixes: bz#1580238 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	build: Disallow unresolved symbol references	Prashanth Pai	2018-05-18	1	-8/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the past, it was often[1] forgotten for xlators to be linked against the symbols they refer to. This often caused glusterd2 to fail while loading xlator's shared object (.so) file. This change adds "--no-undefined" as a linker flag which causes the linker to treat unresolved symbol references as an error and hence fail linking. [1]: https://review.gluster.org/#/c/19912/ https://review.gluster.org/#/c/19664/ https://review.gluster.org/#/c/19056/ https://review.gluster.org/#/c/17659/ https://bugzilla.redhat.com/show_bug.cgi?id=1532238 Bonus: Added cloudsync and utime xlator's generated source files to .gitignore Updates: bz#1193929 Change-Id: I9604a4a87b7313a5fa43bda5fdb37dfa7ef8facd Signed-off-by: Prashanth Pai <ppai@redhat.com>
*	cluster/dht: Remove EIO from dht_inode_missing	N Balachandran	2018-05-17	2	-4/+2
\| \| \| \| \| \| \| \| \|	Removed EIO from the list of errnos that triggered a migrate check task. Change-Id: I7f89c7a16056421588f1af2377cebe6affddcb47 fixes: bz#1578823 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	dht: Excessive 'dict is null' logs in dht_discover_complete	Mohit Agrawal	2018-05-11	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \|	Problem: In Geo-Rep setup excessive "dict is null" logs in dht_discover_complete while xattr is NULL Solution: To avoid the logs update a condition in dht_discover_complete BUG: 1576767 Change-Id: Ic7aad712d9b6d69b85b76e4fdf2881adb0512237 fixes: bz#1576767 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	cluster/dht: Debug virtual xattrs for dht	N Balachandran	2018-05-09	2	-0/+101
\| \| \| \| \| \| \| \| \|	Provide a virtual xattr with which to query the hashed subvol for a file. Change-Id: Ic7abd031f875da4b9084841ea7c25d6c8a851992 fixes: bz#1574421 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	cluster/dht: Debug logs in dht_readdir(p)_cbk	N Balachandran	2018-05-09	1	-2/+27
\| \| \| \| \| \| \| \| \|	Additional log messages to help debug issues with file listings. Change-Id: Iccd07498ba01d597c0c40f026f4177dd06d7e901 fixes: bz#1575887 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	dht: Avoid dict log flooding for internal MDS xattr	Mohit Agrawal	2018-05-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Before populate MDS internal xattr first dht checks if MDS is present in xattr or not.If xattr dictionary is NULL dict_get log the message either dict or key is NULL Solution: Before call dict_get check xattr, if it is NULL then no need to call dict_get. BUG: 1575910 Change-Id: I81604ec5945b85eba14b42f4583d06ec713028f4 fixes: bz#1575910 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	cluster/dht: fixes to parallel renames to same destination codepath	Raghavendra G	2018-05-07	6	-81/+558
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Test case: # while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv "test$uuid" "test" -f \|\| break; echo "done:$uuid"; done This script was run in parallel from multiple mountpoints Along the course of getting the above usecase working, many issues were found: Issue 1: ======= consider a case of rename (src, dst). We can encounter a situation where, * dst is a file present at the time of lookup * dst is removed by the time rename fop reaches glusterfs In this scenario, acquring inodelk on dst fails with ESTALE resulting in failure of rename. However, as per POSIX irrespective of whether dst is present or not, rename should be successful. Acquiring entrylk provides synchronization even in races like this. Algorithm: 1. Take inodelks on src and dst (if dst is present) on respective cached subvols. These inodelks are done to preserve backward compatibility with older clients, so that synchronization is preserved when a volume is mounted by clients of different versions. Once relevant older versions (3.10, 3.12, 3.13) reach EOL, this code can be removed. 2. Ignore ENOENT/ESTALE errors of inodelk on dst. 3. protect namespace of src and dst. To protect namespace of a file, take inodelk on parent on hashed subvol, then take entrylk on the same subvol on parent with basename of file. inodelk on parent is done to guard against changes to parent layout so that hashed subvol won't change during rename. 4. <rest of rename continues> 5. unlock all locks Issue 2: ======== linkfile creation in lookup codepath can race with a rename. Imagine the following scenario: * lookup finds a data-file with gfid - gfid-dst - without a corresponding linkto file on hashed-subvol. It decides to create linkto file with gfid - gfid-dst. - Note that some codepaths of dht-rename deletes linkto file of dst as first step. So, a lookup racing with an in-progress rename can easily run into this situation. * a rename (src-path:gfid-src, dst-path:gfid-dst) renames data-file and hence gfid of data-file changes to gfid-src with path dst-path. * lookup proceeds and creates linkto file - dst-path - with gfid - dst-gfid - on hashed-subvol. * rename tries to create a linkto file dst-path with src-gfid on hashed-subvol, but it fails with EEXIST. But EEXIST is ignored during linkto file creation. Now we've ended with dst-path having different gfids - dst-gfid on linkto file and src-gfid on data file. Future lookups on dst-path will always fail with ESTALE, due to differing gfids. The fix is to synchronize linkfile creation in lookup path with rename using the same mechanism of protecting namespace explained in solution of Issue 1. Once locks are acquired, before proceeding with linkfile creation, we check whether conditions for linkto file creation are still valid. If not, we skip linkto file creation. Issue 3: ======== gfid of dst-path can change by the time locks are acquired. This means, either another rename overwrote dst-path or dst-path was deleted and recreated by a different client. When this happens, cached-subvol for dst can change. If rename proceeds with old-gfid and old-cached subvol, we'll end up in inconsistent state(s) like dst-path with different gfids on different subvols, more than one data-file being present etc. Fix is to do the lookup with a new inode after protecting namespace of dst. Post lookup, we've to compare gfids and correct local state appropriately to be in sync with backend. Issue 4: ======== During revalidate lookup, if following a linkto file doesn't lead to a valid data-file, local->cached-subvol was not reset to NULL. This means we would be operating on a stale state which can lead to inconsistency. As a fix, reset it to NULL before proceeding with lookup everywhere. Issue 5: ======== Stale dentries left out in inode table on brick resulted in failures of link fop even though the file/dentry didn't exist on backend fs. A patch is submitted to fix this issue. Please check the dependency tree of current patch on gerrit for details In short, we fix the problem by not blindly trusting the inode-table. Instead we validate whether dentry is present by doing lookup on backend fs. Change-Id: I832e5c47d232f90c4edb1fafc512bf19bebde165 updates: bz#1543279 BUG: 1543279 Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
*	cluster/dht: log error only if layout healing is required	Raghavendra G	2018-05-07	2	-188/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	selfhealing of directory is invoked on two conditions: 1. no layout on disk or layout has some anomalies (holes/overlaps) 2. mds xattr is not set on the directory When dht_selfheal_directory is called with a correct layout just to set mds xattr, we see error msgs complaining about "not able to form layout on directory", which is misleading as the layout is correct. So, log this msg only if layout has anomalies. Change-Id: I4af25246fc3a2450c2426e9902d1a5b372eab125 updates: bz#1543279 BUG: 1543279 Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
*	cluster/dht: unwind if dht_selfheal_dir_mkdir returns an error	Raghavendra G	2018-05-03	1	-1/+5
\| \| \| \| \| \| \| \| \| \|	If dht_selfheal_dir_mkdir returns an error, cbk passed to dht_selfheal_directory is not invoked. So, Current codepath leaves an unwound frame resulting in a hung fop forever. Change-Id: I422308b8a34a074301ca46b029ffe676f5e0f66c fixes: bz#1574305 Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
*	dht: gf_defrag_settle_hash should ignore ENOENT and ESTALE error	Susant Palai	2018-04-30	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \|	Problem: A directory deletion can happen just before gf_defrag_settle_hash which internally does a setxattr operation on a directory. Solution: Ignore ENOENT and ESTALE errors Fixes: bz#1572581 Change-Id: I2f91809f3b5e02976c4c3a5a596406a8b2f8f6f2 Signed-off-by: Susant Palai <spalai@redhat.com>
*	cluster/dht: Fix dht_rename lock order	N Balachandran	2018-04-23	1	-18/+47
\| \| \| \| \| \| \| \| \| \|	Fixed dht_order_rename_lock to use the same inodelk ordering as that of the dht selfheal locks (dictionary order of lock subvolumes). Change-Id: Ia3f8353b33ea2fd3bc1ba7e8e777dda6c1d33e0d fixes: bz#1568348 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	fuse: do fd_resolve in fuse_getattr if fd is received	Susant Palai	2018-04-18	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	problem: With the current code, post graph switch the old fd is received for fuse_getattr and since it is associated with old inode, it does not have the inode ctx across xlators in new graph. Hence, dht errored out saying "no layout" for fstat call. Hence the EINVAL. Solution: if fd is passed, init and resolve fd to carry on getattr test case: - Created a single brick distributed volume - Started untar - Added a new-brick Without this fix, untar used to abort with ERROR. Change-Id: I5805c463fb9a04ba5c24829b768127097ff8b9f9 fixes: bz#1566207 Signed-off-by: Susant Palai <spalai@redhat.com>
*	cluster/dht: Handle file migrations when brick down	N Balachandran	2018-04-13	1	-5/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The decision as to which node would migrate a file was based on the gfid of the file. Files were divided among the nodes for the replica/disperse set. However, if a brick was down when rebalance started, the nodeuuids would be saved as NULL and a set of files would not be migrated. Now, if the nodeuuid is NULL, the first non-null entry in the set is the node responsible for migrating the file. Change-Id: I72554c107792c7d534e0f25640654b6f8417d373 fixes: bz#1564198 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	cluster/dht: Wind open to all subvols	N Balachandran	2018-04-11	1	-10/+5
\| \| \| \| \| \| \| \| \| \|	dht_opendir should wind the open to all subvols whether or not local->subvols is set. This is because dht_readdirp winds the calls to all subvols. Change-Id: I67a96b06dad14a08967c3721301e88555aa01017 updates: bz#1564198 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	cluster/dht: act as passthrough for renames on single child DHT	Raghavendra G	2018-04-10	1	-7/+15
\| \| \| \| \| \| \| \| \| \|	Various synchronization present in dht_rename while handling directories and files is necessary only if we have more than only one child. Change-Id: Ie21ad419125504ca2f391b1ae2e5c1d166fee247 fixes: bz#1563511 Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
*	cluster/dht: enable lookup-optimize by default	N Balachandran	2018-04-04	2	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Lookup-optimize has been shown to improve create performance. The code has been in the project for several years and is considered stable. Enabling this by default in order to test this in the upstream regression runs. Change-Id: Iab792979ee34f0af4713931e0b5b399c23f65313 updates: bz#1557435 BUG: 1557435 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	cluster/dht: Update dht option levels	N Balachandran	2018-04-02	1	-2/+16
\| \| \| \| \| \| \| \| \|	Set the levels for DHT options based on https://review.gluster.org/#/c/19466/ Change-Id: I51b31a706a0b9517404e83224c89de145fd5d7e1 updates: #430 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	cluster/dht: Update layout in inode only on success	N Balachandran	2018-04-02	2	-4/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With lookup-optimize enabled, gf_defrag_settle_hash in rebalance sometimes flips the on-disk layout on volume root post the migration of all files in the directory. This is sometimes seen when attempting to fix the layout of a directory multiple times before calling gf_defrag_settle_hash. dht_fix_layout_of_directory generates a new layout in memory but updates it in the inode ctx before it is set on disk. The layout may be different the second time around due to dht_selfheal_layout_maximize_overlap. If the layout is then not written to the disk, the inode now contains the wrong layout. gf_defrag_settle_hash does not check the correctness of the layout in the inode before updating the commit-hash and writing it to the disk thus changing the layout of the directory. Change-Id: Ie1407d92982518f2a0c40ec70ad370b34a87b4d4 updates: bz#1557435 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	cluster/dht: ENOSPC will not fail rebalance	N Balachandran	2018-03-28	1	-6/+2
\| \| \| \| \| \| \| \| \|	ENOSPC returned by a file migration is no longer considered a rebalance failure. Change-Id: I21cf3a8acdc827bc478e138d6cb5db649d53a28c fixes: bz#1553598 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	Quota: heal directory on newly added bricks when quota limit is reached	Sanoj Unnikrishnan	2018-03-28	1	-1/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: if a lookup is done on a newly added brick for a path on which limit has been reached, the lookup fails to heal the directory tree due to quota. Solution: Tag the lookup as an internal fop and ignore it in quota. Since marking internal fop does not usually give enough contextual information. Introducing new flags to pass the contextual info. Adding dict_check_flag and dict_set_flag to aid flag operations. A flag is a single bit in a bit array (currently limited to 256 bits). Change-Id: Ifb6a68bcaffedd425dd0f01f7db24edd5394c095 fixes: bz#1505355 BUG: 1505355 Signed-off-by: Sanoj Unnikrishnan <sunnikri@redhat.com>
*	cluster/dht: Skipped files are not treated as errors	N Balachandran	2018-03-12	1	-9/+11
\| \| \| \| \| \| \| \| \|	For skipped files, use a return value of 1 to prevent error messages being logged. Change-Id: I18de31ac1a64d4460e88dea7826c3ba03c895861 BUG: 1553598 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	cluster/dht: store the 'reaction' on failures per lock	Raghavendra G	2018-02-23	6	-38/+46
\| \| \| \| \| \| \| \| \| \| \|	Currently its passed in dht_blocking_inode(entry)lk, which would be a global value for all the locks passed in the argument. This would be a limitation for cases where we want to ignore failures on only few locks and fail for others. Change-Id: I02cfbcaafb593ad8140c0e5af725c866b630fb6b BUG: 1543279 Signed-off-by: Raghavendra G <rgowdapp@redhat.com>