summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/dht/src/dht-common.c
Commit message (Collapse)AuthorAgeFilesLines
...
* xlators/cluster/dht/src/dht-common.c :move to GF_MALLOC() instead of ↵Yaniv Kaul2018-09-071-2/+2
| | | | | | | | | | | | | | | | | | GF_CALLOC() when It doesn't make sense to calloc (allocate and clear) memory when the code right away fills that memory with data. It may be optimized by the compiler, or have a microscopic performance improvement. Please review carefully, especially for string allocation, with the terminating NULL string (added another byte to ensure it's there). Only compile-tested! Change-Id: Ia5e4f50dfb0c29809c2019fcfd8079507813249e updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* multiple xlators: strncpy()->sprintf(), reduce strlen()'sYaniv Kaul2018-09-071-12/+10
| | | | | | | | | | | | | | | | | | | | | | | xlators/cluster/afr/src/afr-common.c xlators/cluster/dht/src/dht-common.c xlators/cluster/dht/src/dht-rebalance.c xlators/cluster/stripe/src/stripe-helpers.c strncpy may not be very efficient for short strings copied into a large buffer: If the length of src is less than n, strncpy() writes additional null bytes to dest to ensure that a total of n bytes are written. Instead, use snprintf(). Also: - save the result of strlen() and re-use it when possible. - move from strlen to SLEN (sizeof() ) for const strings. Compile-tested only! Change-Id: Icdf79dd3d9f9ff120e4720ff2b8bd016df575c38 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* multiple files: calloc -> mallocYaniv Kaul2018-09-041-13/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xlators/cluster/stripe/src/stripe-helpers.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible xlators/cluster/dht/src/tier.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible xlators/cluster/dht/src/dht-layout.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible xlators/cluster/dht/src/dht-helper.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible xlators/cluster/dht/src/dht-common.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible xlators/cluster/afr/src/afr.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible xlators/cluster/afr/src/afr-inode-read.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible tests/bugs/replicate/bug-1250170-fsync.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible tests/basic/gfapi/gfapi-async-calls-test.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible tests/basic/ec/ec-fast-fgetxattr.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible rpc/xdr/src/glusterfs3.h: Move to GF_MALLOC() instead of GF_CALLOC() when possible rpc/rpc-transport/socket/src/socket.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible rpc/rpc-lib/src/rpc-clnt.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible extras/geo-rep/gsync-sync-gfid.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible cli/src/cli-xml-output.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible cli/src/cli-rpc-ops.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible cli/src/cli-cmd-volume.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible cli/src/cli-cmd-system.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible cli/src/cli-cmd-snapshot.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible cli/src/cli-cmd-peer.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible cli/src/cli-cmd-global.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible It doesn't make sense to calloc (allocate and clear) memory when the code right away fills that memory with data. It may be optimized by the compiler, or have a microscopic performance improvement. In some cases, also changed allocation size to be sizeof some struct or type instead of a pointer - easier to read. In some cases, removed redundant strlen() calls by saving the result into a variable. 1. Only done for the straightforward cases. There's room for improvement. 2. Please review carefully, especially for string allocation, with the terminating NULL string. Only compile-tested! updates: bz#1193929 Original-Author: Yaniv Kaul <ykaul@redhat.com> Signed-off-by: Yaniv Kaul <ykaul@redhat.com> Signed-off-by: Amar Tumballi <amarts@redhat.com> Change-Id: I16274dca4078a1d06ae09a0daf027d734b631ac2
* xlators/cluster/dht/src/dht-common.c: simplify some if statementsYaniv Kaul2018-08-311-48/+52
| | | | | | | | | | | | Use goto when some vars are not set to simplify long if statements. Please review logic has not changed! Compile-tested only! updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com> Change-Id: I45ea2e906d0ccb468af5e1fa65db008edb00d734
* multiple files: remove unndeeded memset()Yaniv Kaul2018-08-291-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a squash of multiple commits: contrib/fuse-lib/misc.c: remove unneeded memset() All flock variables are properly set, no need to memset it. Only compile-tested! Signed-off-by: Yaniv Kaul <ykaul@redhat.com> Change-Id: I8e0512c5a88daadb0e587f545fdb9b32ca8858a2 libglusterfs/src/{client_t|fd|inode|stack}.c: remove some memset() I don't think there's a need for any of them. Only compile-tested! Signed-off-by: Yaniv Kaul <ykaul@redhat.com> Change-Id: I2be9ccc3a5cb5da51a92af73488cdabd1c527f59 libglusterfs/src/xlator.c: remove unneeded memset() All xl->mem_acct members are properly set, no need to memset it. Only compile-tested! Signed-off-by: Yaniv Kaul <ykaul@redhat.com> Change-Id: I7f264cd47e7a06255a3f3943c583de77ae8e3147 xlators/cluster/afr/src/afr-self-heal-common.c: remove unneeded memset() Since we are going over the whole array anyway, initialize it properly, to either 1 or 0. Only compile-tested! Signed-off-by: Yaniv Kaul <ykaul@redhat.com> Change-Id: Ied4210388976b6a7a2e91cc3de334534d6fef201 xlators/cluster/dht/src/dht-common.c: remove unneeded memset() Since we are going over the whole array anyway it is initialized properly. Only compile-tested! Signed-off-by: Yaniv Kaul <ykaul@redhat.com> Change-Id: Idc436d2bd0563b6582908d7cbebf9dbc66a42c9a xlators/cluster/ec/src/ec-helpers.c: remove unneeded memset() Since we are going over the whole array anyway it is initialized properly. Only compile-tested! Signed-off-by: Yaniv Kaul <ykaul@redhat.com> Change-Id: I81bf971f7fcecb4599e807d37f426f55711978fa xlators/mgmt/glusterd/src/glusterd-volgen.c: remove some memset() I don't think there's a need for any of them. Only compile-tested! Signed-off-by: Yaniv Kaul <ykaul@redhat.com> Change-Id: I476ea59ba53546b5153c269692cd5383da81ce2d xlators/mgmt/glusterd/src/glusterd-geo-rep.c: read() in 4K blocks The current 1K seems small. 4K is usually better (in Linux). Also remove a memset() that I don't think is needed between reads. Only compile-tested! Signed-off-by: Yaniv Kaul <ykaul@redhat.com> Change-Id: I5fb7950c92d282948376db14919ad12e589eac2b xlators/storage/posix/src/posix-{gfid-path|inode-fd-ops}.c: remove memset() before sys_*xattr() functions. I don't see a reason to memset the array sent to the functions sys_llistxattr(), sys_lgetxattr(), sys_lgetxattr(), sys_flistxattr(), sys_fgetxattr(). (Note: it's unclear to me why we are calling sys_*txattr() functions with XATTR_VAL_BUF_SIZE-1 size instead of XATTR_VAL_BUF_SIZE ). Only compile-tested! Change-Id: Ief2103b56ba6c71e40ed343a93684eef6b771346 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* cluster/dht: coverity fixesN Balachandran2018-08-211-12/+13
| | | | | | | | | Fixes 1133997, 1370910, 1382387, 1382444, 1394635 Change-Id: Ie63ad47abd5519b9b9536da26b61ed4c9eaf2c75 updates: bz#789278 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* performance/readdir-ahead: keep stats of cached dentries in sync with ↵Krutika Dhananjay2018-08-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | modifications PROBLEM: Stats of dentries that are readdirp'd ahead can become stale due to fops like writes, truncate etc that modify the file pointed by dentries. When a readdir is finally wound at offset corresponding to these entries, the iatts that are returned to the application come from readdir-ahead's cache, which are stale by now. This problem gets further aggravated when caching translators/modules cache and continue to serve this stale information. FIX: * Store the iatt in context of the inode pointed by dentry. * Whenever the inode pointed by dentry undergoes modification, in cbk of modification fop, update the iatt stored in inode-ctx to reflect the modification. * When serving a readdirp response from application, update iatts of dentries with the iatts stored in the context of inodes pointed by these dentries. * Some fops don't have valid iatts in their responses. For eg., write response whose data is still cached in write-behind will have zeroed out stat. In this case keep only ia_type and ia_gfid and reset rest of the iatt members to zero. - fuse-bridge in this case just sends "entry" information back to kernel and attr is not sent. - gfapi sets entry->inode to NULL and zeroes out the entire stat * There is one tiny race between the entry creation and a readdirp on its parent dir, which could cause the inode-ctx setting and inode ctx reading to happen on two different inode objects. To prevent this, when entry->inode doesn't eqaul to linked_inode, - fuse-bridge is made to send only "entry" information without attributes - gfapi sets entry->inode to NULL and zeroes out the entire stat. Change-Id: Ia27ff49a61922e88c73a1547ad8aacc9968a69df BUG: 1390050 Updates: bz#1390050 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
* All: remove memset() before sprintf()Yaniv Kaul2018-08-141-1/+1
| | | | | | | | | | | | It's not needed. There's a good chance the compiler is smart enough to remove it anyway, but it can't hurt - I hope. Compile-tested only! Change-Id: Id7c054e146ba630227affa591007803f3046416b updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* Revert "performance/readdir-ahead: Invalidate cached dentries if they're ↵Raghavendra G2018-08-031-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | modified while in cache" This reverts commit 7131de81f72dda0ef685ed60d0887c6e14289b8c. With the latest master, I created a single brick volume and some files inside it. [root@rhgs313-6 ~]# umount -f /mnt/fuse1; mount -t glusterfs -s 192.168.122.6:/thunder /mnt/fuse1; ls -l /mnt/fuse1/; echo "Trying again"; ls -l /mnt/fuse1 umount: /mnt/fuse1: not mounted total 0 ----------. 0 root root 0 Jan 1 1970 file-1 ----------. 0 root root 0 Jan 1 1970 file-2 ----------. 0 root root 0 Jan 1 1970 file-3 ----------. 0 root root 0 Jan 1 1970 file-4 ----------. 0 root root 0 Jan 1 1970 file-5 d---------. 0 root root 0 Jan 1 1970 subdir Trying again total 3 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-1 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-2 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-3 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-4 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-5 d---------. 0 root root 0 Jan 1 1970 subdir [root@rhgs313-6 ~]# Conversation can be followed on gluster-devel on thread with subj: tests/bugs/distribute/bug-1122443.t - spurious failure. git-bisected pointed this patch as culprit. Change-Id: I1eb46f6c196f44fde8ce991840a0e724e6f50862 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Updates: bz#1390050
* performance/readdir-ahead: Invalidate cached dentries if they're modified ↵Krutika Dhananjay2018-07-281-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | while in cache PROBLEM: Entries that are readdirp'd ahead can undergo modification in terms of writes, truncates which could modify their iatts. When a readdir is finally wound at offset corresponding to these entries, the iatts that are returned to the application come from readdir-ahead's cache, which are stale by now. This problem gets further aggravated when caching translators/modules cache and continue to serve this stale information. FIX: Whenever a dentry undergoes modification, in the cbk of the modification fop, a "dirty" flag (default 0) is set in its inode ctx. When it's time for readdir-ahead to serve these entries, it will read the inode ctx and check if the entry is "dirty", and if it is, set the entry's attrs to all zeroes, as an indicator to fuse, md-cache etc not to cache these attributes. Also there is one tiny race between the entry creation and a readdirp on its parent dir, which could cause the inode-ctx setting and inode ctx reading to happen on two different inode objects. To prevent this, fuse-bridge is made to drop entries for which dentry->inode is not the same as linked inode, in readdirp cbk. Change-Id: If7396507632b5268442ca580473d5155fee9cbef BUG: 1390050 Updates: bz#1390050 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
* All: run codespell on the code and fix issues.Yaniv Kaul2018-07-221-8/+8
| | | | | | | | | | | | Please review, it's not always just the comments that were fixed. I've had to revert of course all calls to creat() that were changed to create() ... Only compile-tested! Change-Id: I7d02e82d9766e272a7fd9cc68e51901d69e5aab5 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* cluster/dht: Set loc->gfid before healing attrN Balachandran2018-07-181-1/+2
| | | | | | | | | | | | | AFR takes inodelks when setting attrs. The loc->gfid and loc->inode->gfid were both null when dht_dir_attr_heal was called during a fresh lookup of an existing directory. As the gfid is null, client_pre_inodelk asserts in the gfid check. We now set the loc->gfid before calling dht_dir_attr_heal. Change-Id: I457f5a73fd301d97a03ca032587e73d4803298ac fixes: bz#1602866 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* dht: remove useless argument from dht_iatt_mergeKinglong Mee2018-07-181-28/+24
| | | | | | | | | | The last using of the subvol argument has been removed at 4e1ec35ef4f7 ("core: fill 'ia_ino' from 'ia_gfid' in 'storage/posix' ......") 7 years ago (2011-06-16). Change-Id: I9788d79e2e40cc153cf2960e28c7c1c1033dc8f7 fixes: bz#1601683 Signed-off-by: Kinglong Mee <mijinlong@open-fs.com>
* dht: delete tier related internal xattr in dht_getxattr_cbkSunny Kumar2018-07-171-2/+1
| | | | | | | | | | | | | | Use dict_del instead of GF_REMOVE_INTERNAL_XATTR. For problem and fix related information see here - https://review.gluster.org/20450. This patch have some modification as requested by reviewers on already merged patch : https://review.gluster.org/20450. Change-Id: I50c263e3411354bb9c1e028b64b9ebfd755dfe37 fixes: bz#1597563 Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
* dht: delete tier related internal xattr in dht_getxattr_cbkSunny Kumar2018-07-161-0/+15
| | | | | | | | | | | | | | Problem : Hot and Cold tier brick changelogs report rsync failure Solution : georep session is failing to sync directory from master volume to slave volume due to lot of changelog retries, solution would be to ignore tier related internal xattrs trusted.tier.fix.layout.complete and trusted.tier.tier-dht.commithash in dht_getxattr_cbk. Change-Id: I3530ffe7c4157584b439486f33ecd82ed8d66aee fixes: bz#1597563 Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
* md-cache: Do not invalidate cache post set/remove xattrPoornima G2018-07-111-2/+2
| | | | | | | | | | | | | | | Since setxattr and removexattr fops cbk do not carry poststat, the stat cache was being invalidated in setxatr/remoxattr cbk. Hence the further lookup wouldn't be served from cache. To prevent this invalidation, md-cache is modified to get the poststat in set/removexattr_cbk in dict. Co-authored with Xavi Hernandez. Change-Id: I6b946be2d20b807e2578825743c25ba5927a60b4 fixes: bz#1586018 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> Signed-off-by: Poornima G <pgurusid@redhat.com>
* dht: Inconsistent permission for directories after brick stop/startMohit Agrawal2018-07-111-8/+73
| | | | | | | | | | | | | | | | Problem: Inconsistent access permissions on directories after bringing back the down sub-volumes, in case of directories dht_setattr first wind a call on MDS once call is finished on MDS then wind a call on NON-MDS.At the time of revalidating dht just compare the uid/gid with stbuf uid/gid and if anyone differs set a flag to heal the same. Solution: Add a condition to compare permission also in dht_revalidate_cbk to set a flag to call dht_dir_attr_heal. BUG: 1584517 Change-Id: I3e039607148005015b5d93364536158380d4c5aa fixes: bz#1584517 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* cluster/dht: Do not try to use up the readdirp bufferN Balachandran2018-06-291-46/+66
| | | | | | | | | | | | | | | | | | | | | | | | | DHT attempts to use up the entire buffer in readdirp before unwinding in an attempt to reduce the number of calls. However, this has 2 disadvantages: 1. This can cause a stack overflow when parallel readdir is enabled. If the buffer only has a little space,rda can send back only one or two entries. If those entries are stripped out by dht_readdirp_cbk (linkto files for example) it will once again wind down to rda in an attempt to fill the buffer before unwinding to FUSE. This process can continue for several iterations, causing the stack to grow and eventually overflow, causing the process to crash. 2. If parallel readdir is disabled, dht could send readdirp calls with small buffers to the bricks, thus increasing the number of network calls. We are therefore reverting to the existing behaviour. Please note, this only mitigates the stack overflow, it does not prevent it from happening. This is still possible if a subvol has thousands of linkto files for instance. Change-Id: I291bc181c5249762d0c4fe27fa4fc2631166adf5 fixes: bz#1593548 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: refactor dht_lookupN Balachandran2018-06-211-211/+321
| | | | | | | | | | The dht lookup code is getting difficult to maintain due to its size. Refactoring the code will make it easier to modify it in future. Change-Id: Ic7cb5bf4f018504dfaa7f0d48cf42ab0aa34abdd updates: bz#1590385 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Minor code cleanupN Balachandran2018-06-201-15/+13
| | | | | | | | Removed extra variable. Change-Id: If43c47f6630454aeadab357a36d061ec0b53cdb5 updates: bz#1590385 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Leverage MDS subvol for dht_removexattr alsoMohit Agrawal2018-06-111-60/+190
| | | | | | | | | | | | | Problem: In a distributed volume situation can be arise when custom extended attributed are not removed from all bricks after stop/start or added newly brick. Solution: To resolve the same use MDS subvol for remove xattr also BUG: 1575587 Change-Id: I7701e0d3833e3064274cb269f26061bff9b71f50 fixes: bz#1575587 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* dht: Delete MDS internal xattr from dict in dht_getxattr_cbkMohit Agrawal2018-06-031-0/+4
| | | | | | | | | | | | | | Problem: At the time of fetching xattr to heal xattr by afr it is not able to fetch xattr because posix_getxattr has a check to ignore if xattr name is MDS Solution: To ignore same xattr update a check in dht_getxattr_cbk instead of having a check in posix_getxattr BUG: 1584098 Change-Id: I86cd2b2ee08488cb6c12f407694219d57c5361dc fixes: bz#1584098 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* dht: Excessive 'dict is null' logs in dht_revalidate_cbkMohit Agrawal2018-05-291-2/+4
| | | | | | | | | | | | | Problem: In case of error(ESTALE/ENOENT) dht_revalidate_cbk throws "dict is null" error because xattr is not available Solution: To avoid the logs update condition in dht_revalidate_cbk and dht_lookup_dir_cbk BUG: 1583565 Change-Id: Ife6b3eeb6d91bf24403ed3100e237bb5d15b4357 fixes: bz#1583565 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* dht: Excessive 'dict is null' logs in dht_discover_completeMohit Agrawal2018-05-111-1/+2
| | | | | | | | | | | | Problem: In Geo-Rep setup excessive "dict is null" logs in dht_discover_complete while xattr is NULL Solution: To avoid the logs update a condition in dht_discover_complete BUG: 1576767 Change-Id: Ic7aad712d9b6d69b85b76e4fdf2881adb0512237 fixes: bz#1576767 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* cluster/dht: Debug virtual xattrs for dhtN Balachandran2018-05-091-0/+95
| | | | | | | | | Provide a virtual xattr with which to query the hashed subvol for a file. Change-Id: Ic7abd031f875da4b9084841ea7c25d6c8a851992 fixes: bz#1574421 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Debug logs in dht_readdir(p)_cbkN Balachandran2018-05-091-2/+27
| | | | | | | | | Additional log messages to help debug issues with file listings. Change-Id: Iccd07498ba01d597c0c40f026f4177dd06d7e901 fixes: bz#1575887 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* dht: Avoid dict log flooding for internal MDS xattrMohit Agrawal2018-05-081-1/+1
| | | | | | | | | | | | | | Problem: Before populate MDS internal xattr first dht checks if MDS is present in xattr or not.If xattr dictionary is NULL dict_get log the message either dict or key is NULL Solution: Before call dict_get check xattr, if it is NULL then no need to call dict_get. BUG: 1575910 Change-Id: I81604ec5945b85eba14b42f4583d06ec713028f4 fixes: bz#1575910 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* cluster/dht: fixes to parallel renames to same destination codepathRaghavendra G2018-05-071-17/+158
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Test case: # while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv "test$uuid" "test" -f || break; echo "done:$uuid"; done This script was run in parallel from multiple mountpoints Along the course of getting the above usecase working, many issues were found: Issue 1: ======= consider a case of rename (src, dst). We can encounter a situation where, * dst is a file present at the time of lookup * dst is removed by the time rename fop reaches glusterfs In this scenario, acquring inodelk on dst fails with ESTALE resulting in failure of rename. However, as per POSIX irrespective of whether dst is present or not, rename should be successful. Acquiring entrylk provides synchronization even in races like this. Algorithm: 1. Take inodelks on src and dst (if dst is present) on respective cached subvols. These inodelks are done to preserve backward compatibility with older clients, so that synchronization is preserved when a volume is mounted by clients of different versions. Once relevant older versions (3.10, 3.12, 3.13) reach EOL, this code can be removed. 2. Ignore ENOENT/ESTALE errors of inodelk on dst. 3. protect namespace of src and dst. To protect namespace of a file, take inodelk on parent on hashed subvol, then take entrylk on the same subvol on parent with basename of file. inodelk on parent is done to guard against changes to parent layout so that hashed subvol won't change during rename. 4. <rest of rename continues> 5. unlock all locks Issue 2: ======== linkfile creation in lookup codepath can race with a rename. Imagine the following scenario: * lookup finds a data-file with gfid - gfid-dst - without a corresponding linkto file on hashed-subvol. It decides to create linkto file with gfid - gfid-dst. - Note that some codepaths of dht-rename deletes linkto file of dst as first step. So, a lookup racing with an in-progress rename can easily run into this situation. * a rename (src-path:gfid-src, dst-path:gfid-dst) renames data-file and hence gfid of data-file changes to gfid-src with path dst-path. * lookup proceeds and creates linkto file - dst-path - with gfid - dst-gfid - on hashed-subvol. * rename tries to create a linkto file dst-path with src-gfid on hashed-subvol, but it fails with EEXIST. But EEXIST is ignored during linkto file creation. Now we've ended with dst-path having different gfids - dst-gfid on linkto file and src-gfid on data file. Future lookups on dst-path will always fail with ESTALE, due to differing gfids. The fix is to synchronize linkfile creation in lookup path with rename using the same mechanism of protecting namespace explained in solution of Issue 1. Once locks are acquired, before proceeding with linkfile creation, we check whether conditions for linkto file creation are still valid. If not, we skip linkto file creation. Issue 3: ======== gfid of dst-path can change by the time locks are acquired. This means, either another rename overwrote dst-path or dst-path was deleted and recreated by a different client. When this happens, cached-subvol for dst can change. If rename proceeds with old-gfid and old-cached subvol, we'll end up in inconsistent state(s) like dst-path with different gfids on different subvols, more than one data-file being present etc. Fix is to do the lookup with a new inode after protecting namespace of dst. Post lookup, we've to compare gfids and correct local state appropriately to be in sync with backend. Issue 4: ======== During revalidate lookup, if following a linkto file doesn't lead to a valid data-file, local->cached-subvol was not reset to NULL. This means we would be operating on a stale state which can lead to inconsistency. As a fix, reset it to NULL before proceeding with lookup everywhere. Issue 5: ======== Stale dentries left out in inode table on brick resulted in failures of link fop even though the file/dentry didn't exist on backend fs. A patch is submitted to fix this issue. Please check the dependency tree of current patch on gerrit for details In short, we fix the problem by not blindly trusting the inode-table. Instead we validate whether dentry is present by doing lookup on backend fs. Change-Id: I832e5c47d232f90c4edb1fafc512bf19bebde165 updates: bz#1543279 BUG: 1543279 Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: Wind open to all subvolsN Balachandran2018-04-111-10/+5
| | | | | | | | | | dht_opendir should wind the open to all subvols whether or not local->subvols is set. This is because dht_readdirp winds the calls to all subvols. Change-Id: I67a96b06dad14a08967c3721301e88555aa01017 updates: bz#1564198 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Update layout in inode only on successN Balachandran2018-04-021-1/+24
| | | | | | | | | | | | | | | | | | | | | With lookup-optimize enabled, gf_defrag_settle_hash in rebalance sometimes flips the on-disk layout on volume root post the migration of all files in the directory. This is sometimes seen when attempting to fix the layout of a directory multiple times before calling gf_defrag_settle_hash. dht_fix_layout_of_directory generates a new layout in memory but updates it in the inode ctx before it is set on disk. The layout may be different the second time around due to dht_selfheal_layout_maximize_overlap. If the layout is then not written to the disk, the inode now contains the wrong layout. gf_defrag_settle_hash does not check the correctness of the layout in the inode before updating the commit-hash and writing it to the disk thus changing the layout of the directory. Change-Id: Ie1407d92982518f2a0c40ec70ad370b34a87b4d4 updates: bz#1557435 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: store the 'reaction' on failures per lockRaghavendra G2018-02-231-5/+6
| | | | | | | | | | | Currently its passed in dht_blocking_inode(entry)lk, which would be a global value for all the locks passed in the argument. This would be a limitation for cases where we want to ignore failures on only few locks and fail for others. Change-Id: I02cfbcaafb593ad8140c0e5af725c866b630fb6b BUG: 1543279 Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: Handle single dht child in dht_lookupN Balachandran2018-02-221-0/+13
| | | | | | | | | | | | | | | This patch limits itself to only handling the case where no file (data or linkto) exists on the subvol. Additional cases to be handled: 1. A linkto file was found on the only child subvol. This currently calls dht_lookup_everywhere which eventually deletes it. It can be deleted directly as it will not be pointing to a valid subvol. 2. Directory lookups - locking might be unnecessary in some cases. Change-Id: I940ba34531f2aaee1d36fd9ca45ecfd46be662a4 BUG: 1546620 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* performance/io-threads: expose io-thread queue depthsVarsha Rao2018-02-081-1/+4
| | | | | | | | | | | | | | | | | | | | The following release-3.8-fb branch patch is upstreamed: > io-stats: Expose io-thread queue depths > Commit ID: 69509ee7d2 > https://review.gluster.org/#/c/18143/ > By Shreyas Siravara <sshreyas@fb.com> Changes in this patch: - Replace iot_pri_t with gf_fop_pri_t - Replace IOT_PRI_{HI, LO, NORMAL, MAX, LEAST} with GF_FOP_PRI_{HI, LO, NORMAL, MAX, LEAST} - Use dict_unref() instead of dict_destroy() This patch is required to forward port io-threads namespace patch. Updates: #401 Change-Id: I1b47a63185a441a30fbc423ca1015df7b36c2518 Signed-off-by: Varsha Rao <varao@redhat.com>
* cluster/dht: Unlink linkto files as rootN Balachandran2018-02-061-3/+7
| | | | | | | | | | | Non-privileged users cannot delete linkto files. However the failure to unlink a stale linkto causes DHT to fail the lookup with EIO and hence prevent access to the file. Change-Id: Id295362d41e52263790694602f36f1219f0646a2 BUG: 1542318 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Fixed a leak in inode_refN Balachandran2018-02-021-3/+2
| | | | | | | | Introduced by commit d9f773ba719397c128 Change-Id: I3f3103a5a80daed7562ace72e5aa53b77e74fb94 BUG: 1541264 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Fixed leak in dht_populate_inode_for_dentryN Balachandran2018-02-021-3/+9
| | | | | | | | | | Fixed an issue in dht_populate_inode_for_dentry where a layout is set in the inode without checking if it is already set. This overwrites the value each time without freeing the already existing layout. Change-Id: I651bf539a0b82b4ddc4c355890c16a8e91f5f1fd BUG: 1541264 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Skip '..' for the volume root dirN Balachandran2018-01-241-0/+5
| | | | | | | | | | | | dht_populate_inode_for_dentry tries to update the layout for the '..' entry when listing the root of the volume. This entry does not correspond to an entry in the volume and therefore does not have a gfid or a layout on disk, causing layout processing to fail. Change-Id: I2b7470e1c5e20d87b5545160697f24d041045140 BUG: 1537457 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* core: fix some of the dict_{get,set} with proper APIsAmar Tumballi2018-01-171-2/+2
| | | | | | | updates #220 Change-Id: I6e25dbb69b2c7021e00073e8f025d212db7de0be Signed-off-by: Amar Tumballi <amarts@redhat.com>
* cluster/dht: Add migration checks to dht_(f)xattropN Balachandran2017-12-261-0/+47
| | | | | | | | | | | | The dht_(f)xattrop implementation did not implement migration phase1/phase2 checks which could cause issues with rebalance on sharded volumes. This does not solve the issue where fops may reach the target out of order. Change-Id: I2416fc35115e60659e35b4b717fd51f20746586c BUG: 1471031 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* dht: Fill first_up_subvol before use in dht_opendirPoornima G2017-12-151-0/+5
| | | | | | | | Reported by: Sam McLeod Change-Id: Ic8f9b46b173796afd70aff1042834b03ac3e80b2 BUG: 1512437 Signed-off-by: Poornima G <pgurusid@redhat.com>
* cluster/dht: Fix build error due to switch statement on a booleanShreyas Siravara2017-12-051-16/+5
| | | | | Change-Id: Idf672b435e389baada732f609398404479306909 BUG: 1520974
* cluster/dht: populate inode in dentry for single subvolume dhtRaghavendra G2017-12-021-1/+66
| | | | | | | | | | ... in readdirp response if dentry points to a directory inode. This is a special case where the entire layout is stored in one single subvolume and hence no need for lookup to construct the layout Change-Id: I44fd951e2393ec9dac2af120469be47081a32185 BUG: 1492625 Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: don't overfill the buffer in readdir(p)Raghavendra G2017-12-021-3/+18
| | | | | | | | | | | | | | | | | | | | Superflous dentries that cannot be fit in the buffer size provided by kernel are thrown away by fuse-bridge. This means, * the next readdir(p) seen by readdir-ahead would have an offset of a dentry returned in a previous readdir(p) response. When readdir-ahead detects non-monotonic offset it turns itself off which can result in poor readdir performance. * readdirp can be cpu-intensive on brick and there is no point to read all those dentries just to be thrown away by fuse-bridge. So, the best strategy would be to fill the buffer optimally - neither overfill nor underfill. Change-Id: Idb3d85dd4c08fdc4526b2df801d49e69e439ba84 BUG: 1492625 Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: Serialize mds update code path with lookup unwind in selfhealMohit Agrawal2017-11-221-141/+179
| | | | | | | | | | | | | | | | Problem: Sometime test case ./tests/bugs/bug-1371806_1.t is failing on centos due to race condition between fresh lookup and setxattr fop. Solution: In selfheal code path we do save mds on inode_ctx, it was not serialize with lookup unwind. Due to this behavior after lookup unwind if mds is not saved on inode_ctx and if any subsequent setxattr fop call it has failed with ENOENT because no mds has found on inode ctx.To resolve it save mds on inode ctx has been serialize with lookup unwind. BUG: 1498966 Change-Id: I8d4bb40a6cbf0cec35d181ec0095cc7142b02e29 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* cluster/dht: dead code coverity fixKartik_Burmee2017-11-211-3/+0
| | | | | | | | | | | | issue: Execution cannot reach this statement: "call_stub_destroy(stub);" function: dht_mkdir_hashed_cbk fix: removed the statement and the corresponding 'if' condition block. Change-Id: I3e31056ee489ede6864e51a8e666edc7da3c175f BUG: 789278 Signed-off-by: Kartik_Burmee <kburmee@redhat.com>
* cluster/dht: fix crash when deleting directoriesZhang Huan2017-10-161-2/+4
| | | | | | | | | | | | | | | | | | | | | In DHT, after locks on all subvolumes are acquired, it would perform the following steps sequentially, 1. send remove dir on all other subvolumes except the hashed one in a loop; 2. wait for all pending rmdir to be done 3. remove dir on the hashed subvolume The problem is that in step 1 there is a check to skip hashed subvolume in the loop. If the last subvolume to check is actually the hashed one, and step 3 is quickly done before the last and hashed subvolume is checked, by accessing shared context data be destroyed in step 3, would cause a crash. Fix by saving shared data in a local variable to access later in the loop. Change-Id: I8db7cf7cb262d74efcb58eb00f02ea37df4be4e2 BUG: 1490642 Signed-off-by: Zhang Huan <zhanghuan@open-fs.com>
* cluster/dht: Don't store the entire uuid for subvolsN Balachandran2017-10-101-7/+13
| | | | | | | | | | | | Comparing the uuid string of the local node against that stored in the local_subvol information is inefficient, especially as it is done for every file to be migrated. The code has now been changed to set the value of info to 1 if the nodeuuid is that of the node making the comparison so this becomes an integer comparison. Change-Id: I7491d59caad3b71dbf5facc94dcde0cd53962775 BUG: 1451434 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht : User xattrs are not healed after brick stop/startMohit Agrawal2017-10-041-63/+1275
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In a distributed volume custom extended attribute value for a directory does not display correct value after stop/start or added newly brick. If any extended(acl) attribute value is set for a directory after stop/added the brick the attribute(user|acl|quota) value is not updated on brick after start the brick. Solution: First store hashed subvol or subvol(has internal xattr) on inode ctx and consider it as a MDS subvol.At the time of update custom xattr (user,quota,acl, selinux) on directory first check the mds from inode ctx, if mds is not present on inode ctx then throw EINVAL error to application otherwise set xattr on MDS subvol with internal xattr value of -1 and then try to update the attribute on other non MDS volumes also.If mds subvol is down in that case throw an error "Transport endpoint is not connected". In dht_dir_lookup_cbk| dht_revalidate_cbk|dht_discover_complete call dht_call_dir_xattr_heal to heal custom extended attribute. In case of gnfs server if hashed subvol has not found based on loc then wind a call on all subvol to update xattr. Fix: 1) Save MDS subvol on inode ctx 2) Check if mds subvol is present on inode ctx 3) If mds subvol is down then call unwind with error ENOTCONN and if it is up then set new xattr "GF_DHT_XATTR_MDS" to -1 and wind a call on other subvol. 4) If setxattr fop is successful on non-mds subvol then increment the value of internal xattr to +1 5) At the time of directory_lookup check the value of new xattr GF_DHT_XATTR_MDS 6) If value is not 0 in dht_lookup_dir_cbk(other cbk) functions then call heal function to heal user xattr 7) syncop_setxattr on hashed_subvol to reset the value of xattr to 0 if heal is successful on all subvol. Test : To reproduce the issue followed below steps 1) Create a distributed volume and create mount point 2) Create some directory from mount point mkdir tmp{1..5} 3) Kill any one brick from the volume 4) Set extended attribute from mount point on directory setfattr -n user.foo -v "abc" ./tmp{1..5} It will throw error " Transport End point is not connected " for those hashed subvol is down 5) Start volume with force option to start brick process 6) Execute getfattr command on mount point for directory 7) Check extended attribute on brick getfattr -n user.foo <volume-location>/tmp{1..5} It shows correct value for directories for those xattr fop were executed successfully. Note: The patch will resolve xattr healing problem only for fuse mount not for nfs mount. BUG: 1371806 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Change-Id: I4eb137eace24a8cb796712b742f1d177a65343d5
* Fix a coverity error of checker type: CHECKED_RETURNKamal Mohanan2017-09-261-1/+1
| | | | | | | | | | | Problem: dht_frame_return was being called without checking the return value. Solution: Typecast the value returned by the function to void. Change-Id: Idfc6a7ed467d1c8f5f8d09ec26d9059f3d23b760 BUG: 789278 Signed-off-by: Kamal Mohanan <kmohanan@redhat.com>
* cluster/dht: Aggregate xattrs only for dirs in dht_discover_cbkN Balachandran2017-08-301-2/+11
| | | | | | | | | | | | | | | | | If dht_discover finds data files on more than one subvol, racing calls to dht_discover_cbk could end up calling dht_aggregate_xattr which could delete dictionary data that is being accessed by higher layer translators. Fixed to call dht_aggregate_xattr only for directories and consider only the first file to be found. Change-Id: I4f3d2a405ec735d4f1bb33a04b7255eb2d179f8a BUG: 1484709 Signed-off-by: N Balachandran <nbalacha@redhat.com> Reviewed-on: https://review.gluster.org/18137 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com>