summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* md-cache: Adapt integer data types to avoid integer overflowDavid Spisla2019-02-201-3/+3
| | | | | | | | | | | | | | The "struct iatt" in iatt.h is using int64_t types for storing the atime, mtime and ctime. Therefore the struct 'struct md_cache' in md-cache.c should also use this types to avoid an integer overflow. This can happen e.g. if someone uses a very high default-retention-period in the WORM-Xlator. Change-Id: I605268d300ab622b9c8ab30e459dc00d9340aad1 fixes: bz#1678726 Signed-off-by: David Spisla <david.spisla@iternity.com> (cherry picked from commit 15423e14f16dd1a15ee5e5cbbdbdd370e57ed59f)
* cluster/thin-arbiter: Consider thin-arbiter before marking new entry changelogAshish Pandey2019-02-185-25/+103
| | | | | | | | | | | | | | | | | | | If a fop to create an entry fails on one of the data brick, we mark the pending changelog on the entry on brick for which it was successful. This is done as part of post op phase to make sure that entry gets healed even if it gets renamed to some other path where its parent was not marked as bad. As it happens as part of post op, we should consider thin-arbiter to check if the brick, which was successful, is the good brick or not. This will avoide split brain and other issues. >Change-Id: I12686675be98f02f70a5186b3ed748c541514d53 >Signed-off-by: Ashish Pandey <aspandey@redhat.com> Change-Id: I12686675be98f02f70a5186b3ed748c541514d53 updates: bz#1672314 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* Bump up timeout for tests on AWSNigel Babu2019-02-073-3/+4
| | | | | | Fixes: bz#1673268 Change-Id: I2b9be45f199f6436b858536c6f49be85902217f0 Signed-off-by: Nigel Babu <nigelb@redhat.com>
* libglusterfs/common-utils.c: Fix buffer size for checksum computationVarsha Rao2019-02-042-2/+37
| | | | | | | | | | | | | | Problem: When quorum count option is updated, the change is not reflected in the nfs-server.vol file. This is because in get_checksum_for_file(), when the last part of the file read has size less than buffer size, the read buffer stores old data value along with correct data value. Solution: Pass the bytes read instead of fixed buffer size, for calculating checksum. Change-Id: I4b641607c8a262961b3f3da0028a54e08c3f8589 fixes: bz#1672248 Signed-off-by: Varsha Rao <varao@redhat.com>
* features/shard: Ref shard inode while adding to fsync listKrutika Dhananjay2019-02-042-8/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PROBLEM: Lot of the earlier changes in the management of shards in lru, fsync lists assumed that if a given shard exists in fsync list, it must be part of lru list as well. This was found to be not true. Consider this - a file is FALLOCATE'd to a size which would make the number of participant shards to be greater than the lru list size. In this case, some of the resolved shards that are to participate in this fop will be evicted from lru list to give way to the rest of the shards. And once FALLOCATE completes, these shards are added to fsync list but without a ref. After the fop completes, these shard inodes are unref'd and destroyed while their inode ctxs are still part of fsync list. Now when an FSYNC is called on the base file and the fsync-list traversed, the client crashes due to illegal memory access. FIX: Hold a ref on the shard inode when adding to fsync list as well. And unref under following conditions: 1. when the shard is evicted from lru list 2. when the base file is fsync'd 3. when the shards are deleted. Change-Id: Iab460667d091b8388322f59b6cb27ce69299b1b2 fixes: bz#1669382 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> (cherry picked from commit 72922c1fd69191b220f79905a23395c3a87f86ce)
* readdir-ahead: do not zero-out iatt in fop cbkRavishankar N2019-02-042-20/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | ...when ctime is zero. ia_type and ia_gfid always need to be non-zero for things to work correctly. Problem: Commit c9bde3021202f1d5c5a2d19ac05a510fc1f788ac zeroed out the iatt buffer in the cbks of modification fops before unwinding if the ctime in the buffer was zero. This was causing the fops to fail: noticeable when AFR's 'consistent-metadata' option was enabled. (AFR zeros out the ctime when the option is set. See commit 4c4624c9bad2edf27128cb122c64f15d7d63bbc8). Fixes: -Do not zero out the ia_type and ia_gfid of the iatt buff under any circumstance. -Also, fixed _rda_inode_ctx_update_iatts() to always update these values from the incoming buf when ctime is zero. Otherwise we end up with zero ia_type and ia_gfid the first time the function is called *and* the incoming buf has ctime set to zero. fixes: bz#1665145 Reported-By:Michael Hanselmann <public@hansmi.ch> Change-Id: Ib72228892d42c3513c19fc6dfb543f2aa3489eca Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 09db11b0c020bc79d493c6d7e7ea4f3beb000c68)
* cluster/dht: Delete invalid linkto files in rmdirN Balachandran2019-02-042-1/+67
| | | | | | | | | | | | rm -rf <dir> fails on dirs which contain linkto files that point to themselves because dht incorrectly thought that they were cached files after looking them up. The fix now treats them as invalid linkto files and deletes them. Change-Id: I376c72a5309714ee339c74485e02cfb4e29be643 fixes: bz#1671611 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* socket: fix issue when socket write return with EAGAINZhang Huan2019-02-041-0/+2
| | | | | | | | | | | | | | In the case socket write return with EAGAIN, the remaining vector count is return all way back to event handler, making followup pollin event to skip handling and dispatch loop complains about failure. Even thought temporary write failure is not an error. [2018-12-29 07:31:41.772310] E [MSGID: 101191] [event-epoll.c:674:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler Change-Id: Idf03d120b5f7619eda19720a583cbcc3e7da2504 updates: bz#1651246 Signed-off-by: Zhang Huan <zhanghuan@open-fs.com> (cherry picked from commit 0301a66bda44582e3a48519f2a5d365b0c38090d)
* socket: don't pass return value from protocol handler to event handlerZhang Huan2019-02-041-2/+2
| | | | | | | | | | | | Event handler handles socket level error only, while protocol handler handles in protocol level error. If protocol handler decides to disconnect on error in any case, it should call disconnect instead of return an error back to event handler. Change-Id: I9375be98cc52cb969085333f3c7229a91207d1bd updates: bz#1651246 Signed-off-by: Zhang Huan <zhanghuan@open-fs.com> (cherry picked from commit cd5714554627fe90ee2c77685cb410a8fb25eceb)
* core: move "dict is NULL" logs to DEBUG log levelMilind Changire2019-02-041-2/+2
| | | | | | | | | Too many logs get printed if dict_ref() and dict_unref() are passed NULL pointer. fixes: bz#1671217 Change-Id: I18afd849d64318f68baa7b549ee310dac0e1e786 Signed-off-by: Milind Changire <mchangir@redhat.com>
* api: bad GFAPI_4.1.6 blockKaleb S. KEITHLEY2019-01-291-2/+3
| | | | | | | | missing global: line, tabs not spaces Change-Id: Icdbc23b4e4cd608da1d764e81757201c4b1269a6 fixes: bz#1670307 Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com>
* doc: Release notes for 5.3v5.3ShyamsundarR2019-01-161-0/+30
| | | | | | Fixes: bz#1659085 Change-Id: I5195e4eca6518e3122ea188e2f4891f5e68ca01a Signed-off-by: ShyamsundarR <srangana@redhat.com>
* fuse: add --lru-limit optionAmar Tumballi2019-01-1611-87/+396
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The inode LRU mechanism is moot in fuse xlator (ie. there is no limit for the LRU list), as fuse inodes are referenced from kernel context, and thus they can only be dropped on request of the kernel. This might results in a high number of passive inodes which are useless for the glusterfs client, causing a significant memory overhead. This change tries to remedy this by extending the LRU semantics and allowing to set a finite limit on the fuse inode LRU. A brief history of problem: When gluster's inode table was designed, fuse didn't have any 'invalidate' method, which means, userspace application could never ask kernel to send a 'forget()' fop, instead had to wait for kernel to send it based on kernel's parameters. Inode table remembers the number of times kernel has cached the inode based on the 'nlookup' parameter. And 'nlookup' field is not used by no other entry points (like server-protocol, gfapi etc). Hence the inode_table of fuse module always has to have lru-limit as '0', which means no limit. GlusterFS always had to keep all inodes in memory as kernel would have had a reference to it. Again, the reason for this is, kernel's glusterfs inode reference was pointer of 'inode_t' structure in glusterfs. As it is a pointer, we could never free it (to prevent segfault, or memory corruption). Solution: In the inode table, handle the prune case of inodes with 'nlookup' differently, and call a 'invalidator' method, which in this case is fuse_invalidate(), and it sends the request to kernel for getting the forget request. When the kernel sends the forget, it means, it has dropped all the reference to the inode, and it will send the forget with the 'nlookup' parameter too. We just need to make sure to reduce the 'nlookup' value we have when we get forget. That automatically cause the relevant prune to happen. Credits: Csaba Henk, Xavier Hernandez, Raghavendra Gowdappa, Nithya B fixes: bz#1623107 Change-Id: Ifee0737b23b12b1426c224ec5b8f591f487d83a2 Signed-off-by: Amar Tumballi <amarts@redhat.com>
* features/shard: Fix launch of multiple synctasks for background deletionKrutika Dhananjay2019-01-152-71/+128
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PROBLEM: When multiple sharded files are deleted in quick succession, multiple issues were observed: 1. misleading logs corresponding to a sharded file where while one log message said the shards corresponding to the file were deleted successfully, this was followed by multiple logs suggesting the very same operation failed. This was because of multiple synctasks attempting to clean up shards of the same file and only one of them succeeding (the one that gets ENTRYLK successfully), and the rest of them logging failure. 2. multiple synctasks to do background deletion would be launched, one for each deleted file but all of them could readdir entries from .remove_me at the same time could potentially contend for ENTRYLK on .shard for each of the entry names. This is undesirable and wasteful. FIX: Background deletion will now follow a state machine. In the event that there are multiple attempts to launch synctask for background deletion, one for each file deleted, only the first task is launched. And if while this task is doing the cleanup, more attempts are made to delete other files, the state of the synctask is adjusted so that it restarts the crawl even after reaching end-of-directory to pick up any files it may have missed in the previous iteration. This patch also fixes uninitialized lk-owner during syncop_entrylk() which was leading to multiple background deletion synctasks entering the critical section at the same time and leading to illegal memory access of base inode in the second syntcask after it was destroyed post shard deletion by the first synctask. Change-Id: Ib33773d27fb4be463c7a8a5a6a4b63689705324e updates: bz#1665803 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> (cherry picked from commit c0c2022e7d7097e96270a74f37813eda0c4e6339)
* features/shard: Assign fop id during background deletion to prevent ↵Krutika Dhananjay2019-01-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | excessive logging ... of the kind "[2018-12-26 05:22:44.195019] E [MSGID: 133010] [shard.c:2253:shard_common_lookup_shards_cbk] 0-volume1-shard: Lookup on shard 785 failed. Base file gfid = cd938e64-bf06-476f-a5d4-d580a0d37416 [No such file or directory]" shard_common_lookup_shards_cbk() has a specific check to ignore ENOENT error without logging them during specific fops. But because background deletion is done in a new frame (with local->fop being GF_FOP_NULL), the ENOENT check is skipped and the absence of shards gets logged everytime. To fix this, local->fop is initialized to GF_FOP_UNLINK during background deletion. Change-Id: I0ca8d3b3bfbcd354b4a555eee520eb0479bcda35 updates: bz#1665803 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> (cherry picked from commit aa28fe32364e39981981d18c784e7f396d56153f)
* leases: Reset lease_ctx->timer post deletionSoumya Koduri2019-01-091-0/+1
| | | | | | | | | | To avoid use_after_free, reset lease_ctx->timer back to NULL after the structure has been freed. Change-Id: Icd213ec809b8af934afdb519c335a4680a1d6cdc updates: bz#1651323 Signed-off-by: Soumya Koduri <skoduri@redhat.com> (cherry picked from commit a9b0003c717087ff168bc143c70559162e53e0d5)
* core: Fixed typos in nl-cache and logging-guidelines.mdN Balachandran2019-01-092-3/+3
| | | | | | | | | Replaced "recieve" with "receive". Change-Id: I58a3d3d4a0093df4743de9fae4d8ff152d4b216c fixes: bz#1662200 Signed-off-by: N Balachandran <nbalacha@redhat.com> (cherry picked from commit a11c5c66321dd8411373a68cc163c981c7d083df)
* gfapi: Access fs->oldvolfile under mutex lockSoumya Koduri2019-01-091-0/+6
| | | | | | | | | | | | | | | | In some cases (for eg., when there are multiple RPC_CLNT_CONNECT notifications), multiple threads may fetch volfile and try to update it in 'fs' object simultaneously. Hence protect those variables' access under fs->mutex lock. This is backport of below two mainline patches - - https://review.gluster.org/#/c/glusterfs/+/21882/ - https://review.gluster.org/#/c/glusterfs/+/21927/ Change-Id: Idaee9548560db32d83f4c04ebb1f375fee7864a9 fixes: bz#1663131 Signed-off-by: Soumya Koduri <skoduri@redhat.com> (cherry picked from commit 8fe3c6107a2b431d7cc0b8cfaeeb7941cf9590f9)
* io-cache: xdata needs to be passed for readv operationsSoumya Koduri2018-12-302-2/+16
| | | | | | | | | | | | | io-cache xlator has been skipping xdata references when the date needs to be read into page cache. This patch fixes the same. Note: similar changes may be needed for other fops as well which are handled by io-cache. Change-Id: I28d73d4ba471d13eb55d0fd0b5197d222df77a2a updates: bz#1651323 Signed-off-by: Soumya Koduri <skoduri@redhat.com> (cherry picked from commit b3d88a0904131f6851f4185e43f815ecc3353ab5)
* geo-rep: Fix syncing of files with non-ascii filenamesKotresh HR2018-12-265-69/+194
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Creation of files/directories with non-ascii names fails to sync to the slave. It crashes with below traceback on slave. ... File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/repce.py", line 118, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py", line 709, in entry_ops [ESTALE, EINVAL, EBUSY]) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/syncdutils.py", line 546, in errno_wrap return call(*arg) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/libcxattr.py", line 83, in lsetxattr cls.raise_oserr() File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/libcxattr.py", line 38, in raise_oserr raise OSError(errn, os.strerror(errn)) OSError: [Errno 12] Cannot allocate memory Cause: The length calculation arguments passed to blob creation was done before encoding. Hence was failing in gfid-access layer. Fix: It appears that the calculating lenght properly fixes this issue. But it will cause issues in other places in 'python2' and not in 'python3'. So encoding and decoding each required string to make geo-rep compatible with both 'python2' and 'python3' is a nightmare and is not fool proof. Hence kept 'python2' code as is with out encode/decode and applied encode/decode only to 'python3' Added non-ascii filename tests to regression Backport of: > Patch: https://review.gluster.org/21668 > BUG: 1650893 > Change-Id: I35cfaf848e07b1a0b5cb93c01b98b472f08271a6 > Signed-off-by: Kotresh HR <khiremat@redhat.com> fixes: bz#1648642 Change-Id: I35cfaf848e07b1a0b5cb93c01b98b472f08271a6 Signed-off-by: Kotresh HR <khiremat@redhat.com>
* cluster/dht: sync brick root perms on add brickN Balachandran2018-12-262-22/+14
| | | | | | | | | | | | | | | | | If a single brick is added to the volume and the newly added brick is the first to respond to a dht_revalidate call, its stbuf will not be merged into local->stbuf as the brick does not yet have a layout. The is_permission_different check therefore fails to detect that an attr heal is required as it only considers the stbuf values from existing bricks. To fix this, merge all stbuf values into local->stbuf and use local->prebuf to store the correct directory attributes. Change-Id: Ic9e8b04a1ab9ed1248b6b056e3450bbafe32e1bc fixes: bz#1660736 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* performance/rda: Fixed dict_t memory leakN Balachandran2018-12-261-8/+0
| | | | | | | | | | | Removed all references to dict_t xdata_from_req which is allocated but not used anywhere. It is also not cleaned up and hence causes a memory leak. fixes: bz#1659676 Change-Id: I2edb857696191e872ad12a12efc36999626bacc7 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* shard: prevent segfault in shard_unlink_block_inode()Sunny Kumar2018-12-201-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | gluster-blockd sometimes segfaults with the following backtrace: Core was generated by `/usr/sbin/gluster-blockd --glfs-lru-count 5 --log-level INFO'. Program terminated with signal 11, Segmentation fault. #0 0x00007fbb9cd639b9 in shard_unlink_block_inode (local=local@entry=0x7fbb80000a78, shard_block_num=<optimized out>) at shard.c:2929 2929 base_ictx->fsync_count--; (gdb) bt #0 0x00007fbb9cd639b9 in shard_unlink_block_inode (local=local@entry=0x7fbb80000a78, shard_block_num=<optimized out>) at shard.c:2929 #1 0x00007fbb9cd64311 in shard_unlink_shards_do_cbk (frame=frame@entry=0x7fbb9010a768, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, preparent=preparent@entry=0x7fbb7470dcf8, postparent=postparent@entry=0x7fbb7470dd90, xdata=xdata@entry=0x0) at shard.c:2987 A fix for this has already been provided through a Converity report. Backport of: > Change-Id: Ic5d302a5e32d375acf8adc412763ab94e6dabc3d > Signed-off-by: Sunny Kumar <sunkumar@redhat.com> > (cherry picked from commit 145e180517054626d07892219fdee689b703c218) Change-Id: I699a039e9c5115eb3376190dd8014427d12a293b Updates: bz#1659563 Signed-off-by: Niels de Vos <ndevos@redhat.com>
* tests: Fix zero-flag.t scriptKrutika Dhananjay2018-12-191-0/+1
| | | | | | | | | | | | | | | | The default value of shard-block-size was changed from 4MB to 64MB sometime back. The script "fallocate"s a 6MB file and expects it to have 1 shard under .shard. This worked when the shard-block-size was 4MB. With the default value now at 64MB, file "file1" won't have any shards under .shard and the stat on the 1st shard's path fails with ENOENT. Changed the script to explicitly set shard-block-size to 4MB. Change-Id: I7f1785922287d16d74c95fa57cbbe12e6e66e4f7 fixes: bz#1660932 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> (cherry picked from commit e69fc87593334b24432978dbf592fa73fe5fc38b)
* doc: release notes for release 5.2v5.2ShyamsundarR2018-12-121-0/+27
| | | | | | Fixes: bz#1649895 Change-Id: Ic8f59df9d99efafd56c930e30040f046654a1f9e Signed-off-by: ShyamsundarR <srangana@redhat.com>
* afr: thin-arbiter 2 domain locking and in-memory stateRavishankar N2018-12-127-81/+684
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 2 domain locking + xattrop for write-txn failures: -------------------------------------------------- - A post-op wound on TA takes AFR_TA_DOM_NOTIFY range lock and AFR_TA_DOM_MODIFY full lock, does xattrop on TA and releases AFR_TA_DOM_MODIFY lock and stores in-memory which brick is bad. - All further write txn failures are handled based on this in-memory value without querying the TA. - When shd heals the files, it does so by requesting full lock on AFR_TA_DOM_NOTIFY domain. Client uses this as a cue (via upcall), releases AFR_TA_DOM_NOTIFY range lock and invalidates its in-memory notion of which brick is bad. The next write txn failure is wound on TA to again update the in-memory state. - Any incomplete write txns before the AFR_TA_DOM_NOTIFY upcall release request is got is completed before the lock is released. - Any write txns got after the release request are maintained in a ta_waitq. - After the release is complete, the ta_waitq elements are spliced to a separate queue which is then processed one by one. - For fops that come in parallel when the in-memory bad brick is still unknown, only one is wound to TA on wire. The other ones are maintained in a ta_onwireq which is then processed after we get the response from TA. Change-Id: I32c7b61a61776663601ab0040e2f0767eca1fd64 updates: bz#1648205 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* features/bitrot: compare the signature with proper lengthRaghavendra Bhat2018-12-122-8/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | * The scrubber was comparing the checksum of the file that it calculated (by reading the file) with the on disk signature (stored via xattr) wrongly. It was using strlen to calculate the signature, while the actual length of the signature is given by the brick. Just use the actual length that the brick provides instead of trying to calculate the signature length via strlen API. * In posix, gfid2path was using the same string that contains the list of all the xattrs of file to save the value of the gfid2path xattr as well. This causes confusion when gfid2path xattr is queried by scrubber for getting the actual path of a corrupted file. Use separate string to fetch the value of the xattr instead of the string that contains the list of xattrs. Backport of: > Patch: https://review.gluster.org/21752/ > BUG: 1654805 > Change-Id: I2d664ab524d2b312233476cb35863dde3122e9a9 (cherry picked from commit f77fb6d568616592ab25501c402c140d15235ca9) Change-Id: I2d664ab524d2b312233476cb35863dde3122e9a9 fixes: bz#1654370 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
* afr: assign gfid during name heal when no 'source' is present.Ravishankar N2018-12-125-52/+201
| | | | | | | | | | | | | | | | | | | Problem: If parent dir is in split-brain or has dirty xattrs set, and the file has gfid missing on one of the bricks, then name heal won't assign the gfid. Fix: Use the brick we select the gfid from as the 'source'. Note: Problem was found while trying to debug a split-brain issue on Cynthia Zhou's setup. fixes: bz#1655545 Change-Id: Id088d4f0fb017aa35122de426654194e581ed742 Reported-by: Cynthia Zhou <cynthia.zhou@nokia-sbell.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 4d58730c0cd6ab5db39aec8a15276f7bd3371b04)
* leases: Do not conflict with internal fopsSoumya Koduri2018-12-121-0/+11
| | | | | | | | | | | | | | | Internal fops (with frame->root->pid < 0) are used to heal or move data and maintains data integrity. That is they do not modify client data which holds the lease. Hence no need to recall Lease for such fops. Note: Like for locks, we would need rebalance and self-heal daemon process to heal lease state as well. Change-Id: I8988693fef8d00e17c19dcc842e2238f9eb5ab48 updates: bz#1651323 Signed-off-by: Soumya Koduri <skoduri@redhat.com> (cherry picked from commit 080aa5b9e9d998552e23f7c33aed3afb0ca93c34)
* geo-rep: Fix traceback with symlink metadata syncKotresh HR2018-12-121-15/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While syncing metadata, 'os.chmod', 'os.chown', 'os.utime' should be used without de-reference. But python supports only 'os.chown' without de-reference. That's mostly because Linux doesn't support 'chmod' on symlink file itself but it does support 'chown'. So while syncing metadata ops, if it's symlink we should only sync 'chown' and not do 'chmod' and 'utime'. It will lead to tracebacks with errors like EROFS, EPERM, ACCESS, ENOENT. All the three errors (EPERM, ACCESS, ENOENT) were handled except EROFS. But the way it was handled was not fool proof. The operation is tried and failure was handled based on the errors. All the errors with symlink file for 'chown', 'utime' had to be passed to safe errors list of 'errno_wrap'. This patch handles it better by avoiding 'chmod' and 'utime' if it's symlink file. Backport of : < Patch: https://review.gluster.org/21546/ > BUG: bz#1646104 > Change-Id: Ic354206455cdc7ab2a87d741d81f4efe1f19d77d > Signed-off-by: Kotresh HR <khiremat@redhat.com> (cherry picked from commit 3c6cf9a4a1b46cab2dc53c1ee0afca0fe993102e) fixes: bz#1654115 Change-Id: Ic354206455cdc7ab2a87d741d81f4efe1f19d77d Signed-off-by: Kotresh HR <khiremat@redhat.com>
* gfapi: Offload callback notifications to synctaskSoumya Koduri2018-12-124-10/+419
| | | | | | | | | | | | | | | | | | | | | | | Upcall notifications are received from server via epoll and same thread is used to forward these notifications to the application. This may lead to deadlock and hang in the following scenario. Consider if as part of handling these callbacks, application has to do some operations which involve sending I/Os to gfapi stack which inturn have to wait for epoll threads to receive repsonse. Thus this may lead to deadlock if all the epoll threads are waiting to complete these callback notifications. To address it, instead of using epoll thread itself, make use of synctask to send those notificaitons to the application. Change-Id: If614e0d09246e4279b9d1f40d883a32a39c8fd90 updates: bz#1651323 Signed-off-by: Soumya Koduri <skoduri@redhat.com> (cherry picked from commit ad35193718a99494ab1b852ca4cbdf054f73de88)
* lease: Treat unlk request as noop if lease not foundSoumya Koduri2018-12-122-4/+16
| | | | | | | | | | | | | | | | | When the glusterfs server recalls the lease, it expects client to flush data and unlock the lease. If not it sets a timer (starting from the time it sends RECALL request) and post timeout, it revokes it. Here we could have a race where in client did send UNLK lease request but because of network delay it may have reached after server revokes it. To handle such situations, treat such requests as noop and return sucesss. Change-Id: I166402d10273f4f115ff04030ecbc14676a01663 updates: bz#1651323 Signed-off-by: Soumya Koduri <skoduri@redhat.com> (cherry picked from commit c2e758b54d8a3f778e3e63db0000bb8b63de9b25)
* geo-rep: Fix permissions with non-root setupKotresh HR2018-11-291-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In non-root fail-over/fail-back(FO/FB), when slave is promoted as master, the session goes to 'Faulty' Cause: The command 'gluster-mountbroker <mountbroker-root> <group>' is run as a pre-requisite on slave in non-root setup. It modifies the permission and group of following required directories and files recursively [1] /var/lib/glusterd/geo-replication [2] /var/log/glusterfs/geo-replication-slaves In a normal setup, this is executed on slave node and hence doing it recursively is not an issue on [1]. But when original master becomes slave in non-root during FO/FB, it contains ssh public keys and modifying permissions on them causes geo-rep to fail with incorrect permissions. Fix: Don't do permission change recursively. Fix permissions for required files. Backport of: > Patch: https://review.gluster.org/#/c/glusterfs/+/21689/ > BUG: bz#1651498 > Change-Id: I68a744644842e3b00abc26c95c06f123aa78361d > Signed-off-by: Kotresh HR <khiremat@redhat.com> (cherry picked from commit b2776b1ec1ad845ba568c4439bca3b57cc4d2592) fixes: bz#1654117 Change-Id: I68a744644842e3b00abc26c95c06f123aa78361d Signed-off-by: Kotresh HR <khiremat@redhat.com>
* leases: Fix incorrect inode_ref/unrefsSoumya Koduri2018-11-292-3/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From testing & code-reading, found couple of places where we incorrectly unref the inode resulting in use_after_free crash or ref leaks. This patch addresses couple of them. a) When we try to grant the very first lease for a inode, inode_ref is taken in __add_lease. This ref should be active till all the leases granted to that inode are released (i.e, till lease_cnt > 0). In addition even after lease_cnt becomes '0', the inode should be active till all the blocked fops are resumed. Hence release this ref, after resuming all those fops. To avoid granting new leases while resuming those fops, defined a new boolean (blocked_fops_resuming) to flag it in the lease_ctx. b) 'new_lease_inode' which creates new lease_inode_entry and takes ref on inode, is used while adding that entry to client_list and recall_list. Use its counter function '__destroy_lease_inode' which does unref while removing those entries from those lists. c) inode ref is also taken when added to timer->data. Unref the same after processing timer->data. Change-Id: Ie77c78ff4a971e0d9a66178597fb34faf39205fb updates: bz#1651323 Signed-off-by: Soumya Koduri <skoduri@redhat.com> (cherry picked from commit b7aec05aa965202ab73120acf0da4c32fe0cf16c)
* gfapi: Send fop_attr dict as part of syncop_openSoumya Koduri2018-11-291-1/+1
| | | | | | | | | | | Leaseid (stored in thread locals) is sent to server via xdata. This dict variable is set but not passed as argument in glfs_h_open(). Fixed the same. Change-Id: Idd2f8a0ec184b4b6b1ad1e6e5d75df551c36a96d updates: bz#1651323 Signed-off-by: Soumya Koduri <skoduri@redhat.com> (cherry picked from commit 04be5463b20ababc29942fa967017e763d0ae2af)
* afr: open_ftruncate_cbk should read fd from local->cont.open structSoumya Koduri2018-11-291-2/+2
| | | | | | | | | | | | | afr_open stores the fd as part of its local->cont.open struct but when it calls ftruncate (if open flags contain O_TRUNC), the corresponding cbk function (afr_ open_ftruncate_cbk) is incorrectly referencing uninitialized local->fd. This patch fixes the same. Change-Id: Icbdedbd1b8cfea11d8f41b6e5c4cb4b44d989aba updates: bz#1651322 Signed-off-by: Soumya Koduri <skoduri@redhat.com> (cherry picked from commit fda594875c4cdb2a22e27aa13f5c66bee032ccb5)
* cluster/afr: Use 2 domain locking in SHD for thin-arbiterkarthik-us2018-11-295-89/+390
| | | | | | | | | | | | | | | | | | | | | | | | With this change when SHD starts the index crawl it requests all the clients to release the AFR_TA_DOM_NOTIFY lock so that clients will know the in memory state is no more valid and any new operations needs to query the thin-arbiter if required. When SHD completes healing all the files without any failure, it will again take the AFR_TA_DOM_NOTIFY lock and gets the xattrs on TA to see whether there are any new failures happened by that time. If there are new failures marked on TA, SHD will start the crawl immediately to heal those failures as well. If there are no new failures, then SHD will take the AFR_TA_DOM_MODIFY lock and unsets the xattrs on TA, so that both the data bricks will be considered as good there after. >Change-Id: I037b89a0823648f314580ba0716d877bd5ddb1f1 >fixes: bz#1579788 >Signed-off-by: karthik-us <ksubrahm@redhat.com> (cherry picked from commit 5784a00f997212d34bd52b2303e20c097240d91c) Change-Id: I037b89a0823648f314580ba0716d877bd5ddb1f1 fixes: bz#1648205
* cluster/ec: prevent infinite loop in self-heal fullXavi Hernandez2018-11-291-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There was a problem in commit 7f81067 that caused infinite loop when full heal was triggered. The previous commit was made to prevent self-heal to go idle after a replace brick operation. One of the changes consisted on setting a flag to force an immediate scan of the dirty directory if a heal on a directory succeeded (assuming it could have generated newer entries). However that change was causing an issue with a full self-heal, since every time an already healed directory was checked and it returned suceessfully, it was also setting the flag, forcing self-heal to start over again. This patch fixes this issue by only setting the flag if the heal is not full. It's assumed that a full self-heal will already traverse all entries automatically, so there's no need to force a new scan later. >Change-Id: Id12dbfc04e622b18183e796cc6cc87ccc30a6d55 >fixes: bz#1636631 >Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> (cherry picked from commit 7150c51ad75ccba22045a35fc31e5037612d1ad4) Change-Id: Id12dbfc04e622b18183e796cc6cc87ccc30a6d55 fixes: bz#1651525 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* glfsheal: add a '--nolog' flagRavishankar N2018-11-284-17/+56
| | | | | | | | | | | | | | | | | | | | | | ....and if set, change the log level to GF_LOG_NONE. This is useful for monitoring applications which invoke the heal info set of commands once every minute, leading to un-necessary glfsheal* logs in /var/log/glusterfs/. For example, we can now run `gluster volume heal <VOLNAME> info --nolog` `gluster volume heal <VOLNAME> info split-brain --nolog` etc. The default log level is still retained at GF_LOG_INFO. The patch also changes glfsheal internally to accept '--xml' instead of 'xml'. Note: The --nolog flag is *not* displayed in the help anywhere, for the sake of consistency in how the other flags are not displayed anywhere in the help. fixes: bz#1654236 Change-Id: Ia08b6aa6e4a0548379db7e313dd4411ebc66f206 Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit fc9889d0373c323aab0d93f8ca31d2d8151bd041)
* doc: Release notes for 5.1 release of GlusterFSv5.1ShyamsundarR2018-11-141-0/+51
| | | | | | Change-Id: I526bf9bfd889dd7aea19f71059042cd9a993e1d0 Signed-off-by: ShyamsundarR <srangana@redhat.com> Fixes: bz#1640685
* cluster/ec: Change log level to DEBUG for lookup combineAshish Pandey2018-11-141-1/+1
| | | | | | | | | | | | | | As lookup is not a locked fop, we can not trust the data received in this to be same. Changing the log level to DEBUG in case lookup finds any difference. (cherry picked from commit 9be6bf3d90e3783b3ba559c93d41b933f8d53f03) Change-Id: I39499c44688a2455c7c6c69a798762d045d21b39 updates: bz#1644622 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* gfapi: fix bad dict setting of lease-idKinglong Mee2018-11-121-10/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | lease_id is a 16 bits opaque data, copying it by gf_strdup is wrong. Invalid read of size 2 at 0x483FA2F: memmove (vg_replace_strmem.c:1270) by 0xE2EF6FB: ??? (in /usr/lib64/libtirpc.so.3.0.0) by 0xE2EE047: xdr_opaque (in /usr/lib64/libtirpc.so.3.0.0) by 0x107A97DC: xdr_gfx_value (glusterfs4-xdr.c:207) by 0x107A98C0: xdr_gfx_dict_pair (glusterfs4-xdr.c:321) by 0xE2EF35E: xdr_array (in /usr/lib64/libtirpc.so.3.0.0) by 0x107A9A89: xdr_gfx_dict (glusterfs4-xdr.c:335) by 0x107AA97B: xdr_gfx_write_req (glusterfs4-xdr.c:897) by 0x107A181E: xdr_serialize_generic (xdr-generic.c:25) by 0x231044A2: client_submit_request (client.c:205) by 0x2314D3C1: client4_0_writev (client-rpc-fops_v2.c:3863) by 0x230FD5FA: client_writev (client.c:956) Address 0xad659e18 is 72 bytes inside a block of size 73 alloc'd at 0x483880B: malloc (vg_replace_malloc.c:299) by 0x106BA7EC: __gf_malloc (mem-pool.c:136) by 0x1064521E: gf_strndup (mem-pool.h:166) by 0x1064521E: gf_strdup (mem-pool.h:183) by 0x1064521E: get_fop_attr_thrd_key (glfs.c:627) by 0x1064D8E9: glfs_pwritev@@GFAPI_3.4.0 (glfs-fops.c:1154) by 0x10610C0C: glusterfs_write2 (handle.c:2092) by 0x54D30C: mdcache_write2 (mdcache_file.c:647) by 0x48A3FC: nfs4_write (nfs4_op_write.c:459) by 0x48A44D: nfs4_op_write (nfs4_op_write.c:487) by 0x4634F5: nfs4_Compound (nfs4_Compound.c:947) by 0x460155: nfs_rpc_process_request (nfs_worker_thread.c:1329) by 0x4608A3: nfs_rpc_valid_NFS (nfs_worker_thread.c:1539) by 0x488F12F: svc_vc_decode (svc_vc.c:825) Backport of: > Patch: https://review.gluster.org/21586/ > BUG: bz#1647651 > Change-Id: Ib9fff55c897bc43c15036a869888e763df133757 > Signed-off-by: Kinglong Mee <mijinlong@open-fs.com> (cherry picked from commit 6d4cd8ce6c0d88d331ffed97c51d3061a3900561) Updates bz#1648923 Change-Id: Ib9fff55c897bc43c15036a869888e763df133757 Signed-off-by: Kinglong Mee <mijinlong@open-fs.com>
* geo-rep: Add more intelligence to automatic error handlingKotresh HR2018-11-091-24/+46
| | | | | | | | | | | | | | | | | | | | | Geo-rep's automatic error handling does gfid conflict resolution. But if there are ENOENT errors because the parent is not synced to slave, it doesn' handle them. This patch adds the intelligence to create missing parent directories on slave. It can create the missing directories upto the depth of 10. Backport of: > Patch: https://review.gluster.org/21498/ > fixes: bz#1643402 > Change-Id: Ic97ed1fa5899c087e404d559e04f7963ed7bb54c > Signed-off-by: Kotresh HR <khiremat@redhat.com> (cherry picked from commit 19775e0445411cca9ddd9d294fd54d0b6fbe6a03) fixes: bz#1646896 Change-Id: Ic97ed1fa5899c087e404d559e04f7963ed7bb54c Signed-off-by: Kotresh HR <khiremat@redhat.com>
* glusterd: ensure volinfo->caps is set to correct value.Sanju Rakonde2018-11-092-0/+11
| | | | | | | | | | | | | | | | | | | | | | With the commit febf5ed4848, during the volume create op, we are setting volinfo->caps to 0, only if any of the bricks belong to the same node and brickinfo->vg[0] is null. Previously, we used to set volinfo->caps to 0, when either brick doesn't belong to the same node or brickinfo->vg[0] is null. With this patch, we set volinfo->caps to 0, when either brick doesn't belong to the same node or brickinfo->vg[0] is null. (as we do earlier without commit febf5ed4848). > BUG: bz#1635820 > Change-Id: I00a97415786b775fb088ac45566ad52b402f1a49 > Signed-off-by: Sanju Rakonde <srakonde@redhat.com> (cherry picked from commit aae1c402b74fd02ed2f6473b896f108d82aef8e3) fixes: bz#1647968 Change-Id: I00a97415786b775fb088ac45566ad52b402f1a49 Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
* all: fix the format string exceptionsAmar Tumballi2018-11-0940-118/+126
| | | | | | | | | | | | | | | | Currently, there are possibilities in few places, where a user-controlled (like filename, program parameter etc) string can be passed as 'fmt' for printf(), which can lead to segfault, if the user's string contains '%s', '%d' in it. While fixing it, makes sense to make the explicit check for such issues across the codebase, by making the format call properly. Fixes: CVE-2018-14661 Fixes: bz#1647666 Change-Id: Ib547293f2d9eb618594cbff0df3b9c800e88bde4 Signed-off-by: Amar Tumballi <amarts@redhat.com>
* features/locks:Use pthread_mutex_unlock() instead of pthread_mutex_lock()Vijay Bellur2018-11-091-1/+1
| | | | | | | | | Fixes CID 1396581 Change-Id: Ic04091b5783a75d8e1e605a9c1c28b77fea048d3 updates: bz#1647962 Signed-off-by: Vijay Bellur <vbellur@redhat.com> Signed-off-by: Susant Palai <spalai@redhat.com>
* lock: Do not allow meta-lock count to be more than oneSusant Palai2018-11-091-1/+34
| | | | | | | | | | | | | | | | | | | | In the current scheme of glusterfs where lock migration is experimental, (ideally) the rebalance process which is migrating the file should request for a metalock. Hence, the metalock count should not be more than one for an inode. In future, if there is a need for meta-lock from other clients, this patch can be reverted. Since pl_metalk is called as part of setxattr operation, any client process(non-rebalance) residing outside trusted network can exhaust memory of the server node by issuing setxattr repetitively on the metalock key. The current patch makes sure that more than one metalock cannot be granted on an inode. Fixes CVE-2018-14660 updates: bz#1647962 Change-Id: Ie1e697766388718804a9551bc58351808fe71069 Signed-off-by: Susant Palai <spalai@redhat.com>
* server: don't allow '/' in basenameAmar Tumballi2018-11-082-5/+19
| | | | | | | | | | | | | | | | Server stack needs to have all the sort of validation, assuming clients can be compromized. It is possible for a compromized client to send basenames with paths with '/', and with that create files without permission on server. By sanitizing the basename, and not allowing anything other than actual directory as the parent for any entry creation, we can mitigate the effects of clients not able to exploit the server. Fixes: CVE-2018-14651 Fixes: bz#1647663 Change-Id: I5dc0da0da2713452ff2b65ac2ddbccf1a267dc20 Signed-off-by: Amar Tumballi <amarts@redhat.com>
* cluster/afr : Check for UP bricks before starting healAshish Pandey2018-11-083-1/+19
| | | | | | | | | | | | | | | | | | | | | | | | | Problem: Currently for replica volume, even if only one brick is UP SHD will keep crawling index entries even if it can not heal anything. In thin-arbiter volume which is also a replica 2 volume, this causes inode lock contention which in turn sends upcall to all the clients to release notify locks, even if it can not do anything for healing. This will slow down the client performance and kills the purpose of keeping in memory information about bad brick. Solution: Before starting heal or even crawling, check if sufficient number of children are UP and available to check and heal entries. (cherry picked from commit f73b4476b15f9d6d3dc3c8e20c9742aacd857f9f) Change-Id: I011c9da3b37cae275f791affd56b8f1c1ac9255d updates: bz#1644645 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* glusterd: allow shared-storage to use bricks under glusterd working directorySanju Rakonde2018-11-086-11/+15
| | | | | | | | | | | | | | | | | | | | With commit 44e4db, we are not allowing user to create a volume using glusterd's working directory as a brick or any sub directory under glusterd's working directory as a brick.This has broken shared-storage since the volume "gluster-shared-storage" is created using the bricks under glusterd's working directory. With this patch, we let the "gluster-shared-storage" volume to use bricks under glusterd's working directory. > BUG: bz#1647029 > Change-Id: Ifcbcf4576eea12cf46f199dea287b29bd3ec3bfd > Signed-off-by: Sanju Rakonde <srakonde@redhat.com> (cherry picked from commit bdb4ca184913c82ccf9552298f5d5b597794f2aa) fixes: bz#1647801 Change-Id: Ifcbcf4576eea12cf46f199dea287b29bd3ec3bfd Signed-off-by: Sanju Rakonde <srakonde@redhat.com>