summaryrefslogtreecommitdiffstats
path: root/xlators
Commit message (Collapse)AuthorAgeFilesLines
* cluster/afr: prevent piggyback on stale pre_opPranith Kumar K2013-04-021-33/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here are the logs of a file on which we saw EIO because of size mismatch: [root@lizzie ~]# grep 38f18204 /var/log/glusterfs/mnt-x-.log Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 0, len: 7680 Cleared unstable write flag for 38f18204-2840-408e-ae65-c01f4106b8c4: offset 0 length 7680 Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 7680, len: 71680 Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 79360, len: 15716 fsync completed on 38f18204-2840-408e-ae65-c01f4106b8c4 for offset 0 length 7680 with changelog status: -1 -1 According to these logs fsync did not happen after writev with offset: 79360, len: 15716. Which is the reason for this problem. In total 3 writes came. lets call them w1, w2, w3 w1 does pre_op so pre_op_done[0], pre_op_done[1] counts become 1 and 1 then is_piggyback_post_op() is called for w1 and it returns *false* w1's fsync is fired Now w2 and w3 come and see that pre_op_done[0], pre_op_done[1] are both 1, so pre_op_piggyback[0] and pre_op_piggyback[1] are both incremented twice, once by w2, one more time by w3 and become 2, 2 ------- Step-A Now fsync of w1 is complete and it goes ahead with post op and decrements pre_op_done[0], pre_op_done[1] to 0, 0 Now w2, w3 writevs complete and is_piggyback_post_op will return *true* for both w2, w3. So fsync is not fired for both w2, w3 this patch prevents Step-A from happening. Change-Id: I8b6af1f1875b2cf5f718caa3c16ee7ff3dc96b5c BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4752 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* dht: improve transform/detransform of d_off (and be ext4 safe)Anand Avati2013-04-011-5/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The scheme to encode brick d_off and brick id into global d_off has two approaches. Since both brick d_off and global d_off are both 64-bit wide, we need to be careful about how the brick id is encoded. Filesystems like XFS always give a d_off which fits within 32bits. So we have another 32bits (actually 31, in this scheme, as seen ahead) to encode the brick id - which is typically plenty. Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off, as the d_off is calculated based on a hash function value. This leaves us no "unused" bits to encode the brick id. However both these filesystmes (EXT4 more importantly) are "tolerant" in terms of the accuracy of the value presented back in seekdir(). i.e, a seekdir(val) actually seeks to the entry which has the "closest" true offset. This "two-prong" scheme exploits this behavior - which seems to be the best middle ground amongst various approaches and has all the advantages of the old approach: - Works against XFS and EXT4, the two most common filesystems out there. (which wasn't an "advantage" of the old approach as it is borken against EXT4) - Probably works against most of the others as well. The ones which would NOT work are those which return HUGE d_offs _and_ NOT tolerant to seekdir() to "closest" true offset. - Nothing to "remember in memory" or evict "old entries". - Works fine across NFS server reboots and also NFS head failover. - Tolerant to seekdir() to arbitrary locations. Algorithm: Each d_off can be encoded in either of the two schemes. There is no requirement to encode all d_offs of a directory or a reply-set in the same scheme. The topmost bit of the 64 bits is used to specify the "type" of encoding of this particular d_off. If the topmost bit (bit-63) is 1, it indicates that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0, it indicates that the "small" d_off encoding scheme is used. The goal of the "small" d_off encoding is to stay as dense as possible towards the lower bits even in the global d_off. The goal of the HUGE d_off encoding is to stay as accurate (close) as possible to the "true" d_off after a round of encoding and decoding. If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick ID (call it "n"). SMALL d_off =========== Encoding -------- If the top n + 1 bits are free in a brick offset, then we leave the top bit as 0 and set the remaining bits based on the old formula: hi_mask = 0xffffffffffffffff hi_mask = ~(hi_mask >> (n + 1)) if ((hi_mask & d_off_brick) != 0) do_large_d_off_encoding () d_off_global = (d_off_brick * N) + brick_id Decoding -------- If the top bit in the global offset is 0, it indicates that this is the encoding formula used. So decoding such a global offset will be like the old formula: if ((d_off_global & 0x8000000000000000) != 0) do_large_d_off_decoding() d_off_brick = (d_off_global % N) brick_id = d_off_global / N HUGE d_off ========== Encoding -------- If the top n + 1 bits are NOT free in a given brick offset, then we set the top bit as 1 in the global offset. The low n bits are replaced by brick_id. low_mask = 0xffffffffffffffff << n // where n is ROOF(Log2(N)) d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id if (d_off_global == 0xffffffffffffffff) discard_entry(); Decoding -------- If the top bit in the global offset is set 1, it indicates that the encoding formula used is above. So decoding would look like: hi_mask = (0xffffffffffffffff << n) low_mask = ~(hi_mask) d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff) brick_id = global_d_off & low_mask If "losing" the low n bits in this decoding of d_off_brick looks "scary", we need to realize that till recently EXT4 used to only return what can now be expressed as (d_off_global >> 32). The extra 31 bits of hash added by EXT recently, only decreases the probability of a collision, and not eliminate it completely, anyways. In a way, the "lost" n bits are made up by decreasing the probability of collision by sharding the files into N bricks / EXT directories -- call it "hash hedging", if you will :-) Thanks-to: Zach Brown <zab@redhat.com> Change-Id: Ieba9a7071829d51860b7c131982f12e0136b9855 BUG: 838784 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4711 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* mgmt/glusterd: Enable write-behind in nfsPranith Kumar K2013-04-011-1/+1
| | | | | | | | | | | | | We observed that the number of write requests thus inodelks are increasing very rapidly to thousands without write-behind in the graph. Change-Id: Id71c9c2b0a4c9601a4644a58a933221c62dab0c0 BUG: 928341 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4734 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* storage/posix: honor O_SYNC and O_DSYNC sent in @flags of writev()Anand Avati2013-03-292-5/+10
| | | | | | | | | | | | | | | | Historic bug - posix_writev() has been inspecting pfd->flushwrites for performing fsync() after write, instead of @flags for O_SYNC|O_DSYNC. pfd->flushwrites was never set anywhere and is unused completely. This is behavior from the time before anonymous FD where open() had @wbflags param. This is a leftover from that cleanup. Change-Id: Id9bfe562a60db4eb3bd0a7705bdba91f2df2f3ec BUG: 916372 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4738 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: fix fd leak with unsafe call_resume()Anand Avati2013-03-282-1/+17
| | | | | | | | | | | | | | Introduce AFR_CALL_RESUME macro which cleans up frame->local, like how AFR_STACK_UNWIND etc. do. Therefore fix leak in afr_fsync() path. Change-Id: I3855d8e7e84dbc44e05f507563b7f722bf9621b8 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4745 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: fsync before erase xattrs in data self-healPranith Kumar K2013-03-281-1/+75
| | | | | | | | | | | | Added extra fsync to data self-heal code to make sure the data reached disk before erasing the changelogs Change-Id: I9e7e6e55cdc49de2b991705d1638946464a9d4f9 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4744 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: piggyback and fsync resume changesPranith Kumar K2013-03-282-15/+18
| | | | | | | | | | | | | | 1) pre_op_piggyback should always be decremented. 2) Move fsync resume to just after post_op. 3) fsync stub should be created from afr's local not from the final response. Change-Id: I220bb532eb03bea584292f4dd2e816ad0c3e0cf7 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4741 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: fsync() guarantees POST-OP completionAnand Avati2013-03-273-10/+56
| | | | | | | | | | | | | | | | | | | | | | | | AFR now provides a stronger guarantee that fsync() returns only after completely finishing all the deferred/delayed POST-OP on that open file. To acheive this we make a stub out of the returning fsync and register it with the "delayed" frame in afr_changelog_wake_resume(). The delayed frame, after getting woken up and finishing the POST-OP will call_resume() the registered stub (which UNWINDs the fsync) at the time of frame destruction. This provides a guarantee that an application's (or FUSE) fsync() returns only after finishing up all the previous transactions, including delayed POST-OPs and UNLOCK. Change-Id: Iaa955457e2f25088a144fde37ad0444277b5cf49 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4737 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: ensure DATA operations are made durable before POST-OPAnand Avati2013-03-274-22/+314
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The changelogging scheme of AFR stores information about the state of all replicas in all replicas (in the extended attribute of the respective files on each server) in the form of 'pending counts' of operations (effectively "dirty flags"). These xattrs are blindly trusted while performing self-heal, and therefore utmost care has to be taken while updating and maintaing them. The most critical updation is the clearing of the pending counts corresponding to the *other* server in the changelog of a given server. Before clearing the pending count, we need durability guarantee of the write which was performed on the other server. To obtain such a guarantee, it may be necessary to explicitly introduce an fsync() phase (if the file itself wasn't already opened with O_SYNC). This patch introduces the detection of unstable stable writes on a file and issues explicit fsync() on the servers before performing the POST-OP clearing of pending flags. Change-Id: I2171b86a74ec91e40e5877eef0a4e7379578ecf7 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4721 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* glusterd: Removed fd leaks in glusterfs_start utility functionKrishnan Parthasarathi2013-03-254-119/+79
| | | | | | | | | | | | | | | | | | | PROBLEM: The FILE* associated with the pidfile was leaked if pmap_registry_search on the brickinfo' path failed. FIX: Eliminates the use of the FILE* that was leaked. Uses glusterd_is_service_running utility function in place of the earlier attempt to check for the same. Change-Id: I94082bd5a94b8a6340f8cc11726d3264e364efe6 BUG: 916549 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-on: http://review.gluster.org/4596 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* config: better (i.e. more portable) test for libxml2Kaleb S. KEITHLEY2013-03-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Over the weekend I tried to build on MacOS X¹ and ran into the following issues: 1) The recent change to autogen.sh to test for pkg-config falls down. 2) After removing the pkg-config test in autogen.sh, w/o pkg-config the PKG_CHECK_MODULES macro invocation in configure[.ac] falls down. N.B. Solaris users run into this too, even through there's a (broken) pkg-config package that can be installed. 3) There are other problems in the code related to fuse that are beyond the scope of this. It seems that pkg-config is only a requirement for the definition of the PKG_CHECK_MODULES macro used to detect libxml2. Since this seems to be inherently unportable — at least to MacOS X and Solaris — I'd like to: A) Change the use of the PKG_CHECK_MODULES macro to the more portable AM_PATH_XML2 macro provided by the libxml2 package in /usr/.../share/aclocal/libxml.m4 2) Revisit the decision to add the check for pkg-config in autogen.sh in BZ 921817. For now this is just an rfc. If people are agreeable I'll reenter this change against BZ 921817. ¹Mountain Lion 10.8.3, XCode 4.6.1 Change-Id: I237b1ed8919088345b8fd943423b2a6ad289981b BUG: 921817 Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com> Reviewed-on: http://review.gluster.org/4720 Reviewed-by: Justin Clift <jclift@redhat.com> Tested-by: Justin Clift <jclift@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* glusterd: Simplify glusterd_service_stop()Krishnan Parthasarathi2013-03-251-70/+11
| | | | | | | | | Change-Id: I396d250a3299ad1f7fce4bd14389b0c2756b6cb0 BUG: 764890 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-on: http://review.gluster.org/4718 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* mount: Added the xlator-option to mount.glusterfs script.Avra Sengupta2013-03-221-0/+6
| | | | | | | | | | | | | | | Now all xlator-options can be set from the mount command as well. Example : mount -t glusterfs Hostname:/Volume_Name Mount_Point -o "xlator-option=xyz=123, xlator-option=abc=999" Change-Id: If52d994986839d1c969e3e2e01b2e1a29a3140b7 BUG: 920583 Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/4660 Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Shishir Gowda <sgowda@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* glusterd: Improve error logging when a brick from an old volume gets re-usedNiels de Vos2013-03-221-2/+7
| | | | | | | | | | | | | | | | | | | The error message when creating a volume that contains a brick with certain xatts set on a parent directory is unclear. Users do not understand '... or a prefix of it is already part of a volume'. Most users check the final directory that is used for a brick, but not its parents. It would be helpful to present the user with the actual directory that is preventing the volume to use the brick. BUG: 923917 Change-Id: I815ad32a992eb0e41ee8fca6ee9327400d042c45 Signed-off-by: Niels de Vos <ndevos@redhat.com> Reviewed-on: http://review.gluster.org/4701 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* nfs: ACCESS - reply only what was asked forAnand Avati2013-03-223-8/+12
| | | | | | | | | | | | | Set only those bits which were requested by the client. Some clients, like AIX, do not like the fact that we are returning the EXEC bit set in the ACCESS reply even though it only asked for LOOKUP bit. Change-Id: I3c2fd5dce030ea5ddae0511497cafa078c4d76d6 BUG: 924481 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4707 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* dht: make DHT xattr names configurableJeff Darcy2013-03-2112-125/+198
| | | | | | | | | | | | | | | | | | | | This is necessary to support "DHT over DHT" configurations, so that the upper and lower instances of DHT don't step all over each other. Why would we even consider such a thing? Because it gives us the ability to do data tiering and rack-aware placement, either by themselves or as complements to other functionality such as erasure codes or deduplication which save space but cost performance. By setting up the top-level DHT to place data into one of several lower-level DHT pools based on policy instead of pure elastic hashing, we get better performance for 90% of accesses and better storage efficiency for 90% of data, all for relatively low effort. Change-Id: I72e65c29edfc80babf39f7a2a00090f4588c4070 BUG: 924265 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4694 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* features/marker: log error when unlinking timestamp fileVenky Shankar2013-03-211-4/+13
| | | | | | | | | | | | | | | | ... so it's easy to figure out errno caused it. As of now it's only due to ENOSPC. Logging is done in the error handling routine, so any further changes that require unlinking of the timestamp file due to some error condition(s) are logged. Change-Id: Ia59338e2e32b2adbbd1d56aa260018270f1abae9 BUG: 853911 Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-on: http://review.gluster.org/4649 Reviewed-by: Csaba Henk <csaba@redhat.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* performance/io-threads: Fix range-check for least-rate-limitPranith Kumar K2013-03-211-0/+1
| | | | | | | | | | | | | The issue could be fixed with .validate=GF_OPT_VALIDATE_MIN. But adding max value is more robust. Change-Id: Ia69c6f86855dbd34a26e20391e77bfa0f796a200 BUG: 923573 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4698 Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* nfs: handle stable write with @flags rather than fsync()Anand Avati2013-03-202-68/+27
| | | | | | | | | | | | | | | stable writes can be "made stable" by simply setting O_SYNC (or O_DSYNC, accordingly) in the write flags or fd->flags. Performing fsync() at the end of the write is extremely inefficient and completely messes up eager-locking logic in AFR. Change-Id: I4d954c133641e246b2ab4df874bad0282667561f BUG: 916372 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4591 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* nfs, afr: Fail lookup only on split-brainPranith Kumar K2013-03-202-3/+7
| | | | | | | | | | | Change-Id: Icee9772f1f1bf5336eb82a4dc13e198424cd4a65 BUG: 921996 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4699 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* dht: fix a typoJulesWang2013-03-181-1/+1
| | | | | | | | | Change-Id: Id6f156957e58aad06bf2602f880c7e4102b80fd1 BUG: 764890 Signed-off-by: JulesWang <w.jq0722@gmail.com> Reviewed-on: http://review.gluster.org/4679 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* posix-acl: disable permission checks for fd based opsshishir gowda2013-03-141-4/+4
| | | | | | | | | Signed-off-by: shishir gowda <sgowda@redhat.com> Change-Id: I9d49537c2c7b51d5598b80627d61f060aaec8549 BUG: 921437 Reviewed-on: http://review.gluster.org/4671 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* geo-rep: retire old style ssh setupCsaba Henk2013-03-143-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | Users are still using geo-rep with the old, deprecated, insecure, unsupported ssh setup. Not their fault -- the implementation of the new method had the following charasteristics: - old method is possible, but with default settings it's not working - it can be made operational by fiddling with "remote-gsyncd" tunable - with default setting, an unhelpful, actually misleading error message is produced - the UI gave no hint to the changes in the ssh setup http://review.gluster.org/4392 tried to fix these; what it accomplished was unrestricted support to the bad practice (by making the default old setup operational). From this on: - we disable the old method by reserving the "remote-gsyncd" tunable - if the old method is attempted, give a hint what to do Change-Id: Icade94725d8d8d2d4c89cab992d4226351637b86 BUG: 895656 Signed-off-by: Csaba Henk <csaba@redhat.com> Reviewed-on: http://review.gluster.org/4602 Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Storage/posix: Don't log at ERROR level for failed getxattr.Raghavendra Talur2013-03-121-0/+4
| | | | | | | | | | | | | | | | | | Problem: ENOATTR returned by getxattr -n <NotAnExistingAttribute> <file> was being logged at ERROR level. Solution: Moved logging to DEBUG level. Change-Id: I982a577a4c231faa958ea71abdb272f8d5ffd70c BUG: 918052 Signed-off-by: Raghavendra Talur <rtalur@redhat.com> Reviewed-on: http://review.gluster.org/4628 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Preserve mtime in self-healPranith Kumar K2013-03-121-14/+25
| | | | | | | | | | | | | | | | | | | | | | | Problem: Data self-heal may choose sink iatt to set mtimes. This happens because after syncing of data is done self-heal does one more xattrops/fstat to determine sources sinks to set the inode-ctx. Since this is done after data syncing and erase of xattrs, old source and old sink are now sources, but the mtimes of them differ. Old code just takes the first source from the list and update mtimes, which could be sink before the self-heal started. Fix: Set mtime from 'sources before syncing'. Change-Id: Id769e1b99aa4f041eaee775f64cbf2c57b799723 BUG: 918437 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4658 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* glusterd: Mark vol as deleted by renaming voldir before cleaning up the storeKrutika Dhananjay2013-03-114-44/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | PROBLEM: During 'volume delete', when glusterd fails to erase all information about a volume from the backend store (for instance because rmdir() failed on non-empty directories), not only does volume delete fail on that node, but also subsequent attempts to restart glusterd fail because the volume store is left in an inconsistent state. FIX: Rename the volume directory path to a new location <working-dir>/trash/<volume-id>.deleted, and then go on to clean up its contents. The volume is considered deleted once rename() succeeds, irrespective of whether the cleanup succeeds or not. Change-Id: Iaf18e1684f0b101808bd5e1cd53a5d55790541a8 BUG: 889630 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/4639 Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Kaushal M <kaushal@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* glusterd: Fixed volume-sync in synctask codepath.Krishnan Parthasarathi2013-03-101-9/+13
| | | | | | | | | | Change-Id: I2911d3ac80825310f84c5ba6bd7890e65e1ee219 BUG: 865700 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-on: http://review.gluster.org/4624 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* glusterd: fix segfault on volume status detailLars Ellenberg2013-03-081-9/+11
| | | | | | | | | | | | | | | | | | | | | | | If for some reason glusterd_get_brick_root() fails, it frees the gf_strdup'ed *mount_point in its own error path, and returns -1. Unfortunately it already had assigned that pointer value to the output argument, the caller function glusterd_add_brick_detail() sees a non-NULL pointer, and free() again: segfault. Could be fixed with a one-liner (*mount_point = NULL) in the error path, but I think glusterd_get_brick_root() should only assign to the output argument once all checks passed, so I use a local temporary pointer, which increases the patch a bit. Change-Id: I3f3035f01e80a5e9bdf2da895e4cf7baa3dfbd2f BUG: 919352 Signed-off-by: Lars Ellenberg <lars@linbit.com> Reviewed-on: http://review.gluster.org/4646 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/distribute: Fix layout overlaps due to spread-count in selfheal pathshishir gowda2013-03-071-50/+12
| | | | | | | | | | | | | | | | We needed to zero out the layout range, before we re-calculate the range. When spread-count is issued, we would end up with stale ranges in the layout. Replaced dht_selfheal_dir_xattr with dht_fix_dir_xattr, which correctly resets the un-used (after re-cal) layouts. Change-Id: I1a900d15df07335f59356bd23182ccec34381ab2 BUG: 884455 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4647 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* storage/posix: Remove a redundant lstat in posix_handle_hard.Mohammed Junaid2013-03-061-7/+0
| | | | | | | | | | Change-Id: I9129b71d5568eff3513c17e3607256783fdc42ec BUG: 903396 Signed-off-by: Mohammed Junaid <junaid@redhat.com> Reviewed-on: http://review.gluster.org/4641 Reviewed-by: Peter Portante <pportant@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com>
* nlm: use appropriate open flags while lockingRajesh Amaravathi2013-03-051-1/+7
| | | | | | | | | | | | | | | In case of a shared/read lock, open the file in O_RDONLY mode, and in the case of an exclusive lock, open the file in O_WRONLY mode to emulate the behaviour posix fcntl implementation as given in the man pages. Change-Id: Ib9eab6570c3bc65f8bd48a14a9d801616213b295 BUG: 916930 Signed-off-by: Rajesh Amaravathi <rajesh@redhat.com> Reviewed-on: http://review.gluster.org/4603 Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster xlators: s/-1/GF_CLIENT_PID_GSYNCD/Csaba Henk2013-03-033-7/+7
| | | | | | | | | Change-Id: I03be3cb23684de4ab36cf2953002708466edd580 BUG: 765433 Signed-off-by: Csaba Henk <csaba@redhat.com> Reviewed-on: http://review.gluster.org/4601 Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* glusterd: Added description for nfs.transport-type option in volume set help.Avra Sengupta2013-03-011-1/+4
| | | | | | | | | | Change-Id: I9fe81dc1c3172158e8dd86c4fa2a04af18cb9dde BUG: 782285 Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/4582 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Modified validation parameters for owner-uid and owner-gid.Avra Sengupta2013-03-011-0/+4
| | | | | | | | | | | | owner-uid and owner-gid will not receive negative values anymore. Change-Id: I82741d3d01b29e448294b2ec093fb70d22a5c77e BUG: 912297 Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/4581 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Turn on eager-lock for fd DATA transactionsPranith Kumar K2013-03-012-21/+7
| | | | | | | | | | | | | | | | | | | | Problem: With the present implementation, eager-lock is issued for any fd fop. eager-lock is being transferred to metadata transactions. But the lk-owner is set to local->fd address only for DATA transactions, but for METADATA transactions it is frame->root. Because of this unlock on the eager-lock fails and rebalance hangs. Fix: Enable eager-lock for fd DATA transactions Change-Id: If30df7486a0b2f5e4150d3259d1261f81473ce8a BUG: 916226 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4588 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* glusterd: Added the validation function for subvols-per-directoryAvra Sengupta2013-02-288-26/+90
| | | | | | | | | Change-Id: Ie2259023b9001311a2032792639c3093054f6750 BUG: 896431 Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/4552 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* glusterd: Fix some options in vme tableKaushal M2013-02-281-7/+2
| | | | | | | | | | | | | | Some of the options had invalid '.flags' members. In the original table these table these were supposed to be the op-versions, but, the entries for the below options were missing the flags field the op-version was entered in that place. Change-Id: I408f5a972743eb37d9a58a809e8be8cb385bced8 BUG: 903478 Signed-off-by: Kaushal M <kaushal@redhat.com> Reviewed-on: http://review.gluster.org/4593 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/distribute: Prevent spurious multiple defrag crawlsshishir gowda2013-02-281-9/+14
| | | | | | | | | | | | | | | | | | In dht_notify, we used to create a thread to start defrag crawls after we had heard from all child subvols. This was in-correct, as a later event, could also trigger the crawl again(due to the fact that all subvols had responded). The fix is to make sure, the thread is started only once after all subvols have responded the first time Change-Id: Ifc2978b9dc866af2395b79911eca50ab38ff9457 BUG: 916449 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4587 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* Do not call xdr_string() with a NULL error messageEmmanuel Dreyfus2013-02-281-0/+1
| | | | | | | | | | | | | It is illegal to call xdr_string() with a NULL string. Linux just retruns false, NetBSD gets a SIGSEGV when xdr_string() calls strlen(NULL) BUG: 916439 Change-Id: Ia958470ada6e8e55a86d439922ec942d038f5f13 Signed-off-by: Emmanuel Dreyfus <manu@netbsd.org> Reviewed-on: http://review.gluster.org/4589 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* glusterd: Added validation function for stripe-block-size.Avra Sengupta2013-02-271-2/+41
| | | | | | | | | Change-Id: I050d01b01eac46550aa435da7d96a972e0393d35 BUG: 770655 Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/4561 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/dht: print hash function munging logs in DEBUG modeAnand Avati2013-02-271-2/+2
| | | | | | | | | Change-Id: Ia2e6bce80710d103da9d78afdb389ea162b00686 BUG: 912564 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4590 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/distribute: Add filter to support file patterns to be migratedshishir gowda2013-02-273-1/+120
| | | | | | | | | | | | | | | | | | | | | | | 'gluster volume rebalance' command will be enhanced to support passing of these options/pattern. <pattern> is comma separated list as show below. The Precedence is from right to left. e.g- "*avi,*pdf:10MB,*:1KB" The precedence is as follows: migrate all files with size equal or greater than 1KB "*:1KB" migrate all pdf files with size equal or greater than 10MB "*pdf:10MB" migrate all avi files "*avi" With this option, it is possible to choose which files to migrate. Change-Id: I6d6d6a015bcbacf1debae2f278a2d92306fb055d BUG: 896456 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4366 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* mgmt/glusterd: Move found-brick logs to DEBUGPranith Kumar K2013-02-273-4/+12
| | | | | | | | | | | Change-Id: I1c311c21d7bdcad4956d3428bda39131c331cd7a BUG: 812356 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4585 Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* nfs/server: Fix multiple crashes in acl handling code.Vijay Bellur2013-02-261-10/+16
| | | | | | | | | | Change-Id: I9b39a485c8b98d9eabe6153487f4dfbd26f8af13 BUG: 915280 Signed-off-by: Vijay Bellur <vbellur@redhat.com> Reviewed-on: http://review.gluster.org/4578 Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* mgmt/glusterd: Expose error-gen options through volume set.Vijay Bellur2013-02-261-0/+24
| | | | | | | | | Change-Id: I7c696d99b43544923fb96d177229cdbac32c09fe BUG: 915280 Signed-off-by: Vijay Bellur <vbellur@redhat.com> Reviewed-on: http://review.gluster.org/4577 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* debug/error-gen: Add support for reconfiguring options.Vijay Bellur2013-02-261-66/+102
| | | | | | | | | Change-Id: Ia10dc29e8608b02037b08e32a72766b6d43a98ba BUG: 915280 Signed-off-by: Vijay Bellur <vbellur@redhat.com> Reviewed-on: http://review.gluster.org/4576 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* glusterd: Increasing throughput of synctask based mgmt ops.Krishnan Parthasarathi2013-02-263-338/+386
| | | | | | | | | | Change-Id: Ibd963f78707b157fc4c9729aa87206cfd5ecfe81 BUG: 913662 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-on: http://review.gluster.org/4570 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* volgen: Use bind-address option for bricks when option set on glusterdKrishnan Parthasarathi2013-02-261-8/+20
| | | | | | | | | | | | | | | | | Brick processes listen on all the interfaces on a given port. When multiple glusterds run on one machine, glusterd assumes that it 'owns' the ports on that machine. This can lead to the different glusterd instances to step on each other's ports. This fix ensures that brick processes listen only on the its host IP when glusterd has bind-address option set. Change-Id: I4c1b05643c64d3098bf56e977e768e611ffce0f5 BUG: 913662 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-on: http://review.gluster.org/4580 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/distribute: Preserve file size during rebalance migrationshishir gowda2013-02-251-0/+6
| | | | | | | | | | | | | | | | | | If holes are encountered, then we do not write these to the dst, which sometimes causes file size to be lesser than src. Data is not corrupted, as when non-zero reads are received, we do write that data. Calling a truncrate to give file size to prevent it from being truncated to less than src in case the file end has holes. Thanks to Brian Foster for providing the test case Change-Id: I3cdd143b63ec8d797273d76189dff8b05eb9e551 BUG: 915554 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4574 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Don't queue transactions during open-fd fixPranith Kumar K2013-02-226-385/+167
| | | | | | | | | | | | | | | | | | | | | Before Anonymous fds are available, afr had to queue up transactions if the file is not opened on one of its subvolumes. This happens until the attempt to open the file either succeeds or fails. These attempts happen until the file is successfully opened on the subvolume. Now client xlator uses anonymous fds to perform the fops if the fd used for the fop is not 'opened'. Fops will be successful even when the file is not opened so there is no need to queue up the transactions anymore in afr. Open is attempted on the subvolume where it is not opened independent of the fop. Change-Id: Id1a4b4ebe6f89f9efe8f6a8247918b91247d0819 BUG: 913051 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4568 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>