summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
...
* dht: make nufa/switch call dht's init/finiJeff Darcy2013-04-038-1046/+806
| | | | | | | | | | | | | These functions keep changing as new functionality is added, so copying and pasting the code is not a good solution. This way ensures that all fields get initialized properly no matter how much new stuff we throw in. Change-Id: I9e9b043d2d305d31e80cf5689465555b70312756 BUG: 924488 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4710 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* core: add dispatch table for init/finiJeff Darcy2013-04-032-10/+27
| | | | | | | | | | | | | This adds a layer of indirection so that derivative translators such as NUFA and switch can refer to the parent's init/fini (in both cases DHT's) without having to create stub functions. Change-Id: I1af1fea70a9ddd2aa20485af7ae65f9660f19dd6 BUG: 924490 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4709 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/distribute: Start rebalance with option readdir-optimize onshishir gowda2013-04-031-0/+1
| | | | | | | | | | | | | | With readdir-optimize set to on, we instruct the posix layer to ignore directory entries from not first subvolume. DHT discards directories returned from non first subvolume. By making posix itself ignore it, we are making directory crawls faster Change-Id: Ia1faf2dedec0c615c0632c3c063e846f5742ede6 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4613 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* build: require /usr/bin/fusermount when not carrying our own versionNiels de Vos2013-04-031-0/+3
| | | | | | | | | | | | | | | | | | The fuse.so from glusterfs-fuse will try to call /usr/bin/fusermount. This obviously fails when the fuse package is not installed and fusermount is not available. In order to prevent this problem, the glusterfs-fuse package should require /usr/bin/fusermount so that it gets automatically pulled in when glusterfs-fuse is installed with yum. BUG: 947830 Change-Id: I20fe836a1aaf751dbc04d9ec4ba5ea50573c71c5 Signed-off-by: Niels de Vos <ndevos@redhat.com> Reviewed-on: http://review.gluster.org/4768 Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* rpc-lib: Move defaulting to socket message to debug levelRaghavendra Talur2013-04-021-1/+1
| | | | | | | | | | | | | | | | | | Problem: For every gluster cli operation from command line rpc init process is required. During init process we print "no transport type set, defaulting to socket" message at WARNING level, which is not necessary. Solution: Moved the log level to DEBUG. Change-Id: I73f4644264368b0f6c11a77ef66595018784ce79 BUG: 928204 Signed-off-by: Raghavendra Talur <rtalur@redhat.com> Reviewed-on: http://review.gluster.org/4727 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cli: add more meaningful error messagesBala.FA2013-04-021-84/+151
| | | | | | | | | Change-Id: I6e88e6763fa537f4705427b4673d86e6443c2c98 BUG: 928648 Signed-off-by: Bala.FA <barumuga@redhat.com> Reviewed-on: http://review.gluster.org/4747 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* glusterd: add more specific log messagesBala.FA2013-04-021-8/+8
| | | | | | | | | Change-Id: I57fbdd83f3098e64886c3dd690a1ae04fc37442d BUG: 928648 Signed-off-by: Bala.FA <barumuga@redhat.com> Reviewed-on: http://review.gluster.org/4739 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Tests: Change rebalance status verificationPranith Kumar K2013-04-023-4/+4
| | | | | | | | | | | | | | According to the comment at the following URL https://bugzilla.redhat.com/show_bug.cgi?id=916226#c2 "success:" can come even before rebalance is completed. Changed it to check for "completed" instead. Change-Id: Ibe9d3b75493240f30261ac2a1280f32ef32886da BUG: 916226 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4614 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: detect in-progress creation in lookup and return ENOENTPranith Kumar K2013-04-021-0/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | if any subvol returned ENOENT while parent entrylk lock was held, yield and return ENOENT for the entire lookup. This is how the issue happens: Multiple clients A, B and C are attempting 'mkdir -p /mnt/a/b/c' 1 Client A is in the middle of mkdir(/a). It has acquired lock. It has performed mkdir(/a) on one subvol, and second one is still in progress 2 Client B performs a lookup, sees directory /a on one, ENOENT on the other, succeeds lookup. 3 Client B performs lookup on /a/b on both subvols, both return ENOENT (one subvol because /a/b does not exist, another because /a itself does not exist) 4 Client B proceeds to mkdir /a/b. It obtains entrylk on inode=/a with basename=b on one subvol, but fails on other subvol as /a is yet to be created by Client A. 5 Client A finishes mkdir of /a on other subvol 6 Client C also attempts to create /a/b, lookup returns ENOENT on both subvols. 7 Client C tries to obtain entrylk on on inode=/a with basename=b, obtains on one subvol (where B had failed), and waits for B to unlock on other subvol. 8 Client B finishes mkdir() on one subvol with GFID-1 and completes transaction and unlocks 9 Client C gets the lock on the second subvol, At this stage second subvol already has /a/b created from Client B, but Client C does not check that in the middle of mkdir transaction 10 Client C attempts mkdir /a/b on both subvols. It succeeds on ONLY ONE (where Client B could not get lock because of missing parent /a dir) with GFID-2, and gets EEXIST from ONE subvol. This way we have /a/b in GFID mismatch. One subvol got GFID-1 because Client B performed transaction on only one subvol (because entrylk() could not be obtained on second subvol because of missing parent dir -- caused by premature/speculative succeeding of lookup() on /a when locks are detected). Other subvol gets GFID-2 from Client C because while it was waiting for entrylk() on both subvols, Client B was in the middle of creating mkdir() on only one subvol, and Client C does not "expect" this when it is between lock() and pre-op()/op() phase of the transaction. Original-author: Anand Avati <avati@redhat.com> Change-Id: Idca475dbbc2a51e09da6fa0f9e1e37148caef208 BUG: 860210 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4625 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Adds missing functions to ring.py, and more thorough tests.Alex Wheeler2013-04-022-14/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Situation: The function get_part_nodes is being called by Openstack-Swift's proxy/controllers/base.py: https://github.com/openstack/swift/blob/1.7.4/swift/proxy/controllers/base.py#L410 https://github.com/openstack/swift/blob/1.7.6/swift/proxy/controllers/base.py#L447 As this was not implemented in the current GlusterFS version of ring.py, it was calling swift's original get_part_nodes, which would often return the incorrect node, resulting in the incorrect drive being associated with a request. There is another function that the original ring.py implements -- get_other_nodes, which has to do with replication. Since GlusterFS is handling replication, this function should never be called. However, in the interest of completeness, that function is also being replaced. Code changes: The two functions, get_part_nodes, and get_other_nodes have been implemented to override the default functions, and get_nodes has been updated to store information in self vars, about the account being operated on, as the two new functions are not called with that info, and get_nodes appears to always be called first. The code has be refactored to all call _get_part_nodes, much like swift has refactored their code. Reason for implementation this way: I didn't see a better way to do it, but am open to suggestions. Test cases: A mock ring is created with two different devices, test and iops test_first_device: Ensure that the first device, test, is returned for both get_nodes, and get_part_node, and get_more_nodes returns volume_not_in_ring. test_invalid_device: Ensure that a request for a non-existant device returns volume_not_in_ring. test_second_device: Same as test_first_device, but for the second device, iops instead of test. test_second_device_with_reseller_prefix: Test that calling with the reseller prefix, AUTH_ will still return the correct data. Change-Id: I2f3d526934a47b01e9c065d0edf0fbf06f300369 BUG: 924792 Signed-off-by: Alex Wheeler <wheelear@gmail.com> Reviewed-on: http://review.gluster.org/4748 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* rpc/nfs: cleanup legacy code of general optionsRajesh Amaravathi2013-04-025-454/+268
| | | | | | | | | | | | | | | | | | Removing the code which handles "general" options. Since it is no longer possible to set general options which apply for all volumes by default, this was redundant. This cleanup of general options code also solves a bug wherein with nfs.addr-namelookup on, nfs.rpc-auth-reject wouldn't work on ip addresses Change-Id: Iba066e32f9a0255287c322ef85ad1d04b325d739 BUG: 921072 Signed-off-by: Rajesh Amaravathi <rajesh@redhat.com> Reviewed-on: http://review.gluster.org/4691 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* synctask: introduce synclocks for co-operative lockingAnand Avati2013-04-022-1/+167
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces a synclocks - co-operative locks for synctasks. Synctasks yield themselves when a lock cannot be acquired at the time of the lock call, and the unlocker will wake the yielded locker at the time of unlock. The implementation is safe in a multi-threaded syncenv framework. It is also safe for sharing the lock between non-synctasks. i.e, the same lock can be used for synchronization between a synctask and a regular thread. In such a situation, waiting synctasks will yield themselves while non-synctasks will sleep on a cond variable. The unlocker (which could be either a synctask or a regular thread) will wake up any type of lock waiter (synctask or regular). Usage: Declaration and Initialization ------------------------------ synclock_t lock; ret = synclock_init (&lock); if (ret) { /* lock could not be allocated */ } Locking and non-blocking lock attempt ------------------------------------- ret = synclock_trylock (&lock); if (ret && (errno == EBUSY)) { /* lock is held by someone else */ return; } synclock_lock (&lock); { /* critical section */ } synclock_unlock (&lock); Change-Id: I081873edb536ddde69a20f4a7dc6558ebf19f5b2 BUG: 763820 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4717 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <raghavendra@gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/afr: sync xattrs removed on source to sink(s)Venky Shankar2013-04-025-12/+227
| | | | | | | | | | | | | xattrs are first removed from sink followed by setting source xattrs. Change-Id: I181cb5b785b667bbfc6e40787a2183a8f45de06b BUG: 906646 Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-on: http://review.gluster.org/4656 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/afr: prevent piggyback on stale pre_opPranith Kumar K2013-04-021-33/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here are the logs of a file on which we saw EIO because of size mismatch: [root@lizzie ~]# grep 38f18204 /var/log/glusterfs/mnt-x-.log Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 0, len: 7680 Cleared unstable write flag for 38f18204-2840-408e-ae65-c01f4106b8c4: offset 0 length 7680 Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 7680, len: 71680 Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 79360, len: 15716 fsync completed on 38f18204-2840-408e-ae65-c01f4106b8c4 for offset 0 length 7680 with changelog status: -1 -1 According to these logs fsync did not happen after writev with offset: 79360, len: 15716. Which is the reason for this problem. In total 3 writes came. lets call them w1, w2, w3 w1 does pre_op so pre_op_done[0], pre_op_done[1] counts become 1 and 1 then is_piggyback_post_op() is called for w1 and it returns *false* w1's fsync is fired Now w2 and w3 come and see that pre_op_done[0], pre_op_done[1] are both 1, so pre_op_piggyback[0] and pre_op_piggyback[1] are both incremented twice, once by w2, one more time by w3 and become 2, 2 ------- Step-A Now fsync of w1 is complete and it goes ahead with post op and decrements pre_op_done[0], pre_op_done[1] to 0, 0 Now w2, w3 writevs complete and is_piggyback_post_op will return *true* for both w2, w3. So fsync is not fired for both w2, w3 this patch prevents Step-A from happening. Change-Id: I8b6af1f1875b2cf5f718caa3c16ee7ff3dc96b5c BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4752 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* dht: improve transform/detransform of d_off (and be ext4 safe)Anand Avati2013-04-011-5/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The scheme to encode brick d_off and brick id into global d_off has two approaches. Since both brick d_off and global d_off are both 64-bit wide, we need to be careful about how the brick id is encoded. Filesystems like XFS always give a d_off which fits within 32bits. So we have another 32bits (actually 31, in this scheme, as seen ahead) to encode the brick id - which is typically plenty. Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off, as the d_off is calculated based on a hash function value. This leaves us no "unused" bits to encode the brick id. However both these filesystmes (EXT4 more importantly) are "tolerant" in terms of the accuracy of the value presented back in seekdir(). i.e, a seekdir(val) actually seeks to the entry which has the "closest" true offset. This "two-prong" scheme exploits this behavior - which seems to be the best middle ground amongst various approaches and has all the advantages of the old approach: - Works against XFS and EXT4, the two most common filesystems out there. (which wasn't an "advantage" of the old approach as it is borken against EXT4) - Probably works against most of the others as well. The ones which would NOT work are those which return HUGE d_offs _and_ NOT tolerant to seekdir() to "closest" true offset. - Nothing to "remember in memory" or evict "old entries". - Works fine across NFS server reboots and also NFS head failover. - Tolerant to seekdir() to arbitrary locations. Algorithm: Each d_off can be encoded in either of the two schemes. There is no requirement to encode all d_offs of a directory or a reply-set in the same scheme. The topmost bit of the 64 bits is used to specify the "type" of encoding of this particular d_off. If the topmost bit (bit-63) is 1, it indicates that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0, it indicates that the "small" d_off encoding scheme is used. The goal of the "small" d_off encoding is to stay as dense as possible towards the lower bits even in the global d_off. The goal of the HUGE d_off encoding is to stay as accurate (close) as possible to the "true" d_off after a round of encoding and decoding. If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick ID (call it "n"). SMALL d_off =========== Encoding -------- If the top n + 1 bits are free in a brick offset, then we leave the top bit as 0 and set the remaining bits based on the old formula: hi_mask = 0xffffffffffffffff hi_mask = ~(hi_mask >> (n + 1)) if ((hi_mask & d_off_brick) != 0) do_large_d_off_encoding () d_off_global = (d_off_brick * N) + brick_id Decoding -------- If the top bit in the global offset is 0, it indicates that this is the encoding formula used. So decoding such a global offset will be like the old formula: if ((d_off_global & 0x8000000000000000) != 0) do_large_d_off_decoding() d_off_brick = (d_off_global % N) brick_id = d_off_global / N HUGE d_off ========== Encoding -------- If the top n + 1 bits are NOT free in a given brick offset, then we set the top bit as 1 in the global offset. The low n bits are replaced by brick_id. low_mask = 0xffffffffffffffff << n // where n is ROOF(Log2(N)) d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id if (d_off_global == 0xffffffffffffffff) discard_entry(); Decoding -------- If the top bit in the global offset is set 1, it indicates that the encoding formula used is above. So decoding would look like: hi_mask = (0xffffffffffffffff << n) low_mask = ~(hi_mask) d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff) brick_id = global_d_off & low_mask If "losing" the low n bits in this decoding of d_off_brick looks "scary", we need to realize that till recently EXT4 used to only return what can now be expressed as (d_off_global >> 32). The extra 31 bits of hash added by EXT recently, only decreases the probability of a collision, and not eliminate it completely, anyways. In a way, the "lost" n bits are made up by decreasing the probability of collision by sharding the files into N bricks / EXT directories -- call it "hash hedging", if you will :-) Thanks-to: Zach Brown <zab@redhat.com> Change-Id: Ieba9a7071829d51860b7c131982f12e0136b9855 BUG: 838784 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4711 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cli: Made volume top help string clearM S Vishwanath Bhat2013-04-011-4/+3
| | | | | | | | | | | | | | | | nfs option is not applicable for read-perf and write-perf nfs option and brick option can not be used in same command Change-Id: I920ba0de011df0cc5e0adca6597aaea9372fe592 BUG: 924335 Signed-off-by: M S Vishwanath Bhat <vbhat@redhat.com> Reviewed-on: http://review.gluster.org/4706 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-by: Raghavendra Talur <rtalur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* mgmt/glusterd: Enable write-behind in nfsPranith Kumar K2013-04-011-1/+1
| | | | | | | | | | | | | We observed that the number of write requests thus inodelks are increasing very rapidly to thousands without write-behind in the graph. Change-Id: Id71c9c2b0a4c9601a4644a58a933221c62dab0c0 BUG: 928341 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4734 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* rpc: disable root-squash dynamically upon volume set commandRaghavendra Bhat2013-04-012-2/+68
| | | | | | | | | Change-Id: I2ba9ca339ffbe07cb74833165a46a941225b623d BUG: 927616 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com> Reviewed-on: http://review.gluster.org/4722 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* storage/posix: honor O_SYNC and O_DSYNC sent in @flags of writev()Anand Avati2013-03-292-5/+10
| | | | | | | | | | | | | | | | Historic bug - posix_writev() has been inspecting pfd->flushwrites for performing fsync() after write, instead of @flags for O_SYNC|O_DSYNC. pfd->flushwrites was never set anywhere and is unused completely. This is behavior from the time before anonymous FD where open() had @wbflags param. This is a leftover from that cleanup. Change-Id: Id9bfe562a60db4eb3bd0a7705bdba91f2df2f3ec BUG: 916372 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4738 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: fix fd leak with unsafe call_resume()Anand Avati2013-03-282-1/+17
| | | | | | | | | | | | | | Introduce AFR_CALL_RESUME macro which cleans up frame->local, like how AFR_STACK_UNWIND etc. do. Therefore fix leak in afr_fsync() path. Change-Id: I3855d8e7e84dbc44e05f507563b7f722bf9621b8 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4745 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: fsync before erase xattrs in data self-healPranith Kumar K2013-03-281-1/+75
| | | | | | | | | | | | Added extra fsync to data self-heal code to make sure the data reached disk before erasing the changelogs Change-Id: I9e7e6e55cdc49de2b991705d1638946464a9d4f9 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4744 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* glusterfs.spec.in: sync with fedora glusterfs.specKaleb S. KEITHLEY2013-03-282-0/+13
| | | | | | | | | | | | | add --without ufo Change-Id: If1b77003ded537f9664fa6ad677d48d118516c64 Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com> BUG: 819130 Reviewed-on: http://review.gluster.org/4742 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: Luis Pabon <lpabon@redhat.com> Reviewed-by: Luis Pabon <lpabon@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: piggyback and fsync resume changesPranith Kumar K2013-03-283-16/+21
| | | | | | | | | | | | | | 1) pre_op_piggyback should always be decremented. 2) Move fsync resume to just after post_op. 3) fsync stub should be created from afr's local not from the final response. Change-Id: I220bb532eb03bea584292f4dd2e816ad0c3e0cf7 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4741 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: fsync() guarantees POST-OP completionAnand Avati2013-03-273-10/+56
| | | | | | | | | | | | | | | | | | | | | | | | AFR now provides a stronger guarantee that fsync() returns only after completely finishing all the deferred/delayed POST-OP on that open file. To acheive this we make a stub out of the returning fsync and register it with the "delayed" frame in afr_changelog_wake_resume(). The delayed frame, after getting woken up and finishing the POST-OP will call_resume() the registered stub (which UNWINDs the fsync) at the time of frame destruction. This provides a guarantee that an application's (or FUSE) fsync() returns only after finishing up all the previous transactions, including delayed POST-OPs and UNLOCK. Change-Id: Iaa955457e2f25088a144fde37ad0444277b5cf49 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4737 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: ensure DATA operations are made durable before POST-OPAnand Avati2013-03-275-24/+316
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The changelogging scheme of AFR stores information about the state of all replicas in all replicas (in the extended attribute of the respective files on each server) in the form of 'pending counts' of operations (effectively "dirty flags"). These xattrs are blindly trusted while performing self-heal, and therefore utmost care has to be taken while updating and maintaing them. The most critical updation is the clearing of the pending counts corresponding to the *other* server in the changelog of a given server. Before clearing the pending count, we need durability guarantee of the write which was performed on the other server. To obtain such a guarantee, it may be necessary to explicitly introduce an fsync() phase (if the file itself wasn't already opened with O_SYNC). This patch introduces the detection of unstable stable writes on a file and issues explicit fsync() on the servers before performing the POST-OP clearing of pending flags. Change-Id: I2171b86a74ec91e40e5877eef0a4e7379578ecf7 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4721 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* libglusterfs/dict: fix infinite loop in dict_keys_join()Vijaykumar koppad2013-03-271-2/+4
| | | | | | | | | | | | - missing "pairs = next" caused infinite loop Change-Id: I9171be5bec051de6095e135d616534ab49cd4797 BUG: 905871 Signed-off-by: Vijaykumar Koppad <vijaykumar.koppad@gmail.com> Reviewed-on: http://review.gluster.org/4723 Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* glusterd: Removed fd leaks in glusterfs_start utility functionKrishnan Parthasarathi2013-03-256-119/+104
| | | | | | | | | | | | | | | | | | | PROBLEM: The FILE* associated with the pidfile was leaked if pmap_registry_search on the brickinfo' path failed. FIX: Eliminates the use of the FILE* that was leaked. Uses glusterd_is_service_running utility function in place of the earlier attempt to check for the same. Change-Id: I94082bd5a94b8a6340f8cc11726d3264e364efe6 BUG: 916549 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-on: http://review.gluster.org/4596 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* config: better (i.e. more portable) test for libxml2Kaleb S. KEITHLEY2013-03-254-15/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Over the weekend I tried to build on MacOS X¹ and ran into the following issues: 1) The recent change to autogen.sh to test for pkg-config falls down. 2) After removing the pkg-config test in autogen.sh, w/o pkg-config the PKG_CHECK_MODULES macro invocation in configure[.ac] falls down. N.B. Solaris users run into this too, even through there's a (broken) pkg-config package that can be installed. 3) There are other problems in the code related to fuse that are beyond the scope of this. It seems that pkg-config is only a requirement for the definition of the PKG_CHECK_MODULES macro used to detect libxml2. Since this seems to be inherently unportable — at least to MacOS X and Solaris — I'd like to: A) Change the use of the PKG_CHECK_MODULES macro to the more portable AM_PATH_XML2 macro provided by the libxml2 package in /usr/.../share/aclocal/libxml.m4 2) Revisit the decision to add the check for pkg-config in autogen.sh in BZ 921817. For now this is just an rfc. If people are agreeable I'll reenter this change against BZ 921817. ¹Mountain Lion 10.8.3, XCode 4.6.1 Change-Id: I237b1ed8919088345b8fd943423b2a6ad289981b BUG: 921817 Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com> Reviewed-on: http://review.gluster.org/4720 Reviewed-by: Justin Clift <jclift@redhat.com> Tested-by: Justin Clift <jclift@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* glusterd: Simplify glusterd_service_stop()Krishnan Parthasarathi2013-03-251-70/+11
| | | | | | | | | Change-Id: I396d250a3299ad1f7fce4bd14389b0c2756b6cb0 BUG: 764890 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-on: http://review.gluster.org/4718 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* glusterfsd: dump the in-memory graph rather than volfileAnand Avati2013-03-245-14/+108
| | | | | | | | | | | | | | | | | | | | | Currently we have been printing in the logfile, the volfile verbatim as received from the server. However we perform pre-processing on the graph we receive from the server, like adding ACL translator, applying --xlator-option cli params, etc. So print the serialized in-memory graph as the "volfile" in the log. This can be very handy to double check if certain --xlator-option param actually got applied or not, and in general is showing a "truer" representation of the real graph actually used. Change-Id: I0221dc56e21111b48a1ee3e5fe17a5ef820dc0c6 BUG: 924504 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4708 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com>
* mount: Added the xlator-option to mount.glusterfs script.Avra Sengupta2013-03-222-0/+56
| | | | | | | | | | | | | | | Now all xlator-options can be set from the mount command as well. Example : mount -t glusterfs Hostname:/Volume_Name Mount_Point -o "xlator-option=xyz=123, xlator-option=abc=999" Change-Id: If52d994986839d1c969e3e2e01b2e1a29a3140b7 BUG: 920583 Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/4660 Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Shishir Gowda <sgowda@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* glusterfsd: Fixed fd leak due to use of tmpfile()Krishnan Parthasarathi2013-03-223-32/+69
| | | | | | | | | Change-Id: I3c2dc070ebe967100170e39f3545acacc6016d61 BUG: 924075 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-on: http://review.gluster.org/4703 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* glusterd: Improve error logging when a brick from an old volume gets re-usedNiels de Vos2013-03-221-2/+7
| | | | | | | | | | | | | | | | | | | The error message when creating a volume that contains a brick with certain xatts set on a parent directory is unclear. Users do not understand '... or a prefix of it is already part of a volume'. Most users check the final directory that is used for a brick, but not its parents. It would be helpful to present the user with the actual directory that is preventing the volume to use the brick. BUG: 923917 Change-Id: I815ad32a992eb0e41ee8fca6ee9327400d042c45 Signed-off-by: Niels de Vos <ndevos@redhat.com> Reviewed-on: http://review.gluster.org/4701 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* nfs: ACCESS - reply only what was asked forAnand Avati2013-03-223-8/+12
| | | | | | | | | | | | | Set only those bits which were requested by the client. Some clients, like AIX, do not like the fact that we are returning the EXEC bit set in the ACCESS reply even though it only asked for LOOKUP bit. Change-Id: I3c2fd5dce030ea5ddae0511497cafa078c4d76d6 BUG: 924481 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4707 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* Added autogen.sh check for presence of tarJustin Clift2013-03-221-0/+6
| | | | | | | | | | Change-Id: I95313699edcf7bc2696505fcb475a4a67c1800cf BUG: 924891 Signed-off-by: Justin Clift <jclift@redhat.com> Reviewed-on: http://review.gluster.org/4716 Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* glusterfs.spec.in: sync with fedora glusterfs.specKaleb S. KEITHLEY2013-03-221-18/+32
| | | | | | | | | | | | | | | | | remove hard-coded reference to swift version and dist tarballs minor improvements for rpm builds for regression tests, including adding cache on build.gluster.org to avoid random failures due to transient network or dns failures causing curl or git to fail. BUG: 819130 Change-Id: I4f1213056ae2987dd2202f9cfbb3ed4f16ffc7cf Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com> Reviewed-on: http://review.gluster.org/4674 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Justin Clift <jclift@redhat.com> Tested-by: Justin Clift <jclift@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* socket: Make non-ssl sockets perform non-blocking connect()Krishnan Parthasarathi2013-03-221-0/+12
| | | | | | | | | Change-Id: Icb60cf7ad3ea7ca0eeb12fd19b95a6b340857bb2 BUG: 920916 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-on: http://review.gluster.org/4670 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dht: make DHT xattr names configurableJeff Darcy2013-03-2113-125/+233
| | | | | | | | | | | | | | | | | | | | This is necessary to support "DHT over DHT" configurations, so that the upper and lower instances of DHT don't step all over each other. Why would we even consider such a thing? Because it gives us the ability to do data tiering and rack-aware placement, either by themselves or as complements to other functionality such as erasure codes or deduplication which save space but cost performance. By setting up the top-level DHT to place data into one of several lower-level DHT pools based on policy instead of pure elastic hashing, we get better performance for 90% of accesses and better storage efficiency for 90% of data, all for relatively low effort. Change-Id: I72e65c29edfc80babf39f7a2a00090f4588c4070 BUG: 924265 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4694 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* object-storage: Removed the redundant REMOTE_CLUSTER option.Mohammed Junaid2013-03-213-19/+3
| | | | | | | | | | | | | | Gluster cli uses the remote-host option to connect to the glusterd and by default it uses localhost to connect to glusterd. So, UFO code will use the remote-host option everytime to connect to the glusterd. Change-Id: I5a684d3c43fe9bdc9cc0b7c472a9d8145f9e1fd4 BUG: 878663 Signed-off-by: Mohammed Junaid <junaid@redhat.com> Reviewed-on: http://review.gluster.org/4690 Reviewed-by: Peter Portante <pportant@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* rpc-transport: fix glusterd crash when rdma.so missingRajesh Amaravathi2013-03-211-2/+4
| | | | | | | | | | | Add checks before trying to delete vol_opt from list and free Change-Id: I2858f58518394beb8f74fa477be81d7bdd38304f BUG: 924215 Signed-off-by: Rajesh Amaravathi <rajesh@redhat.com> Reviewed-on: http://review.gluster.org/4704 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* features/marker: log error when unlinking timestamp fileVenky Shankar2013-03-211-4/+13
| | | | | | | | | | | | | | | | ... so it's easy to figure out errno caused it. As of now it's only due to ENOSPC. Logging is done in the error handling routine, so any further changes that require unlinking of the timestamp file due to some error condition(s) are logged. Change-Id: Ia59338e2e32b2adbbd1d56aa260018270f1abae9 BUG: 853911 Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-on: http://review.gluster.org/4649 Reviewed-by: Csaba Henk <csaba@redhat.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* performance/io-threads: Fix range-check for least-rate-limitPranith Kumar K2013-03-212-0/+16
| | | | | | | | | | | | | The issue could be fixed with .validate=GF_OPT_VALIDATE_MIN. But adding max value is more robust. Change-Id: Ia69c6f86855dbd34a26e20391e77bfa0f796a200 BUG: 923573 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4698 Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* socket: Associated IP:port information with error logs to make debugging easierKrishnan Parthasarathi2013-03-201-3/+4
| | | | | | | | | Change-Id: I4e7268f74b392c5cee8c7b8fc4bb5ab41d74dd04 BUG: 922780 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-on: http://review.gluster.org/4686 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* rpc: before freeing the volume options object, delete it from the listRaghavendra Bhat2013-03-202-6/+7
| | | | | | | | | | | | | | | | | | | | | | * Suppose there is an xlator option which is considered by the xlator only if the source was built with debug mode enabled (the only example in the current code base is run-with-valgrind option for glusterd), then giving that option would make the process crash if the source was not built with debug mode enabled. Reason: In rpc, after getting the options symbol dynamically, it was stored in the newly allocated volume options structure and the structure's list head was added to the xlator's volume_options list. But while freeing the structure the list was not deleted. Thus when the list was traversed, already freed structure was accessed leading to segfault. Change-Id: I3e9e51dd2099e34b206199eae7ba44d9d88a86ad BUG: 922877 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com> Reviewed-on: http://review.gluster.org/4687 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cli: Better message on rebalance/remove-brick stopKaushal M2013-03-201-6/+43
| | | | | | | | | | | | A better message is displayed when rebalance or remove-brick is stopped, as migration may be still in progress. Change-Id: I5f301cc66f349a1a85245f7d7508bf746756bdea BUG: 923529 Signed-off-by: Kaushal M <kaushal@redhat.com> Reviewed-on: http://review.gluster.org/4695 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* nfs: handle stable write with @flags rather than fsync()Anand Avati2013-03-202-68/+27
| | | | | | | | | | | | | | | stable writes can be "made stable" by simply setting O_SYNC (or O_DSYNC, accordingly) in the write flags or fd->flags. Performing fsync() at the end of the write is extremely inefficient and completely messes up eager-locking logic in AFR. Change-Id: I4d954c133641e246b2ab4df874bad0282667561f BUG: 916372 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4591 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* iobuf: Added a function iobref_clearPrashanth Pai2013-03-202-1/+24
| | | | | | | | | | | Original-author: Venky Shankar <vshankar@redhat.com> Change-Id: Ibf861db6c1b084b798d210962344487a1919aad2 BUG: 921942 Signed-off-by: Prashanth Pai <nullpai@gmail.com> Reviewed-on: http://review.gluster.org/4595 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* nfs, afr: Fail lookup only on split-brainPranith Kumar K2013-03-202-3/+7
| | | | | | | | | | | Change-Id: Icee9772f1f1bf5336eb82a4dc13e198424cd4a65 BUG: 921996 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4699 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* rpc: For AF_UNIX sockets, do not set keepalive option.Venkatesh Somyajulu2013-03-181-2/+3
| | | | | | | | | | | Change-Id: I65df52e89fe7783ff00c2b7a49be571d15e8f0d0 BUG: 920009 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/4684 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-by: Niels de Vos <ndevos@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dht: fix a typoJulesWang2013-03-181-1/+1
| | | | | | | | | Change-Id: Id6f156957e58aad06bf2602f880c7e4102b80fd1 BUG: 764890 Signed-off-by: JulesWang <w.jq0722@gmail.com> Reviewed-on: http://review.gluster.org/4679 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>