summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/dht/src
Commit message (Collapse)AuthorAgeFilesLines
* cluster/dht: node-uuid for directories winds to all subvolumesAvra Sengupta2013-07-151-2/+6
| | | | | | | | | | | | | | | this works similar to pathinfo now except that the request is sent to all subvolumes of dht. Underlying replica selects it's subvolume in a round-robin fashion till one of them returns successfully. Change-Id: Ie46c5f7090d04d8c2e487b209916ae6791e94624 BUG: 847839 Original Author: Venky Shankar <vshankar@redhat.com> Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/5225 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/*: get logic to calculate min() of the 'stime' xattrAvra Sengupta2013-07-141-0/+5
| | | | | | | | | | | | | | | | | | | | | | * in both distribute and replicate (ignoring stripe for now), add logic to calculate the min() of stime values. * What is a 'stime' ? Why is this required: - stime means 'slave xtime', mainly used to keep track of slave node's sync status when distributed geo-replication is used. Logic of calculating 'min()' for this stime is very important as in case of crashes/reboots/shutdown, we will have to 'restart' with crawling from stime time value from the mount point, which gives the 'min()' of all the bricks, which means, we don't miss syncing any files in the above cases. Change-Id: I2be8d434326572be9d4986db665570a6181db1ee BUG: 847839 Original Author: Amar Tumballi <amarts@redhat.com> Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/4893 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Unlink dst file after cleanup during migrationshishir gowda2013-07-121-17/+17
| | | | | | | | | | | | | | | | | If a rename happens during migration, unlink fails. This leads to stale xattrs, and Sticky bits still being set. By removing the xattrs and Sticky bits from dst (through fd ops), the stale link file would be cleaned up eventually (even if unlink fails on src). Change-Id: Iec537d021905438327a20e1d811aa06e74034364 BUG: 983399 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/5316 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: If linkfile unlink fails with ENOTCONN, do not failshishir gowda2013-07-111-1/+2
| | | | | | | | | | | | | | Currently if linkfile fails with ENOENT, we do not fail. We also need to treat failures with ENOTCONN as success, as if cached subvol is up, rm of a file should succeed. A stale linkfile will get removed later Change-Id: I71d136847933351ed9e2c939bda4a69bc96a3cfc BUG: 983416 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/5317 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: Ignore subvols with error in min-free-disk/inodesshishir gowda2013-07-105-17/+85
| | | | | | | | | | | | | | | Currently when selecting a alternative subvolume when hashed subvol has exceeded min-free-disk/inodes, we do not check if layouts have errors (including decommissioning). This leads to data being written to those subvolumes, and in case of decommissioning, will lead to data loss. Change-Id: Ie0c6cf4a29d7c53d8a6d8a8c1bd595cf58a0012a BUG: 982919 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/5299 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* nufa: allow subvols with fanout > 1Krishnan Parthasarathi2013-07-041-67/+100
| | | | | | | | | | | | Previously, nufa wouldn't work on volume topologies such as distribute-replicate or distribute-stripe. Change-Id: Ia89ed4412a00601022c1fc94f046056ce4820fe8 BUG: 980838 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-on: http://review.gluster.org/5262 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: In reconfig handle removed decommissioned nodesshishir gowda2013-06-211-0/+26
| | | | | | | | | | | | | | If no decommissioned nodes options are set in the options, then clear the conf->decommissioned_bricks. Change-Id: I426d2bcc874aab21b2eba0b16a580b9a26672ea2 Signed-off-by: shishir gowda <sgowda@redhat.com> BUG: 973073 Reviewed-on: http://review.gluster.org/5199 Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* glusterfs: discard (hole punch) supportBrian Foster2013-06-133-0/+139
| | | | | | | | | | | | | | | | Add support for the DISCARD file operation. Discard punches a hole in a file in the provided range. Block de-allocation is implemented via fallocate() (as requested via fuse and passed on to the brick fs) but a separate fop is created within gluster to emphasize the fact that discard changes file data (the discarded region is replaced with zeroes) and must invalidate caches where appropriate. BUG: 963678 Change-Id: I34633a0bfff2187afeab4292a15f3cc9adf261af Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-on: http://review.gluster.org/5090 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* gluster: add fallocate fop supportBrian Foster2013-06-133-0/+143
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement support for the fallocate file operation. fallocate allocates blocks for a particular inode such that future writes to the associated region of the file are guaranteed not to fail with ENOSPC. This patch adds fallocate support to the following areas: - libglusterfs - mount/fuse - io-stats - performance/md-cache,open-behind - quota - cluster/afr,dht,stripe - rpc/xdr - protocol/client,server - io-threads - marker - storage/posix - libgfapi BUG: 949242 Change-Id: Ice8e61351f9d6115c5df68768bc844abbf0ce8bd Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-on: http://review.gluster.org/4969 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: Make sure loc has gfidPranith Kumar K2013-06-101-0/+2
| | | | | | | | | | | | | | | | | | Problem: In some code paths neither loc->gfid nor loc->inode->gfid is populated which leads to EINVAL for linkfile setattr in dht_linkfile_attr_heal. Fix: Populate loc->gfid before dht_linkfile_attr_heal. Change-Id: I062770e6f6eaead304eff1dae81f8588a3b97eed BUG: 971805 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5178 Reviewed-by: Shishir Gowda <sgowda@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Fix unused-but-set-variable warningsPranith Kumar K2013-06-061-4/+0
| | | | | | | | | | Change-Id: Ie2b2dc54aa0d500c35752c72d3b562bcc05b1fc2 BUG: 969336 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5123 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Ignore ENOENT errors for unlink of linkfilesshishir gowda2013-06-061-2/+2
| | | | | | | | | | | | | If unlink of linkfile returns ENOENT, do not fail unlink. Proceed with unlinking of cached file. Change-Id: If7cec92b40c39d68dd9c3606c6c2c3a6bd67d27b BUG: 966848 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4971 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Prevent crash in dht_linkfile_lookupshishir gowda2013-06-041-3/+3
| | | | | | | | | | | | | Assign local = frame->local before dereferencing local->linkfile.linkfile_cbk. Additionally, fail if op_ret is non_zero. Change-Id: I96a2f34ba29887da9ccaae38a644431cf7c43265 BUG: 966858 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/5141 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/dht: Set layout when inode is presentPranith Kumar K2013-06-041-1/+2
| | | | | | | | | | | | | | | | | Problem: Lookups in discovery fail with ENOENT so local->inode is never set. dht_layout_set logs the callstack when the function is called in that state. Fix: Don't set layout when lookups fail in discovery. Change-Id: I5d588314c89e3575fcf7796d57847e35fd20f89a BUG: 965434 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5055 Reviewed-by: Shishir Gowda <sgowda@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/dht: Return success in dht_discover if layout issuesJeff Darcy2013-06-033-33/+65
| | | | | | | | | | | | | | | | | We cannot heal in dht_discover, as it is a gfid based lookup, and not path based. So, returning error here would lead to app's to see failure. Also, update the layout in inode_ctx even if it has anomalies. Let subsequent heals fix the issue. Change-Id: I2358aadacf9a24e20a22ab0a6055c38c5eb6485c BUG: 960348 Original-author: shishir gowda <sgowda@redhat.com> Signed-off-by: shishir gowda <sgowda@redhat.com> Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4959 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: Handle linkfile creation with EEXIST errorshishir gowda2013-05-311-0/+59
| | | | | | | | | | | | | If linkfile create fails with EEXISTS, then check if the file is a linkfile for the same file. If not, return the error Change-Id: Iab42db54422dea69de0049b5196365e65edadd91 BUG: 966858 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/5060 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Linkfiles creation with correct uid/gidshishir gowda2013-05-312-20/+47
| | | | | | | | | | | | | | | | | | | | If renames are done with different uid/gid (non-owners), then we would end up with incorrect uid/gid. The fix is to create linkfiles, and heal the uid/gid as root:root. This preserves our notion of creation as root:root and heal the uid/gid as root:root in all paths. Additionally, we need to consider uid/gid from only src_cached subvol, and not from linkfiles. rename is also done as root:root if done on linkfile, as setattr of ownership on linkfile is done after the rename Change-Id: Icb5d431dc42da9c02dfae81980e3fe769a47a274 BUG: 884597 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4682 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/dht: Do not open fd in migration check/complete for non fd opsshishir gowda2013-05-312-1/+3
| | | | | | | | | | | | | | | | | if local->fd == NULL, then in dht_migration_check_complete, do not do open call. Let the layout get updated, and proceed with invoking the registered target_fn. if local->fd == NULL, do not call dht_rebalance_in_progress_check for truncate fop, but proceed with truncate2. Change-Id: Ia5a5d40bcea7bfb320ef7096af1e035b8847d4ff BUG: 960055 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4958 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: getxattr linkto as root:rootshishir gowda2013-05-311-6/+15
| | | | | | | | | | | | | | | | | | In path based op's like truncate, we use getxattr instead of fgetxattr call. These can fail with permission denied issues as linkto file creation, and setattr of ownership is not atomic, and in cases where setattr failed (subvols down..) The fix is to perform getxattr as root:root as it is a internal fop. fgetxattr, bypass the access check, as it already has a valid open fd. Change-Id: Ie221c9172e3c1c7ed4e50c8782d362826910756f BUG: 957074 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4890 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Do migration inprog/complete check only if ENOENTshishir gowda2013-05-312-2/+8
| | | | | | | | | | | | | | Additionally, update op_errno to the lasted failure. If failures found in complete_check, error returned would be EUCLEAN instead of the right failure (in this case ENOENT) Change-Id: Ib813867f4b817af651627b9ea07b0b09fa2b26ce BUG: 966852 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4989 Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* logging: Fix to avoid excessive logging.Venkatesh Somyajulu2013-05-281-1/+1
| | | | | | | | | | | | | | | | | | | | | mem_get function: Log message related to mem_pool calloc is removed as its been calculated in mempool 'stats'. This messgae is consuming nearly half of the total log messages in DEBUG mode. dht_hash_compute function: Changed log level from DEBUG to TRACE. client_fdctx_destroy function: Changed log level from DEBUG to TRACE. Change-Id: Ic948db0419e76df4e95ebd0cabaf66eadbaada6b BUG: 966851 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/5086 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* dht,posix: support for case discoveryAnand Avati2013-05-251-0/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is support for discovering a filename in a given directory which has a case insensitive match of a given name. It is implemented as a virtual extended attribute on the directory where the required filename is specified in the key. E.g: sh# getfattr -e "text" -n user.glusterfs.get_real_filename:FiLe-B /mnt/samba/patchy getfattr: Removing leading '/' from absolute path names # file: mnt/samba/patchy user.glusterfs.get_real_filename:FiLe-B="file-b" In reality, there can be multiple "answers" as the backend filesystem is case sensitive and there can be multiple files which can strcasecamp() successfully. In this case we pick the first matched file from the first responding server. If a matching file does not exist, we return ENOENT (and NOT ENODATA). This way the caller can differentiate between "unsupported" glusterfs API and file not existing. This API is used by Samba VFS to perform efficient discovery of the real filename without doing a full scan at the Samba level. Change-Id: I53054c4067cba69e585fd0bbce004495bc6e39e8 BUG: 953694 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4941 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* gfapi: link inodes in relevant entry FOPsAnand Avati2013-05-251-3/+3
| | | | | | | | | | | | Do not let inode linking to happen only in lookup(). While that works, it is inefficient. Change-Id: I51bbfb6255ec4324ab17ff00566375f49d120c06 BUG: 953694 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4931 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/dht: Ignore decommissioned subvol in overlap optimizationshishir gowda2013-05-241-0/+7
| | | | | | | | | | Change-Id: Ib727948c6e21b19fd509f258ff0aea1c5d1a84d1 BUG: 966845 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/5056 Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/dht: Don't do extra unref in dht-migration checksPranith Kumar K2013-05-161-5/+2
| | | | | | | | | | | | | | | | | | | Problem: syncop_open used to perform a ref in syncop_open_cbk so the extra unref was needed but now syncop_open_cbk does not take a ref so no need to do extra unref. Fix: remove the extra fd_unref and let dht_local_wipe do the final unref. Change-Id: Ibe8f9a678d456a0c7bff175306068b5cd297ecc4 BUG: 961615 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4974 Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com> Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* object-storage: final removal of ufo codePeter Portante2013-05-101-54/+0
| | | | | | | | | | | | | | | | | | | See https://git.gluster.org/gluster-swift.git for the new location of the Gluster-Swift code. With this patch, no OpenStack Swift related RPMs are constructed. This patch also removes the unused code that references the user.ufo-test xattr key in the DHT translator. Change-Id: I2da32642cbd777737a41c5f9f6d33f059c85a2c1 BUG: 961902 (https://bugzilla.redhat.com/show_bug.cgi?id=961902) Signed-off-by: Peter Portante <peter.portante@redhat.com> Reviewed-on: http://review.gluster.org/4970 Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Reviewed-by: Luis Pabon <lpabon@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/distribute: Ignore non-participating subvols for layout checksshishir gowda2013-04-092-20/+88
| | | | | | | | | | | | | | | | | | | | | | | When subvols-per-directory is < available subvols, then there are layouts which are not populated. This leads to incorrect identification of holes or overlaps. We need to ignore layouts, which have err == 0, and start == stop. In the current scenario (start == stop == 0). Additionally, in layout-merge, treat missing xattrs as err = 0. In case of missing layouts, anomalies will reset them. For any other valid subvoles, err != 0 in case of layouts being zeroed out. Also reverted back dht_selfheal_dir_xattr, which does layout calculation only on subvols which have errors. Change-Id: I9f57062722c9e8a26285e10675c31a78921115a1 BUG: 921408 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4668 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* dht: make nufa/switch call dht's init/finiJeff Darcy2013-04-036-1032/+786
| | | | | | | | | | | | | These functions keep changing as new functionality is added, so copying and pasting the code is not a good solution. This way ensures that all fields get initialized properly no matter how much new stuff we throw in. Change-Id: I9e9b043d2d305d31e80cf5689465555b70312756 BUG: 924488 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4710 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dht: improve transform/detransform of d_off (and be ext4 safe)Anand Avati2013-04-011-5/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The scheme to encode brick d_off and brick id into global d_off has two approaches. Since both brick d_off and global d_off are both 64-bit wide, we need to be careful about how the brick id is encoded. Filesystems like XFS always give a d_off which fits within 32bits. So we have another 32bits (actually 31, in this scheme, as seen ahead) to encode the brick id - which is typically plenty. Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off, as the d_off is calculated based on a hash function value. This leaves us no "unused" bits to encode the brick id. However both these filesystmes (EXT4 more importantly) are "tolerant" in terms of the accuracy of the value presented back in seekdir(). i.e, a seekdir(val) actually seeks to the entry which has the "closest" true offset. This "two-prong" scheme exploits this behavior - which seems to be the best middle ground amongst various approaches and has all the advantages of the old approach: - Works against XFS and EXT4, the two most common filesystems out there. (which wasn't an "advantage" of the old approach as it is borken against EXT4) - Probably works against most of the others as well. The ones which would NOT work are those which return HUGE d_offs _and_ NOT tolerant to seekdir() to "closest" true offset. - Nothing to "remember in memory" or evict "old entries". - Works fine across NFS server reboots and also NFS head failover. - Tolerant to seekdir() to arbitrary locations. Algorithm: Each d_off can be encoded in either of the two schemes. There is no requirement to encode all d_offs of a directory or a reply-set in the same scheme. The topmost bit of the 64 bits is used to specify the "type" of encoding of this particular d_off. If the topmost bit (bit-63) is 1, it indicates that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0, it indicates that the "small" d_off encoding scheme is used. The goal of the "small" d_off encoding is to stay as dense as possible towards the lower bits even in the global d_off. The goal of the HUGE d_off encoding is to stay as accurate (close) as possible to the "true" d_off after a round of encoding and decoding. If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick ID (call it "n"). SMALL d_off =========== Encoding -------- If the top n + 1 bits are free in a brick offset, then we leave the top bit as 0 and set the remaining bits based on the old formula: hi_mask = 0xffffffffffffffff hi_mask = ~(hi_mask >> (n + 1)) if ((hi_mask & d_off_brick) != 0) do_large_d_off_encoding () d_off_global = (d_off_brick * N) + brick_id Decoding -------- If the top bit in the global offset is 0, it indicates that this is the encoding formula used. So decoding such a global offset will be like the old formula: if ((d_off_global & 0x8000000000000000) != 0) do_large_d_off_decoding() d_off_brick = (d_off_global % N) brick_id = d_off_global / N HUGE d_off ========== Encoding -------- If the top n + 1 bits are NOT free in a given brick offset, then we set the top bit as 1 in the global offset. The low n bits are replaced by brick_id. low_mask = 0xffffffffffffffff << n // where n is ROOF(Log2(N)) d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id if (d_off_global == 0xffffffffffffffff) discard_entry(); Decoding -------- If the top bit in the global offset is set 1, it indicates that the encoding formula used is above. So decoding would look like: hi_mask = (0xffffffffffffffff << n) low_mask = ~(hi_mask) d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff) brick_id = global_d_off & low_mask If "losing" the low n bits in this decoding of d_off_brick looks "scary", we need to realize that till recently EXT4 used to only return what can now be expressed as (d_off_global >> 32). The extra 31 bits of hash added by EXT recently, only decreases the probability of a collision, and not eliminate it completely, anyways. In a way, the "lost" n bits are made up by decreasing the probability of collision by sharding the files into N bricks / EXT directories -- call it "hash hedging", if you will :-) Thanks-to: Zach Brown <zab@redhat.com> Change-Id: Ieba9a7071829d51860b7c131982f12e0136b9855 BUG: 838784 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4711 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* dht: make DHT xattr names configurableJeff Darcy2013-03-2111-125/+192
| | | | | | | | | | | | | | | | | | | | This is necessary to support "DHT over DHT" configurations, so that the upper and lower instances of DHT don't step all over each other. Why would we even consider such a thing? Because it gives us the ability to do data tiering and rack-aware placement, either by themselves or as complements to other functionality such as erasure codes or deduplication which save space but cost performance. By setting up the top-level DHT to place data into one of several lower-level DHT pools based on policy instead of pure elastic hashing, we get better performance for 90% of accesses and better storage efficiency for 90% of data, all for relatively low effort. Change-Id: I72e65c29edfc80babf39f7a2a00090f4588c4070 BUG: 924265 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4694 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dht: fix a typoJulesWang2013-03-181-1/+1
| | | | | | | | | Change-Id: Id6f156957e58aad06bf2602f880c7e4102b80fd1 BUG: 764890 Signed-off-by: JulesWang <w.jq0722@gmail.com> Reviewed-on: http://review.gluster.org/4679 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/distribute: Fix layout overlaps due to spread-count in selfheal pathshishir gowda2013-03-071-50/+12
| | | | | | | | | | | | | | | | We needed to zero out the layout range, before we re-calculate the range. When spread-count is issued, we would end up with stale ranges in the layout. Replaced dht_selfheal_dir_xattr with dht_fix_dir_xattr, which correctly resets the un-used (after re-cal) layouts. Change-Id: I1a900d15df07335f59356bd23182ccec34381ab2 BUG: 884455 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4647 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster xlators: s/-1/GF_CLIENT_PID_GSYNCD/Csaba Henk2013-03-031-2/+2
| | | | | | | | | Change-Id: I03be3cb23684de4ab36cf2953002708466edd580 BUG: 765433 Signed-off-by: Csaba Henk <csaba@redhat.com> Reviewed-on: http://review.gluster.org/4601 Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* glusterd: Added the validation function for subvols-per-directoryAvra Sengupta2013-02-281-0/+2
| | | | | | | | | Change-Id: Ie2259023b9001311a2032792639c3093054f6750 BUG: 896431 Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/4552 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/distribute: Prevent spurious multiple defrag crawlsshishir gowda2013-02-281-9/+14
| | | | | | | | | | | | | | | | | | In dht_notify, we used to create a thread to start defrag crawls after we had heard from all child subvols. This was in-correct, as a later event, could also trigger the crawl again(due to the fact that all subvols had responded). The fix is to make sure, the thread is started only once after all subvols have responded the first time Change-Id: Ifc2978b9dc866af2395b79911eca50ab38ff9457 BUG: 916449 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4587 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/dht: print hash function munging logs in DEBUG modeAnand Avati2013-02-271-2/+2
| | | | | | | | | Change-Id: Ia2e6bce80710d103da9d78afdb389ea162b00686 BUG: 912564 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4590 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/distribute: Add filter to support file patterns to be migratedshishir gowda2013-02-273-1/+120
| | | | | | | | | | | | | | | | | | | | | | | 'gluster volume rebalance' command will be enhanced to support passing of these options/pattern. <pattern> is comma separated list as show below. The Precedence is from right to left. e.g- "*avi,*pdf:10MB,*:1KB" The precedence is as follows: migrate all files with size equal or greater than 1KB "*:1KB" migrate all pdf files with size equal or greater than 10MB "*pdf:10MB" migrate all avi files "*avi" With this option, it is possible to choose which files to migrate. Change-Id: I6d6d6a015bcbacf1debae2f278a2d92306fb055d BUG: 896456 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4366 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/distribute: Preserve file size during rebalance migrationshishir gowda2013-02-251-0/+6
| | | | | | | | | | | | | | | | | | If holes are encountered, then we do not write these to the dst, which sometimes causes file size to be lesser than src. Data is not corrupted, as when non-zero reads are received, we do write that data. Calling a truncrate to give file size to prevent it from being truncated to less than src in case the file end has holes. Thanks to Brian Foster for providing the test case Change-Id: I3cdd143b63ec8d797273d76189dff8b05eb9e551 BUG: 915554 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4574 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dht: Enable mem-accounting for nufaPranith Kumar K2013-02-191-0/+9
| | | | | | | | | Change-Id: I0cf8a59c81e30603aadc45393bbd49d80ae1621f BUG: 912657 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4542 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* distribute: add hash-name-regex optionJeff Darcy2013-02-185-23/+125
| | | | | | | | | | | | | | | | | | | | | This is to support the common "write to temp file then rename" idiom. In our case this causes us to create a linkfile during the rename in (N-1)/N cases, with a significant impact on performance e.g. for UFO which uses this idiom. This patch allows the user to specify up to two regular expressions that separate the permanent and transient parts of a temp-file name: rsync-hash-regex reimplements the existing RSYNC_FRIENDLY_NAME pattern where the temporary name for EXAMPLE is .EXAMPLE.suffix extra-hash-regex can be used for a second pattern, active concurrently, and can include alternatives via the POSIX extended-regex syntax Change-Id: Ic1a6ed19324bc31fefe91ee34b8478877a9c5d2c BUG: 912564 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4116 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: improvement in rebalance logsAmar Tumballi2013-02-171-2/+2
| | | | | | | | | | | | provide space between path and next string, so that we can grep for the correct path in error logs. Change-Id: Ide53e341864dce432406a58e8cbb2ff1480cfbde Signed-off-by: Amar Tumballi <amarts@redhat.com> BUG: 815194 Reviewed-on: http://review.gluster.org/4531 Reviewed-by: Anand Avati <avati@redhat.com> Tested-by: Anand Avati <avati@redhat.com>
* cluster/distribute: Reopen fds in migration internally as root:rootshishir gowda2013-02-141-1/+10
| | | | | | | | | | | | | | | Though linkfile_create and rebalance dst file create sent a setattr with correct ownership, there is still a race window where the linkfile open (client open due to migration) will fail, as its ownership will be root:root. Change-Id: I056092da6102319efa3bb9f1795f8c97db2a3fed BUG: 884597 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4513 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/distribute: Remove suprious fd_unref callshishir gowda2013-02-141-2/+0
| | | | | | | | | | | | | After fix http://review.gluster.org/4282 (libglusterfsterfs/syncop: do not hold ref on the fd in cbk) was pushed, syncop_open does not take a ref anymore. Change-Id: I5543986a4f2d965eef42ca979b39ac75439cec49 BUG: 910661 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4509 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: Create linkfile with file uid/gidshishir gowda2013-02-134-4/+101
| | | | | | | | | | | | | | | | Currently, linkfile creation happens as root. use uid/gid returned from _cbk (link/rename) to set the correct ownership of the link files. Also added test/dht.rc to implement common dht functions Change-Id: Iecdcf52d89fed8286670ce430b9d19deebd27b8c BUG: 884597 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4304 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: pathinfo xattr changes for directoriesVenky Shankar2013-02-082-92/+224
| | | | | | | | | | | | | | | | | | | | Since directories have presence on all subvolumes there is no definite meaning of ->hashed_subvol or ->cached_subvol. getxattr() code path chooses ->cached_subvol for pathinfo extended attribute. While this makes sense of files, it makes less sense for directories. Further if a hashed or a cached subvolume is down, and there's a getxattr request for a directory, we return with an errno. This patch changes pathinfo extended attribute contents by aggregating information from all subvolumes that are up. Change-Id: I58adb741d63ccfd1d0239af75eb65f26f0fb384d Signed-off-by: Venky Shankar <vshankar@redhat.com> BUG: 856455 Reviewed-on: http://review.gluster.org/4047 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Use proper libtool option -avoid-version instead of bogus -avoidversionAnand Avati2013-02-071-3/+3
| | | | | | | | | | Change-Id: I1c9541058c7d07786539a3266ca125a6a15287d8 BUG: 859835 Signed-off-by: Anand Avati <avati@redhat.com> Original-author: Kacper Kowalik (Xarthisius) <xarthisius.kk@gmail.com> Signed-off-by: Kacper Kowalik (Xarthisius) <xarthisius.kk@gmail.com> Reviewed-on: http://review.gluster.org/3967 Tested-by: Gluster Build System <jenkins@build.gluster.com>
* dht: better layout-optimization algorithmJeff Darcy2013-02-072-22/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This method deals with the case where swapping might gain a bigger overlap for the xlator currently under consideration, but sacrifices even more from the xlator we're swapping with. For example: A = 0x00000000 - 0x44444443 (new 0x00000000 - 0x55555554) B = 0x44444444 - 0x77777776 (new 0x55555555 - 0xaaaaaaa9) C = 0x77777777 - 0xffffffff (new 0xaaaaaaaa - 0xffffffff) Here, the new range for B has a bigger overlap with the old C than with the old B (0x33333333 vs. 0x22222222 to be precise) so looking only at that might lead us to swap. However, such a swap turns the new C's overlap from 0x55555556 (vs. old C) to *zero* (vs. old B). In other words, we've gained 0x11111111 for B but lost 0x55555556 for C, so it's a bad idea. The new algorithm accounts for all effects of the swap, so it not only avoids bad swaps but can make some good ones that would have been missed previously. For example, if swapping a range X with a later range Y would not increase the overlap for X we would previously have skipped it even if the swap would increase Y's overlap without affecting X's. This is the normal case when we're adding a new brick (which initially has zero overlap with any old range) so finding more good swaps is probably even more important than avoiding bad ones. Also, the logic in dht_overlap_calc was completely broken before, causing integer overflows instead of providing correct values, so no matter what higher-level algorithm was in place the GIGO effect would have resulted in bad decisions. Change-Id: If61ed513cfcb931916c6b51da293e3efbaaf385f BUG: 853258 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/3908 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: Correct min_free_disk behaviourRaghavendra Talur2013-02-042-27/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Files were being created in subvol which had less than min_free_disk available even in the cases where other subvols with more space were available. Solution: Changed the logic to look for subvol which has more space available. In cases where all the subvols have lesser than Min_free_disk available , the one with max space and atleast one inode is available. Known Issue: Cannot ensure that first file that is created right after min-free-value is crossed on a brick will get created in other brick because disk usage stat takes some time to update in glusterprocess. Will fix that as part of another bug. Change-Id: If3ae0bf5a44f8739ce35b3ee3f191009ddd44455 BUG: 858488 Signed-off-by: Raghavendra Talur <rtalur@redhat.com> Reviewed-on: http://review.gluster.org/4420 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: ignore EEXIST error in mkdir to avoid GFID mismatchAnand Avati2013-02-031-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In dht_mkdir_cbk, EEXIST error is treated like a true error. Because of this the following sequence of events can happen, eventually resulting in GFID mismatch and (and possibly leaked locks and hang, in the presence of replicate.) The issue exists when many clients concurrently attempt creation of directory and subdirectory (e.g mkdir -p /mnt/gluster/dir1/subdir) 0. First mkdir happens by one client on the hashed subvolume. Only one client wins the race. Others racing mkdirs get EEXIST. Yet other "laggers" in the race encounter the just-created directory in lookup() on the hash dir. 1. At least one "lagger" lookup() notices that there are missing directories on other subvolumes (which the "winner" mkdir is yet to create), and starts off self-heal of the directory. 2. At least on some subvolumes, self-heal's mkdir wins the race against the "winner" mkdir and creates the directory first. This causes the "winner" mkdir to experience EEXIST error on those subvolumes. 3. On other subvolumes where "winner" mkdir won the race, self-heal experiences EEXIST error, but self-heal is properly translating that into a success (but mkdir code path is not -- which is the bug.) 4. Both mkdir and self-heal assign hash layouts to the just created directory. But self-heal distributes hash range across N (total) subvolumes, whereas mkdir distributes hash range across N - M (where M is the number of subvolumes where mkdir lost the race). Both the clients "cache" their respective layouts in the near future for all future creates inside them (evidence in logs) 5. During the creation of the subdirectory, two clients race again. Ideally winner performs mkdir() on the hashed subvolume and proceeds to create other dirs, loser experiences EEXIST error on the hashed subvolume and backs off. But in this case, because the two clients have different layout views of the parent directory (because of different hash splits and assignements), the hashed subvolumes for the new directory can end up being different. Therefore, both clients now win the race (they were never fighting against each other on a common server), assigning different GFIDs to the directory on their respective (different) subvolumes. Some of the remaining subvolumes get GFID1, others GFID2. Conclusion/Fix: Making mkdir translate EEXIST error as success (just the way self-heal is already rightly doing) will bring back truth to the design claim that concurrent mkdir/self-heals perform deterministic + idempotent operations. This will prevent the differing "hash views" by different clients and thereby also avoid GFID mismatch by forcing all clients to have a "fair race", because the hashed subvolume for all will be the same (and thereby avoiding leaked locks and hangs.) Change-Id: I84592fb9b8a3f739a07e2afb23b33758a0a9a157 BUG: 907072 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4459 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com>
* cluster/dht: stack wind with cookiev3.4.0qa8Varun Shastry2013-01-313-29/+45
| | | | | | | | | | | | | | | Default_fops uses stack_wind_tail. It winds without creating the frame leading into wrong subvol return in the cookie. To avoid the problem caused by the same, we're getting the subvol by passing the cookie. Change-Id: I51ee79b22c89e4fb0b89e9a0bc3ac96c5b469f8f BUG: 893338 Signed-off-by: Varun Shastry <vshastry@redhat.com> Reviewed-on: http://review.gluster.org/4388 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com> Tested-by: Anand Avati <avati@redhat.com>