summaryrefslogtreecommitdiffstats
path: root/xlators/cluster
Commit message (Collapse)AuthorAgeFilesLines
* cluster/ec: Prevent Null dereference in dht-renamePranith Kumar K2015-06-161-1/+1
| | | | | | | | | | | Backport of http://review.gluster.com/11178 BUG: 1230653 Change-Id: I708d4d84b1341fb7cb515808836721735dc63f14 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/11179 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es>
* features/changelog: Avoid setattr fop logging during renameSaravanakumar Arumugam2015-06-113-23/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: When a file is renamed and the (renamed)file's Hashing falls into a different brick, DHT creates a special file(linkto file) in the brick(Hashed subvolume) and carries out setattr operation on that file. Currently, Changelog records this(setattr) operation in Hashed subvolume. glusterfind in turn records this operation as MODIFY operation. So, there is a NEW entry in Cached subvolume and MODIFY entry in Hashed subvolume for the same file. Solution: Avoid logging setattr operation carried out, by marking the operation as internal fop using xdata. In changelog translator, check whether setattr is set as internal fop and skip accordingly. Change-Id: I21b09afb5a638b88a4ccb822442216680b7b74fd BUG: 1230687 Signed-off-by: Saravanakumar Arumugam <sarumuga@redhat.com> Reviewed-on: http://review.gluster.org/11183 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com>
* stripe: fix use-after-freeJeff Darcy2015-06-101-4/+10
| | | | | | | | | | | | | | | | | | | Pretty much a classic case. STRIPE_STACK_UNWIND frees the "local" structure. In the "virtual xattr" path, used for lock recovery among other things, we were calling STRIPE_STACK_UNWIND and then continuing to clean up "our" parts of the just-freed structure. Oops. Change-Id: Ifa961b89cd21a2893de39a9eea243d184f9eac46 BUG: 1228510 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/11037 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Niels de Vos <ndevos@redhat.com> (cherry picked from commit 62992ac27d729ecc7da500ce42dc46592c13d003) Reviewed-on: http://review.gluster.org/11145 Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/ec: EC_XATTR_DIRTY doesn't come in responsePranith Kumar K2015-06-062-27/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Backport of http://review.gluster.com/11078 Problem: ec_update_size_version expects all the keys it did xattrop with to come in response so that it can set the values again in ec_update_size_version_done. But EC_XATTR_DIRTY is not combined so the value won't be present in the response. So ctx->post/pre_dirty are not updated in ec_update_size_version_done. So these values are still non-zero. When ec_unlock_now is called as part of flush's unlock phase it again tries to perform same xattrop for EC_XATTR_DIRTY. But ec_update_size_version is not expected to be called in unlock phase of flush because ec_flush_size_version should have reset everything to zero and unlock is never invoked from ec_update_size_version_done for flush/fsync/fsyncdir. This leads to stale lock which leads to hang. Fix: EC_XATTR_DIRTY is removed in ex_xattrop_cbk and is never combined with other answers. So remove handling of this in the response. BUG: 1228160 Change-Id: I657efca6e706e7acb541f98f526943f67562da9f Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/11084 Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org>
* dht: Add lookup-optimize configuration option for DHTShyam2015-06-054-16/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently with commit 4eaaf5 a mixed version cluster would have issues if lookup-uhashed is set to auto, as older clients would fail to validate the layouts if newer clients (i.e 3.7 or upwards) create directories. Also, in a mixed version cluster rebalance daemon would set commit hash for some subvolumes and not for the others. This commit fixes this problem by moving the enabling of the functionality introduced in the above mentioned commit to a new dht option. This option also has a op_version of 3_7_1 thereby preventing it from being set in a mixed version cluster. It brings in the following changes, - Option can be set only if min version of the cluster is 3.7.1 or more - Rebalance and mkdir update the layout with the commit hashes only if this option is set, hence ensuring rebalance works in a mixed version cluster, and also directories created by newer clients do not cause layout errors when read by older clients - This option also supersedes lookup-unhased, to enable the optimization for lookups more deterministic and not conflict with lookup-unhashed settings. Option added is cluster.lookup-optimize, which is a boolean. Usage: # gluster volume set VOLNAME cluster.lookup-optimize on Change-Id: Ifd1d4ce3f6438fcbcd60ffbfdbfb647355ea1ae0 BUG: 1225940 Signed-off-by: Shyam <srangana@redhat.com> Reviewed-on: http://review.gluster.org/10976 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: N Balachandran <nbalacha@redhat.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: fix incorrect dst subvol info in inode_ctxNithya Balachandran2015-06-046-68/+156
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Stashing additional information in the inode_ctx to help decide whether the migration information is stale, which could happen if a file was migrated several times but FOPs only detected the P1 migration phase. If no FOP detects the P2 phase, the inode ctx1 is never reset. We now save the src subvol as well as the dst subvol in the inode ctx. The src subvol is the subvol on which the FOP was sent when the mig info was set in the inode ctx. This information is considered stale if: 1. The subvol on which the current FOP is sent is the same as the dst subvol in the ctx 2. The subvol on which the current FOP is sent is not the same as the src subvol in the ctx This does not handle the case where the same file might have been renamed such that the src subvol is the same but the dst subvol is different. However, that is unlikely to happen very often. Change-Id: I05a2e9b107ee64750c7ca629aee03b03a02ef75f BUG: 1225809 Signed-off-by: Nithya Balachandran <nbalacha@redhat.com> Reviewed-on: http://review.gluster.org/10967 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org>
* dht/rebalance : Fixed rebalance failureNithya Balachandran2015-06-042-3/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The rebalance process determines the local subvols for the node it is running on and only acts on files in those subvols. If a dist-rep or dist-disperse volume is created on 2 nodes by dividing the bricks equally across the nodes, one process might determine it has no local_subvols. When trying to update the commit hash, the function attempts to lock all local subvols. On the node with no local_subvols the dht inode lock operation fails, in turn causing the rebalance to fail. In a dist-rep volume with 2 nodes, if brick 0 of each replica set is on node1 and brick 1 is on node2, node2 will find that it has no local subvols. Change-Id: I7d73b5b4bf1c822eae6df2e6f79bd6a1606f4d1c BUG: 1221656 Signed-off-by: Nithya Balachandran <nbalacha@redhat.com> Reviewed-on-master: http://review.gluster.org/10786 Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> Reviewed-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/10788 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: Fix dht_setxattr to follow files under migrationNithya Balachandran2015-06-042-17/+347
| | | | | | | | | | | | | | | | If a file is under migration, then any xattrs created on it are lost post migration of the file. This is because the xattrs are set only on the cached subvol of the source and as the source is under migration, it becomes a linkto file post migration. Change-Id: Ib8e233b519cf954e7723c6e26b38fa8f9b8c85c0 BUG: 1225839 Signed-off-by: Nithya Balachandran <nbalacha@redhat.com> Reviewed-on: http://review.gluster.org/10968 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: pass a destination subvol to fop2 variants to avoid races.Raghavendra G2015-06-045-134/+193
| | | | | | | | | | | | | | | | | | The destination subvol used in the fop2 variants is either stored in inode-ctx1 or local->cached_subvol. However, it is not guaranteed that a value stored in these locations before invocation of fop2 is still present after the invocation as these locations are shared among different concurrent operations. So, to preserve the atomicity of "check dst-subvol and invoke fop2 variant if dst-subvol found", we pass down the dst-subvol to fop2 variant. This patch also fixes error handling in some fop2 variants. Change-Id: Icc226228a246d3f223e3463519736c4495b364d2 BUG: 1225809 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-on: http://review.gluster.org/10966 Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/dht: Don't rely on linkto xattr to find destination subvolRaghavendra G2015-06-031-101/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | during phase 2 of migration. linkto xattr on source file cannot be relied to find where the data file currently resides. This can happen if there are multiple migrations before phase 2 detection by a client. For eg., * migration (M1, node1, node2) starts. * application writes some data. DHT correctly stores the state in inode context that phase-1 of migration is in progress * migration M1 completes * migration (M2, node2, node3) is triggered and completed * application resumes writes to the file. DHT identifies it as phase-2 of migration. However, linkto xattr on node1 points to node2, but the file is on node3. A lookup correctly identifies node3 as cached subvol TBD: When we identify phase-2 of a previous migration (say M1), there might be a migration in progress - say (M3, node3, node4). In this case we need to send writes to both (node3, node4) not just node3. Also, the inode state needs to correctly indicate that its in phase-1 of migration. I'll send this as a different patch. Change-Id: I1a861f766258170af2f6c0935468edb6be687b95 BUG: 1225809 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-on: http://review.gluster.org/10805 Reviewed-on: http://review.gluster.org/10965 Tested-by: Gluster Build System <jenkins@build.gluster.com>
* afr: honour selfheal enable/disable volume set optionsRavishankar N2015-06-032-3/+11
| | | | | | | | | | | | | | | | | | | | | | | afr-v1 had the following volume set options that are used to enable/ disable self-heals from happening in AFR xlator when loaded in the client graph: cluster.metadata-self-heal cluster.data-self-heal cluster.entry-self-heal In afr-v2, these 3 heals can happen from the client if there is an inode refresh. This patch allows such heals to proceed only if the corresponding volume set options are set to true. Change-Id: I8d97d6020611152e73a269f3fdb607652c66cc86 BUG: 1227674 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/11012 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> (cherry picked from commit da111ae21429d33179cd11409bc171fae9d55194) Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/11062
* cluster/ec: Fix incorrect check for iatt differencesXavier Hernandez2015-06-031-5/+19
| | | | | | | | | | | | | | | | | | | | | Backport of http://review.gluster.org/11018 A previous patch (http://review.gluster.org/10974) introduced a bug that caused that some metadata differences could not be detected in some circumstances. This could cause that self-heal is not triggered and the file not repaired. We also need to consider all differences for lookup requests, even if there isn't any lock. Special handling of differences in lookup is already done in lookup specific code. Change-Id: I88e3678634d848a89f428a37209c31464e42cf8b BUG: 1225796 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/11027 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* dht/rebalance: Change log_level to DEBUGSusant Palai2015-06-011-2/+3
| | | | | | | | | | | BUG: 1221503 Change-Id: I4ac87bb69e05e2dd445fc0a5bcf48d5bd3d0020b Reviewed-on: http://review.gluster.org/10756 Reviewed-by: N Balachandran <nbalacha@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/10778
* cluster/tier: load libgfdb.so properly in all casesNiels de Vos2015-06-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | We should load libgfdb.so.0, not libgfdb.so Cherry picked from commit 628406f28364f6019261a3bb37335a494ccf8dda: > Change-Id: I7a0d64018ccd9893b1685de391e99b5392bd1879 > BUG: 1222092 > Signed-off-by: Dan Lambright <dlambrig@redhat.com> > Reviewed-on: http://review.gluster.org/10796 > Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> > Reviewed-by: Joseph Fernandes > Reviewed-by: Niels de Vos <ndevos@redhat.com> > Tested-by: Gluster Build System <jenkins@build.gluster.com> Change-Id: I7a0d64018ccd9893b1685de391e99b5392bd1879 BUG: 1221534 Signed-off-by: Niels de Vos <ndevos@redhat.com> Reviewed-on: http://review.gluster.org/10799 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Dan Lambright <dlambrig@redhat.com> Reviewed-by: mohammed rafi kc <rkavunga@redhat.com>
* tiering/rebalance: Use separate pid/socket file for tieringMohammed Rafi KC2015-05-311-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | Back port of http://review.gluster.org/10792 When promotion/demotion daemon starts, it uses the same pidfile as rebalance. This patch will introduce a different pid file for the same. >Change-Id: Ic484c53f51e00ae6b2d697748a9600b14829e23b >BUG: 1221970 >Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> >Reviewed-on: http://review.gluster.org/10792 >Reviewed-by: Atin Mukherjee <amukherj@redhat.com> >Tested-by: Gluster Build System <jenkins@build.gluster.com> >Tested-by: NetBSD Build System Change-Id: Idda13e983ffd443672aee0873ee51e8cc7089c49 BUG: 1221969 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Reviewed-on: http://review.gluster.org/10980 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Joseph Fernandes Reviewed-by: Kaushal M <kaushal@redhat.com>
* tier: Do not allow detach-tier commands on a non-tiered volumeMohammed Rafi KC2015-05-311-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | Back port of http://review.gluster.org/10773 >Change-Id: Ic92d25db68e40ef4a4388ef42affd1b3ee5a7ec6 >BUG: 1221270 >Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> >Reviewed-on: http://review.gluster.org/10773 >Reviewed-by: Atin Mukherjee <amukherj@redhat.com> >Reviewed-by: Raghavendra G <rgowdapp@redhat.com> >Reviewed-by: Kaushal M <kaushal@redhat.com> >Tested-by: Gluster Build System <jenkins@build.gluster.com> >Tested-by: NetBSD Build System >Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Change-Id: I4b52da590dfcca8edc7e2b7e0c24c5dab7983c10 BUG: 1221967 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Reviewed-on: http://review.gluster.org/10979 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Joseph Fernandes Reviewed-by: Kaushal M <kaushal@redhat.com>
* cluster/ec: Ignore differences in non locked inodesXavier Hernandez2015-05-305-28/+100
| | | | | | | | | | | | | | | | | | | | | | | | | Backport of http://review.gluster.org/10974 When ec combines iatt structures from multiple bricks, it checks for equality in important fields. This is ok for iatt related to inodes involved in the operation that have been locked before starting execution. However some fops return iatt information from other inodes. For example a rename locks source and destination parent directories, but it also returns an iatt from the entry itself. In these cases we ignore differences in some fields to avoid false detection of inconsistencies and trigger unnecessary self-heals. Another issue is solved in this patch that caused that the real size of the file stored into the inode context was lost during self-heal. BUG: 1225796 Change-Id: I29f328a7b4895368ded859f3bae0359436c3588f Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/10983 Tested-by: Gluster Build System <jenkins@build.gluster.com>
* afr: allow readdir to proceed for directories in split-brainRavishankar N2015-05-291-18/+22
| | | | | | | | | | | | | | | | | | | | | | | Problem: afr_read_txn() bails out if read_subvol==-1. This meant that for directories that were in entry split-brain, FOPS like readdir, access, stat etc were not allowed. Fix: Except for getxattr, all other FOPS are wound on the first up child of afr. Change-Id: Iacec8fbb1e75c4d2094baa304f62331c81a6f670 BUG: 1218863 BUG: Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/10776 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Anuradha Talur <atalur@redhat.com> (cherry picked from commit 49b428433a03fcf709fdc8c08603b4cf02198e0a) Reviewed-on: http://review.gluster.org/10962 Tested-by: NetBSD Build System Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: Treat op_ret >= 0 as success in afr_final_errno()Krutika Dhananjay2015-05-281-1/+1
| | | | | | | | | | | | Backport of: http://review.gluster.org/10946 Change-Id: I77e7ec8627846d1d7e3198459ea7b8624b5f450b BUG: 1225743 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/10960 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: NetBSD Build System
* cluster/ec: Fix all EIO errors in ECPranith Kumar K2015-05-2817-1065/+1347
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Backport of http://review.gluster.org/10770 Backport of http://review.gluster.org/10806 Backport of http://review.gluster.org/10787 Backport of http://review.gluster.org/10868 Backport of http://review.gluster.com/10852 - When a blocking lock is requested, lock request is succeeded even when ec->fragment number of locks are acquired successfully in non-blocking locking phase. This will lead to fop succeeding only on the bricks where the locks are acquired, leading to the necessity of self-heals. To prevent these un-necessary self-heals, if the remaining locks fail with EAGAIN in non-blocking lock phase try blocking locking phase instead. - Handle lookup failures while op in progress - cluster/ec: Correctly cleanup delayed locks When a delayed lock is pending, a graph switch doesn't correctly terminate it. This means that the update of version and size xattrs is lost, causing EIO errors. This patch handles GF_EVENT_PARENT_DOWN event to correctly finish pending udpdates before completing the graph switch. - Fix use after free crash ec_heal creates ec_fop_data but doesn't run ec_manager. ec_fop_data_allocate adds this fop to ec->pending_fops, because ec_manager is not run on this heal fop it is never removed from ec->pending_fops. When it is accessed after free it leads to crash. It is better to not to add HEAL fops to ec->pending_fops because we don't want graph switch to hang the mount because of a BIG file/directory heal. - Forced unlock when lock contention is detected EC uses an eager lock mechanism to optimize multiple read/write requests on the same entry or inode. This increases performance but can have adverse results when other clients try to access the same entry/inode. To solve this, this patch adds a functionality to detect when this happens and force an earlier release to not block other clients. The method consists on requesting GF_GLUSTERFS_INODELK_COUNT and GF_GLUSTERFS_ENTRYLK_COUNT for all fops that take a lock. When this count is greater than one, the lock is marked to be released. All fops already waiting for this lock will be executed normally before releasing the lock, but new requests that also require it will be blocked and restarted after the lock has been released and reacquired again. Another problem was that some operations did correctly lock the parent of an entry when needed, but got the size and version xattrs from the entry instead of the parent. This patch solves this problem by binding all queries of size and version to each lock and replacing all entrylk calls by inodelk ones to remove concurrent updates on directory metadata. This also allows rename to correctly update source and destination directories. BUG: 1225279 Change-Id: I02a6084b138dd38e018a462347cd9ce38610c7ef Reviewed-on: http://review.gluster.org/10926 Tested-by: NetBSD Build System Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* glusterd: add counter support for tiered volumesDan Lambright2015-05-283-1/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | back port of http://review.gluster.org/10292 This fix adds support to view the number of promoted or demoted files from the cli. The mechanism is isolmorphic to checking the status of volumes being rebalanced. gluster volume rebalance <vol> tier status >Change-Id: I1b11ca27355ceec36c488967c23531202030e205 >BUG: 1213063 >Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> >Signed-off-by: Dan Lambright <dlambrig@redhat.com> >Reviewed-on: http://review.gluster.org/10292 >Reviewed-by: Atin Mukherjee <amukherj@redhat.com> >Tested-by: Gluster Build System <jenkins@build.gluster.com> Change-Id: I543e886f17132b544274c83fdecca5a8da9d092a BUG: 1221477 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Signed-off-by: Dan Lambright <dlambrig@redhat.com> Reviewed-on: http://review.gluster.org/10775 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Atin Mukherjee <amukherj@redhat.com>
* cluster/afr : Do not copy dict when it is NULLAnuradha2015-05-151-1/+1
| | | | | | | | | | | | | | | | | | | Backport of: http://review.gluster.org/10755 In afr_lookup_xattr_req_prepare(), dict_copy was done even though source dict was NULL. Change-Id: I85a5d2823ba021e7f78c1ce13402a0f16b08cb51 BUG: 1221470 Reviewed-on: http://review.gluster.org/10755 Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Prashanth Pai <ppai@redhat.com> Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Susant Palai <spalai@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Signed-off-by: Anuradha <atalur@redhat.com> Reviewed-on: http://review.gluster.org/10777
* dht/tier/rebalancer: Fix reset of tiering client pidJoseph Fernandes2015-05-104-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the patch http://review.gluster.org/#/c/9657 the client pid set by tiering migration was getting over- written in dht_start_rebalance_task(). Just corrected it in dht_setxattr() before calling dht_start_rebalance_task() and removed it from dht_start_rebalance_task(). > http://review.gluster.org/#/c/10502/ > Cherry picked from commit a5fe0f594d41e1a11661d9074bb19e9c2e2c4776 > Change-Id: I37cfa111f83a4e5d498042575c93799f60b49870 > BUG: 1217937 > Signed-off-by: Joseph Fernandes <josferna@redhat.com> > Reviewed-on: http://review.gluster.org/10502 > Tested-by: Gluster Build System <jenkins@build.gluster.com> > Reviewed-by: Susant Palai <spalai@redhat.com> > Reviewed-by: Dan Lambright <dlambrig@redhat.com> Signed-off-by: Joseph Fernandes <josferna@redhat.com> Reviewed-on: http://review.gluster.org/10502 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Susant Palai <spalai@redhat.com> Reviewed-by: Dan Lambright <dlambrig@redhat.com> Signed-off-by: Joseph Fernandes <josferna@redhat.com> Conflicts: xlators/cluster/dht/src/dht-common.c xlators/cluster/dht/src/tier.c Change-Id: Id513114c9a880c6196162dd4b35bbf1155a8cd09 BUG: 1219027 Reviewed-on: http://review.gluster.org/10609 Reviewed-by: N Balachandran <nbalacha@redhat.com> Tested-by: NetBSD Build System Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* dht: make lookup-unhashed=auto do something actually usefulJeff Darcy2015-05-096-48/+577
| | | | | | | | | | | | | | | | | | | | | | | | | The key concept here is to determine whether a directory is "clean" by comparing its last-known-good topology to the current one for the volume. These are stored as "commit hashes" on the directory and the volume root respectively. The volume's commit hash changes whenever a brick is added or removed, and a fix-layout is done. A directory's commit hash changes only when a full rebalance (not just fix-layout) is done on it. If all bricks are present and have a directory commit hash that matches the volume commit hash, then we can assume that every file is in its "proper" place. Therefore, if we look for a file in that proper place and don't find it, we can assume it's not on any other subvolume and *safely* skip the global (broadcast to all) lookup. Change-Id: Id6ce4593ba1f7daffa74cfab591cb45960629ae3 BUG: 1220064 Reviewed-on-master: http://review.gluster.org/#/c/7702/ Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Signed-off-by: Shyam <srangana@redhat.com> Reviewed-on: http://review.gluster.org/10729 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* ec: Fix failures with missing filesXavier Hernandez2015-05-095-283/+164
| | | | | | | | | | | | | | | | | | | | | | | | | | | Backport of http://review.gluster.com/9407 When a file does not exist on a brick but it does on others, there could be problems trying to access it because there was some loc_t structures with null 'pargfid' but 'name' was set. This forced inode resolution based on <pargfid>/name instead of <gfid> which would be the correct one. To solve this problem, 'name' is always set to NULL when 'pargfid' is not present. Another problem was caused by an incorrect management of errors while doing incremental locking. The only allowed error during an incremental locking was ENOTCONN, but missing files on a brick can be returned as ESTALE. This caused an EIO on the operation. This patch doesn't care of errors during an incremental locking. At the end of the operation it will check if there are enough successfully locked bricks to continue or not. BUG: 1220011 Change-Id: I4a1e6235d80e20ef7ef12daba0807b859ee5c435 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/10701 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* core: use reference counting for mem_acct structuresJeff Darcy2015-05-092-12/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When freeing memory, our memory-accounting code expects to be able to dereference from the (previously) allocated block to its owning translator. However, as we have already found once in option validation and twice in logging, that translator might itself have been freed and the dereference attempt causes on of our daemons to crash with SIGSEGV. This patch attempts to fix that as follows: * We no longer embed a struct mem_acct directly in a struct xlator, but instead allocate it separately. * Allocated memory blocks now contain a pointer to the mem_acct instead of the xlator. * The mem_acct structure contains a reference count, manipulated in both the normal and translator allocate/free code using atomic increments and decrements. * Because it's now a separate structure, we can defer freeing the mem_acct until its reference count reaches zero (either way). * Some unit tests were disabled, because they embedded their own copies of the implementation for what they were supposedly testing. Life's too short to spend time fixing tests that seem designed to impede progress by requiring a certain implementation as well as behavior. Change-Id: Id929b11387927136f78626901729296b6c0d0fd7 BUG: 1219026 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/10417 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-by: Niels de Vos <ndevos@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/10723 Tested-by: NetBSD Build System
* glusterd: support for tier volumes 'detach start' and 'detach commit'Dan Lambright2015-05-094-8/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Back port of http://review.gluster.org/10108 These commands work in a manner analagous to rebalancing when removing a brick. The existing migration daemon detects "detach start" and switches to moving data off the hot tier. While in this state all lookups are directed to the cold tier. gluster v detach-tier <vol> start gluster v detach-tier <vol> commit The status and stop cli commands shall be submitted separately. >Change-Id: I24fda5cc3ba74f5fb8aa9a3234ad51f18b80a8a0 >BUG: 1205540 >Signed-off-by: Dan Lambright <dlambrig@redhat.com> >Signed-off-by: root <root@localhost.localdomain> >Signed-off-by: Dan Lambright <dlambrig@redhat.com> >Reviewed-on: http://review.gluster.org/10108 >Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Change-Id: I212d748d077fb5870ee84b316c653acbafbea3f7 BUG: 1220047 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Reviewed-on: http://review.gluster.org/10708 Reviewed-by: Dan Lambright <dlambrig@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: change log level of developer logs to DEBUGVijay Bellur2015-05-091-2/+2
| | | | | | | | | | | | | | | | | | Backport of : http://review.gluster.org/10281 A few log messages in dht directory self heal at log level INFO are useful only for developers and these logs tend to casue excessive logs in our log files. Hence moving the log level of such logs to DEBUG. Change-Id: I8a543f4ddeb5c20b2978a0f7b18d8baccc935a54 BUG: 1217949 Signed-off-by: Vijay Bellur <vbellur@redhat.com> Reviewed-on: http://review.gluster.org/10281 Reviewed-by: N Balachandran <nbalacha@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-on: http://review.gluster.org/10704 Tested-by: NetBSD Build System Reviewed-by: Raghavendra Talur <rtalur@redhat.com>
* cluster/afr : Prevent inode-evict during split-brain resolutionAnuradha2015-05-095-58/+303
| | | | | | | | | | | | | | | | | | | | | | | | | Backport of: http://review.gluster.org/#/c/10134/ 1) Provided setfattr command to set timeout for split-brain choice. 2) If split-brain inspection/resolution is being done from the mount for a file, ref the inode when split-brain-choice is set. This inode will be unconditionally unref-ed after timeout seconds set by the user/default otherwise. 3) Updated the doc and testcase to reflect the changes. Change-Id: I15c9037dee28855f21e680e7e3632e1f48dba4e1 BUG: 1219388 Reviewed-on: http://review.gluster.org/10134 Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Signed-off-by: Anuradha <atalur@redhat.com> Reviewed-on: http://review.gluster.org/10679
* cluster/ec: Change meaning of trusted.ec.dirtyPranith Kumar K2015-05-087-208/+555
| | | | | | | | | | | | | | | | | - With this change, the xattr will represent if the file needs to be healed or not. It will have different values for data/entry and metadata changes. - inode ref leaks and dict_set_dynstr related leaks fixed - Added support for trylock/lock based on heal-cmd execution or not in data heal. - Made fixes to pass regression runs Change-Id: I9d8def4c2badde18a76b7898816fecfac113737a BUG: 1216303 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/10385 Reviewed-on: http://review.gluster.org/10693 Tested-by: NetBSD Build System Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/ec: data heal implementation for ecPranith Kumar K2015-05-083-8/+788
| | | | | | | | | | | | | | | | | | | | | | | | | Data self-heal: 1) Take inode lock in domain 'this->name:self-heal' on 0-0 range (full file), So that no other processes try to do self-heal at the same time. 2) Take inode lock in domain 'this->name' on 0-0 range (full file), 3) perform fxattrop+fstat and get the xattrs on all the bricks 3) Choose the brick with ec->fragment number of same version as source 4) Truncate sinks 5) Unlock lock taken in 2) 5) For each block take full file lock, Read from sources write to the sinks, Unlock 6) Take full file lock and see if the file is still sane copy i.e. File didn't become unusable while the bricks are offline. Update mtime to before healing 7) xattrop with -ve values of 'dirty' and difference of highest and its own version values for version xattr 8) unlock lock acquired in 6) 9) unlock lock acquired in 1) Change-Id: I6f4d42cd5423c767262c9d7bb5ca7767adb3e5fd BUG: 1216303 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/10384 Reviewed-on: http://review.gluster.org/10692 Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/ec: metadata/name/entry heal implementation for ecPranith Kumar K2015-05-082-0/+1058
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Metadata self-heal: 1) Take inode lock in domain 'this->name' on 0-0 range (full file) 2) perform lookup and get the xattrs on all the bricks 3) Choose the brick with highest version as source 4) Setattr uid/gid/permissions 5) removexattr stale xattrs 6) Setxattr existing/new xattrs 7) xattrop with -ve values of 'dirty' and difference of highest and its own version values for version xattr 8) unlock lock acquired in 1) Entry self-heal: 1) take directory lock in domain 'this->name:self-heal' on 'NULL' to prevent more than one self-heal 2) we take directory lock in domain 'this->name' on 'NULL' 3) Perform lookup on version, dirty and remember the values 4) unlock lock acquired in 2) 5) readdir on all the bricks and trigger name heals 6) xattrop with -ve values of 'dirty' and difference of highest and its own version values for version xattr 7) unlock lock acquired in 1) Name heal: 1) Take 'name' lock in 'this->name' on 'NULL' 2) Perform lookup on 'name' and get stat and xattr structures 3) Build gfid_db where for each gfid we know what subvolumes/bricks have a file with 'name' 4) Delete all the stale files i.e. the file does not exist on more than ec->redundancy number of bricks 5) On all the subvolumes/bricks with missing entry create 'name' with same type,gfid,permissions etc. 6) Unlock lock acquired in 1) Known limitation: At the moment with present design, it conservatively preserves the 'name' in case it can not decide whether to delete it. this can happen in the following scenario: 1) we have 3=2+1 (bricks: A, B, C) ec volume and 1 brick is down (Lets say A) 2) rename d1/f1 -> d2/f2 is performed but the rename is successful only on one of the bricks (Lets say B) 3) Now name self-heal on d1 and d2 would re-create the file on both d1 and d2 resulting in d1/f1 and d2/f2. Because we wanted to prevent data loss in the case above, the following scenario is not healable, i.e. it needs manual intervention: 1) we have 3=2+1 (bricks: A, B, C) ec volume and 1 brick is down (Lets say A) 2) We have two hard links: d1/a, d2/b and another file d3/c even before the brick went down 3) rename d3/c -> d2/b is performed 4) Now name self-heal on d2/b doesn't heal because d2/b with older gfid will not be deleted. One could think why not delete the link if there is more than 1 hardlink, but that leads to similar data loss issue I described earlier: Scenario: 1) we have 3=2+1 (bricks: A, B, C) ec volume and 1 brick is down (Lets say A) 2) We have two hard links: d1/a, d2/b 3) rename d1/a -> d3/c, d2/b -> d4/d is performed and both the operations are successful only on one of the bricks (Lets say B) 4) Now name self-heal on the 'names' above which can happen in parallel can decide to delete the file thinking it has 2 links but after all the self-heals do unlinks we are left with data loss. Change-Id: I3a68218a47bb726bd684604efea63cf11cfd11be BUG: 1216303 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/10298 Reviewed-on: http://review.gluster.org/10691 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System
* cluster/ec: Fix dictionary compare functionPranith Kumar K2015-05-084-118/+57
| | | | | | | | | | | | | | | | | | | | | If both dicts are NULL then equal. If one of the dicts is NULL but the other has only ignorable keys then also they are equal. If both dicts are non-null then check if for each non-ignorable key, values are same or not. value_ignore function is used to skip comparing values for the keys which must be present in both the dictionaries but the value could be different. geo-rep's stime xattr doesn't need to be present in list xattr but when getxattr comes on stime xattr even if there aren't enough responses with the xattr we should still give out an answer which is maximum of the stimes available. Change-Id: I8de2ceaa2db785b797f302f585d88e73b154167d BUG: 1216303 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/10078 Reviewed-on: http://review.gluster.org/10690 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System
* cluster/ec: add separate versions for data/entry, metadataAshish Pandey2015-05-088-29/+104
| | | | | | | | | | | | | | | | | | | Adding 64 bits in "version" key of extended attributes. First 64 bits (Left) represents Data version. Last 64 bits (right) represents Meta Data version. Note: 3.7 and 3.6 version ec can't co-exist with this change because xattrop in 3.6 will fail with ERANGE as the buffer passed to it will be '8' bytes where as the value will be 16 bytes in 3.7. Where as 3.7 version clients can work with old version files. For upgrades we need to tell users to complete heals and then upgrade BUG: 1215265 Change-Id: Ib85114680cb7e75b8371c984d9f7b6401c1ffb93 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: http://review.gluster.org/10312 Reviewed-on: http://review.gluster.org/10626 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/ec: Use fd instead of loc for get_size_versionAshish Pandey2015-05-083-57/+63
| | | | | | | | | | Change-Id: Ia7d43cb3b222db34ecb0e35424f1766715ed8e6a BUG: 1219358 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: http://review.gluster.org/10176 Reviewed-on: http://review.gluster.org/10625 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/tier: don't use hot tier until subvolumes readyDan Lambright2015-05-082-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a backport of fix 10435 to Gluster 3.7. When we attach a tier, the hot tier becomes the hashed subvolume. But directories may not yet have been replicated by the fix layout process. Hence lookups to those directories will fail on the hot subvolume. We should only go to the hashed subvolume once the layout has been fixed. This is known if the layout for the parent directory does not have an error. If there is an error, the cold tier is considered the hashed subvolume. The exception to this rules is ENOCON, in which case we do not know where the file is and must abort. Note we may revalidate a lookup for a directory even if the inode has not yet been populated by FUSE. This case can happen in tiering (where one tier has completed a lookup but the other has not, in which case we revalidate one tier when we call lookup the second time). Such inodes are still invalid and should not be consulted for validation. > http://review.gluster.org/#/c/10435/ > Change-Id: Ia2bc62e1d807bd70590bd2a8300496264d73c523 > BUG: 1214289 > Signed-off-by: Dan Lambright <dlambrig@redhat.com> > Reviewed-on: http://review.gluster.org/10435 > Tested-by: Gluster Build System <jenkins@build.gluster.com> > Reviewed-by: Raghavendra G <rgowdapp@redhat.com> > Reviewed-by: N Balachandran <nbalacha@redhat.com> > Signed-off-by: Dan Lambright <dlambrig@redhat.com> Change-Id: Ia2bc62e1d807bd70590bd2a8300496264d73c523 BUG: 1219547 Signed-off-by: Dan Lambright <dlambrig@redhat.com> Reviewed-on: http://review.gluster.org/10649 Tested-by: NetBSD Build System Reviewed-by: Joseph Fernandes Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* features/shard: Implement [f]truncate fopsKrutika Dhananjay2015-05-081-2/+2
| | | | | | | | | | | | | | | Backport of: http://review.gluster.org/10631 To-Do: * Make ftruncate work even in the absence of path * Aggregate and update ia_blocks appropriately when a file is truncated to a lower size. Change-Id: Icd424430066233ba61a030e72fdddf692d2b3f22 BUG: 1214247 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/10638 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* geo-rep: rename handling in dht volumeNithya Balachandran2015-05-081-0/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Background: Glusterfs changelogs are stored in each brick, which records the changes happened in that brick. Georep will run in all the nodes of master and processes changelogs "independently". Processing changelogs is in brick level, but all the fops will be replayed on "slave mount" point. Problem: With a DHT volume, in changelog "internal fops" are NOT recorded. For Rename case, Rename is recorded in "hashed" brick changelog. (DHT's internal fops like creating linkto file, unlink is NOT recorded). This lead us to inconsistent rename operations. For example, Distribute volume created with Two bricks B1, B2. //Consider master volume mounted @ /mnt/master and following operations executed: cd /mnt/master touch f1 // f1 falls on B1 Hash mv f1 f2 // f2 falls on B2 Hash // Here, Changelogs are recorded as below: @B1 CREATE f1 @B2 RENAME f1 f2 Here, race exists between Brick B1 and B2, say B2 will get executed first. Source file f1 itself is "NOT PRESENT", so it will go ahead and create f2 (Current implementation). We have this problem When rename falls in another brick and file is unlinked in Master. Similar kind of issue exists in following case too(multiple rename): CREATE f1 RENAME f1 f2 RENAME f2 f1 Solution: Instead of carrying out "changelogging" at "HASHED volume", carry out at the "CACHED volume". This way we have rename operations carried out where actual files are present. So,Changelog recorded as : @B1 CREATE f1 RENAME f1 f2 credit: sarumuga@redhat.com PS: Some of the races as the one below are _NOT_ fixed by this patch * f1 and f2 exist. B1 and B2 are their respective cached subvols. For both files hashed-subvol == cached-subvol * mv f1 f2 on master. * B1 has change-log entry of rename f1 f2 * rebalance migrates f2 from B1 and B2 * mv f2 f1 on master. * B2 has change-log entry of rename f2 f1 Since changelog entries (rename f1 f2) and (rename f2 f1) are processed independently by gsyncds, which of either f1 and f2 survives on slave is subject to race. Note that on master its file f1 with name f1 which survived. On slave it can be either file f1 with name f1 or file f2 with name f2 based on who wins the race of processing changelog. BUG: 1219412 Change-Id: I43725d69635e2ce065135691ef629014e8df7d50 Original-Author: Nithya Balachandran <nbalacha@redhat.com> Reviewed-on: http://review.gluster.org/10410 Signed-off-by: Saravanakumar Arumugam <sarumuga@redhat.com> Reviewed-on: http://review.gluster.org/10628 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System Reviewed-by: Kotresh HR <khiremat@redhat.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* guster/dht: tiered volumes may not allow access to files undergoing migrationDan Lambright2015-05-081-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a backport of fix 10324 to Gluster 3.7. If a read IO occurs against a file that has reached rebalance phase 2, we redirect the IO to the destination. For tiered volumes, when we try to reopen the file (on the destination), the lower level DHT receives the open call and fails; it does not have a "cached subvol". Fix is to "teach" the lower level DHT of the new location by sending a locate before the open. > http://review.gluster.org/#/c/10324/ > Change-Id: Ia4acb0035ff1da15f6a8f9ed54f43c76e8b98f5f > BUG: 1214048 > Signed-off-by: Dan Lambright <dlambrig@redhat.com> > Signed-off-by: root <root@gprfs018.sbu.lab.eng.bos.redhat.com> > Signed-off-by: Dan Lambright <dlambrig@redhat.com> > Reviewed-on: http://review.gluster.org/10324 > Tested-by: NetBSD Build System > Tested-by: Gluster Build System <jenkins@build.gluster.com> > Reviewed-by: Raghavendra G <rgowdapp@redhat.com> > Tested-by: Raghavendra G <rgowdapp@redhat.com> > Signed-off-by: Dan Lambright <dlambrig@redhat.com> Change-Id: Ia4acb0035ff1da15f6a8f9ed54f43c76e8b98f5f BUG: 1219608 Signed-off-by: Dan Lambright <dlambrig@redhat.com> Reviewed-on: http://review.gluster.org/10654 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System Reviewed-by: Joseph Fernandes Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* Restore build on non Linux systemsEmmanuel Dreyfus2015-05-072-4/+7
| | | | | | | | | | | | | | | | | | | | | | This change broke the build on NetBSD, FreeBSD, and MacOS X: http://review.gluster.org/10526/ We restore the build with two fixes: - Use POSIX-compliant sysconf(_SC_NPROCESSORS_ONLN) to get the number of processors, instead of Linux specific get_nprocs(). That let us remove Linux-specific #include <sys/sysinfo.h> - Only define MAX() if it is not already defined. NetBSD defines it in <sys/param.h> which is already included Backport of: I62341c670598670e47ea2f69ab94864f96588b18 BUG: 1212676 Signed-off-by: Emmanuel Dreyfus <manu@netbsd.org> Change-Id: I0f098153e76954bb85b5dca3f054a069e31dd94c Reviewed-on: http://review.gluster.org/10653 Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Tested-by: Vijay Bellur <vbellur@redhat.com>
* dht/rebalance: Throttle rebalanceSusant Palai2015-05-073-4/+133
| | | | | | | | | | | | | | | | Throttle value will be "normal" by default. For throttling down, a thread will be put in to sleep. And for throttling up, gf_defrag_process_dir will wake up the sleeping threads. Change-Id: I4892ab14982a1ff305aeb2d8bbd33c79d6877b69 BUG: 1219579 Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/10526 Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-on: http://review.gluster.org/10629 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Vijay Bellur <vbellur@redhat.com>
* dht/rebalancer: Marking tiering migration fopsJoseph Fernandes2015-05-073-6/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a follow up patch for http://review.gluster.org/#/c/10080 In the above, the suggested change in http://review.gluster.org/#/c/10080/7/xlators/cluster/dht/src/dht-rebalance.c doesnot work. The reason it doesnt work is promotion and demotion are done in a multithread way. Whenever a promotion or demotion thread is called, the frame of the old sync_op thread is not carried with it. As a result the frame->root->pid is not set. Solution: When the file is getting migrated, we get a tiering.migration key_value in the xattr dict, so that we pass this dic key-value when we do syncop_setxattr() to do data migration and set the frame->root->pid GF_CLIENT_PID_TIER_DEFRAG in dht_setxattr() just before calling dht_start_rebalance_task(). > http://review.gluster.org/#/c/10266/ > Change-Id: I86fef2d961b32fdd2c0c69d8512cbe846b393404 > BUG: 1194753 > Signed-off-by: Joseph Fernandes <josferna@redhat.com> > Reviewed-on: http://review.gluster.org/10266 > Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> > Reviewed-by: Susant Palai <spalai@redhat.com> > Reviewed-by: Dan Lambright <dlambrig@redhat.com> > Tested-by: Gluster Build System <jenkins@build.gluster.com> Change-Id: I6ab42b2c7a3c3e21c461d097b7558ee967b62c62 BUG: 1218959 Signed-off-by: Joseph Fernandes <josferna@redhat.com> Reviewed-on: http://review.gluster.org/10266 Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> Reviewed-by: Susant Palai <spalai@redhat.com> Reviewed-by: Dan Lambright <dlambrig@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Signed-off-by: Joseph Fernandes <josferna@redhat.com> Reviewed-on: http://review.gluster.org/10601 Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/ec: Perform inode-write on healing subvolsPranith Kumar K2015-05-073-129/+88
| | | | | | | | | | | | | | | | | | | Backport of http://review.gluster.org/10372 If the version numbers do not match, then writes are performed only on at least N-R bricks which have same version. But if we want to do healing of files which are constantly modified we need to allow writes on subvols that are undergoing heal. Data healing will mark 62nd bit while the heal is going on. When the data transaction sees that this bit is set it needs to perform the fop on that subvol irrespective of whether the versions match or do not match. Fop is considered successful only if N-R non-healing bricks succeed. BUG: 1216303 Change-Id: I79aaf1ac86357c51547cdaaa56cf7338004cc512 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/10440 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System
* rebalance: Introducing local crawl and parallel migrationSusant Palai2015-05-076-252/+1134
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current patch address two part of the design proposed. 1. Rebalance multiple files in parallel 2. Crawl only bricks that belong to the current node Brief design explanation for the above two points. 1. Rebalance multiple files in parallel: ------------------------------------- The existing rebalance engine is single threaded. Hence, introduced multiple threads which will be running parallel to the crawler. The current rebalance migration is converted to a "Producer-Consumer" frame work. Where Producer is : Crawler Consumer is : Migrating Threads Crawler: Crawler is the main thread. The job of the crawler is now limited to fix-layout of each directory and add the files which are eligible for the migration to a global queue in a round robin manner so that we will use all the disk resources efficiently. Hence, the crawler will not be "blocked" by migration process. Producer: Producer will monitor the global queue. If any file is added to this queue, it will dqueue that entry and migrate the file. Currently 20 migration threads are spawned at the beginning of the rebalance process. Hence, multiple file migration happens in parallel. 2. Crawl only bricks that belong to the current node: -------------------------------------------------- As rebalance process is spawned per node, it migrates only the files that belongs to it's own node for the sake of load balancing. But it also reads entries from the whole cluster, which is not necessary as readdir hits other nodes. New Design: As part of the new design the rebalancer decides the subvols that are local to the rebalancer node by checking the node-uuid of root directory prior to the crawler starts. Hence, readdir won't hit the whole cluster as it has already the context of local subvols and also node-uuid request for each file can be avoided. This makes the rebalance process "more scalable". Change-Id: I6f1b44086a09df8ca23935fd213509c70cc0c050 BUG: 1217381 Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/10466 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System Reviewed-by: N Balachandran <nbalacha@redhat.com>
* afr: add arbitration supportRavishankar N2015-05-058-32/+207
| | | | | | | | | | | | | | | | | | | | | | | | | | | Backport of http://review.gluster.org/#/c/10258/ Add logic in afr to work in conjunction with the arbiter xlator when a replica 3 arbiter volume is created. More specifically, this patch: * Enables full locks for afr data transaction for such volumes. * Removes the upfront marking of pending xattrs at the time of pre-op and defer it to post-op. (This is an arbiter independent change and is made for all afr transactions.) * After pre-op stage, check if we can proceed with the fop stage without ending up in split-brain by examining the changelog xattrs. * Unwinds the fop with failure if only one source was available at the time of pre-op and the fop happened to fail on particular source brick. * Skips data self-heal if arbiter brick is the only source available. * Adds the arbiter-count option to the shd graph. This patch is a part of the arbiter logic implementation for 3 way AFR details of which can be found at http://review.gluster.org/#/c/9656/ Change-Id: I9603db9d04de5626eb2f4d8d959ef5b46113561d BUG: 1217689 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/10514 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* arbiter: load arbiter xlator on every 3rd brick of a replica 3 AFR subvolRavishankar N2015-05-032-0/+8
| | | | | | | | | | | | | | | | | | | Backport of http://review.gluster.org/10257 Logic for adding the 'glusterd_brickinfo->group' member and using it to find the brick positon has been taken from http://review.gluster.org/#/c/9919. Thanks to Jeff Darcy for that. This patch is a part of the arbiter logic implementation for 3 way AFR details of which can be found at http://review.gluster.org/#/c/9656/ Change-Id: Idbfe4f29ee8e098e0102def8f38b32314316b188 BUG: 1217689 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/10479 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Atin Mukherjee <amukherj@redhat.com>
* cluster/ec: Handle unhandled statesPranith Kumar K2015-05-021-0/+3
| | | | | | | | | | | | | | Backport of http://review.gluster.org/10390 This was leading to hangs when get_size_and_version fails BUG: 1216303 Change-Id: Iae455ee957b9377e1b0b711b0ef567d50d32c7cb Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/10436 Tested-by: NetBSD Build System Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr,dht: Fix memleak after syncop_readlinkPranith Kumar K2015-05-012-0/+2
| | | | | | | | | | | | Backport of http://review.gluster.org/10305 BUG: 1216302 Change-Id: Icb0f2d6bbff806e1c5827fabcbf46b9b7983491f Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/10441 Tested-by: NetBSD Build System Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* dht: tackle thread race in dht_getxattr_cbkSusant Palai2015-05-011-17/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | problem: 1. When two threads execute in parallel in dht_getxattr_cbk it may so happen that, both may find local->xattr to be NULL. As a result dht_aggregate_xattr may not get executed. 2. In dht_getxattr_cbk, thread1 thread2 T1 this_call_cnt = 2 -1 T2 this_call_cnt = 1 - 1 T3 fills local_xattr T4 DHT_STACK_UNWIND -> local_wipe T5 tries to dereference local which is already freed, leading to crash. Solution: for problem1: Execute critical section inside frame lock to resolve race. for problem2: Calculate this_call_count just before out section. BUG: 1217386 Change-Id: I14fdb0cb1825896721670d71f48c93053448be7b Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/10389 Reviewed-by: Anuradha Talur <atalur@redhat.com> Reviewed-by: N Balachandran <nbalacha@redhat.com> Reviewed-by: Kotresh HR <khiremat@redhat.com> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/10467 Tested-by: Gluster Build System <jenkins@build.gluster.com>
* tier: relax libgfdb required version numberEmmanuel Dreyfus2015-04-281-1/+1
| | | | | | | | | | | | | | | | | | | | | When calling dlopen() for libgfdb, do not specify the library version number "libgfdb.so.0.0.1", since libtool will not always create libraries or link with that name with the full 3-digit version. For instance on NetBSD only up to the 2-digit version is available and "libgfdb.so.0.0.1" does not exist. Instead, just specify "libgfdb.so" and rely on smymlinks installed by libtool to find the relevant library. Backport of: I074b1009d3622a122fdaeb4b99658bca3277e211 BUG: 1212676 Change-Id: I334cb6be8508051105c393ce4bb350f9df014df5 Signed-off-by: Emmanuel Dreyfus <manu@netbsd.org> Reviewed-on: http://review.gluster.org/10408 Reviewed-by: Dan Lambright <dlambrig@redhat.com> Tested-by: NetBSD Build System Tested-by: Gluster Build System <jenkins@build.gluster.com>