summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/dht/src/dht-common.h
Commit message (Collapse)AuthorAgeFilesLines
* dht: "replica.split-brain-status" attribute value is not correctMohit Agrawal2016-09-121-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In a distributed-replicate volume attribute "replica.split-brain-status" value does not display split-brain condition though directory is in split-brain. If directory is in split brain on mutiple replica-pairs it does not show full list of replica pairs. Solution: Update the dht_aggregate code to aggregate the xattr value in this specific condition. Fix: 1) function getChoices returns the choices from split-brain status string. 2) function add_opt adding the choices to local buffer to store in dictionary 3) For the key "replica.split-brain-status" function dht_aggregate call dht_aggregate_split_brain_xattr to prepare the list. Test: To verify the patch followed below steps 1) Create a distributed replica volume and create mount point 2) Stop heal daemon 3) Touch file and directories on mount point mkdir test{1..5};touch tmp{1..5} 4) Down brick process on one of the replica set pkill -9 glusterfsd 5) Change permission of dir on mount point chmod 755 test{1..5} 6) Restart brick process on node with force option 7) kill brick process on other node in same replica set 8) Change permission of dir again on mount point chmod 766 test{1..5} 9) Reexecute same step from 4-9 on other replica set also 10) After check heal status on server it will show dir's are in split brain on all replica sets 11) After check the replica.split-brain-status attr on mount point it will show wrong status of split brain. 12) After apply the patch the attribute shows correct value. > Change-Id: Icdfd72005a4aa82337c342762775a3d1761bbe4a > Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> > Reviewed-on: http://review.gluster.org/15201 > Smoke: Gluster Build System <jenkins@build.gluster.org> > NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> > CentOS-regression: Gluster Build System <jenkins@build.gluster.org> > Reviewed-by: Raghavendra G <rgowdapp@redhat.com> > (cherry picked from commit c4e9ec653c946002ab6d4c71ee8e6df056438a04) Change-Id: Ia5234e8a2291a7e8a7211c82368f4df1c99fa099 Backport of commit c4e9ec653c946002ab6d4c71ee8e6df056438a04 BUG: 1375099 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Reviewed-on: http://review.gluster.org/15466 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
* dht:remember locked subvol and send unlock to the sameMohammed Rafi KC2016-05-061-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During locking we send lock request to cached subvol, and normally we unlock to the cached subvol But with parallel fresh lookup on a directory, there is a race window where the cached subvol can change and the unlock can go into a different subvol from which we took lock. This will result in a stale lock held on one of the subvol. So we will store the details of subvol which we took the lock and will unlock from the same subvol back port of> >Change-Id: I47df99491671b10624eb37d1d17e40bacf0b15eb >BUG: 1311002 >Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> >Reviewed-on: http://review.gluster.org/13492 >Reviewed-by: N Balachandran <nbalacha@redhat.com> >Smoke: Gluster Build System <jenkins@build.gluster.com> >NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> >Reviewed-by: Raghavendra G <rgowdapp@redhat.com> >CentOS-regression: Gluster Build System <jenkins@build.gluster.com> Change-Id: Ia847e7115d2296ae9811b14a956f3b6bf39bd86d BUG: 1333645 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Reviewed-on: http://review.gluster.org/14236 Smoke: Gluster Build System <jenkins@build.gluster.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
* tier/dht: check for rebalance completion for EIO errorMohammed Rafi KC2016-04-281-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an ongoing rebalance completion check task been triggered by dht, there is a possibility of a race between afr setting subvol as non-readable and dht updates the cached subvol. In this window a write can fail with EIO. Back port of> >Change-Id: I42638e6d4104c0dbe893d1bc73e1366188458c5d >BUG: 1329503 >Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> >Reviewed-on: http://review.gluster.org/14049 >NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> >Smoke: Gluster Build System <jenkins@build.gluster.com> >CentOS-regression: Gluster Build System <jenkins@build.gluster.com> >Reviewed-by: N Balachandran <nbalacha@redhat.com> >Reviewed-by: Jeff Darcy <jdarcy@redhat.com> (cherry picked from commit a9ccd0c8ea6989c72073028b296f73a6fcb6b896) Change-Id: Ifc741f05a9cfe5c1d9c03085312ebbf21b8b798b BUG: 1330428 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Reviewed-on: http://review.gluster.org/14070 Smoke: Gluster Build System <jenkins@build.gluster.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: N Balachandran <nbalacha@redhat.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/distribute: detect stale layouts in entry fopsRaghavendra G2016-04-251-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dht_mkdir () { first-hashed-subvol = hashed-subvol for "bname" in in-memory layout of "parent"; inodelk (SETLKW, parent, "LAYOUT_HEAL_DOMAIN", "can be any subvol, but we choose first-hashed-subvol randomly"); { begin: hashed-subvol = hashed-subvol for "bname" in in-memory layout of "parent"; hash-range = extract hashe-range from layout of "parent"; ret = mkdir (parent/bname, hashed-subvol, hash-range); if (ret == "hash-value doesn't fall into layout stored on the brick (this error is returned by posix-mkdir)") { refresh_parent_layout (); goto begin; } } inodelk (UNLCK, parent, "LAYOUT_HEAL_DOMAIN", "first-hashed-subvol"); proceed with other parts of dht_mkdir; } posix_mkdir (parent/bname, client-hash-range) { disk-hash-range = getxattr (parent, "dht-layout-key"); if (disk-hash-range != client-hash-range) { fail-with-error ("hash-value doesn't fall into layout stored on the brick"); return 0; } continue-with-posix-mkdir; } Similar changes need to be done for dentry operations like create, symlink, link, unlink, rmdir, rename. These will be addressed in subsequent patches. This patch addresses only mkdir codepath. This change breaks stripe tests, as on some striped subvols dht layout xattrs are not set for some reason. This results in failure of mkdir. Since striped volumes are always created with dht, some tests associated with stripe also fail. So, I am making following tests changes (since stripe is out of maintainance): * modify ./tests/basic/rpc-coverage.t to not to use striped volumes * mark all (2) tests in tests/bugs/stripe/ as bad tests Change-Id: Idd1ae879f24a48303dc743c1bb4d91f89a629e25 BUG: 1329062 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-on: http://review.gluster.org/14040 Smoke: Gluster Build System <jenkins@build.gluster.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.com>
* dht: lock on subvols to prevent lookup vs rmdir raceSakshi2016-04-061-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a possibility that while an rmdir is completed on some non-hashed subvol and proceeding to others, a lookup selfheal can recreate the same directory on those subvols for which the rmdir had succeeded. Now the deletion of the parent directory will fail with an ENOTEMPTY. To fix this take blocking inodelk on the subvols before starting rmdir. Selfheal must also take blocking inodelk before creating the entry. Backport of http://review.gluster.org/13528 > Change-Id: I168a195c35ac1230ba7124d3b0ca157755b3df96 > BUG: 1245065 > Signed-off-by: Sakshi <sabansal@redhat.com> > Reviewed-on: http://review.gluster.org/13528 > CentOS-regression: Gluster Build System <jenkins@build.gluster.com> > NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> > Smoke: Gluster Build System <jenkins@build.gluster.com> > Reviewed-by: Raghavendra G <rgowdapp@redhat.com> > Tested-by: Raghavendra G <rgowdapp@redhat.com> Change-Id: I168a195c35ac1230ba7124d3b0ca157755b3df96 BUG: 1257894 Signed-off-by: Sakshi <sabansal@redhat.com> Reviewed-on: http://review.gluster.org/13915 Smoke: Gluster Build System <jenkins@build.gluster.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.com>
* tier/dht : Attach tier fix layout to run in backgroundJoseph Fernandes2016-04-011-9/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. Spawn a thread for background fix-layout for tier process. 2. Once the fix-layout is completed a marker xttr is set on the root of volume to mark the completion of the background fixlayout, so that even if the tier process is spawned again, fixlayout will not be issued, if it was completed last time. 3. Please note that promotion of legacy files will happen eventually as the ctr lookup heal in the fixlayout slowly heals the ctr db for legacy files OR the ctr lookup heal happend due to a name lookup. 4. When a detach tier is successful in evacuation data from hot tier, we remove the marker xattr is removed. So that next attach tier runs the background tier fixlayout. what is remaining ? 1. Instead of clearing the marker xattr of tiering fix layout at the end of detach start clear it during detach commit. But the issue is detach commit is a glusterd operation and the volume is not mounted in glusterd. The reason we want to do it in detach commit is that if the admin wants to attach the same tier again, then a background fixlayout will be triggered, which would not be needed. 2. Clearing the CTR DB of the cold bricks when there is a detach commit, as it will be having entries which will be stale when the volume is used, with ctr off (ctr is switched off only when we have detach commit.) Backport of http://review.gluster.org/13491 > Change-Id: Ibe343572e95865325cd0eef4d0b976b626a3c0c5 > BUG: 1313228 > Signed-off-by: Joseph Fernandes <josferna@redhat.com> > Reviewed-on: http://review.gluster.org/13491 > Smoke: Gluster Build System <jenkins@build.gluster.com> > Tested-by: Joseph Fernandes > NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> > CentOS-regression: Gluster Build System <jenkins@build.gluster.com> > Reviewed-by: Dan Lambright <dlambrig@redhat.com> Signed-off-by: Joseph Fernandes <josferna@redhat.com> Change-Id: Ic28affdf78d2ac0f394f3dd59f0126df7915d609 BUG: 1323016 Reviewed-on: http://review.gluster.org/13879 Smoke: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Joseph Fernandes Tested-by: Joseph Fernandes NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Dan Lambright <dlambrig@redhat.com>
* dht: report constant directory sizeJeff Darcy2016-03-281-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Directory size is meaningless. Every filesystem has its own unpredictable way of increasing or decreasing it, based on internal data structures and even transient conditions. Some filesystems (e.g. ext4) never decrease it at all. Others (e.g. btrfs) don't even report it. Very few programs look at it, and those that do are broken. Unfortunately, one such program is GNU tar, which will complain when it sees different values because at different times we got the value from different DHT subvolumes. To avoid such problems, just report a constant value. Backport of http://review.gluster.org/#/c/13770/ > Change-Id: Id64ce917c75b5f7ff50cb55b6e997f3b3556e7e3 > BUG: 1302948 > Original-author: Shyam <srangana@redhat.com> > Signed-off-by: Jeff Darcy <jdarcy@redhat.com> > Signed-off-by: N Balachandran <nbalacha@redhat.com> > Reviewed-on: http://review.gluster.org/13770 > Smoke: Gluster Build System <jenkins@build.gluster.com> > NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> > CentOS-regression: Gluster Build System <jenkins@build.gluster.com> > Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> > Reviewed-by: Vijay Bellur <vbellur@redhat.com> Change-Id: Id64ce917c75b5f7ff50cb55b6e997f3b3556e7e3 BUG: 1312721 Original-author: Shyam <srangana@redhat.com> Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Signed-off-by: N Balachandran <nbalacha@redhat.com> Reviewed-on: http://review.gluster.org/13831 Smoke: Gluster Build System <jenkins@build.gluster.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
* tier:unlink during migrationMohammed Rafi KC2016-02-221-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | files deleted during promotion were not deleting as the files are moving from hashed to non-hashed On deleting a file that is undergoing promotion, the unlink call is not sent to the dst file as the hashed subvol == cached subvol. This causes the file to reappear once the migration is complete. This patch also fixes a problem with stale linkfile deleting. Backport of> >Change-Id: I4b02a498218c9d8eeaa4556fa4219e91e7fa71e5 >BUG: 1282390 >Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> >Reviewed-on: http://review.gluster.org/12829 >Tested-by: NetBSD Build System <jenkins@build.gluster.org> >Tested-by: Gluster Build System <jenkins@build.gluster.com> >Reviewed-by: Dan Lambright <dlambrig@redhat.com> >Tested-by: Dan Lambright <dlambrig@redhat.com> (cherry picked from commit b5de382afa8c5777e455c7a376fc4f1f01d782d1) Change-Id: I951adb4d929926bcd646dd7574f7a2d41d57479d BUG: 1282388 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Reviewed-on: http://review.gluster.org/12991 Smoke: Gluster Build System <jenkins@build.gluster.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Dan Lambright <dlambrig@redhat.com>
* cluster/tier: allow db queries to be interruptableDan Lambright2016-02-181-4/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | A query to the database may take a long time if the database has many entries. The tier daemon also sends IPC calls to the bricks which can run slowly, espcially in RHEL6. While it is possible to track down each such instance, the snapshot feature should not be affected by database operations. It requires no migration be underway. Therefore it is okay to pause tiering at any time except when DHT is moving a file. This fix implements this strategy by monitoring when control passes to DHT to migrate a file using the GF_XATTR_FILE_MIGRATE_KEY trigger. If it is not, the pause operation is successful. > Change-Id: I21f168b1bd424077ad5f38cf82f794060a1fabf6 > BUG: 1287842 > Signed-off-by: Dan Lambright <dlambrig@redhat.com> > Reviewed-on: http://review.gluster.org/13104 > Reviewed-by: Joseph Fernandes > Tested-by: Gluster Build System <jenkins@build.gluster.com> Signed-off-by: Dan Lambright <dlambrig@redhat.com> Change-Id: I667e0af24eaa66afefa860c4d73b324e4f39b997 BUG: 1288352 Signed-off-by: Dan Lambright <dlambrig@redhat.com> Reviewed-on: http://review.gluster.org/13199 Smoke: Gluster Build System <jenkins@build.gluster.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.com>
* cluster/dht : Ftruncate on migrating file fails with EINVALN Balachandran2016-01-291-4/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | What: If dht_open is called on a migrating file after the inode_ctx is set, subsequent FOPs on that fd do not open the fd on the dst subvol. This is seen when the open-ftruncate-close sequence is repeatedly called on a migrating file. A second call to the sequence described above causes dht_truncate_cbk to call dht_truncate2 as the dht_inode_ctx was already set by the first call. As dht_rebalance_in_progress_check is not called, the fd is not opened on the dst subvol. On a distributed-replicate volume, this causes AFR to open the fd using afr_fix_open, but with the wrong flags, causing posix_ftruncate to fail with EINVAL. The fix: We require fd specific information to make a decision while handling migrating files. Set the fd_ctx to indicate the fd has been opened on the dst subvol and check if it has been set while processing Phase1/Phase2 checks in the FOP callback functions. > Change-Id: I43cdcd8017b4a11e18afdd210469de7cd9a5ef14 > Signed-off-by: N Balachandran <nbalacha@redhat.com> > Reviewed-on: http://review.gluster.org/12985 > Reviewed-by: Raghavendra G <rgowdapp@redhat.com> > Tested-by: Gluster Build System <jenkins@build.gluster.com> > Reviewed-by: Dan Lambright <dlambrig@redhat.com> > Tested-by: Dan Lambright <dlambrig@redhat.com> Change-Id: I99f8aceec105f16631def06a263f0561954d14b3 BUG: 1293827 Signed-off-by: N Balachandran <nbalacha@redhat.com> Reviewed-on: http://review.gluster.org/13071 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.com>
* cluster/tier: do not block in synctask created from pause tierDan Lambright2015-12-251-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We had run sleep() in the pause tier callback. Blocking within a synctask is dangerous. The sleep() call does not inform the synctask scheduler that a thread is no longer running. It therefore believes it is running. If a second synctask already exists, it may not be able to run. This occurs if the thread limit in the pool has been reached. Note the pool size only grows when a synctask is created, not when it is moved from wait state to run state, as is the case when an FOP completes. When the tier is paused during migration, synctasks already exist waiting for responses to FOPs to the server with high probability. The fix is to yield() in the RPC callback, which will place the synctask into the wait queue and free up a thread for the FOP callback. A timer wakes the callback after sufficient time has elapsed for the pause to occur. This is a backport of 12987. > Change-Id: I6a947ee04c6e5649946cb6d8207ba17263a67fc6 > BUG: 1267950 > Signed-off-by: Dan Lambright <dlambrig@redhat.com> > Reviewed-on: http://review.gluster.org/12987 > Tested-by: Gluster Build System <jenkins@build.gluster.com> > Reviewed-by: Rajesh Joseph <rjoseph@redhat.com> Signed-off-by: Dan Lambright <dlambrig@redhat.com> Change-Id: I9bb7564d57d59abb98e89c8d54c8b1ff4a5fc3f3 BUG: 1274100 Reviewed-on: http://review.gluster.org/13078 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Dan Lambright <dlambrig@redhat.com> Tested-by: Dan Lambright <dlambrig@redhat.com>
* cluster/tier: fix loading tier.so into glusterdN Balachandran2015-12-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | The glusterd process loads the shared libraries of client translators. This failed for tiering due to a reference to dht_methods which is defined as a global variable which is not necessary. The global variable has been removed and this is now a member of dht_conf and is now initialised in the *_init calls. > Change-Id: Ifa0a21e3962b5cd8d9b927ef1d087d3b25312953 > Signed-off-by: N Balachandran <nbalacha@redhat.com> > Reviewed-on: http://review.gluster.org/12863 > Tested-by: NetBSD Build System <jenkins@build.gluster.org> > Tested-by: Gluster Build System <jenkins@build.gluster.com> > Reviewed-by: Dan Lambright <dlambrig@redhat.com> >Tested-by: Dan Lambright <dlambrig@redhat.com> (cherry picked from commit 96fc7f64da2ef09e82845a7ab97574f511a9aae5) Change-Id: If3cc908ebfcd1f165504f15db2e3079d97f3132e BUG: 1288352 Signed-off-by: N Balachandran <nbalacha@redhat.com> Reviewed-on: http://review.gluster.org/12877 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Dan Lambright <dlambrig@redhat.com> Tested-by: Dan Lambright <dlambrig@redhat.com>
* dht: heal directory path if the directory is not presentMohammed Rafi KC2015-11-091-0/+6
| | | | | | | | | | | | | | | | | | | | | back port of http://review.gluster.org/#/c/12376/ After a successful nameless lookup if the directory is not present on any of the subvol, then we will get the path of the directory and will recursively send a named lookp on each parent directory. This will help particularly for the scenarios like add brick and attach-tier. Change-Id: I97f3178adb08b88910f709f0acf1d86c147be017 BUG: 1279095 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Reviewed-on: http://review.gluster.org/12539 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: N Balachandran <nbalacha@redhat.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/tier: add pause tier for snapshotsDan Lambright2015-10-211-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a backport of 12304 Snaps of tiered volumes cannot handle files undergoing migration. We implement a helper mechanism to "pause" migration. Any files undergoing migration are aborted. Clean up is done to remove sticky bits and data at the destination. Migration is restarted after snap completes. For testing an internal switch is added. It is not exposed externally. gluster volume set vol1 tier-pause [true|false] > Change-Id: Ia85bbf89ac142e9b7e73fcbef98bb9da86097799 > BUG: 1267950 > Signed-off-by: Dan Lambright <dlambrig@redhat.com> > Reviewed-on: http://review.gluster.org/12304 > Reviewed-by: N Balachandran <nbalacha@redhat.com> > Tested-by: NetBSD Build System <jenkins@build.gluster.org> > Tested-by: Gluster Build System <jenkins@build.gluster.com> Signed-off-by: Dan Lambright <dlambrig@redhat.com> Conflicts: xlators/mgmt/glusterd/src/glusterd-messages.h Change-Id: I5f039d8d38a4c915bd873969f336b96755a0b8f1 BUG: 1274101 Reviewed-on: http://review.gluster.org/12411 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Dan Lambright <dlambrig@redhat.com> Tested-by: Dan Lambright <dlambrig@redhat.com>
* cluster/tier: add watermarks and policy driverDan Lambright2015-10-101-2/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Backport fix 12039 This fix introduces infrastructure to support different policies for promotion and demotion. Currently the tier feature automatically promotes and demotes files periodically based on access. This is good for testing but too stringent for most real workloads. It makes it difficult to fully utilize a hot tier- data will be demoted before it is touched- its unlikely a 100GB hot SSD will have all its data touched in a window of time. A new parameter "mode" allows the user to pick promotion/demotion polcies. The "test mode" will be used for *.t and other general testing. This is the current mechanism. The "cache mode" introduces watermarks. The watermarks represent levels of data residing on the hot tier. "cache mode" policy: The % the hot tier is full is called P. Do not promote or demote more than D MB or F files. A random number [0-100] is called R. Rules for migration: if (P < watermark_low) don't demote, always promote. if (P >= watermark_low) && (P < watermark_hi) demote if R < P; promote if R > P. if (P > watermark_hi) always demote, don't promote. gluster volume set {vol} cluster.watermark-hi % gluster volume set {vol} cluster.watermark-low % gluster volume set {vol} cluster.tier-max-mb {D} gluster volume set {vol} cluster.tier-max-files {F} gluster volume set {vol} cluster.tier-mode {test|cache} > Change-Id: I157f19667ec95aa1d53406041c1e3b073be127c2 > BUG: 1257911 > Signed-off-by: Dan Lambright <dlambrig@redhat.com> > Reviewed-on: http://review.gluster.org/12039 > Tested-by: Gluster Build System <jenkins@build.gluster.com> > Reviewed-by: Atin Mukherjee <amukherj@redhat.com> Signed-off-by: Dan Lambright <dlambrig@redhat.com> Signed-off-by: Dan Lambright <dlambrig@redhat.com> Conflicts: xlators/cluster/dht/src/dht-rebalance.c xlators/cluster/dht/src/tier.c Change-Id: Ibfe6b89563ceab98708325cf5d5ab0997c64816c BUG: 1270527 Reviewed-on: http://review.gluster.org/12330 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Dan Lambright <dlambrig@redhat.com> Tested-by: Dan Lambright <dlambrig@redhat.com>
* cluster/tier: Handle FOPs on files being migratedN Balachandran2015-09-251-1/+12
| | | | | | | | | | | | | | | | | | | | | Determine which DHT level is responsible for handling fops on a file undergoing migration based on the name of the the linkto xattr set on the file being migrated and process accordingly. Change-Id: I82772e39314d4fe7f2ba0dcf22de0c6a374ee139 BUG: 1265892 Signed-off-by: N Balachandran <nbalacha@redhat.com> > Reviewed-on: http://review.gluster.org/12090 > Tested-by: NetBSD Build System <jenkins@build.gluster.org> > Tested-by: Gluster Build System <jenkins@build.gluster.com> > Reviewed-by: Raghavendra G <rgowdapp@redhat.com> (cherry picked from commit 470869a954c17f32a3ba43ccda7442f82c0da6b2) Reviewed-on: http://review.gluster.org/12224 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Dan Lambright <dlambrig@redhat.com> Tested-by: Dan Lambright <dlambrig@redhat.com>
* cluster/dht: avoid mknod on decommissioned brickSusant Palai2015-08-271-0/+5
| | | | | | | | | | | | BUG: 1256702 Change-Id: I0795720cb77a9c77e608f34fbb69574fd2acb542 Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/11998 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/12024
* dht: block/handle create op falling to decommissioned brickSusant Palai2015-08-261-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | Problem: Post remove-brick start till commit phase, the client layout may not be in sync with disk layout because of lack of lookup. Hence,a create call may fall on the decommissioned brick. Solution: Will acquire a lock on hashed subvol. So that a fix-layout or selfheal can not step on layout while reading the layout. Even if we read a layout before remove-brick fix-layout and the file falls on the decommissioned brick, the file should be migrated to a new brick as per the fix-layout. BUG: 1256283 Change-Id: I3ef1adaf20dfb9524396a3648d1a664464eda8c1 Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/11260 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/12001
* cluster/dht: use refcount to manage memory used to store migrationRaghavendra G2015-07-011-0/+2
| | | | | | | | | | | | | | information. Without refcounting, we might free up memory while other fops are still accessing it. BUG: 1235928 Change-Id: Ia4fa4a651cd6fe2394a0c20cef83c8d2cbc8750f Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-on: http://review.gluster.org/11419 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org>
* features/changelog: Avoid setattr fop logging during renameSaravanakumar Arumugam2015-06-111-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: When a file is renamed and the (renamed)file's Hashing falls into a different brick, DHT creates a special file(linkto file) in the brick(Hashed subvolume) and carries out setattr operation on that file. Currently, Changelog records this(setattr) operation in Hashed subvolume. glusterfind in turn records this operation as MODIFY operation. So, there is a NEW entry in Cached subvolume and MODIFY entry in Hashed subvolume for the same file. Solution: Avoid logging setattr operation carried out, by marking the operation as internal fop using xdata. In changelog translator, check whether setattr is set as internal fop and skip accordingly. Change-Id: I21b09afb5a638b88a4ccb822442216680b7b74fd BUG: 1230687 Signed-off-by: Saravanakumar Arumugam <sarumuga@redhat.com> Reviewed-on: http://review.gluster.org/11183 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com>
* dht: Add lookup-optimize configuration option for DHTShyam2015-06-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently with commit 4eaaf5 a mixed version cluster would have issues if lookup-uhashed is set to auto, as older clients would fail to validate the layouts if newer clients (i.e 3.7 or upwards) create directories. Also, in a mixed version cluster rebalance daemon would set commit hash for some subvolumes and not for the others. This commit fixes this problem by moving the enabling of the functionality introduced in the above mentioned commit to a new dht option. This option also has a op_version of 3_7_1 thereby preventing it from being set in a mixed version cluster. It brings in the following changes, - Option can be set only if min version of the cluster is 3.7.1 or more - Rebalance and mkdir update the layout with the commit hashes only if this option is set, hence ensuring rebalance works in a mixed version cluster, and also directories created by newer clients do not cause layout errors when read by older clients - This option also supersedes lookup-unhased, to enable the optimization for lookups more deterministic and not conflict with lookup-unhashed settings. Option added is cluster.lookup-optimize, which is a boolean. Usage: # gluster volume set VOLNAME cluster.lookup-optimize on Change-Id: Ifd1d4ce3f6438fcbcd60ffbfdbfb647355ea1ae0 BUG: 1225940 Signed-off-by: Shyam <srangana@redhat.com> Reviewed-on: http://review.gluster.org/10976 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: N Balachandran <nbalacha@redhat.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: fix incorrect dst subvol info in inode_ctxNithya Balachandran2015-06-041-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Stashing additional information in the inode_ctx to help decide whether the migration information is stale, which could happen if a file was migrated several times but FOPs only detected the P1 migration phase. If no FOP detects the P2 phase, the inode ctx1 is never reset. We now save the src subvol as well as the dst subvol in the inode ctx. The src subvol is the subvol on which the FOP was sent when the mig info was set in the inode ctx. This information is considered stale if: 1. The subvol on which the current FOP is sent is the same as the dst subvol in the ctx 2. The subvol on which the current FOP is sent is not the same as the src subvol in the ctx This does not handle the case where the same file might have been renamed such that the src subvol is the same but the dst subvol is different. However, that is unlikely to happen very often. Change-Id: I05a2e9b107ee64750c7ca629aee03b03a02ef75f BUG: 1225809 Signed-off-by: Nithya Balachandran <nbalacha@redhat.com> Reviewed-on: http://review.gluster.org/10967 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org>
* cluster/dht: pass a destination subvol to fop2 variants to avoid races.Raghavendra G2015-06-041-3/+3
| | | | | | | | | | | | | | | | | | The destination subvol used in the fop2 variants is either stored in inode-ctx1 or local->cached_subvol. However, it is not guaranteed that a value stored in these locations before invocation of fop2 is still present after the invocation as these locations are shared among different concurrent operations. So, to preserve the atomicity of "check dst-subvol and invoke fop2 variant if dst-subvol found", we pass down the dst-subvol to fop2 variant. This patch also fixes error handling in some fop2 variants. Change-Id: Icc226228a246d3f223e3463519736c4495b364d2 BUG: 1225809 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-on: http://review.gluster.org/10966 Tested-by: Gluster Build System <jenkins@build.gluster.com>
* dht/tier/rebalancer: Fix reset of tiering client pidJoseph Fernandes2015-05-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the patch http://review.gluster.org/#/c/9657 the client pid set by tiering migration was getting over- written in dht_start_rebalance_task(). Just corrected it in dht_setxattr() before calling dht_start_rebalance_task() and removed it from dht_start_rebalance_task(). > http://review.gluster.org/#/c/10502/ > Cherry picked from commit a5fe0f594d41e1a11661d9074bb19e9c2e2c4776 > Change-Id: I37cfa111f83a4e5d498042575c93799f60b49870 > BUG: 1217937 > Signed-off-by: Joseph Fernandes <josferna@redhat.com> > Reviewed-on: http://review.gluster.org/10502 > Tested-by: Gluster Build System <jenkins@build.gluster.com> > Reviewed-by: Susant Palai <spalai@redhat.com> > Reviewed-by: Dan Lambright <dlambrig@redhat.com> Signed-off-by: Joseph Fernandes <josferna@redhat.com> Reviewed-on: http://review.gluster.org/10502 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Susant Palai <spalai@redhat.com> Reviewed-by: Dan Lambright <dlambrig@redhat.com> Signed-off-by: Joseph Fernandes <josferna@redhat.com> Conflicts: xlators/cluster/dht/src/dht-common.c xlators/cluster/dht/src/tier.c Change-Id: Id513114c9a880c6196162dd4b35bbf1155a8cd09 BUG: 1219027 Reviewed-on: http://review.gluster.org/10609 Reviewed-by: N Balachandran <nbalacha@redhat.com> Tested-by: NetBSD Build System Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* dht: make lookup-unhashed=auto do something actually usefulJeff Darcy2015-05-091-1/+28
| | | | | | | | | | | | | | | | | | | | | | | | | The key concept here is to determine whether a directory is "clean" by comparing its last-known-good topology to the current one for the volume. These are stored as "commit hashes" on the directory and the volume root respectively. The volume's commit hash changes whenever a brick is added or removed, and a fix-layout is done. A directory's commit hash changes only when a full rebalance (not just fix-layout) is done on it. If all bricks are present and have a directory commit hash that matches the volume commit hash, then we can assume that every file is in its "proper" place. Therefore, if we look for a file in that proper place and don't find it, we can assume it's not on any other subvolume and *safely* skip the global (broadcast to all) lookup. Change-Id: Id6ce4593ba1f7daffa74cfab591cb45960629ae3 BUG: 1220064 Reviewed-on-master: http://review.gluster.org/#/c/7702/ Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Signed-off-by: Shyam <srangana@redhat.com> Reviewed-on: http://review.gluster.org/10729 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* glusterd: support for tier volumes 'detach start' and 'detach commit'Dan Lambright2015-05-091-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Back port of http://review.gluster.org/10108 These commands work in a manner analagous to rebalancing when removing a brick. The existing migration daemon detects "detach start" and switches to moving data off the hot tier. While in this state all lookups are directed to the cold tier. gluster v detach-tier <vol> start gluster v detach-tier <vol> commit The status and stop cli commands shall be submitted separately. >Change-Id: I24fda5cc3ba74f5fb8aa9a3234ad51f18b80a8a0 >BUG: 1205540 >Signed-off-by: Dan Lambright <dlambrig@redhat.com> >Signed-off-by: root <root@localhost.localdomain> >Signed-off-by: Dan Lambright <dlambrig@redhat.com> >Reviewed-on: http://review.gluster.org/10108 >Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Change-Id: I212d748d077fb5870ee84b316c653acbafbea3f7 BUG: 1220047 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Reviewed-on: http://review.gluster.org/10708 Reviewed-by: Dan Lambright <dlambrig@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* dht/rebalance: Throttle rebalanceSusant Palai2015-05-071-0/+8
| | | | | | | | | | | | | | | | Throttle value will be "normal" by default. For throttling down, a thread will be put in to sleep. And for throttling up, gf_defrag_process_dir will wake up the sleeping threads. Change-Id: I4892ab14982a1ff305aeb2d8bbd33c79d6877b69 BUG: 1219579 Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/10526 Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-on: http://review.gluster.org/10629 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Vijay Bellur <vbellur@redhat.com>
* rebalance: Introducing local crawl and parallel migrationSusant Palai2015-05-071-0/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current patch address two part of the design proposed. 1. Rebalance multiple files in parallel 2. Crawl only bricks that belong to the current node Brief design explanation for the above two points. 1. Rebalance multiple files in parallel: ------------------------------------- The existing rebalance engine is single threaded. Hence, introduced multiple threads which will be running parallel to the crawler. The current rebalance migration is converted to a "Producer-Consumer" frame work. Where Producer is : Crawler Consumer is : Migrating Threads Crawler: Crawler is the main thread. The job of the crawler is now limited to fix-layout of each directory and add the files which are eligible for the migration to a global queue in a round robin manner so that we will use all the disk resources efficiently. Hence, the crawler will not be "blocked" by migration process. Producer: Producer will monitor the global queue. If any file is added to this queue, it will dqueue that entry and migrate the file. Currently 20 migration threads are spawned at the beginning of the rebalance process. Hence, multiple file migration happens in parallel. 2. Crawl only bricks that belong to the current node: -------------------------------------------------- As rebalance process is spawned per node, it migrates only the files that belongs to it's own node for the sake of load balancing. But it also reads entries from the whole cluster, which is not necessary as readdir hits other nodes. New Design: As part of the new design the rebalancer decides the subvols that are local to the rebalancer node by checking the node-uuid of root directory prior to the crawler starts. Hence, readdir won't hit the whole cluster as it has already the context of local subvols and also node-uuid request for each file can be avoided. This makes the rebalance process "more scalable". Change-Id: I6f1b44086a09df8ca23935fd213509c70cc0c050 BUG: 1217381 Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/10466 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System Reviewed-by: N Balachandran <nbalacha@redhat.com>
* Xlators : Fixed typosManikandan Selvaganesh2015-04-021-1/+1
| | | | | | | | | | | Change-Id: I948f85cb369206ee8ce8b8cd5e48cae9adb971c9 BUG: 1075417 Signed-off-by: Manikandan Selvaganesh <mselvaga@redhat.com> Reviewed-on: http://review.gluster.org/9529 Reviewed-by: Niels de Vos <ndevos@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Reviewed-by: Humble Devassy Chirammal <humble.devassy@gmail.com>
* libxlator: Change marker xattr handling interfacePranith Kumar K2015-03-251-3/+0
| | | | | | | | | | | | | | | | | | | | | - Changed the implementation of marker xattr handling to take just a function which populates important data that is different from default 'gauge' values and subvolumes where the call needs to be wound. - Removed duplicate code I found while reading the code and moved it to cluster_marker_unwind. Removed unused structure members. - Changed dht/afr/stripe implementations to follow the new implementation - Implemented marker xattr handling for ec. Change-Id: Ib0c3626fe31eb7c8aae841eabb694945bf23abd4 BUG: 1200372 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/9892 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Add tier translator.Dan Lambright2015-03-211-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The tier translator shares most of DHT's code. It differs in how subvolumes are chosen for I/Os, and how file migration (cache promotion and demotion) is managed. That different functionality is split to either DHT or tier logic according to the "tier_methods" structure. A cache promotion and demotion thread is created in a manner similar to the rebalance daemon. The thread operates a timing wheel which periodically checks for promotion and demotion candidates (files). Candidates are queued and then migrated. Candidates must exist on the same node as the daemon and meet other critera per caching policies. This patch has two authors (Dan Lambright and Joseph Fernandes). Dan did the DHT changes and Joe wrote the cache policies. The fix depends on DHT readidr changes and the database library which have been submitted separately. Header files in libglusterfs/src/gfdb should be reviewed in patch 9683. For more background and design see the feature page [1]. [1] http://www.gluster.org/community/documentation/index.php/Features/data-classification Change-Id: Icc26c517ccecf5c42aef039f5b9c6f7afe83e46c BUG: 1194753 Signed-off-by: Dan Lambright <dlambrig@redhat.com> Reviewed-on: http://review.gluster.org/9724 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Change the subvolume encoding in d_off to be a "global"Dan Lambright2015-03-181-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | position in the graph rather than relative (local) to a particular translator. Encoding the volume in this way allows a single translator to manage which brick is currently being scanned for directory entries. Using a single translator minimizes allocated bits in the d_off. It also allows multiple DHT translators in the same graph to have a common frame of reference (the graph position) for which brick is being read. Multiple DHT translators are needed for the Tiering feature. The fix builds off a previous change (9332) which removed subvolume encoding from AFR. The fix makes an equivalent change to the EC translator. More background can be found in fix 9332 and gluster-dev discussions [1]. DHT and AFR/EC are responsibile (as before) for choosing which brick to enumerate directory entries in over the readdir lifecycle. The client translator receiving the readdir fop encodes the dht_t. It is referred to as the "leaf node" in the graph and corresponds to the brick being scanned. When DHT decodes the d_off, it translates the leaf node to a local subvolume, which represents the next node in the graph leading to the brick. Tracking of leaf nodes is done in common utility functions. Leaf nodes counts and positional information are updated on a graph switch. [1] www.gluster.org/pipermail/gluster-devel/2015-January/043592.html Change-Id: Iaf0ea86d7046b1ceadbad69d88707b243077ebc8 BUG: 1190734 Signed-off-by: Dan Lambright <dlambrig@redhat.com> Reviewed-on: http://review.gluster.org/9688 Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: fixes to should_layout_fix logicRaghavendra G2015-03-051-0/+5
| | | | | | | | | | | | | | * Don't consider "dir-spread-count" option. This option is not supported. * Consider transition to weighted to equal distribution or vice-versa a valid case for fixing the layout. Change-Id: I0dcfe555dae9269ce20a41611cfdaa4f96c9e98b BUG: 1196615 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-on: http://review.gluster.org/9809 Reviewed-by: N Balachandran <nbalacha@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* dht: fix for dht_lock_count() compile errorNiels de Vos2015-02-261-1/+1
| | | | | | | | | | | | | | | dht-common.h includes a function definition with "inline", but the function is not declared in the header. Dropping the "inline" compile directive so that linking against .o files works correctly. BUG: 1196650 Change-Id: I105be591125b29cd455769b0c4ff22d6e139227d Signed-off-by: Niels de Vos <ndevos@redhat.com> Reviewed-on: http://review.gluster.org/9760 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: synchronize with other concurrent healers while healing layout.Raghavendra G2015-02-201-6/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | Current layout heal code assumes layout setting is idempotent. This allowed multiple concurrent healers to set the layout without any synchronization. However, this is not the case as different healers can come up with different layout for same directory and making layout setting non-idempotent. So, we bring in synchronization among healers to 1. Not to overwrite an ondisk well-formed layout. 2. Refresh the in-memory layout with the ondisk layout if in-memory layout needs healing and ondisk layout is well formed. This patch can synchronize 1. among multiple healers. 2. among multiple fix-layouts (which extends layout to consider added or removed brick) 3. (but) not between healers and fix-layouts. So, the problem of in-memory stale layouts (not matching with layout ondisk), is not _completely_ fixed by this patch. Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Change-Id: Ia285f25e8d043bb3175c61468d0d11090acee539 BUG: 1176008 Reviewed-on: http://review.gluster.org/9302 Reviewed-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Modify dht-rebalance.c to use trusted.distribute.migrate-dataDan Lambright2015-02-131-1/+1
| | | | | | | | | | | | | | | rather than distribute.migrate-data and work with CLI. The change makes this work both when it is internally driven and from the shell. The problem is further described in bugzilla # 1147107. Change-Id: I4fe04cae661dca25432530ddf5ac6ff2c957d6b3 BUG: 1147107 Signed-off-by: Dan Lambright <dlambrig@redhat.com> Reviewed-on: http://review.gluster.org/9284 Reviewed-by: N Balachandran <nbalacha@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com>
* cluster/dht: Fix incorrect updates to parent timesKrutika Dhananjay2015-01-191-6/+5
| | | | | | | | | | | | | | | | | | | | | In directory write FOPs, as far as updates to timestamps associated with parent by DHT is concerned, there are three possibilities: a) time (in sec) gotten from child of DHT < time (in sec) in inode ctx b) time (in sec) gotten from child of DHT = time (in sec) in inode ctx c) time (in sec) gotten from child of DHT > time (in sec) in inode ctx In case (c), for time in nsecs, it is the value returned by DHT's child that must be selected. But what DHT_UPDATE_TIME ends up doing is to choose the maximum of (time in nsec gotten from DHT's child, time in nsec in inode ctx). Change-Id: I535a600b9f89b8d9b6714a73476e63ce60e169a8 BUG: 1179169 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/9457 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: Rename should not fail post hardlink creationShyam2014-09-021-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | In the rename path, we wind the creation of newname hardlink and linkto file in dst hashed a the same time. If the linkto creation fails, but the link creation succeeds, we enter the failure code and cleanup the created newname hardlink. In the interim if another client looks up newname and finds it as a hardlink from FUSE, it could send an unlink for oldname instead of a rename. This combined with the above cleanup code could end up losing all the files copies, and thereby losing data. This fix separates these steps into 2 parts, creating the linkto first and then the link file, so that post link file creation no failures would cleanup the newname file. If linkto fails then link is not attempted, thereby not polluting the name space with newname. Change-Id: I61da8e906060da16a31ea1076eec2f01fd617f44 BUG: 1130888 Signed-off-by: Shyam <srangana@redhat.com> Reviewed-on: http://review.gluster.org/8570 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: synchronize rename and file-migrationRaghavendra G2014-08-281-0/+4
| | | | | | | | | Change-Id: I4f243c946f76d440680b651235f925e3d0ebf0fd BUG: 1130888 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-on: http://review.gluster.org/8523 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: introduce locking api.Raghavendra G2014-08-261-0/+56
| | | | | | | | | | Change-Id: I41389ba91951d3e63e617aa32cd0bee848261c72 BUG: 1130888 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-on: http://review.gluster.org/8521 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Modified logic of linkto file deletion on non-hashedVenkatesh Somyajulu2014-07-311-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently whenever dht_lookup_everywhere gets called, if in dht_lookup_everywhere_cbk, a linkto file is found on non-hashed subvolume, file is unlinked. But there are cases when this file is under migration. Under such condition, we should avoid deletion of file. When some other rebalance process changes the layout of parent such that dst_file (w.r.t. migration) falls on non-hashed node, then may be lookup could have found it as linkto file but just before unlink, file is under migration or already migrated In such cased unlink can be avoided. Race: ------- If we have two bricks (brick-1 and brick-2) with initial file "a" under BaseDir which is hashed as well as cached on (brick-1). Assume "a" hashing gives 44. Brick-1 Brick-2 Initial Setup: BaseDir/a BaseDir [1-50] [51-100] Now add new-brick Brick-3. 1. Rebalance-1 on node Node-1 (Brick-1 node) will reset the BaseDir Layout. 2. After that it will perform a) Create linkto file on new-hashed (brick-2) b) Perform file migration. 1.Rebalance-1 Fixes the base-layout: Brick-1 Brick-2 Brick-3 --------- ---------- ------------ BaseDir/a BaseDir BaseDir [1-33] [34-66] [67-100] 2. Only a) is BaseDir/a BaseDir/a(linkto) BaseDir performed Create linktofile Now rebalance 2 on node-2 jumped in and it will perform step 1 and 2-a. After (rebal-2, step-1), it changes the layout of the BaseDir. BaseDir/a BaseDir/a(link) BaseDir [67-100] [1-33] [34-66] For (rebale-2, step-2), It will perform lookup at Brick-3 as w.r.t new layout 44 falls for brick-3. But lookup will fail. So dht_lookup_everywhere gets called. NOTE: On brick-2 by rebalance-1, a linkto file was created. Currently that linkto files gets deleted by rebalance-2 lookup as it is considered as stale linkto file. But with patch if rebalance is already in progress or rebalance is over, linkto file will not be unlinked. If rebalance is in progress fd will be open and if rebalance is over then linkto file wont be set. Change-Id: I3fee0d28de3c76197325536a9e30099d2413f079 BUG: 1116150 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/8345 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Fix races to avoid deletion of linkto fileVenkatesh Somyajulu2014-07-181-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Explanation of Race between rebalance processes: https://bugzilla.redhat.com/show_bug.cgi?id=1110694#c4 STATE 1: BRICK-1 only one brick Cached File in the system STATE 2: Add brick-2 BRICK-1 BRICK-2 STATE 3: Lookup of File on brick-2 by this node's rebalance will fail because hashed file is not created yet. So dht_lookup_everywhere is about to get called. STATE 4: As part of lookup link file at brick-2 will be created. STATE 5: getxattr to check that cached file belongs to this node is done STATE 6: dht_lookup_everywhere_cbk detects the link created by rebalance-1. It will unlink it. STATE 7: getxattr at the link file with "pathinfo" key will be called will fail as the link file is deleted by rebalance on node-2 Fix: So in the STATE 6, we should avoid the deletion of link file. Every time dht_lookup_everywhere gets called, lookup will be performed on all the nodes. So to avoid STATE 6, if linkto file is found, it is not deleted until valid case is found in dht_lookup_everywhere_done. Case 1: if linkto file points to cached node, and cached file exists, uwind with success. Case 2: if linkto does not point to current cached node, and cached file exists: a) Unlink stale link file b) Create new link file Case 3: Only linkto file exists: Delete linkto file Case 4: Only cached file Create link file (Handled event without patch) Case 5: Neither cached nor hashed file is present Return with ENOENT (handled even without patch) Change-Id: Ibf53671410d8d613b8e2e7e5d0ec30fc7dcc0298 BUG: 1116150 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/8231 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Vijay Bellur <vbellur@redhat.com>
* dht: fix rename raceJeff Darcy2014-07-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | If two clients try to rename the same file at the same time, we sometimes end up with *no file at all* in either the old or new location. That's kind of bad. The culprit seems to be some overly aggressive cleanup code. AFAICT, based on today's study of the code, the intent of the changed section is to remove any linkfile we might have created before the actual rename. However, what we're removing might not be our extra link. If we're racing with another client that's also doing a rename, it might be the only remaining link to the user's data. The solution, which is good enough to pass this test but almost certainly still not complete, is to be more selective about when we do this unlink. Now, we only do it if we know that, at some point, we did in fact create the link without error (notably ENOENT on the source or EEXIST on the destination) ourselves. Change-Id: I8d8cce150b6f8b372c9fb813c90be58d69f8eb7b BUG: 1117851 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/8269 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* dht: support heterogeneous brick sizesJeff Darcy2014-07-121-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calculation of layouts now considers the size of each brick, so that smaller bricks don't get an "unfair" share of allocations and start returning ENOSPC while the larger bricks still have plenty of space. The observation has been made that some clients might get ENOTCONN when trying to fetch disk-size information, and end up calculating layouts differently. The following meta-observations can be made. (1) This scenario is extremely unlikely in configurations with AFR. (2) The most likely consequence of this scenario is that some files will be placed sub-optimally by the client with the obsolete (non-weighted) layout. They'll still be found anyway, so this isn't a show stopper. (3) Without this patch it's *guaranteed* that some files will be placed sub-optimally, because any layout that fails to account for brick sizes is sub-optimal. (4) We shouldn't be doing fix-layout from two nodes simultaneously anyway. That's inefficient at best. Any instances of such behavior are separate bugs, which should be fixed separately. (5) In the most extreme edge case, two nodes doing weighted and non-weighted layout fixes could race and end up creating an internally inconsistent layout. This condition is still transient; it will be detected and repaired automatically the next time anyone fetches the layout. (If it's not that's also a preexisting bug that can show up in other contexts.) In conclusion, it's not the purpose of this patch to fix bugs elsewhere in DHT. Its purpose is to make life incrementally better for users who add new hardware with larger disks etc. than the older equipment. It's only one part of an ongoing process to improve layout management and repair, all the way up to support for multiple hash rings or tiering. Change-Id: I05eb6f9eface9cdaf8622e0260c8c7f29020447f BUG: 1114680 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/8093 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Added logging of new layout for dir-selfhealVenkatesh Somyajulu2014-07-031-0/+3
| | | | | | | | | | | | | | | | | | | | | Added a log which logs the new layout which will be used for the directory self healing It prints: a) Subvolume name b) Error --> Is needed because layout healing depends on the error and having it in log will help in debugging c) Start Starting of the layout range d) Stop Ending of the layout range Change-Id: I48c9c697716a899165ed29b737362a75c62e09b3 BUG: 1113066 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/8173 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* features/quota: Make dht_statfs_cbk more fool proof from quota_deem_statfsVarun Shastry2014-06-181-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: The function depends on the fact that if quota-deem-statfs option is enabled, all of the subvolumes send their xdata with quota-deem-statfs flag ON. But, this may not be true in case of errors in some of the subvolumes. There is a decision/policy made which assumes quota-deem-statfs to be ON if at least ONE of the subvolumes sends the flag ON. By this, df reports quota modified statfs values if *at least ONE* of the bricks sends the quota-deem-statfs flag ON. This can be visualized with the below "Transition Diagram/State Machine". Event: Each Quota deem statfs status from the individual bricks Action: Decision taken on the calculation of the statvfs received State: Whether quota deem statfs is ON or OFF (0: OFF, 1: ON) Input: Event from individual bricks ___ ___ / \ OFF* / \ (OFF|ON)* | | | | \ / ON \ / -----> 0 ----------------> 1 The below Transition Function depicts the relation between the statfs calculation based on the events received. State Event action ------------------------------------- OFF OFF OFF OFF ON REPLACE ON OFF NEGLECT ON ON COMPARE Change-Id: I0e8fb7d3945a3ca3dde0bb99de6cd397e27a3162 BUG: 1048786 Signed-off-by: Varun Shastry <vshastry@redhat.com> Reviewed-on: http://review.gluster.org/6652 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: Do layout self healing of directory for nameless lookupVenkatesh Somyajulu2014-06-171-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Currently in the nameless lookup code path, if at the end of the lookup, even if it detects that layout anamolies are there, layout healing will not be done as there is no code to heal it. So there can be race between mkdir and lookup. Assume mkdir is going on from some other mount point, Say, M1. Directories are created on some nodes but layout is not set yet. Now from M2, nameless lookup goes, lookup will be success full as the directory is present on some of the nodes, but it won't heal layout. Now if create goes after lookup fop, because layout is absent, file creation will fail. Fix: Included the code of layout self-heal in the nameless lookup path. At the end of lookup, layout will be computed as it would have been in the named lookup, but it will be set to those node only, where directory is present. So after that if create fop goes, the probabiliy to get the subvolume with proper hash-range is high now, so reduces the race window. Other: Whenever a directory is created, we have to choose a brick from which we start allocating layout in a circular fashion. To calculate this starting brick, I have changed the candidate from name of the directory to gfid of the directory But to compute where a given file belongs, we will still use the name of the file. Hash computed from the name of the file should belong to any one of the directory-hash-range Calculation of hash for a file is acting as a consumer and the setting of directory layout based on gfid is acting as a producer, which are independent from each other. Change-Id: I3808c55082cd1b5c72d2c77cbbc063f55aa38bee BUG: 1095888 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/7493 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Bring option to choose gfid or name based hashingVenkatesh Somyajulu2014-06-171-0/+1
| | | | | | | | | Change-Id: I11794eb2adceb88e75864aede450e904431a6273 BUG: 1095888 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/8049 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* Cluster/DHT: New logging frameworkNithya Balachandran2014-06-161-0/+1
| | | | | | | | | | | | Moved all relevant DHT gf_log calls to the new logging framework. Change-Id: I3af3cfe0416e332774a6c4ff6a091d006c400af2 BUG: 1075611 Signed-off-by: Nithya Balachandran <nbalacha@redhat.com> Reviewed-on: http://review.gluster.org/7929 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* DHT/readdirp: Directory not shown/healed on mount point if existsSusant Palai2014-06-161-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | on single brick(non first up subvolume). Problem: If snapshot is taken, when mkdir has succeeded only on hashed_subvolume, then after restoring snapshot the directory is not shown on mount point. Why: dht_readdirp takes only those directory entries in to account, which are present on first_up_subvolume. Hence, if the "hashed subvolume" is not same as first_up_subvolume, it wont be listed on mount point and also not healed. Solution: Case 1: (Rebalance not running)If hashed subvolume is NULL or down then filter in first_up_subvolume. Other wise the corresponding hashed subvolume will take care of the directory entry. Case 2: If readdirp_optimize option is turned on then read from first_up_subvol Change-Id: Idaad28f1c9f688dbfb1a8a3ab8b244510c02365e BUG: 1092433 Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/7599 Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>