summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/afr/src/afr.c
Commit message (Collapse)AuthorAgeFilesLines
* cluster/afr: Heal directory rename without rmdir/mkdirPranith Kumar K2020-10-011-1/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem1: When a directory is renamed while a brick is down entry-heal always did an rm -rf on that directory on the sink on old location and did mkdir and created the directory hierarchy again in the new location. This is inefficient. Problem2: Renamedir heal order may lead to a scenario where directory in the new location could be created before deleting it from old location leading to 2 directories with same gfid in posix. Fix: As part of heal, if oldlocation is healed first and is not present in source-brick always rename it into a hidden directory inside the sink-brick so that when heal is triggered in new-location shd can rename it from this hidden directory to the new-location. If new-location heal is triggered first and it detects that the directory already exists in the brick, then it should skip healing the directory until it appears in the hidden directory. Credits: Ravi for rename-data-loss.t script Fixes: #1211 Change-Id: I0cba2006f35cd03d314d18211ce0bd530e254843 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Fixing coverity issueskarthik-us2020-07-131-2/+4
| | | | | | | | | | | Fixing the unchecked return value issues reported by coverity scan CID: 1400734 CID: 1400750 Change-Id: I3c953df9ade4a1548e41e18018edb1b041f7e15e Signed-off-by: karthik-us <ksubrahm@redhat.com> Updates: #1060
* cluster/afr: Fixes for haloPranith Kumar K2020-03-131-1/+3
| | | | | | | | | | | Current implementation assumes that ping-event will come after connect event but that may not be the case in the cases where after socket connection fds need to be re-opened which would consume more time. So handle any order of the ping/child-up events. fixes: bz#1800583 Change-Id: I6bcdc0caa503bdc039ef2b4739fbf4afae121f05 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* afr: expose cluster.optimistic-change-log to CLI.Ravishankar N2020-01-071-0/+2
| | | | | | | | | | | This volume option was not made avaialble to `gluster volume set` CLI. Reported-by: epolakis(https://github.com/kinsu) in https://github.com/gluster/glusterfs/issues/781 fixes: bz#1787554 Change-Id: I7141bdd4e53ee99e22b354edde8d023bfc0b2cd7 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: lock healing changesRavishankar N2019-10-301-0/+3
| | | | | | | | | | | | | | | | | | | Implements lock healing for gluster-block fencing use case. If mandatory lock is enabled: - Add domain lock/unlock to afr_lk fop. - Maintain a list of locks to be healed in afr_private_t. - Add lock to the list if afr_lk(F_SETLK or F_SETLKW) was sucessful. - Remove it from the list during afr_lk(F_UNLCK). - On child_down, mark lock as needing heal on that child. If lock is lost on quorum no. of bricks, remove it from the list and mark fd bad. - For fds marked as bad, fail the subsequent fd based fops. - On parent up, traverse the list and heal the locks IFF the client is the lk owner and has quorum. (shd does not heal any locks). updates: #613 Change-Id: I03c46ceaea30f5e6236d5ec13f71d843d827f1bc Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Add afr_seek to fops tablePranith Kumar K2019-10-141-0/+1
| | | | | | fixes: bz#1760189 Change-Id: Iffbf8d6f4c50b8e2de8364658697bdbe96549f5d Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* afr: wake up index healer threadsRavishankar N2019-08-301-0/+10
| | | | | | | | | | | | | ...whenever shd is re-enabled after disabling or there is a change in `cluster.heal-timeout`, without needing to restart shd or waiting for the current `cluster.heal-timeout` seconds to expire. See BZ 1743988 for more details. Change-Id: Ia5ebd7c8e9f5b54cba3199c141fdd1af2f9b9bfe fixes: bz#1744548 Reported-by: Glen Kiessling <glenk1973@hotmail.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr/read: Implement latency based read child selectionMohammed Rafi KC2019-06-201-2/+5
| | | | | | | | | | | | | | | | | Network latency is an important factor selecting a read subvolume. So this patch is adding two new policy. 1) We measure the latency of a child during a GF_DUMP rpc call. Then use this latency to pick a read subvol having the least latency. 2) Second one is an hybrid mode where it calculates the effective latency by multiplying outstanding pending read request and latency, and choose the least one. Change-Id: Ia49c8a08ab61f7dcdad8b8950aa4d338e7accf97 fixes: #520 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* afr/fini: Free local_pool data during an afr finiMohammed Rafi KC2019-06-171-0/+6
| | | | | | | | | We should free the mem_pool local_pool during an afr_fini. Otherwise this will lead to mem leak for shd Change-Id: I805a34a88077bf7b886c28b403798bf9eeeb1c0b Updates: bz#1716695 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* cluster/afr: Thin-arbiter SHD fixeskarthik-us2019-04-121-1/+1
| | | | | | | | | This patch address post-merge review comments for commit 5784a00f997212d34bd52b2303e20c097240d91c Change-Id: I7ed954664a2ae8e1091d23ee3ceb9c66e83bfeac fixes: bz#1697930 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* afr/shd: Cleanup self heal daemon resources during afr finiMohammed Rafi KC2019-02-121-0/+57
| | | | | | | | | We were not properly cleaning self-heal daemon resources during afr fini. This patch will clean the same. Change-Id: I597860be6f781b195449e695d871b8667a418d5a updates: bz#1659708 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* cluster/afr: Disable client side heals in AFR by default.Sunil Kumar Acharya2019-01-101-3/+3
| | | | | | | | | With this changeset, default value for the AFR client side heal volume option is set to "off" fixes: bz#1663102 Change-Id: Ie4016932339c4896487e3e7cb5caca68739b7ba2 Signed-off-by: Sunil Kumar Acharya <sheggodu@redhat.com>
* cluster/afr: Allow lookup on root if it is from ADD_REPLICA_MOUNTkarthik-us2018-12-181-1/+1
| | | | | | | | | | | | | | | | | | | | | Problem: When trying to convert a plain distribute volume to replica-3 or arbiter type it is failing with ENOTCONN error as the lookup on the root will fail as there is no quorum. Fix: Allow lookup on root if it is coming from the ADD_REPLICA_MOUNT which is used while adding bricks to a volume. It will try to set the pending xattrs for the newly added bricks to allow the heal to happen in the right direction and avoid data loss scenarios. Note: This fix will solve the problem of type conversion only in the case where the volume was mounted at least once. The conversion of non mounted volumes will still fail since the dht selfheal tries to set the directory layout will fail as they do that with the PID GF_CLIENT_PID_NO_ROOT_SQUASH set in the frame->root. Change-Id: Ic511939981dad118cc946754341318b164954b3b fixes: bz#1655854 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* Don't depend on string options to be valid alwaysPranith Kumar K2018-12-171-9/+34
| | | | | | updates bz#1650403 Change-Id: Ib5a11e691599ce4bd93c1ed5aca6060592893961 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* AFR xlator: use dict_{setn|getn|deln|get_int32n|set_int32n|set_strn}Yaniv Kaul2018-12-171-1/+1
| | | | | | | | | | | | | | | | | | | | In a previous patch (https://review.gluster.org/20769) we've added the key length to be passed to dict_* funcs, to remove the need to strlen() it. This patch moves some xlators to use it. - In some cases, moved strlen() of the key length outside of locks, which is usually a good thing. Please verify it's safe to do so. - In some cases, created a prefix for the keys, replacing something like "%d-%d" with a "%s" in snprintf(). Not sure it adds value, but improves readability. Please review carefully. Compile-tested only! Change-Id: I04f2a1eb2ecfc3283d849d150d10d088ae7aa7f1 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* afr: some minor itable related cleanupsRavishankar N2018-12-121-3/+14
| | | | | | | | | | | | - this->itable always needs to be allocated, hence move it outside afr_selfheal_daemon_init(). - Invoke afr_selfheal_daemon_init() only for self-heal daemon case. - remove redundant itable allocation in afr_discover(). - destroy itable in fini. Updates bz#1193929 Change-Id: Ib28b50b607386f5a5aa7d2f743c8b506ccb10eae Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* all: add xlator_api to many translatorsAmar Tumballi2018-12-061-2/+17
| | | | | | Fixes: #164 Change-Id: I93ad6f0232a1dc534df099059f69951e1339086f Signed-off-by: Amar Tumballi <amarts@redhat.com>
* afr: thin-arbiter 2 domain locking and in-memory stateRavishankar N2018-10-251-5/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 2 domain locking + xattrop for write-txn failures: -------------------------------------------------- - A post-op wound on TA takes AFR_TA_DOM_NOTIFY range lock and AFR_TA_DOM_MODIFY full lock, does xattrop on TA and releases AFR_TA_DOM_MODIFY lock and stores in-memory which brick is bad. - All further write txn failures are handled based on this in-memory value without querying the TA. - When shd heals the files, it does so by requesting full lock on AFR_TA_DOM_NOTIFY domain. Client uses this as a cue (via upcall), releases AFR_TA_DOM_NOTIFY range lock and invalidates its in-memory notion of which brick is bad. The next write txn failure is wound on TA to again update the in-memory state. - Any incomplete write txns before the AFR_TA_DOM_NOTIFY upcall release request is got is completed before the lock is released. - Any write txns got after the release request are maintained in a ta_waitq. - After the release is complete, the ta_waitq elements are spliced to a separate queue which is then processed one by one. - For fops that come in parallel when the in-memory bad brick is still unknown, only one is wound to TA on wire. The other ones are maintained in a ta_onwireq which is then processed after we get the response from TA. Change-Id: I32c7b61a61776663601ab0040e2f0767eca1fd64 updates: bz#1579788 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/afr: Use 2 domain locking in SHD for thin-arbiterkarthik-us2018-09-201-0/+1
| | | | | | | | | | | | | | | | | | | | With this change when SHD starts the index crawl it requests all the clients to release the AFR_TA_DOM_NOTIFY lock so that clients will know the in memory state is no more valid and any new operations needs to query the thin-arbiter if required. When SHD completes healing all the files without any failure, it will again take the AFR_TA_DOM_NOTIFY lock and gets the xattrs on TA to see whether there are any new failures happened by that time. If there are new failures marked on TA, SHD will start the crawl immediately to heal those failures as well. If there are no new failures, then SHD will take the AFR_TA_DOM_MODIFY lock and unsets the xattrs on TA, so that both the data bricks will be considered as good there after. Change-Id: I037b89a0823648f314580ba0716d877bd5ddb1f1 fixes: bz#1579788 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* Land part 2 of clang-format changesGluster Ant2018-09-121-1053/+992
| | | | | Change-Id: Ia84cc24c8924e6d22d02ac15f611c10e26db99b4 Signed-off-by: Nigel Babu <nigelb@redhat.com>
* multiple files: calloc -> mallocYaniv Kaul2018-09-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xlators/cluster/stripe/src/stripe-helpers.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible xlators/cluster/dht/src/tier.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible xlators/cluster/dht/src/dht-layout.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible xlators/cluster/dht/src/dht-helper.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible xlators/cluster/dht/src/dht-common.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible xlators/cluster/afr/src/afr.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible xlators/cluster/afr/src/afr-inode-read.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible tests/bugs/replicate/bug-1250170-fsync.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible tests/basic/gfapi/gfapi-async-calls-test.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible tests/basic/ec/ec-fast-fgetxattr.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible rpc/xdr/src/glusterfs3.h: Move to GF_MALLOC() instead of GF_CALLOC() when possible rpc/rpc-transport/socket/src/socket.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible rpc/rpc-lib/src/rpc-clnt.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible extras/geo-rep/gsync-sync-gfid.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible cli/src/cli-xml-output.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible cli/src/cli-rpc-ops.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible cli/src/cli-cmd-volume.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible cli/src/cli-cmd-system.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible cli/src/cli-cmd-snapshot.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible cli/src/cli-cmd-peer.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible cli/src/cli-cmd-global.c: Move to GF_MALLOC() instead of GF_CALLOC() when possible It doesn't make sense to calloc (allocate and clear) memory when the code right away fills that memory with data. It may be optimized by the compiler, or have a microscopic performance improvement. In some cases, also changed allocation size to be sizeof some struct or type instead of a pointer - easier to read. In some cases, removed redundant strlen() calls by saving the result into a variable. 1. Only done for the straightforward cases. There's room for improvement. 2. Please review carefully, especially for string allocation, with the terminating NULL string. Only compile-tested! updates: bz#1193929 Original-Author: Yaniv Kaul <ykaul@redhat.com> Signed-off-by: Yaniv Kaul <ykaul@redhat.com> Signed-off-by: Amar Tumballi <amarts@redhat.com> Change-Id: I16274dca4078a1d06ae09a0daf027d734b631ac2
* afr: common thin-arbiter functionsRavishankar N2018-08-231-0/+2
| | | | | | | | | | | | | | ...that can be used by client and self-heal daemon, namely: afr_ta_post_op_lock() afr_ta_post_op_unlock() Note: These are not yet consumed. They will be used in the write txn changes patch which will introduce 2 domain locking. updates: bz#1579788 Change-Id: I636d50f8fde00736665060e8f9ee4510d5f38795 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* All: run codespell on the code and fix issues.Yaniv Kaul2018-07-221-2/+2
| | | | | | | | | | | | Please review, it's not always just the comments that were fixed. I've had to revert of course all calls to creat() that were changed to create() ... Only compile-tested! Change-Id: I7d02e82d9766e272a7fd9cc68e51901d69e5aab5 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* afr: Add lease() fopPoornima G2018-05-051-0/+1
| | | | | | | | Change-Id: Ied047dd5ee44e9d5a5d3db214826f7df30332ef9 updates: #350 BUG: 1319992 Signed-off-by: Poornima G <pgurusid@redhat.com> Signed-off-by: Jiffin Tony Thottan <jthottan@redhat.com>
* afr: initial changes for thin arbiterRavishankar N2018-04-301-2/+23
| | | | | | | | | 1. Create thin arbiter index file during mount. 2. Set pending marker in thin arbiter id file in case of failure. Change-Id: I269eb8d069f0323f1fc616175e5e5eb7b91d5f82 updates: #352 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Keep child-up until ping-eventPranith Kumar K2018-04-251-1/+5
| | | | | | | | | | | | | | | | | | | | | Problem: If we have 2 bricks, brick-A and brick-B with brick-A within halo-max-latency and brick-B more than halo-max-latency. If we set both halo-min, halo-max replicas as '1'. In this case, brick-A comes online and then ping-latency will be updated for it. When brick-B comes online, we have 2 up-bricks, so the code tries to find the brick with worst latency to mark it down. Since Brick-B just came online it always had '0' latency so brick-B used to be marked offline and Brick-B would eventually be the one to be online even when brick-A is more suited. Fix: Consider latency of just-up child as HALO_MAX_LATENCY so that worst-child until ping-latency is found as the just-up brick. Also keep ping-latency as -1 until child-up during initialization. BUG: 1567881 fixes bz#1567881 Change-Id: I148262fe505468190f0eb99225d0f6d57cdb6f04 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Need heal-timeout to be configured as low as 5 secondsPranith Kumar K2018-04-201-1/+1
| | | | | | | | | | | In Halo replication, there are pending heals more often than not. It makes sense to give users the capability to configure it as low as 5 seconds. BUG: 1569489 fixes bz#1569489 Change-Id: I451c1975827f66398b903f659c981ef3121d5376 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* afr: add new value for read-hash-mode volume optionRavishankar N2018-03-291-5/+9
| | | | | | | | | | Updates: #363 This new value (3) will try to wind read requests to the child of AFR having the least amount of pending requests in its queue. Change-Id: If6bda2aac9bf7aec3fc39622f78659313c4b6508 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Remove unused code pathsPranith Kumar K2018-03-061-29/+6
| | | | | | | | | | | | | | | | Removed 1) afr-v1 self-heal locks related code which is not used anymore 2) transaction has some data types that are not needed, so removed them 3) Never used lock tracing available in afr as gluster's network tracing does the job. So removed that as well. 4) Changelog is always enabled and afr is always used with locks, so __changelog_enabled, afr_lock_server_count etc functions can be deleted. 5) transaction.fop/done/resume always call the same functions, so no need to have these variables. BUG: 1549606 Change-Id: I370c146fec2892d40e674d232a5d7256e003c7f1 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Remove compound-fops usage in afrPranith Kumar K2018-03-061-8/+2
| | | | | | | | | We are not seeing much improvement with this change. So removing the feature so that it doesn't need to be maintained anymore. Fixes: #414 Change-Id: Ic7969b151544daf2547bd262a9fa03f575626411 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Make afr_fsync a transactionkarthik-us2018-03-021-1/+1
| | | | | | Change-Id: I713401feb96393f668efb074f2d5b870d19e6fda BUG: 1548361 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* cluster/afr: remove unnecessary child_up initializationXavier Hernandez2018-02-031-7/+0
| | | | | | | | | | | | The child_up array was initialized with all elements being -1 to allow afr_notify() to differentiate down bricks from bricks that haven't reported yet. With current implementation this is not needed anymore and it was causing unexpected results when other parts of the code considered that if child_up[i] != 0, it meant that it was up. Change-Id: I2a9d712ee64c512f24bd5cd3a48dcb37e3139472 BUG: 1541038 Signed-off-by: Xavier Hernandez <jahernan@redhat.com>
* cluster/afr: Adding option to take full file lockkarthik-us2018-01-191-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In replica 3 volumes there is a possibilities of ending up in split brain scenario, when multiple clients writing data on the same file at non overlapping regions in parallel. Scenario: - Initially all the copies are good and all the clients gets the value of data readables as all good. - Client C0 performs write W1 which fails on brick B0 and succeeds on other two bricks. - C1 performs write W2 which fails on B1 and succeeds on other two bricks. - C2 performs write W3 which fails on B2 and succeeds on other two bricks. - All the 3 writes above happen in parallel and fall on different ranges so afr takes granular locks and all the writes are performed in parallel. Since each client had data-readables as good, it does not see file going into split-brain in the in_flight_split_brain check, hence performs the post-op marking the pending xattrs. Now all the bricks are being blamed by each other, ending up in split-brain. Fix: Have an option to take either full lock or range lock on files while doing data transactions, to prevent the possibility of ending up in split brains. With this change, by default the files will take full lock while doing IO. If you want to make use of the old range lock change the value of "cluster.full-lock" to "no". Change-Id: I7893fa33005328ed63daa2f7c35eeed7c5218962 BUG: 1535438 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* mgmt/glusterd: Adding validation for setting quorum-countkarthik-us2017-12-291-1/+2
| | | | | | | | | | | In a replicated volume it was allowing to set the quorum-count value between the range [1 - 2147483647]. This patch adds validation for allowing only maximum of replica_count number of quorum-count value to be set on a volume. Change-Id: I13952f3c6cf498c9f2b91161503fc0fba9d94898 BUG: 1529515 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* afr: volume option fixes for GD2Ravishankar N2017-11-271-38/+121
| | | | | | | | | This patch takes care of volume options exposed via the CLI. Updates #302 Change-Id: I6fd1645604928f6b9700e2425af4147cc6446a3a Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: add checks for allowing lookupsRavishankar N2017-11-181-5/+2
| | | | | | | | | | | | | | | | | | | | | | Problem: In an arbiter volume, lookup was being served from one of the sink bricks (source brick was down). shard uses the iatt values from lookup cbk to calculate the size and block count, which in this case were incorrect values. shard_local_t->last_block was thus initialised to -1, resulting in an infinite while loop in shard_common_resolve_shards(). Fix: Use client quorum logic to allow or fail the lookups from afr if there are no readable subvolumes. So in replica-3 or arbiter vols, if there is no good copy or if quorum is not met, fail lookup with ENOTCONN. With this fix, we are also removing support for quorum-reads xlator option. So if quorum is not met, neither read nor write txns are allowed and we fail the fop with ENOTCONN. Change-Id: Ic65c00c24f77ece007328b421494eee62a505fa0 BUG: 1467250 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Make choose-local "reconfigurable"Krutika Dhananjay2017-09-301-0/+11
| | | | | | | | | | With this change, enabling choose-local (which means its state makes transition from "off" to "on") will be effective after the first gfid-lookup on "/" since volume-set was executed. Change-Id: Ibab292ba705d993b475cd0303fb3318211fb2500 BUG: 1480525 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
* cluster/afr: Remove debug logs in fix_quorum_options()Vijay Bellur2017-05-191-5/+0
| | | | | | | | | | | Change-Id: Id019b0c6425849eece8a9aba7acec9a521dfb10b BUG: 1452378 Signed-off-by: Vijay Bellur <vbellur@redhat.com> Reviewed-on: https://review.gluster.org/17335 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Jeff Darcy <jeff@pl.atyp.us> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* Halo Replication feature for AFR translatorKevin Vigor2017-05-021-2/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: Halo Geo-replication is a feature which allows Gluster or NFS clients to write locally to their region (as defined by a latency "halo" or threshold if you like), and have their writes asynchronously propagate from their origin to the rest of the cluster. Clients can also write synchronously to the cluster simply by specifying a halo-latency which is very large (e.g. 10seconds) which will include all bricks. In other words, it allows clients to decide at mount time if they desire synchronous or asynchronous IO into a cluster and the cluster can support both of these modes to any number of clients simultaneously. There are a few new volume options due to this feature: halo-shd-latency: The threshold below which self-heal daemons will consider children (bricks) connected. halo-nfsd-latency: The threshold below which NFS daemons will consider children (bricks) connected. halo-latency: The threshold below which all other clients will consider children (bricks) connected. halo-min-replicas: The minimum number of replicas which are to be enforced regardless of latency specified in the above 3 options. If the number of children falls below this threshold the next best (chosen by latency) shall be swapped in. New FUSE mount options: halo-latency & halo-min-replicas: As descripted above. This feature combined with multi-threaded SHD support (D1271745) results in some pretty cool geo-replication possibilities. Operational Notes: - Global consistency is gaurenteed for synchronous clients, this is provided by the existing entry-locking mechanism. - Asynchronous clients on the other hand and merely consistent to their region. Writes & deletes will be protected via entry-locks as usual preventing concurrent writes into files which are undergoing replication. Read operations on the other hand should never block. - Writes are allowed from _any_ region and propagated from the origin to all other regions. The take away from this is care should be taken to ensure multiple writers do not write the same files resulting in a gfid split-brain which will require resolution via split-brain policies (majority, mtime & size). Recommended method for preventing this is using the nfs-auth feature to define which region for each share has RW permissions, tiers not in the origin region should have RO perms. TODO: - Synchronous clients (including the SHD) should choose clients from their own region as preferred sources for reads. Most of the plumbing is in place for this via the child_latency array. - Better GFID split brain handling & better dent type split brain handling (i.e. create a trash can and move the offending files into it). - Tagging in addition to latency as a means of defining which children you wish to synchronously write to Test Plan: - The usual suspects, clang, gcc w/ address sanitizer & valgrind - Prove tests Reviewers: jackl, dph, cjh, meyering Reviewed By: meyering Subscribers: ethanr Differential Revision: https://phabricator.fb.com/D1272053 Tasks: 4117827 Change-Id: I694a9ab429722da538da171ec528406e77b5e6d1 BUG: 1428061 Signed-off-by: Kevin Vigor <kvigor@fb.com> Reviewed-on: http://review.gluster.org/16099 Reviewed-on: https://review.gluster.org/16177 Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* core: run many bricks within one glusterfsd processJeff Darcy2017-01-301-0/+7
| | | | | | | | | | | | | | | | | | | | | | | This patch adds support for multiple brick translator stacks running in a single brick server process. This reduces our per-brick memory usage by approximately 3x, and our appetite for TCP ports even more. It also creates potential to avoid process/thread thrashing, and to improve QoS by scheduling more carefully across the bricks, but realizing that potential will require further work. Multiplexing is controlled by the "cluster.brick-multiplex" global option. By default it's off, and bricks are started in separate processes as before. If multiplexing is enabled, then *compatible* bricks (mostly those with the same transport options) will be started in the same process. Change-Id: I45059454e51d6f4cbb29a4953359c09a408695cb BUG: 1385758 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: https://review.gluster.org/14763 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* afr: fix auto-quorumJeff Darcy2016-11-281-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | (1) afr_have_quorum is dead code. It was copied to afr_has_quorum, and everything else uses that, but the original was never deleted (until now). (2) Auto-quorum should be default for any N>2. Leaving quorum disabled is BAD, but apparently deemed acceptable for N=2 because there's no real quorum in that case. For any larger number (including arbiter configurations) there is such a thing as real quorum and we should use it by default. Note that for N=3 the answers we get from "N % 2" (the old check) and "N > 2" (the new one) are the same. (3) The special case for even N in afr_has_quorum has been simplified and explained more thoroughly in a comment. Change-Id: I48b33c15093512fecf516b26dcf09afecb7ae33b Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/15873 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Vijay Bellur <vbellur@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: Implement IPC fopPoornima G2016-09-291-0/+1
| | | | | | | | | | | | | | | | | | | | | Currently ipc() is not implemented in afr. md-cache and upcall uses ipc to register the list of xattrs, [1] for more details. For the ipc op GF_IPC_TARGET_UPCALL, it has to be wound to all the replica subvolumes. ipc() is failed when any of the subvolumes fails with other than ENOTCONN or all of the subvolumes are down. [1] http://review.gluster.org/#/c/15002/ Change-Id: I0f651330eafda64e4d922043fe53bd0014536247 BUG: 1211863 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/15378 Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: Consume compound fops in afr transactionAnuradha Talur2016-09-011-0/+12
| | | | | | | | | | | | | Change-Id: Ib06ece3cce1b10d28d6d2953da28444f5c2457ad BUG: 1290304 Signed-off-by: Anuradha Talur <atalur@redhat.com> Reviewed-on: http://review.gluster.org/15014 Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/ec: Do multi-threaded self-healPranith Kumar K2016-08-241-3/+3
| | | | | | | | | | | BUG: 1368451 Change-Id: I5d6b91d714ad6906dc478a401e614115c89a8fbb Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/15083 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ashish Pandey <aspandey@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: Give option to do consistent-ioPranith Kumar K2016-08-221-4/+22
| | | | | | | | | | | | | | | | | | | | | | | Problem: When tiering/rebalance does migrations and afr with 2-way replica is in picture, migration can read stale data if the source brick goes down and writes to the destination. After this deletion of the file leads to permanent loss of data after migration. Fix: Rebalance/tiering should migrate only when the data is definitely not stale. So introduce an option in afr called consistent-io which will be enabled in migration daemons. BUG: 1306398 Change-Id: I750f65091cc70a3ed4bf3c12f83d0949af43920a Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/13425 Reviewed-by: Anuradha Talur <atalur@redhat.com> Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* afr: afr-pending-xattr fallback checkRavishankar N2016-06-091-31/+63
| | | | | | | | | | | | | | | | | | | | Commit 6e635284a4411b816d4d860a28262c9e6dc4bd6a introduced a comma separated list of values to be used as AFR's pending changelogs. If this xlator option is missing in the volfile, fall back to using client xlator names for constructing the pending changelog names. Also, since the aforementioned commit was reverted from 3.7 and 3.8 branches, introduce GD_OP_VERSION_3_9_0 and change the op-version for this feature to GD_OP_VERSION_3_9_0. Change-Id: I3639b9ab475bd8d9929cc7527d9f4584dee1ad1b BUG: 1285152 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/14642 Smoke: Gluster Build System <jenkins@build.gluster.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
* cluster/ec: Add/Modify description for eager-lock optionAshish Pandey2016-06-031-4/+5
| | | | | | | | | | | | | | | | | | | This patch provides description for disperse.eager-lock option for disperse volume. It also modifies the description for cluster.eager-lock option to indicate that this option is only for replica volume. Change-Id: Ie73298947fcaaa6aaf825978bc2d27ceaff386d2 BUG: 1327171 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: http://review.gluster.org/13999 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Ravishankar N <ravishankar@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* core: assorted spelling mistakes reported by DebianKaleb S KEITHLEY2016-05-261-1/+1
| | | | | | | | | | | | | | | | | See also > Change-Id: I567a4be8f0f31f6285550f243fe802895f6bc43b Reported-by: Patrick Matthäi <pmatthaei@debian.org> BUG: 1336793 Change-Id: Icb9a6ff94d86663a5bca4ba931d810439c02556e Signed-off-by: Kaleb S KEITHLEY <kkeithle@redhat.com> Reviewed-on: http://review.gluster.org/14526 Smoke: Gluster Build System <jenkins@build.gluster.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* afr: Automagic unsplit-brain by [ctime|mtime|size|majority]Ravishankar N2016-05-251-0/+46
| | | | | | | | | | | | | | | | | | | | | Introduce cluster.favorite-child-policy which when enabled with [ctime|mtime|size|majority], automatically heals files that are in split-brian. The majority policy will not pick a source if there is no majority. The other three policies pick the first brick with a valid reply and non-zero ctime/mtime/size as source. Change-Id: I3c099a0404082213860f74f2c9b4d207cfaedb76 BUG: 1328224 Original-author: Richard Wareing <rwareing@fb.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/14026 Smoke: Gluster Build System <jenkins@build.gluster.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anuradha Talur <atalur@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/afr: Entry self-heal performance enhancementsKrutika Dhananjay2016-04-291-0/+10
| | | | | | | | | | | Change-Id: I52da41dff5619492b656c2217f4716a6cdadebe0 BUG: 1269461 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/12442 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.com>