From aa2f48dbd8f8ff1d10230fb9656f2ac7d99a48f8 Mon Sep 17 00:00:00 2001 From: Shyam Date: Mon, 27 Feb 2017 13:25:14 -0500 Subject: doc: Moved feature pages that were delivered as a part of 3.10.0 Change-Id: I35a6b599eebbe42b5ef1244d2d72fa103bcf8acb Signed-off-by: Shyam Reviewed-on: https://review.gluster.org/16775 Reviewed-by: Vijay Bellur --- done/GlusterFS 3.10/client-opversion.md | 111 +++++++++++++++++++ done/GlusterFS 3.10/max-opversion.md | 118 ++++++++++++++++++++ done/GlusterFS 3.10/multiplexing.md | 141 ++++++++++++++++++++++++ done/GlusterFS 3.10/readdir-ahead.md | 167 +++++++++++++++++++++++++++++ done/GlusterFS 3.10/rebalance-estimates.md | 128 ++++++++++++++++++++++ done/GlusterFS 3.10/tier_service.md | 130 ++++++++++++++++++++++ under_review/client-opversion.md | 111 ------------------- under_review/max-opversion.md | 118 -------------------- under_review/multiplexing.md | 141 ------------------------ under_review/readdir-ahead.md | 167 ----------------------------- under_review/rebalance-estimates.md | 128 ---------------------- under_review/tier_service.md | 130 ---------------------- 12 files changed, 795 insertions(+), 795 deletions(-) create mode 100644 done/GlusterFS 3.10/client-opversion.md create mode 100644 done/GlusterFS 3.10/max-opversion.md create mode 100644 done/GlusterFS 3.10/multiplexing.md create mode 100644 done/GlusterFS 3.10/readdir-ahead.md create mode 100644 done/GlusterFS 3.10/rebalance-estimates.md create mode 100644 done/GlusterFS 3.10/tier_service.md delete mode 100644 under_review/client-opversion.md delete mode 100644 under_review/max-opversion.md delete mode 100644 under_review/multiplexing.md delete mode 100644 under_review/readdir-ahead.md delete mode 100644 under_review/rebalance-estimates.md delete mode 100644 under_review/tier_service.md diff --git a/done/GlusterFS 3.10/client-opversion.md b/done/GlusterFS 3.10/client-opversion.md new file mode 100644 index 0000000..8c9991e --- /dev/null +++ b/done/GlusterFS 3.10/client-opversion.md @@ -0,0 +1,111 @@ +Feature +------- + +Summary +------- + +Support to get the op-version information for each client through the volume +status command. + +Owners +------ + +Samikshan Bairagya + +Current status +-------------- + +Currently the only way to get an idea regarding the version of the connected +clients is to grep for "accepted client from" in /var/log/glusterfs/bricks. +There is no command that gives that information out to the users. + +Related Feature Requests and Bugs +--------------------------------- + +https://bugzilla.redhat.com/show_bug.cgi?id=1409078 + +Detailed Description +-------------------- + +The op-version information for each client can be added to the already existing +volume status command. `volume status clients` currently gives the +following information for each client: + +* Hostname:port +* Bytes Read +* Bytes Written + +Benefit to GlusterFS +-------------------- + +This would make the user-experience better as it would make it easier for users +to know the op-version of each client from a single command. + +Scope +----- + +#### Nature of proposed change + +Adds more information to `volume status clients` output. + +#### Implications on manageability + +None. + +#### Implications on presentation layer + +None. + +#### Implications on persistence layer + +None. + +#### Implications on 'GlusterFS' backend + +None. + +#### Modification to GlusterFS metadata + +None. + +#### Implications on 'glusterd' + +None. + +How To Test +----------- + +This can be tested by having clients with different glusterfs versions connected +to running volumes, and executing the `volume status clients` +command. + +User Experience +--------------- + +Users can use the `volume status clients` command to get +information on the op-versions for each client along with information that were +already available like (hostname, bytes read and bytes written). + +Dependencies +------------ + +None + +Documentation +------------- + +None. + +Status +------ + +In development. + +Comments and Discussion +----------------------- + + 1. Discussion on gluster-devel ML: + - [Thread 1](http://www.gluster.org/pipermail/gluster-users/2016-January/025064.html) + - [Thread 2](http://www.gluster.org/pipermail/gluster-devel/2017-January/051820.html) + 2. [Discussion on Github](https://github.com/gluster/glusterfs/issues/79) + diff --git a/done/GlusterFS 3.10/max-opversion.md b/done/GlusterFS 3.10/max-opversion.md new file mode 100644 index 0000000..16d4ee4 --- /dev/null +++ b/done/GlusterFS 3.10/max-opversion.md @@ -0,0 +1,118 @@ +Feature +------- + +Summary +------- + +Support to retrieve the maximum supported op-version (cluster.op-version) in a +heterogeneous cluster. + +Owners +------ + +Samikshan Bairagya + +Current status +-------------- + +Currently users can retrieve the op-version on which a cluster is operating by +using the gluster volume get command on the global option cluster.op-version as +follows: + +# gluster volume get cluster.op-version + +There is however no way for an user to find out the maximum op-version to which +the cluster could be bumped upto. + +Related Feature Requests and Bugs +--------------------------------- + +https://bugzilla.redhat.com/show_bug.cgi?id=1365822 + +Detailed Description +-------------------- + +A heterogeneous cluster operates on a common op-version that can be supported +across all the nodes in the trusted storage pool.Upon upgrade of the nodes in +the cluster, the cluster might support a higher op-version. However, since it +is currently not possible for the user to get this op-version value, it is +difficult for them to bump up the op-version of the cluster to the supported +value. + +The maximum supported op-version in a cluster would be the minimum of the +maximum op-versions in each of the nodes. To retrieve this, the volume get +functionality could be invoked as follows: + +# gluster volume get all cluster.max-op-version + +Benefit to GlusterFS +-------------------- + +This would make the user-experience better as it would make it easier for users +to know the maximum op-version on which the cluster can operate. + +Scope +----- + +#### Nature of proposed change + +This adds a new non-settable global option, cluster.max-op-version. + +#### Implications on manageability + +None. + +#### Implications on presentation layer + +None. + +#### Implications on persistence layer + +None. + +#### Implications on 'GlusterFS' backend + +None. + +#### Modification to GlusterFS metadata + +None. + +#### Implications on 'glusterd' + +None. + +How To Test +----------- + +This can be tested on a cluster with at least one node running on version 'n+1' +and others on version 'n' where n = 3.10. The maximum supported op-version +(cluster.max-op-version) should be returned by `volume get` as n in this case. + +User Experience +--------------- + +Upon upgrade of one or more nodes in a cluster, users can get the new maximum +op-version the cluster can support. + +Dependencies +------------ + +None + +Documentation +------------- + +None. + +Status +------ + +In development. + +Comments and Discussion +----------------------- + + 1. [Discussion on gluster-devel ML](http://www.gluster.org/pipermail/gluster-devel/2016-December/051650.html) + 2. [Discussion on Github](https://github.com/gluster/glusterfs/issues/56) + diff --git a/done/GlusterFS 3.10/multiplexing.md b/done/GlusterFS 3.10/multiplexing.md new file mode 100644 index 0000000..fd06150 --- /dev/null +++ b/done/GlusterFS 3.10/multiplexing.md @@ -0,0 +1,141 @@ +Feature +------- +Brick Multiplexing + +Summary +------- + +Use one process (and port) to serve multiple bricks. + +Owners +------ + +Jeff Darcy (jdarcy@redhat.com) + +Current status +-------------- + +In development. + +Related Feature Requests and Bugs +--------------------------------- + +Mostly N/A, except that this will make implementing real QoS easier at some +point in the future. + +Detailed Description +-------------------- + +The basic idea is very simple: instead of spawning a new process for every +brick, we send an RPC to an existing brick process telling it to attach the new +brick (identified and described by a volfile) beneath its protocol/server +instance. Likewise, instead of killing a process to terminate a brick, we tell +it to detach one of its (possibly several) brick translator stacks. + +Bricks can *not* share a process if they use incompatible transports (e.g. TLS +vs. non-TLS). Also, a brick process serving several bricks is a larger failure +domain than we have with a process per brick, so we might voluntarily decide to +spawn a new process anyway just to keep the failure domains smaller. Lastly, +there should always be a fallback to current brick-per-process behavior, by +simply pretending that all bricks' transports are incompatible with each other. + +Benefit to GlusterFS +-------------------- + +Multiplexing should significantly reduce resource consumption: + + * Each *process* will consume one TCP port, instead of each *brick* doing so. + + * The cost of global data structures and object pools will be reduced to 1/N + of what it is now, where N is the average number of bricks per process. + + * Thread counts will also be reduced to 1/N. This avoids the exponentially + bad thrashing effects as the total number of threads far exceeds the number + of cores, made worse by multiple processes trying to auto-scale the nunber + of network and disk I/O threads independently. + +These resource issues are already limiting the number of bricks and volumes we +can support. By reducing all forms of resource consumption at once, we should +be able to raise these user-visible limits by a corresponding amount. + +Scope +----- + +#### Nature of proposed change + +The largest changes are at the two places where we do brick and process +management - GlusterD at one end, generic glusterfsd code at the other. The +new messages require changes to rpc and client/server translator code. The +server translator needs further changes to look up one among several child +translators instead of assuming only one. Auth code must be changed to handle +separate permissions/credentials on each brick. + +Beyond these "obvious" changes, many lesser changes will undoubtedly be needed +anywhere that we make assumptions about the relationships between bricks and +processes. Anything that involves a "helper" daemon - e.g. self-heal, quota - +is particularly suspect in this regard. + +#### Implications on manageability + +The fact that bricks can only share a process when they have compatible +transports might affect decisions about what transport options to use for +separate volumes. + +#### Implications on presentation layer + +N/A + +#### Implications on persistence layer + +N/A + +#### Implications on 'GlusterFS' backend + +N/A + +#### Modification to GlusterFS metadata + +N/A + +#### Implications on 'glusterd' + +GlusterD changes are integral to this feature, and described above. + +How To Test +----------- + +For the most part, testing is of the "do no harm" sort; the most thorough test +of this feature is to run our current regression suite. Only one additional +test is needed - create/start a volume with multiple bricks on one node, and +check that only one glusterfsd process is running. + +User Experience +--------------- + +Volume status can now include the possibly-surprising result of multiple bricks +on the same node having the same port number and PID. Anything that relies on +these values, such as monitoring or automatic firewall configuration (or our +regression tests) could get confused and/or end up doing the wrong thing. + +Dependencies +------------ + +N/A + +Documentation +------------- + +TBD (very little) + +Status +------ + +Very basic functionality - starting/stopping bricks along with volumes, +mounting, doing I/O - work. Some features, especially snapshots, probably do +not work. Currently running tests to identify the precise extent of needed +fixes. + +Comments and Discussion +----------------------- + +N/A diff --git a/done/GlusterFS 3.10/readdir-ahead.md b/done/GlusterFS 3.10/readdir-ahead.md new file mode 100644 index 0000000..71e5b62 --- /dev/null +++ b/done/GlusterFS 3.10/readdir-ahead.md @@ -0,0 +1,167 @@ +Feature +------- +Improve directory enumeration performance + +Summary +------- +Improve directory enumeration performance by implementing parallel readdirp +at the dht layer. + +Owners +------ + +Raghavendra G +Poornima G +Rajesh Joseph + +Current status +-------------- + +In development. + +Related Feature Requests and Bugs +--------------------------------- +https://bugzilla.redhat.com/show_bug.cgi?id=1401812 + +Detailed Description +-------------------- + +Currently readdirp is sequential at the dht layer. +This makes find and recursive listing of small directories very slow +(directory whose content can be accomodated in one readdirp call, +eg: ~600 entries if buf size is 128k). + +The number of readdirp fops required to fetch the ls -l -R for nested +directories is: +no. of fops = (x + 1) * m * n +n = number of bricks +m = number of directories +x = number of readdirp calls required to fetch the dentries completely +(this depends on the size of the directory and the readdirp buf size) +1 = readdirp fop that is sent to just detect the end of directory. + +Eg: Let's say, to list 800 directories with files ~300 each and readdirp +buf size 128K, on distribute 6: +(1+1) * 800 * 6 = 9600 fops + +And all the readdirp fops are sent in sequential manner to all the bricks. +With parallel readdirp, the number of fops may not decrease drastically +but since they are issued in parallel, it will increase the throughput. + +Why its not a straightforward problem to solve: +One needs to briefly understand, how the directory offset is handled in dht. +[1], [2], [3] are some of the links that will hint the same. +- The d_off is in the order of bricks identfied by dht. Hence, the dentries +should always be returned in the same order as bricks. i.e. brick2 entries +shouldn't be returned before brick1 reaches EOD. +- We cannot store any info of offset read so far etc. in inode_ctx or fd_ctx +- In case of a very large directories, and readdirp buf too small to hold +all the dentries in any brick, parallel readdirp is a overhead. Sequential +readdirp best suits the large directories. This demands dht be aware of or +speculate the directory size. + +There were two solutions that we evaluated: +1. Change dht_readdirp itself to wind readdirp parallely + http://review.gluster.org/15160 + http://review.gluster.org/15159 + http://review.gluster.org/15169 +2. Load readd-ahead as a child of dht + http://review.gluster.org/#/q/status:open+project:glusterfs+branch:master+topic:bug-1401812 + +For the below mentioned reasons we go with the second approach suggested by +Ragavendra G: +- It requires nil or very less changes in dht +- Along with empty/small directories it also benifits large directories +The only slightly complecated part would be to tune the readdir-ahead +buffer size for each instance. + +The perf gain observed is directly proportional to the: +- Number of nodes in the cluster/Volume +- Latency between client and each node in the volume. + +Some references: +[1] http://review.gluster.org/#/c/4711 +[2] https://www.mail-archive.com/gluster-devel@gluster.org/msg02834.html +[3] http://www.gluster.org/pipermail/gluster-devel/2015-January/043592.html + +Benefit to GlusterFS +-------------------- + +Improves directory enumeration performance in large clusters. + +Scope +----- + +#### Nature of proposed change + +- Changes in readdir-ahead, dht xlators. +- Change glusterd to load readdir-ahead as a child of dht + and without breaking upgrade and downgrade scenarios + +#### Implications on manageability + +N/A + +#### Implications on presentation layer + +N/A + +#### Implications on persistence layer + +N/A + +#### Implications on 'GlusterFS' backend + +N/A + +#### Modification to GlusterFS metadata + +N/A + +#### Implications on 'glusterd' + +GlusterD changes are integral to this feature, and described above. + +How To Test +----------- + +For the most part, testing is of the "do no harm" sort; the most thorough test +of this feature is to run our current regression suite. +Some specific test cases include readdirp on all kind of volumes: +- distribute +- replicate +- shard +- disperse +- tier +Also, readdirp while: +- rebalance in progress +- tiering migration in progress +- self heal in progress + +And all the test cases being run while the memory consumption of the process +is monitored. + +User Experience +--------------- + +Faster directory enumeration + +Dependencies +------------ + +N/A + +Documentation +------------- + +TBD (very little) + +Status +------ + +Development in progress + +Comments and Discussion +----------------------- + +N/A diff --git a/done/GlusterFS 3.10/rebalance-estimates.md b/done/GlusterFS 3.10/rebalance-estimates.md new file mode 100644 index 0000000..2a2c299 --- /dev/null +++ b/done/GlusterFS 3.10/rebalance-estimates.md @@ -0,0 +1,128 @@ +Feature +------- + +Summary +------- + +Provide a user interface to determine when the rebalance process will complete + +Owners +------ +Nithya Balachandran + + +Current status +-------------- +Patch being worked on. + + +Related Feature Requests and Bugs +--------------------------------- +https://bugzilla.redhat.com/show_bug.cgi?id=1396004 +Desc: RFE: An administrator friendly way to determine rebalance completion time + + +Detailed Description +-------------------- +The rebalance operation starts a rebalance process on each node of the volume. +Each process scans the files and directories on the local subvols, fixes the layout +for each directory and migrates files to their new hashed subvolumes based on the +new layouts. + +Currently we do not have any way to determine how long the rebalance process will +take to complete. + +The proposed approach is as follows: + + 1. Determine the total number of files and directories on the local subvol + 2. Calculate the rate at which files have been processed since the rebalance started + 3. Calculate the time required to process all the files based on the rate calculated + 4. Send these values in the rebalance status response + 5. Calculate the maximum time required among all the rebalance processes + 6. Display the time required along with the rebalance status output + + +The time taken is a factor or the number and size of the files and the number of directories. +Determining the number of files and directories is difficult as Glusterfs currently +does not keep track of the number of files on each brick. + +The current approach uses the statfs call to determine the number of used inodes +and uses that number as a rough estimate of how many files/directories ae present +on the brick. However, this number is not very accurate because the .glusterfs +directory contributes heavily to this number. + +Benefit to GlusterFS +-------------------- +Improves the usability of rebalance operations. +Administrators can now determine how long a rebalance operation will take to complete +allowing better planning. + + +Scope +----- + +#### Nature of proposed change + +Modifications required to the rebalance and the cli code. + +#### Implications on manageability + +gluster volume rebalance status output will be modified + +#### Implications on presentation layer + +None + +#### Implications on persistence layer + +None + +#### Implications on 'GlusterFS' backend + +None + +#### Modification to GlusterFS metadata + +None + +#### Implications on 'glusterd' + +None + +How To Test +----------- + +Run a rebalance and compare the estimates with the time actually taken to complete +the rebalance. + +The feature needs to be tested against large workloads to determine the accuracy +of the calculated times. + +User Experience +--------------- + +Gluster volume rebalance status +will display the expected time left for the rebalance process to complete + + +Dependencies +------------ + +None + +Documentation +------------- + +Documents to be updated with the changes in the rebalance status output. + + +Status +------ +In development. + + + +Comments and Discussion +----------------------- + +*Follow here* diff --git a/done/GlusterFS 3.10/tier_service.md b/done/GlusterFS 3.10/tier_service.md new file mode 100644 index 0000000..47640ee --- /dev/null +++ b/done/GlusterFS 3.10/tier_service.md @@ -0,0 +1,130 @@ +Feature +------- + +Tier as a daemon with the service framework of gluster. + +Summary +------- + +Current tier process uses the same dht code. If any change is made to DHT +it affects tier and vice versa. On an attempt to support add brick on tiered +volume, we need a rebalance daemon. So the current tier daemon has to be +separated from DHT. And so the new Daemon has been split from DHT and comes +under the service framework. + +Owners +------ + +Dan Lambright + +Hari Gowtham + +Current status +-------------- + +In the current code, it doesn't fall under the service framework and this +makes it hard for gluster to manage the daemon. Moving it into the gluster's +service framework makes it easier to be managed. + +Related Feature Requests and Bugs +--------------------------------- + +[BUG] https://bugzilla.redhat.com/show_bug.cgi?id=1313838 + +Detailed Description +-------------------- + +This change is similar to the other daemons that come under service framework. +The service framework takes care of : + +*) Spawning the daemon, killing it and other such processes. +*) Volume set options. +*) Restarting the daemon at two points + 1) when gluster goes down and comes up. + 2) to stop detach tier. +*) Reconfigure is used to make volfile changes. The reconfigure checks if the +daemons needs a restart or not and then does it as per the requirement. +By doing this, we don’t restart the daemon everytime. +*) Volume status lists the status of tier daemon as a process instead of +a task. +*) remove-brick and detach tier are separated from code level. + +With this patch the log, pid, and volfile are separated and put into respective +directories. + + +Benefit to GlusterFS +-------------------- + +Improved Stability, helps the glusterd to manage the daemon during situations +like update, node down, and restart. + +Scope +----- + +#### Nature of proposed change + +A new service will be made available. The existing code will be removed in a +while to make DHT rebalance easy to maintain as the DHT and tier code are +separated. + +#### Implications on manageability + +The older gluster commands are designed to be compatible with this change. + +#### Implications on presentation layer + +None. + +#### Implications on persistence layer + +None. + +#### Implications on 'GlusterFS' backend + +Remains the same as for Tier. + +#### Modification to GlusterFS metadata + +None. + +#### Implications on 'glusterd' + +The data related to tier is made persistent (will be available after reboot). +The brick op phase being different for Tier (brick op phase was earlier used +to communicate with the daemon instead of bricks) has been implemented in +the commit phase. +The volfile changes for setting the options are also take care of using the +service framework. + +How To Test +----------- + +The basic tier commands need to be tested as it doesn't change much +in the user perspective. The same test (like attaching tier, detaching it, +status) used for testing tier have to be used. + +User Experience +--------------- + +No changes. + +Dependencies +------------ + +None. + +Documentation +------------- + +https://docs.google.com/document/d/1_iyjiwTLnBJlCiUgjAWnpnPD801h5LNxLhHmN7zmk1o/edit?usp=sharing + +Status +------ + +Code being reviewed. + +Comments and Discussion +----------------------- + +*Follow here* diff --git a/under_review/client-opversion.md b/under_review/client-opversion.md deleted file mode 100644 index 8c9991e..0000000 --- a/under_review/client-opversion.md +++ /dev/null @@ -1,111 +0,0 @@ -Feature -------- - -Summary -------- - -Support to get the op-version information for each client through the volume -status command. - -Owners ------- - -Samikshan Bairagya - -Current status --------------- - -Currently the only way to get an idea regarding the version of the connected -clients is to grep for "accepted client from" in /var/log/glusterfs/bricks. -There is no command that gives that information out to the users. - -Related Feature Requests and Bugs ---------------------------------- - -https://bugzilla.redhat.com/show_bug.cgi?id=1409078 - -Detailed Description --------------------- - -The op-version information for each client can be added to the already existing -volume status command. `volume status clients` currently gives the -following information for each client: - -* Hostname:port -* Bytes Read -* Bytes Written - -Benefit to GlusterFS --------------------- - -This would make the user-experience better as it would make it easier for users -to know the op-version of each client from a single command. - -Scope ------ - -#### Nature of proposed change - -Adds more information to `volume status clients` output. - -#### Implications on manageability - -None. - -#### Implications on presentation layer - -None. - -#### Implications on persistence layer - -None. - -#### Implications on 'GlusterFS' backend - -None. - -#### Modification to GlusterFS metadata - -None. - -#### Implications on 'glusterd' - -None. - -How To Test ------------ - -This can be tested by having clients with different glusterfs versions connected -to running volumes, and executing the `volume status clients` -command. - -User Experience ---------------- - -Users can use the `volume status clients` command to get -information on the op-versions for each client along with information that were -already available like (hostname, bytes read and bytes written). - -Dependencies ------------- - -None - -Documentation -------------- - -None. - -Status ------- - -In development. - -Comments and Discussion ------------------------ - - 1. Discussion on gluster-devel ML: - - [Thread 1](http://www.gluster.org/pipermail/gluster-users/2016-January/025064.html) - - [Thread 2](http://www.gluster.org/pipermail/gluster-devel/2017-January/051820.html) - 2. [Discussion on Github](https://github.com/gluster/glusterfs/issues/79) - diff --git a/under_review/max-opversion.md b/under_review/max-opversion.md deleted file mode 100644 index 16d4ee4..0000000 --- a/under_review/max-opversion.md +++ /dev/null @@ -1,118 +0,0 @@ -Feature -------- - -Summary -------- - -Support to retrieve the maximum supported op-version (cluster.op-version) in a -heterogeneous cluster. - -Owners ------- - -Samikshan Bairagya - -Current status --------------- - -Currently users can retrieve the op-version on which a cluster is operating by -using the gluster volume get command on the global option cluster.op-version as -follows: - -# gluster volume get cluster.op-version - -There is however no way for an user to find out the maximum op-version to which -the cluster could be bumped upto. - -Related Feature Requests and Bugs ---------------------------------- - -https://bugzilla.redhat.com/show_bug.cgi?id=1365822 - -Detailed Description --------------------- - -A heterogeneous cluster operates on a common op-version that can be supported -across all the nodes in the trusted storage pool.Upon upgrade of the nodes in -the cluster, the cluster might support a higher op-version. However, since it -is currently not possible for the user to get this op-version value, it is -difficult for them to bump up the op-version of the cluster to the supported -value. - -The maximum supported op-version in a cluster would be the minimum of the -maximum op-versions in each of the nodes. To retrieve this, the volume get -functionality could be invoked as follows: - -# gluster volume get all cluster.max-op-version - -Benefit to GlusterFS --------------------- - -This would make the user-experience better as it would make it easier for users -to know the maximum op-version on which the cluster can operate. - -Scope ------ - -#### Nature of proposed change - -This adds a new non-settable global option, cluster.max-op-version. - -#### Implications on manageability - -None. - -#### Implications on presentation layer - -None. - -#### Implications on persistence layer - -None. - -#### Implications on 'GlusterFS' backend - -None. - -#### Modification to GlusterFS metadata - -None. - -#### Implications on 'glusterd' - -None. - -How To Test ------------ - -This can be tested on a cluster with at least one node running on version 'n+1' -and others on version 'n' where n = 3.10. The maximum supported op-version -(cluster.max-op-version) should be returned by `volume get` as n in this case. - -User Experience ---------------- - -Upon upgrade of one or more nodes in a cluster, users can get the new maximum -op-version the cluster can support. - -Dependencies ------------- - -None - -Documentation -------------- - -None. - -Status ------- - -In development. - -Comments and Discussion ------------------------ - - 1. [Discussion on gluster-devel ML](http://www.gluster.org/pipermail/gluster-devel/2016-December/051650.html) - 2. [Discussion on Github](https://github.com/gluster/glusterfs/issues/56) - diff --git a/under_review/multiplexing.md b/under_review/multiplexing.md deleted file mode 100644 index fd06150..0000000 --- a/under_review/multiplexing.md +++ /dev/null @@ -1,141 +0,0 @@ -Feature -------- -Brick Multiplexing - -Summary -------- - -Use one process (and port) to serve multiple bricks. - -Owners ------- - -Jeff Darcy (jdarcy@redhat.com) - -Current status --------------- - -In development. - -Related Feature Requests and Bugs ---------------------------------- - -Mostly N/A, except that this will make implementing real QoS easier at some -point in the future. - -Detailed Description --------------------- - -The basic idea is very simple: instead of spawning a new process for every -brick, we send an RPC to an existing brick process telling it to attach the new -brick (identified and described by a volfile) beneath its protocol/server -instance. Likewise, instead of killing a process to terminate a brick, we tell -it to detach one of its (possibly several) brick translator stacks. - -Bricks can *not* share a process if they use incompatible transports (e.g. TLS -vs. non-TLS). Also, a brick process serving several bricks is a larger failure -domain than we have with a process per brick, so we might voluntarily decide to -spawn a new process anyway just to keep the failure domains smaller. Lastly, -there should always be a fallback to current brick-per-process behavior, by -simply pretending that all bricks' transports are incompatible with each other. - -Benefit to GlusterFS --------------------- - -Multiplexing should significantly reduce resource consumption: - - * Each *process* will consume one TCP port, instead of each *brick* doing so. - - * The cost of global data structures and object pools will be reduced to 1/N - of what it is now, where N is the average number of bricks per process. - - * Thread counts will also be reduced to 1/N. This avoids the exponentially - bad thrashing effects as the total number of threads far exceeds the number - of cores, made worse by multiple processes trying to auto-scale the nunber - of network and disk I/O threads independently. - -These resource issues are already limiting the number of bricks and volumes we -can support. By reducing all forms of resource consumption at once, we should -be able to raise these user-visible limits by a corresponding amount. - -Scope ------ - -#### Nature of proposed change - -The largest changes are at the two places where we do brick and process -management - GlusterD at one end, generic glusterfsd code at the other. The -new messages require changes to rpc and client/server translator code. The -server translator needs further changes to look up one among several child -translators instead of assuming only one. Auth code must be changed to handle -separate permissions/credentials on each brick. - -Beyond these "obvious" changes, many lesser changes will undoubtedly be needed -anywhere that we make assumptions about the relationships between bricks and -processes. Anything that involves a "helper" daemon - e.g. self-heal, quota - -is particularly suspect in this regard. - -#### Implications on manageability - -The fact that bricks can only share a process when they have compatible -transports might affect decisions about what transport options to use for -separate volumes. - -#### Implications on presentation layer - -N/A - -#### Implications on persistence layer - -N/A - -#### Implications on 'GlusterFS' backend - -N/A - -#### Modification to GlusterFS metadata - -N/A - -#### Implications on 'glusterd' - -GlusterD changes are integral to this feature, and described above. - -How To Test ------------ - -For the most part, testing is of the "do no harm" sort; the most thorough test -of this feature is to run our current regression suite. Only one additional -test is needed - create/start a volume with multiple bricks on one node, and -check that only one glusterfsd process is running. - -User Experience ---------------- - -Volume status can now include the possibly-surprising result of multiple bricks -on the same node having the same port number and PID. Anything that relies on -these values, such as monitoring or automatic firewall configuration (or our -regression tests) could get confused and/or end up doing the wrong thing. - -Dependencies ------------- - -N/A - -Documentation -------------- - -TBD (very little) - -Status ------- - -Very basic functionality - starting/stopping bricks along with volumes, -mounting, doing I/O - work. Some features, especially snapshots, probably do -not work. Currently running tests to identify the precise extent of needed -fixes. - -Comments and Discussion ------------------------ - -N/A diff --git a/under_review/readdir-ahead.md b/under_review/readdir-ahead.md deleted file mode 100644 index 71e5b62..0000000 --- a/under_review/readdir-ahead.md +++ /dev/null @@ -1,167 +0,0 @@ -Feature -------- -Improve directory enumeration performance - -Summary -------- -Improve directory enumeration performance by implementing parallel readdirp -at the dht layer. - -Owners ------- - -Raghavendra G -Poornima G -Rajesh Joseph - -Current status --------------- - -In development. - -Related Feature Requests and Bugs ---------------------------------- -https://bugzilla.redhat.com/show_bug.cgi?id=1401812 - -Detailed Description --------------------- - -Currently readdirp is sequential at the dht layer. -This makes find and recursive listing of small directories very slow -(directory whose content can be accomodated in one readdirp call, -eg: ~600 entries if buf size is 128k). - -The number of readdirp fops required to fetch the ls -l -R for nested -directories is: -no. of fops = (x + 1) * m * n -n = number of bricks -m = number of directories -x = number of readdirp calls required to fetch the dentries completely -(this depends on the size of the directory and the readdirp buf size) -1 = readdirp fop that is sent to just detect the end of directory. - -Eg: Let's say, to list 800 directories with files ~300 each and readdirp -buf size 128K, on distribute 6: -(1+1) * 800 * 6 = 9600 fops - -And all the readdirp fops are sent in sequential manner to all the bricks. -With parallel readdirp, the number of fops may not decrease drastically -but since they are issued in parallel, it will increase the throughput. - -Why its not a straightforward problem to solve: -One needs to briefly understand, how the directory offset is handled in dht. -[1], [2], [3] are some of the links that will hint the same. -- The d_off is in the order of bricks identfied by dht. Hence, the dentries -should always be returned in the same order as bricks. i.e. brick2 entries -shouldn't be returned before brick1 reaches EOD. -- We cannot store any info of offset read so far etc. in inode_ctx or fd_ctx -- In case of a very large directories, and readdirp buf too small to hold -all the dentries in any brick, parallel readdirp is a overhead. Sequential -readdirp best suits the large directories. This demands dht be aware of or -speculate the directory size. - -There were two solutions that we evaluated: -1. Change dht_readdirp itself to wind readdirp parallely - http://review.gluster.org/15160 - http://review.gluster.org/15159 - http://review.gluster.org/15169 -2. Load readd-ahead as a child of dht - http://review.gluster.org/#/q/status:open+project:glusterfs+branch:master+topic:bug-1401812 - -For the below mentioned reasons we go with the second approach suggested by -Ragavendra G: -- It requires nil or very less changes in dht -- Along with empty/small directories it also benifits large directories -The only slightly complecated part would be to tune the readdir-ahead -buffer size for each instance. - -The perf gain observed is directly proportional to the: -- Number of nodes in the cluster/Volume -- Latency between client and each node in the volume. - -Some references: -[1] http://review.gluster.org/#/c/4711 -[2] https://www.mail-archive.com/gluster-devel@gluster.org/msg02834.html -[3] http://www.gluster.org/pipermail/gluster-devel/2015-January/043592.html - -Benefit to GlusterFS --------------------- - -Improves directory enumeration performance in large clusters. - -Scope ------ - -#### Nature of proposed change - -- Changes in readdir-ahead, dht xlators. -- Change glusterd to load readdir-ahead as a child of dht - and without breaking upgrade and downgrade scenarios - -#### Implications on manageability - -N/A - -#### Implications on presentation layer - -N/A - -#### Implications on persistence layer - -N/A - -#### Implications on 'GlusterFS' backend - -N/A - -#### Modification to GlusterFS metadata - -N/A - -#### Implications on 'glusterd' - -GlusterD changes are integral to this feature, and described above. - -How To Test ------------ - -For the most part, testing is of the "do no harm" sort; the most thorough test -of this feature is to run our current regression suite. -Some specific test cases include readdirp on all kind of volumes: -- distribute -- replicate -- shard -- disperse -- tier -Also, readdirp while: -- rebalance in progress -- tiering migration in progress -- self heal in progress - -And all the test cases being run while the memory consumption of the process -is monitored. - -User Experience ---------------- - -Faster directory enumeration - -Dependencies ------------- - -N/A - -Documentation -------------- - -TBD (very little) - -Status ------- - -Development in progress - -Comments and Discussion ------------------------ - -N/A diff --git a/under_review/rebalance-estimates.md b/under_review/rebalance-estimates.md deleted file mode 100644 index 2a2c299..0000000 --- a/under_review/rebalance-estimates.md +++ /dev/null @@ -1,128 +0,0 @@ -Feature -------- - -Summary -------- - -Provide a user interface to determine when the rebalance process will complete - -Owners ------- -Nithya Balachandran - - -Current status --------------- -Patch being worked on. - - -Related Feature Requests and Bugs ---------------------------------- -https://bugzilla.redhat.com/show_bug.cgi?id=1396004 -Desc: RFE: An administrator friendly way to determine rebalance completion time - - -Detailed Description --------------------- -The rebalance operation starts a rebalance process on each node of the volume. -Each process scans the files and directories on the local subvols, fixes the layout -for each directory and migrates files to their new hashed subvolumes based on the -new layouts. - -Currently we do not have any way to determine how long the rebalance process will -take to complete. - -The proposed approach is as follows: - - 1. Determine the total number of files and directories on the local subvol - 2. Calculate the rate at which files have been processed since the rebalance started - 3. Calculate the time required to process all the files based on the rate calculated - 4. Send these values in the rebalance status response - 5. Calculate the maximum time required among all the rebalance processes - 6. Display the time required along with the rebalance status output - - -The time taken is a factor or the number and size of the files and the number of directories. -Determining the number of files and directories is difficult as Glusterfs currently -does not keep track of the number of files on each brick. - -The current approach uses the statfs call to determine the number of used inodes -and uses that number as a rough estimate of how many files/directories ae present -on the brick. However, this number is not very accurate because the .glusterfs -directory contributes heavily to this number. - -Benefit to GlusterFS --------------------- -Improves the usability of rebalance operations. -Administrators can now determine how long a rebalance operation will take to complete -allowing better planning. - - -Scope ------ - -#### Nature of proposed change - -Modifications required to the rebalance and the cli code. - -#### Implications on manageability - -gluster volume rebalance status output will be modified - -#### Implications on presentation layer - -None - -#### Implications on persistence layer - -None - -#### Implications on 'GlusterFS' backend - -None - -#### Modification to GlusterFS metadata - -None - -#### Implications on 'glusterd' - -None - -How To Test ------------ - -Run a rebalance and compare the estimates with the time actually taken to complete -the rebalance. - -The feature needs to be tested against large workloads to determine the accuracy -of the calculated times. - -User Experience ---------------- - -Gluster volume rebalance status -will display the expected time left for the rebalance process to complete - - -Dependencies ------------- - -None - -Documentation -------------- - -Documents to be updated with the changes in the rebalance status output. - - -Status ------- -In development. - - - -Comments and Discussion ------------------------ - -*Follow here* diff --git a/under_review/tier_service.md b/under_review/tier_service.md deleted file mode 100644 index 47640ee..0000000 --- a/under_review/tier_service.md +++ /dev/null @@ -1,130 +0,0 @@ -Feature -------- - -Tier as a daemon with the service framework of gluster. - -Summary -------- - -Current tier process uses the same dht code. If any change is made to DHT -it affects tier and vice versa. On an attempt to support add brick on tiered -volume, we need a rebalance daemon. So the current tier daemon has to be -separated from DHT. And so the new Daemon has been split from DHT and comes -under the service framework. - -Owners ------- - -Dan Lambright - -Hari Gowtham - -Current status --------------- - -In the current code, it doesn't fall under the service framework and this -makes it hard for gluster to manage the daemon. Moving it into the gluster's -service framework makes it easier to be managed. - -Related Feature Requests and Bugs ---------------------------------- - -[BUG] https://bugzilla.redhat.com/show_bug.cgi?id=1313838 - -Detailed Description --------------------- - -This change is similar to the other daemons that come under service framework. -The service framework takes care of : - -*) Spawning the daemon, killing it and other such processes. -*) Volume set options. -*) Restarting the daemon at two points - 1) when gluster goes down and comes up. - 2) to stop detach tier. -*) Reconfigure is used to make volfile changes. The reconfigure checks if the -daemons needs a restart or not and then does it as per the requirement. -By doing this, we don’t restart the daemon everytime. -*) Volume status lists the status of tier daemon as a process instead of -a task. -*) remove-brick and detach tier are separated from code level. - -With this patch the log, pid, and volfile are separated and put into respective -directories. - - -Benefit to GlusterFS --------------------- - -Improved Stability, helps the glusterd to manage the daemon during situations -like update, node down, and restart. - -Scope ------ - -#### Nature of proposed change - -A new service will be made available. The existing code will be removed in a -while to make DHT rebalance easy to maintain as the DHT and tier code are -separated. - -#### Implications on manageability - -The older gluster commands are designed to be compatible with this change. - -#### Implications on presentation layer - -None. - -#### Implications on persistence layer - -None. - -#### Implications on 'GlusterFS' backend - -Remains the same as for Tier. - -#### Modification to GlusterFS metadata - -None. - -#### Implications on 'glusterd' - -The data related to tier is made persistent (will be available after reboot). -The brick op phase being different for Tier (brick op phase was earlier used -to communicate with the daemon instead of bricks) has been implemented in -the commit phase. -The volfile changes for setting the options are also take care of using the -service framework. - -How To Test ------------ - -The basic tier commands need to be tested as it doesn't change much -in the user perspective. The same test (like attaching tier, detaching it, -status) used for testing tier have to be used. - -User Experience ---------------- - -No changes. - -Dependencies ------------- - -None. - -Documentation -------------- - -https://docs.google.com/document/d/1_iyjiwTLnBJlCiUgjAWnpnPD801h5LNxLhHmN7zmk1o/edit?usp=sharing - -Status ------- - -Code being reviewed. - -Comments and Discussion ------------------------ - -*Follow here* -- cgit