summaryrefslogtreecommitdiffstats
path: root/under_review
diff options
context:
space:
mode:
Diffstat (limited to 'under_review')
-rw-r--r--under_review/Better Brick Mgmt.md180
-rw-r--r--under_review/Compression Dedup.md128
-rw-r--r--under_review/Kerberos.md326
-rw-r--r--under_review/Split Network.md138
-rw-r--r--under_review/caching.md143
-rw-r--r--under_review/code-generation.md143
-rw-r--r--under_review/composite-operations.md438
-rw-r--r--under_review/dht-scalability.md171
-rw-r--r--under_review/index.md82
-rw-r--r--under_review/lockdep.md101
-rw-r--r--under_review/stat-xattr-cache.md197
-rw-r--r--under_review/template.md93
-rw-r--r--under_review/volgen-rewrite.md128
13 files changed, 2268 insertions, 0 deletions
diff --git a/under_review/Better Brick Mgmt.md b/under_review/Better Brick Mgmt.md
new file mode 100644
index 0000000..adfc781
--- /dev/null
+++ b/under_review/Better Brick Mgmt.md
@@ -0,0 +1,180 @@
+Goal
+----
+
+Easier (more autonomous) assignment of storage to specific roles
+
+Summary
+-------
+
+Managing bricks and arrangements of bricks (e.g. into replica sets)
+manually doesn't scale. Instead, we need more intuitive ways to group
+bricks together into pools, allocate space from those pools (creating
+new pools), and let users define volumes in terms of pools rather than
+individual bricks. We get to worry about how to arrange those bricks
+into an intelligent volume configuration, e.g. replicating between
+bricks that are the same size/speed/type but not on the same server.
+
+Because this smarter and/or finer-grain resource allocation (plus
+general technology evolution) is likely to result in many more bricks
+per server than we have now, we also need a brick-daemon infrastructure
+capable of handling that.
+
+Owners
+------
+
+Jeff Darcy <jdarcy@redhat.com>
+
+Current status
+--------------
+
+Proposed, waiting until summit for approval.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+[Features/data-classification](../GlusterFS 3.7/Data Classification.md)
+will drive the heaviest and/or most sophisticated use of this feature,
+and some of the underlying mechanisms were originally proposed there.
+
+Detailed Description
+--------------------
+
+To start with, we need to distinguish between the raw brick that the
+user allocates to GlusterFS and the pieces of that brick that result
+from our complicated storage allocation. Some documents refer to these
+as u-brick and s-brick respectively, though perhaps it's better to keep
+calling the former bricks and come up with a new name for the latter -
+slice, tile, pebble, etc. For now, let's stick with the x-brick
+terminology. We can manipulate these objects in several ways.
+
+- Group u-bricks together into an equivalent pool of s-bricks
+ (trivially 1:1).
+
+- Allocate space from a pool of s-bricks, creating a set of smaller
+ s-bricks. Note that the results of applying this repeatedly might be
+ s-bricks which are on the same u-brick but part of different
+ volumes.
+
+- Combine multiple s-bricks into one via some combination of
+ replication, erasure coding, distribution, tiering, etc.
+
+- Export an s-brick as a volume.
+
+These operations - especially combining - can be applied iteratively,
+creating successively more complex structures prior to the final export.
+To support this, the code we currently use to generate volfiles needs to
+be changed to generate similar definitions for the various levels of
+s-bricks. Combined with the need to support versioning of these files
+(for snapshots), this probably means a rewrite of the volgen code.
+Another type of configuration file we need to create is for a brick
+daemon. We still run one glusterfsd process per u-brick, for various
+reasons.
+
+- Maximize compatibility with our current infrastructure for starting
+ and monitoring server processes.
+
+- Align the boundaries between actual and detected device failures.
+
+- Reduce the number of ports assigned, both for administrative
+ convenience and to avoid exhaustion.
+
+- Reduce context-switch and virtual-memory thrashing between too many
+ uncoordinated processes. Some day we might even add custom resource
+ control/scheduling between s-bricks within a process, which would be
+ impossible in separate processes.
+
+These new glusterfsd processes are going to require more complex
+volfiles, and more complex translator-graph code to consume those. They
+also need to be more parallel internally, so this feature depends on
+eliminating single-threaded bottlenecks such as our socket transport.
+
+Benefit to GlusterFS
+--------------------
+
+- Reduced administrative overhead for large/complex volume
+ configurations.
+
+- More flexible/sophisticated volume configurations, especially with
+ respect to other features such as tiering or internal enhancements
+ such as overlapping replica/erasure sets.
+
+- Improved performance.
+
+Scope
+-----
+
+### Nature of proposed change
+
+- New object model, exposed via both glusterd-level and user-level
+ commands on those objects.
+
+- Rewritten volfile infrastructure.
+
+- Significantly enhanced translator-graph infrastructure.
+
+- Multi-threaded transport.
+
+### Implications on manageability
+
+New commands will be needed to group u-bricks into pools, allocate
+s-bricks from pools, etc. There will also be new commands to view status
+of objects at various levels, and perhaps to set options on them. On the
+other hand, "volume create" will probably become simpler as the
+specifics of creating a volume are delegated downward to s-bricks.
+
+### Implications on presentation layer
+
+Surprisingly little.
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+The on-disk structures (.glusterfs and so on) currently associated with
+a brick become associated with an s-brick. The u-brick itself will
+contain little, probably just an enumeration of the s-bricks into which
+it has been divided.
+
+### Modification to GlusterFS metadata
+
+None.
+
+### Implications on 'glusterd'
+
+See detailed description.
+
+How To Test
+-----------
+
+New tests will be needed for grouping/allocation functions. In
+particular, negative tests for incorrect or impossible configurations
+will be needed. Once s-bricks have been aggregated back into volumes,
+most of the current volume-level tests will still apply. Related tests
+will also be developed as part of the data classification feature.
+
+User Experience
+---------------
+
+See "implications on manageability" etc.
+
+Dependencies
+------------
+
+This feature is so closely associated with data classification that the
+two can barely be considered separately.
+
+Documentation
+-------------
+
+Much of our "brick and volume management" documentation will require a
+thorough review, if not an actual rewrite.
+
+Status
+------
+
+Design still in progress.
+
+Comments and Discussion
+-----------------------
diff --git a/under_review/Compression Dedup.md b/under_review/Compression Dedup.md
new file mode 100644
index 0000000..7829018
--- /dev/null
+++ b/under_review/Compression Dedup.md
@@ -0,0 +1,128 @@
+Feature
+-------
+
+Compression / Deduplication
+
+Summary
+-------
+
+In the never-ending quest to increase storage efficiency (or conversely
+to decrease storage cost), we could compress and/or deduplicate data
+stored on bricks.
+
+Owners
+------
+
+Jeff Darcy <jdarcy@redhat.com>
+
+Current status
+--------------
+
+Just a vague idea so far.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+TBD
+
+Detailed Description
+--------------------
+
+Compression and deduplication for GlusterFS have been discussed many
+times. Deduplication across machines/bricks is a recognized Hard
+Problem, with uncertain benefits, and is thus considered out of scope.
+Deduplication within a brick is potentially achievable by using
+something like
+[lessfs](http://sourceforge.net/projects/lessfs/files/ "wikilink"),
+which is itself a FUSE filesystem, so one fairly simple approach would
+be to integrate lessfs as a translator. There's no similar option for
+compression.
+
+In both cases, it's generally preferable to work on fully expanded files
+while they're open, and then compress/dedup when they're closed. Some of
+the bitrot or tiering infrastructure might be useful for moving files
+between these states, or detecting when such a change is needed. There
+are also some interesting interactions with quota, since we need to
+count the un-compressed un-deduplicated size of the file against quota
+(or do we?) and that's not what the underlying local file system will
+report.
+
+Benefit to GlusterFS
+--------------------
+
+Less \$\$\$/GB for our users.
+
+Scope
+-----
+
+### Nature of proposed change
+
+New translators, hooks into bitrot/tiering/quota, probably new daemons.
+
+### Implications on manageability
+
+Besides turning these options on or off, or setting parameters, there
+will probably need to be some way of reporting the real vs.
+compressed/deduplicated size of files/bricks/volumes.
+
+### Implications on presentation layer
+
+Should be none.
+
+### Implications on persistence layer
+
+If the DM folks ever get their <expletive deleted> together on this
+front, we might be able to use some of their stuff instead of lessfs.
+That worked so well for thin provisioning and snapshots.
+
+### Implications on 'GlusterFS' backend
+
+What's on the brick will no longer match the data that the user stored
+(and might some day retrieve). In the case of compression,
+reconstituting the user-visible version of the data should be a simple
+matter of decompressing via a well known algorithm. In the case of
+deduplication, the relevant data structures are much more complicated
+and reconstitution will be correspondingly more difficult.
+
+### Modification to GlusterFS metadata
+
+Some of the information tracking deduplicated blocks will probably be
+stored "privately" in .glusterfs or similar.
+
+### Implications on 'glusterd'
+
+TBD
+
+How To Test
+-----------
+
+TBD
+
+User Experience
+---------------
+
+Mostly unchanged, except for performance. As with erasure coding, a
+compressed/deduplicated slow tier will usually need to be paired with a
+simpler fast tier for overall performance to be acceptable.
+
+Dependencies
+------------
+
+External: lessfs, DM, whatever other technology we use to do the
+low-level work
+
+Internal: tiering/bitrot (perhaps changelog?) to track state and detect
+changes
+
+Documentation
+-------------
+
+TBD
+
+Status
+------
+
+Still just a vague idea.
+
+Comments and Discussion
+-----------------------
diff --git a/under_review/Kerberos.md b/under_review/Kerberos.md
new file mode 100644
index 0000000..8dda497
--- /dev/null
+++ b/under_review/Kerberos.md
@@ -0,0 +1,326 @@
+# Feature
+
+# Summary
+
+Support for Kerberos in the different Gluster protocols. Configuration and
+usage is expected to be similar as Kerberized NFS.
+
+
+# Owners
+
+* core sponsor:
+ * [Niels de Vos](ndevos@redhat.com)
+* design reviewer:
+ * ...
+* developers
+ * ...
+
+
+# Current status
+
+Gluster supports SSL connections, but managing SSL-certificates is difficult.
+Many organisations already use Kerberos and using the same infrastructure for
+authentication and encryption will make it easier to deploy.
+
+
+# Related Feature Requests and Bugs
+
+There is a dependency on the compound/composite procedures for the Gluster
+protocol. This is needed to pass additional details that are currently passed
+in the RPC-header to the storage servers (mainly `lockowner`).
+
+
+# Benefit to GlusterFS
+
+*Describe Value additions to GlusterFS*
+
+* multi-user/tenancy authentication/encryption, not per mountpoint
+* Kerberos is often available in enterprise environments, lower entrance compared to SSL
+* based on the industrial standard for Kerberized NFS
+
+
+# Detailed Description
+
+*Detailed Feature Description*
+
+
+# Scope
+
+* similar to NFS in setup and features
+* authentication, (maybe) integrity, encryption
+* client <-> Gluster and Gluster internal
+* a KDC is not part of the development, existing KDCs should be used
+
+
+## Nature of proposed change
+
+Changes to the authentication part of the protocols affect all processes that
+handle networking.
+
+* clients
+ * FUSE client
+ * gfapi library (Samba, QEMU, NFS-Ganesha, ...)
+ * service processes like rebalance
+ * Gluster/NFS server
+
+* Server processes
+ * GlusterD management daemon
+ * brick processes (storage units)
+
+* Gluster CLI (configuration interface talking to GlusterD)
+
+
+## Design
+
+### Hostname/IP resolving:
+
+* hostnames will be required (functioning DNS)
+* hostname for mgmt daemon, ip-addresses for bricks possible (optional, needs extra work)
+* load-balancing (shared/rr-dns hostname) would need additional work (required, common setup)
+
+
+### Process of mounting (or connecting with libgfapi):
+
+* client connects to the GlusterD service (can be on the same server, or different machine)
+* client requests the volume-file that describes the layout (bricks) of the volume
+* client connects to the bricks
+
+
+### Principals
+
+Principals are expected to be configurable through the Gluster CLI, program
+arguments and/or in the `glusterd.vol` file. The example below illustrates the
+defaults and shows the differences between roles.
+
+On GlusterD servers:
+
+* `glusterd/${rr_dns}@REALM`: client side acceptance through the rr-dns hostname
+* `gluster/${hostname}@REALM`: any access directly to this storage server (glusterd, bricks, server-side processes like shd)
+
+Diagram showing the communications in the Trusted Storage Pool:
+
+ .-----------. <fetch volume layout> .----------.
+ | self-heal |--------------------------------->| GlusterD |
+ '-----------' gluster/${hostname}@REALM '----------'
+ ^
+ |
+ <GlusterD internal> |
+ gluster/${hostname}@REALM |
+ |
+ v
+ .----------.
+ | GlusterD |
+ '----------'
+
+GlusterD and self-heal in the above diagram are examples of services that are
+trusted by the storage servers. There is no user interaction for daemons like
+these, and the processes will only run on the storage servers. Rebalance,
+Gluster/NFS and quotad are other daemons that use the
+`gluster/${hostname}@REALM` Service Principal Name (SPN).
+
+On client systems that connect to GlusterD for the volume file:
+
+* `glusterfs/${client}@REALM`: client-side towards servers (glusterd/bricks)
+* `${username}@REALM`: I/O done by the user (or service processes like qemu)
+
+Diagram showing the communication done by the Gluster FUSE client. The
+principals used are marked with "I" for Initiators and "A" for the Acceptors:
+
+ .-------------. <fetch volume layout> .----------.
+ | FUSE client |--------------------------------->| GlusterD |
+ '-------------' I:glusterfs/${client}@REALM '----------'
+ | A:glusterd/{rr_dns}@REALM
+ |
+ |
+ | <I/O> .-------.
+ '------------------------------->| brick |
+ I:${username}@REALM '-------'
+ A:gluster/${hostname}@REALM
+
+
+When a client connects to a GlusterD service, GlusterD should provide its
+authentication with the `glusterd/{rr_dns}@REALM` principal. This Kerberos TGT
+may be shared by multiple GlusterD services, so that a round-robin DNS hostname
+can be used for mounting.
+
+
+### libgfapi access difficulties
+
+Kerberized Samba- or NFS-clients should be able to connect to a filesystem
+service (Same or NFS-Ganesha), and get authenticated by their User Principal
+Name at the Gluster processes. GSSAPI supports this through constraint
+delegation. Not all Kerberos Domain Controllers support this feature, but
+Active Directory and FreeIPA do.
+
+There is a difficulty where a filesystem service (like NFS-Ganesha or Samba)
+receive connections from a non-Kerberos client, but do need to communicate
+Kerberized Gluster to the storage servers. The services will need to
+impersonate the different users. GSS-Proxy makes it possible to obtain Kerberos
+TGTs on behalf of the connecting user. This TGT can then be used to
+authenticate the user through Kerberos at the Gluster services.
+
+ .-----------.
+ | User |
+ |-----------|
+ | Linux NFS |
+ '-----------'
+ |
+ | - - - - - - [Kerberos optional]
+ v
+ .-------------.
+ | NFS-Ganesha |
+ |-------------| <fetch volume layout> .----------.
+ | libgfapi |--------------------------------->| GlusterD |
+ '-------------' I:glusterfs/${client}@REALM '----------'
+ | A:glusterd/${rr_dns}@REALM
+ |
+ |
+ | <I/O> .-------.
+ '------------------------------->| brick |
+ I:${username}@REALM '-------'
+ A:gluster/${hostname}@REALM
+
+
+The `${username}@REALM` might not be available for the NFS-Ganesha process (or
+other filesystem services like Samba). In that case, all I/O can only be done
+through the `glusterfs/${client}@REALM` principal. It makes it impossible to
+map the principal to a user that does the I/O on the NFS-client side.
+
+To solve this problem, `COMPOUND` procedures can be used. A new `SETFSUID` and
+`SETFSGID` FOP could instruct the `COMPOUND` procedure to switch to a certain
+UID/GID. This requires trusting the Gluster-client fully, and should only be
+used as a fall-back solution when constrained delegation is not possible.
+
+
+### Username mapping
+
+Mapping of a UID to GIDs:
+
+1. Due to a protocol limitation, the number of groups sent in RPC packets is
+ limited. The bricks are already capable of resolving aux-GIDs based on the
+ UID that is sent in the RPC packet.
+1. Some way of mapping the Kerberos principal to uid/gid/aux-gids is needed.
+
+Resolving the `uid` from the User Principal Name can be done by the
+`gss_localname()` function provided by GSSAPI. The Gluster servers (brick
+processes) will need to map the Kerberos principal to a UID/GID that does the
+I/O for correct permission checking and file/directory ownership.
+
+
+### RPC Protocol Changes
+
+RPC should ultimately use [RPCSEC_GSS](http://tools.ietf.org/html/rfc2203) like
+Kerberized NFS.
+
+* no lockowner in the RPC header
+* potential cases where the username doing the I/O can not be resolved to a
+ username with a Kerberos principal (alternative to constraint delegations)
+
+All attributes that can not be passed in the RPC header, and can not be found
+out through other means will get passed as a 'fake FOP' in a COMPOUND/COMPOSITE
+procedure.
+
+ [RPC header]
+ [COMPOUND/COMPOSITE]
+ [SETFSUID] (as replacement for constraint delegations)
+ [SET_LOCKOWNER]
+ [actual FOP]
+
+The design and development of `COMPOUND`/`COMPOSITE` procedures is not part of
+the Kerberos feature, details can be found elsewhere.
+
+*TODO: add link for compound/composite design/discussion*
+
+
+## Implications on manageability
+
+Kerberos depends heavily on correct configuration of the participating servers.
+Services like DNS and time-synchronisation are a requirement for environments
+that want to use Kerberos. A central repository where users and groups are
+managed (LDAP, Active Directory, NIS, ...) is highly recommended.
+
+## Implications on presentation layer
+
+Bindings provided by the top-most xlators will need to provide an API for
+passing the options needed to configure/apply Kerberos functionalities.
+
+## Implications on persistence layer
+
+None.
+
+
+## Implications on 'GlusterFS' backend
+
+None.
+
+
+## Modification to GlusterFS metadata
+
+None.
+
+
+## Implications on 'glusterd'
+
+GlusterD will have to maintain the options for the Kerberos configuration,
+which would be similar to the current SSL implementation.
+
+The different Gluster daemons that receive connections will need to get an
+improved access control mechanism. Not all systems should be able to use the
+`glusterfs/${client}@REALM` Service Principal Name to do I/O (workaround for
+for constraint delegations). The same counts for the
+`gluster/${hostname}@REALM` Service Principal Name, which should be only
+accepted by systems in the Trusted Storage Pool.
+
+
+# How To Test
+
+The steps to configure Kerberos access to Gluster volumes would look like:
+
+1. verify that all participating systems are in DNS
+1. enable NTP or similar time-syncing between servers
+1. configure Kerberos system-wide in `/etc/krb5.conf`
+1. configure idmapping through `/etc/nsswitch.conf` (LDAP, AD, ..) and `/etc/idmapd.conf`
+1. add Kerberos TGTs to the `/etc/krb5.keytab` file
+1. enable Kerberos through GlusterD
+
+Performing I/O over a FUSE with Kerberos mountpoint:
+
+1. `[root]` mount the volume, uses Kerberos TGT from `/etc/krb5.keytab`
+1. `[user]` should have a valid Kerberos TGT (obtained with `kinit`)
+1. `[user]` I/O should be permitted as normal
+1. `[user]` after invalidating the Kerberos TGT (with `kdestroy`), I/O should be denied
+
+Different ways of Kerberos usage can be inspected with
+[Wireshark](https://wireshark.org). The RPC-headers will not list the
+traditional AUTH_GLUSTER authentication structures, but RPCSEC_GSS with a
+readable form of the Kerberos principal instead of UID/GID values.
+
+
+# User Experience
+
+Configuration of Gluster/Kerberos should be very similar to the configuration
+of Kerberized NFS. The administrator needs to configure the participating
+servers and enable Kerberos support for the GlusterD and the Gluster Volumes.
+Users with a valid Kerberos TGT should not notice any difference while doing
+I/O.
+
+
+# Dependencies
+
+* requires functional `COMPOUND` procedures, including new procedures like
+ `SET_LOCKOWNER`
+
+
+# Documentation
+
+*TODO: Point to the pull requests for the `glusterdocs` repository.*
+
+
+# Status
+
+*Status of development - Design Ready, In development, Completed*
+
+
+# Comments and Discussion
+
+*TODO: Link to mailinglist thread(s) and the Gerrit review.*
diff --git a/under_review/Split Network.md b/under_review/Split Network.md
new file mode 100644
index 0000000..95cf944
--- /dev/null
+++ b/under_review/Split Network.md
@@ -0,0 +1,138 @@
+Goal
+----
+
+Better support for multiple networks, especially front-end vs. back-end.
+
+Summary
+-------
+
+GlusterFS generally expects that all clients and servers use a common
+set of network names and/or addresses. For many users, having a separate
+network exclusively for servers is highly desirable for both performance
+reasons (segregating administrative traffic and/or second-hop NFS
+traffic from ongoing user I/O) and security reasons (limiting
+administrative access to the private network). While such configurations
+can already be created with routing/iptables trickery, full and explicit
+support would be a great improvement.
+
+Owners
+------
+
+Jeff Darcy <jdarcy@redhat.com>
+
+Current status
+--------------
+
+Proposed, awaiting summit for approval.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+One proposal for the high-level syntax and semantics was made [on the
+mailing
+list](http://www.gluster.org/pipermail/gluster-users/2014-November/019463.html).
+
+Detailed Description
+--------------------
+
+At the very least, we need to be able to define and keep track of
+multiple names/addresses for the same node, one used on the back-end
+network e.g. for "peer probe" and and NFS and the other used on the
+front-end network by native-protocol clients. The association can be
+done via the node UUID, but we still need a way for the user to specify
+which name/address is to be used for which purpose.
+
+Future enhancements could include multiple front-end (client) networks,
+and network-specific access control.
+
+Benefit to GlusterFS
+--------------------
+
+More flexible network network topologies, potentially enhancing
+performance and/or security for some deployments.
+
+Scope
+-----
+
+### Nature of proposed change
+
+The information in /var/lib/glusterd/peers/\* must be enhanced to
+include multiple names/addresses per peer, plus tags for roles
+associated with each address/name.
+
+The volfile-generation code must be enhanced to generate volfiles for
+each purpose - server, native client, NFS proxy, self-heal/rebalance -
+using the names/addresses appropriate to that purpose.
+
+### Implications on manageability
+
+CLI and GUI support must be added for viewing/changing the addresses
+associated with each server and the roles associated with each address.
+
+### Implications on presentation layer
+
+None. The changes should be transparent to users.
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+None.
+
+### Modification to GlusterFS metadata
+
+See [nature of proposed change](#Nature_of_proposed_change "wikilink").
+
+### Implications on 'glusterd'
+
+See [nature of proposed change](#Nature_of_proposed_change "wikilink").
+
+How To Test
+-----------
+
+Set up a physical configuration with separate front-end and back-end
+networks.
+
+Use the new CLI/GUI features to define addresses and roles split across
+the two networks.
+
+Mount a volume using each of the several volfiles that result, and
+generate some traffic.
+
+Verify that the traffic is actually on the network appropriate to that
+mount type.
+
+User Experience
+---------------
+
+By default, nothing changes. If and only if a user wants to set up a
+more "advanced" split-network configuration, they'll have new tools
+allowing them to do that without having to "step outside" to mess with
+routing tables etc.
+
+Dependencies
+------------
+
+None.
+
+Documentation
+-------------
+
+New documentation will be needed at both the conceptual and detail
+levels, describing how (and why?) to set up a split-network
+configuration.
+
+Status
+------
+
+In design.
+
+Comments and Discussion
+-----------------------
+
+Some use-cases in [Bug 764850](https://bugzilla.redhat.com/764850).
+Feedback requested. Please jump in.
+
+[Discussion on gluster-devel](https://mail.corp.redhat.com/zimbra/#16)
diff --git a/under_review/caching.md b/under_review/caching.md
new file mode 100644
index 0000000..2c21c0c
--- /dev/null
+++ b/under_review/caching.md
@@ -0,0 +1,143 @@
+Goal Caching
+------------
+
+Improved performance via client-side caching.
+
+Summary
+-------
+
+GlusterFS has historically taken a very conservative approach to
+client-side caching, due to the cost and difficulty of ensuring
+consistency across a truly distributed file system. However, this has
+often led to a competitive disadvantage vs. other file systems that
+cache more aggressively. While one could argue that expecting an
+application designed for a local FS or NFS to behave the same way on a
+distributed FS is unrealistic, or question whether competitors' caching
+is really safe, this nonetheless remains one of our users' top requests.
+
+For purposes of this discussion, pre-fetching into cache is considered
+part of caching itself. However, write-behind caching (buffering) is a
+separate feature, and is not in scope.
+
+Owners
+------
+
+Xavier Hernandez <xhernandez@datalab.es>
+
+Jeff Darcy <jdarcy@redhat.com>
+
+Current status
+--------------
+
+Proposed, waiting until summit for approval.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+[Features/FS-Cache](Features/FS-Cache "wikilink") is about a looser
+(non-consistent) kind of caching integrated via FUSE. This feature is
+differentiated by being fully consistent, and implemented in GlusterFS
+itself.
+
+[IMCa](http://mvapich.cse.ohio-state.edu/static/media/publications/slide/imca_icpp08.pdf)
+describes a completely external approach to caching (both data and
+metadata) with GlusterFS.
+
+Detailed Description
+--------------------
+
+Retaining data in cache on a client after it's read is trivial.
+Pre-fetching into that same cache is barely more difficult. All of the
+hard parts are on the server.
+
+- Tracking which clients still have cached copies of which data (or
+ metadata).
+
+- Issuing and waiting for invalidation requests when a client changes
+ data cached elsewhere.
+
+- Handling failures of the servers tracking client state, and of
+ communication with clients that need to be invalidated.
+
+- Doing all of this without putting performance in the toilet.
+
+Invalidating cached copies is analogous to breaking locks, so the
+async-notification and "oplock" code already being developed for
+multi-protocol (SMB3/NFS4) support can probably be used here. More
+design is probably needed around scalable/performant tracking of client
+cache state by servers.
+
+Benefit to GlusterFS
+--------------------
+
+Much better performance for cache-friendly workloads.
+
+Scope
+-----
+
+### Nature of proposed change
+
+Some of the existing "performance" translators could be replaced by a
+single client-caching translator. There will also need to be a
+server-side helper translator to track client cache states and issue
+invalidation requests at the appropriate times. Such asynchronous
+(server-initiated) requests probably require transport changes, and
+[GF\_FOP\_IPC](http://review.gluster.org/#/c/8812/) might play a part as
+well.
+
+### Implications on manageability
+
+New commands will be needed to set cache parameters, force cache
+flushes, etc.
+
+### Implications on presentation layer
+
+None, except for integration with the same async/oplock infrastructure
+as used separately in SMB and NFS.
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+We will likely need some sort of database associated with each brick to
+maintain information about cache states.
+
+### Modification to GlusterFS metadata
+
+None.
+
+### Implications on 'glusterd'
+
+None.
+
+How To Test
+-----------
+
+We'll need new tests to verify that invalidations are in fact occurring,
+that we can't read stale/inconsistent data despite the increased caching
+on clients, etc.
+
+User Experience
+---------------
+
+See "implications on manageability" section.
+
+Dependencies
+------------
+
+Async-notification and oplock code from the Samba team.
+
+Documentation
+-------------
+
+TBD
+
+Status
+------
+
+Design in private review, hopefully available for public review soon.
+
+Comments and Discussion
+-----------------------
diff --git a/under_review/code-generation.md b/under_review/code-generation.md
new file mode 100644
index 0000000..5c25a13
--- /dev/null
+++ b/under_review/code-generation.md
@@ -0,0 +1,143 @@
+Goal
+----
+
+Reduce internal duplication of code by generating from templates.
+
+Summary
+-------
+
+The translator calling convention is based on long lists of
+operation-specific arguments instead of a common "control block"
+struct/union. As a result, many parts of our code are highly repetitive
+both internally and with respect to one another. As an example of
+internal redundancy, consider how many of the functions in
+[defaults.c](https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/defaults.c)
+look similar. As an example of external redundancy, consider how the
+[patch to add GF\_FOP\_IPC](http://review.gluster.org/#/c/8812/) has to
+make parallel changes to 17 files - defaults, stubs, syncops, RPC,
+io-threads, and so on. All of this duplication slows development of new
+features, and creates huge potential for errors as definitions that need
+to match don't. Indeed, during development of a code generator for NSR,
+several such inconsistencies have already been found.
+
+Owners
+------
+
+Jeff Darcy <jdarcy@redhat.com>
+
+Current status
+--------------
+
+Proposed, awaiting approval.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+Code generation was already used successfully in the first generation of
+[NSR](../GlusterFS 3.6/New Style Replication.md) and will continue to be
+used in the second.
+
+Detailed Description
+--------------------
+
+See Summary section above.
+
+Benefit to GlusterFS
+--------------------
+
+- Fewer bugs from inconsistencies between how similar operations are
+ handled within one translator, or how a single operation is handled
+ across many.
+
+- Greater ease of adding new operation types, or new translators which
+ implement similar functionality for many operations.
+
+Scope
+-----
+
+### Nature of proposed change
+
+The code-generation infrastructure itself consists of three parts:
+
+- A list of operations and their associated arguments (both original
+ and callback, with types).
+
+- A script to combine this list with a template to do the actual
+ generation.
+
+- Modifications to makefiles etc. to do generation during a build.
+
+The first and easiest target is auto-generated default functions. Stubs
+and syncops could follow pretty quickly. Other possibilities include:
+
+- GFAPI (both C and Python)
+
+- glupy
+
+- RPC (replace rpcgen?)
+
+- io-threads
+
+- changelog (the [full-data-logging
+ translator](https://forge.gluster.org/~jdarcy/glusterfs-core/jdarcys-glusterfs-data-logging)
+ on the forge already uses this technique)
+
+Even something as complicated as AFR/NSR/EC could use code generation to
+handle quorum checks more consistently, wrap operations in transactions,
+and so on. NSR already does; the others could.
+
+### Implications on manageability
+
+None.
+
+### Implications on presentation layer
+
+None.
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+None.
+
+### Modification to GlusterFS metadata
+
+None.
+
+### Implications on 'glusterd'
+
+None.
+
+How To Test
+-----------
+
+This change is not intended to introduce any change visible except to
+developers. Standard regression tests should be sufficient to verify
+that no such change has occurred.
+
+User Experience
+---------------
+
+None.
+
+Dependencies
+------------
+
+None.
+
+Documentation
+-------------
+
+Developer documentation should explain the format of the fop-description
+and template files. In particular developers need to know what variables
+are available for use in templates, and how to add new ones.
+
+Status
+------
+
+Patch available to generate default functions. Others to follow.
+
+Comments and Discussion
+-----------------------
diff --git a/under_review/composite-operations.md b/under_review/composite-operations.md
new file mode 100644
index 0000000..5cc29b4
--- /dev/null
+++ b/under_review/composite-operations.md
@@ -0,0 +1,438 @@
+Feature
+-------
+
+Composite operations is a term describing elimination of round trips
+through a variety of techniques. Some of these techniques are borrowed
+from NFS and SMB protocols in spirit at least.
+
+Why do we need this? All too frequently we encounter situations where
+Gluster performance is an order of magnitude or even two orders of
+magnitude slower than NFS or SMB to a local filesystem, particularly for
+small-file and metadata-intensive workloads (example: file browsing).
+You can argue that Gluster provides more functionality, so it should be
+slower, but we need to close the gap -- if Gluster was half the speed of
+NFS and provided much greater functionality plus scalability, users
+would be ok with some performance tradeoff.
+
+What is the root cause? Response time of Gluster APIs is much higher
+than response time of other protocols. A simple protocol trace can show
+you a root cause for this: excessive round-trips.
+
+There are two dimensions to this:
+
+- operations that require lookups on every brick (covered elsewhere)
+- excessive one-at-a-time access to xattrs and ACLs
+- client responsible for maintaining filesystem state instead of
+ server
+- SMB: case-insensitivity of Windows = no direct lookup by filename on
+ brick
+
+Summary
+-------
+
+example of previous success: eager-lock. When Gluster was first acquired
+by Red Hat and testing with 10-GbE interfaces began, we quickly noticed
+that sequential write performance was not what we expected. The Gluster
+protocol required a 5-step sequence for every write from client to
+server(s), in order to maintain consistency between replicas, which is
+loosely paraphrased here:
+
+- lock-replica-inode
+- pre-op (mark replicas dirty)
+- write
+- post-op
+- unlock-replica-inode
+
+The **cluster.eager-lock** feature was added to Gluster (3.4?) to allow
+the client to hang onto the lock, and we combined post-op for previous
+write with pre-op for current write and actual write request so that
+instead of 5 RPCs per write we got down to ONE RPC per write, and write
+performance improved significantly (how much TBS)
+
+Owners
+------
+
+TBS
+
+Current status
+--------------
+
+Some of the problems with round trips stem from lack of scalability in
+DHT protocol, and attributes of AFR protocol.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+- [Features/Smallfile Perf](../GlusterFS 3.7/Small File Performance.md) - small-file performance enhancement menu
+- [Features/dht-scalability](./dht-scalability.md) - new, scalable DHT
+- [Features/new-style-replication](..GlusterFS 3.6/New Style Replication.md)- client no longer does replication
+
+*Note : search RHS buglist for small-file-related performance bugs and directory browsing performance bugs, I haven't done that yet, there are a LOT of them*
+
+Detailed Description
+--------------------
+
+Here are the proposals:
+
+- READDIRPLUS generalization
+- lockless-CREATE
+- CREATE-AND-WRITE - allow CREATE op to transmit data and metadata
+ also
+- case-insensitivity feature - removes perf. penalty for SMB
+
+### READDIRPLUS used to prefetch xattrs
+
+recent correction: For SMB and other protocols that have additional
+security metadata, READDIRPLUS can be used more effectively to prefetch
+xattr data, such as ACLs and Windows-specific security info. However,
+upper layers have to make use of this feature. We treat ACLs as a
+special case of an extended attribute, since ACLs are not currently
+returned by READDIRPLUS (can someone confirm this?). The current RPC
+request and response structure are in
+[gfs3\_readdirp\_{req,rsp}](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/rpc/xdr/src/glusterfs3-xdr.x)
+in the above source code URL. In both cases, the request structure field
+"dict" can contain a list of extended attribute IDs (or names, not sure
+which).
+
+However, once these xattrs are prefetched, will md-cache translator in
+the client be able to hang onto them to prevent round-trips to the
+server? Is there any additional invalidation needed for the expanded
+role of md-cache?
+
+### eager-lock for directories
+
+This extension doesn't seem to impact APIs at all, but it does require a
+way to safely do a CREATE FOP that will either appear on all replicas or
+none (or allow self-healing to repair the difference in the directories
+in the correct way).
+
+If we have an NSR translator, this seems pretty straightforward. NSR
+only allows the client to talk to the "leader" server in the replica
+host set, and the leader then takes responsibility for propagating the
+change.
+
+With AFR, the situation is very different. In order to guarantee that a
+CREATE will succeed on all AFR subvolumes, the client must write-lock
+the parent directory. Otherwise some other client could create the same
+file at the same time on some but not all of the AFR subvolumes.
+
+But why unlock? Chances are that any immediate subsequent file create in
+that directory will be coming from the same client, so it makes sense
+for the client to hang onto the write lock for a short while, unless
+some other client wants it. This optimistic lock behavior is similar to
+the "eager-lock" feature in the AFR translator today. Doing this saves
+us not only the need to do a LOOKUP prior to CREATE, but also saves us
+the need to do a directory unlock per file!
+
+### CREATE-AND-WRITE
+
+This extension is similar to quick-read, where the OPEN FOP can return
+the file data if it's small enough. This extension adds the following
+features to the CREATE FOP:
+
+- - optionally specify xattrs to associate with file when it's
+ created
+ - optionally specify write data (if it fits in 1 RPC)
+ - optionally close the file (what RELEASE does today)
+ - optionally fsync the file (for apps that require file
+ persistence such as Swift)
+
+This option is also similar to what librados (Ceph) API allows user to
+do today, see [Ioctx.write\_full in librados python
+binding](http://ceph.com/docs/master/rados/api/python/#writing-reading-and-removing-objects)
+
+This avoids the need for the round-trip sequence:
+
+- lock inode for write
+- create
+- write
+- flush(directory)
+- set-xattr[1]
+- set-xattr[2]
+- ...
+- set-xattr[N]
+- release
+- unlock inode
+
+The existing protocol structure is in [structure
+gfs3\_create\_req](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/rpc/xdr/src/glusterfs3-xdr.x)
+. We would allocate reserved bits from the "flags" field for the
+optional extensions. The xdata field in the request would contain a
+tagged sequence containing the optional parameter values.
+
+### case-insensitive volume support
+
+The SMB protocol is bridging a divide between an operating system,
+Windows, that supports case-insensitive lookup, and an operating system,
+Linux (POSIX) that supports only case-sensitive lookup inside Gluster
+bricks. If nothing is done to bridge this gap, file lookup and creation
+becomes very expensive in large directories (a few thousand files in
+size):
+
+- on CREATE, the client has to search the entire directory to
+ determine whether some other file with the same name (but a
+ different case mix) already exists. This requires locking the
+ directory. Furthermore, consistent hashing, which knows nothing
+ about case mix, can not predict which brick might contain the file,
+ since it might have been created with a different case mix. This is
+ a SCALABILITY issue.
+
+- on LOOKUP, the client has to search all bricks for the filename
+ since there is in general no way to predict which brick the
+ case-altered version of the filename might have hashed to. This is a
+ SCALABILITY issue. The entire contents of the directory on each
+ brick must be searched as well.
+
+- SMB does support "case-sensitive yes" smb.conf configuration option,
+ but this is user-hostile since Windows does not understand it.
+
+What happens when Linux user-mode process such as glusterfsd (brick)
+tries to do a case-insensitive lookup on the filename using a local
+filesystem? XFS has a feature for this, but Gluster can't assume XFS as
+for VFS supporting case-insensitivity - it's not going to happen. Yes
+you can do readdir on directory and scan for the case-insensitive match,
+but it's O(N\^2) where N is number of files you place into a directory.
+
+**Proposal**: only use lower-case filenames (or upper-case, it doesn't
+matter) at the brick filesystem, and record the original case mix (how
+the user specified the filename at create/rename time) in an xattr, call
+it 'original-case'.
+
+**Issue**: (from Ira Cooper): what locales would be supported? SMB
+already had to deal with this.
+
+We could define a 'case-insensitive' volume parameter (default off), so
+that users who have no SMB clients do not experience this change in
+behavior.
+
+This mapping to lower-case filenames has to happen at or above DHT layer
+to avoid the scalability issue above. If this is not done by DHT (if it
+is done in VFS-gluster SMB plugin for example), then Gluster clients on
+a POSIX filesystem will not see the same filenames as Windows users, and
+this will lead to confusion.
+
+However, this has consequences for sharing file between SMB and non-SMB
+client - non-SMB client will pay performance penalty for
+case-insensitivity and will see case-insensitive behavior that is not
+strictly POSIX-compliant - for example if I create file "a" and then
+file "A" in same directory, the 2nd create will get EEXIST. That's the
+price you pay for having the two kinds of clients accessing the same
+volume - the most restricted client has to win.
+
+Changes required to DHT or equivalent:
+
+- READDIR(PLUS): report filenames as the user expects to see them,
+ using the original-case xattr. see above READDIRPLUS enhancement for
+ how this can be done efficiently.
+- CREATE (or RENAME):, map the filename within the brick to lower case
+ before creating, and records the original case mix using the
+ original-case xattr. See CREATE-AND-WRITE enhancement above for how
+ this can be done efficiently.
+- LOOKUP: map the filename to lower case before attempting a lookup on
+ the brick.
+- RENAME: To prevent loss of file during a client-side crash, first
+ delete the case-mix xattr, then do the rename, then re-add the
+ case-mix xattr. If the case-mix xattr is not present, then the
+ lower-case filename is returned by READDIR(PLUS) but the file is not
+ lost.
+
+Since existing SMB users may want to take advantage of this change, we
+need a process for converting a Gluster volume to support
+case-insensitivity:
+
+- optional - use "find /your/brick/directory -not -type d -a -not
+ -path '/your/brick/directory/.glusterfs/\*' | tr '[A-Z]' '[a-z]' |
+ sort " command in parallel on every brick, and do sort -merge of
+ per-brick outputs followed by "uniq -d" to quickly determine if
+ there are case-insensitivity collisions on existing volume. This
+ would let user resolve such conflicts ahead of time without taking
+ down the volume.
+- shut down the volume
+- run a script on all bricks in parallel to convert it to
+ case-insensitive format - very fast because it runs on a local fs.
+ - rename the brick file to lower case and store an xattr with
+ original case.
+- turn volume lookup-unhashed to ON because files will not yet be on
+ the right brick.
+- set volume into case-insensitive state
+- start volume - it is now online but not in efficient state
+- rebalance (get DHT to place the files where they belong)
+ - If rebalance uncovers case-insensitive filename collisions (very
+ unlikely), the 2nd file is renamed to its original case-mix with
+ string 'case-collision-gfid' + hex gfid appended, and a counter
+ is incremented. A simple "find" command at each brick in
+ parallel executed with pdsh can locate all instances of such
+ files - the user then has to decide what they want to do with
+ them.
+- reset lookup-unhashed to default (auto)
+
+Benefit to GlusterFS
+--------------------
+
+- READDIRPLUS optimizations could completely solve the performance
+ problems with file browsing in large directories, at least to the
+ point where Gluster performs similarly to NFS and SMB in general and
+ can't be blamed. (DHT v2 could also improve performance by not
+ requiring round trips to every brick to retrieve a directory).
+
+- lockless-CREATE - can improve small-file create performance
+ significantly by condensing 4 round-trips into 1. Small-file create
+ is the worst-performing feature in Gluster today. However, it won't
+ solve small-file create problems until we address other areas below.
+
+- CREATE-AND-WRITE - as you can see, at least 6 round trips (maybe
+ more) are combined into 1 round trip.
+
+The performance benefit increases as the Gluster client round-trip time
+to the servers increases. For example, these enhancements could make
+possible use of Gluster protocol over a WAN.
+
+Scope
+-----
+
+Still unsure. This impacts libgfapi - if we want applications to take
+advantage of these enhancements, we need to expose these APIs to
+applications somehow, and POSIX does not allow them AFAIK.
+
+CREATE-AND-WRITE impacts the translator interface. Translators must be
+able to pass down:
+
+- a list of xattr values (which translators in the stack can append
+ to).
+- a data buffer
+- flags to request optionally that file be fsynced and/or closed.
+
+The [fop\_create\_t
+params](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/libglusterfs/src/xlator.h)
+have both a "flags" parameter and a "xdata" parameter; this last
+parameter could be used to pass both data and xattrs in a tagged
+sequence format (not sure whether **dict\_t** supports this).
+
+### Nature of proposed change
+
+The Gluster code might need refactoring in create-related code to
+maximize common code between existing implementation, which won't go
+away, and the new implementation of these FOPS.
+
+However, I suspect that READDIRPLUS extensions may be possible to insert
+without disrupting existing code that much, may need some help on this
+one.
+
+### Implications on manageability
+
+The gluster volume profile command will have to be extended to get
+support for the new CREATE FOP if this is how we choose to implement it.
+
+These changes should be somewhat transparent to management layer
+otherwise.
+
+### Implications on presentation layer
+
+Swift-on-file Gluster-specific code would have to change to take
+advantage of this feature.
+
+NFS and SMB would have to change to exploit new features to reduce
+SMB-specific xattr and ACL access.
+
+The
+[libgfapi](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/api/src/glfs.h)
+implementation would have to expose these features.
+
+- **glfs\_readdirplus\_r** - it's not clear that struct dirent would
+ be able to handle xattrs, and there is no place to specify which
+ extended attributes we are interested in.
+- **glfs\_creat** - has no parameters to support xattrs or write data.
+ So we'd need a new entry point to do this.
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+None.
+
+### Modification to GlusterFS metadata
+
+None. We are repackaging how data gets passed in protocol, not what it
+means.
+
+### Implications on 'glusterd'
+
+None.
+
+How To Test
+-----------
+
+We have programs that can generate metadata-intensive workloads, such as
+smallfile benchmark or fio. For smallfile creates, we can use a modified
+version of the [parallel libgfapi
+benchmark](https://github.com/bengland2/parallel-libgfapi) (don't worry,
+I know the developer ;-) to verify that the response time for the new
+create-and-write API is better than before, or to verify that
+lockless-create improves response time.
+
+In the case of readdirplus extensions, we can test with simple libgfapi
+program coupled with a protocol trace or gluster volume profile output
+to see if it's working and has desired decrease in response time.
+
+User Experience
+---------------
+
+The impact of this operation should be functionally transparent to the
+end-user, but it should significantly improve Gluster performance to the
+point where throughput and response time are reasonably close (not
+equal) to NFS, SMB, etc on local filesystems. This is particularly true
+for small-file operations and directory browsing/listing.
+
+Dependencies
+------------
+
+This change will have significant impact on translators, it is not easy.
+Because this is a non-trivial change, an incremental approach should be
+specified and followed, with each stage committed and regression tested
+separately. For example, we could break CREATE-and-WRITE proposal into 4
+pieces:
+
+- add libgfapi support, with ENOSUPPORT returned for unimplemented
+ features
+- add list of xattrs written at create time.
+- add write data
+- add close and fsync options
+
+Documentation
+-------------
+
+How do we document RPC protocol changes? For now, I'll try to use IDL .x
+file or whatever specifies the RPC itself.
+
+Status
+------
+
+Not designed yet.
+
+Comments and Discussion
+-----------------------
+
+### Jeff Darcy 16:20, 3 December 2014
+
+Talk:Features/composite-operations
+
+"SMB: case-insensitivity of Windows = no direct lookup by filename on
+brick"
+
+We did actually come up with a way to do the case-preserving and
+case-squashing lookups simultaneously before falling back to the global
+lookup, but AFAIK it's not implemented.
+
+READDIRPLUS extension: md-cache actually does pre-fetch some attributes
+associated with (Linux) ACLs and SELinux. Maybe it just needs to
+pre-fetch some others for SMB? Also, fetching into glusterfs process
+memory doesn't save us the context switch. For that we need dentry
+injection (or something like it) so that the information is available in
+the kernel by the time the user asks for it.
+
+"glfs\_creat - has no parameters to support xattrs"
+
+These are being added already because NSR reconciliation needs them (for
+many other calls already).
diff --git a/under_review/dht-scalability.md b/under_review/dht-scalability.md
new file mode 100644
index 0000000..83ef255
--- /dev/null
+++ b/under_review/dht-scalability.md
@@ -0,0 +1,171 @@
+Goal
+----
+
+More scalable DHT translator.
+
+Summary
+-------
+
+Current DHT inhibits scalability by requiring that directories be on all
+subvolumes. In addition to the extra message traffic this incurs during
+*mkdir*, it adds significant complexity keeping all of the directories
+consistent across operations like *create* and *rename*. What is
+proposed, in a nutshell, is that directories should only exist on one
+subvolume, which might contain "stubs" pointing to files and directories
+that can be accessed by GFID on the same or other subvolumes. In concert
+with this, the way we store layout information needs to change, so that
+at least the "fix-layout" part of rebalancing need not involve
+traversing every directory on every subvolume.
+
+Owners
+------
+
+Shyam Ranganathan <srangana@redhat.com>
+
+Raghavendra Gowdappa <rgowdapp@redhat.com>
+
+Current status
+--------------
+
+Proposed, awaiting summit for approval.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+[Features/thousand-node-glusterd](../GLusterFS 3.6/Thousand Node Gluster.md)
+will define new ways of managing maintenance activities, some of which
+are DHT-related. Also, the new "mon cluster" might be where we store
+layout information.
+
+[Features/data-classification](../GLusterFS 3.7/Data Classification.md)
+also affects layout storage and use.
+
+Detailed Description
+--------------------
+
+Under this scheme, path-based lookup becomes very different. Currently,
+we look up a path on the file's "hashed" subvol first (according to
+parent-directory layout and file GFID). If it's not there, we need to
+look elsewhere - in the worst case on **all** subvolumes. In the future,
+our first lookup should be at the parent directory's subvolume. If the
+file is not there, it's not linked anywhere (though it might still exist
+unlinked and accessible by GFID) and we can terminate immediately. If it
+is there, then that single copy of the parent directory will contain a
+"stub" giving the file's GFID and a hint for what subvolume it's on
+(similar to a current linkfile). That information can then be used to
+initiate a GFID-based lookup. Many other code paths, such as *rename*,
+can leverage this new infrastructure to avoid current problems with
+multiple directory entries and linkfiles all for the same actual file.
+
+A possible enhancement would be to include more information in stubs,
+allowing readdirp to operate only on the directory and avoid going to
+every subvolume for information about individual files. Also, some
+secondary issues such as hard links and garbage collection (of unlinked
+but still open files) remain TBD in the final design.
+
+With respect to layout storage, the basic idea is to store a fairly
+small number of actual layouts - default, user defined, or related to
+data classification - that are each shared across many directories.
+These layouts are stored as part of our configuration, and the xattrs on
+individual directories need only specify a shared layout ID (plus
+possibly some additional "tweak" parameters) instead of a full explicit
+layout. When we do any kind of rebalancing, we need only change the
+shared layouts and not the pointers on each directory.
+
+Benefit to GlusterFS
+--------------------
+
+Improved scalability and performance for all directory-entry operations.
+
+Improved reliability for conflicting directory-entry operations, and for
+layout repair.
+
+Almost instantaneous "fix-layout" completion.
+
+Scope
+-----
+
+### Nature of proposed change
+
+Due to the complexity of the changes involved, this will probably be a
+new translator developed using a similar model to that used for AFR2.
+While it's likely to share/borrow a significant amount of code from
+current DHT, the new version will go through most of its development
+lifecycle separately and then completely supplant the old version, as
+opposed to integrating individual changes. For testing of
+compatibility/migration, it is probably desirable for both versions to
+coexist in the source tree and packages, but not necessarily in the same
+process.
+
+### Implications on manageability
+
+New/different options, but otherwise no change.
+
+### Implications on presentation layer
+
+No change. At this level the new DHT translator should be a plug-in
+replacement for the old one.
+
+### Implications on persistence layer
+
+None, unless you count reduced xattr usage.
+
+### Implications on 'GlusterFS' backend
+
+This will fundamentally change the directory structure on our back end.
+A file that is currently visible within a brick as \$BRICK\_ROOT/a/b/c
+might now be visible only as \$GFID\_FOR\_B/c with neither of the parent
+directories even present on that brick. Even that "file" will actually
+be a stub containing only the file's own GFID; to get the contents, one
+would need to look up that GFID in .glusterfs on this or some other
+brick.
+
+Linkfiles will be gone, also subsumed by stubs.
+
+### Modification to GlusterFS metadata
+
+Explicit layouts will be replaced by IDs for shared layouts (in config
+storage).
+
+### Implications on 'glusterd'
+
+Minimal changes (mostly volfile generation).
+
+How To Test
+-----------
+
+Most existing DHT tests should suffice, except for those that depend on
+the details of how layouts are stored and formatted. Those will have to
+be modified to go through the extra layer of indirection to where the
+actual layouts live.
+
+User Experience
+---------------
+
+None, except for better performance and less lost data.
+
+Dependencies
+------------
+
+See "related features" section.
+
+Documentation
+-------------
+
+TBD. There should be very little at the user level, though hopefully
+this time we'll do better at documenting the things developers
+(including add-on tool developers) need to know.
+
+Status
+------
+
+Design in progress
+
+Design and some notes can be found here,
+<https://drive.google.com/open?id=15_TOW9jwzW4griAmk-rqg2cWF-LHiR_TJ8Jn0vOvYpU&authuser=0>
+
+Thread at gluster-devel opening this up for discussion here,
+<https://www.mail-archive.com/gluster-devel%40gluster.org/msg03036.html>
+
+Comments and Discussion
+-----------------------
diff --git a/under_review/index.md b/under_review/index.md
new file mode 100644
index 0000000..0a3d47d
--- /dev/null
+++ b/under_review/index.md
@@ -0,0 +1,82 @@
+GlusterFS 4.0 Release Planning
+------------------------------
+
+Tentative Dates:
+
+Feature proposal for GlusterFS 4.0
+----------------------------------
+
+This list has been seeded with features from <http://goo.gl/QyjfxM>
+which provides some rationale and context. Feel free to add more. Some
+of the individual feature pages are still incomplete, but should be
+completed before voting on the final 4.0 feature set.
+
+### Node Scaling Features
+
+- [Features/thousand-node-glusterd](../GlusterFS 3.6/Thousand Node Gluster.md):
+ Glusterd changes for higher scale.
+
+- [Features/dht-scalability](./dht-scalability.md):
+ a.k.a. DHT2
+
+- [Features/sharding-xlator](../GlusterFS 3.7/Sharding xlator.md):
+ Replacement for striping.
+
+- [Features/caching](./caching.md): Client-side caching, with coherency support.
+
+### Technology Scaling Features
+
+- [Features/data-classification](../GlusterFS 3.7/Data Classification.md):
+ Tiering, compliance, and more.
+
+- [Features/SplitNetwork](./Split Network.md):
+ Support for public/private (or other multiple) networks.
+
+- [Features/new-style-replication](../GlusterFS 3.6/New Style Replication.md):
+ Log-based, chain replication.
+
+- [Features/better-brick-mgmt](./Better Brick Mgmt.md):
+ Flexible resource allocation + daemon infrastructure to handle
+ (many) more bricks
+
+- [Features/compression-dedup](./Compression Dedup.md):
+ Compression and/or deduplication
+
+### Small File Performance Features
+
+- [Features/composite-operations](./composite-operations.md):
+ Reducing round trips by wrapping multiple ops in one message.
+
+- [Features/stat-xattr-cache](./stat-xattr-cache.md):
+ Caching stat/xattr information in (user-space) server memory.
+
+### Technical Debt Reduction
+
+- [Features/code-generation](./code-generation.md):
+ Code generation
+
+- [Features/volgen-rewrite](./volgen-rewrite.md):
+ Technical-debt reduction
+
+### Other Features
+
+- [Features/rest-api](../GlusterFS 3.7/rest-api.md):
+ Fully generic API sufficient to support all CLI operations.
+
+- Features/mgmt-plugins:
+ No more patching glusterd for every new feature.
+
+- Features/perf-monitoring:
+ Always-on performance monitoring and hotspot identification.
+
+Proposing New Features
+----------------------
+
+[New Feature Template](../Feature Template.md)
+
+Use the template to create a new feature page, and then link to it from the "Feature Proposals" section above.
+
+Release Criterion
+-----------------
+
+- TBD
diff --git a/under_review/lockdep.md b/under_review/lockdep.md
new file mode 100644
index 0000000..29b4888
--- /dev/null
+++ b/under_review/lockdep.md
@@ -0,0 +1,101 @@
+Feature
+-------
+Lockdep - runtime lock validator
+
+Summary
+-------
+
+Lockdep is a "lock dependency correctness validator" at it's core. It observes and maps all locking rules as they occur dynamically, i.e., it keeps track of locking dependency (in a graph like data structure) between various locks at runtime. Whenever a new lock is about to be taken, the lockdep subsystem "validates" the locking rule against the set of existing rules (which are learnt over time as system is in use). If this lock is "inconsistent" with the set of existing rules, a probable deadlock is detected and logged. A successfull lock validation "adds" the new rule and things move forward.
+
+Owners
+------
+
+Venky Shankar <vshankar@redhat.com>
+
+Current status
+--------------
+
+Feature proposed.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+TBD
+
+Detailed Description
+--------------------
+
+Lockdep helps in catching locking related deadlocks far before they are possibly hit. As codebase grows overtime, it's natural to have lots of "inter dependent" locks and it becomes hard (and define) locking orders. Lockdep ensures that such cases are caught even before they are encountered in real life, e.g.
+
+ Thread 1: L1 -> L2
+ Thread 2: L2 -> L1
+
+The above example would surely deadlock in no time. These are probably the easier ones. Much more nastier ones include grabbing a lock in the signal handler with the main thread (or any other) already holding the lock (this is similar to acquiring a lock in an interrupt handler for a given CPU with a task running on _that_ CPU already holding the lock). Such cases are also caught by lockdep.
+
+Benefit to GlusterFS
+--------------------
+
+Who doesn't want to be free from deadlocks :-)
+
+Furthermore, lockdep would be disabled by default. Compiling with -DUSE_LOCKDEP would transparently enable it.
+
+Scope
+-----
+
+#### Nature of proposed change
+
+Possibly adding a wrapper to GlusterFS locking macros and maintaining a graph of locking rules. For reference see kernel/locking in the linux kernel source tree.
+
+#### Implications on manageability
+
+None.
+
+#### Implications on presentation layer
+
+None.
+
+#### Implications on persistence layer
+
+None.
+
+#### Implications on 'GlusterFS' backend
+
+None.
+
+#### Modification to GlusterFS metadata
+
+None.
+
+#### Implications on 'glusterd'
+
+None.
+
+How To Test
+-----------
+
+Enable lockdep during compilation by passing "-DUSE_LOCKDEP" CFLAGS while running configure and run Gluster smoke/regression test suites.
+
+User Experience
+---------------
+
+Nothing for end user though, but immesely helpful for developers.
+
+Dependencies
+------------
+
+None.
+
+Documentation
+-------------
+
+TBD.
+
+Status
+------
+
+Design in progress.
+
+Comments and Discussion
+-----------------------
+
+More than welcome :-)
diff --git a/under_review/stat-xattr-cache.md b/under_review/stat-xattr-cache.md
new file mode 100644
index 0000000..e00399d
--- /dev/null
+++ b/under_review/stat-xattr-cache.md
@@ -0,0 +1,197 @@
+Feature
+-------
+
+server-side md-cache
+
+Summary
+-------
+
+Two years ago, Peter Portante noticed the extremely high number of
+system calls on the XFS brick required per Swift object. Since then, he
+and Ben England have observed several similar cases.
+
+More recently, while looking at a **netmist** single-thread workload run
+by a major banking customer to characterize Gluster performance, Ben
+observed this [system call profile PER
+FILE](https://s3.amazonaws.com/ben.england/netmist-and-gluster.pdf) .
+This is strong evidence of several problems with POSIX translator:
+
+- repeated polling with **sys\_lgetxattr** of the **gfid** xattr
+- repeated **sys\_lstat** calls
+- polling of xattrs that were *undefined*
+- calling **sys\_llistattr** to get list of all xattrs AFTER all other
+ calls
+- calling *'sys\_lgetxattr* two times, once to find out how big the
+ value is and once to get the value!
+- one-at-a-time calls to get individual xattrs
+
+All of the problems except for the last one could be solved through use
+of a metadata cache associated with each inode. The last problem is not
+solvable in a pure POSIX API at this time, although XFS offers an
+**ioctl** that can get all xattrs at once (the cache could conceivably
+determine whether the brick was XFS or not and exploit this where
+available).
+
+Note that as xattrs are added to the system, this becomes more and more
+costly, and as Gluster adds new features, these typically require that
+state be kept associated with a file, usually in one or more xattrs.
+
+Owners
+------
+
+TBS
+
+Current status
+--------------
+
+There is already a **md-cache** translator, so you would think that
+problems like this would not occur, but clearly they do -- this
+translator is typically on the client side of the protocol and is
+typically above such translators as AFR and DHT. The problems may be
+worse in cases where the md-cache translator is not present (example:
+SMB with gluster-vfs plugin that requires stat-prefetch volume parameter
+to be set to *off*.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+- [Features/Smallfile Perf](../GlusterFS 3.7/Small File Performance.md)
+- bugzillas TBS
+
+Detailed Description
+--------------------
+
+This proposal has changed as a result of discussions in
+\#gluster-meeting - instead of modifying the POSIX translator, we
+propose to use the md-cache translator in the server above the POSIX
+translator, and add negative caching capabilities to the md-cache
+translator.
+
+By "negative caching" we mean that md-cache can tell you if the xattr
+does not exist without calling down the translator stack. How can it do
+this? In the server side, the only path to the brick is through the
+md-cache translator. When it encounters a xattr get request for a file
+it has not seen before, the first step is to call down with llistxattr()
+to find out what xattrs are stored for that file. From that point on
+until the file is evicted from the cache, any request for non-existent
+xattr values from higher translators will immediately be returned with
+ENODATA, without calling down to POSIX translator.
+
+We must ensure that memory leaks do not occur, and that race conditions
+do not occur while multiple threads are accessing the cache, but this
+seems like a manageable problem and is certainly not a new problem for
+Gluster translator code.
+
+Benefit to GlusterFS
+--------------------
+
+Most of the system calls and about 50% of the elapsed time could have
+been removed from the above small-file read profile through use of this
+cache. This benefit will be more visible as we transition to using SSD
+storage, where disk seek times will not mask overheads such as this.
+
+Scope
+-----
+
+This can be done local to the glusterfsd process by inserting md-cache
+translator just above the POSIX translator, where the vast majority of
+the stat, getxattr and setxattr calls are generated from.
+
+### Nature of proposed change
+
+No new translators are required. We may require some existing
+translators to call down the stack ("wind a FOP") instead of calling
+sys\_\*xattr themselves if these calls are heavily used, so that they
+can take advantage of the stat-xattr-cache.
+
+It is *really important* that the md-cache use listxattr() to
+immediately determine which xattrs are on disk, avoiding needless
+getxattr calls this way. At present it does not do this.
+
+### Implications on manageability
+
+None. We need to make sure that the cache is big enough to support the
+threads that use it, but not so big that it consumes a significant
+percentage of memory. We may want to make a cache size and expiration
+time a tunable so that we can experiment in performance testing to
+determine optimal parameters.
+
+### Implications on presentation layer
+
+Translators above the md-cache translator are not affected
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+None
+
+### Modification to GlusterFS metadata
+
+None
+
+### Implications on 'glusterd'
+
+None
+
+How To Test
+-----------
+
+We can use strace of a single-thread smallfile workload to verify that
+the cache is filtering out excess system calls. We could include
+counters into the cache to measure the cache hit rate.
+
+User Experience
+---------------
+
+single-thread small-file creates should be faster, particularly on SSD
+storage. Performance testing is needed to further quantify this.
+
+Dependencies
+------------
+
+None
+
+Documentation
+-------------
+
+None, except for tunables relating to cache size and expiration time.
+
+Status
+------
+
+Not started.
+
+Comments and Discussion
+-----------------------
+
+Jeff Darcy: I've been saying for ages that we should store xattrs in a
+local DB and avoid local xattrs altogether. Besides performance, this
+would also eliminate the need for special configuration of the
+underlying local FS (to accommodate our highly unusual use of this
+feature) and generally be good for platform independence. Not quite so
+sure about other stat(2) information, but perhaps I could be persuaded.
+In any case, this has led me to look into the relevant code on a few
+occasions. Unfortunately, there are \*many\* places that directly call
+sys\_\*xattr instead of winding fops - glusterd (for replace-brick),
+changelog, quota, snapshots, and others. I think this feature is still
+very worthwhile, but all of the "cheating" we've tolerated over the
+years is going to make it more difficult.
+
+Ben England: a local DB might be a good option but could also become a
+bottleneck, unless you have a DB instance per brick (local) filesystem.
+One problem that the DB would solve is getting all the metadata in one
+query - at present POSIX API requires you to get one xattr at a time. If
+we implement a caching layer that hides whether a DB or xattrs are being
+used, we can make it easier to experiment with a DB (level DB?). On your
+2nd point, While it's true that there are many sites that call
+sys\_\*xattr directory, only a few of these really generate a lot of
+system calls. For example, some of these calls are only for the
+mountpoint. From a performance perspective, as long as we can intercept
+the vast majority of the sys\_\*xattr calls with this caching layer,
+IMHO we can tolerate a few exceptions in glusterd, etc. However, from a
+CORRECTNESS standpoint, we have to be careful that calls bypassing the
+caching layer don't cause cache contents to become stale (out-of-date,
+inconsistent with the on-disk brick filesystem contents).
diff --git a/under_review/template.md b/under_review/template.md
new file mode 100644
index 0000000..02b7de1
--- /dev/null
+++ b/under_review/template.md
@@ -0,0 +1,93 @@
+Feature
+-------
+
+Summary
+-------
+
+*Brief Description of the Feature *
+
+Owners
+------
+
+**Feature Owners** - *Ideally includes you * :-)
+
+Current status
+--------------
+
+*Provide details on related existing features, if any and why this new feature is needed*
+
+Related Feature Requests and Bugs
+---------------------------------
+
+*Link all the related feature requests and bugs in [https://bugzilla.redhat.com bugzilla] here. If there is no bug filed for this feature, please do so now. Add a comment and a link to this page in each related bug.*
+
+Detailed Description
+--------------------
+
+*Detailed Feature Description*
+
+Benefit to GlusterFS
+--------------------
+
+*Describe Value additions to GlusterFS*
+
+Scope
+-----
+
+#### Nature of proposed change
+
+*modification to existing code, new translators ...*
+
+#### Implications on manageability
+
+*Glusterd, GlusterCLI, Web Console, REST API*
+
+#### Implications on presentation layer
+
+*NFS/SAMBA/UFO/FUSE/libglusterfsclient Integration*
+
+#### Implications on persistence layer
+
+*LVM, XFS, RHEL ...*
+
+#### Implications on 'GlusterFS' backend
+
+*brick's data format, layout changes*
+
+#### Modification to GlusterFS metadata
+
+*extended attributes used, internal hidden files to keep the metadata...*
+
+#### Implications on 'glusterd'
+
+*persistent store, configuration changes, brick-op...*
+
+How To Test
+-----------
+
+*Description on Testing the feature*
+
+User Experience
+---------------
+
+*Changes in CLI, effect on User experience...*
+
+Dependencies
+------------
+
+*Dependencies, if any*
+
+Documentation
+-------------
+
+*Documentation for the feature*
+
+Status
+------
+
+*Status of development - Design Ready, In development, Completed*
+
+Comments and Discussion
+-----------------------
+
+*Follow here*
diff --git a/under_review/volgen-rewrite.md b/under_review/volgen-rewrite.md
new file mode 100644
index 0000000..4b954b3
--- /dev/null
+++ b/under_review/volgen-rewrite.md
@@ -0,0 +1,128 @@
+Feature
+-------
+
+Volgen rewrite
+
+Summary
+-------
+
+This module has become an important choke point for development of new
+features, as each new feature needs to make changes here. Many previous
+feature additions have been rushed in, by copying/pasting code or adding
+special-case checks, instead of refactoring. The result is a big
+hairball. Every new change that involves client translators has to deal
+with various permutations of replication/EC, striping/sharding,
+rebalance, self-heal, quota, snapshots, tiering, NFS, and so on. Each
+new change at this point is almost certain to introduce subtle errors
+that will only be caught when certain combinations of features and
+operations are attempted. There aren't even enough tests to cover even
+the basic combinations, and we'd need hundreds more to test the obscure
+ones.
+
+Owners
+------
+
+Jeff Darcy <jdarcy@redhat.com>
+
+Current status
+--------------
+
+Just a proposal so far.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+TBD
+
+Detailed Description
+--------------------
+
+Many of the problems stem from the fact that our internal volfiles need
+to be consistent with, but slightly different from, one another. Instead
+of generating them all separately, we should separate the generation
+into two phases:
+
+*  Generate a "core" or "vanilla" graph containing all of the translators, option settings, etc. common to all of the special-purpose volfiles.
+
+*  For each special-purpose volfile, copy the core/vanilla graph (*not the code* that generated it) and modify the copy to get what's desired.
+
+Some of the other problems in this module stem from lower-level issues
+such as bad data- or control-structure choices (e.g. operating on a
+linear list of bricks instead of a proper graph), or complex
+object-lifecycle management (e.g. see
+<https://bugzilla.redhat.com/show_bug.cgi?id=1211749>). Some of these
+problems might be alleviated by using a higher-level language with
+complex data structures and garbage collection. An infrastructure
+already exists to do graph manipulation in Python, developed for HekaFS
+and subsequently used in several other places (it's already in our
+source tree).
+
+Benefit to GlusterFS
+--------------------
+
+More correct, and more \*verifiably\* correct, volfile generation even
+as the next dozen features are added. Also, accelerated development time
+for those next dozen features.
+
+Scope
+-----
+
+### Nature of proposed change
+
+Pretty much limited to what currently exists in glusterd-volgen.c
+
+### Implications on manageability
+
+None.
+
+### Implications on presentation layer
+
+None.
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+None.
+
+### Modification to GlusterFS metadata
+
+None.
+
+### Implications on 'glusterd'
+
+None, unless we decide to store volfiles in a different format (e.g.
+JSON) so we can use someone else's parser instead of rolling our own.
+
+How To Test
+-----------
+
+Practically every current test generates multiple volfiles, which will
+quickly smoke out any differences. Ideally, we'd add a bunch more tests
+(many of which might fail against current code) to verify correctness of
+results for particularly troublesome combinations of features.
+
+User Experience
+---------------
+
+None.
+
+Dependencies
+------------
+
+None.
+
+Documentation
+-------------
+
+None.
+
+Status
+------
+
+Still just a proposal.
+
+Comments and Discussion
+-----------------------