From 601bfa2719d8c9be40982b8a6526c21cd0ea4966 Mon Sep 17 00:00:00 2001 From: Kaushal M Date: Wed, 20 Jan 2016 13:09:23 +0530 Subject: Rename in_progress to under_review `in_progress` is vague term, which could either mean the feature review is in progress, or that the feature implementation is in progress. Renaming to `under_review` gives a much better indication that the feature is under review and implementation hasn't begun yet. Refer https://review.gluster.org/13187 for the discussion which lead to this Change-Id: I3f48e15deb4cf5486d7b8cac4a7915f9925f38f5 Signed-off-by: Kaushal M Reviewed-on: http://review.gluster.org/13264 Reviewed-by: Raghavendra Talur Tested-by: Raghavendra Talur --- under_review/composite-operations.md | 438 +++++++++++++++++++++++++++++++++++ 1 file changed, 438 insertions(+) create mode 100644 under_review/composite-operations.md (limited to 'under_review/composite-operations.md') diff --git a/under_review/composite-operations.md b/under_review/composite-operations.md new file mode 100644 index 0000000..5cc29b4 --- /dev/null +++ b/under_review/composite-operations.md @@ -0,0 +1,438 @@ +Feature +------- + +Composite operations is a term describing elimination of round trips +through a variety of techniques. Some of these techniques are borrowed +from NFS and SMB protocols in spirit at least. + +Why do we need this? All too frequently we encounter situations where +Gluster performance is an order of magnitude or even two orders of +magnitude slower than NFS or SMB to a local filesystem, particularly for +small-file and metadata-intensive workloads (example: file browsing). +You can argue that Gluster provides more functionality, so it should be +slower, but we need to close the gap -- if Gluster was half the speed of +NFS and provided much greater functionality plus scalability, users +would be ok with some performance tradeoff. + +What is the root cause? Response time of Gluster APIs is much higher +than response time of other protocols. A simple protocol trace can show +you a root cause for this: excessive round-trips. + +There are two dimensions to this: + +- operations that require lookups on every brick (covered elsewhere) +- excessive one-at-a-time access to xattrs and ACLs +- client responsible for maintaining filesystem state instead of + server +- SMB: case-insensitivity of Windows = no direct lookup by filename on + brick + +Summary +------- + +example of previous success: eager-lock. When Gluster was first acquired +by Red Hat and testing with 10-GbE interfaces began, we quickly noticed +that sequential write performance was not what we expected. The Gluster +protocol required a 5-step sequence for every write from client to +server(s), in order to maintain consistency between replicas, which is +loosely paraphrased here: + +- lock-replica-inode +- pre-op (mark replicas dirty) +- write +- post-op +- unlock-replica-inode + +The **cluster.eager-lock** feature was added to Gluster (3.4?) to allow +the client to hang onto the lock, and we combined post-op for previous +write with pre-op for current write and actual write request so that +instead of 5 RPCs per write we got down to ONE RPC per write, and write +performance improved significantly (how much TBS) + +Owners +------ + +TBS + +Current status +-------------- + +Some of the problems with round trips stem from lack of scalability in +DHT protocol, and attributes of AFR protocol. + +Related Feature Requests and Bugs +--------------------------------- + +- [Features/Smallfile Perf](../GlusterFS 3.7/Small File Performance.md) - small-file performance enhancement menu +- [Features/dht-scalability](./dht-scalability.md) - new, scalable DHT +- [Features/new-style-replication](..GlusterFS 3.6/New Style Replication.md)- client no longer does replication + +*Note : search RHS buglist for small-file-related performance bugs and directory browsing performance bugs, I haven't done that yet, there are a LOT of them* + +Detailed Description +-------------------- + +Here are the proposals: + +- READDIRPLUS generalization +- lockless-CREATE +- CREATE-AND-WRITE - allow CREATE op to transmit data and metadata + also +- case-insensitivity feature - removes perf. penalty for SMB + +### READDIRPLUS used to prefetch xattrs + +recent correction: For SMB and other protocols that have additional +security metadata, READDIRPLUS can be used more effectively to prefetch +xattr data, such as ACLs and Windows-specific security info. However, +upper layers have to make use of this feature. We treat ACLs as a +special case of an extended attribute, since ACLs are not currently +returned by READDIRPLUS (can someone confirm this?). The current RPC +request and response structure are in +[gfs3\_readdirp\_{req,rsp}](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/rpc/xdr/src/glusterfs3-xdr.x) +in the above source code URL. In both cases, the request structure field +"dict" can contain a list of extended attribute IDs (or names, not sure +which). + +However, once these xattrs are prefetched, will md-cache translator in +the client be able to hang onto them to prevent round-trips to the +server? Is there any additional invalidation needed for the expanded +role of md-cache? + +### eager-lock for directories + +This extension doesn't seem to impact APIs at all, but it does require a +way to safely do a CREATE FOP that will either appear on all replicas or +none (or allow self-healing to repair the difference in the directories +in the correct way). + +If we have an NSR translator, this seems pretty straightforward. NSR +only allows the client to talk to the "leader" server in the replica +host set, and the leader then takes responsibility for propagating the +change. + +With AFR, the situation is very different. In order to guarantee that a +CREATE will succeed on all AFR subvolumes, the client must write-lock +the parent directory. Otherwise some other client could create the same +file at the same time on some but not all of the AFR subvolumes. + +But why unlock? Chances are that any immediate subsequent file create in +that directory will be coming from the same client, so it makes sense +for the client to hang onto the write lock for a short while, unless +some other client wants it. This optimistic lock behavior is similar to +the "eager-lock" feature in the AFR translator today. Doing this saves +us not only the need to do a LOOKUP prior to CREATE, but also saves us +the need to do a directory unlock per file! + +### CREATE-AND-WRITE + +This extension is similar to quick-read, where the OPEN FOP can return +the file data if it's small enough. This extension adds the following +features to the CREATE FOP: + +- - optionally specify xattrs to associate with file when it's + created + - optionally specify write data (if it fits in 1 RPC) + - optionally close the file (what RELEASE does today) + - optionally fsync the file (for apps that require file + persistence such as Swift) + +This option is also similar to what librados (Ceph) API allows user to +do today, see [Ioctx.write\_full in librados python +binding](http://ceph.com/docs/master/rados/api/python/#writing-reading-and-removing-objects) + +This avoids the need for the round-trip sequence: + +- lock inode for write +- create +- write +- flush(directory) +- set-xattr[1] +- set-xattr[2] +- ... +- set-xattr[N] +- release +- unlock inode + +The existing protocol structure is in [structure +gfs3\_create\_req](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/rpc/xdr/src/glusterfs3-xdr.x) +. We would allocate reserved bits from the "flags" field for the +optional extensions. The xdata field in the request would contain a +tagged sequence containing the optional parameter values. + +### case-insensitive volume support + +The SMB protocol is bridging a divide between an operating system, +Windows, that supports case-insensitive lookup, and an operating system, +Linux (POSIX) that supports only case-sensitive lookup inside Gluster +bricks. If nothing is done to bridge this gap, file lookup and creation +becomes very expensive in large directories (a few thousand files in +size): + +- on CREATE, the client has to search the entire directory to + determine whether some other file with the same name (but a + different case mix) already exists. This requires locking the + directory. Furthermore, consistent hashing, which knows nothing + about case mix, can not predict which brick might contain the file, + since it might have been created with a different case mix. This is + a SCALABILITY issue. + +- on LOOKUP, the client has to search all bricks for the filename + since there is in general no way to predict which brick the + case-altered version of the filename might have hashed to. This is a + SCALABILITY issue. The entire contents of the directory on each + brick must be searched as well. + +- SMB does support "case-sensitive yes" smb.conf configuration option, + but this is user-hostile since Windows does not understand it. + +What happens when Linux user-mode process such as glusterfsd (brick) +tries to do a case-insensitive lookup on the filename using a local +filesystem? XFS has a feature for this, but Gluster can't assume XFS as +for VFS supporting case-insensitivity - it's not going to happen. Yes +you can do readdir on directory and scan for the case-insensitive match, +but it's O(N\^2) where N is number of files you place into a directory. + +**Proposal**: only use lower-case filenames (or upper-case, it doesn't +matter) at the brick filesystem, and record the original case mix (how +the user specified the filename at create/rename time) in an xattr, call +it 'original-case'. + +**Issue**: (from Ira Cooper): what locales would be supported? SMB +already had to deal with this. + +We could define a 'case-insensitive' volume parameter (default off), so +that users who have no SMB clients do not experience this change in +behavior. + +This mapping to lower-case filenames has to happen at or above DHT layer +to avoid the scalability issue above. If this is not done by DHT (if it +is done in VFS-gluster SMB plugin for example), then Gluster clients on +a POSIX filesystem will not see the same filenames as Windows users, and +this will lead to confusion. + +However, this has consequences for sharing file between SMB and non-SMB +client - non-SMB client will pay performance penalty for +case-insensitivity and will see case-insensitive behavior that is not +strictly POSIX-compliant - for example if I create file "a" and then +file "A" in same directory, the 2nd create will get EEXIST. That's the +price you pay for having the two kinds of clients accessing the same +volume - the most restricted client has to win. + +Changes required to DHT or equivalent: + +- READDIR(PLUS): report filenames as the user expects to see them, + using the original-case xattr. see above READDIRPLUS enhancement for + how this can be done efficiently. +- CREATE (or RENAME):, map the filename within the brick to lower case + before creating, and records the original case mix using the + original-case xattr. See CREATE-AND-WRITE enhancement above for how + this can be done efficiently. +- LOOKUP: map the filename to lower case before attempting a lookup on + the brick. +- RENAME: To prevent loss of file during a client-side crash, first + delete the case-mix xattr, then do the rename, then re-add the + case-mix xattr. If the case-mix xattr is not present, then the + lower-case filename is returned by READDIR(PLUS) but the file is not + lost. + +Since existing SMB users may want to take advantage of this change, we +need a process for converting a Gluster volume to support +case-insensitivity: + +- optional - use "find /your/brick/directory -not -type d -a -not + -path '/your/brick/directory/.glusterfs/\*' | tr '[A-Z]' '[a-z]' | + sort " command in parallel on every brick, and do sort -merge of + per-brick outputs followed by "uniq -d" to quickly determine if + there are case-insensitivity collisions on existing volume. This + would let user resolve such conflicts ahead of time without taking + down the volume. +- shut down the volume +- run a script on all bricks in parallel to convert it to + case-insensitive format - very fast because it runs on a local fs. + - rename the brick file to lower case and store an xattr with + original case. +- turn volume lookup-unhashed to ON because files will not yet be on + the right brick. +- set volume into case-insensitive state +- start volume - it is now online but not in efficient state +- rebalance (get DHT to place the files where they belong) + - If rebalance uncovers case-insensitive filename collisions (very + unlikely), the 2nd file is renamed to its original case-mix with + string 'case-collision-gfid' + hex gfid appended, and a counter + is incremented. A simple "find" command at each brick in + parallel executed with pdsh can locate all instances of such + files - the user then has to decide what they want to do with + them. +- reset lookup-unhashed to default (auto) + +Benefit to GlusterFS +-------------------- + +- READDIRPLUS optimizations could completely solve the performance + problems with file browsing in large directories, at least to the + point where Gluster performs similarly to NFS and SMB in general and + can't be blamed. (DHT v2 could also improve performance by not + requiring round trips to every brick to retrieve a directory). + +- lockless-CREATE - can improve small-file create performance + significantly by condensing 4 round-trips into 1. Small-file create + is the worst-performing feature in Gluster today. However, it won't + solve small-file create problems until we address other areas below. + +- CREATE-AND-WRITE - as you can see, at least 6 round trips (maybe + more) are combined into 1 round trip. + +The performance benefit increases as the Gluster client round-trip time +to the servers increases. For example, these enhancements could make +possible use of Gluster protocol over a WAN. + +Scope +----- + +Still unsure. This impacts libgfapi - if we want applications to take +advantage of these enhancements, we need to expose these APIs to +applications somehow, and POSIX does not allow them AFAIK. + +CREATE-AND-WRITE impacts the translator interface. Translators must be +able to pass down: + +- a list of xattr values (which translators in the stack can append + to). +- a data buffer +- flags to request optionally that file be fsynced and/or closed. + +The [fop\_create\_t +params](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/libglusterfs/src/xlator.h) +have both a "flags" parameter and a "xdata" parameter; this last +parameter could be used to pass both data and xattrs in a tagged +sequence format (not sure whether **dict\_t** supports this). + +### Nature of proposed change + +The Gluster code might need refactoring in create-related code to +maximize common code between existing implementation, which won't go +away, and the new implementation of these FOPS. + +However, I suspect that READDIRPLUS extensions may be possible to insert +without disrupting existing code that much, may need some help on this +one. + +### Implications on manageability + +The gluster volume profile command will have to be extended to get +support for the new CREATE FOP if this is how we choose to implement it. + +These changes should be somewhat transparent to management layer +otherwise. + +### Implications on presentation layer + +Swift-on-file Gluster-specific code would have to change to take +advantage of this feature. + +NFS and SMB would have to change to exploit new features to reduce +SMB-specific xattr and ACL access. + +The +[libgfapi](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/api/src/glfs.h) +implementation would have to expose these features. + +- **glfs\_readdirplus\_r** - it's not clear that struct dirent would + be able to handle xattrs, and there is no place to specify which + extended attributes we are interested in. +- **glfs\_creat** - has no parameters to support xattrs or write data. + So we'd need a new entry point to do this. + +### Implications on persistence layer + +None. + +### Implications on 'GlusterFS' backend + +None. + +### Modification to GlusterFS metadata + +None. We are repackaging how data gets passed in protocol, not what it +means. + +### Implications on 'glusterd' + +None. + +How To Test +----------- + +We have programs that can generate metadata-intensive workloads, such as +smallfile benchmark or fio. For smallfile creates, we can use a modified +version of the [parallel libgfapi +benchmark](https://github.com/bengland2/parallel-libgfapi) (don't worry, +I know the developer ;-) to verify that the response time for the new +create-and-write API is better than before, or to verify that +lockless-create improves response time. + +In the case of readdirplus extensions, we can test with simple libgfapi +program coupled with a protocol trace or gluster volume profile output +to see if it's working and has desired decrease in response time. + +User Experience +--------------- + +The impact of this operation should be functionally transparent to the +end-user, but it should significantly improve Gluster performance to the +point where throughput and response time are reasonably close (not +equal) to NFS, SMB, etc on local filesystems. This is particularly true +for small-file operations and directory browsing/listing. + +Dependencies +------------ + +This change will have significant impact on translators, it is not easy. +Because this is a non-trivial change, an incremental approach should be +specified and followed, with each stage committed and regression tested +separately. For example, we could break CREATE-and-WRITE proposal into 4 +pieces: + +- add libgfapi support, with ENOSUPPORT returned for unimplemented + features +- add list of xattrs written at create time. +- add write data +- add close and fsync options + +Documentation +------------- + +How do we document RPC protocol changes? For now, I'll try to use IDL .x +file or whatever specifies the RPC itself. + +Status +------ + +Not designed yet. + +Comments and Discussion +----------------------- + +### Jeff Darcy 16:20, 3 December 2014 + +Talk:Features/composite-operations + +"SMB: case-insensitivity of Windows = no direct lookup by filename on +brick" + +We did actually come up with a way to do the case-preserving and +case-squashing lookups simultaneously before falling back to the global +lookup, but AFAIK it's not implemented. + +READDIRPLUS extension: md-cache actually does pre-fetch some attributes +associated with (Linux) ACLs and SELinux. Maybe it just needs to +pre-fetch some others for SMB? Also, fetching into glusterfs process +memory doesn't save us the context switch. For that we need dentry +injection (or something like it) so that the information is available in +the kernel by the time the user asks for it. + +"glfs\_creat - has no parameters to support xattrs" + +These are being added already because NSR reconciliation needs them (for +many other calls already). -- cgit