From ee5fab234b0a9d317f7ec7cb7eb05c2d1172e94e Mon Sep 17 00:00:00 2001 From: Poornima G Date: Fri, 20 Jan 2017 12:16:04 +0530 Subject: Add feature page for negative lookup cache Change-Id: I5c3212416f8ed1729656eadc4ef28aeebfd16ebb Signed-off-by: Poornima G Reviewed-on: https://review.gluster.org/16436 Reviewed-by: Amar Tumballi Reviewed-by: Jeff Darcy Tested-by: Jeff Darcy --- done/negative-lookup-cache.md | 220 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 220 insertions(+) create mode 100644 done/negative-lookup-cache.md diff --git a/done/negative-lookup-cache.md b/done/negative-lookup-cache.md new file mode 100644 index 0000000..ce549f1 --- /dev/null +++ b/done/negative-lookup-cache.md @@ -0,0 +1,220 @@ +Feature +------- +Implement negative lookup cache and hence improve create performance + +Summary +------- +Before creating any file lookups(1 in Fuse, 5-6 in SMB etc.) are sent to verify if the file + already exists. Serving these lookup from the cache when possible, increases the create +performance by multiple folds in SMB access and by some percentage in Fuse/NFS access. +A lookup that fails with ENOENT is called as negative lookup. + +Owners +------ +Poornima G +Pranith Kumar Karampuri +Rajesh Joseph + +Current status +-------------- +In design phase. + +Related Feature Requests and Bugs +--------------------------------- + +Detailed Description +-------------------- + +### Design: +To identify if the lookup is on a non existant file, we need to compare the filename in lookup +against filenames in the parent directory. Hence, we have to have some kind of dentry cache and +a way to ensure consistency of the cache. + +#### Which cache consistency model? + +1. Cache entries and use directory leases for consistency + Directory RW lease is taken by any client that is caching the existing entries and creating new + entries in the directory. The lease is broken when any other client tries to create an entry in + the directory. With this approach we need to first implement directory leases and then the + negative lookup cache xlator. Directory leases implementation in the current dht setup becomes + very complex as the directories are replicated on all the bricks and leases should be acquired + on all the bricks. Since lease will be a in memory state on the bricks it becomes harder to heal + the states and handle disconnects, network partitions etc. Hence the approach 2. + +2. Cache entries and use upcall for consistency + Directory entries are cached, whenever a any other client is creating entries, the cache is + invalidated and built a fresh. With this there is a small window when the file was actually + created and the upcall has not yet reached the client. In this window, the cache can return + lookup as ENOENT though the file existed at that point on the bricks. But this window can happen + even otherwise because we do not serialize fops from multiple clients. If there are applications + which poll on a file being created, it is not a suitable use case for caching. Hence going with + this approach. + +#### What to cache? +Positive entries (entries that are already present on disk) and/or negative entries (entries that +were looked up and found to be ENOENT). The names of the entries are cached, but it can be plugged +to cache the hash of the name ([1] suggested by Pranith). The cache is bound by the upper limit +configured by the user. + +The positive entries(PE): +- Populated as a part of readdirp, and as a part of mkdir followed by creates inside that directory. + Lookups and other fops do not populate the positive entry (as it can grow long and is of no value + add) +- Freed on recieving upcall(with dentry change flag) or on expiring timeout of the cache. + +The negative entries(NE): +- Populated only when lookup/stat returns ENOENT. Fuse mostly sends only one lookup before create, + hence negative entry cache is almost useless. But for SMB access, multiple lookups/stats are sent + before creating the file. Hence the negative entry cache. It can exist even when the positive + entry cache is invalid. +- Freed on recieving upcall(with dentry change flag) or on expiring timeout of the cache. + +PE list and NE list can exist independent of each other and can be optionally disabled. + +#### What fops gets served from the cache? +1. Lookup/stat if its a negative lookup +2. Getxattr on get_real_filename (this is a case insensitive lookup) + +Pseudocode for lookup/stat and getxattr fops: +lookup/stat () { + if inode is found in inode table for this name then goto out + if (parent inode state == NE_VALID) + if found in NE list + STACK_UNWIND (ENOENT) + else + goto out + if (parent inode state == PE_VALID) + if not found in PE list + STACK_UNWIND (ENOENT) + else + goto out + else if (parent inode state == PE_PARTIAL || PE_INVALID) + goto out +out: + STACK_WIND +} +getxattr (get_real_filename) { + if (inode state == PE_VALID) + if (found in case insensitive PE search) + STACK_UNWIND (found filename) + else + if (inode state == PE_FULL) + STACK_UNWIND (ENOENT) + else if (inode state == PE_PARTIAL) + STACK_WIND +} + +#### Data structures to store the cache? +First thought would be to have a linked list of filenames. String compare of all the PE and NE list +is CPU consuming and may cause performance hit in some cases. Lookups can be catagorised to: Fresh ++ve lookup, Subsequent lookups(+ve), Negative lookups. + +Subsequent lookups: As the inode will be present in inode table we need not search PE/NE list, + hence do not suffer performance hit. + +Fresh +Ve lookups: Worst case it will have to compare it with every entry in the NE and PE list. + This will cause some performance hit. + +Fresh -Ve lookups: Worst case it will have to compare it with every entry in the NE list to find + that its not there, and PE is INVALID/PARTIAL hence has to goto bricks anyways. + (PE_FULL is not a bad case, as it can return from the cache instead of network fop) + +Fresh +ve lookup will have some performance impact, hence we need something better than sequential +string compare of names: +- Store the hash instead of names as suggested in [1]. If this is the case get_real_filename cannot + be served, unless we store two hashes (case as is- for lookups, and lower case hash- for + get_real_filename) +- Store it in htree, so that the search time reduces but adding and removing from a tree is not + O(1) operation. +- Set an upper limit on the number of NE and PE in the linked list. + +For the initial implementation going with the linked list for negative entries. +List of inode refs for positive entries. + +#### Other things to note? +- The timeout can be for the whole of the inode cache, in that case any further update of the cache +will not not change the cache time. Or the timeout can be per entry if the PE/NE list, but this can +lead to lot of bookkeeping and timers. The timer expiry cleanup can be as a part of different reaper +thread or we can register a time handler for each inode in the timerwheel (as it is scalse well). + +[1] http://lists.gluster.org/pipermail/gluster-users/2016-August/028107.html + +Benefit to GlusterFS +-------------------- + +Improves file/directory create performance. + +Scope +----- + +#### Nature of proposed change + +- Adds a new xlator and glusterd canges to enable that xlator + +#### Implications on manageability + +N/A + +#### Implications on presentation layer + +N/A + +#### Implications on persistence layer + +N/A + +#### Implications on 'GlusterFS' backend + +N/A + +#### Modification to GlusterFS metadata + +N/A + +#### Implications on 'glusterd' + +GlusterD changes are integral to this feature, and described above. + +How To Test +----------- + +For the most part, testing is of the "do no harm" sort; the most thorough test +of this feature is to run our current regression suite. +Some specific test cases include create on all kind of volumes and from multiple clienats: +- distribute +- replicate +- shard +- disperse +- tier +Also, create while: +- rebalance in progress +- tiering migration in progress +- self heal in progress + +And all the test cases being run while the memory consumption of the process +is monitored. + +User Experience +--------------- + +Faster file or directory creates + +Dependencies +------------ + +N/A + +Documentation +------------- + +TBD (very little) + +Status +------ + +Development in progress + +Comments and Discussion +----------------------- + +N/A -- cgit