|author||Poornima G <firstname.lastname@example.org>||2016-12-09 12:37:17 +0530|
|committer||Raghavendra G <email@example.com>||2017-01-05 20:19:01 -0800|
Add parallel readdirp feature
Change-Id: Iae0ef7181c0d416359dd87412bfa4c31c489559e Signed-off-by: Poornima G <firstname.lastname@example.org> Reviewed-on: http://review.gluster.org/16090 Reviewed-by: Raghavendra G <email@example.com> Tested-by: Raghavendra G <firstname.lastname@example.org>
1 files changed, 167 insertions, 0 deletions
diff --git a/under_review/readdir-ahead.md b/under_review/readdir-ahead.md
new file mode 100644
@@ -0,0 +1,167 @@
+Improve directory enumeration performance
+Improve directory enumeration performance by implementing parallel readdirp
+at the dht layer.
+Raghavendra G <email@example.com>
+Poornima G <firstname.lastname@example.org>
+Rajesh Joseph <email@example.com>
+Related Feature Requests and Bugs
+Currently readdirp is sequential at the dht layer.
+This makes find and recursive listing of small directories very slow
+(directory whose content can be accomodated in one readdirp call,
+eg: ~600 entries if buf size is 128k).
+The number of readdirp fops required to fetch the ls -l -R for nested
+no. of fops = (x + 1) * m * n
+n = number of bricks
+m = number of directories
+x = number of readdirp calls required to fetch the dentries completely
+(this depends on the size of the directory and the readdirp buf size)
+1 = readdirp fop that is sent to just detect the end of directory.
+Eg: Let's say, to list 800 directories with files ~300 each and readdirp
+buf size 128K, on distribute 6:
+(1+1) * 800 * 6 = 9600 fops
+And all the readdirp fops are sent in sequential manner to all the bricks.
+With parallel readdirp, the number of fops may not decrease drastically
+but since they are issued in parallel, it will increase the throughput.
+Why its not a straightforward problem to solve:
+One needs to briefly understand, how the directory offset is handled in dht.
+, ,  are some of the links that will hint the same.
+- The d_off is in the order of bricks identfied by dht. Hence, the dentries
+should always be returned in the same order as bricks. i.e. brick2 entries
+shouldn't be returned before brick1 reaches EOD.
+- We cannot store any info of offset read so far etc. in inode_ctx or fd_ctx
+- In case of a very large directories, and readdirp buf too small to hold
+all the dentries in any brick, parallel readdirp is a overhead. Sequential
+readdirp best suits the large directories. This demands dht be aware of or
+speculate the directory size.
+There were two solutions that we evaluated:
+1. Change dht_readdirp itself to wind readdirp parallely
+2. Load readd-ahead as a child of dht
+For the below mentioned reasons we go with the second approach suggested by
+- It requires nil or very less changes in dht
+- Along with empty/small directories it also benifits large directories
+The only slightly complecated part would be to tune the readdir-ahead
+buffer size for each instance.
+The perf gain observed is directly proportional to the:
+- Number of nodes in the cluster/Volume
+- Latency between client and each node in the volume.
+Benefit to GlusterFS
+Improves directory enumeration performance in large clusters.
+#### Nature of proposed change
+- Changes in readdir-ahead, dht xlators.
+- Change glusterd to load readdir-ahead as a child of dht
+ and without breaking upgrade and downgrade scenarios
+#### Implications on manageability
+#### Implications on presentation layer
+#### Implications on persistence layer
+#### Implications on 'GlusterFS' backend
+#### Modification to GlusterFS metadata
+#### Implications on 'glusterd'
+GlusterD changes are integral to this feature, and described above.
+How To Test
+For the most part, testing is of the "do no harm" sort; the most thorough test
+of this feature is to run our current regression suite.
+Some specific test cases include readdirp on all kind of volumes:
+Also, readdirp while:
+- rebalance in progress
+- tiering migration in progress
+- self heal in progress
+And all the test cases being run while the memory consumption of the process
+Faster directory enumeration
+TBD (very little)
+Development in progress
+Comments and Discussion