what is stat-prefetch? ====================== It is a translator which caches the dentries read in readdir. This dentry list is stored in the context of fd. Later when lookup happens on [parent-inode, basename (path)] combination, this list is searched for the basename. The dentry thus searched is used to fill up the stat corresponding to path being looked upon, thereby short-cutting lookup calls. This cache is preserved till closedir is called on the fd. The purpose of this translator is to optimize operations like 'ls -l', where a readdir is followed by lookup (stat) calls on each directory entry. 1. stat-prefetch harnesses the efficiency of short lookup calls (saves network roundtrip time for lookup calls from being accounted to the stat call). 2. To maintain the correctness, it does lookup-behind - lookup is winded to underlying translators after it is unwound to upper translators. lookup-behind is necessary as inode gets populated in server inode table only in lookup-cbk and also because various translators store their contexts in inode contexts during lookup calls. fops to be implemented: ======================= * lookup 1. check the dentry cache stored in context of fds opened by the same process on parent inode for basename. If found unwind with cached stat, else wind the lookup call to underlying translators. 2. stat is stored in the context of inode if the path being looked upon happens to be directory. This stat will be used to fill postparent stat when lookup happens on any of the directory contents. * readdir 1. cache the direntries returned in readdir_cbk in the context of fd. 2. if the readdir is happening on non-expected offsets (means a seekdir/rewinddir has happened), cache has to be flushed. 3. delete the entry corresponding to basename of path on which fd is opened from cache stored in parent. * chmod/fchmod delete the entry corresponding to basename from cache stored in context of fds opened on parent inode, since these calls change st_mode and st_ctime of stat. * chown/fchown delete the entry corresponding to basename from cache stored in context of fds opened on parent inode, since these calls change st_uid/st_gid and st_ctime of stat. * truncate/ftruncate delete the entry corresponding to basename from cache stored in context of fds opened on parent inode, since these calls change st_size/st_mtime of stat. * utimens delete the entry corresponding to basename from cache stored in context of fds opened on parent inode, since this call changes st_atime/st_mtime of stat. * readlink delete the entry corresponding to basename from cache stored in context of fds opened on parent inode, since this call changes st_atime of stat. * unlink 1. delete the entry corresponding to basename from cache stored in context of fds, opened on parent directory containing the file being unlinked. 2. delete the entry corresponding to basename of parent directory from cache of grand-parent. * rmdir 1. delete the entry corresponding to basename from cache stored in context of fds opened on parent inode. 2. remove the entire cache from all fds opened on inode corresponding to directory being removed. 3. delete the entry correspondig to basename of parent from cache stored in grand-parent. * readv delete the entry corresponding to basename from cache stored in context of fds opened on parent inode, since readv changes st_atime of file. * writev delete the entry corresponding to basename from cache stored in context of fds opened on parent inode, since writev can possibly change st_size and definitely changes st_mtime of file. * fsync there is a confusion here as to whether fsync updates mtime/ctimes. Disk based filesystems (atleast ext2) just writes the times stored in inode to disk during fsync and not the time at which fsync is being done. But in glusterfs, a translator like write-behind actually sends writes during fsync which will change mtime/ctime. Hence stat-prefetch implements fsync to delete the entry corresponding to basename from cache stored in context of fds opened on parent inode. * rename 1. remove entry corresponding to oldname from cache stored in fd contexts of oldparent. 2. remove entry corresponding to newname from cache stored in fd contexts of newparent. 3. remove entry corresponding to oldparent from cache stored in old-grand-parent, since removing oldname changes st_mtime and st_ctime of oldparent stat. 4. remove entry corresponding to newparent from cache stored in new-grand-parent, since adding newname changes st_mtime and st_ctime of newparent stat. 5. if oldname happens to be a directory, remove entire cache from all fds opened on it. * create/mknod/mkdir/symlink/link delete entry corresponding to basename of parent directory in which these operations are happening, from cache stored in context of fds opened on grand-parent, since adding a new entry to a directory changes st_mtime and st_ctime of parent directory. * setxattr/removexattr delete the entry corresponding to basename from cache stored in context of fds opened on parent inode, since setxattr changes st_ctime of file. * setdents 1. remove entry corresponding to basename of path on which fd is opened from cache stored in context of fds opened on parent. 2. for each of the entry in the direntry list, delete from cache stored in context of fd, the entry corresponding to basename of path being passed. * getdents 1. remove entry corresponding to basename of path on which fd is opened from cache stored in parent, since getdents changes st_atime. 2. remove entries corresponding to symbolic links from cache, since readlink would've changed st_atime. * checksum delete the entry corresponding to basename from cache stored in context of fds opened on parent inode, since st_atime is changed during this call. * xattrop/fxattrop delete the entry corresponding to basename from cache stored in context of fds opened on parent inode, since these calls modify st_ctime of file. callbacks to be implemented: ============================ * releasedir free the context stored in fd. * forget dree the stat if the inode corresponds to a directory. limitations: ============ * since a readdir does not return extended attributes of file, if need_xattr is set, short-cutting of lookup does not happen and lookup is passed to underlying translators. * posix_readdir does not check whether the dentries are spanning across multiple mount points. Hence it is not transforming inode numbers in stat buffers if posix is configured to allow export directory spanning on multiple mountpoints. This is a bug which needs to be fixed. posix_readdir should treat dentries the same way as if lookup is happening on dentries.