diff options
Diffstat (limited to 'doc/hacker-guide')
| -rw-r--r-- | doc/hacker-guide/Makefile.am | 8 | ||||
| -rw-r--r-- | doc/hacker-guide/adding-fops.txt | 33 | ||||
| -rw-r--r-- | doc/hacker-guide/bdb.txt | 70 | ||||
| -rw-r--r-- | doc/hacker-guide/call-stub.txt | 1033 | ||||
| -rw-r--r-- | doc/hacker-guide/hacker-guide.tex | 312 | ||||
| -rw-r--r-- | doc/hacker-guide/posix.txt | 59 | ||||
| -rw-r--r-- | doc/hacker-guide/replicate.txt | 206 | ||||
| -rw-r--r-- | doc/hacker-guide/write-behind.txt | 45 | 
8 files changed, 1766 insertions, 0 deletions
diff --git a/doc/hacker-guide/Makefile.am b/doc/hacker-guide/Makefile.am new file mode 100644 index 000000000..65c92ac23 --- /dev/null +++ b/doc/hacker-guide/Makefile.am @@ -0,0 +1,8 @@ +EXTRA_DIST = replicate.txt bdb.txt posix.txt call-stub.txt write-behind.txt + +#EXTRA_DIST = hacker-guide.tex afr.txt bdb.txt posix.txt call-stub.txt write-behind.txt +#hacker_guidedir = $(docdir) +#hacker_guide_DATA = hacker-guide.pdf + +#hacker-guide.pdf: $(EXTRA_DIST) +#	pdflatex $(srcdir)/hacker-guide.tex diff --git a/doc/hacker-guide/adding-fops.txt b/doc/hacker-guide/adding-fops.txt new file mode 100644 index 000000000..293de2637 --- /dev/null +++ b/doc/hacker-guide/adding-fops.txt @@ -0,0 +1,33 @@ +		  HOW TO ADD A NEW FOP TO GlusterFS +		  ================================= + +Steps to be followed when adding a new FOP to GlusterFS: + +1. Edit glusterfs.h and add a GF_FOP_* constant. + +2. Edit xlator.[ch] and: +   2a. add the new prototype for fop and callback. +   2b. edit xlator_fops structure. + +3. Edit xlator.c and add to fill_defaults. + +4. Edit protocol.h and add struct necessary for the new FOP. + +5. Edit defaults.[ch] and provide default implementation. + +6. Edit call-stub.[ch] and provide stub implementation. + +7. Edit common-utils.c and add to gf_global_variable_init(). + +8. Edit client-protocol and add your FOP. + +9. Edit server-protocol and add your FOP. + +10. Implement your FOP in any translator for which the default implementation +    is not sufficient. + +========================================== +Last updated: Mon Oct 27 21:35:49 IST 2008 + +Author: Vikas Gorur <vikas@zresearch.com> +========================================== diff --git a/doc/hacker-guide/bdb.txt b/doc/hacker-guide/bdb.txt new file mode 100644 index 000000000..fd0bd3652 --- /dev/null +++ b/doc/hacker-guide/bdb.txt @@ -0,0 +1,70 @@ + +* How does file translates to key/value pair? +--------------------------------------------- + +  in bdb a file is identified by key (obtained by taking basename() of the path of +the file) and file contents are stored as value corresponding to the key in database +file (defaults to glusterfs_storage.db under dirname() directory). + +* symlinks, directories +----------------------- + +  symlinks and directories are stored as is. + +* db (database) files +--------------------- + +  every directory, including root directory, contains a database file called +glusterfs_storage.db. all the regular files contained in the directory are stored +as key/value pair inside the glusterfs_storage.db. + +* internal data cache +--------------------- + +  db does not provide a way to find out the size of the value corresponding to a key.  +so, bdb makes DB->get() call for key and takes the length of the value returned.  +since DB->get() also returns file contents for key, bdb maintains an internal cache and +stores the file contents in the cache. +  every directory maintains a seperate cache. +   +* inode number transformation +----------------------------- + +  bdb allocates a inode number to each file and directory on its own. bdb maintains a  +global counter and increments it after allocating inode number for each file +(regular, symlink or directory). NOTE: bdb does not guarantee persistent inode numbers. + +* checkpoint thread +------------------- + +  bdb creates a checkpoint thread at the time of init(). checkpoint thread does a  +periodic checkpoint on the DB_ENV. checkpoint is the mechanism, provided by db, to  +forcefully commit the logged transactions to the storage. + +NOTES ABOUT FOPS: +----------------- + +lookup() - + 1> do lstat() on the path, if lstat fails, we assume that the file being looked up +    is either a regular file or doesn't exist. + 2> lookup in the DB of parent directory for key corresponding to path. if key exists, +    return key, with.  +    NOTE: 'struct stat' stat()ed from DB file is used as a container for 'struct stat' +           of the regular file. st_ino, st_size, st_blocks are updated with file's values. + +readv() - + 1> do a lookup in bctx cache. if successful, return the requested data from cache. + 2> if cache missed, do a DB->get() the entire file content and insert to cache. + +writev(): + 1> flush any cached content of this file. + 2> do a DB->put(), with DB_DBT_PARTIAL flag.  +    NOTE: DB_DBT_PARTIAL is used to do partial update of a value in DB. + +readdir(): + 1> regular readdir() in a loop, and vomit all DB_ENV log files and DB files that +    we encounter. + 2> if the readdir() buffer still has space, open a DB cursor and do a sequential +    DBC->get() to fill the reaadir buffer. + + diff --git a/doc/hacker-guide/call-stub.txt b/doc/hacker-guide/call-stub.txt new file mode 100644 index 000000000..bca1579b2 --- /dev/null +++ b/doc/hacker-guide/call-stub.txt @@ -0,0 +1,1033 @@ +creating a call stub and pausing a call +--------------------------------------- +libglusterfs provides seperate API to pause each of the fop. parameters to each API is +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +       NOTE: @fn should exactly take the same type and number of parameters that  +             the corresponding regular fop takes. +rest will be the regular parameters to corresponding fop. + +NOTE: @frame can never be NULL. fop_<operation>_stub() fails with errno +      set to EINVAL, if @frame is NULL. also wherever @loc is applicable, +      @loc cannot be NULL. + +refer to individual stub creation API to know about call-stub creation's behaviour with +specific parameters. + +here is the list of stub creation APIs for xlator fops. + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@loc       - pointer to location structure. +             NOTE: @loc will be copied to a different location, with inode_ref() to +	           @loc->inode and @loc->parent, if not NULL. also @loc->path will be +		   copied to a different location. +@need_xattr - flag to specify if xattr should be returned or not. +call_stub_t * +fop_lookup_stub (call_frame_t *frame, +		 fop_lookup_t fn, +		 loc_t *loc, +		 int32_t need_xattr); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to +	       @loc->inode and @loc->parent, if not NULL. also @loc->path will be +	       copied to a different location. +call_stub_t * +fop_stat_stub (call_frame_t *frame, +	       fop_stat_t fn, +	       loc_t *loc); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@fd    - file descriptor parameter to lk fop. +         NOTE: @fd is stored with a fd_ref(). +call_stub_t * +fop_fstat_stub (call_frame_t *frame, +		fop_fstat_t fn, +		fd_t *fd); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to @loc->inode and +	       @loc->parent, if not NULL. also @loc->path will be copied to a different location. +@mode  - mode parameter to chmod. +call_stub_t * +fop_chmod_stub (call_frame_t *frame, +		fop_chmod_t fn, +		loc_t *loc, +		mode_t mode); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@fd    - file descriptor parameter to lk fop. +         NOTE: @fd is stored with a fd_ref(). +@mode  - mode parameter for fchmod fop. +call_stub_t * +fop_fchmod_stub (call_frame_t *frame, +		 fop_fchmod_t fn, +		 fd_t *fd, +		 mode_t mode); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to @loc->inode and +	       @loc->parent, if not NULL. also @loc->path will be copied to a different location. +@uid   - uid parameter to chown. +@gid   - gid parameter to chown. +call_stub_t * +fop_chown_stub (call_frame_t *frame, +		fop_chown_t fn, +		loc_t *loc, +		uid_t uid, +		gid_t gid); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@fd    - file descriptor parameter to lk fop. +         NOTE: @fd is stored with a fd_ref(). +@uid   - uid parameter to fchown. +@gid   - gid parameter to fchown. +call_stub_t * +fop_fchown_stub (call_frame_t *frame, +		 fop_fchown_t fn, +		 fd_t *fd, +		 uid_t uid, +		 gid_t gid); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to +	       @loc->inode and @loc->parent, if not NULL. also @loc->path will be +	       copied to a different location, if not NULL. +@off   - offset parameter to truncate fop. +call_stub_t * +fop_truncate_stub (call_frame_t *frame, +		   fop_truncate_t fn, +		   loc_t *loc, +		   off_t off); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@fd    - file descriptor parameter to lk fop. +         NOTE: @fd is stored with a fd_ref(). +@off   - offset parameter to ftruncate fop. +call_stub_t * +fop_ftruncate_stub (call_frame_t *frame, +		    fop_ftruncate_t fn, +		    fd_t *fd, +		    off_t off); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to +	       @loc->inode and @loc->parent, if not NULL. also @loc->path will be +	       copied to a different location. +@tv    - tv parameter to utimens fop. +call_stub_t * +fop_utimens_stub (call_frame_t *frame, +		  fop_utimens_t fn, +		  loc_t *loc, +		  struct timespec tv[2]); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to +	       @loc->inode and @loc->parent, if not NULL. also @loc->path will be +	       copied to a different location. +@mask  - mask parameter for access fop. +call_stub_t * +fop_access_stub (call_frame_t *frame, +		 fop_access_t fn, +		 loc_t *loc, +		 int32_t mask); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to +	       @loc->inode and @loc->parent, if not NULL. also @loc->path will be +	       copied to a different location. +@size  - size parameter to readlink fop. +call_stub_t * +fop_readlink_stub (call_frame_t *frame, +		   fop_readlink_t fn, +		   loc_t *loc, +		   size_t size); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to +	       @loc->inode and @loc->parent, if not NULL. also @loc->path will be +	       copied to a different location. +@mode  - mode parameter to mknod fop. +@rdev  - rdev parameter to mknod fop. +call_stub_t * +fop_mknod_stub (call_frame_t *frame, +		fop_mknod_t fn, +		loc_t *loc, +		mode_t mode, +		dev_t rdev); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to +	       @loc->inode and @loc->parent, if not NULL. also @loc->path will be +	       copied to a different location. +@mode  - mode parameter to mkdir fop. +call_stub_t * +fop_mkdir_stub (call_frame_t *frame, +		fop_mkdir_t fn, +		loc_t *loc, +		mode_t mode); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to +	       @loc->inode and @loc->parent, if not NULL. also @loc->path will be +	       copied to a different location. +call_stub_t * +fop_unlink_stub (call_frame_t *frame, +		 fop_unlink_t fn, +		 loc_t *loc); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to +	       @loc->inode and @loc->parent, if not NULL. also @loc->path will be +	       copied to a different location. +call_stub_t * +fop_rmdir_stub (call_frame_t *frame, +		fop_rmdir_t fn, +		loc_t *loc); + +@frame    - call frame which has to be used to resume the call at call_resume(). +@fn       - procedure to call during call_resume().  +@linkname - linkname parameter to symlink fop. +@loc      - pointer to location structure. +            NOTE: @loc will be copied to a different location, with inode_ref() to +	          @loc->inode and @loc->parent, if not NULL. also @loc->path will be +		  copied to a different location. +call_stub_t * +fop_symlink_stub (call_frame_t *frame, +		  fop_symlink_t fn, +		  const char *linkname, +		  loc_t *loc); + +@frame    - call frame which has to be used to resume the call at call_resume(). +@fn       - procedure to call during call_resume().  +@oldloc   - pointer to location structure. +            NOTE: @oldloc will be copied to a different location, with inode_ref() to  +	          @oldloc->inode and @oldloc->parent, if not NULL. also @oldloc->path will  +		  be copied to a different location, if not NULL. +@newloc   - pointer to location structure. +            NOTE: @newloc will be copied to a different location, with inode_ref() to +	          @newloc->inode and @newloc->parent, if not NULL. also @newloc->path will +		  be copied to a different location, if not NULL. +call_stub_t * +fop_rename_stub (call_frame_t *frame, +		 fop_rename_t fn, +		 loc_t *oldloc, +		 loc_t *newloc); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc     - pointer to location structure. +           NOTE: @loc will be copied to a different location, with inode_ref() to +	         @loc->inode and @loc->parent, if not NULL. also @loc->path will be +		 copied to a different location. +@newpath - newpath parameter to link fop. +call_stub_t * +fop_link_stub (call_frame_t *frame, +	       fop_link_t fn, +	       loc_t *oldloc, +	       const char *newpath); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to +	       @loc->inode and @loc->parent, if not NULL. also @loc->path will be +	       copied to a different location. +@flags - flags parameter to create fop. +@mode  - mode parameter to create fop. +@fd    - file descriptor parameter to create fop. +         NOTE: @fd is stored with a fd_ref(). +call_stub_t * +fop_create_stub (call_frame_t *frame, +		 fop_create_t fn, +		 loc_t *loc, +		 int32_t flags, +		 mode_t mode, fd_t *fd); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@flags - flags parameter to open fop. +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to +	       @loc->inode and @loc->parent, if not NULL. also @loc->path will be +	       copied to a different location. +call_stub_t * +fop_open_stub (call_frame_t *frame, +	       fop_open_t fn, +	       loc_t *loc, +	       int32_t flags, +	       fd_t *fd); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@fd    - file descriptor parameter to lk fop. +         NOTE: @fd is stored with a fd_ref(). +@size  - size parameter to readv fop. +@off   - offset parameter to readv fop. +call_stub_t * +fop_readv_stub (call_frame_t *frame, +		fop_readv_t fn, +		fd_t *fd, +		size_t size, +		off_t off); + +@frame  - call frame which has to be used to resume the call at call_resume(). +@fn     - procedure to call during call_resume().  +@fd     - file descriptor parameter to lk fop. +          NOTE: @fd is stored with a fd_ref(). +@vector - vector parameter to writev fop.  +	  NOTE: @vector is iov_dup()ed while creating stub. and frame->root->req_refs +                dictionary is dict_ref()ed. +@count  - count parameter to writev fop. +@off    - off parameter to writev fop. +call_stub_t * +fop_writev_stub (call_frame_t *frame, +		 fop_writev_t fn, +		 fd_t *fd, +		 struct iovec *vector, +		 int32_t count, +		 off_t off); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@fd    - file descriptor parameter to flush fop. +         NOTE: @fd is stored with a fd_ref(). +call_stub_t * +fop_flush_stub (call_frame_t *frame, +		fop_flush_t fn, +		fd_t *fd); + + +@frame    - call frame which has to be used to resume the call at call_resume(). +@fn       - procedure to call during call_resume().  +@fd       - file descriptor parameter to lk fop. +            NOTE: @fd is stored with a fd_ref(). +@datasync - datasync parameter to fsync fop. +call_stub_t * +fop_fsync_stub (call_frame_t *frame, +		fop_fsync_t fn, +		fd_t *fd, +		int32_t datasync); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to @loc->inode and +	       @loc->parent, if not NULL. also @loc->path will be copied to a different location. +@fd    - file descriptor parameter to opendir fop. +         NOTE: @fd is stored with a fd_ref(). +call_stub_t * +fop_opendir_stub (call_frame_t *frame, +		  fop_opendir_t fn, +		  loc_t *loc,  +		  fd_t *fd); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@fd    - file descriptor parameter to getdents fop. +         NOTE: @fd is stored with a fd_ref(). +@size  - size parameter to getdents fop. +@off   - off parameter to getdents fop. +@flags - flags parameter to getdents fop. +call_stub_t * +fop_getdents_stub (call_frame_t *frame, +		   fop_getdents_t fn, +		   fd_t *fd, +		   size_t size, +		   off_t off, +		   int32_t flag); + +@frame   - call frame which has to be used to resume the call at call_resume(). +@fn      - procedure to call during call_resume().  +@fd      - file descriptor parameter to setdents fop. +           NOTE: @fd is stored with a fd_ref(). +@flags   - flags parameter to setdents fop. +@entries - entries parameter to setdents fop. +call_stub_t * +fop_setdents_stub (call_frame_t *frame, +		   fop_setdents_t fn, +		   fd_t *fd, +		   int32_t flags, +		   dir_entry_t *entries, +		   int32_t count); + +@frame    - call frame which has to be used to resume the call at call_resume(). +@fn       - procedure to call during call_resume().  +@fd       - file descriptor parameter to setdents fop. +            NOTE: @fd is stored with a fd_ref(). +@datasync - datasync parameter to fsyncdir fop. +call_stub_t * +fop_fsyncdir_stub (call_frame_t *frame, +		   fop_fsyncdir_t fn, +		   fd_t *fd, +		   int32_t datasync); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to +	       @loc->inode and @loc->parent, if not NULL. also @loc->path will be +	       copied to a different location. +call_stub_t * +fop_statfs_stub (call_frame_t *frame, +		 fop_statfs_t fn, +		 loc_t *loc); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to  +	       @loc->inode and @loc->parent, if not NULL. also @loc->path will be +	       copied to a different location. +@dict  - dict parameter to setxattr fop. +         NOTE: stub creation procedure stores @dict pointer with dict_ref() to it. +call_stub_t * +fop_setxattr_stub (call_frame_t *frame, +		   fop_setxattr_t fn, +		   loc_t *loc, +		   dict_t *dict, +		   int32_t flags); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to +	       @loc->inode and @loc->parent, if not NULL. also @loc->path will be +	       copied to a different location. +@name  - name parameter to getxattr fop. +call_stub_t * +fop_getxattr_stub (call_frame_t *frame, +		   fop_getxattr_t fn, +		   loc_t *loc, +		   const char *name); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to +	       @loc->inode and @loc->parent, if not NULL. also @loc->path will be +	       copied to a different location. +@name  - name parameter to removexattr fop. +         NOTE: name string will be copied to a different location while creating stub. +call_stub_t * +fop_removexattr_stub (call_frame_t *frame, +		      fop_removexattr_t fn, +		      loc_t *loc, +		      const char *name); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@fd    - file descriptor parameter to lk fop. +         NOTE: @fd is stored with a fd_ref(). +@cmd   - command parameter to lk fop. +@lock  - lock parameter to lk fop. +         NOTE: lock will be copied to a different location while creating stub. +call_stub_t * +fop_lk_stub (call_frame_t *frame, +	     fop_lk_t fn, +	     fd_t *fd, +	     int32_t cmd, +	     struct flock *lock); + +@frame    - call frame which has to be used to resume the call at call_resume(). +@fn       - procedure to call during call_resume().  +@fd       - fd parameter to gf_lk fop. +	    NOTE: @fd is fd_ref()ed while creating stub, if not NULL. +@cmd      - cmd parameter to gf_lk fop. +@lock     - lock paramater to gf_lk fop. +	    NOTE: @lock is copied to a different memory location while creating +	          stub.  +call_stub_t * +fop_gf_lk_stub (call_frame_t *frame, +		fop_gf_lk_t fn, +		fd_t *fd, +		int32_t cmd, +		struct flock *lock); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@fd    - file descriptor parameter to readdir fop. +         NOTE: @fd is stored with a fd_ref(). +@size  - size parameter to readdir fop. +@off   - offset parameter to readdir fop. +call_stub_t * +fop_readdir_stub (call_frame_t *frame, +		  fop_readdir_t fn, +		  fd_t *fd, +		  size_t size, +		  off_t off); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn    - procedure to call during call_resume().  +@loc   - pointer to location structure. +         NOTE: @loc will be copied to a different location, with inode_ref() to +	       @loc->inode and @loc->parent, if not NULL. also @loc->path will be +	       copied to a different location. +@flags - flags parameter to checksum fop. +call_stub_t * +fop_checksum_stub (call_frame_t *frame, +		   fop_checksum_t fn, +		   loc_t *loc, +		   int32_t flags); + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@inode     - inode parameter to @fn. +	     NOTE: @inode pointer is stored with a inode_ref(). +@buf       - buf parameter to @fn. +	     NOTE: @buf is copied to a different memory location, if not NULL. +@dict      - dict parameter to @fn. +	     NOTE: @dict pointer is stored with dict_ref(). +call_stub_t * +fop_lookup_cbk_stub (call_frame_t *frame, +		     fop_lookup_cbk_t fn, +		     int32_t op_ret, +		     int32_t op_errno, +		     inode_t *inode, +		     struct stat *buf, +		     dict_t *dict); +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@buf       - buf parameter to @fn. +	     NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_stat_cbk_stub (call_frame_t *frame, +		   fop_stat_cbk_t fn, +		   int32_t op_ret, +		   int32_t op_errno, +		   struct stat *buf); + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@buf       - buf parameter to @fn. +	     NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_fstat_cbk_stub (call_frame_t *frame, +		    fop_fstat_cbk_t fn, +		    int32_t op_ret, +		    int32_t op_errno, +		    struct stat *buf); + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@buf       - buf parameter to @fn. +	     NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_chmod_cbk_stub (call_frame_t *frame, +		    fop_chmod_cbk_t fn, +		    int32_t op_ret, +		    int32_t op_errno, +		    struct stat *buf); + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@buf       - buf parameter to @fn. +	     NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_fchmod_cbk_stub (call_frame_t *frame, +		     fop_fchmod_cbk_t fn, +		     int32_t op_ret, +		     int32_t op_errno, +		     struct stat *buf); + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@buf       - buf parameter to @fn. +	     NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_chown_cbk_stub (call_frame_t *frame, +		    fop_chown_cbk_t fn, +		    int32_t op_ret, +		    int32_t op_errno, +		    struct stat *buf); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@buf       - buf parameter to @fn. +	     NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_fchown_cbk_stub (call_frame_t *frame, +		     fop_fchown_cbk_t fn, +		     int32_t op_ret, +		     int32_t op_errno, +		     struct stat *buf); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@buf       - buf parameter to @fn. +	     NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_truncate_cbk_stub (call_frame_t *frame, +		       fop_truncate_cbk_t fn, +		       int32_t op_ret, +		       int32_t op_errno, +		       struct stat *buf); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@buf       - buf parameter to @fn. +	     NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_ftruncate_cbk_stub (call_frame_t *frame, +			fop_ftruncate_cbk_t fn, +			int32_t op_ret, +			int32_t op_errno, +			struct stat *buf); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@buf       - buf parameter to @fn. +	     NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_utimens_cbk_stub (call_frame_t *frame, +		      fop_utimens_cbk_t fn, +		      int32_t op_ret, +		      int32_t op_errno, +		      struct stat *buf); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +call_stub_t * +fop_access_cbk_stub (call_frame_t *frame, +		     fop_access_cbk_t fn, +		     int32_t op_ret, +		     int32_t op_errno); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@path      - path parameter to @fn. +	     NOTE: @path is copied to a different memory location, if not NULL. +call_stub_t * +fop_readlink_cbk_stub (call_frame_t *frame, +		       fop_readlink_cbk_t fn, +		       int32_t op_ret, +		       int32_t op_errno, +		       const char *path); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@inode     - inode parameter to @fn. +	     NOTE: @inode pointer is stored with a inode_ref(). +@buf       - buf parameter to @fn. +	     NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_mknod_cbk_stub (call_frame_t *frame, +		    fop_mknod_cbk_t fn, +		    int32_t op_ret, +		    int32_t op_errno, +		    inode_t *inode, +		    struct stat *buf); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@inode     - inode parameter to @fn. +	     NOTE: @inode pointer is stored with a inode_ref(). +@buf       - buf parameter to @fn. +	     NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_mkdir_cbk_stub (call_frame_t *frame, +		    fop_mkdir_cbk_t fn, +		    int32_t op_ret, +		    int32_t op_errno, +		    inode_t *inode, +		    struct stat *buf); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +call_stub_t * +fop_unlink_cbk_stub (call_frame_t *frame, +		     fop_unlink_cbk_t fn, +		     int32_t op_ret, +		     int32_t op_errno); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +call_stub_t * +fop_rmdir_cbk_stub (call_frame_t *frame, +		    fop_rmdir_cbk_t fn, +		    int32_t op_ret, +		    int32_t op_errno); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@inode     - inode parameter to @fn. +	     NOTE: @inode pointer is stored with a inode_ref(). +@buf       - buf parameter to @fn. +	     NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_symlink_cbk_stub (call_frame_t *frame, +		      fop_symlink_cbk_t fn, +		      int32_t op_ret, +		      int32_t op_errno, +		      inode_t *inode, +		      struct stat *buf); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@buf       - buf parameter to @fn. +	     NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_rename_cbk_stub (call_frame_t *frame, +		     fop_rename_cbk_t fn, +		     int32_t op_ret, +		     int32_t op_errno, +		     struct stat *buf); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@inode     - inode parameter to @fn. +	     NOTE: @inode pointer is stored with a inode_ref(). +@buf       - buf parameter to @fn. +	     NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_link_cbk_stub (call_frame_t *frame, +		   fop_link_cbk_t fn, +		   int32_t op_ret, +		   int32_t op_errno, +		   inode_t *inode, +		   struct stat *buf); + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@fd        - fd parameter to @fn. +	     NOTE: @fd pointer is stored with a fd_ref(). +@inode     - inode parameter to @fn. +	     NOTE: @inode pointer is stored with a inode_ref(). +@buf       - buf parameter to @fn. +	     NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_create_cbk_stub (call_frame_t *frame, +		     fop_create_cbk_t fn, +		     int32_t op_ret, +		     int32_t op_errno, +		     fd_t *fd, +		     inode_t *inode, +		     struct stat *buf); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@fd        - fd parameter to @fn. +	     NOTE: @fd pointer is stored with a fd_ref(). +call_stub_t * +fop_open_cbk_stub (call_frame_t *frame, +		   fop_open_cbk_t fn, +		   int32_t op_ret, +		   int32_t op_errno, +		   fd_t *fd); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@vector    - vector parameter to @fn.	 +	     NOTE: @vector is copied to a different memory location, if not NULL. also +	           frame->root->rsp_refs is dict_ref()ed. +@stbuf     - stbuf parameter to @fn. +	     NOTE: @stbuf is copied to a different memory location, if not NULL. +call_stub_t * +fop_readv_cbk_stub (call_frame_t *frame, +		    fop_readv_cbk_t fn, +		    int32_t op_ret, +		    int32_t op_errno, +		    struct iovec *vector, +		    int32_t count, +		    struct stat *stbuf); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@stbuf     - stbuf parameter to @fn. +	     NOTE: @stbuf is copied to a different memory location, if not NULL. +call_stub_t * +fop_writev_cbk_stub (call_frame_t *frame, +		     fop_writev_cbk_t fn, +		     int32_t op_ret, +		     int32_t op_errno, +		     struct stat *stbuf); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +call_stub_t * +fop_flush_cbk_stub (call_frame_t *frame, +		    fop_flush_cbk_t fn, +		    int32_t op_ret, +		    int32_t op_errno); + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +call_stub_t * +fop_fsync_cbk_stub (call_frame_t *frame, +		    fop_fsync_cbk_t fn, +		    int32_t op_ret, +		    int32_t op_errno); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@fd        - fd parameter to @fn. +	     NOTE: @fd pointer is stored with a fd_ref(). +call_stub_t * +fop_opendir_cbk_stub (call_frame_t *frame, +		      fop_opendir_cbk_t fn, +		      int32_t op_ret, +		      int32_t op_errno, +		      fd_t *fd); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@entries   - entries parameter to @fn. +@count     - count parameter to @fn. +call_stub_t * +fop_getdents_cbk_stub (call_frame_t *frame, +		      fop_getdents_cbk_t fn, +		      int32_t op_ret, +		      int32_t op_errno, +		      dir_entry_t *entries, +		      int32_t count); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +call_stub_t * +fop_setdents_cbk_stub (call_frame_t *frame, +		       fop_setdents_cbk_t fn, +		       int32_t op_ret, +		       int32_t op_errno); + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +call_stub_t * +fop_fsyncdir_cbk_stub (call_frame_t *frame, +		       fop_fsyncdir_cbk_t fn, +		       int32_t op_ret, +		       int32_t op_errno); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@buf       - buf parameter to @fn. +	     NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_statfs_cbk_stub (call_frame_t *frame, +		     fop_statfs_cbk_t fn, +		     int32_t op_ret, +		     int32_t op_errno, +		     struct statvfs *buf); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +call_stub_t * +fop_setxattr_cbk_stub (call_frame_t *frame, +		       fop_setxattr_cbk_t fn, +		       int32_t op_ret, +		       int32_t op_errno); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +@value     - value dictionary parameter to @fn. +	     NOTE: @value pointer is stored with a dict_ref(). +call_stub_t * +fop_getxattr_cbk_stub (call_frame_t *frame, +		       fop_getxattr_cbk_t fn, +		       int32_t op_ret, +		       int32_t op_errno, +		       dict_t *value); + + +@frame     - call frame which has to be used to resume the call at call_resume(). +@fn        - procedure to call during call_resume().  +@op_ret    - op_ret parameter to @fn. +@op_errno  - op_errno parameter to @fn. +call_stub_t * +fop_removexattr_cbk_stub (call_frame_t *frame, +			  fop_removexattr_cbk_t fn, +			  int32_t op_ret, +			  int32_t op_errno); + + +@frame    - call frame which has to be used to resume the call at call_resume(). +@fn       - procedure to call during call_resume().  +@op_ret   - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@lock     - lock parameter to @fn. +	    NOTE: @lock is copied to a different memory location while creating +	          stub.  +call_stub_t * +fop_lk_cbk_stub (call_frame_t *frame, +		 fop_lk_cbk_t fn, +		 int32_t op_ret, +		 int32_t op_errno, +		 struct flock *lock); + +@frame    - call frame which has to be used to resume the call at call_resume(). +@fn       - procedure to call during call_resume().  +@op_ret   - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@lock     - lock parameter to @fn. +	    NOTE: @lock is copied to a different memory location while creating +	          stub.  +call_stub_t * +fop_gf_lk_cbk_stub (call_frame_t *frame, +		    fop_gf_lk_cbk_t fn, +		    int32_t op_ret, +		    int32_t op_errno, +		    struct flock *lock); + + +@frame    - call frame which has to be used to resume the call at call_resume(). +@fn       - procedure to call during call_resume().  +@op_ret   - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@entries  - entries parameter to @fn. +call_stub_t * +fop_readdir_cbk_stub (call_frame_t *frame, +		      fop_readdir_cbk_t fn, +		      int32_t op_ret, +		      int32_t op_errno, +		      gf_dirent_t *entries); + + +@frame         - call frame which has to be used to resume the call at call_resume(). +@fn            - procedure to call during call_resume().  +@op_ret        - op_ret parameter to @fn. +@op_errno      - op_errno parameter to @fn. +@file_checksum - file_checksum parameter to @fn. +                 NOTE: file_checksum will be copied to a different memory location  +		       while creating stub. +@dir_checksum  - dir_checksum parameter to @fn. +                 NOTE: file_checksum will be copied to a different memory location  +		       while creating stub. +call_stub_t * +fop_checksum_cbk_stub (call_frame_t *frame, +		       fop_checksum_cbk_t fn, +		       int32_t op_ret, +		       int32_t op_errno, +		       uint8_t *file_checksum, +		       uint8_t *dir_checksum); + +resuming a call: +--------------- +  call can be resumed using call stub through call_resume API. + +  void call_resume (call_stub_t *stub); + +  stub - call stub created during pausing a call. + +  NOTE: call_resume() will decrease reference count of any fd_t, dict_t and inode_t that it finds +        in  stub->args.<operation>.<fd_t-or-inode_t-or-dict_t>. so, if any fd_t, dict_t or +	inode_t pointers are assigned at stub->args.<operation>.<fd_t-or-inode_t-or-dict_t> after +	fop_<operation>_stub() call, they must be <fd_t-or-inode_t-or-dict_t>_ref()ed. +	 +	call_resume does not STACK_DESTROY() for any fop. +	 +  if stub->fn is NULL, call_resume does STACK_WIND() or STACK_UNWIND() using the stub->frame. + +  return - call resume fails only if stub is NULL. call resume fails with errno set to EINVAL. diff --git a/doc/hacker-guide/hacker-guide.tex b/doc/hacker-guide/hacker-guide.tex new file mode 100644 index 000000000..72c44df1a --- /dev/null +++ b/doc/hacker-guide/hacker-guide.tex @@ -0,0 +1,312 @@ +\documentclass{book}[12pt] +\usepackage{graphicx} +% \usepackage{fancyhdr} + +% \pagestyle{fancy} +\begin{document} + +% \headheight 117pt +% \rhead{\includegraphics{zr-logo.eps}} + +\author{Z Research} +\title{GlusterFS 1.3 Hacker's Guide} +\date{June 1, 2007} + +\maketitle +\frontmatter +\tableofcontents + +\mainmatter +\chapter{Introduction} + +\section{Coding guidelines} +GlusterFS uses GNU Arch for version control. To get the latest source do: +\begin{verbatim} +  $ tla register-archive http://arch.sv.gnu.org/archives/gluster +  $ tla -A gluster@sv.gnu.org get glusterfs--mainline--2.4 +\end{verbatim} +\noindent +GlusterFS follows the GNU coding +standards\footnote{http://www.gnu.org/prep/standards\_toc.html} for the +most part. + +\chapter{Major components} +\section{libglusterfs} +\texttt{libglusterfs} contains supporting code used by all the other components.  +The important files here are: + +\texttt{dict.c}: This is an implementation of a serializable dictionary type. It is +used by the protocol code to send requests and replies. It is also used to pass options +to translators. + +\texttt{logging.c}: This is a thread-safe logging library. The log messages go to a +file (default \texttt{/usr/local/var/log/glusterfs/*}). + +\texttt{protocol.c}: This file implements the GlusterFS on-the-wire +protocol. The protocol itself is a simple ASCII protocol, designed to +be easy to parse and be human readable. + +A sample GlusterFS protocol block looks like this: +\begin{verbatim} +  Block Start                            header +  0000000000000023                       callid +  00000001                               type +  00000016                               op +  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx       human-readable name +  00000000000000000000000000000ac3       block size +  <...>                                  block +  Block End +\end{verbatim} + +\texttt{stack.h}: This file defines the \texttt{STACK\_WIND} and +\texttt{STACK\_UNWIND} macros which are used to implement the parallel +stack that is maintained for inter-xlator calls. See the \textsl{Taking control +of the stack} section below for more details. + +\texttt{spec.y}: This contains the Yacc grammar for the GlusterFS +specification file, and the parsing code. + + +Draw diagrams of trees +Two rules: +(1) directory structure is same +(2) file can exist only on one node + +\section{glusterfs-fuse} +\section{glusterfsd} +\section{transport} +\section{scheduler} +\section{xlator} + +\chapter{xlators} +\section{Taking control of the stack} +One can think of STACK\_WIND/UNWIND as a very specific RPC mechanism. + +% \includegraphics{stack.eps} + +\section{Overview of xlators} + +\flushleft{\LARGE\texttt{cluster/}} +\vskip 2ex +\flushleft{\Large\texttt{afr}} +\vskip 2ex +\flushleft{\Large\texttt{stripe}} +\vskip 2ex +\flushleft{\Large\texttt{unify}} + +\vskip 4ex +\flushleft{\LARGE\texttt{debug/}} +\vskip 2ex +\flushleft{\Large\texttt{trace}} +\vskip 2ex +The trace xlator simply logs all fops and mops, and passes them through to its child. + +\vskip 4ex +\flushleft{\LARGE\texttt{features/}} +\flushleft{\Large\texttt{posix-locks}} +\vskip 2ex +This xlator implements \textsc{posix} record locking semantics over +any kind of storage. + +\vskip 4ex +\flushleft{\LARGE\texttt{performance/}} + +\flushleft{\Large\texttt{io-threads}} +\vskip 2ex +\flushleft{\Large\texttt{read-ahead}} +\vskip 2ex +\flushleft{\Large\texttt{stat-prefetch}} +\vskip 2ex +\flushleft{\Large\texttt{write-behind}} +\vskip 2ex + +\vskip 4ex +\flushleft{\LARGE\texttt{protocol/}} +\vskip 2ex + +\flushleft{\Large\texttt{client}} +\vskip 2ex + +\flushleft{\Large\texttt{server}} +\vskip 2ex + +\vskip 4ex +\flushleft{\LARGE\texttt{storage/}} +\flushleft{\Large\texttt{posix}} +\vskip 2ex +The \texttt{posix} xlator is the one which actually makes calls to the +on-disk filesystem. Currently this is the only storage xlator available. However, +plans to develop other storage xlators, such as one for Amazon's S3 service, are +on the roadmap. + +\chapter{Writing a simple xlator} +\noindent +In this section we're going to write a rot13 xlator. ``Rot13'' is a +simple substitution cipher which obscures a text by replacing each +letter with the letter thirteen places down the alphabet. So `a' (0) +would become `n' (12), `b' would be 'm', and so on.  Rot13 applied to +a piece of ciphertext yields the plaintext again, because rot13 is its +own inverse, since: + +\[ +x_c = x + 13\; (mod\; 26) +\] +\[ +x_c + 13\; (mod\; 26) = x + 13 + 13\; (mod\; 26) = x +\] + +First we include the requisite headers. + +\begin{verbatim} +#include <ctype.h> +#include <sys/uio.h> + +#include "glusterfs.h" +#include "xlator.h" +#include "logging.h" + +/* + * This is a rot13 ``encryption'' xlator. It rot13's data when  + * writing to disk and rot13's it back when reading it.  + * This xlator is meant as an example, not for production + *  use ;) (hence no error-checking) + */ + +\end{verbatim} + +Then we write the rot13 function itself. For simplicity, we only transform lower case +letters. Any other byte is passed through as it is. + +\begin{verbatim} +/* We only handle lower case letters for simplicity */ +static void  +rot13 (char *buf, int len) +{ +  int i; +  for (i = 0; i < len; i++) { +    if (isalpha (buf[i])) +      buf[i] = (buf[i] - 'a' + 13) % 26; +    else if (buf[i] <= 26) +      buf[i] = (buf[i] + 13) % 26 + 'a'; +  } +} +\end{verbatim} + +Next comes a utility function whose purpose will be clear after looking at the code +below. + +\begin{verbatim} +static void +rot13_iovec (struct iovec *vector, int count) +{ +  int i; +  for (i = 0; i < count; i++) { +    rot13 (vector[i].iov_base, vector[i].iov_len); +  } +} +\end{verbatim} + +\begin{verbatim} +static int32_t +rot13_readv_cbk (call_frame_t *frame, +                 call_frame_t *prev_frame, +                 xlator_t *this, +                 int32_t op_ret, +                 int32_t op_errno, +                 struct iovec *vector, +                 int32_t count) +{ +  rot13_iovec (vector, count); + +  STACK_UNWIND (frame, op_ret, op_errno, vector, count); +  return 0; +} + +static int32_t +rot13_readv (call_frame_t *frame, +             xlator_t *this, +             dict_t *ctx, +             size_t size, +             off_t offset) +{ +  STACK_WIND (frame, +              rot13_readv_cbk, +              FIRST_CHILD (this), +              FIRST_CHILD (this)->fops->readv, +              ctx, size, offset); +  return 0; +} + +static int32_t +rot13_writev_cbk (call_frame_t *frame, +                  call_frame_t *prev_frame, +                  xlator_t *this, +                  int32_t op_ret, +                  int32_t op_errno) +{ +  STACK_UNWIND (frame, op_ret, op_errno); +  return 0; +} + +static int32_t +rot13_writev (call_frame_t *frame, +              xlator_t *this, +              dict_t *ctx, +              struct iovec *vector, +              int32_t count,  +              off_t offset) +{ +  rot13_iovec (vector, count); + +  STACK_WIND (frame,  +              rot13_writev_cbk, +              FIRST_CHILD (this), +              FIRST_CHILD (this)->fops->writev, +              ctx, vector, count, offset); +  return 0; +} + +\end{verbatim} + +Every xlator must define two functions and two external symbols. The functions are  +\texttt{init} and \texttt{fini}, and the symbols are \texttt{fops} and \texttt{mops}. +The \texttt{init} function is called when the xlator is loaded by GlusterFS, and  +contains code for the xlator to initialize itself. Note that if an xlator is present +multiple times in the spec tree, the \texttt{init} function will be called each time +the xlator is loaded. + +\begin{verbatim} +int32_t +init (xlator_t *this) +{ +  if (!this->children) { +    gf_log ("rot13", GF_LOG_ERROR,  +            "FATAL: rot13 should have exactly one child"); +    return -1; +  } + +  gf_log ("rot13", GF_LOG_DEBUG, "rot13 xlator loaded"); +  return 0; +} +\end{verbatim} + +\begin{verbatim} + +void  +fini (xlator_t *this) +{ +  return; +} + +struct xlator_fops fops = { +  .readv        = rot13_readv, +  .writev       = rot13_writev +}; + +struct xlator_mops mops = { +}; + +\end{verbatim} + +\end{document} + diff --git a/doc/hacker-guide/posix.txt b/doc/hacker-guide/posix.txt new file mode 100644 index 000000000..d0132abfe --- /dev/null +++ b/doc/hacker-guide/posix.txt @@ -0,0 +1,59 @@ +--------------- +* storage/posix +--------------- + +- SET_FS_ID + +  This is so that all filesystem checks are done with the user's  +  uid/gid and not GlusterFS's uid/gid. + +- MAKE_REAL_PATH +  +  This macro concatenates the base directory of the posix volume +  ('option directory') with the given path. + +- need_xattr in lookup + +  If this flag is passed, lookup returns a xattr dictionary that contains +  the file's create time, the file's contents, and the version number +  of the file. + +  This is a hack to increase small file performance. If an application  +  wants to read a small file, it can finish its job with just a lookup  +  call instead of a lookup followed by read. + +- getdents/setdents + +  These are used by unify to set and get directory entries. + +- ALIGN_BUF +   +  Macro to align an address to a page boundary (4K). + +- priv->export_statfs + +  In some cases, two exported volumes may reside on the same +  partition on the server. Sending statvfs info for both +  the volumes will lead to erroneous df output at the client, +  since free space on the partition will be counted twice. + +  In such cases, user can disable exporting statvfs info +  on one of the volumes by setting this option. + +- xattrop + +  This fop is used by replicate to set version numbers on files.  + +- getxattr/setxattr hack to read/write files + +  A key, GLUSTERFS_FILE_CONTENT_STRING, is handled in a special way by +  getxattr/setxattr. A getxattr with the key will return the entire +  content of the file as the value. A setxattr with the key will write +  the value as the entire content of the file. + +- posix_checksum +   +  This calculates a simple XOR checksum on all entry names in a +  directory that is used by unify to compare directory contents. + + diff --git a/doc/hacker-guide/replicate.txt b/doc/hacker-guide/replicate.txt new file mode 100644 index 000000000..284f373fb --- /dev/null +++ b/doc/hacker-guide/replicate.txt @@ -0,0 +1,206 @@ +--------------- +* cluster/replicate +--------------- + +Before understanding replicate, one must understand two internal FOPs: + +GF_FILE_LK: +  This is exactly like fcntl(2) locking, except the locks are in a  +  separate domain from locks held by applications. + +GF_DIR_LK (loc_t *loc, char *basename): +  This allows one to lock a name under a directory. For example, +  to lock /mnt/glusterfs/foo, one would use the call: + +  GF_DIR_LK ({loc_t for "/mnt/glusterfs"}, "foo") + +  If one wishes to lock *all* the names under a particular directory, +  supply the basename argument as NULL. + +  The locks can either be read locks or write locks; consult the  +  function prototype for more details. + +Both these operations are implemented by the features/locks (earlier +known as posix-locks) translator. + +-------------- +* Basic design +-------------- + +All FOPs can be classified into four major groups: + + - inode-read +   Operations that read an inode's data (file contents) or metadata (perms, etc.). + +   access, getxattr, fstat, readlink, readv, stat. + + - inode-write +   Operations that modify an inode's data or metadata. + +   chmod, chown, truncate, writev, utimens. + + - dir-read +   Operations that read a directory's contents or metadata. + +   readdir, getdents, checksum. + + - dir-write +   Operations that modify a directory's contents or metadata. + +   create, link, mkdir, mknod, rename, rmdir, symlink, unlink. + +   Some of these make a subgroup in that they modify *two* different entries: +        link, rename, symlink. + + - Others +   Other operations. + +   flush, lookup, open, opendir, statfs. + +------------ +* Algorithms +------------ + +Each of the four major groups has its own algorithm: + + ---------------------- + - inode-read, dir-read + ---------------------- + + = Send a request to the first child that is up: +   - if it fails: +       try the next available child +   - if we have exhausted all children: +       return failure + + ------------- + - inode-write + ------------- + + All operations are done in parallel unless specified otherwise. + + (1) Send a GF_FILE_LK request on all children for a write lock on  +     the appropriate region +            (for metadata operations: entire file (0, 0) +             for writev: (offset, offset+size of buffer)) + +     - If a lock request fails on a child: +         unlock all children +         try to acquire a blocking lock (F_SETLKW) on each child, serially. +	  +	 If this fails (due to ENOTCONN or EINVAL): +           Consider this child as dead for rest of transaction. + + (2) Mark all children as "pending" on all (alive) children  +     (see below for meaning of "pending"). + +     - If it fails on any child: +         mark it as dead (in transaction local state). + + (3) Perform operation on all (alive) children. + +     - If it fails on any child: +         mark it as dead (in transaction local state). + + (4) Unmark all successful children as not "pending" on all nodes. + + (5) Unlock region on all (alive) children. +      + ----------- + - dir-write + ----------- + + The algorithm for dir-write is same as above except instead of holding + GF_FILE_LK locks we hold a GF_DIR_LK lock on the name being operated upon. + In case of link-type calls, we hold locks on both the operand names. + +----------- +* "pending" +----------- + + The "pending" number is like a journal entry. A pending entry is an + array of 32-bit integers stored in network byte-order as the extended + attribute of an inode (which can be a directory as well). +  + There are three keys corresponding to three types of pending operations: + + - AFR_METADATA_PENDING +     There are some metadata operations pending on this inode (perms, ctime/mtime,  +     xattr, etc.). + + - AFR_DATA_PENDING +     There is some data pending on this inode (writev). + + - AFR_ENTRY_PENDING +     There are some directory operations pending on this directory +     (create, unlink, etc.). +     +----------- +* Self heal +----------- + + - On lookup, gather extended attribute data: +   - If entry is a regular file: +     - If an entry is present on one child and not on others: +       - create entry on others. +     - If entries exist but have different metadata (perms, etc.): +       - consider the entry with the highest AFR_METADATA_PENDING number as +         definitive and replicate its attributes on children. + +   - If entry is a directory: +     - Consider the entry with the higest AFR_ENTRY_PENDING number as +       definitive and replicate its contents on all children. + +   - If any two entries have non-matching types (i.e., one is file and +     other is directory): +     - Announce to the user via log that a split-brain situation has been +       detected, and do nothing. +  + - On open, gather extended attribute data: +   - Consider the file with the highest AFR_DATA_PENDING number as +     the definitive one and replicate its contents on all other +     children. + + During all self heal operations, appropriate locks must be held on all + regions/entries being affected. + +--------------- +* Inode scaling +--------------- + +Inode scaling is necessary because if a situation arises where: +  - An inode number is returned for a directory (by lookup) which was +    previously the inode number of a file (as per FUSE's table), then +    FUSE gets horribly confused (consult a FUSE expert for more details). + +To avoid such a situation, we distribute the 64-bit inode space equally +among all children of replicate. + +To illustrate: + +If c1, c2, c3 are children of replicate, they each get 1/3 of the available +inode space: + +Child:        c1   c2   c3   c1   c2   c3   c1   c2   c3   c1   c2 ... +Inode number: 1    2    3    4    5    6    7    8    9    10   11 ... + +Thus, if lookup on c1 returns an inode number "2", it is scaled to "4" +(which is the second inode number in c1's space). + +This way we ensure that there is never a collision of inode numbers from +two different children. + +This reduction of inode space doesn't really reduce the usability of  +replicate since even if we assume replicate has 1024 children (which would be a +highly unusual scenario), each child still has a 54-bit inode space. + +2^54 ~ 1.8 * 10^16 + +which is much larger than any real world requirement. + + +============================================== +$ Last updated: Sun Oct 12 23:17:01 IST 2008 $ +$ Author: Vikas Gorur <vikas@zresearch.com>  $ +============================================== + diff --git a/doc/hacker-guide/write-behind.txt b/doc/hacker-guide/write-behind.txt new file mode 100644 index 000000000..498e95480 --- /dev/null +++ b/doc/hacker-guide/write-behind.txt @@ -0,0 +1,45 @@ +basic working +-------------- + +  write behind is basically a translator to lie to the application that the write-requests are finished, even before it is actually finished. + +  on a regular translator tree without write-behind, control flow is like this: +   +  1. application makes a write() system call. +  2. VFS ==> FUSE ==> /dev/fuse. +  3. fuse-bridge initiates a glusterfs writev() call. +  4. writev() is STACK_WIND()ed upto client-protocol or storage translator. +  5. client-protocol, on recieving reply from server, starts STACK_UNWIND() towards the fuse-bridge. + +  on a translator tree with write-behind, control flow is like this: +   +  1. application makes a write() system call. +  2. VFS ==> FUSE ==> /dev/fuse. +  3. fuse-bridge initiates a glusterfs writev() call. +  4. writev() is STACK_WIND()ed upto write-behind translator. +  5. write-behind adds the write buffer to its internal queue and does a STACK_UNWIND() towards the fuse-bridge. +   +  write call is completed in application's percepective. after STACK_UNWIND()ing towards the fuse-bridge, write-behind initiates a fresh writev() call to its child translator, whose replies will be consumed by write-behind itself. write-behind _doesn't_ cache the write buffer, unless 'option flush-behind on' is specified in volume specification file. + +windowing +--------- + +  write respect to write-behind, each write-buffer has three flags: 'stack_wound', 'write_behind' and 'got_reply'. + +  stack_wound: if set, indicates that write-behind has initiated STACK_WIND() towards child translator.  + +  write_behind: if set, indicates that write-behind has done STACK_UNWIND() towards fuse-bridge. + +  got_reply: if set, indicates that write-behind has recieved reply from child translator for a writev() STACK_WIND(). a request will be destroyed by write-behind only if this flag is set. + +  currently pending write requests = aggregate size of requests with write_behind = 1 and got_reply = 0. +   +  window size limits the aggregate size of currently pending write requests. once the pending requests' size has reached the window size, write-behind blocks  writev() calls from fuse-bridge.  +  blocking is only from application's perspective. write-behind does STACK_WIND() to child translator straight-away, but hold behind the STACK_UNWIND() towards fuse-bridge. STACK_UNWIND() is done only once write-behind gets enough replies to accomodate for currently blocked request. +   +flush behind +------------ + +  if 'option flush-behind on' is specified in volume specification file, then write-behind sends aggregate write requests to child translator, instead of regular per request STACK_WIND()s. +   +  | 
