diff options
Diffstat (limited to 'doc/hacker-guide')
-rw-r--r-- | doc/hacker-guide/en-US/markdown/adding-fops.md | 18 | ||||
-rw-r--r-- | doc/hacker-guide/en-US/markdown/afr.md | 191 | ||||
-rw-r--r-- | doc/hacker-guide/en-US/markdown/coding-standard.md | 402 | ||||
-rw-r--r-- | doc/hacker-guide/en-US/markdown/inode.md | 226 | ||||
-rw-r--r-- | doc/hacker-guide/en-US/markdown/posix.md | 59 | ||||
-rw-r--r-- | doc/hacker-guide/en-US/markdown/translator-development.md | 666 | ||||
-rw-r--r-- | doc/hacker-guide/en-US/markdown/unittest.md | 228 | ||||
-rw-r--r-- | doc/hacker-guide/en-US/markdown/write-behind.md | 56 |
8 files changed, 0 insertions, 1846 deletions
diff --git a/doc/hacker-guide/en-US/markdown/adding-fops.md b/doc/hacker-guide/en-US/markdown/adding-fops.md deleted file mode 100644 index 3f72ed3e23a..00000000000 --- a/doc/hacker-guide/en-US/markdown/adding-fops.md +++ /dev/null @@ -1,18 +0,0 @@ -Adding a new FOP -================ - -Steps to be followed when adding a new FOP to GlusterFS: - -1. Edit `glusterfs.h` and add a `GF_FOP_*` constant. -2. Edit `xlator.[ch]` and: - * add the new prototype for fop and callback. - * edit `xlator_fops` structure. -3. Edit `xlator.c` and add to fill_defaults. -4. Edit `protocol.h` and add struct necessary for the new FOP. -5. Edit `defaults.[ch]` and provide default implementation. -6. Edit `call-stub.[ch]` and provide stub implementation. -7. Edit `common-utils.c` and add to gf_global_variable_init(). -8. Edit client-protocol and add your FOP. -9. Edit server-protocol and add your FOP. -10. Implement your FOP in any translator for which the default implementation - is not sufficient. diff --git a/doc/hacker-guide/en-US/markdown/afr.md b/doc/hacker-guide/en-US/markdown/afr.md deleted file mode 100644 index 566573a4e26..00000000000 --- a/doc/hacker-guide/en-US/markdown/afr.md +++ /dev/null @@ -1,191 +0,0 @@ -cluster/afr translator -====================== - -Locking -------- - -Before understanding replicate, one must understand two internal FOPs: - -### `GF_FILE_LK` - -This is exactly like `fcntl(2)` locking, except the locks are in a -separate domain from locks held by applications. - -### `GF_DIR_LK (loc_t *loc, char *basename)` - -This allows one to lock a name under a directory. For example, -to lock /mnt/glusterfs/foo, one would use the call: - -``` -GF_DIR_LK ({loc_t for "/mnt/glusterfs"}, "foo") -``` - -If one wishes to lock *all* the names under a particular directory, -supply the basename argument as `NULL`. - -The locks can either be read locks or write locks; consult the -function prototype for more details. - -Both these operations are implemented by the features/locks (earlier -known as posix-locks) translator. - -Basic design ------------- - -All FOPs can be classified into four major groups: - -### inode-read - -Operations that read an inode's data (file contents) or metadata (perms, etc.). - -access, getxattr, fstat, readlink, readv, stat. - -### inode-write - -Operations that modify an inode's data or metadata. - -chmod, chown, truncate, writev, utimens. - -### dir-read - -Operations that read a directory's contents or metadata. - -readdir, getdents, checksum. - -### dir-write - -Operations that modify a directory's contents or metadata. - -create, link, mkdir, mknod, rename, rmdir, symlink, unlink. - -Some of these make a subgroup in that they modify *two* different entries: -link, rename, symlink. - -### Others - -Other operations. - -flush, lookup, open, opendir, statfs. - -Algorithms ----------- - -Each of the four major groups has its own algorithm: - -### inode-read, dir-read - -1. Send a request to the first child that is up: - * if it fails: - * try the next available child - * if we have exhausted all children: - * return failure - -### inode-write - - All operations are done in parallel unless specified otherwise. - -1. Send a ``GF_FILE_LK`` request on all children for a write lock on the - appropriate region - (for metadata operations: entire file (0, 0) for writev: - (offset, offset+size of buffer)) - * If a lock request fails on a child: - * unlock all children - * try to acquire a blocking lock (`F_SETLKW`) on each child, serially. - If this fails (due to `ENOTCONN` or `EINVAL`): - Consider this child as dead for rest of transaction. -2. Mark all children as "pending" on all (alive) children (see below for -meaning of "pending"). - * If it fails on any child: - * mark it as dead (in transaction local state). -3. Perform operation on all (alive) children. - * If it fails on any child: - * mark it as dead (in transaction local state). -4. Unmark all successful children as not "pending" on all nodes. -5. Unlock region on all (alive) children. - -### dir-write - - The algorithm for dir-write is same as above except instead of holding - `GF_FILE_LK` locks we hold a GF_DIR_LK lock on the name being operated upon. - In case of link-type calls, we hold locks on both the operand names. - -"pending" ---------- - -The "pending" number is like a journal entry. A pending entry is an -array of 32-bit integers stored in network byte-order as the extended -attribute of an inode (which can be a directory as well). - -There are three keys corresponding to three types of pending operations: - -### `AFR_METADATA_PENDING` - -There are some metadata operations pending on this inode (perms, ctime/mtime, -xattr, etc.). - -### `AFR_DATA_PENDING` - -There is some data pending on this inode (writev). - -### `AFR_ENTRY_PENDING` - -There are some directory operations pending on this directory -(create, unlink, etc.). - -Self heal ---------- - -* On lookup, gather extended attribute data: - * If entry is a regular file: - * If an entry is present on one child and not on others: - * create entry on others. - * If entries exist but have different metadata (perms, etc.): - * consider the entry with the highest `AFR_METADATA_PENDING` number as - definitive and replicate its attributes on children. - * If entry is a directory: - * Consider the entry with the highest `AFR_ENTRY_PENDING` number as - definitive and replicate its contents on all children. - * If any two entries have non-matching types (i.e., one is file and - other is directory): - * Announce to the user via log that a split-brain situation has been - detected, and do nothing. -* On open, gather extended attribute data: - * Consider the file with the highest `AFR_DATA_PENDING` number as - the definitive one and replicate its contents on all other - children. - -During all self heal operations, appropriate locks must be held on all -regions/entries being affected. - -Inode scaling -------------- - -Inode scaling is necessary because if a situation arises where an inode number -is returned for a directory (by lookup) which was previously the inode number -of a file (as per FUSE's table), then FUSE gets horribly confused (consult a -FUSE expert for more details). - -To avoid such a situation, we distribute the 64-bit inode space equally -among all children of replicate. - -To illustrate: - -If c1, c2, c3 are children of replicate, they each get 1/3 of the available -inode space: - -------------- -- -- -- -- -- -- -- -- -- -- -- --- -Child: c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 ... -Inode number: 1 2 3 4 5 6 7 8 9 10 11 ... -------------- -- -- -- -- -- -- -- -- -- -- -- --- - -Thus, if lookup on c1 returns an inode number "2", it is scaled to "4" -(which is the second inode number in c1's space). - -This way we ensure that there is never a collision of inode numbers from -two different children. - -This reduction of inode space doesn't really reduce the usability of -replicate since even if we assume replicate has 1024 children (which would be a -highly unusual scenario), each child still has a 54-bit inode space: -$2^{54} \sim 1.8 \times 10^{16}$, which is much larger than any real -world requirement. diff --git a/doc/hacker-guide/en-US/markdown/coding-standard.md b/doc/hacker-guide/en-US/markdown/coding-standard.md deleted file mode 100644 index 368c5553464..00000000000 --- a/doc/hacker-guide/en-US/markdown/coding-standard.md +++ /dev/null @@ -1,402 +0,0 @@ -GlusterFS Coding Standards -========================== - -Structure definitions should have a comment per member ------------------------------------------------------- - -Every member in a structure definition must have a comment about its -purpose. The comment should be descriptive without being overly verbose. - -*Bad:* - -``` -gf_lock_t lock; /* lock */ -``` - -*Good:* - -``` -DBTYPE access_mode; /* access mode for accessing - * the databases, can be - * DB_HASH, DB_BTREE - * (option access-mode <mode>) - */ -``` - -Declare all variables at the beginning of the function ------------------------------------------------------- - -All local variables in a function must be declared immediately after the -opening brace. This makes it easy to keep track of memory that needs to be freed -during exit. It also helps debugging, since gdb cannot handle variables -declared inside loops or other such blocks. - -Always initialize local variables ---------------------------------- - -Every local variable should be initialized to a sensible default value -at the point of its declaration. All pointers should be initialized to NULL, -and all integers should be zero or (if it makes sense) an error value. - - -*Good:* - -``` -int ret = 0; -char *databuf = NULL; -int _fd = -1; -``` - -Initialization should always be done with a constant value ----------------------------------------------------------- - -Never use a non-constant expression as the initialization value for a variable. - - -*Bad:* - -``` -pid_t pid = frame->root->pid; -char *databuf = malloc (1024); -``` - -Validate all arguments to a function ------------------------------------- - -All pointer arguments to a function must be checked for `NULL`. -A macro named `VALIDATE` (in `common-utils.h`) -takes one argument, and if it is `NULL`, writes a log message and -jumps to a label called `err` after setting op_ret and op_errno -appropriately. It is recommended to use this template. - - -*Good:* - -``` -VALIDATE(frame); -VALIDATE(this); -VALIDATE(inode); -``` - -Never rely on precedence of operators -------------------------------------- - -Never write code that relies on the precedence of operators to execute -correctly. Such code can be hard to read and someone else might not -know the precedence of operators as accurately as you do. - -*Bad:* - -``` -if (op_ret == -1 && errno != ENOENT) -``` - -*Good:* - -``` -if ((op_ret == -1) && (errno != ENOENT)) -``` - -Use exactly matching types --------------------------- - -Use a variable of the exact type declared in the manual to hold the -return value of a function. Do not use an ``equivalent'' type. - - -*Bad:* - -``` -int len = strlen (path); -``` - -*Good:* - -``` -size_t len = strlen (path); -``` - -Never write code such as `foo->bar->baz`; check every pointer -------------------------------------------------------------- - -Do not write code that blindly follows a chain of pointer -references. Any pointer in the chain may be `NULL` and thus -cause a crash. Verify that each pointer is non-null before following -it. - -Check return value of all functions and system calls ----------------------------------------------------- - -The return value of all system calls and API functions must be checked -for success or failure. - -*Bad:* - -``` -close (fd); -``` - -*Good:* - -``` -op_ret = close (_fd); -if (op_ret == -1) { - gf_log (this->name, GF_LOG_ERROR, - "close on file %s failed (%s)", real_path, - strerror (errno)); - op_errno = errno; - goto out; -} -``` - - -Gracefully handle failure of malloc ------------------------------------ - -GlusterFS should never crash or exit due to lack of memory. If a -memory allocation fails, the call should be unwound and an error -returned to the user. - -*Use result args and reserve the return value to indicate success or failure:* - -The return value of every functions must indicate success or failure (unless -it is impossible for the function to fail --- e.g., boolean functions). If -the function needs to return additional data, it must be returned using a -result (pointer) argument. - -*Bad:* - -``` -int32_t dict_get_int32 (dict_t *this, char *key); -``` - -*Good:* - -``` -int dict_get_int32 (dict_t *this, char *key, int32_t *val); -``` - -Always use the `n' versions of string functions ------------------------------------------------ - -Unless impossible, use the length-limited versions of the string functions. - -*Bad:* - -``` -strcpy (entry_path, real_path); -``` - -*Good:* - -``` -strncpy (entry_path, real_path, entry_path_len); -``` - -No dead or commented code -------------------------- - -There must be no dead code (code to which control can never be passed) or -commented out code in the codebase. - -Only one unwind and return per function ---------------------------------------- - -There must be only one exit out of a function. `UNWIND` and return -should happen at only point in the function. - -Function length or Keep functions small ---------------------------------------- - -We live in the UNIX-world where modules do one thing and do it well. -This rule should apply to our functions also. If a function is very long, try splitting it -into many little helper functions. The question is, in a coding -spree, how do we know a function is long and unreadable. One rule of -thumb given by Linus Torvalds is that, a function should be broken-up -if you have 4 or more levels of indentation going on for more than 3-4 -lines. - -*Example for a helper function:* -``` -static int -same_owner (posix_lock_t *l1, posix_lock_t *l2) -{ - return ((l1->client_pid == l2->client_pid) && - (l1->transport == l2->transport)); -} -``` - -Defining functions as static ----------------------------- - -Define internal functions as static only if you're -very sure that there will not be a crash(..of any kind..) emanating in -that function. If there is even a remote possibility, perhaps due to -pointer derefering, etc, declare the function as non-static. This -ensures that when a crash does happen, the function name shows up the -in the back-trace generated by libc. However, doing so has potential -for polluting the function namespace, so to avoid conflicts with other -components in other parts, ensure that the function names are -prepended with a prefix that identify the component to which it -belongs. For eg. non-static functions in io-threads translator start -with iot_. - -Ensure function calls wrap around after 80-columns --------------------------------------------------- - -Place remaining arguments on the next line if needed. - -Functions arguments and function definition -------------------------------------------- - -Place all the arguments of a function definition on the same line -until the line goes beyond 80-cols. Arguments that extend beyind -80-cols should be placed on the next line. - -Style issues ------------- - -### Brace placement - -Use K&R/Linux style of brace placement for blocks. - -*Good:* - -``` -int some_function (...) -{ - if (...) { - /* ... */ - } else if (...) { - /* ... */ - } else { - /* ... */ - } - - do { - /* ... */ - } while (cond); -} -``` - -### Indentation - -Use *eight* spaces for indenting blocks. Ensure that your -file contains only spaces and not tab characters. You can do this -in Emacs by selecting the entire file (`C-x h`) and -running `M-x untabify`. - -To make Emacs indent lines automatically by eight spaces, add this -line to your `.emacs`: - -``` -(add-hook 'c-mode-hook (lambda () (c-set-style "linux"))) -``` - -### Comments - -Write a comment before every function describing its purpose (one-line), -its arguments, and its return value. Mention whether it is an internal -function or an exported function. - -Write a comment before every structure describing its purpose, and -write comments about each of its members. - -Follow the style shown below for comments, since such comments -can then be automatically extracted by doxygen to generate -documentation. - -*Good:* - -``` -/** -* hash_name -hash function for filenames -* @par: parent inode number -* @name: basename of inode -* @mod: number of buckets in the hashtable -* -* @return: success: bucket number -* failure: -1 -* -* Not for external use. -*/ -``` - -### Indicating critical sections - -To clearly show regions of code which execute with locks held, use -the following format: - -``` -pthread_mutex_lock (&mutex); -{ - /* code */ -} -pthread_mutex_unlock (&mutex); -``` - -*A skeleton fop function:* - -This is the recommended template for any fop. In the beginning come -the initializations. After that, the `success' control flow should be -linear. Any error conditions should cause a `goto` to a single -point, `out`. At that point, the code should detect the error -that has occurred and do appropriate cleanup. - -``` -int32_t -sample_fop (call_frame_t *frame, xlator_t *this, ...) -{ - char * var1 = NULL; - int32_t op_ret = -1; - int32_t op_errno = 0; - DIR * dir = NULL; - struct posix_fd * pfd = NULL; - - VALIDATE_OR_GOTO (frame, out); - VALIDATE_OR_GOTO (this, out); - - /* other validations */ - - dir = opendir (...); - - if (dir == NULL) { - op_errno = errno; - gf_log (this->name, GF_LOG_ERROR, - "opendir failed on %s (%s)", loc->path, - strerror (op_errno)); - goto out; - } - - /* another system call */ - if (...) { - op_errno = ENOMEM; - gf_log (this->name, GF_LOG_ERROR, - "out of memory :("); - goto out; - } - - /* ... */ - - out: - if (op_ret == -1) { - - /* check for all the cleanup that needs to be - done */ - - if (dir) { - closedir (dir); - dir = NULL; - } - - if (pfd) { - FREE (pfd->path); - FREE (pfd); - pfd = NULL; - } - } - - STACK_UNWIND (frame, op_ret, op_errno, fd); - return 0; -} -``` diff --git a/doc/hacker-guide/en-US/markdown/inode.md b/doc/hacker-guide/en-US/markdown/inode.md deleted file mode 100644 index a340ab9ca8e..00000000000 --- a/doc/hacker-guide/en-US/markdown/inode.md +++ /dev/null @@ -1,226 +0,0 @@ -#Inode and dentry management in GlusterFS: - -##Background -Filesystems internally refer to files and directories via inodes. Inodes -are unique identifiers of the entities stored in a filesystem. Whenever an -application has to operate on a file/directory (read/modify), the filesystem -maps that file/directory to the right inode and start referring to that inode -whenever an operation has to be performed on the file/directory. - -In GlusterFS a new inode gets created whenever a new file/directory is created -OR when a successful lookup is done on a file/directory for the first time. -Inodes in GlusterFS are maintained by the inode table which gets initiated when -the filesystem daemon is started (both for the brick process as well as the -mount process). Below are some important data structures for inode management. - -## Data-structure (inode-table) -``` -struct _inode_table { - pthread_mutex_t lock; - size_t hashsize; /* bucket size of inode hash and dentry hash */ - char *name; /* name of the inode table, just for gf_log() */ - inode_t *root; /* root directory inode, with inode - number and gfid 1 */ - xlator_t *xl; /* xlator to be called to do purge and - the xlator which maintains the inode table*/ - uint32_t lru_limit; /* maximum LRU cache size */ - struct list_head *inode_hash; /* buckets for inode hash table */ - struct list_head *name_hash; /* buckets for dentry hash table */ - struct list_head active; /* list of inodes currently active (in an fop) */ - uint32_t active_size; /* count of inodes in active list */ - struct list_head lru; /* list of inodes recently used. - lru.next most recent */ - uint32_t lru_size; /* count of inodes in lru list */ - struct list_head purge; /* list of inodes to be purged soon */ - uint32_t purge_size; /* count of inodes in purge list */ - - struct mem_pool *inode_pool; /* memory pool for inodes */ - struct mem_pool *dentry_pool; /* memory pool for dentrys */ - struct mem_pool *fd_mem_pool; /* memory pool for fd_t */ - int ctxcount; /* number of slots in inode->ctx */ -}; -``` - -#Life-cycle -``` - -inode_table_new (size_t lru_limit, xlator_t *xl) - -This is a function which allocates a new inode table. Usually the top xlators in -the graph such as protocol/server (for bricks), fuse and nfs (for fuse and nfs -mounts) and libgfapi do inode managements. Hence they are the ones which will -allocate a new inode table by calling the above function. - -Each xlator graph in glusterfs maintains an inode table. So in fuse clients, -whenever there is a graph change due to add brick/remove brick or -addition/removal of some other xlators, a new graph is created which creates a -new inode table. - -Thus an allocated inode table is destroyed only when the filesystem daemon is -killed or unmounted. - -``` - -#what it contains. -``` - -Inode table in glusterfs mainly contains a hash table for maintaining inodes. -In general a file/directory is considered to be existing if there is a -corresponding inode present in the inode table. If a inode for a file/directory -cannot be found in the inode table, glusterfs tries to resolve it by sending a -lookup on the entry for which the inode is needed. If lookup is successful, then -a new inode correponding to the entry is added to the hash table present in the -inode table. Thus an inode present in the hash-table means, its an existing -file/directory within the filesystem. The inode table also contains the hash -size of the hash table (as of now it is hard coded to 14057. The hash value of -a inode is calculated using its gfid). - -Apart from the hash table, inode table also maintains 3 important list of inodes -1) Active list: -Active list contains all the active inodes (i.e inodes which are currently part -of some fop). -2) Lru list: -Least recently used inodes list. A limit can be set for the size of the lru -list. For bricks it is 16384 and for clients it is infinity. -3) Purge list: -List of all the inodes which have to be purged (i.e inodes which have to be -deleted from the inode table due to unlink/rmdir/forget). - -And at last it also contains the mem-pool for allocating inodes, dentries so -that frequent malloc/calloc and free of the data structures can be avoided. -``` - -#Data structure (inode) -``` -struct _inode { - inode_table_t *table; /* the table this inode belongs to */ - uuid_t gfid; /* unique identifier of the inode */ - gf_lock_t lock; - uint64_t nlookup; - uint32_t fd_count; /* Open fd count */ - uint32_t ref; /* reference count on this inode */ - ia_type_t ia_type; /* what kind of file */ - struct list_head fd_list; /* list of open files on this inode */ - struct list_head dentry_list; /* list of directory entries for this inode */ - struct list_head hash; /* hash table pointers */ - struct list_head list; /* active/lru/purge */ - - struct _inode_ctx *_ctx; /* place holder for keeping the - information about the inode by different xlators */ -}; - -As said above, inodes are internal way of identifying the files/directories. A -inode uniquely represents a file/directory. A new inode is created whenever a -create/mkdir/symlink/mknod operations are performed. Apart from that a new inode -is created upon the successful fresh lookup of a file/directory. Say the -filesystem contained some file "a" within root and the filesystem was -unmounted. Now when glusterfs is mounted and some operation is perfomed on "/a", -glusterfs tries to get the inode for the entry "a" with parent inode as -root. But, since glusterfs just came up, it will not be able to find the inode -for "a" and will send a lookup on "/a". If the lookup operation succeeds (i.e. -the root of glusterfs contains an entry called "a"), then a new inode for "/a" -is created and added to the inode table. - -Depending upon the situation, an inode can be in one of the 3 lists maintained -by the inode table. If some fop is happening on the inode, then the inode will -be present in the active inodes list maintained by the inode table. Active -inodes are those inodes whose refcount is greater than zero. Whenever some -operation comes on a file/directory, and the resolver tries to find the inode -for it, it increments the refcount of the inode before returning the inode. The -refcount of an inode can be incremented by calling the below function - -inode_ref (inode_t *inode) - -Any xlator which wants to operate on a inode as part of some fop (or wants the -inode in the callback), should hold a ref on the inode. -Once the fop is completed before sending the reply of the fop to the above -layers , the inode has to be unrefed. When the refcount of an inode becomes -zero, it is removed from the active inodes list and put into LRU list maintained -by the inode table. Thus in short if some fop is happening on a file/directory, -the corresponding inode will be in the active list or it will be in the LRU -list. -``` - -#Life Cycle - -A new inode is created whenever a new file/directory/symlink is created OR a -successful lookup of an existing entry is done. The xlators which does inode -management (as of now protocol/server, fuse, nfs, gfapi) will perform inode_link -operation upon successful lookup or successful creation of a new entry. - -inode_link (inode_t *inode, inode_t *parent, const char *name, - struct iatt *buf); - -inode_link actually adds the inode to the inode table (to be precise it adds -the inode to the hash table maintained by the inode table. The hash value is -calculated based on the gfid). Copies the gfid to the inode (the gfid is -present in the iatt structure). Creates a dentry with the new name. - -A inode is removed from the inode table and eventually destroyed when unlink -or rmdir operation is performed on a file/directory, or the the lru limit of -the inode table has been exceeded. - -#Data structure (dentry) -``` - -struct _dentry { - struct list_head inode_list; /* list of dentries of inode */ - struct list_head hash; /* hash table pointers */ - inode_t *inode; /* inode of this directory entry */ - char *name; /* name of the directory entry */ - inode_t *parent; /* directory of the entry */ -}; - -A dentry is the presence of an entry for a file/directory within its parent -directory. A dentry usually points to the inode to which it belongs to. In -glusterfs a dentry contains the following fields. -1) a hook using which it can add itself to the list of -the dentries maintained by the inode to which it points to. -2) A hash table pointer. -3) Pointer to the inode to which it belongs to. -4) Name of the dentry -5) Pointer to the inode of the parent directory in which the dentry is present - -A new dentry is created when a new file/directory/symlink is created or a hard -link to an existing file is created. - -__dentry_create (inode_t *inode, inode_t *parent, const char *name); - -A dentry holds a refcount on the parent -directory so that the parent inode is never removed from the active inode's list -and put to the lru list (If the lru limit of the lru list is exceeded, there is -a chance of parent inode being destroyed. To avoid it, the dentries hold a -reference to the parent inode). A dentry is removed whenevern a unlink/rmdir -is perfomed on a file/directory. Or when the lru limit has been exceeded, the -oldest inodes are purged out of the inode table, during which all the dentries -of the inode are removed. - -Whenever a unlink/rmdir comes on a file/directory, the corresponding inode -should be removed from the inode table. So upon unlink/rmdir, the inode will -be moved to the purge list maintained by the inode table and from there it is -destroyed. To be more specific, if a inode has to be destroyed, its refcount -and nlookup count both should become 0. For refcount to become 0, the inode -should not be part of any fop (there should not be any open fds). Or if the -inode belongs to a directory, then there should not be any fop happening on the -directory and it should not contain any dentries within it. For nlookup count to -become zero, a forget has to be sent on the inode with nlookup count set to 0 as -an argument. For fuse clients, forget is sent by the kernel itself whenever a -unlink/rmdir is performed. But for brick processes, upon unlink/rmdir, the -protocol/server itself has to do inode_forget. Whenever the inode has to be -deleted due to file removal or lru limit being exceeded the inode is retired -(i.e. all the dentries of the inode are deleted and the inode is moved to the -purge list maintained by the inode table), the nlookup count is set to 0 via -inode_forget api. The inode table, then prunes all the inodes from the purge -list by destroying the inode contexts maintained by each xlator. - -unlinking of the dentry is done via inode_unlink; - -void -inode_unlink (inode_t *inode, inode_t *parent, const char *name); - -If the inode has multiple hard links, then the unlink operation performed by -the application results just in the removal of the dentry with the name provided -by the application. For the inode to be removed, all the dentries of the inode -should be unlinked. -``` - diff --git a/doc/hacker-guide/en-US/markdown/posix.md b/doc/hacker-guide/en-US/markdown/posix.md deleted file mode 100644 index 84c813e55a2..00000000000 --- a/doc/hacker-guide/en-US/markdown/posix.md +++ /dev/null @@ -1,59 +0,0 @@ -storage/posix translator -======================== - -Notes ------ - -### `SET_FS_ID` - -This is so that all filesystem checks are done with the user's -uid/gid and not GlusterFS's uid/gid. - -### `MAKE_REAL_PATH` - -This macro concatenates the base directory of the posix volume -('option directory') with the given path. - -### `need_xattr` in lookup - -If this flag is passed, lookup returns a xattr dictionary that contains -the file's create time, the file's contents, and the version number -of the file. - -This is a hack to increase small file performance. If an application -wants to read a small file, it can finish its job with just a lookup -call instead of a lookup followed by read. - -### `getdents`/`setdents` - -These are used by unify to set and get directory entries. - -### `ALIGN_BUF` - -Macro to align an address to a page boundary (4K). - -### `priv->export_statfs` - -In some cases, two exported volumes may reside on the same -partition on the server. Sending statvfs info for both -the volumes will lead to erroneous df output at the client, -since free space on the partition will be counted twice. - -In such cases, user can disable exporting statvfs info -on one of the volumes by setting this option. - -### `xattrop` - -This fop is used by replicate to set version numbers on files. - -### `getxattr`/`setxattr` hack to read/write files - -A key, `GLUSTERFS_FILE_CONTENT_STRING`, is handled in a special way by -`getxattr`/`setxattr`. A getxattr with the key will return the entire -content of the file as the value. A `setxattr` with the key will write -the value as the entire content of the file. - -### `posix_checksum` - -This calculates a simple XOR checksum on all entry names in a -directory that is used by unify to compare directory contents. diff --git a/doc/hacker-guide/en-US/markdown/translator-development.md b/doc/hacker-guide/en-US/markdown/translator-development.md deleted file mode 100644 index edadd5150dc..00000000000 --- a/doc/hacker-guide/en-US/markdown/translator-development.md +++ /dev/null @@ -1,666 +0,0 @@ -Translator development -====================== - -Setting the Stage ------------------ - -This is the first post in a series that will explain some of the details of -writing a GlusterFS translator, using some actual code to illustrate. - -Before we begin, a word about environments. GlusterFS is over 300K lines of -code spread across a few hundred files. That's no Linux kernel or anything, but - you're still going to be navigating through a lot of code in every -code-editing session, so some kind of cross-referencing is *essential*. I use -cscope with the vim bindings, and if I couldn't do Crtl+G and such to jump -between definitions all the time my productivity would be cut in half. You may -prefer different tools, but as I go through these examples you'll need -something functionally similar to follow on. OK, on with the show. - -The first thing you need to know is that translators are not just bags of -functions and variables. They need to have a very definite internal structure -so that the translator-loading code can figure out where all the pieces are. -The way it does this is to use dlsym to look for specific names within your -shared-object file, as follow (from `xlator.c`): - -``` -if (!(xl->fops = dlsym (handle, "fops"))) { - gf_log ("xlator", GF_LOG_WARNING, "dlsym(fops) on %s", - dlerror ()); - goto out; -} - -if (!(xl->cbks = dlsym (handle, "cbks"))) { - gf_log ("xlator", GF_LOG_WARNING, "dlsym(cbks) on %s", - dlerror ()); - goto out; -} - -if (!(xl->init = dlsym (handle, "init"))) { - gf_log ("xlator", GF_LOG_WARNING, "dlsym(init) on %s", - dlerror ()); - goto out; -} - -if (!(xl->fini = dlsym (handle, "fini"))) { - gf_log ("xlator", GF_LOG_WARNING, "dlsym(fini) on %s", - dlerror ()); - goto out; -} -``` - -In this example, `xl` is a pointer to the in-memory object for the translator -we're loading. As you can see, it's looking up various symbols *by name* in the - shared object it just loaded, and storing pointers to those symbols. Some of -them (e.g. init are functions, while others e.g. fops are dispatch tables -containing pointers to many functions. Together, these make up the translator's - public interface. - -Most of this glue or boilerplate can easily be found at the bottom of one of -the source files that make up each translator. We're going to use the `rot-13` -translator just for fun, so in this case you'd look in `rot-13.c` to see this: - -``` -struct xlator_fops fops = { - .readv = rot13_readv, - .writev = rot13_writev -}; - -struct xlator_cbks cbks = { -}; - -struct volume_options options[] = { -{ .key = {"encrypt-write"}, - .type = GF_OPTION_TYPE_BOOL -}, -{ .key = {"decrypt-read"}, - .type = GF_OPTION_TYPE_BOOL -}, -{ .key = {NULL} }, -}; -``` - -The `fops` table, defined in `xlator.h`, is one of the most important pieces. -This table contains a pointer to each of the filesystem functions that your -translator might implement -- `open`, `read`, `stat`, `chmod`, and so on. There -are 82 such functions in all, but don't worry; any that you don't specify here -will be see as null and filled with defaults from `defaults.c` when your -translator is loaded. In this particular example, since `rot-13` is an -exceptionally simple translator, we only fill in two entries for `readv` and -`writev`. - -There are actually two other tables, also required to have predefined names, -that are also used to find translator functions: `cbks` (which is empty in this - snippet) and `dumpops` (which is missing entirely). The first of these specify - entry points for when inodes are forgotten or file descriptors are released. -In other words, they're destructors for objects in which your translator might - have an interest. Mostly you can ignore them, because the default behavior -handles even the simpler cases of translator-specific inode/fd context -automatically. However, if the context you attach is a complex structure -requiring complex cleanup, you'll need to supply these functions. As for -dumpops, that's just used if you want to provide functions to pretty-print -various structures in logs. I've never used it myself, though I probably -should. What's noteworthy here is that we don't even define dumpops. That's -because all of the functions that might use these dispatch functions will check - for `xl->dumpops` being `NULL` before calling through it. This is in sharp -contrast to the behavior for `fops` and `cbks1`, which *must* be present. If -they're not, translator loading will fail because these pointers are not -checked every time and if they're `NULL` then we'll segfault. That's why we -provide an empty definition for cbks; it's OK for the individual function -pointers to be NULL, but not for the whole table to be absent. - -The last piece I'll cover today is options. As you can see, this is a table of -translator-specific option names and some information about their types. -GlusterFS actually provides a pretty rich set of types (`volume_option_type_t` -in `options.`h) which includes paths, translator names, percentages, and times -in addition to the obvious integers and strings. Also, the `volume_option_t` -structure can include information about alternate names, min/max/default -values, enumerated string values, and descriptions. We don't see any of these -here, so let's take a quick look at some more complex examples from afr.c and -then come back to `rot-13`. - -``` -{ .key = {"data-self-heal-algorithm"}, - .type = GF_OPTION_TYPE_STR, - .default_value = "", - .description = "Select between \"full\", \"diff\". The " - "\"full\" algorithm copies the entire file from " - "source to sink. The \"diff\" algorithm copies to " - "sink only those blocks whose checksums don't match " - "with those of source.", - .value = { "diff", "full", "" } -}, -{ .key = {"data-self-heal-window-size"}, - .type = GF_OPTION_TYPE_INT, - .min = 1, - .max = 1024, - .default_value = "1", - .description = "Maximum number blocks per file for which " - "self-heal process would be applied simultaneously." -}, -``` - -When your translator is loaded, all of this information is used to parse the -options actually provided in the volfile, and then the result is turned into a -dictionary and stored as `xl->options`. This dictionary is then processed by -your init function, which you can see being looked up in the first code -fragment above. We're only going to look at a small part of the `rot-13`'s -init for now. - -``` -priv->decrypt_read = 1; -priv->encrypt_write = 1; - -data = dict_get (this->options, "encrypt-write"); -if (data) { - if (gf_string2boolean (data->data, &priv->encrypt_write - == -1) { - gf_log (this->name, GF_LOG_ERROR, - "encrypt-write takes only boolean options"); - return -1; - } -} -``` - -What we can see here is that we're setting some defaults in our priv structure, -then looking to see if an `encrypt-write` option was actually provided. If so, -we convert and store it. This is a pretty classic use of dict_get to fetch a -field from a dictionary, and of using one of many conversion functions in -`common-utils.c` to convert `data->data` into something we can use. - -So far we've covered the basic of how a translator gets loaded, how we find its -various parts, and how we process its options. In my next Translator 101 post, -we'll go a little deeper into other things that init and its companion fini -might do, and how some other fields in our `xlator_t` structure (commonly -referred to as this) are commonly used. - -`init`, `fini`, and private context ------------------------------------ - -In the previous Translator 101 post, we looked at some of the dispatch tables -and options processing in a translator. This time we're going to cover the rest - of the "shell" of a translator -- i.e. the other global parts not specific to -handling a particular request. - -Let's start by looking at the relationship between a translator and its shared -library. At a first approximation, this is the relationship between an object -and a class in just about any object-oriented programming language. The class -defines behaviors, but has to be instantiated as an object to have any kind of -existence. In our case the object is an `xlator_t`. Several of these might be -created within the same daemon, sharing all of the same code through init/fini -and dispatch tables, but sharing *no data*. You could implement shared data (as - static variables in your shared libraries) but that's strongly discouraged. -Every function in your shared library will get an `xlator_t` as an argument, -and should use it. This lack of class-level data is one of the points where -the analogy to common OOP systems starts to break down. Another place is the -complete lack of inheritance. Translators inherit behavior (code) from exactly -one shared library -- looked up and loaded using the `type` field in a volfile -`volume ... end-volume` block -- and that's it -- not even single inheritance, -no subclasses or superclasses, no mixins or prototypes, just the relationship -between an object and its class. With that in mind, let's turn to the init -function that we just barely touched on last time. - -``` -int32_t -init (xlator_t *this) -{ - data_t *data = NULL; - rot_13_private_t *priv = NULL; - - if (!this->children || this->children->next) { - gf_log ("rot13", GF_LOG_ERROR, - "FATAL: rot13 should have exactly one child"); - return -1; - } - - if (!this->parents) { - gf_log (this->name, GF_LOG_WARNING, - "dangling volume. check volfile "); - } - - priv = GF_CALLOC (sizeof (rot_13_private_t), 1, 0); - if (!priv) - return -1; -``` - -At the very top, we see the function signature -- we get a pointer to the -`xlator_t` object that we're initializing, and we return an `int32_t` status. -As with most functions in the translator API, this should be zero to indicate -success. In this case it's safe to return -1 for failure, but watch out: in -dispatch-table functions, the return value means the status of the *function -call* rather than the *request*. A request error should be reflected as a -callback with a non-zero `op_re`t value, but the dispatch function itself -should still return zero. In fact, the handling of a non-zero return from a -dispatch function is not all that robust (we recently had a bug report in -HekaFS related to this) so it's something you should probably avoid -altogether. This only underscores the difference between dispatch functions -and `init`/`fini` functions, where non-zero returns *are* expected and handled -logically by aborting the translator setup. We can see that down at the -bottom, where we return -1 to indicate that we couldn't allocate our -private-data area (more about that later). - -The first thing this init function does is check that the translator is being -set up in the right kind of environment. Translators are called by parents and -in turn call children. Some translators are "initial" translators that inject -requests into the system from elsewhere -- e.g. mount/fuse injecting requests -from the kernel, protocol/server injecting requests from the network. Those -translators don't need parents, but `rot-13` does and so we check for that. -Similarly, some translators are "final" translators that (from the perspective -of the current process) terminate requests instead of passing them on -- e.g. -`protocol/client` passing them to another node, `storage/posix` passing them to -a local filesystem. Other translators "multiplex" between multiple children -- - passing each parent request on to one (`cluster/dht`), some -(`cluster/stripe`), or all (`cluster/afr`) of those children. `rot-13` fits -into none of those categories either, so it checks that it has *exactly one* -child. It might be more convenient or robust if translator shared libraries -had standard variables describing these requirements, to be checked in a -consistent way by the translator-loading infrastructure itself instead of by -each separate init function, but this is the way translators work today. - -The last thing we see in this fragment is allocating our private data area. -This can literally be anything we want; the infrastructure just provides the -priv pointer as a convenience but takes no responsibility for how it's used. In - this case we're using `GF_CALLOC` to allocate our own `rot_13_private_t` -structure. This gets us all the benefits of GlusterFS's memory-leak detection -infrastructure, but the way we're calling it is not quite ideal. For one thing, - the first two arguments -- from `calloc(3)` -- are kind of reversed. For -another, notice how the last argument is zero. That can actually be an -enumerated value, to tell the GlusterFS allocator *what* type we're -allocating. This can be very useful information for memory profiling and leak -detection, so it's recommended that you follow the example of any -x`xx-mem-types.h` file elsewhere in the source tree instead of just passing -zero here (even though that works). - -To finish our tour of standard initialization/termination, let's look at the -end of `init` and the beginning of `fini`: - -``` - this->private = priv; - gf_log ("rot13", GF_LOG_DEBUG, "rot13 xlator loaded"); - return 0; -} - -void -fini (xlator_t *this) -{ - rot_13_private_t *priv = this->private; - - if (!priv) - return; - this->private = NULL; - GF_FREE (priv); -``` - -At the end of init we're just storing our private-data pointer in the `priv` -field of our `xlator_t`, then returning zero to indicate that initialization -succeeded. As is usually the case, our fini is even simpler. All it really has -to do is `GF_FREE` our private-data pointer, which we do in a slightly -roundabout way here. Notice how we don't even have a return value here, since -there's nothing obvious and useful that the infrastructure could do if `fini` -failed. - -That's practically everything we need to know to get our translator through -loading, initialization, options processing, and termination. If we had defined - no dispatch functions, we could actually configure a daemon to use our -translator and it would work as a basic pass-through from its parent to a -single child. In the next post I'll cover how to build the translator and -configure a daemon to use it, so that we can actually step through it in a -debugger and see how it all fits together before we actually start adding -functionality. - -This Time For Real ------------------- - -In the first two parts of this series, we learned how to write a basic -translator skeleton that can get through loading, initialization, and option -processing. This time we'll cover how to build that translator, configure a -volume to use it, and run the glusterfs daemon in debug mode. - -Unfortunately, there's not much direct support for writing new translators. You -can check out a GlusterFS tree and splice in your own translator directory, but - that's a bit painful because you'll have to update multiple makefiles plus a -bunch of autoconf garbage. As part of the HekaFS project, I basically reverse -engineered the truly necessary parts of the translator-building process and -then pestered one of the Fedora glusterfs package maintainers (thanks -daMaestro!) to add a `glusterfs-devel` package with the required headers. Since - then the complexity level in the HekaFS tree has crept back up a bit, but I -still remember the simple method and still consider it the easiest way to get -started on a new translator. For the sake of those not using Fedora, I'm going -to describe a method that doesn't depend on that header package. What it does -depend on is a GlusterFS source tree, much as you might have cloned from GitHub - or the Gluster review site. This tree doesn't have to be fully built, but you -do need to run `autogen.sh` and configure in it. Then you can take the -following simple makefile and put it in a directory with your actual source. - -``` -# Change these to match your source code. -TARGET = rot-13.so -OBJECTS = rot-13.o - -# Change these to match your environment. -GLFS_SRC = /srv/glusterfs -GLFS_LIB = /usr/lib64 -HOST_OS = GF_LINUX_HOST_OS - -# You shouldn't need to change anything below here. - -CFLAGS = -fPIC -Wall -O0 -g \ - -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE \ - -D$(HOST_OS) -I$(GLFS_SRC) -I$(GLFS_SRC)/contrib/uuid \ - -I$(GLFS_SRC)/libglusterfs/src -LDFLAGS = -shared -nostartfiles -L$(GLFS_LIB) -LIBS = -lglusterfs -lpthread - -$(TARGET): $(OBJECTS) - $(CC) $(OBJECTS) $(LDFLAGS) -o $(TARGET) $(OBJECTS) $(LIBS) -``` - -Yes, it's still Linux-specific. Mea culpa. As you can see, we're sticking with -the `rot-13` example, so you can just copy the files from -`xlators/encryption/rot-13/src` in your GlusterFS tree to follow on. Type -`make` and you should be rewarded with a nice little `.so` file. - -``` -xlator_example$ ls -l rot-13.so --rwxr-xr-x. 1 jeff jeff 40784 Nov 16 16:41 rot-13.so -``` - -Notice that we've built with optimization level zero and debugging symbols -included, which would not typically be the case for a packaged version of -GlusterFS. Let's put our version of `rot-13.so` into a slightly different file -on our system, so that it doesn't stomp on the installed version (not that -you'd ever want to use that anyway). - -``` -xlator_example# ls /usr/lib64/glusterfs/3git/xlator/encryption/ -crypt.so crypt.so.0 crypt.so.0.0.0 rot-13.so rot-13.so.0 -rot-13.so.0.0.0 -xlator_example# cp rot-13.so \ - /usr/lib64/glusterfs/3git/xlator/encryption/my-rot-13.so -``` - -These paths represent the current Gluster filesystem layout, which is likely to -be deprecated in favor of the Fedora layout; your paths may vary. At this point - we're ready to configure a volume using our new translator. To do that, I'm -going to suggest something that's strongly discouraged except during -development (the Gluster guys are going to hate me for this): write our own -volfile. Here's just about the simplest volfile you'll ever see. - -``` -volume my-posix - type storage/posix - option directory /srv/export -end-volume - -volume my-rot13 - type encryption/my-rot-13 - subvolumes my-posix -end-volume -``` - -All we have here is a basic brick using `/srv/export` for its data, and then -an instance of our translator layered on top -- no client or server is -necessary for what we're doing, and the system will automatically push a -mount/fuse translator on top if there's no server translator. To try this out, -all we need is the following command (assuming the directories involved already - exist). - -``` -xlator_example$ glusterfs --debug -f my.vol /srv/import -``` - -You should be rewarded with a whole lot of log output, including the text of -the volfile (this is very useful for debugging problems in the field). If you -go to another window on the same machine, you can see that you have a new -filesystem mounted. - -``` -~$ df /srv/import -Filesystem 1K-blocks Used Available Use% Mounted on -/srv/xlator_example/my.vol - 114506240 2706176 105983488 3% /srv/import -``` - -Just for fun, write something into a file in `/srv/import`, then look at the -corresponding file in `/srv/export` to see it all `rot-13`'ed for you. - -``` -~$ echo hello > /srv/import/a_file -~$ cat /srv/export/a_file -uryyb -``` - -There you have it -- functionality you control, implemented easily, layered on -top of local storage. Now you could start adding functionality -- real -encryption, perhaps -- and inevitably having to debug it. You could do that the - old-school way, with `gf_log` (preferred) or even plain old `printf`, or you -could run daemons under `gdb` instead. Alternatively, you could wait for the -next Translator 101 post, where we'll be doing exactly that. - -Debugging a Translator ----------------------- - -Now that we've learned what a translator looks like and how to build one, it's -time to run one and actually watch it work. The best way to do this is good -old-fashioned `gdb`, as follows (using some of the examples from last time). - -``` -xlator_example# gdb glusterfs -GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6) -... -(gdb) r --debug -f my.vol /srv/import -Starting program: /usr/sbin/glusterfs --debug -f my.vol /srv/import -... -[2011-11-23 11:23:16.495516] I [fuse-bridge.c:2971:fuse_init] - 0-glusterfs-fuse: FUSE inited with protocol versions: - glusterfs 7.13 kernel 7.13 -``` - -If you get to this point, your glusterfs client process is already running. You -can go to another window to see the mountpoint, do file operations, etc. - -``` -~# df /srv/import -Filesystem 1K-blocks Used Available Use% Mounted on -/root/xlator_example/my.vol - 114506240 2643968 106045568 3% /srv/import -~# ls /srv/import -a_file -~# cat /srv/import/a_file -hello -``` - -Now let's interrupt the process and see where we are. - -``` -^C -Program received signal SIGINT, Interrupt. -0x0000003a0060b3dc in pthread_cond_wait@@GLIBC_2.3.2 () - from /lib64/libpthread.so.0 -(gdb) info threads - 5 Thread 0x7fffeffff700 (LWP 27206) 0x0000003a002dd8c7 - in readv () - from /lib64/libc.so.6 - 4 Thread 0x7ffff50e3700 (LWP 27205) 0x0000003a0060b75b - in pthread_cond_timedwait@@GLIBC_2.3.2 () - from /lib64/libpthread.so.0 - 3 Thread 0x7ffff5f02700 (LWP 27204) 0x0000003a0060b3dc - in pthread_cond_wait@@GLIBC_2.3.2 () - from /lib64/libpthread.so.0 - 2 Thread 0x7ffff6903700 (LWP 27203) 0x0000003a0060f245 - in sigwait () - from /lib64/libpthread.so.0 -* 1 Thread 0x7ffff7957700 (LWP 27196) 0x0000003a0060b3dc - in pthread_cond_wait@@GLIBC_2.3.2 () - from /lib64/libpthread.so.0 -``` - -Like any non-toy server, this one has multiple threads. What are they all -doing? Honestly, even I don't know. Thread 1 turns out to be in -`event_dispatch_epoll`, which means it's the one handling all of our network -I/O. Note that with socket multi-threading patch this will change, with one -thread in `socket_poller` per connection. Thread 2 is in `glusterfs_sigwaiter` -which means signals will be isolated to that thread. Thread 3 is in -`syncenv_task`, so it's a worker process for synchronous requests such as -those used by the rebalance and repair code. Thread 4 is in -`janitor_get_next_fd`, so it's waiting for a chance to close no-longer-needed -file descriptors on the local filesystem. (I admit I had to look that one up, -BTW.) Lastly, thread 5 is in `fuse_thread_proc`, so it's the one fetching -requests from our FUSE interface. You'll often see many more threads than -this, but it's a pretty good basic set. Now, let's set a breakpoint so we can -actually watch a request. - -``` -(gdb) b rot13_writev -Breakpoint 1 at 0x7ffff50e4f0b: file rot-13.c, line 119. -(gdb) c -Continuing. -``` - -At this point we go into our other window and do something that will involve a write. - -``` -~# echo goodbye > /srv/import/another_file -(back to the first window) -[Switching to Thread 0x7fffeffff700 (LWP 27206)] - -Breakpoint 1, rot13_writev (frame=0x7ffff6e4402c, this=0x638440, - fd=0x7ffff409802c, vector=0x7fffe8000cd8, count=1, offset=0, - iobref=0x7fffe8001070) at rot-13.c:119 -119 rot_13_private_t *priv = (rot_13_private_t *)this->private; -``` - -Remember how we built with debugging symbols enabled and no optimization? That -will be pretty important for the next few steps. As you can see, we're in -`rot13_writev`, with several parameters. - -* `frame` is our always-present frame pointer for this request. Also, - `frame->local` will point to any local data we created and attached to the - request ourselves. -* `this` is a pointer to our instance of the `rot-13` translator. You can examine - it if you like to see the name, type, options, parent/children, inode table, - and other stuff associated with it. -* `fd` is a pointer to a file-descriptor *object* (`fd_t`, not just a - file-descriptor index which is what most people use "fd" for). This in turn - points to an inode object (`inode_t`) and we can associate our own - `rot-13`-specific data with either of these. -* `vector` and `count` together describe the data buffers for this write, which - we'll get to in a moment. -* `offset` is the offset into the file at which we're writing. -* `iobref` is a buffer-reference object, which is used to track the life cycle - of buffers containing read/write data. If you look closely, you'll notice that - `vector[0].iov_base` points to the same address as `iobref->iobrefs[0].ptr`, which - should give you some idea of the inter-relationships between vector and iobref. - -OK, now what about that `vector`? We can use it to examine the data being -written, like this. - -``` -(gdb) p vector[0] -$2 = {iov_base = 0x7ffff7936000, iov_len = 8} -(gdb) x/s 0x7ffff7936000 -0x7ffff7936000: "goodbye\n" -``` - -It's not always safe to view this data as a string, because it might just as -well be binary data, but since we're generating the write this time it's safe -and convenient. With that knowledge, let's step through things a bit. - -``` -(gdb) s -120 if (priv->encrypt_write) -(gdb) -121 rot13_iovec (vector, count); -(gdb) -rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57 -57 for (i = 0; i < count; i++) { -(gdb) -58 rot13 (vector[i].iov_base, vector[i].iov_len); -(gdb) -rot13 (buf=0x7ffff7936000 "goodbye\n", len=8) at rot-13.c:45 -45 for (i = 0; i < len; i++) { -(gdb) -46 if (buf[i] >= 'a' && buf[i] <= 'z') -(gdb) -47 buf[i] = 'a' + ((buf[i] - 'a' + 13) % 26); -``` - -Here we've stepped into `rot13_iovec`, which iterates through our vector -calling `rot13`, which in turn iterates through the characters in that chunk -doing the `rot-13` operation if/as appropriate. This is pretty straightforward -stuff, so let's skip to the next interesting bit. - -``` -(gdb) fin -Run till exit from #0 rot13 (buf=0x7ffff7936000 "goodbye\n", - len=8) at rot-13.c:47 -rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57 -57 for (i = 0; i < count; i++) { -(gdb) fin -Run till exit from #0 rot13_iovec (vector=0x7fffe8000cd8, - count=1) at rot-13.c:57 -rot13_writev (frame=0x7ffff6e4402c, this=0x638440, - fd=0x7ffff409802c, vector=0x7fffe8000cd8, count=1, - offset=0, iobref=0x7fffe8001070) at rot-13.c:123 -123 STACK_WIND (frame, -(gdb) b 129 -Breakpoint 2 at 0x7ffff50e4f35: file rot-13.c, line 129. -(gdb) b rot13_writev_cbk -Breakpoint 3 at 0x7ffff50e4db3: file rot-13.c, line 106. -(gdb) c -``` - -So we've set breakpoints on both the callback and the statement following the -`STACK_WIND`. Which one will we hit first? - -``` -Breakpoint 3, rot13_writev_cbk (frame=0x7ffff6e4402c, - cookie=0x7ffff6e440d8, this=0x638440, op_ret=8, op_errno=0, - prebuf=0x7fffefffeca0, postbuf=0x7fffefffec30) - at rot-13.c:106 -106 STACK_UNWIND_STRICT (writev, frame, op_ret, op_errno, - prebuf, postbuf); -(gdb) bt -#0 rot13_writev_cbk (frame=0x7ffff6e4402c, - cookie=0x7ffff6e440d8, this=0x638440, op_ret=8, op_errno=0, - prebuf=0x7fffefffeca0, postbuf=0x7fffefffec30) - at rot-13.c:106 -#1 0x00007ffff52f1b37 in posix_writev (frame=0x7ffff6e440d8, - this=<value optimized out>, fd=<value optimized out>, - vector=<value optimized out>, count=1, - offset=<value optimized out>, iobref=0x7fffe8001070) - at posix.c:2217 -#2 0x00007ffff50e513e in rot13_writev (frame=0x7ffff6e4402c, - this=0x638440, fd=0x7ffff409802c, vector=0x7fffe8000cd8, - count=1, offset=0, iobref=0x7fffe8001070) at rot-13.c:123 -``` - -Surprise! We're in `rot13_writev_cbk` now, called (indirectly) while we're -still in `rot13_writev` before `STACK_WIND` returns (still at rot-13.c:123). If - you did any request cleanup here, then you need to be careful about what you -do in the remainder of `rot13_writev` because data may have been freed etc. -It's tempting to say you should just do the cleanup in `rot13_writev` after -the `STACK_WIND,` but that's not valid because it's also possible that some -other translator returned without calling `STACK_UNWIND` -- i.e. before -`rot13_writev` is called, so then it would be the one getting null-pointer -errors instead. To put it another way, the callback and the return from -`STACK_WIND` can occur in either order or even simultaneously on different -threads. Even if you were to use reference counts, you'd have to make sure to -use locking or atomic operations to avoid races, and it's not worth it. Unless -you *really* understand the possible flows of control and know what you're -doing, it's better to do cleanup in the callback and nothing after -`STACK_WIND.` - -At this point all that's left is a `STACK_UNWIND` and a return. The -`STACK_UNWIND` invokes our parent's completion callback, and in this case our -parent is FUSE so at that point the VFS layer is notified of the write being -complete. Finally, we return through several levels of normal function calls -until we come back to fuse_thread_proc, which waits for the next request. - -So that's it. For extra fun, you might want to repeat this exercise by stepping -through some other call -- stat or setxattr might be good choices -- but you'll - have to use a translator that actually implements those calls to see much -that's interesting. Then you'll pretty much know everything I knew when I -started writing my first for-real translators, and probably even a bit more. I -hope you've enjoyed this series, or at least found it useful, and if you have -any suggestions for other topics I should cover please let me know (via -comments or email, IRC or Twitter). diff --git a/doc/hacker-guide/en-US/markdown/unittest.md b/doc/hacker-guide/en-US/markdown/unittest.md deleted file mode 100644 index 5c6c0a8a039..00000000000 --- a/doc/hacker-guide/en-US/markdown/unittest.md +++ /dev/null @@ -1,228 +0,0 @@ -# Unit Tests in GlusterFS - -## Overview -[Art-of-unittesting][definitionofunittest] provides a good definition for unit tests. A good unit test is: - -* Able to be fully automated -* Has full control over all the pieces running (Use mocks or stubs to achieve this isolation when needed) -* Can be run in any order if part of many other tests -* Runs in memory (no DB or File access, for example) -* Consistently returns the same result (You always run the same test, so no random numbers, for example. save those for integration or range tests) -* Runs fast -* Tests a single logical concept in the system -* Readable -* Maintainable -* Trustworthy (when you see its result, you don’t need to debug the code just to be sure) - -## cmocka -GlusterFS unit test framework is based on [cmocka][]. cmocka provides -developers with methods to isolate and test modules written in C language. It -also provides integration with Jenkins by providing JUnit XML compliant unit -test results. - -cmocka - -## Running Unit Tests -To execute the unit tests, all you need is to type `make check`. Here is a step-by-step example assuming you just cloned a GlusterFS tree: - -``` -$ ./autogen.sh -$ ./configure --enable-debug -$ make check -``` - -Sample output: - -``` -PASS: mem_pool_unittest -============================================================================ -Testsuite summary for glusterfs 3git -============================================================================ -# TOTAL: 1 -# PASS: 1 -# SKIP: 0 -# XFAIL: 0 -# FAIL: 0 -# XPASS: 0 -# ERROR: 0 -============================================================================ -``` - -In this example, `mem_pool_unittest` has multiple tests inside, but `make check` assumes that the program itself is the test, and that is why it only shows one test. Here is the output when we run `mem_pool_unittest` directly: - -``` -$ ./libglusterfs/src/mem_pool_unittest -[==========] Running 10 test(s). -[ RUN ] test_gf_mem_acct_enable_set -Expected assertion data != ((void *)0) occurred -[ OK ] test_gf_mem_acct_enable_set -[ RUN ] test_gf_mem_set_acct_info_asserts -Expected assertion xl != ((void *)0) occurred -Expected assertion size > ((4 + sizeof (size_t) + sizeof (xlator_t *) + 4 + 8) + 8) occurred -Expected assertion type <= xl->mem_acct.num_types occurred -[ OK ] test_gf_mem_set_acct_info_asserts -[ RUN ] test_gf_mem_set_acct_info_memory -[ OK ] test_gf_mem_set_acct_info_memory -[ RUN ] test_gf_calloc_default_calloc -[ OK ] test_gf_calloc_default_calloc -[ RUN ] test_gf_calloc_mem_acct_enabled -[ OK ] test_gf_calloc_mem_acct_enabled -[ RUN ] test_gf_malloc_default_malloc -[ OK ] test_gf_malloc_default_malloc -[ RUN ] test_gf_malloc_mem_acct_enabled -[ OK ] test_gf_malloc_mem_acct_enabled -[ RUN ] test_gf_realloc_default_realloc -[ OK ] test_gf_realloc_default_realloc -[ RUN ] test_gf_realloc_mem_acct_enabled -[ OK ] test_gf_realloc_mem_acct_enabled -[ RUN ] test_gf_realloc_ptr -Expected assertion ((void *)0) != ptr occurred -[ OK ] test_gf_realloc_ptr -[==========] 10 test(s) run. -[ PASSED ] 10 test(s). -[ FAILED ] 0 test(s). -[ REPORT ] Created libglusterfs_mem_pool_xunit.xml report -``` - - -## Writing Unit Tests - -### Enhancing your C functions - -#### Programming by Contract -Add the following to your C file: - -```c -#include <cmocka_pbc.h> -``` - -```c -/* - * Programming by Contract is a programming methodology - * which binds the caller and the function called to a - * contract. The contract is represented using Hoare Triple: - * {P} C {Q} - * where {P} is the precondition before executing command C, - * and {Q} is the postcondition. - * - * See also: - * http://en.wikipedia.org/wiki/Design_by_contract - * http://en.wikipedia.org/wiki/Hoare_logic - * http://dlang.org/dbc.html - */ - #ifndef CMOCKERY_PBC_H_ -#define CMOCKERY_PBC_H_ - -#if defined(UNIT_TESTING) || defined (DEBUG) - -#include <assert.h> - -/* - * Checks caller responsibility against contract - */ -#define REQUIRE(cond) assert(cond) - -/* - * Checks function reponsability against contract. - */ -#define ENSURE(cond) assert(cond) - -/* - * While REQUIRE and ENSURE apply to functions, INVARIANT - * applies to classes/structs. It ensures that intances - * of the class/struct are consistent. In other words, - * that the instance has not been corrupted. - */ -#define INVARIANT(invariant_fnc) do{ (invariant_fnc) } while (0); - -#else -#define REQUIRE(cond) do { } while (0); -#define ENSURE(cond) do { } while (0); -#define INVARIANT(invariant_fnc) do{ } while (0); - -#endif /* defined(UNIT_TESTING) || defined (DEBUG) */ -#endif /* CMOCKERY_PBC_H_ */ -``` - -##### Example -This is an _extremely_ simple example: - -```c -int divide (int n, int d) -{ - int ans; - - REQUIRE(d != 0); - - ans = n / d; - - // As code is added to this function throughout its lifetime, - // ENSURE will assert that data will be returned - // according to the contract. Again this is an - // extremely simple example. :-D - ENSURE( ans == (n / d) ); - - return ans; -} - -``` - -##### Important Note -`REQUIRE`, `ENSURE`, and `INVARIANT` are only available when `DEBUG` or `UNIT_TESTING` are set in the CFLAGS. You must pass `--enable-debug` to `./configure` to enable PBC on your non-unittest builds. - -#### Overriding functions -Cmockery2 provides its own memory allocation functions which check for buffer overrun and memory leaks. The following header file must be included **last** to be able to override any of the memory allocation functions: - -```c -#include <cmocka.h> -``` - -This file will only take effect with the `UNIT_TESTING` CFLAG is set. - -### Creating a unit test -Once you identify the C file you would like to test, first create a `unittest` directory under the directory where the C file is located. This will isolate the unittests to a different directory. - -Next, you need to edit the `Makefile.am` file in the directory where your C file is located. Initialize the -`Makefile.am` if it does not already have the following sections: - -``` -#### UNIT TESTS ##### -CLEANFILES += *.gcda *.gcno *_xunit.xml -noinst_PROGRAMS = -TESTS = -``` - -Now you can add the following for each of the unit tests that you would like to build: - -``` -### UNIT TEST xxx_unittest ### -xxx_unittest_CPPFLAGS = $(xxx_CPPFLAGS) -xxx_unittest_SOURCES = xxx.c \ - unittest/xxx_unittest.c -xxx_unittest_CFLAGS = $(UNITTEST_CFLAGS) -xxx_unittest_LDFLAGS = $(UNITTEST_LDFLAGS) -noinst_PROGRAMS += xxx_unittest -TESTS += xxx_unittest -``` - -Where `xxx` is the name of your C file. For example, look at `libglusterfs/src/Makefile.am`. - -Copy the simple unit test from the [cmocka API][cmockaapi] to `unittest/xxx_unittest.c`. If you would like to see an example of a unit test, please refer to `libglusterfs/src/unittest/mem_pool_unittest.c`. - -#### Mocking -You may see that the linker will complain about missing functions needed by the C file you would like to test. Identify the required functions, then place their stubs in a file called `unittest/xxx_mock.c`, then include this file in `Makefile.am` in `xxx_unittest_SOURCES`. This will allow you to you Cmockery2's mocking functions. - -#### Running the unit test -You can type `make` in the directory where the C file is located. Once you built it and there are no errors, you can execute the test either by directly executing the program (in our example above it is called `xxx_unittest` ), or by running `make check`. - -#### Debugging -Sometimes you may need to debug your unit test. To do that, you will have to point `gdb` to the binary which is located in the same directory as the source. For example, you can do the following from the root of the source tree to debug `mem_pool_unittest`: - -``` -$ gdb libglusterfs/src/mem_pool_unittest -``` - - -[cmocka]: https://cmocka.org -[definitionofunittest]: http://artofunittesting.com/definition-of-a-unit-test/ -[cmockapi]: https://api.cmocka.org diff --git a/doc/hacker-guide/en-US/markdown/write-behind.md b/doc/hacker-guide/en-US/markdown/write-behind.md deleted file mode 100644 index 0d78964fa20..00000000000 --- a/doc/hacker-guide/en-US/markdown/write-behind.md +++ /dev/null @@ -1,56 +0,0 @@ -performance/write-behind translator -=================================== - -Basic working --------------- - -Write behind is basically a translator to lie to the application that the -write-requests are finished, even before it is actually finished. - -On a regular translator tree without write-behind, control flow is like this: - -1. application makes a `write()` system call. -2. VFS ==> FUSE ==> `/dev/fuse`. -3. fuse-bridge initiates a glusterfs `writev()` call. -4. `writev()` is `STACK_WIND()`ed up to client-protocol or storage translator. -5. client-protocol, on receiving reply from server, starts `STACK_UNWIND()` towards the fuse-bridge. - -On a translator tree with write-behind, control flow is like this: - -1. application makes a `write()` system call. -2. VFS ==> FUSE ==> `/dev/fuse`. -3. fuse-bridge initiates a glusterfs `writev()` call. -4. `writev()` is `STACK_WIND()`ed up to write-behind translator. -5. write-behind adds the write buffer to its internal queue and does a `STACK_UNWIND()` towards the fuse-bridge. - -write call is completed in application's percepective. after -`STACK_UNWIND()`ing towards the fuse-bridge, write-behind initiates a fresh -writev() call to its child translator, whose replies will be consumed by -write-behind itself. Write-behind _doesn't_ cache the write buffer, unless -`option flush-behind on` is specified in volume specification file. - -Windowing ---------- - -With respect to write-behind, each write-buffer has three flags: `stack_wound`, `write_behind` and `got_reply`. - -* `stack_wound`: if set, indicates that write-behind has initiated `STACK_WIND()` towards child translator. -* `write_behind`: if set, indicates that write-behind has done `STACK_UNWIND()` towards fuse-bridge. -* `got_reply`: if set, indicates that write-behind has received reply from child translator for a `writev()` `STACK_WIND()`. a request will be destroyed by write-behind only if this flag is set. - -Currently pending write requests = aggregate size of requests with write_behind = 1 and got_reply = 0. - -window size limits the aggregate size of currently pending write requests. once -the pending requests' size has reached the window size, write-behind blocks -writev() calls from fuse-bridge. Blocking is only from application's -perspective. Write-behind does `STACK_WIND()` to child translator -straight-away, but hold behind the `STACK_UNWIND()` towards fuse-bridge. -`STACK_UNWIND()` is done only once write-behind gets enough replies to -accommodate for currently blocked request. - -Flush behind ------------- - -If `option flush-behind on` is specified in volume specification file, then -write-behind sends aggregate write requests to child translator, instead of -regular per request `STACK_WIND()`s. |