summaryrefslogtreecommitdiffstats
path: root/doc/developer-guide/data-structures
diff options
context:
space:
mode:
authorHumble Devassy Chirammal <hchiramm@redhat.com>2015-03-30 12:21:05 +0530
committerKaleb KEITHLEY <kkeithle@redhat.com>2015-03-30 05:51:12 -0700
commit8907f67ba215172b01a7018adcbb063fcc4570e9 (patch)
tree68cf6557990ae4215926068625a7b5e0e1374882 /doc/developer-guide/data-structures
parente3bd2387a5973df4548fe4a62b5dfc227a2bdc64 (diff)
doc: restructure developer docs to new layout
The developer oriented information is scattered in source and its very difficult to identify which are those. With this patch subdirs are created under developer-guide which will be the parent for developer notes. The changes suggested in http://review.gluster.org/#/c/8827/ are also included in this patch. Change-Id: I4c8510d52c49f4066225f72cac8f97f087d6c70c BUG: 1206539 Signed-off-by: Humble Devassy Chirammal <hchiramm@redhat.com> Reviewed-on: http://review.gluster.org/10038 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Lalatendu Mohanty <lmohanty@redhat.com> Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com>
Diffstat (limited to 'doc/developer-guide/data-structures')
-rw-r--r--doc/developer-guide/data-structures/inode.md226
-rw-r--r--doc/developer-guide/data-structures/iobuf.md259
-rw-r--r--doc/developer-guide/data-structures/mem-pool.md124
3 files changed, 609 insertions, 0 deletions
diff --git a/doc/developer-guide/data-structures/inode.md b/doc/developer-guide/data-structures/inode.md
new file mode 100644
index 00000000000..a340ab9ca8e
--- /dev/null
+++ b/doc/developer-guide/data-structures/inode.md
@@ -0,0 +1,226 @@
+#Inode and dentry management in GlusterFS:
+
+##Background
+Filesystems internally refer to files and directories via inodes. Inodes
+are unique identifiers of the entities stored in a filesystem. Whenever an
+application has to operate on a file/directory (read/modify), the filesystem
+maps that file/directory to the right inode and start referring to that inode
+whenever an operation has to be performed on the file/directory.
+
+In GlusterFS a new inode gets created whenever a new file/directory is created
+OR when a successful lookup is done on a file/directory for the first time.
+Inodes in GlusterFS are maintained by the inode table which gets initiated when
+the filesystem daemon is started (both for the brick process as well as the
+mount process). Below are some important data structures for inode management.
+
+## Data-structure (inode-table)
+```
+struct _inode_table {
+ pthread_mutex_t lock;
+ size_t hashsize; /* bucket size of inode hash and dentry hash */
+ char *name; /* name of the inode table, just for gf_log() */
+ inode_t *root; /* root directory inode, with inode
+ number and gfid 1 */
+ xlator_t *xl; /* xlator to be called to do purge and
+ the xlator which maintains the inode table*/
+ uint32_t lru_limit; /* maximum LRU cache size */
+ struct list_head *inode_hash; /* buckets for inode hash table */
+ struct list_head *name_hash; /* buckets for dentry hash table */
+ struct list_head active; /* list of inodes currently active (in an fop) */
+ uint32_t active_size; /* count of inodes in active list */
+ struct list_head lru; /* list of inodes recently used.
+ lru.next most recent */
+ uint32_t lru_size; /* count of inodes in lru list */
+ struct list_head purge; /* list of inodes to be purged soon */
+ uint32_t purge_size; /* count of inodes in purge list */
+
+ struct mem_pool *inode_pool; /* memory pool for inodes */
+ struct mem_pool *dentry_pool; /* memory pool for dentrys */
+ struct mem_pool *fd_mem_pool; /* memory pool for fd_t */
+ int ctxcount; /* number of slots in inode->ctx */
+};
+```
+
+#Life-cycle
+```
+
+inode_table_new (size_t lru_limit, xlator_t *xl)
+
+This is a function which allocates a new inode table. Usually the top xlators in
+the graph such as protocol/server (for bricks), fuse and nfs (for fuse and nfs
+mounts) and libgfapi do inode managements. Hence they are the ones which will
+allocate a new inode table by calling the above function.
+
+Each xlator graph in glusterfs maintains an inode table. So in fuse clients,
+whenever there is a graph change due to add brick/remove brick or
+addition/removal of some other xlators, a new graph is created which creates a
+new inode table.
+
+Thus an allocated inode table is destroyed only when the filesystem daemon is
+killed or unmounted.
+
+```
+
+#what it contains.
+```
+
+Inode table in glusterfs mainly contains a hash table for maintaining inodes.
+In general a file/directory is considered to be existing if there is a
+corresponding inode present in the inode table. If a inode for a file/directory
+cannot be found in the inode table, glusterfs tries to resolve it by sending a
+lookup on the entry for which the inode is needed. If lookup is successful, then
+a new inode correponding to the entry is added to the hash table present in the
+inode table. Thus an inode present in the hash-table means, its an existing
+file/directory within the filesystem. The inode table also contains the hash
+size of the hash table (as of now it is hard coded to 14057. The hash value of
+a inode is calculated using its gfid).
+
+Apart from the hash table, inode table also maintains 3 important list of inodes
+1) Active list:
+Active list contains all the active inodes (i.e inodes which are currently part
+of some fop).
+2) Lru list:
+Least recently used inodes list. A limit can be set for the size of the lru
+list. For bricks it is 16384 and for clients it is infinity.
+3) Purge list:
+List of all the inodes which have to be purged (i.e inodes which have to be
+deleted from the inode table due to unlink/rmdir/forget).
+
+And at last it also contains the mem-pool for allocating inodes, dentries so
+that frequent malloc/calloc and free of the data structures can be avoided.
+```
+
+#Data structure (inode)
+```
+struct _inode {
+ inode_table_t *table; /* the table this inode belongs to */
+ uuid_t gfid; /* unique identifier of the inode */
+ gf_lock_t lock;
+ uint64_t nlookup;
+ uint32_t fd_count; /* Open fd count */
+ uint32_t ref; /* reference count on this inode */
+ ia_type_t ia_type; /* what kind of file */
+ struct list_head fd_list; /* list of open files on this inode */
+ struct list_head dentry_list; /* list of directory entries for this inode */
+ struct list_head hash; /* hash table pointers */
+ struct list_head list; /* active/lru/purge */
+
+ struct _inode_ctx *_ctx; /* place holder for keeping the
+ information about the inode by different xlators */
+};
+
+As said above, inodes are internal way of identifying the files/directories. A
+inode uniquely represents a file/directory. A new inode is created whenever a
+create/mkdir/symlink/mknod operations are performed. Apart from that a new inode
+is created upon the successful fresh lookup of a file/directory. Say the
+filesystem contained some file "a" within root and the filesystem was
+unmounted. Now when glusterfs is mounted and some operation is perfomed on "/a",
+glusterfs tries to get the inode for the entry "a" with parent inode as
+root. But, since glusterfs just came up, it will not be able to find the inode
+for "a" and will send a lookup on "/a". If the lookup operation succeeds (i.e.
+the root of glusterfs contains an entry called "a"), then a new inode for "/a"
+is created and added to the inode table.
+
+Depending upon the situation, an inode can be in one of the 3 lists maintained
+by the inode table. If some fop is happening on the inode, then the inode will
+be present in the active inodes list maintained by the inode table. Active
+inodes are those inodes whose refcount is greater than zero. Whenever some
+operation comes on a file/directory, and the resolver tries to find the inode
+for it, it increments the refcount of the inode before returning the inode. The
+refcount of an inode can be incremented by calling the below function
+
+inode_ref (inode_t *inode)
+
+Any xlator which wants to operate on a inode as part of some fop (or wants the
+inode in the callback), should hold a ref on the inode.
+Once the fop is completed before sending the reply of the fop to the above
+layers , the inode has to be unrefed. When the refcount of an inode becomes
+zero, it is removed from the active inodes list and put into LRU list maintained
+by the inode table. Thus in short if some fop is happening on a file/directory,
+the corresponding inode will be in the active list or it will be in the LRU
+list.
+```
+
+#Life Cycle
+
+A new inode is created whenever a new file/directory/symlink is created OR a
+successful lookup of an existing entry is done. The xlators which does inode
+management (as of now protocol/server, fuse, nfs, gfapi) will perform inode_link
+operation upon successful lookup or successful creation of a new entry.
+
+inode_link (inode_t *inode, inode_t *parent, const char *name,
+ struct iatt *buf);
+
+inode_link actually adds the inode to the inode table (to be precise it adds
+the inode to the hash table maintained by the inode table. The hash value is
+calculated based on the gfid). Copies the gfid to the inode (the gfid is
+present in the iatt structure). Creates a dentry with the new name.
+
+A inode is removed from the inode table and eventually destroyed when unlink
+or rmdir operation is performed on a file/directory, or the the lru limit of
+the inode table has been exceeded.
+
+#Data structure (dentry)
+```
+
+struct _dentry {
+ struct list_head inode_list; /* list of dentries of inode */
+ struct list_head hash; /* hash table pointers */
+ inode_t *inode; /* inode of this directory entry */
+ char *name; /* name of the directory entry */
+ inode_t *parent; /* directory of the entry */
+};
+
+A dentry is the presence of an entry for a file/directory within its parent
+directory. A dentry usually points to the inode to which it belongs to. In
+glusterfs a dentry contains the following fields.
+1) a hook using which it can add itself to the list of
+the dentries maintained by the inode to which it points to.
+2) A hash table pointer.
+3) Pointer to the inode to which it belongs to.
+4) Name of the dentry
+5) Pointer to the inode of the parent directory in which the dentry is present
+
+A new dentry is created when a new file/directory/symlink is created or a hard
+link to an existing file is created.
+
+__dentry_create (inode_t *inode, inode_t *parent, const char *name);
+
+A dentry holds a refcount on the parent
+directory so that the parent inode is never removed from the active inode's list
+and put to the lru list (If the lru limit of the lru list is exceeded, there is
+a chance of parent inode being destroyed. To avoid it, the dentries hold a
+reference to the parent inode). A dentry is removed whenevern a unlink/rmdir
+is perfomed on a file/directory. Or when the lru limit has been exceeded, the
+oldest inodes are purged out of the inode table, during which all the dentries
+of the inode are removed.
+
+Whenever a unlink/rmdir comes on a file/directory, the corresponding inode
+should be removed from the inode table. So upon unlink/rmdir, the inode will
+be moved to the purge list maintained by the inode table and from there it is
+destroyed. To be more specific, if a inode has to be destroyed, its refcount
+and nlookup count both should become 0. For refcount to become 0, the inode
+should not be part of any fop (there should not be any open fds). Or if the
+inode belongs to a directory, then there should not be any fop happening on the
+directory and it should not contain any dentries within it. For nlookup count to
+become zero, a forget has to be sent on the inode with nlookup count set to 0 as
+an argument. For fuse clients, forget is sent by the kernel itself whenever a
+unlink/rmdir is performed. But for brick processes, upon unlink/rmdir, the
+protocol/server itself has to do inode_forget. Whenever the inode has to be
+deleted due to file removal or lru limit being exceeded the inode is retired
+(i.e. all the dentries of the inode are deleted and the inode is moved to the
+purge list maintained by the inode table), the nlookup count is set to 0 via
+inode_forget api. The inode table, then prunes all the inodes from the purge
+list by destroying the inode contexts maintained by each xlator.
+
+unlinking of the dentry is done via inode_unlink;
+
+void
+inode_unlink (inode_t *inode, inode_t *parent, const char *name);
+
+If the inode has multiple hard links, then the unlink operation performed by
+the application results just in the removal of the dentry with the name provided
+by the application. For the inode to be removed, all the dentries of the inode
+should be unlinked.
+```
+
diff --git a/doc/developer-guide/data-structures/iobuf.md b/doc/developer-guide/data-structures/iobuf.md
new file mode 100644
index 00000000000..5f521f1485f
--- /dev/null
+++ b/doc/developer-guide/data-structures/iobuf.md
@@ -0,0 +1,259 @@
+#Iobuf-pool
+##Datastructures
+###iobuf
+Short for IO Buffer. It is one allocatable unit for the consumers of the IOBUF
+API, each unit hosts @page_size(defined in arena structure) bytes of memory. As
+initial step of processing a fop, the IO buffer passed onto GlusterFS by the
+other applications (FUSE VFS/ Applications using gfapi) is copied into GlusterFS
+space i.e. iobufs. Hence Iobufs are mostly allocated/deallocated in Fuse, gfapi,
+protocol xlators, and also in performance xlators to cache the IO buffers etc.
+```
+struct iobuf {
+ union {
+ struct list_head list;
+ struct {
+ struct iobuf *next;
+ struct iobuf *prev;
+ };
+ };
+ struct iobuf_arena *iobuf_arena;
+
+ gf_lock_t lock; /* for ->ptr and ->ref */
+ int ref; /* 0 == passive, >0 == active */
+
+ void *ptr; /* usable memory region by the consumer */
+
+ void *free_ptr; /* in case of stdalloc, this is the
+ one to be freed not the *ptr */
+};
+```
+
+###iobref
+There may be need of multiple iobufs for a single fop, like in vectored read/write.
+Hence multiple iobufs(default 16) are encapsulated under one iobref.
+```
+struct iobref {
+ gf_lock_t lock;
+ int ref;
+ struct iobuf **iobrefs; /* list of iobufs */
+ int alloced; /* 16 by default, grows as required */
+ int used; /* number of iobufs added to this iobref */
+};
+```
+###iobuf_arenas
+One region of memory MMAPed from the operating system. Each region MMAPs
+@arena_size bytes of memory, and hosts @arena_size / @page_size IOBUFs.
+The same sized iobufs are grouped into one arena, for sanity of access.
+
+```
+struct iobuf_arena {
+ union {
+ struct list_head list;
+ struct {
+ struct iobuf_arena *next;
+ struct iobuf_arena *prev;
+ };
+ };
+
+ size_t page_size; /* size of all iobufs in this arena */
+ size_t arena_size; /* this is equal to
+ (iobuf_pool->arena_size / page_size)
+ * page_size */
+ size_t page_count;
+
+ struct iobuf_pool *iobuf_pool;
+
+ void *mem_base;
+ struct iobuf *iobufs; /* allocated iobufs list */
+
+ int active_cnt;
+ struct iobuf active; /* head node iobuf
+ (unused by itself) */
+ int passive_cnt;
+ struct iobuf passive; /* head node iobuf
+ (unused by itself) */
+ uint64_t alloc_cnt; /* total allocs in this pool */
+ int max_active; /* max active buffers at a given time */
+};
+
+```
+###iobuf_pool
+Pool of Iobufs. As there may be many Io buffers required by the filesystem,
+a pool of iobufs are preallocated and kept, if these preallocated ones are
+exhausted only then the standard malloc/free is called, thus improving the
+performance. Iobuf pool is generally one per process, allocated during
+glusterfs_ctx_t init (glusterfs_ctx_defaults_init), currently the preallocated
+iobuf pool memory is freed on process exit. Iobuf pool is globally accessible
+across GlusterFs, hence iobufs allocated by any xlator can be accessed by any
+other xlators(unless iobuf is not passed).
+```
+struct iobuf_pool {
+ pthread_mutex_t mutex;
+ size_t arena_size; /* size of memory region in
+ arena */
+ size_t default_page_size; /* default size of iobuf */
+
+ int arena_cnt;
+ struct list_head arenas[GF_VARIABLE_IOBUF_COUNT];
+ /* array of arenas. Each element of the array is a list of arenas
+ holding iobufs of particular page_size */
+
+ struct list_head filled[GF_VARIABLE_IOBUF_COUNT];
+ /* array of arenas without free iobufs */
+
+ struct list_head purge[GF_VARIABLE_IOBUF_COUNT];
+ /* array of of arenas which can be purged */
+
+ uint64_t request_misses; /* mostly the requests for higher
+ value of iobufs */
+};
+```
+~~~
+The default size of the iobuf_pool(as of yet):
+1024 iobufs of 128Bytes = 128KB
+512 iobufs of 512Bytes = 256KB
+512 iobufs of 2KB = 1MB
+128 iobufs of 8KB = 1MB
+64 iobufs of 32KB = 2MB
+32 iobufs of 128KB = 4MB
+8 iobufs of 256KB = 2MB
+2 iobufs of 1MB = 2MB
+Total ~13MB
+~~~
+As seen in the datastructure iobuf_pool has 3 arena lists.
+
+- arenas:
+The arenas allocated during iobuf_pool create, are part of this list. This list
+also contains arenas that are partially filled i.e. contain few active and few
+passive iobufs (passive_cnt !=0, active_cnt!=0 except for initially allocated
+arenas). There will be by default 8 arenas of the sizes mentioned above.
+- filled:
+If all the iobufs in the arena are filled(passive_cnt = 0), the arena is moved
+to the filled list. If any of the iobufs from the filled arena is iobuf_put,
+then the arena moves back to the 'arenas' list.
+- purge:
+If there are no active iobufs in the arena(active_cnt = 0), the arena is moved
+to purge list. iobuf_put() triggers destruction of the arenas in this list. The
+arenas in the purge list are destroyed only if there is atleast one arena in
+'arenas' list, that way there won't be spurious mmap/unmap of buffers.
+(e.g: If there is an arena (page_size=128KB, count=32) in purge list, this arena
+is destroyed(munmap) only if there is an arena in 'arenas' list with page_size=128KB).
+
+##APIs
+###iobuf_get
+
+```
+struct iobuf *iobuf_get (struct iobuf_pool *iobuf_pool);
+```
+Creates a new iobuf of the default page size(128KB hard coded as of yet).
+Also takes a reference(increments ref count), hence no need of doing it
+explicitly after getting iobuf.
+
+###iobuf_get2
+
+```
+struct iobuf * iobuf_get2 (struct iobuf_pool *iobuf_pool, size_t page_size);
+```
+Creates a new iobuf of a specified page size, if page_size=0 default page size
+is considered.
+```
+if (requested iobuf size > Max iobuf size in the pool(1MB as of yet))
+ {
+ Perform standard allocation(CALLOC) of the requested size and
+ add it to the list iobuf_pool->arenas[IOBUF_ARENA_MAX_INDEX].
+ }
+ else
+ {
+ -Round the page size to match the stndard sizes in iobuf pool.
+ (eg: if 3KB is requested, it is rounded to 8KB).
+ -Select the arena list corresponding to the rounded size
+ (eg: select 8KB arena)
+ If the selected arena has passive count > 0, then return the
+ iobuf from this arena, set the counters(passive/active/etc.)
+ appropriately.
+ else the arena is full, allocate new arena with rounded size
+ and standard page numbers and add to the arena list
+ (eg: 128 iobufs of 8KB is allocated).
+ }
+```
+Also takes a reference(increments ref count), hence no need of doing it
+explicitly after getting iobuf.
+
+###iobuf_ref
+
+```
+struct iobuf *iobuf_ref (struct iobuf *iobuf);
+```
+ Take a reference on the iobuf. If using an iobuf allocated by some other
+xlator/function/, its a good practice to take a reference so that iobuf is not
+deleted by the allocator.
+
+###iobuf_unref
+```
+void iobuf_unref (struct iobuf *iobuf);
+```
+Unreference the iobuf, if the ref count is zero iobuf is considered free.
+
+```
+ -Delete the iobuf, if allocated from standard alloc and return.
+ -set the active/passive count appropriately.
+ -if passive count > 0 then add the arena to 'arena' list.
+ -if active count = 0 then add the arena to 'purge' list.
+```
+Every iobuf_ref should have a corresponding iobuf_unref, and also every
+iobuf_get/2 should have a correspondning iobuf_unref.
+
+###iobref_new
+```
+struct iobref *iobref_new ();
+```
+Creates a new iobref structure and returns its pointer.
+
+###iobref_ref
+```
+struct iobref *iobref_ref (struct iobref *iobref);
+```
+Take a reference on the iobref.
+
+###iobref_unref
+```
+void iobref_unref (struct iobref *iobref);
+```
+Decrements the reference count of the iobref. If the ref count is 0, then unref
+all the iobufs(iobuf_unref) in the iobref, and destroy the iobref.
+
+###iobref_add
+```
+int iobref_add (struct iobref *iobref, struct iobuf *iobuf);
+```
+Adds the given iobuf into the iobref, it takes a ref on the iobuf before adding
+it, hence explicit iobuf_ref is not required if adding to the iobref.
+
+###iobref_merge
+```
+int iobref_merge (struct iobref *to, struct iobref *from);
+```
+Adds all the iobufs in the 'from' iobref to the 'to' iobref. Merge will not
+cause the delete of the 'from' iobref, therefore it will result in another ref
+on all the iobufs added to the 'to' iobref. Hence iobref_unref should be
+performed both on 'from' and 'to' iobrefs (performing iobref_unref only on 'to'
+will not free the iobufs and may result in leak).
+
+###iobref_clear
+```
+void iobref_clear (struct iobref *iobref);
+```
+Unreference all the iobufs in the iobref, and also unref the iobref.
+
+##Iobuf Leaks
+If all iobuf_refs/iobuf_new do not have correspondning iobuf_unref, then the
+iobufs are not freed and recurring execution of such code path may lead to huge
+memory leaks. The easiest way to identify if a memory leak is caused by iobufs
+is to take a statedump. If the statedump shows a lot of filled arenas then it is
+a sure sign of leak. Refer doc/debugging/statedump.md for more details.
+
+If iobufs are leaking, the next step is to find where the iobuf_unref went
+missing. There is no standard/easy way of debugging this, code reading and logs
+are the only ways. If there is a liberty to reproduce the memory leak at will,
+then logs(gf_callinginfo) in iobuf_ref/unref might help.
+TODO: A easier way to debug iobuf leaks.
diff --git a/doc/developer-guide/data-structures/mem-pool.md b/doc/developer-guide/data-structures/mem-pool.md
new file mode 100644
index 00000000000..c71aa2a8ddd
--- /dev/null
+++ b/doc/developer-guide/data-structures/mem-pool.md
@@ -0,0 +1,124 @@
+#Mem-pool
+##Background
+There was a time when every fop in glusterfs used to incur cost of allocations/de-allocations for every stack wind/unwind between xlators because stack/frame/*_localt_t in every wind/unwind was allocated and de-allocated. Because of all these system calls in the fop path there was lot of latency and the worst part is that most of the times the number of frames/stacks active at any time wouldn't cross a threshold. So it was decided that this threshold number of frames/stacks would be allocated in the beginning of the process only once. Get one of them from the pool of stacks/frames whenever `STACK_WIND` is performed and put it back into the pool in `STACK_UNWIND`/`STACK_DESTROY` without incurring any extra system calls. The data structures are allocated only when threshold number of such items are in active use i.e. pool is in complete use.% increase in the performance once this was added to all the common data structures (inode/fd/dict etc) in xlators throughout the stack was tremendous.
+
+## Data structure
+```
+struct mem_pool {
+ struct list_head list; /*Each member in the mempool is element padded with a doubly-linked-list + ptr of mempool + is-in
+-use info. This list is used to add the element to the list of free members in the mem-pool*/
+ int hot_count;/*number of mempool elements that are in active use*/
+ int cold_count;/*number of mempool elements that are not in use. If a new allocation is required it
+will be served from here until all the elements in the pool are in use i.e. cold-count becomes 0.*/
+ gf_lock_t lock;/*synchronization mechanism*/
+ unsigned long padded_sizeof_type;/*Each mempool element is padded with a doubly-linked-list + ptr of mempool + is-in
+-use info to operate the pool of elements, this size is the element-size after padding*/
+ void *pool;/*Starting address of pool*/
+ void *pool_end;/*Ending address of pool*/
+/* If an element address is in the range between pool, pool_end addresses then it is alloced from the pool otherwise it is 'calloced' this is very useful for functions like 'mem_put'*/
+ int real_sizeof_type;/* size of just the element without any padding*/
+ uint64_t alloc_count; /*Number of times this type of data is allocated through out the life of this process. This may include calloced elements as well*/
+ uint64_t pool_misses; /*Number of times the element had to be allocated from heap because all elements from the pool are in active use.*/
+ int max_alloc; /*Maximum number of elements from the pool in active use at any point in the life of the process. This does *not* include calloced elements*/
+ int curr_stdalloc;/*Number of elements that are allocated from heap at the moment because the pool is in completed use. It should be '0' when pool is not in complete use*/
+ int max_stdalloc;/*Maximum number of allocations from heap after the pool is completely used that are in active use at any point in the life of the process.*/
+ char *name; /*Contains xlator-name:data-type as a string
+ struct list_head global_list;/*This is used to insert it into the global_list of mempools maintained in 'glusterfs-ctx'
+};
+```
+
+##Life-cycle
+```
+mem_pool_new (data_type, unsigned long count)
+
+This is a macro which expands to mem_pool_new_fn (sizeof (data_type), count, string-rep-of-data_type)
+
+struct mem_pool *
+mem_pool_new_fn (unsigned long sizeof_type, unsigned long count, char *name)
+
+Padded-element:
+ ----------------------------------------
+|list-ptr|mem-pool-address|in-use|Element|
+ ----------------------------------------
+ ```
+
+This function allocates the `mem-pool` structure and sets up the pool for use.
+`name` parameter above is the `string` containing type of the datatype. This `name` is appended to `xlator-name + ':'` so that it can be easily identified in things like statedump. `count` is the number of elements that need to be allocated. `sizeof_type` is the size of each element. Ideally `('sizeof_type'*'count')` should be the size of the total pool. But to manage the pool using `mem_get`/`mem_put` (will be explained after this section) each element needs to be padded in the front with a `('list', 'mem-pool-address', 'in_use')`. So the actual size of the pool it allocates will be `('padded_sizeof_type'*'count')`. Why these extra elements are needed will be evident after understanding how `mem_get` and `mem_put` are implemented. In this function it just initializes all the `list` structures in front of each element and adds them to the `mem_pool->list` which represent the list of `cold` elements which can be allocated whenever `mem_get` is called on this mem_pool. It remembers mem_pool's start and end addresses in `mem_pool->pool`, `mem_pool->pool_end` respectively. Initializes `mem_pool->cold_count` to `count` and `mem_pool->hot_count` to `0`. This mem-pool will be added to the list of `global_list` maintained in `glusterfs-ctx`
+
+
+```
+void* mem_get (struct mem_pool *mem_pool)
+
+Initial-list before mem-get
+----------------
+| Pool |
+| ----------- | ---------------------------------------- ----------------------------------------
+| | pool-list | |<---> |list-ptr|mem-pool-address|in-use|Element|<--->|list-ptr|mem-pool-address|in-use|Element|
+| ----------- | ---------------------------------------- ----------------------------------------
+----------------
+
+list after mem-get from the pool
+----------------
+| Pool |
+| ----------- | ----------------------------------------
+| | pool-list | |<--->|list-ptr|mem-pool-address|in-use|Element|
+| ----------- | ----------------------------------------
+----------------
+
+List when the pool is full:
+ ----------------
+| Pool | extra element that is allocated
+| ----------- | ----------------------------------------
+| | pool-list | | |list-ptr|mem-pool-address|in-use|Element|
+| ----------- | ----------------------------------------
+ ----------------
+```
+
+This function is similar to `malloc()` but it gives memory of type `element` of this pool. When this function is called it increments `mem_pool->alloc_count`, checks if there are any free elements in the pool that can be returned by inspecting `mem_pool->cold_count`. If `mem_pool->cold_count` is non-zero then it means there are elements in the pool which are not in active use. It deletes one element from the list of free elements and decrements `mem_pool->cold_count` and increments `mem_pool->hot_count` to indicate there is one more element in active use. Updates `mem_pool->max_alloc` accordingly. Sets `element->in_use` in the padded memory to `1`. Sets `element->mem_pool` address to this mem_pool also in the padded memory(It is useful for mem_put). Returns the address of the memory after the padded boundary to the caller of this function. In the cases where all the elements in the pool are in active use it `callocs` the element with padded size and sets mem_pool address in the padded memory. To indicate the pool-miss and give useful accounting information of the pool-usage it increments `mem_pool->pool_misses`, `mem_pool->curr_stdalloc`. Updates `mem_pool->max_stdalloc` accordingly.
+
+```
+void* mem_get0 (struct mem_pool *mem_pool)
+```
+Just like `calloc` is to `malloc`, `mem_get0` is to `mem_get`. It memsets the memory to all '0' before returning the element.
+
+
+```
+void mem_put (void *ptr)
+
+list before mem-put from the pool
+ ----------------
+| Pool |
+| ----------- | ----------------------------------------
+| | pool-list | |<--->|list-ptr|mem-pool-address|in-use|Element|
+| ----------- | ----------------------------------------
+ ----------------
+
+list after mem-put to the pool
+ ----------------
+| Pool |
+| ----------- | ---------------------------------------- ----------------------------------------
+| | pool-list | |<---> |list-ptr|mem-pool-address|in-use|Element|<--->|list-ptr|mem-pool-address|in-use|Element|
+| ----------- | ---------------------------------------- ----------------------------------------
+ ----------------
+
+If mem_put is putting an element not from pool then it is just freed so
+no change to the pool
+ ----------------
+| Pool |
+| ----------- |
+| | pool-list | |
+| ----------- |
+ ----------------
+```
+
+This function is similar to `free()`. Remember that ptr passed to this function is the address of the element, so this function gets the ptr to its head of the padding in front of it. If this memory falls in bettween `mem_pool->pool`, `mem_pool->pool_end` then the memory is part of the 'pool' memory that is allocated so it does some sanity checks to see if the memory is indeed head of the element by checking if `in_use` is set to `1`. It resets `in_use` to `0`. It gets the mem_pool address stored in the padded region and adds this element to the list of free elements. Decreases `mem_pool->hot_count` increases `mem_pool->cold_count`. In the case where padded-element address does not fall in the range of `mem_pool->pool`, `mem_pool->pool_end` it just frees the element and decreases `mem_pool->curr_stdalloc`.
+
+```
+void
+mem_pool_destroy (struct mem_pool *pool)
+```
+Deletes this pool from the `global_list` maintained by `glusterfs-ctx` and frees all the memory allocated in `mem_pool_new`.
+
+
+###How to pick pool-size
+This varies from work-load to work-load. Create the mem-pool with some random size and run the work-load. Take the statedump after the work-load is complete. In the statedump if `max_alloc` is always less than `cold_count` may be reduce the size of the pool closer to `max_alloc`. On the otherhand if there are lots of `pool-misses` then increase the `pool_size` by `max_stdalloc` to achieve better 'hit-rate' of the pool.