summaryrefslogtreecommitdiffstats
path: root/doc/features
diff options
context:
space:
mode:
Diffstat (limited to 'doc/features')
-rw-r--r--doc/features/afr-statistics.md142
-rw-r--r--doc/features/afr-v1.md340
-rw-r--r--doc/features/brick-failure-detection.md67
-rw-r--r--doc/features/ctime.md68
-rw-r--r--doc/features/dht.md223
-rw-r--r--doc/features/file-snapshot.md91
-rw-r--r--doc/features/ganesha-ha.md43
-rw-r--r--doc/features/geo-replication/distributed-geo-rep.md71
-rw-r--r--doc/features/geo-replication/libgfchangelog.md119
-rw-r--r--doc/features/gfid-access.md73
-rw-r--r--doc/features/libgfapi.md381
-rw-r--r--doc/features/nufa.md20
-rw-r--r--doc/features/ovirt-integration.md106
-rw-r--r--doc/features/qemu-integration.md231
-rw-r--r--doc/features/quota-scalability.md52
-rw-r--r--doc/features/rdmacm.md26
-rw-r--r--doc/features/readdir-ahead.md14
-rw-r--r--doc/features/rebalance.md74
-rw-r--r--doc/features/server-quorum.md44
-rw-r--r--doc/features/worm.md75
-rw-r--r--doc/features/zerofill.md26
21 files changed, 111 insertions, 2175 deletions
diff --git a/doc/features/afr-statistics.md b/doc/features/afr-statistics.md
deleted file mode 100644
index d0705845aa4..00000000000
--- a/doc/features/afr-statistics.md
+++ /dev/null
@@ -1,142 +0,0 @@
-##gluster volume heal <volume-name> statistics
-
-##Description
-In case of index self-heal, self-heal daemon reads the entries from the
-local bricks, from /brick-path/.glusterfs/indices/xattrop/ directory.
-So based on the entries read by self heal daemon, it will attempt self-heal.
-Executing this command will list the crawl statistics of self heal done for
-each brick.
-
-For each brick, it will list:
-1. Starting time of crawl done for that brick.
-2. Ending time of crawl done for that brick.
-3. No of entries for which self-heal is successfully attempted.
-4. No of split-brain entries found while self-healing.
-5. No of entries for which heal failed.
-
-
-
-Example:
-a) Create a gluster volume with replica count 2.
-b) Create 10 files.
-c) kill brick_1 of this replica.
-d) Overwrite all 10 files.
-e) Kill the other brick (brick_2), and bring back (brick_1) up.
-f) Overwrite all 10 files.
-
-Now we have 10 files, which are in split brain. Self heal daemon will crawl for
-both the bricks, and will count 10 files from each brick.
-It will report 10 files under split-brain with respect to given brick.
-
-Gathering crawl statistics on volume volume1 has been successful
-------------------------------------------------
-
-Crawl statistics for brick no 0
-Hostname of brick 192.168.122.1
-
-Starting time of crawl: Tue May 20 19:13:11 2014
-
-Ending time of crawl: Tue May 20 19:13:12 2014
-
-Type of crawl: INDEX
-No. of entries healed: 0
-No. of entries in split-brain: 10
-No. of heal failed entries: 0
-------------------------------------------------
-
-Crawl statistics for brick no 1
-Hostname of brick 192.168.122.1
-
-Starting time of crawl: Tue May 20 19:13:12 2014
-
-Ending time of crawl: Tue May 20 19:13:12 2014
-
-Type of crawl: INDEX
-No. of entries healed: 0
-No. of entries in split-brain: 10
-No. of heal failed entries: 0
-
-------------------------------------------------
-
-
-As the output shows, self-heal daemon detects 10 files in split-brain with
-resept to given brick.
-
-
-
-
-##gluster volume heal <volume-name> statistics heal-count
-It lists the number of entries present in
-/<brick>/.glusterfs/indices/xattrop from each-brick.
-
-
-1. Create a replicate volume.
-2. Kill one brick of a replicate volume1.
-3. Create 10 files.
-4. Execute above command.
-
---------------------------------------------------------------------------------
-Gathering count of entries to be healed on volume volume1 has been successful
-
-Brick 192.168.122.1:/brick_1
-Number of entries: 10
-
-Brick 192.168.122.1:/brick_2
-No gathered input for this brick
--------------------------------------------------------------------------------
-
-
-
-
-
-
-##gluster volume heal <volume-name> statistics heal-count replica \
- ip_addr:/brick_location
-
-To list the number of entries to be healed from a particular replicate
-subvolume, listing any one child of that replicate subvolume in the command,
-will list the entries for all the childrens of that replicate subvolume.
-
-Example: dht
- / \
- / \
- replica-1 replica-2
- / \ / \
- child-1 child-2 child-3 child-4
- /brick1 /brick2 /brick3 /brick4
-
-gluster volume heal <vol-name> statistics heal-count ip:/brick1
-will list count only for child-1 and child-2.
-
-gluster volume heal <vol-name> statistics heal-count ip:/brick3
-will list count only for child-3 and child-4.
-
-
-
-1. Create a volume same as mentioned in the above graph.
-2. Kill Brick-2.
-3. Create some files.
-4. If we are interested in knowing the number of files to be healed from each
- brick of replica-1 only, mention any one child of that replica.
-
-gluster volume heal volume1 statistics heal-count replica 192.168.122.1:/brick2
-
-output:
--------
-Gathering count of entries to be healed per replica on volume volume1 has \
-been successful
-
-Brick 192.168.122.1:/brick_1
-Number of entries: 10 <--10 files
-
-Brick 192.168.122.1:/brick_2
-No gathered input for this brick <-Brick is down
-
-Brick 192.168.122.1:/brick_3
-No gathered input for this brick <--No result, as we are not
- interested.
-
-Brick 192.168.122.1:/brick_4 <--No result, as we are not
-No gathered input for this brick interested.
-
-
diff --git a/doc/features/afr-v1.md b/doc/features/afr-v1.md
deleted file mode 100644
index 0ab41a1ab4c..00000000000
--- a/doc/features/afr-v1.md
+++ /dev/null
@@ -1,340 +0,0 @@
-#Automatic File Replication
-Afr xlator in glusterfs is responsible for replicating the data across the bricks.
-
-###Responsibilities of AFR
-Its responsibilities include the following:
-
-1. Maintain replication consistency (i.e. Data on both the bricks should be same, even in the cases where there are operations happening on same file/directory in parallel from multiple applications/mount points as long as all the bricks in replica set are up)
-
-2. Provide a way of recovering data in case of failures as long as there is
- at least one brick which has the correct data.
-
-3. Serve fresh data for read/stat/readdir etc
-
-###Transaction framework
-For 1, 2 above afr uses transaction framework which consists of the following 5
-phases which happen on all the bricks in replica set(Bricks which are in replication):
-
-####1.Lock Phase
-####2. Pre-op Phase
-####3. Op Phase
-####4. Post-op Phase
-####5. Unlock Phase
-
-*Op Phase* is the actual operation sent by application (`write/create/unlink` etc). For every operation which afr receives that modifies data it sends that same operation in parallel to all the bricks in its replica set. This is how it achieves replication.
-
-*Lock, Unlock Phases* take necessary locks so that *Op phase* can provide **replication consistency** in normal work flow.
-
-#####For example:
-If an application performs `touch a` and the other one does `mkdir a`, afr makes sure that either file with name `a` or directory with name `a` is created on both the bricks.
-
-*Pre-op, Post-op Phases* provide changelogging which enables afr to figure out which copy is fresh.
-Once afr knows how to figure out fresh copy in the replica set it can **recover data** from fresh copy to stale copy. Also it can **serve fresh** data for `read/stat/readdir` etc.
-
-##Internal Operations
-Brief introduction to internal operations in Glusterfs which make *Locking, Unlocking, Pre/Post ops* possible:
-
-###Internal Locking Operations
-Glusterfs has **locks** translator which provides the following internal locking operations called `inodelk`, `entrylk` which are used by afr to achieve synchronization of operations on files or directories that conflict with each other.
-
-`Inodelk` gives the facility for translators in Glusterfs to obtain range (denoted by tuple with **offset**, **length**) locks in a given domain for an inode.
-Full file lock is denoted by the tuple (offset: `0`, length: `0`) i.e. length `0` is considered as infinity.
-
-`Entrylk` enables translators of Glusterfs to obtain locks on `name` in a given domain for an inode, typically a directory.
-
-**Locks** translator provides both *blocking* and *nonblocking* variants and of these operations.
-
-###Xattrop
-For pre/post ops posix translator provides an operation called xattrop.
-xattrop is a way of *incrementing*/*decrementing* a number present in the extended attribute of the inode *atomically*.
-
-##Transaction Types
-There are 3 types of transactions in AFR.
-1. Data transactions
- - Operations that add/modify/truncate the file contents.
- - `Write`/`Truncate`/`Ftruncate` etc
-
-2. Metadata transactions
- - Operations that modify the data kept in inode.
- - `Chmod`/`Chown` etc
-
-3) Entry transactions
- - Operations that add/remove/rename entries in a directory
- - `Touch`/`Mkdir`/`Mknod` etc
-
-###Data transactions:
-
-*write* (`offset`, `size`) - writes data from `offset` of `size`
-
-*ftruncate*/*truncate* (`offset`) - truncates data from `offset` till the end of file.
-
-Afr internal locking needs to make sure that two conflicting data operations happen in order, one after the other so that it does not result in replication inconsistency. Afr data operations take inodelks in same domain (lets call it `data` domain).
-
-*Write* operation with offset `O` and size `S` takes an inode lock in data domain with range `(O, S)`.
-
-*Ftruncate*/*Truncate* operations with offset `O` take inode locks in `data` domain with range `(O, 0)`. Please note that size `0` means size infinity.
-
-These ranges make sure that overlapping write/truncate/ftruncate operations are done one after the other.
-
-Now that we know the ranges the operations take locks on, we will see how locking happens in afr.
-
-####Lock:
-Afr initially attempts **non-blocking** locks on **all** the bricks of the replica set in **parallel**. If all the locks are successful then it goes on to perform pre-op. But in case **non-blocking** locks **fail** because there is *at least one conflicting operation* which already has a **granted lock** then it **unlocks** the **non-blocking** locks it already acquired in this previous step and proceeds to perform **blocking** locks **one after the other** on each of the subvolumes in the order of subvolumes specified in the volfile.
-
-Chances of **conflicting operations** is **very low** and time elapsed in **non-blocking** locks phase is `Max(latencies of the bricks for responding to inodelk)`, where as time elapsed in **blocking locks** phase is `Sum(latencies of the bricks for responding to inodelk)`. That is why afr always tries for non-blocking locks first and only then it moves to blocking locks.
-
-####Pre-op:
-Each file/dir in a brick maintains the changelog(roughly pending operation count) of itself and that of the files
-present in all the other bricks in it's replica set as seen by that brick.
-
-Lets consider an example replica volume with 2 bricks brick-a and brick-b.
-all files in brick-a will have 2 entries
-one for itself and the other for the file present in it's replica set, i.e.brick-b:
-One can inspect changelogs using getfattr command.
-
- # getfattr -d -e hex -m. brick-a/file.txt
- trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a)
- trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for brick-b as seen by brick-a
-
-Likewise, all files in brick-b will have:
-
- # getfattr -d -e hex -m. brick-b/file.txt
- trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b
- trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for itself (brick-b)
-
-#####Interpreting Changelog Value:
-Each extended attribute has a value which is `24` hexa decimal digits. i.e. `12` bytes
-First `8` digits (`4` bytes) represent changelog of `data`. Second `8` digits represent changelog
-of `metadata`. Last 8 digits represent Changelog of `directory entries`.
-
-Pictorially representing the same, we have:
-
- 0x 00000000 00000000 00000000
- | | |
- | | \_ changelog of directory entries
- | \_ changelog of metadata
- \ _ changelog of data
-
-Before write operation is performed on the brick, afr marks the file saying there is a pending operation.
-
-As part of this pre-op afr sends xattrop operation with increment 1 for data operation to make the extended attributes the following:
- # getfattr -d -e hex -m. brick-a/file.txt
- trusted.afr.vol-client-0=0x000000010000000000000000 -->changelog for itself (brick-a)
- trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for brick-b as seen by brick-a
-
-Likewise, all files in brick-b will have:
-
- # getfattr -d -e hex -m. brick-b/file.txt
- trusted.afr.vol-client-0=0x000000010000000000000000 -->changelog for brick-a as seen by brick-b
- trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for itself (brick-b)
-
-As the operation is in progress on files on both the bricks all the extended attributes show the same value.
-
-####Op:
-Now it sends the actual write operation to both the bricks. Afr remembers whether the operation is successful or not on all the subvolumes.
-
-####Post-Op:
-If the operation succeeds on all the bricks then there is no pending operations on any of the bricks so as part of POST-OP afr sends xattrop operation with increment -1 i.e. decrement by 1 for data operation to make the extended attributes back to all zeros again.
-
-In case there is a failure on brick-b then there is still a pending operation on brick-b where as no pending operations are there on brick-a. So xattrop operation for both of these extended attributes differs now. For extended attribute corresponding to brick-a i.e. trusted.afr.vol-client-0 decrement by 1 is sent where as for extended attribute corresponding to brick-b increment by '0' is sent to retain the pending operation count.
-
- # getfattr -d -e hex -m. brick-a/file.txt
- trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a)
- trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for brick-b as seen by brick-a
-
- # getfattr -d -e hex -m. brick-b/file.txt
- trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b
- trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for itself (brick-b)
-
-####Unlock:
-Once the transaction is completed unlock is sent on all the bricks where lock is acquired.
-
-
-###Meta Data transactions:
-
-setattr, setxattr, removexattr
-All metadata operations take same inode lock with same range in metadata domain.
-
-####Lock:
-Metadata locking also starts initially with non-blocking locks then move on to blocking locks on any failures because of conflicting operations.
-
-####Pre-op:
-Before metadata operation is performed on the brick, afr marks the file saying there is a pending operation.
-As part of this pre-op afr sends xattrop operation with increment 1 for metadata operation to make the extended attributes the following:
- # getfattr -d -e hex -m. brick-a/file.txt
- trusted.afr.vol-client-0=0x000000000000000100000000 -->changelog for itself (brick-a)
- trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for brick-b as seen by brick-a
-
-Likewise, all files in brick-b will have:
- # getfattr -d -e hex -m. brick-b/file.txt
- trusted.afr.vol-client-0=0x000000000000000100000000 -->changelog for brick-a as seen by brick-b
- trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for itself (brick-b)
-
-As the operation is in progress on files on both the bricks all the extended attributes show the same value.
-
-####Op:
-Now it sends the actual metadata operation to both the bricks. Afr remembers whether the operation is successful or not on all the subvolumes.
-
-Post-Op:
-If the operation succeeds on all the bricks then there is no pending operations on any of the bricks so as part of POST-OP afr sends xattrop operation with increment -1 i.e. decrement by 1 for metadata operation to make the extended attributes back to all zeros again.
-
-In case there is a failure on brick-b then there is still a pending operation on brick-b where as no pending operations are there on brick-a. So xattrop operation for both of these extended attributes differs now. For extended attribute corresponding to brick-a i.e. trusted.afr.vol-client-0 decrement by 1 is sent where as for extended attribute corresponding to brick-b increment by '0' is sent to retain the pending operation count.
-
- # getfattr -d -e hex -m. brick-a/file.txt
- trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a)
- trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for brick-b as seen by brick-a
-
- # getfattr -d -e hex -m. brick-b/file.txt
- trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b
- trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for itself (brick-b)
-
-####Unlock:
-Once the transaction is completed unlock is sent on all the bricks where lock is acquired.
-
-
-###Entry transactions:
-
-create, mknod, mkdir, link, symlink, rename, unlink, rmdir
-Pre-op/Post-op (done using xattrop) always happens on the parent directory.
-
-Entry Locks taken by these entry operations:
-
-**Create** (file `dir/a`): Lock on name `a` in inode of `dir`
-
-**mknod** (file `dir/a`): Lock on name `a` in inode of `dir`
-
-**mkdir** (dir `dir/a`): Lock on name `a` in inode of `dir`
-
-**link** (file `oldfile`, file `dir/newfile`): Lock on name `newfile` in inode of `dir`
-
-**Symlink** (file `oldfile`, file `dir`/`symlinkfile`): Lock on name `symlinkfile` in inode of `dir`
-
-**rename** of (file `dir1`/`file1`, file `dir2`/`file2`): Lock on name `file1` in inode of `dir1`, Lock on name `file2` in inode of `dir2`
-
-**rename** of (dir `dir1`/`dir2`, dir `dir3`/`dir4`): Lock on name `dir2` in inode of `dir1`, Lock on name `dir4` in inode of `dir3`, Lock on `NULL` in inode of `dir4`
-
-**unlink** (file `dir`/`a`): Lock on name `a` in inode of `dir`
-
-**rmdir** (dir dir/a): Lock on name `a` in inode of `dir`, Lock on `NULL` in inode of `a`
-
-####Lock:
-Even entry locking starts initially with non-blocking locks then move on to blocking locks on any failures because of conflicting operations.
-
-####Pre-op:
-Before entry operation is performed on the brick, afr marks the directory saying there is a pending operation.
-
-As part of this pre-op afr sends xattrop operation with increment 1 for entry operation to make the extended attributes the following:
-
- # getfattr -d -e hex -m. brick-a/
- trusted.afr.vol-client-0=0x000000000000000000000001 -->changelog for itself (brick-a)
- trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for brick-b as seen by brick-a
-
-Likewise, all files in brick-b will have:
- # getfattr -d -e hex -m. brick-b/
- trusted.afr.vol-client-0=0x000000000000000000000001 -->changelog for brick-a as seen by brick-b
- trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for itself (brick-b)
-
-As the operation is in progress on files on both the bricks all the extended attributes show the same value.
-
-####Op:
-Now it sends the actual entry operation to both the bricks. Afr remembers whether the operation is successful or not on all the subvolumes.
-
-####Post-Op:
-If the operation succeeds on all the bricks then there is no pending operations on any of the bricks so as part of POST-OP afr sends xattrop operation with increment -1 i.e. decrement by 1 for entry operation to make the extended attributes back to all zeros again.
-
-In case there is a failure on brick-b then there is still a pending operation on brick-b where as no pending operations are there on brick-a. So xattrop operation for both of these extended attributes differs now. For extended attribute corresponding to brick-a i.e. trusted.afr.vol-client-0 decrement by 1 is sent where as for extended attribute corresponding to brick-b increment by '0' is sent to retain the pending operation count.
-
- # getfattr -d -e hex -m. brick-a/file.txt
- trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a)
- trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for brick-b as seen by brick-a
-
- # getfattr -d -e hex -m. brick-b/file.txt
- trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b
- trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for itself (brick-b)
-
-####Unlock:
-Once the transaction is completed unlock is sent on all the bricks where lock is acquired.
-
-The parts above cover how replication consistency is achieved in afr.
-
-Now let us look at how afr can figure out how to recover from failures given the changelog extended attributes
-
-###Recovering from failures (Self-heal)
-For recovering from failures afr tries to determine which copy is the fresh copy based on the extended attributes.
-
-There are 3 possibilities:
-1. All the extended attributes are zero on all the bricks. This means there are no pending operations on any of the bricks so there is nothing to recover.
-2. According to the extended attributes there is a brick(brick-a) which noticed that there are operations pending on the other brick(brick-b).
- - There are 4 possibilities for brick-b
-
- - It did not even participate in transaction (all extended attributes on brick-b are zeros). Choose brick-a as source and perform recovery to brick-b.
-
- - It participated in the transaction but died even before post-op. (All extended attributes on brick-b have a pending-count). Choose brick-a as source and perform recovery to brick-b.
-
- - It participated in the transaction and after the post-op extended attributes on brick-b show that there are pending operations on itself. Choose brick-a as source and perform recovery to brick-b.
-
- - It participated in the transaction and after the post-op extended attributes on brick-b show that there are pending operations on brick-a. This situation is called Split-brain and there is no way to recover. This situation can happen in cases of network partition.
-
-3. The only possibility now is where both brick-a, brick-b have pending operations. In this case changelogs extended attributes are all non-zeros on all the bricks. Basically what could have happened is the operations started on the file but either the whole replica set went down or the mount process itself dies before post-op is performed. In this case there is a possibility that data on the bricks is different. In this case afr chooses file with bigger size as source, if both files have same size then it choses the subvolume which has witnessed large number of pending operations on the other brick as source. If both have same number of pending operations then it chooses the file with newest ctime as source. If this is also same then it just picks one of the two bricks as source and syncs data on to the other to make sure that the files are replicas to each other.
-
-###Self-healing:
-Afr does 3 types of self-heals for data recovery.
-
-1. Data self-heal
-
-2. Metadata self-heal
-
-3. Entry self-heal
-
-As we have seen earlier, afr depends on changelog extended attributes to figure out which copy is source and which copy is sink. General algorithm for performing this recovery (self-heal) is same for all of these different self-heals.
-
-1. Take appropriate full locks on the file/directory to make sure no other transaction is in progress while inspecting changelog extended attributes.
-In this step, for
- - Data self-heal afr takes inode lock with `offset: 0` and `size: 0`(infinity) in data domain.
- - Entry self-heal takes entry lock on directory with `NULL` name i.e. full directory lock.
- - Metadata self-heal it takes pre-defined range in metadata domain on which all the metadata operations on that inode take locks on. To prevent duplicate data self-heal an inode lock is taken in self-heal domain as well.
-
-2. Perform Sync from fresh copy to stale copy.
-In this step,
- - Metadata self-heal gets the inode attributes, extended attributes from source copy and sets them on the stale copy.
-
- - Entry self-heal reads entries on stale directories and see if they are present on source directory, if they are not present it deletes them. Then it reads entries on fresh directory and creates the missing entries on stale directories.
-
- - Data self-heal does things a bit differently to make sure no other writes on the file are blocked for the duration of self-heal because files sizes could be as big as 100G(VM files) and we don't want to block all the transactions until the self-heal is over. Locks translator allows two overlapping locks to be granted if they are from same lock owner. Using this what data self-heal does is it takes a small 128k size range lock and unlock previous acquired lock, heals just that 128k chunk and takes next 128k chunk lock and unlock previous lock and moves to the next one. It always makes sure that at least one lock is present on the file by selfheal throughout the duration of self-heal so that two self-heals don't happen in parallel.
-
- - Data self-heal has two algorithms, where the file can be copied only when there is data mismatch for that chunk called as 'diff' self-heal. The otherone is blind copy of each chunk called 'full' self-heal
-
-3. Change extended attributes to mark new sources after the sync.
-
-4. Unlock the locks acquired to perform self-heal.
-
-### Transaction Optimizations:
-As we saw earlier afr transaction for all the operations that modify data happens in 5 phases, i.e. it sends 5 operations on the network for every operation. In the following sections we will see optimizations already implemented in afr which reduce the number of operations on the network to just 1 per transaction in best case.
-
-####Changelog-piggybacking
-This optimization comes into picture when on same file descriptor, before write1's post op is complete write2's pre-op starts and the operations are succeeding. When writes come in that manner we can piggyback on the pre-op of write1 for write2 and somehow tell write1 that write2 will do the post-op that was supposed to be done by write1. So write1's post-op does not happen over network, write2's pre-op does not happen over network. This optimization does not hold if there are any failures in write1's phases.
-
-####Delayed Post-op
-This optimization just delays post-op of the write transaction(write1) by a pre-configured amount time to increase the probability of next write piggybacking on the pre-op done by write1.
-
-With the combination of these two optimizations for operations like full file copy which are write intensive operations, what will essentially happen is for the first write a pre-op will happen. Then for the last write on the file post-op happens. So for all the write transactions between first write and last write afr reduced network operations from 5 to 3.
-
-####Eager-locking:
-This optimization comes into picture when only one file descriptor is open on the file and performing writes just like in the previous optimization. What this optimization does is it takes a full file lock on the file irrespective of the offset, size of the write, so that lock acquired by write1 can be piggybacked by write2 and write2 takes the responsibility of unlocking it. both write1, write2 will have same lock owner and afr takes the responsibility of serializing overlapping writes so that replication consistency is maintained.
-
-With the combination of these optimizations for operations like full file copy which are write intensive operations, what will essentially happen is for the first write a pre-op, full-file lock will happen. Then for the last write on the file post-op, unlock happens. So for all the write transactions between first write and last write afr reduced network operations from 5 to 1.
-
-###Quorum in afr:
-To avoid split-brains, afr employs the following quorum policies.
- - In replica set with odd number of bricks, replica set is said to be in quorum if more than half of the bricks are up.
- - In replica set with even number of bricks, if more than half of the bricks are up then it is said to be in quorum but if number of bricks that are up is equal to number of bricks that are down then, it is said to be in quorum if the first brick is also up in the set of bricks that are up.
-
-When quorum is not met in the replica set then modify operations on the mount are not allowed by afr.
-
-###Self-heal daemon and Index translator usage by afr:
-
-####Index xlator:
-On each brick index xlator is loaded. This xlator keeps track of what is happening in afr's pre-op and post-op. If there is an ongoing I/O or a pending self-heal, changelog xattrs would have non-zero values. Whenever xattrop/fxattrop fop (pre-op, post-ops are done using these fops) comes to index xlator a link (with gfid as name of the file on which the fop is performed) is added in <brick>/.glusterfs/indices/xattrop directory. If the value returned by the fop is zero the link is removed from the index otherwise it is kept until zero is returned in the subsequent xattrop/fxattrop fops.
-
-####Self-heal-daemon:
-self-heal-daemon process keeps running on each machine of the trusted storage pool. This process has afr xlators of all the volumes which are started. Its job is to crawl indices on bricks that are local to that machine. If any of the files represented by the gfid of the link name need healing and automatically heal them. This operation is performed every 10 minutes for each replica set. Additionally when a brick comes online also this operation is performed.
diff --git a/doc/features/brick-failure-detection.md b/doc/features/brick-failure-detection.md
deleted file mode 100644
index 24f2a18f39f..00000000000
--- a/doc/features/brick-failure-detection.md
+++ /dev/null
@@ -1,67 +0,0 @@
-# Brick Failure Detection
-
-This feature attempts to identify storage/file system failures and disable the failed brick without disrupting the remainder of the node's operation.
-
-## Description
-
-Detecting failures on the filesystem that a brick uses makes it possible to handle errors that are caused from outside of the Gluster environment.
-
-There have been hanging brick processes when the underlying storage of a brick went unavailable. A hanging brick process can still use the network and repond to clients, but actual I/O to the storage is impossible and can cause noticible delays on the client side.
-
-Provide better detection of storage subsytem failures and prevent bricks from hanging. It should prevent hanging brick processes when storage-hardware or the filesystem fails.
-
-A health-checker (thread) has been added to the posix xlator. This thread periodically checks the status of the filesystem (implies checking of functional storage-hardware).
-
-`glusterd` can detect that the brick process has exited, `gluster volume status` will show that the brick process is not running anymore. System administrators checking the logs should be able to triage the cause.
-
-## Usage and Configuration
-
-The health-checker is enabled by default and runs a check every 30 seconds. This interval can be changed per volume with:
-
- # gluster volume set <VOLNAME> storage.health-check-interval <SECONDS>
-
-If `SECONDS` is set to 0, the health-checker will be disabled.
-
-## Failure Detection
-
-Error are logged to the standard syslog (mostly `/var/log/messages`):
-
- Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 5 buf count 512
- Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): I/O Error Detected. Shutting down filesystem
- Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): Please umount the filesystem and rectify the problem(s)
- Jun 24 11:31:49 vm130-32 kernel: VFS:Filesystem freeze failed
- Jun 24 11:31:50 vm130-32 GlusterFS[1969]: [2013-06-24 10:31:50.500674] M [posix-helpers.c:1114:posix_health_check_thread_proc] 0-failing_xfs-posix: health-check failed, going down
- Jun 24 11:32:09 vm130-32 kernel: XFS (dm-2): xfs_log_force: error 5 returned.
- Jun 24 11:32:20 vm130-32 GlusterFS[1969]: [2013-06-24 10:32:20.508690] M [posix-helpers.c:1119:posix_health_check_thread_proc] 0-failing_xfs-posix: still alive! -> SIGTERM
-
-The messages labelled with `GlusterFS` in the above output are also written to the logs of the brick process.
-
-## Recovery after a failure
-
-When a brick process detects that the underlaying storage is not responding anymore, the process will exit. There is no automated way that the brick process gets restarted, the sysadmin will need to fix the problem with the storage first.
-
-After correcting the storage (hardware or filesystem) issue, the following command will start the brick process again:
-
- # gluster volume start <VOLNAME> force
-
-## How To Test
-
-The health-checker thread that is part of each brick process will get started automatically when a volume has been started. Verifying its functionality can be done in different ways.
-
-On virtual hardware:
-
-* disconnect the disk from the VM that holds the brick
-
-On real hardware:
-
-* simulate a RAID-card failure by unplugging the card or cables
-
-On a system that uses LVM for the bricks:
-
-* use device-mapper to load an error-table for the disk, see [this description](http://review.gluster.org/5176).
-
-On any system (writing to random offsets of the block device, more difficult to trigger):
-
-1. cause corruption on the filesystem that holds the brick
-2. read contents from the brick, hoping to hit the corrupted area
-3. the filsystem should abort after hitting a bad spot, the health-checker should notice that shortly afterwards
diff --git a/doc/features/ctime.md b/doc/features/ctime.md
new file mode 100644
index 00000000000..74a77abed4b
--- /dev/null
+++ b/doc/features/ctime.md
@@ -0,0 +1,68 @@
+# Consistent time attributes in gluster across replica/distribute
+
+
+#### Problem:
+Traditionally gluster has been using time attributes (ctime, atime, mtime) of files/dirs from bricks. The problem with this approach is that, it is not consisteant across replica and distribute bricks. And applications which depend on it breaks as replica might not always return time attributes from same brick.
+
+Tar especially gives "file changed as we read it" whenever it detects ctime differences when stat is served from different bricks. The way we have been trying to solve it is to serve the stat structures from same brick in afr, max-time in dht. But it doesn't avoid the problem completely. Because there is no way to change ctime at the moment(lutimes() only allows mtime, atime), there is little we can do to make sure ctimes match after self-heals/xattr updates/rebalance.
+
+#### Solution Proposed:
+Store time attribues (ctime, mtime, atime) as an xattr of the file. The xattr is updated based
+on the fop. If a filesystem fop changes only mtime and ctime, update only those in xattr for
+that file.
+
+#### Design Overview:
+1) As part of each fop, top layer will generate a time stamp and pass it to the down along
+ with other information
+ - This will bring a dependency for NTP synced clients along with servers
+ - There can be a diff in time if the fop stuck in the xlator for various reason,
+for ex: because of locks.
+
+ 2) On the server, posix layer stores the value in the memory (inode ctx) and will sync the data periodically to the disk as an extended attr
+ - Of course sync call also will force it. And fop comes for an inode which is not linked, we do the sync immediately.
+
+ 3) Each time when inodes are created or initialized it read the data from disk and store in inode ctx.
+
+ 4) Before setting to inode_ctx we compare the timestamp stored and the timestamp received, and only store if the stored value is lesser than the current value.
+
+ 5) So in best case data will be stored and retrieved from the memory. We replace the values in iatt with the values in inode_ctx.
+
+ 6) File ops that changes the parent directory attr time need to be consistent across all the distributed directories across the subvolumes. (for eg: a create call will change ctime and mtime of parent dir)
+
+ - This has to handle separately because we only send the fop to the hashed subvolume.
+ - We can asynchronously send the timeupdate setattr fop to the other subvoumes and change the values for parent directory if the file fops is successful on hashed subvolume.
+ - This will have a window where the times are inconsistent across dht subvolume (Please provide your suggestions)
+
+7) Currently we have couple of mount options for time attributes like noatime, relatime , nodiratime etc. But we are not explicitly handled those options even if it is given as mount option when gluster mount.
+
+
+#### Implementation Overview:
+This features involves changes in following xlators.
+ - utime xlator
+ - posix xlator
+
+##### utime xlator:
+This is a new client side xlator which does following tasks.
+
+1. It will generate a time stamp and passes it down in frame->root->ctime and over the network.
+2. Based on fop, it also decides the time attributes to be updated and this passed using "frame->root->flags"
+
+ Patches:
+ 1. https://review.gluster.org/#/c/19857/
+
+##### posix xlator:
+Following tasks are done in posix xlator:
+
+1. Provides APIs to set and get the xattr from backend. It also caches the xattr in inode context. During get, it updates time attributes stored in xattr into iatt structure.
+2. Based on the flags from utime xlator, relevant fops update the time attributes in the xattr.
+
+ Patches:
+ 1. https://review.gluster.org/#/c/19267/
+ 2. https://review.gluster.org/#/c/19795/
+ 3. https://review.gluster.org/#/c/19796/
+
+#### Pending Work:
+1. Handling of time related mount options (noatime, realatime,etc)
+2. flag based create (depending on flags in open, create behaviour might change)
+3. Changes in dht for direcotory sync acrosss multiple subvolumes
+4. readdirp stat need to be worked on.
diff --git a/doc/features/dht.md b/doc/features/dht.md
deleted file mode 100644
index c35dd6d0c27..00000000000
--- a/doc/features/dht.md
+++ /dev/null
@@ -1,223 +0,0 @@
-# How GlusterFS Distribution Works
-
-The defining feature of any scale-out system is its ability to distribute work
-or data among many servers. Accordingly, people in the distributed-system
-community have developed many powerful techniques to perform such distribution,
-but those techniques often remain little known or understood even among other
-members of the file system and database communities that benefit. This
-confusion is represented even in the name of the GlusterFS component that
-performs distribution - DHT, which stands for Distributed Hash Table but is not
-actually a DHT as that term is most commonly used or defined. The way
-GlusterFS's DHT works is based on a few basic principles:
-
- * All operations are driven by clients, which are all equal. There are no
- special nodes with special knowledge of where files are or should be.
-
- * Directories exist on all subvolumes (bricks or lower-level aggregations of
- bricks); files exist on only one.
-
- * Files are assigned to subvolumes based on *consistent hashing*, and even
- more specifically a form of consistent hashing exemplified by Amazon's
- [Dynamo][dynamo].
-
-The result of all this is that users are presented with a set of files that is
-the union of the files present on all subvolumes. The following sections
-describe how this "uniting" process actually works.
-
-## Layouts
-
-The conceptual basis of Dynamo-style consistent hashing is of numbers around a
-circle, like a clock. First, the circle is divided into segments and those
-segments are assigned to bricks. (For the sake of simplicity we'll use
-"bricks" hereafter even though they might actually be replicated/striped
-subvolumes.) Several factors guide this assignment.
-
- * Assignments are done separately for each directory.
-
- * Historically, segments have all been the same size. However, this can lead
- to smaller bricks becoming full while plenty of space remains on larger
- ones. If the *cluster.weighted-rebalance* option is set, segments sizes
- will be proportional to brick sizes.
-
- * Assignments need not include all bricks in the volume. If the
- *cluster.subvols-per-directory* option is set, only that many bricks will
- receive assignments for that directory.
-
-However these assignments are done, they collectively become what we call a
-*layout* for a directory. This layout is then stored using extended
-attributes, with each brick's copy of that extended attribute on that directory
-consisting of four 32-bit fields.
-
- * A version, which might be DHT\_HASH\_TYPE\_DM to represent an assignment as
- described above, or DHT\_HASH\_TYPE\_DM\_USER to represent an assignment made
- manually by the user (or external script).
-
- * A "commit hash" which will be described later.
-
- * The first number in the assigned range (segment).
-
- * The last number in the assigned range.
-
-For example, the extended attributes representing a weighted assignment between
-three bricks, one twice as big as the others, might look like this.
-
- * Brick A (the large one): DHT\_HASH\_TYPE\_DM 1234 0 0x7ffffff
-
- * Brick B: DHT\_HASH\_TYPE\_DM 1234 0x80000000 0xbfffffff
-
- * Brick C: DHT\_HASH\_TYPE\_DM 1234 0xc0000000 0xffffffff
-
-## Placing Files
-
-To place a file in a directory, we first need a layout for that directory - as
-described above. Next, we calculate a hash for the file. To minimize
-collisions either between files in the same directory with different names or
-between files in different directories with the same name, this hash is
-generated using both the (containing) directory's unique GFID and the file's
-name. This hash is then matched to one of the layout assignments, to yield
-what we call a *hashed location*. For example, consider the layout shown
-above. The hash 0xabad1dea is between 0x80000000 and 0xbfffffff, so the
-corresponding file's hashed location would be on Brick B. A second file with a
-hash of 0xfaceb00c would be assigned to Brick C by the same reasoning.
-
-## Looking Up Files
-
-Because layout assignments might change, especially as bricks are added or
-removed, finding a file involves more than calculating its hashed location and
-looking there. That is in fact the first step, and works most of the time -
-i.e. the file is found where we expected it to be - but there are a few more
-steps when that's not the case. Historically, the next step has been to look
-for the file **everywhere** - i.e. to broadcast our lookup request to all
-subvolumes. If the file isn't found that way, it doesn't exist. At this
-point, an open that requires the file's presence will fail, or a create/mkdir
-that requires its absence will be allowed to continue.
-
-Regardless of whether a file is found at its hashed location or elsewhere, we
-now know its *cached location*. As the name implies, this is stored within DHT
-to satisfy future lookups. If it's not the same as the hashed location, we
-also take an extra step. This step is the creation of a *linkfile*, which is a
-special stub left at the **hashed** location pointing to the **cached**
-location. Therefore, if a client naively looks for a file at its hashed
-location and finds a linkfile instead, it can use that linkfile to look up the
-file where it really is instead of needing to inquire everywhere.
-
-## Rebalancing
-
-As bricks are added or removed, or files are renamed, many files can end up
-somewhere other than at their hashed locations. When this happens, the volumes
-need to be rebalanced. This process consists of two parts.
-
- 1. Calculate new layouts, according to the current set of bricks (and possibly
- their characteristics). We call this the "fix-layout" phase.
-
- 2. Migrate any "misplaced" files to their correct (hashed) locations, and
- clean up any linkfiles which are no longer necessary. We call this the
- "migrate-data" phase.
-
-Usually, these two phases are done together. (In fact, the code for them is
-somewhat intermingled.) However, the migrate-data phase can involve a lot of
-I/O and be very disruptive, so users can do just the fix-layout phase and defer
-migrate-data until a more convenient time. This allows new files to be placed
-on new bricks, even though old files might still be in the "wrong" place.
-
-When calculating a new layout to replace an old one, DHT specifically tries to
-maximize overlap of the assigned ranges, thus minimizing data movement. This
-difference can be very large. For example, consider the case where our example
-layout from earlier is updated to add a new double-sided brick. Here's a very
-inefficient way to do that.
-
- * Brick A (the large one): 0x00000000 to 0x55555555
-
- * Brick B: 0x55555556 to 0x7fffffff
-
- * Brick C: 0x80000000 to 0xaaaaaaaa
-
- * Brick D (the new one): 0xaaaaaaab to 0xffffffff
-
-This would cause files in the following ranges to be migrated:
-
- * 0x55555556 to 0x7fffffff (from A to B)
-
- * 0x80000000 to 0xaaaaaaaa (from B to C)
-
- * 0xaaaaaaab to 0xbfffffff (from B to D)
-
- * 0xc0000000 to 0xffffffff (from C to D)
-
-As an historical note, this is exactly what we used to do, and in this case it
-would have meant moving 7/12 of all files in the volume. Now let's consider a
-new layout that's optimized to maximize overlap with the old one.
-
- * Brick A: 0x00000000 to 0x55555555
-
- * Brick D: 0x55555556 to 0xaaaaaaaa <- optimized insertion point
-
- * Brick B: 0xaaaaaaab to 0xd5555554
-
- * Brick C: 0xd5555555 to 0xffffffff
-
-In this case we only need to move 5/12 of all files. In a volume with millions
-or even billions of files, reducing data movement by 1/6 of all files is a
-pretty big improvement. In the future, DHT might use "virtual node IDs" or
-multiple hash rings to make rebalancing even more efficient.
-
-## Rename Optimizations
-
-With the file-lookup mechanisms we already have in place, it's not necessary to
-move a file from one brick to another when it's renamed - even across
-directories. It will still be found, albeit a little less efficiently. The
-first client to look for it after the rename will add a linkfile, which every
-other client will follow from then on. Also, every client that has found the
-file once will continue to find it based on its cached location, without any
-network traffic at all. Because the extra lookup cost is small, and the
-movement cost might be very large, DHT renames the file "in place" on its
-current brick instead (taking advantage of the fact that directories exist
-everywhere).
-
-This optimization is further extended to handle cases where renames are very
-common. For example, rsync and similar tools often use a "write new then
-rename" idiom in which a file "xxx" is actually written as ".xxx.1234" and then
-moved into place only after its contents have been fully written. To make this
-process more efficient, DHT uses a regular expression to separate the permanent
-part of a file's name (in this case "xxx") from what is likely to be a
-temporary part (the leading "." and trailing ".1234"). That way, after the
-file is renamed it will be in its correct hashed location - which it wouldn't
-be otherwise if "xxx" and ".xxx.1234" hash differently - and no linkfiles or
-broadcast lookups will be necessary.
-
-In fact, there are two regular expressions available for this purpose -
-*cluster.rsync-hash-regex* and *cluster.extra-hash-regex*. As its name
-implies, *rsync-hash-regex* defaults to the pattern that regex uses, while
-*extra-hash-regex* can be set by the user to support a second tool using the
-same temporary-file idiom.
-
-## Commit Hashes
-
-A very recent addition to DHT's algorithmic arsenal is intended to reduce the
-number of "broadcast" lookups the it issues. If a volume is completely in
-balance, then no file could exist anywhere but at its hashed location.
-Therefore, if we've already looked there and not found it, then looking
-elsewhere would be pointless (and wasteful). The *commit hash* mechanism is
-used to detect this case. A commit hash is assigned to a volume, and
-separately to each directory, and then updated according to the following
-rules.
-
- * The volume commit hash is changed whenever actions are taken that might
- cause layout assignments across all directories to become invalid - i.e.
- bricks being added, removed, or replaced.
-
- * The directory commit hash is changed whenever actions are taken that might
- cause files to be "misplaced" - e.g. when they're renamed.
-
- * The directory commit hash is set to the volume commit hash when the
- directory is created, and whenever the directory is fully rebalanced so that
- all files are at their hashed locations.
-
-In other words, whenever either the volume or directory commit hash is changed
-that creates a mismatch. In that case we revert to the "pessimistic"
-broadcast-lookup method described earlier. However, if the two hashes match
-then we can with skip the broadcast lookup and return a result immediately.
-This has been observed to cause a 3x performance improvement in workloads that
-involve creating many small files across many bricks.
-
-[dynamo]: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
diff --git a/doc/features/file-snapshot.md b/doc/features/file-snapshot.md
deleted file mode 100644
index 7f7c419fc7f..00000000000
--- a/doc/features/file-snapshot.md
+++ /dev/null
@@ -1,91 +0,0 @@
-#File Snapshot
-This feature gives the ability to take snapshot of files.
-
-##Descritpion
-This feature adds file snapshotting support to glusterfs. Snapshots can be created , deleted and reverted.
-
-To take a snapshot of a file, file should be in QCOW2 format as the code for the block layer snapshot has been taken from Qemu and put into gluster as a translator.
-
-With this feature, glusterfs will have better integration with Openstack Cinder, and in general ability to take snapshots of files (typically VM images).
-
-New extended attribute (xattr) will be added to identify files which are 'snapshot managed' vs raw files.
-
-##Volume Options
-Following volume option needs to be set on the volume for taking file snapshot.
-
- # features.file-snapshot on
-##CLI parameters
-Following cli parameters needs to be passed with setfattr command to create, delete and revert file snapshot.
-
- # trusted.glusterfs.block-format
- # trusted.glusterfs.block-snapshot-create
- # trusted.glusterfs.block-snapshot-goto
-##Fully loaded Example
-Download glusterfs3.5 rpms from download.gluster.org
-Install these rpms.
-
-start glusterd by using the command
-
- # service glusterd start
-Now create a volume by using the command
-
- # gluster volume create <vol_name> <brick_path>
-Run the command below to make sure that volume is created.
-
- # gluster volume info
-Now turn on the snapshot feature on the volume by using the command
-
- # gluster volume set <vol_name> features.file-snapshot on
-Verify that the option is set by using the command
-
- # gluster volume info
-User should be able to see another option in the volume info
-
- # features.file-snapshot: on
-Now mount the volume using fuse mount
-
- # mount -t glusterfs <vol_name> <mount point>
-cd into the mount point
- # cd <mount_point>
- # touch <file_name>
-Size of the file can be set and format of the file can be changed to QCOW2 by running the command below. File size can be in KB/MB/GB
-
- # setfattr -n trusted.glusterfs.block-format -v qcow2:<file_size> <file_name>
-Now create another file and send data to that file by running the command
-
- # echo 'ABCDEFGHIJ' > <data_file1>
-copy the data to the one file to another by running the command
-
- # dd if=data-file1 of=big-file conv=notrunc
-Now take the `snapshot of the file` by running the command
-
- # setfattr -n trusted.glusterfs.block-snapshot-create -v <image1> <file_name>
-Add some more contents to the file and take another file snaphot by doing the following steps
-
- # echo '1234567890' > <data_file2>
- # dd if=<data_file2> of=<file_name> conv=notrunc
- # setfattr -n trusted.glusterfs.block-snapshot-create -v <image2> <file_name>
-Now `revert` both the file snapshots and write data to some files so that data can be compared.
-
- # setfattr -n trusted.glusterfs.block-snapshot-goto -v <image1> <file_name>
- # dd if=<file_name> of=<out-file1> bs=11 count=1
- # setfattr -n trusted.glusterfs.block-snapshot-goto -v <image2> <file_name>
- # dd if=<file_name> of=<out-file2> bs=11 count=1
-Now read the contents of the files and compare as below:
-
- # cat <data_file1>, <out_file1> and compare contents.
- # cat <data_file2>, <out_file2> and compare contents.
-##one line description for the variables used
-file_name = File which will be creating in the mount point intially.
-
-data_file1 = File which contains data 'ABCDEFGHIJ'
-
-image1 = First file snapshot which has 'ABCDEFGHIJ' + some null values.
-
-data_file2 = File which contains data '1234567890'
-
-image2 = second file snapshot which has '1234567890' + some null values.
-
-out_file1 = After reverting image1 this contains 'ABCDEFGHIJ'
-
-out_file2 = After reverting image2 this contians '1234567890'
diff --git a/doc/features/ganesha-ha.md b/doc/features/ganesha-ha.md
new file mode 100644
index 00000000000..4b226a22ccf
--- /dev/null
+++ b/doc/features/ganesha-ha.md
@@ -0,0 +1,43 @@
+# Overview of Ganesha HA Resource Agents in GlusterFS 3.7
+
+The ganesha_mon RA monitors its ganesha.nfsd daemon. While the
+daemon is running, it creates two attributes: ganesha-active and
+grace-active. When the daemon stops for any reason, the attributes
+are deleted. Deleting the ganesha-active attribute triggers the
+failover of the virtual IP (the IPaddr RA) to another node —
+according to constraint location rules — where ganesha.nfsd is
+still running.
+
+The ganesha_grace RA monitors the grace-active attribute. When
+the grace-active attibute is deleted, the ganesha_grace RA stops,
+and will not restart. This triggers pacemaker to invoke the notify
+action in the ganesha_grace RAs on the other nodes in the cluster;
+which send a DBUS message to their respective ganesha.nfsd.
+
+(N.B. grace-active is a bit of a misnomer. while the grace-active
+attribute exists, everything is normal and healthy. Deleting the
+attribute triggers putting the surviving ganesha.nfsds into GRACE.)
+
+To ensure that the remaining/surviving ganesha.nfsds are put into
+ NFS-GRACE before the IPaddr (virtual IP) fails over there is a
+short delay (sleep) between deleting the grace-active attribute
+and the ganesha-active attribute. To summarize, e.g. in a four
+node cluster:
+
+1. on node 2 ganesha_mon::monitor notices that ganesha.nfsd has died
+
+2. on node 2 ganesha_mon::monitor deletes its grace-active attribute
+
+3. on node 2 ganesha_grace::monitor notices that grace-active is gone
+and returns OCF_ERR_GENERIC, a.k.a. new error. When pacemaker tries
+to (re)start ganesha_grace, its start action will return
+OCF_NOT_RUNNING, a.k.a. known error, don't attempt further restarts.
+
+4. on nodes 1, 3, and 4, ganesha_grace::notify receives a post-stop
+notification indicating that node 2 is gone, and sends a DBUS message
+to its ganesha.nfsd, putting it into NFS-GRACE.
+
+5. on node 2 ganesha_mon::monitor waits a short period, then deletes
+its ganesha-active attribute. This triggers the IPaddr (virt IP)
+failover according to constraint location rules.
+
diff --git a/doc/features/geo-replication/distributed-geo-rep.md b/doc/features/geo-replication/distributed-geo-rep.md
deleted file mode 100644
index 0a3183d6269..00000000000
--- a/doc/features/geo-replication/distributed-geo-rep.md
+++ /dev/null
@@ -1,71 +0,0 @@
-Introduction
-============
-
-This document goes through the new design of distributed geo-replication, it's features and the nature of changes involved. First we list down some of the important features.
-
- - Distributed asynchronous replication
- - Fast and versatile change detection
- - Replica failover
- - Hardlink synchronization
- - Effective handling of deletes and renames
- - Configurable sync engine (rsync, tar+ssh)
- - Adaptive to a wide variety of workloads
- - GFID synchronization
-
-Geo-replication makes use of the all new *journaling* infrastructure (a.k.a. changelog) to achieve great performance and feature improvements as mentioned above. To understand more about changelogging and the helper library (*libgfchangelog*) refer to document: doc/features/geo-replication/libgfchangelog.md
-
-Data Replication
-----------------
-
-Geo-replication is responsible to incrementally replicate data from the master node to the slave. But isn't that similar to what AFR does? Yes, but here the slave is located geographically distant from the master. Geo-replication follows the eventually consistent replication model, which implies, at any point of time, the slave would be lagging w.r.t. master, but would eventually catch up. Replication performance is dependent on two crucial factors:
- - Network latency
- - Change detection
-
-Network latency is something that is not in direct control for many reasons, but still there is always a best effort. Therefore, geo-replication offloads the data replicaiton part to common UNIX file transfer utilities. We choose the grand daddy of file transfers [rsync(1)] [1] as the default synchronization engine, as it's best known for it's diff transfer algorithm for effcient usage of network and lightning fast transfers (leave alone the flexibiliy). But what about small files performance? Due to it's checksumming algorithm, rsync has more overhead for small files -- the overhead of checksumming outweighs the bytes to be transferred for small files. Therefore, geo-replication can also use combination of tar piped over ssh to transfer large number of small files. Tests have shown a great improvement over standard rsync. However, sync engine is not yet dynamic to the file type and needs to be chosen manually by a configuration option.
-
-OTOH, change detection is something that is in full control of the application. Earlier (< release 3.5), geo-replicaiton would perform a file system crawl to indentify changes in the file system. This was not an unintelligent *check-every-single-inode* in the file system, but crawl logic based on *xtime*. xtime is an extended attribute maintained by the *marker* translator for each inode on the master and follows an upward-recursive marking pattern. Geo-replication would traverse a directory based on this simple condition:
-
-> xtime(master) > xtime(slave)
-
-E.g.:
-
-> MASTER SLAVE
->
-> /\ /\
-> d0 dir0 d0 dir0
-> / \ / \
-> d1 dir1 d1 dir1
-> / /
-> d2 d2
-> / /
-> file0 file0
-
-Consider the directory tree above. Assume that master and slave were in sync and the following operation happens on master:
-```
-touch /d0/d1/d2/file0
-```
-This would trigger a xtime marking (xtime being the current timestamp) from the leaf (*file0*) upto the root (*/*), i.e. an *xattr* of *file0*, *d2*, *d1*, *d0* and finally */*. Geo-replication daemon would crawl the file system based the condition mentioned before and hence would only crawl the **left** part of the directory tree (as the **right** part would hve equal xtimes).
-
-Although the above crawling algorithm is fast, it still has to crawl a good part of the file system. Also, to decide whether to crawl a particular subdirectory, geo-rep need to compare xtime -- which is basically a **getxattr()** call on the master and slave (remember, *slave* is over a WAN).
-
-Therefore, in 3.5 the need arised to take crawling to the next level. Geo-replication now uses the changelogging infrastructure to idenitify changes in the filesystem. Actually, there is absolutely no crawl involved. Changelogging based detection is notification based. Geo-replication daemon registers itself with the changelog consumer library (*libgfchangelog*) and basically invokes a set of APIs to get the list of changes in the filesystem and replays them onto the slave. There is absolutely no crawl or any kind of extended attribute gets involved.
-
-Distributed Geo-Replication
----------------------------
-Geo-replication (also known as gsyncd or geo-rep) used to be non-distributed before release 3.5. The node on which geo-rep start command was executed was responsible for replication data to the slave. If this node goes offline due to some reason (reboot, crash, etc..), replication would thereby be ceased. So one of the main development efforts for release 3.5 was to *distributify* geo-replication. Geo-rep daemon running on each node (per brick) is responsible for replicating data **local** to each brick. This results in full parallelism and effective use of cluster/network resource.
-
-With release 3.5, geo-rep start command would spawn a geo-replication daemon on each node in the master cluster (one per brick). Geo-rep *status* command shown geo-rep session status from each master node. Similary, *stop* would gracefully tear down the session from all nodes.
-
-What else is synced?
---------------------
- - GFID: Synchronizing the inode number (GFID) between master and the slave helps in synchronizing hardlinks.
- - Purges are also handled effectively as there is no entry comparison between master and slave. With changelog replay, geo-rep perform unlink operation without having to resort to expensive **readdir()** over the WAN.
- - Renames: With earlier geo-replication, because of the path based nature of crawling, renames were actually a delete and a create on the slave, followed by data transfer (not to mention the inode number change). Now, with changelogging, it's actually a **rename()** call on the slave.
-
-Replica Failover
-----------------
-One of the basic volume configuration is a replicated volume (synchronous replication). Having geo-replication sync data from all replicas would mean wastage of network bandwidth and possibly data corruption on the slave (though that's unlikely). Therefore, geo-rep on such volume configurations works in an **ACTIVE** and **PASSIVE** mode. Geo-rep daemon on one of the replicas is responsible for replicating data (**ACTIVE**), while the other geo-rep daemon is basically doing nothing (**PASSIVE**).
-
-On the event of the *ACTIVE* node going offline, the *PASSIVE* node identifies this event (there's a lag of max 60 seconds for this identification) and switches to *ACTIVE*; thereby taking over the role of replicating data from where the earlier *ACTIVE* node left off. This guarantees uninterrupted data replication even on node reboot/failures.
-
-[1]:http://rsync.samba.org
diff --git a/doc/features/geo-replication/libgfchangelog.md b/doc/features/geo-replication/libgfchangelog.md
deleted file mode 100644
index 1dd0d24253a..00000000000
--- a/doc/features/geo-replication/libgfchangelog.md
+++ /dev/null
@@ -1,119 +0,0 @@
-libgfchangelog: "GlusterFS changelog" consumer library
-======================================================
-
-This document puts forward the intended need for GlusterFS changelog consumer library (a.k.a. libgfchangelog) for consuming changlogs produced by the Changelog translator. Further, it mentions the proposed design and the API exposed by it. A brief explanation of changelog translator can also be found as a commit message in the upstream source tree and the review link can be [accessed here] [1].
-
-Initial consumer of changelogs would be Geo-Replication (release 3.5). Possible consumers in the future could be backup utilities, GlusterFS self-heal, bit-rot detection, AV scanners. All these utilities have one thing in common - to get a list of changed entities (created/modified/deleted) in the file system. Therefore, the need arises to provide such functionality in the form of a shared library that applications can link against and query for changes (See API section). There is no plan as of now to provide language bindings as such, but for shell script friendliness: 'gfind' command line utility (which would be dynamically linked with libgfchangelog) would be helpful. As of now, development for this utility is still not commenced.
-
-The next section gives a brief introduction about how changelogs are organized and managed. Then we propose couple of designs for libgfchangelog. API set is not covered in this document (maybe later).
-
-Changelogs
-==========
-
-Changelogs can be thought as a running history for an entity in the file system from the time the entity came into existance. The goal is to capture all possible transitions the entity underwent till the time it got purged. The transition namespace is broken up into three categories with each category represented by a specific changelog format. Changes are recorded in a flat file in the filesystem and are rolled over after a specific time interval. All three types of categories are recorded in a single changelog file (sequentially) with a type for each entry. Having a single file reduces disk seeks and fragmentation and less number of files to deal with. Stratergy for pruning of old logs is still undecided.
-
-
-Changelog Transition Namespace
-------------------------------
-
-As mentioned before the transition namespace is categorized into three types:
- - TYPE-I : Data operation
- - TYPE-II : Metadata operation
- - TYPE-III : Entry operation
-
-One could visualize the transition of an file system entity as a state machine transitioning from one type to another. For TYPE-I and TYPE-II operations there is no state transition as such, but TYPE-III operation involves a state change from the file systems perspective. We can now classify file operations (fops) into one of the three types:
- - Data operation: write(), writev(), truncate(), ftruncate()
- - Metadata operation: setattr(), fsetattr(), setxattr(), fsetxattr(), removexattr(), fremovexattr()
- - Entry operation: create(), mkdir(), mknod(), symlink(), link(), rename(), unlink(), rmdir()
-
-Changelog Entry Format
-----------------------
-
-In order to record the type of operation and entity underwent, a type identifier is used. Normally, the entity on which the operation is performed would be identified by the pathname, which is the most common way of addressing in a file system, but we choose to use GlusterFS internal file identifier (GFID) instead (as GlusterFS supports GFID based backend and the pathname field may not always be valid and other reasons which are out of scope of this this document). Therefore, the format of the record for the three types of operation can be summarized as follows:
-
- - TYPE-I : GFID of the file
- - TYPE-II : GFID of the file
- - TYPE-III : GFID + FOP + MODE + UID + GID + PARGFID/BNAME [PARGFID/BNAME]
-
-GFID's are analogous to inodes. TYPE-I and TYPE-II fops record the GFID of the entity on which the operation was performed: thereby recording that there was an data/metadata change on the inode. TYPE-III fops record at the minimum a set of six or seven records (depending on the type of operation), that is sufficient to identify what type of operation the entity underwent. Normally this record inculdes the GFID of the entity, the type of file operation (which is an integer [an enumerated value which is used in GluterFS]) and the parent GFID and the basename (analogous to parent inode and basename).
-
-Changelogs can be either in ascii or binary format, the difference being the format of the records that is persisted. In a binary changelog the gfids are recorded in it's native format ie. 16 byte record and the fop number as a 4 byte integer. In an ascii changelog, the gfids are stored in their canonical form and the fop number is stringified and persisted. Null charater is used as the record serarator and changelogs. This makes it hard to read changelogs from the command line, but the packed format is needed to support file names with spaces and special characters. Below is a snippet of a changelog along side it's hexdump.
-
-```
-00000000 47 6c 75 73 74 65 72 46 53 20 43 68 61 6e 67 65 |GlusterFS Change|
-00000010 6c 6f 67 20 7c 20 76 65 72 73 69 6f 6e 3a 20 76 |log | version: v|
-00000020 31 2e 31 20 7c 20 65 6e 63 6f 64 69 6e 67 20 3a |1.1 | encoding :|
-00000030 20 32 0a 45 61 36 39 33 63 30 34 65 2d 61 66 39 | 2.Ea693c04e-af9|
-00000040 65 2d 34 62 61 35 2d 39 63 61 37 2d 31 63 34 61 |e-4ba5-9ca7-1c4a|
-00000050 34 37 30 31 30 64 36 32 00 32 33 00 33 33 32 36 |47010d62.23.3326|
-00000060 31 00 30 00 30 00 66 36 35 34 32 33 32 65 2d 61 |1.0.0.f654232e-a|
-00000070 34 32 62 2d 34 31 62 33 2d 62 35 61 61 2d 38 30 |42b-41b3-b5aa-80|
-00000080 33 62 33 64 61 34 35 39 33 37 2f 6c 69 62 76 69 |3b3da45937/libvi|
-00000090 72 74 5f 64 72 69 76 65 72 5f 6e 65 74 77 6f 72 |rt_driver_networ|
-000000a0 6b 2e 73 6f 00 44 61 36 39 33 63 30 34 65 2d 61 |k.so.Da693c04e-a|
-000000b0 66 39 65 2d 34 62 61 35 2d 39 63 61 37 2d 31 63 |f9e-4ba5-9ca7-1c|
-000000c0 34 61 34 37 30 31 30 64 36 32 00 45 36 65 39 37 |4a47010d62.E6e97|
-```
-
-As you can see, there is an *entry* operation (journal record starting with an "E"). Records for this operation are:
- - GFID : a693c04e-af9e-4ba5-9ca7-1c4a-47010d62
- - FOP : 23 (create)
- - Mode : 33261
- - UID : 0
- - GID : 0
- - PARGFID/BNAME: f654232e-a42b-41b3-b5aa-803b3da45937
-
-**NOTE**: In case of a rename operation, there would be an additional record (for the target PARGFID/BNAME).
-
-libgfchangelog
---------------
-
-NOTE: changelogs generated by the changelog translator are rolled over [with the timestamp as the suffix] after a specific interval, after which a new change is started. The current changelog [changelog file without the timestamp as the suffix] should never be processed unless it's rolled over. The rolled over logs should be treated read-only.
-
-Capturing changes performed on a file system is useful for applications that rely on file system scan (crawl) to figure out such information. Backup utilities, automatic file healing in a replicated environment, bit-rot detection and the likes are some of the end user applications that require a set of changed entities in a file system to act on. Goal of libgfchangelog is to provide the application (consumer) a fast and easy to use common query interface (API). The consumer need not worry about the changelog format, nomenclature of the changelog files etc.
-
-Now we list functionality and some of the features.
-
-Functionality
--------------
-
-Changelog Processing: Processing involes reading changelog file(s), converting the entries into human-readable (or application understandable) format (in case of binary log format).
-Book-keeping: Keeping track of how much the application has consumed the changelog (ie. changes during the time slice start-time -> end-time).
-Serve API request: Update the consumer by providing the set of changes.
-
-Processing could be done in two ways:
-
-* Pre-processing (pre-processing from the library POV):
-Once a changelog file is rolled over (by the changelog translator), a set of post processing operations are performed. These operations could include conversion of a binary log file to an understandable format, collate a bunch of logs into a larger sampling period or just keep a private copy of the changelog (in ascii format). Extra disk space is consumed to store this private copy. The library would then be free to consume these logs and serve application requests.
-
-* On-demand:
-The processing of the changelogs is trigerred when an application requests for changes. Downside of this being additional time spent on decoding the logs and data accumulation during application request time (but no additional disk space is used over the time period).
-
-After processing, the changelog is ready to be consumed by the application. The function of processing is to convert the logs into human/application readable format (an example is shown below):
-
-```
-E a7264fe2-dd6b-43e1-8786-a03b42cc2489 CREATE 33188 0 0 00000000-0000-0000-0000-000000000001%2Fservices1
-M a7264fe2-dd6b-43e1-8786-a03b42cc2489 NULL
-M 00000000-0000-0000-0000-000000000001 NULL
-D a7264fe2-dd6b-43e1-8786-a03b42cc2489
-```
-
-Features
---------
-
-The following points mention some of the features that the library could provide.
-
- - Consumer could choose the update type when it registers with the library. 'types' could be:
- - Streaming: The consumer is updated via stream of changes, ie. the library would just replay the logs
- - Consolidated: The consumer is provided with a consolidated view of the changelog, eg. if <gfid> had an DATA and a METADATA operation, it would be presented as a single update. Similarly for ENTRY operations.
- - Raw: This mode provides the consumer with the pathnames of the changelog files itself (after processing). The changelogs should be strictly treated as read-only. This gives the flexibility to the consumer to extract updates using thier own preferred way (eg. using command line tools like sed, awk, sort | uniq etc.).
- - Application may choose to adopt a synchronous (blocking) or an asynchronous (callback) notification mechanism.
- - Provide a unified view of changelogs from multiple peers (replication scenario) or a global changelog view of the entire cluster.
-
-
-** The first cut of the library supports:**
- - Raw access mode
- - Synchronous programming model
- - Per brick changelog consumption ie. no unified/globally aggregated changelog
-
-[1]:http://review.gluster.org/5127
diff --git a/doc/features/gfid-access.md b/doc/features/gfid-access.md
deleted file mode 100644
index 2d324a18bdb..00000000000
--- a/doc/features/gfid-access.md
+++ /dev/null
@@ -1,73 +0,0 @@
-#Gfid-access Translator
-The 'gfid-access' translator provides access to data in glusterfs using a
-virtual path. This particular translator is designed to provide direct access to
-files in glusterfs using its gfid. 'GFID' is glusterfs's inode number for a file
-to identify it uniquely. As of now, Geo-replication is the only consumer of this
-translator. The changelog translator logs the 'gfid' with corresponding file
-operation in journals which are consumed by Geo-Replication to replicate the
-files using gfid-access translator very efficiently.
-
-###Implications and Usage
-A new virtual directory called '.gfid' is exposed in the aux-gfid mount
-point when gluster volume is mounted with 'aux-gfid-mount' option.
-All the gfids of files are exposed in one level under the '.gfid' directory.
-No matter at what level the file resides, it is accessed using its
-gfid under this virutal directory as shown in example below. All access
-protocols work seemlessly, as the complexities are handled internally.
-
-###Testing
-1. Mount glusterfs client with '-o aux-gfid-mount' as follows.
-
- mount -t glusterfs -o aux-gfid-mount <node-ip>:<volname> <mountpoint>
-
- Example:
-
- #mount -t glusterfs -o aux-gfid-mount rhs1:master /master-aux-mnt
-
-2. Get the 'gfid' of a file using normal mount or aux-gfid-mount and do some
- operations as follows.
-
- getfattr -n glusterfs.gfid.string <file>
-
- Example:
-
- #getfattr -n glusterfs.gfid.string /master-aux-mnt/file
- # file: file
- glusterfs.gfid.string="796d3170-0910-4853-9ff3-3ee6b1132080"
-
- #cat /master-aux-mnt/file
- sample data
-
- #stat /master-aux-mnt/file
- File: `file'
- Size: 12 Blocks: 1 IO Block: 131072 regular file
- Device: 13h/19d Inode: 11525625031905452160 Links: 1
- Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
- Access: 2014-05-23 20:43:33.239999863 +0530
- Modify: 2014-05-23 17:36:48.224999989 +0530
- Change: 2014-05-23 20:44:10.081999938 +0530
-
-
-3. Access files using virtual path as follows.
-
- /mountpoint/.gfid/<actual-canonical-gfid-of-the-file\>'
-
- Example:
-
- #cat /master-aux-mnt/.gfid/796d3170-0910-4853-9ff3-3ee6b1132080
- sample data
- #stat /master-aux-mnt/.gfid/796d3170-0910-4853-9ff3-3ee6b1132080
- File: `.gfid/796d3170-0910-4853-9ff3-3ee6b1132080'
- Size: 12 Blocks: 1 IO Block: 131072 regular file
- Device: 13h/19d Inode: 11525625031905452160 Links: 1
- Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
- Access: 2014-05-23 20:43:33.239999863 +0530
- Modify: 2014-05-23 17:36:48.224999989 +0530
- Change: 2014-05-23 20:44:10.081999938 +0530
-
- We can notice that 'cat' command on the 'file' using path and using virtual
- path displays the same data. Similarly 'stat' command on the 'file' and using
- virtual path with gfid gives same Inode Number confirming that its same file.
-
-###Nature of changes
-This feature is introduced with 'gfid-access' translator.
diff --git a/doc/features/libgfapi.md b/doc/features/libgfapi.md
deleted file mode 100644
index dfc8cfe6527..00000000000
--- a/doc/features/libgfapi.md
+++ /dev/null
@@ -1,381 +0,0 @@
-One of the known methods to access glusterfs is via fuse module. However, it has some overhead or performance issues because of the number of context switches which need to be performed to complete one i/o transaction[1].
-
-
-To over come this limitation, a new method called ‘libgfapi’ is introduced. libgfapi support is available from GlusterFS-3.4 release.
-
-libgfapi is a userspace library for accessing data in glusterfs. libgfapi library perform IO on gluster volumes directly without FUSE mount. It is a filesystem like api and runs/sits in application process context. libgfapi eliminates the fuse and the kernel vfs layer from the glusterfs volume access. The speed and latency have improved with libgfapi access. [1]
-
-
-Using libgfapi, various user-space filesystems (like NFS-Ganesha or Samba) or the virtualizer (like QEMU) can interact with GlusterFS which serves as back-end filesystem. Currently below projects integrate with glusterfs using libgfapi interfaces.
-
-
-* qemu storage layer
-* Samba VFS plugin
-* NFS-Ganesha
-
-All the APIs in libgfapi make use of `struct glfs` object. This object
-contains information about volume name, glusterfs context associated,
-subvols in the graph etc which makes it unique for each volume.
-
-
-For any application to make use of libgfapi, it should typically start
-with the below APIs in the following order -
-
-* To create a new glfs object :
-
- glfs_t *glfs_new (const char *volname) ;
-
- glfs_new() returns glfs_t object.
-
-
-* On this newly created glfs_t, you need to be either set a volfile path
- (glfs_set_volfile) or a volfile server (glfs_set_volfile_server).
- Incase of failures, the corresponding cleanup routine is
- "glfs_unset_volfile_server"
-
- int glfs_set_volfile (glfs_t *fs, const char *volfile);
-
- int glfs_set_volfile_server (glfs_t *fs, const char *transport,const char *host, int port) ;
-
- int glfs_unset_volfile_server (glfs_t *fs, const char *transport,const char *host, int port) ;
-
-* Specify logging parameters using glfs_set_logging():
-
- int glfs_set_logging (glfs_t *fs, const char *logfile, int loglevel) ;
-
-* Initializes the glfs_t object using glfs_init()
- int glfs_init (glfs_t *fs) ;
-
-#### FOPs APIs available with libgfapi :
-
-
-
- int glfs_get_volumeid (struct glfs *fs, char *volid, size_t size);
-
- int glfs_setfsuid (uid_t fsuid) ;
-
- int glfs_setfsgid (gid_t fsgid) ;
-
- int glfs_setfsgroups (size_t size, const gid_t *list) ;
-
- glfs_fd_t *glfs_open (glfs_t *fs, const char *path, int flags) ;
-
- glfs_fd_t *glfs_creat (glfs_t *fs, const char *path, int flags,mode_t mode) ;
-
- int glfs_close (glfs_fd_t *fd) ;
-
- glfs_t *glfs_from_glfd (glfs_fd_t *fd) ;
-
- int glfs_set_xlator_option (glfs_t *fs, const char *xlator, const char *key,const char *value) ;
-
- typedef void (*glfs_io_cbk) (glfs_fd_t *fd, ssize_t ret, void *data);
-
- ssize_t glfs_read (glfs_fd_t *fd, void *buf,size_t count, int flags) ;
-
- ssize_t glfs_write (glfs_fd_t *fd, const void *buf,size_t count, int flags) ;
-
- int glfs_read_async (glfs_fd_t *fd, void *buf, size_t count, int flags, glfs_io_cbk fn, void *data) ;
-
- int glfs_write_async (glfs_fd_t *fd, const void *buf, size_t count, int flags, glfs_io_cbk fn, void *data) ;
-
- ssize_t glfs_readv (glfs_fd_t *fd, const struct iovec *iov, int iovcnt,int flags) ;
-
- ssize_t glfs_writev (glfs_fd_t *fd, const struct iovec *iov, int iovcnt,int flags) ;
-
- int glfs_readv_async (glfs_fd_t *fd, const struct iovec *iov, int count, int flags, glfs_io_cbk fn, void *data) ;
-
- int glfs_writev_async (glfs_fd_t *fd, const struct iovec *iov, int count, int flags, glfs_io_cbk fn, void *data) ;
-
- ssize_t glfs_pread (glfs_fd_t *fd, void *buf, size_t count, off_t offset,int flags) ;
-
- ssize_t glfs_pwrite (glfs_fd_t *fd, const void *buf, size_t count, off_t offset, int flags) ;
-
- int glfs_pread_async (glfs_fd_t *fd, void *buf, size_t count, off_t offset,int flags, glfs_io_cbk fn, void *data) ;
-
- int glfs_pwrite_async (glfs_fd_t *fd, const void *buf, int count, off_t offset,int flags, glfs_io_cbk fn, void *data) ;
-
- ssize_t glfs_preadv (glfs_fd_t *fd, const struct iovec *iov, int iovcnt, int count, off_t offset, int flags,glfs_io_cbk fn, void *data) ;
-
- ssize_t glfs_pwritev (glfs_fd_t *fd, const struct iovec *iov, int iovcnt,int count, off_t offset, int flags, glfs_io_cbk fn, void *data) ;
-
- int glfs_preadv_async (glfs_fd_t *fd, const struct iovec *iov, glfs_io_cbk fn, void *data) ;
-
- int glfs_pwritev_async (glfs_fd_t *fd, const struct iovec *iov, glfs_io_cbk fn, void *data) ;
-
- off_t glfs_lseek (glfs_fd_t *fd, off_t offset, int whence) ;
-
- int glfs_truncate (glfs_t *fs, const char *path, off_t length) ;
-
- int glfs_ftruncate (glfs_fd_t *fd, off_t length) ;
-
- int glfs_ftruncate_async (glfs_fd_t *fd, off_t length, glfs_io_cbk fn,void *data) ;
-
- int glfs_lstat (glfs_t *fs, const char *path, struct stat *buf) ;
-
- int glfs_stat (glfs_t *fs, const char *path, struct stat *buf) ;
-
- int glfs_fstat (glfs_fd_t *fd, struct stat *buf) ;
-
- int glfs_fsync (glfs_fd_t *fd) ;
-
- int glfs_fsync_async (glfs_fd_t *fd, glfs_io_cbk fn, void *data) ;
-
- int glfs_fdatasync (glfs_fd_t *fd) ;
-
- int glfs_fdatasync_async (glfs_fd_t *fd, glfs_io_cbk fn, void *data) ;
-
- int glfs_access (glfs_t *fs, const char *path, int mode) ;
-
- int glfs_symlink (glfs_t *fs, const char *oldpath, const char *newpath) ;
-
- int glfs_readlink (glfs_t *fs, const char *path,char *buf, size_t bufsiz) ;
-
- int glfs_mknod (glfs_t *fs, const char *path, mode_t mode, dev_t dev) ;
-
- int glfs_mkdir (glfs_t *fs, const char *path, mode_t mode) ;
-
- int glfs_unlink (glfs_t *fs, const char *path) ;
-
- int glfs_rmdir (glfs_t *fs, const char *path) ;
-
- int glfs_rename (glfs_t *fs, const char *oldpath, const char *newpath) ;
-
- int glfs_link (glfs_t *fs, const char *oldpath, const char *newpath) ;
-
- glfs_fd_t *glfs_opendir (glfs_t *fs, const char *path) ;
-
- int glfs_readdir_r (glfs_fd_t *fd, struct dirent *dirent,struct dirent **result) ;
-
- int glfs_readdirplus_r (glfs_fd_t *fd, struct stat *stat, struct dirent *dirent, struct dirent **result) ;
-
- struct dirent *glfs_readdir (glfs_fd_t *fd) ;
-
- struct dirent *glfs_readdirplus (glfs_fd_t *fd, struct stat *stat) ;
-
- long glfs_telldir (glfs_fd_t *fd) ;
-
- void glfs_seekdir (glfs_fd_t *fd, long offset) ;
-
- int glfs_closedir (glfs_fd_t *fd) ;
-
- int glfs_statvfs (glfs_t *fs, const char *path, struct statvfs *buf) ;
-
- int glfs_chmod (glfs_t *fs, const char *path, mode_t mode) ;
-
- int glfs_fchmod (glfs_fd_t *fd, mode_t mode) ;
-
- int glfs_chown (glfs_t *fs, const char *path, uid_t uid, gid_t gid) ;
-
- int glfs_lchown (glfs_t *fs, const char *path, uid_t uid, gid_t gid) ;
-
- int glfs_fchown (glfs_fd_t *fd, uid_t uid, gid_t gid) ;
-
- int glfs_utimens (glfs_t *fs, const char *path,struct timespec times[2]) ;
-
- int glfs_lutimens (glfs_t *fs, const char *path,struct timespec times[2]) ;
-
- int glfs_futimens (glfs_fd_t *fd, struct timespec times[2]) ;
-
- ssize_t glfs_getxattr (glfs_t *fs, const char *path, const char *name,void *value, size_t size) ;
-
- ssize_t glfs_lgetxattr (glfs_t *fs, const char *path, const char *name,void *value, size_t size) ;
-
- ssize_t glfs_fgetxattr (glfs_fd_t *fd, const char *name,void *value, size_t size) ;
-
- ssize_t glfs_listxattr (glfs_t *fs, const char *path,void *value, size_t size) ;
-
- ssize_t glfs_llistxattr (glfs_t *fs, const char *path, void *value,size_t size) ;
-
- ssize_t glfs_flistxattr (glfs_fd_t *fd, void *value, size_t size) ;
-
- int glfs_setxattr (glfs_t *fs, const char *path, const char *name,const void *value, size_t size, int flags) ;
-
- int glfs_lsetxattr (glfs_t *fs, const char *path, const char *name,const void *value, size_t size, int flags) ;
-
- int glfs_fsetxattr (glfs_fd_t *fd, const char *name,const void *value, size_t size, int flags) ;
-
- int glfs_removexattr (glfs_t *fs, const char *path, const char *name) ;
-
- int glfs_lremovexattr (glfs_t *fs, const char *path, const char *name) ;
-
- int glfs_fremovexattr (glfs_fd_t *fd, const char *name) ;
-
- int glfs_fallocate(glfs_fd_t *fd, int keep_size, off_t offset, size_t len) ;
-
- int glfs_discard(glfs_fd_t *fd, off_t offset, size_t len) ;
-
- int glfs_discard_async (glfs_fd_t *fd, off_t length, size_t lent, glfs_io_cbk fn, void *data) ;
-
- int glfs_zerofill(glfs_fd_t *fd, off_t offset, off_t len) ;
-
- int glfs_zerofill_async (glfs_fd_t *fd, off_t length, off_t len, glfs_io_cbk fn, void *data) ;
-
- char *glfs_getcwd (glfs_t *fs, char *buf, size_t size) ;
-
- int glfs_chdir (glfs_t *fs, const char *path) ;
-
- int glfs_fchdir (glfs_fd_t *fd) ;
-
- char *glfs_realpath (glfs_t *fs, const char *path, char *resolved_path) ;
-
- int glfs_posix_lock (glfs_fd_t *fd, int cmd, struct flock *flock) ;
-
- glfs_fd_t *glfs_dup (glfs_fd_t *fd) ;
-
-
- struct glfs_object *glfs_h_lookupat (struct glfs *fs,struct glfs_object *parent,
- const char *path,
- struct stat *stat) ;
-
- struct glfs_object *glfs_h_creat (struct glfs *fs, struct glfs_object *parent,
- const char *path, int flags, mode_t mode,
- struct stat *sb) ;
-
- struct glfs_object *glfs_h_mkdir (struct glfs *fs, struct glfs_object *parent,
- const char *path, mode_t flags,
- struct stat *sb) ;
-
- struct glfs_object *glfs_h_mknod (struct glfs *fs, struct glfs_object *parent,
- const char *path, mode_t mode, dev_t dev,
- struct stat *sb) ;
-
- struct glfs_object *glfs_h_symlink (struct glfs *fs, struct glfs_object *parent,
- const char *name, const char *data,
- struct stat *stat) ;
-
-
- int glfs_h_unlink (struct glfs *fs, struct glfs_object *parent,
- const char *path) ;
-
- int glfs_h_close (struct glfs_object *object) ;
-
- int glfs_caller_specific_init (void *uid_caller_key, void *gid_caller_key,
- void *future) ;
-
- int glfs_h_truncate (struct glfs *fs, struct glfs_object *object,
- off_t offset) ;
-
- int glfs_h_stat(struct glfs *fs, struct glfs_object *object,
- struct stat *stat) ;
-
- int glfs_h_getattrs (struct glfs *fs, struct glfs_object *object,
- struct stat *stat) ;
-
- int glfs_h_getxattrs (struct glfs *fs, struct glfs_object *object,
- const char *name, void *value,
- size_t size) ;
-
- int glfs_h_setattrs (struct glfs *fs, struct glfs_object *object,
- struct stat *sb, int valid) ;
-
- int glfs_h_setxattrs (struct glfs *fs, struct glfs_object *object,
- const char *name, const void *value,
- size_t size, int flags) ;
-
- int glfs_h_readlink (struct glfs *fs, struct glfs_object *object, char *buf,
- size_t bufsiz) ;
-
- int glfs_h_link (struct glfs *fs, struct glfs_object *linktgt,
- struct glfs_object *parent, const char *name) ;
-
- int glfs_h_rename (struct glfs *fs, struct glfs_object *olddir,
- const char *oldname, struct glfs_object *newdir,
- const char *newname) ;
-
- int glfs_h_removexattrs (struct glfs *fs, struct glfs_object *object,
- const char *name) ;
-
- ssize_t glfs_h_extract_handle (struct glfs_object *object,
- unsigned char *handle, int len) ;
-
- struct glfs_object *glfs_h_create_from_handle (struct glfs *fs,
- unsigned char *handle, int len,
- struct stat *stat) ;
-
-
- struct glfs_fd *glfs_h_opendir (struct glfs *fs,
- struct glfs_object *object) ;
-
- struct glfs_fd *glfs_h_open (struct glfs *fs, struct glfs_object *object,
- int flags) ;
-
-For more details on these apis please refer glfs.h and glfs-handles.h in the source tree (api/src/) of glusterfs:
-
-* Incase of failures or to close the connection and destroy glfs_t
-object, use glfs_fini.
-
- int glfs_fini (glfs_t *fs) ;
-
-
-All the fileops are typically divided into below categories
-
-* a) Handle based Operations -
-
-These APIs create/make use of a glfs_object (referred as handles) unique
-to each file within a volume.
-The structure glfs_object contains inode pointer and gfid.
-
-For example: Since NFS protocol uses file handles to access files, these APIs are
-mainly used by NFS-Ganesha server.
-
-Eg:
-
- struct glfs_object *glfs_h_lookupat (struct glfs *fs,
- struct glfs_object *parent,
- const char *path,
- struct stat *stat);
-
- struct glfs_object *glfs_h_creat (struct glfs *fs,
- struct glfs_object *parent,
- const char *path,
- int flags, mode_t mode,
- struct stat *sb);
-
- struct glfs_object *glfs_h_mkdir (struct glfs *fs,
- struct glfs_object *parent,
- const char *path, mode_t flags,
- struct stat *sb);
-
-
-
-* b) File path/descriptor based Operations -
-
-These APIs make use of file path/descriptor to determine the file on
-which it needs to operate on.
-
-For example: Samba uses these APIs for file operations.
-
-Examples of the APIs using file path -
-
- int glfs_chdir (glfs_t *fs, const char *path) ;
-
- char *glfs_realpath (glfs_t *fs, const char *path, char *resolved_path) ;
-
-Once the file is opened, the file-descriptor generated is used for
-further operations.
-
-Eg:
-
- int glfs_posix_lock (glfs_fd_t *fd, int cmd, struct flock *flock) ;
- glfs_fd_t *glfs_dup (glfs_fd_t *fd) ;
-
-
-
-#### libgfapi bindings :
-
-libgfapi bindings are available for below languages:
-
- - Go
- - Java
- - python [2]
- - Ruby
-
-For more details on these bindings,please refer :
-
- #http://www.gluster.org/community/documentation/index.php/Language_Bindings
-
-References:
-
-[1] http://humblec.com/libgfapi-interface-glusterfs/
-[2] http://www.gluster.org/2014/04/play-with-libgfapi-and-its-python-bindings/
-
diff --git a/doc/features/nufa.md b/doc/features/nufa.md
deleted file mode 100644
index 03b8194b4c0..00000000000
--- a/doc/features/nufa.md
+++ /dev/null
@@ -1,20 +0,0 @@
-# NUFA Translator
-
-The NUFA ("Non Uniform File Access") is a variant of the DHT ("Distributed Hash
-Table") translator, intended for use with workloads that have a high locality
-of reference. Instead of placing new files pseudo-randomly, it places them on
-the same nodes where they are created so that future accesses can be made
-locally. For replicated volumes, this means that one copy will be local and
-others will be remote; the read-replica selection mechanisms will then favor
-the local copy for reads. For non-replicated volumes, the only copy will be
-local.
-
-## Interface
-
-Use of NUFA is controlled by a volume option, as follows.
-
- gluster volume set myvolume cluster.nufa on
-
-This will cause the NUFA translator to be used wherever the DHT translator
-otherwise would be. The rest is all automatic.
-
diff --git a/doc/features/ovirt-integration.md b/doc/features/ovirt-integration.md
deleted file mode 100644
index 46dbeabbbaa..00000000000
--- a/doc/features/ovirt-integration.md
+++ /dev/null
@@ -1,106 +0,0 @@
-##Ovirt Integration with glusterfs
-
-oVirt is an opensource virtualization management platform. You can use oVirt to manage
-hardware nodes, storage and network resources, and to deploy and monitor virtual machines
-running in your data center. oVirt serves as the bedrock for Red Hat''s Enterprise Virtualization product,
-and is the "upstream" project where new features are developed in advance of their inclusion
-in that supported product offering.
-
-To know more about ovirt please visit http://www.ovirt.org/ and to configure
-#http://www.ovirt.org/Quick_Start_Guide#Install_oVirt_Engine_.28Fedora.29%60
-
-For the installation step of ovirt, please refer
-#http://www.ovirt.org/Quick_Start_Guide#Install_oVirt_Engine_.28Fedora.29%60
-
-When oVirt integrated with gluster, glusterfs can be used in below forms:
-
-* As a storage domain to host VM disks.
-
-There are mainly two ways to exploit glusterfs as a storage domain.
- - POSIXFS_DOMAIN ( >=oVirt 3.1 )
- - GLUSTERFS_DOMAIN ( >=oVirt 3.3)
-
-The former one has performance overhead and is not an ideal way to consume images hosted in glusterfs volumes.
-When used by this method, qemu uses glusterfs `mount point` to access VM images and invite FUSE overhead.
-The libvirt treats this as a file type disk in its xml schema.
-
-The latter is the recommended way of using glusterfs with ovirt as a storage domain. This provides better
-and efficient way to access images hosted under glusterfs volumes.When qemu accessing glusterfs volume using this method,
-it make use of `libgfapi` implementation of glusterfs and this method is called native integration.
-Here the glusterfs is added as a block backend to qemu and libvirt treat this as a `network` type disk.
-
-For more details on this, please refer # http://www.ovirt.org/Features/GlusterFS_Storage_Domain
-However there are 2 bugs which block usage of this feature.
-
-https://bugzilla.redhat.com/show_bug.cgi?id=1022961
-https://bugzilla.redhat.com/show_bug.cgi?id=1017289
-
-Please check above bugs for latest status.
-
-* To manage gluster trusted pools.
-
-oVirt web admin console can be used to -
- - add new / import existing gluster cluster
- - add/delete volumes
- - add/delete bricks
- - set/reset volume options
- - optimize volume for virt store
- - Rebalance and Remove bricks
- - Monitor gluster deployment - node, brick, volume status,
- Enhanced service monitoring (Physical node resources as well Quota, geo-rep and self-heal status) through Nagios integration(>=oVirt 3.4)
-
-
-
-When configuing ovirt to manage only gluster cluster/trusted pool, you need to select `gluster` as an input for
-`Application mode` in OVIRT ENGINE CONFIGURATION option of `engine-setup` command.
-Refer # http://www.ovirt.org/Quick_Start_Guide#Install_oVirt_Engine_.28Fedora.29%60
-
-If you want to use gluster as both ( as a storage domain to host VM disks and to manage gluster trusted pools)
-you need to input `both` as a value for `Application mode` in engine-setup command.
-
-Once you have successfully installed oVirt Engine as mentioned above, you will be provided with instructions
-to access oVirt''s web console.
-
-Below example shows how to configure gluster nodes in fedora.
-
-
-#Configuring gluster nodes.
-
-On the machine designated as your host, install any supported distribution( ex:Fedora/CentOS/RHEL...etc).
-A minimal installation is sufficient.
-
-Refer # http://www.ovirt.org/Quick_Start_Guide#Install_Hosts
-
-
-##Connect to Ovirt Engine
-
-Log In to Administration Console
-
-Ensure that you have the administrator password configured during installation of oVirt engine.
-
-- To connect to oVirt webadmin console
-
-
-Open a browser and navigate to https://domain.example.com/webadmin. Substitute domain.example.com with the URL provided during installation
-
-If this is your first time connecting to the administration console, oVirt Engine will issue
-security certificates for your browser. Click the link labelled this certificate to trust the
-ca.cer certificate. A pop-up displays, click Open to launch the Certificate dialog.
-Click `Install Certificate` and select to place the certificate in Trusted Root Certification Authorities store.
-
-
-The console login screen displays. Enter admin as your User Name, and enter the Password that
-you provided during installation. Ensure that your domain is set to Internal. Click Login.
-
-
-You have now successfully logged in to the oVirt web administration console. Here, you can configure and manage all your gluster resources.
-
-To manage gluster trusted pool:
-
-- Create a cluster with "Enable gluster service" - turned on. (Turn on "Enable virt service" if the same nodes are used as hypervisor as well)
-- Add hosts which have already been set up as in step Configuring gluster nodes.
-- Create a volume, and click on "Optimize for virt store",This sets the volume tunables optimize volume to be used as an image store
-
-To use this volume as a storage domain:
-
-Please refer `User interface` section of www.ovirt.org/Features/GlusterFS_Storage_Domain
diff --git a/doc/features/qemu-integration.md b/doc/features/qemu-integration.md
deleted file mode 100644
index b44dc06bb43..00000000000
--- a/doc/features/qemu-integration.md
+++ /dev/null
@@ -1,231 +0,0 @@
-Using GlusterFS volumes to host VM images and data was sub-optimal due to the FUSE overhead involved in accessing gluster volumes via GlusterFS native client. However this has changed now with two specific enhancements:
-
-- A new library called libgfapi is now available as part of GlusterFS that provides POSIX-like C APIs for accessing gluster volumes. libgfapi support is available from GlusterFS-3.4 release.
-- QEMU (starting from QEMU-1.3) will have GlusterFS block driver that uses libgfapi and hence there is no FUSE overhead any longer when QEMU works with VM images on gluster volumes.
-
-GlusterFS with its pluggable translator model can serve as a flexible storage backend for QEMU. QEMU has to just talk to GlusterFS and GlusterFS will hide different file systems and storage types underneath. Various GlusterFS storage features like replication and striping will automatically be available for QEMU. Efforts are also on to add block device backend in Gluster via Block Device (BD) translator that will expose underlying block devices as files to QEMU. This allows GlusterFS to be a single storage backend for both file and block based storage types.
-
-###GlusterFS specifcation in QEMU
-
-VM image residing on gluster volume can be specified on QEMU command line using URI format
-
- gluster[+transport]://[server[:port]]/volname/image[?socket=...]
-
-
-
-* `gluster` is the protocol.
-
-* `transport` specifies the transport type used to connect to gluster management daemon (glusterd). Valid transport types are `tcp, unix and rdma.` If a transport type isn’t specified, then tcp type is assumed.
-
-* `server` specifies the server where the volume file specification for the given volume resides. This can be either hostname, ipv4 address or ipv6 address. ipv6 address needs to be within square brackets [ ]. If transport type is unix, then server field should not be specified. Instead the socket field needs to be populated with the path to unix domain socket.
-
-* `port` is the port number on which glusterd is listening. This is optional and if not specified, QEMU will send 0 which will make gluster to use the default port. If the transport type is unix, then port should not be specified.
-
-* `volname` is the name of the gluster volume which contains the VM image.
-
-* `image` is the path to the actual VM image that resides on gluster volume.
-
-
-###Examples:
-
- gluster://1.2.3.4/testvol/a.img
- gluster+tcp://1.2.3.4/testvol/a.img
- gluster+tcp://1.2.3.4:24007/testvol/dir/a.img
- gluster+tcp://[1:2:3:4:5:6:7:8]/testvol/dir/a.img
- gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir/a.img
- gluster+tcp://server.domain.com:24007/testvol/dir/a.img
- gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket
- gluster+rdma://1.2.3.4:24007/testvol/a.img
-
-
-
-NOTE: (GlusterFS URI description and above examples are taken from QEMU documentation)
-
-###Configuring QEMU with GlusterFS backend
-
-While building QEMU from source, in addition to the normal configuration options, ensure that –enable-glusterfs options are specified explicitly with ./configure script to get glusterfs support in qemu.
-
-Starting with QEMU-1.6, pkg-config is used to configure the GlusterFS backend in QEMU. If you are using GlusterFS compiled and installed from sources, then the GlusterFS package config file (glusterfs-api.pc) might not be present at the standard path and you will have to explicitly add the path by executing this command before running the QEMU configure script:
-
- export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig/
-
-Without this, GlusterFS driver will not be compiled into QEMU even when GlusterFS is present in the system.
-
-* Creating a VM image on GlusterFS backend
-
-qemu-img command can be used to create VM images on gluster backend. The general syntax for image creation looks like this:
-
-For ex:
-
- qemu-img create gluster://server/volname/path/to/image size
-
-## How to setup the environment:
-
-This usecase ( using glusterfs backend for VM disk store), is known as 'Virt-Store' usecase. Steps for the entire procedure could be split to:
-
-* Steps to be done on gluster volume side
-* Steps to be done on Hypervisor side
-
-
-##Steps to be done on gluster side
-
-These are the steps that needs to be done on the gluster side. Precisely this involves
-
- Creating "Trusted Storage Pool"
- Creating a volume
- Tuning the volume for virt-store
- Tuning glusterd to accept requests from QEMU
- Tuning glusterfsd to accept requests from QEMU
- Setting ownership on the volume
- Starting the volume
-
-* Creating "Trusted Storage Pool"
-
-Install glusterfs rpms on the NODE. You can create a volume with a single node. You can also scale up the cluster, as we call as `Trusted Storage Pool`, by adding more nodes to the cluster
-
- gluster peer probe <hostname>
-
-* Creating a volume
-
-It is highly recommended to have replicate volume or distribute-replicate volume for virt-store usecase, as it would add high availability and fault-tolerance. Remember the plain distribute works equally well
-
- gluster volume create replica 2 <brick1> .. <brickN>
-
-where, `<brick1> is <hostname>:/<path-of-dir> `
-
-
-Note: It is recommended to create sub-directories inside brick and that could be used to create a volume.For example, say, /home/brick1 is the mountpoint of XFS, then you can create a sub-directory inside it /home/brick1/b1 and use it while creating a volume.You can also use space available in root filesystem for bricks. Gluster cli, by default, throws warning in that case. You can override it by using force option
-
- gluster volume create replica 2 <brick1> .. <brickN> force
-
-If you are new to GlusterFS, you can take a look at QuickStart (http://www.gluster.org/community/documentation/index.php/QuickStart) guide.
-
-* Tuning the volume for virt-store
-
-There are recommended settings available for virt-store. This provide good performance characteristics when enabled on the volume that was used for virt-store
-
-Refer to http://www.gluster.org/community/documentation/index.php/Virt-store-usecase#Tunables for recommended tunables and for applying them on the volume, http://www.gluster.org/community/documentation/index.php/Virt-store-usecase#Applying_the_Tunables_on_the_volume
-
-
-* Tuning glusterd to accept requests from QEMU
-
-glusterd receives the request only from the applications that run with port number less than 1024 and it blocks otherwise. QEMU uses port number greater than 1024 and to make glusterd accept requests from QEMU, edit the glusterd vol file, /etc/glusterfs/glusterd.vol and add the following,
-
- option rpc-auth-allow-insecure on
-
-Note: If you have installed glusterfs from source, you can find glusterd vol file at `/usr/local/etc/glusterfs/glusterd.vol`
-
-Restart glusterd after adding that option to glusterd vol file
-
- service glusterd restart
-
-* Tuning glusterfsd to accept requests from QEMU
-
-Enable the option `allow-insecure` on the particular volume
-
- gluster volume set <volname> server.allow-insecure on
-
-IMPORTANT : As of now(april 2,2014)there is a bug, as allow-insecure is not dynamically set on a volume.You need to restart the volume for the change to take effect
-
-
-* Setting ownership on the volume
-
-Set the ownership of qemu:qemu on to the volume
-
- gluster volume set <vol-name> storage.owner-uid 107
- gluster volume set <vol-name> storage.owner-gid 107
-
-* Starting the volume
-
-Start the volume
-
- gluster volume start <vol-name>
-
-## Steps to be done on Hypervisor Side:
-
-To create a raw image,
-
- qemu-img create gluster://1.2.3.4/testvol/dir/a.img 5G
-
-To create a qcow2 image,
-
- qemu-img create -f qcow2 gluster://server.domain.com:24007/testvol/a.img 5G
-
-
-
-
-
-## Booting VM image from GlusterFS backend
-
-A VM image 'a.img' residing on gluster volume testvol can be booted using QEMU like this:
-
-
- qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio
-
-In addition to VM images, gluster drives can also be used as data drives:
-
- qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio -drive file=gluster://1.2.3.4/datavol/a-data.img,if=virtio
-
-Here 'a-data.img' from datavol gluster volume appears as a 2nd drive for the guest.
-
-It is also possible to make use of libvirt to define a disk and use it with qemu:
-
-
-### Create libvirt XML to define Virtual Machine
-
-virt-install is python wrapper which is mostly used to create VM using set of params. How-ever virt-install doesn't support any network filesystem [ https://bugzilla.redhat.com/show_bug.cgi?id=1017308 ]
-
-Create a libvirt VM xml - http://libvirt.org/formatdomain.html where the disk section is formatted in such a way, qemu driver for glusterfs is being used. This can be seen in the following example xml description
-
-
- <disk type='network' device='disk'>
- <driver name='qemu' type='raw' cache='none'/>
- <source protocol='gluster' name='distrepvol/vm3.img'>
- <host name='10.70.37.106' port='24007'/>
- </source>
- <target dev='vda' bus='virtio'/>
- <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
- </disk>
-
-
-
-
-
-* Define the VM from the XML file that was created earlier
-
-
- virsh define <xml-file-description>
-
-* Verify that the VM is created successfully
-
-
- virsh list --all
-
-* Start the VM
-
-
- virsh start <VM>
-
-* Verification
-
-You can verify the disk image file that is being used by VM
-
- virsh domblklist <VM-Domain-Name/ID>
-
-The above should show the volume name and image name. Here is the example,
-
-
- [root@test ~]# virsh domblklist vm-test2
- Target Source
- ------------------------------------------------
- vda distrepvol/test.img
- hdc -
-
-
-Reference:
-
-For more details on this feature implementation and its advantages, please refer:
-
-http://raobharata.wordpress.com/2012/10/29/qemu-glusterfs-native-integration/
-
-http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt
diff --git a/doc/features/quota-scalability.md b/doc/features/quota-scalability.md
deleted file mode 100644
index e47c898dd2a..00000000000
--- a/doc/features/quota-scalability.md
+++ /dev/null
@@ -1,52 +0,0 @@
-Issues with older implemetation:
------------------------------------
-* >#### Enforcement of quota was done on client side. This had following two issues :
- > >* All clients are not trusted and hence enforcement is not secure.
- > >* Quota enforcer caches directory size for a certain time out period to reduce network calls to fetch size. On time out, this cache is validated by querying server. With more clients, the traffic caused due to this
-validation increases.
-
-* >#### Relying on lookup calls on a file/directory (inode) to update its contribution [time consuming]
-
-* >####Hardlimits were stored in a comma separated list.
- > >* Hence, changing hard limit of one directory is not an independent operation and would invalidate hard limits of other directories. We need to parse the string once for each of these directories just to identify whether its hard limit is changed. This limits the number of hard limits we can configure.
-
-* >####Cli used to fetch the list of directories on which quota-limit is set, from glusterd.
- > >* With more number of limits, the network overhead incurred to fetch this list limits the scalability of number of directories on which we can set quota.
-
-* >#### Problem with NFS mount
- > >* Quota, for its enforcement and accounting requires all the ancestors of a file/directory till root. However, with NFS relying heavily on nameless lookups (through which there is no guarantee that ancestry can be
-accessed) this ancestry is not always present. Hence accounting and enforcement was not correct.
-
-
-New Design Implementation:
---------------------------------
-
-* Quota enforcement is moved to server side. This addresses issues that arose because of client side enforcement.
-
-* Two levels of quota limits, soft and hard quota is introduced.
- This will result in a message being logged on reaching soft quota and writes will fail with EDQUOT after hard limit is reached.
-
-Work Flow
------------------
-
-* Accounting
- # This is done using the marker translator loaded on each brick of the volume. Accounting happens in the background. Ie, it doesn't happen in-flight with the file operation. The file operations latency is not
-directly affected by the time taken to perform accounting. This update is sent recursively upwards up to the root of the volume.
-
-* Enforcement
- # The enforcer updates its 'view' (cached) of directory's disk usage on the incidence of a file operation after the expiry of hard/soft timeout, depending on the current usage. Enforcer uses quotad to get the
-aggregated disk usage of a directory from the accounting information present on each brick (viz, provided by marker).
-
-* Aggregator (quotad)
- # Quotad is a daemon that serves volume-wide disk usage of a directory, on which quota is configured. It is present on all nodes in the cluster (trusted storage pool) as bricks don't have a global view of cluster.
-Quotad queries the disk usage information from all the bricks in that volume and aggregates. It manages all the volumes on which quota is enabled.
-
-
-Benefit to GlusterFS
----------------------------------
-
-* Support upto 65536 quota configurations per volume.
-* More quotas can be configured in a single volume thereby leading to support GlusterFS for use cases like home directory.
-
-###For more information on quota usability refer the following link :
-> https://access.redhat.com/site/documentation/en-US/Red_Hat_Storage/2.1/html-single/Administration_Guide/index.html#chap-User_Guide-Dir_Quota-Enable
diff --git a/doc/features/rdmacm.md b/doc/features/rdmacm.md
deleted file mode 100644
index 2c287e85fb6..00000000000
--- a/doc/features/rdmacm.md
+++ /dev/null
@@ -1,26 +0,0 @@
-## Rdma Connection manager ##
-
-### What? ###
-Infiniband requires addresses of end points to be exchanged using an out-of-band channel (like tcp/ip). Glusterfs used a custom protocol over a tcp/ip channel to exchange this address. However, librdmacm provides the same functionality with the advantage of being a standard protocol. This helps if we want to communicate with a non-glusterfs entity (say nfs client with gluster nfs server) over infiniband.
-
-### Dependencies ###
-* [IP over Infiniband](http://pkg-ofed.alioth.debian.org/howto/infiniband-howto-5.html) - The value to *option* **remote-host** in glusterfs transport configuration should be an IPoIB address
-* [rdma cm kernel module](http://pkg-ofed.alioth.debian.org/howto/infiniband-howto-4.html#ss4.4)
-* [user space rdmacm library - librdmacm](https://www.openfabrics.org/downloads/rdmacm)
-
-### rdma-cm in >= GlusterFs 3.4 ###
-
-Following is the impact of http://review.gluster.org/#change,149.
-
-New userspace packages needed:
-librdmacm
-librdmacm-devel
-
-### Limitations ###
-
-* Because of bug [890502](https://bugzilla.redhat.com/show_bug.cgi?id=890502), we've to probe the peer on an IPoIB address. This imposes a restriction that all volumes created in the future have to communicate over IPoIB address (irrespective of whether they use gluster's tcp or rdma transport).
-
-* Currently client has independence to choose b/w tcp and rdma transports while communicating with the server (by creating volumes with **transport-type tcp,rdma**). This independence was a by-product of our ability to use the tcp/ip channel - transports with *option transport-type tcp* - for rdma connection establishment handshake too. However, with new requirement of IPoIB address for connection establishment, we loose this independence (till we bring in [multi-network support](https://bugzilla.redhat.com/show_bug.cgi?id=765437) - where a brick can be identified by a set of ip-addresses and we can choose different pairs of ip-addresses for communication based on our requirements - in glusterd).
-
-### External links ###
-* [Infiniband Howto](http://pkg-ofed.alioth.debian.org/howto/infiniband-howto.html)
diff --git a/doc/features/readdir-ahead.md b/doc/features/readdir-ahead.md
deleted file mode 100644
index 5302a021202..00000000000
--- a/doc/features/readdir-ahead.md
+++ /dev/null
@@ -1,14 +0,0 @@
-## Readdir-ahead ##
-
-### Summary ###
-Provide read-ahead support for directories to improve sequential directory read performance.
-
-### Owners ###
-Brian Foster
-
-### Detailed Description ###
-The read-ahead feature for directories is analogous to read-ahead for files. The objective is to detect sequential directory read operations and establish a pipeline for directory content. When a readdir request is received and fulfilled, preemptively issue subsequent readdir requests to the server in anticipation of those requests from the user. If sequential readdir requests are received, the directory content is already immediately available in the client. If subsequent requests are not sequential or not received, said data is simply dropped and the optimization is bypassed.
-
-readdir-ahead is currently disabled by default. It can be enabled with the following command:
-
- gluster volume set <volname> readdir-ahead on
diff --git a/doc/features/rebalance.md b/doc/features/rebalance.md
deleted file mode 100644
index 29b993008d2..00000000000
--- a/doc/features/rebalance.md
+++ /dev/null
@@ -1,74 +0,0 @@
-## Background
-
-
-For a more detailed description, view Jeff Darcy's blog post [here]
-(http://hekafs.org/index.php/2012/03/glusterfs-algorithms-distribution/)
-
-GlusterFS uses the distribute translator (DHT) to aggregate space of multiple servers. DHT distributes files among its subvolumes using a consistent hashing method providing 32-bit hashes. Each DHT subvolume is given a range in the 32-bit hash space. A hash value is calculated for every file using a combination of its name. The file is then placed in the subvolume with the hash range that contains the hash value.
-
-## What is rebalance?
-
-The rebalance process migrates files between the DHT subvolumes when necessary.
-
-## When is rebalance required?
-
-Rebalancing is required for two main cases.
-
-1. Addition/Removal of bricks
-
-2. Renaming of a file
-
-## Addition/Removal of bricks
-
-Whenever the number or order of DHT subvolumes change, the hash range given to each subvolume is recalculated. When this happens, already existing files on the volume will need to be moved to the correct subvolume based on their hash. Rebalance does this activity.
-
-Addition of bricks which increase the size of a volume will increase the number of DHT subvolumes and lead to recalculation of hash ranges (This doesn't happen when bricks are added to a volume to increase redundancy, i.e. increase replica count of a volume). This will require an explicit rebalance command to be issued to migrate the files.
-
-Removal of bricks which decrease the size of a volumes also causes the hash ranges of DHT to be recalculated. But we don't need to issue an explicit rebalance command in this case, as rebalance is done automatically by the remove-brick process if needed.
-
-## Renaming of a file
-
-Renaming of file will cause its hash to change. The file now needs to be moved to the correct subvolume based on its new hash. Rebalance does this.
-
-## How does rebalance work?
-
-At a high level, the rebalance process consists of the following 3 steps:
-
-1. Crawl the volume to access all files
-2. Calculate the hash for the file
-3. If needed move the migrate the file to the correct subvolume.
-
-
-The rebalance process has been optimized by making it distributed across the trusted storage pool. With distributed rebalance, a rebalance process is launched on each peer in the cluster. Each rebalance process will crawl files on only those bricks of the volume which are present on it, and migrate the files which need migration to the correct brick. This speeds up the rebalance process considerably.
-
-## What will happen if rebalance is not run?
-
-### Addition of bricks
-
-With the current implementation of add-brick, when the size of a volume is augmented by adding new bricks, the new bricks are not put into use immediately i.e., the hash ranges there not recalculated immediately. This means that the files will still be placed only onto the existing bricks, leaving the newly added storage space unused. Starting a rebalance process on the volume will cause the hash ranges to be recalculated with the new bricks included, which allows the newly added storage space to be used.
-
-### Renaming a file
-
-When a file rename causes the file to be hashed to a new subvolume, DHT writes a link file on the new subvolume leaving the actual file on the original subvolume. A link file is an empty file, which has an extended attribute set that points to the subvolume on which the actual file exists. So, when a client accesses the renamed file, DHT first looks for the file in the hashed subvolume and gets the link file. DHT understands the link file, and gets the actual file from the subvolume pointed to by the link file. This leads to a slight reduction in performance. A rebalance will move the actual file to the hashed subvolume, allowing clients to access the file directly once again.
-
-## Are clients affected during a rebalance process?
-
-The rebalance process is transparent to applications on the clients. Applications which have open files on the volume will not be affected by the rebalance process, even if the open file requires migration. The DHT translator on the client will hide the migration from the applications.
-
-##How are open files migrated?
-
-(A more technical description of the algorithm used can be seen in the commit message of commit a07bb18c8adeb8597f62095c5d1361c5bad01f09.)
-
-To achieve migration of open files, two things need to be assured of,
-a) any writes or changes happening to the file during migration are correctly synced to destination subvolume after the migration is complete.
-b) any further changes should be made to the destination subvolume
-
-Both of these requirements require sending notificatoins to clients. Clients are notified by overloading an attribute used in every callback functions. DHT understands these attributes in the callbacks and can be notified if a file is being migrated or not.
-
-During rebalance, a file will be in two phases
-
-1. Migration in process - In this phase the file is being migrated by the rebalance process from the source subvolume to the destination subvolume. The rebalance process will set a 'in-migration' attribute on the file, which will notify the clients' DHT translator. The clients' DHT translator will then take care to send any further changes to the destination subvolume as well. This way we satisfy the first requirement
-
-2. Migration completed - Once the file has been migrated, the rebalance process will set a 'migration-complete' attribute on the file. The clients will be notified of the completion and all further operations on the file will happen on the destination subvolume.
-
-The DHT translator handles the above and allows the applications on the clients to continue working on a file under migration.
diff --git a/doc/features/server-quorum.md b/doc/features/server-quorum.md
deleted file mode 100644
index 7b20084cea8..00000000000
--- a/doc/features/server-quorum.md
+++ /dev/null
@@ -1,44 +0,0 @@
-# Server Quorum
-
-Server quorum is a feature intended to reduce the occurrence of "split brain"
-after a brick failure or network partition. Split brain happens when different
-sets of servers are allowed to process different sets of writes, leaving data
-in a state that can not be reconciled automatically. The key to avoiding split
-brain is to ensure that there can be only one set of servers - a quorum - that
-can continue handling writes. Server quorum does this by the brutal but
-effective means of forcing down all brick daemons on cluster nodes that can no
-longer reach enough of their peers to form a majority. Because there can only
-be one majority, there can be only one set of bricks remaining, and thus split
-brain can not occur.
-
-## Options
-
-Server quorum is controlled by two parameters:
-
- * **cluster.server-quorum-type**
-
- This value may be "server" to indicate that server quorum is enabled, or
- "none" to mean it's disabled.
-
- * **cluster.server-quorum-ratio**
-
- This is the percentage of cluster nodes that must be up to maintain quorum.
- More precisely, this percentage of nodes *plus one* must be up.
-
-Note that these are cluster-wide flags. All volumes served by the cluster will
-be affected. Once these values are set, quorum actions - starting or stopping
-brick daemons in response to node or network events - will be automatic.
-
-## Best Practices
-
-If a cluster with an even number of nodes is split exactly down the middle,
-neither half can have quorum (which requires **more than** half of the total).
-This is particularly important when N=2, in which case the loss of either node
-leads to loss of quorum. Therefore, it is highly advisable to ensure that the
-cluster size is three or greater. The "extra" node in this case need not have
-any bricks or serve any data. It need only be present to preserve the notion
-of a quorum majority less than the entire cluster membership, allowing the
-cluster to survive the loss of a single node without losing quorum.
-
-
-
diff --git a/doc/features/worm.md b/doc/features/worm.md
deleted file mode 100644
index dba99777da5..00000000000
--- a/doc/features/worm.md
+++ /dev/null
@@ -1,75 +0,0 @@
-#WORM (Write Once Read Many)
-This features enables you to create a `WORM volume` using gluster CLI.
-##Description
-WORM (write once,read many) is a desired feature for users who want to store data such as `log files` and where data is not allowed to get modified.
-
-GlusterFS provides a new key `features.worm` which takes boolean values(enable/disable) for volume set.
-
-Internally, the volume set command with 'feature.worm' key will add 'features/worm' translator in the brick's volume file.
-
-`This change would be reflected on a subsequent restart of the volume`, i.e gluster volume stop, followed by a gluster volume start.
-
-With a volume converted to WORM, the changes are as follows:
-
-* Reads are handled normally
-* Only files with O_APPEND flag will be supported.
-* Truncation,deletion wont be supported.
-
-##Volume Options
-Use the volume set command on a volume and see if the volume is actually turned into WORM type.
-
- # features.worm enable
-##Fully loaded Example
-WORM feature is being supported from glusterfs version 3.4
-start glusterd by using the command
-
- # service glusterd start
-Now create a volume by using the command
-
- # gluster volume create <vol_name> <brick_path>
-start the volume created by running the command below.
-
- # gluster vol start <vol_name>
-Run the command below to make sure that volume is created.
-
- # gluster volume info
-Now turn on the WORM feature on the volume by using the command
-
- # gluster vol set <vol_name> worm enable
-Verify that the option is set by using the command
-
- # gluster volume info
-User should be able to see another option in the volume info
-
- # features.worm: enable
-Now restart the volume for the changes to reflect, by performing volume stop and start.
-
- # gluster volume <vol_name> stop
- # gluster volume <vol_name> start
-Now mount the volume using fuse mount
-
- # mount -t glusterfs <vol_name> <mnt_point>
-create a file inside the mount point by running the command below
-
- # touch <file_name>
-Verify that user is able to create a file by running the command below
-
- # ls <file_name>
-
-##How To Test
-Now try deleting the above file which is been created
-
- # rm <file_name>
-Since WORM is enabled on the volume, it gives the following error message `rm: cannot remove '/<mnt_point>/<file_name>': Read-only file system`
-
-put some content into the file which is created above.
-
- # echo "at the end of the file" >> <file_name>
-Now try editing the file by running the commnad below and verify that the following error message is displayed `rm: cannot remove '/<mnt_point>/<file_name>': Read-only file system`
-
- # sed -i "1iAt the beginning of the file" <file_name>
-Now read the contents of the file and verify that file can be read.
-
- cat <file_name>
-
-`Note: If WORM option is set on the volume before it is started, then volume need not be restarted for the changes to get reflected`.
diff --git a/doc/features/zerofill.md b/doc/features/zerofill.md
deleted file mode 100644
index c0f1fc5014c..00000000000
--- a/doc/features/zerofill.md
+++ /dev/null
@@ -1,26 +0,0 @@
-#zerofill API for GlusterFS
-zerofill() API would allow creation of pre-allocated and zeroed-out files on GlusterFS volumes by offloading the zeroing part to server and/or storage (storage offloads use SCSI WRITESAME).
-## Description
-
-Zerofill writes zeroes to a file in the specified range. This fop will be useful when a whole file needs to be initialized with zero (could be useful for zero filled VM disk image provisioning or during scrubbing of VM disk images).
-
-Client/application can issue this FOP for zeroing out. Gluster server will zero out required range of bytes ie server offloaded zeroing. In the absence of this fop, client/application has to repetitively issue write (zero) fop to the server, which is very inefficient method because of the overheads involved in RPC calls and acknowledgements.
-
-WRITESAME is a SCSI T10 command that takes a block of data as input and writes the same data to other blocks and this write is handled completely within the storage and hence is known as offload . Linux ,now has support for SCSI WRITESAME command which is exposed to the user in the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to implement this fop. Thus zeroing out operations can be completely offloaded to the storage device,
-making it highly efficient.
-
-The fop takes two arguments offset and size. It zeroes out 'size' number of bytes in an opened file starting from 'offset' position.
-This feature adds zerofill support to the following areas:
-> - libglusterfs
-- io-stats
-- performance/md-cache,open-behind
-- quota
-- cluster/afr,dht,stripe
-- rpc/xdr
-- protocol/client,server
-- io-threads
-- marker
-- storage/posix
-- libgfapi
-
-Client applications can exploit this fop by using glfs_zerofill introduced in libgfapi.FUSE support to this fop has not been added as there is no system call for this fop.