diff options
Diffstat (limited to 'doc/features')
| -rw-r--r-- | doc/features/ctime.md | 68 | ||||
| -rw-r--r-- | doc/features/ganesha-ha.md | 43 | ||||
| -rw-r--r-- | doc/features/rdma-cm-in-3.4.0.txt | 9 | ||||
| -rw-r--r-- | doc/features/rebalance.md | 74 |
4 files changed, 111 insertions, 83 deletions
diff --git a/doc/features/ctime.md b/doc/features/ctime.md new file mode 100644 index 00000000000..74a77abed4b --- /dev/null +++ b/doc/features/ctime.md @@ -0,0 +1,68 @@ +# Consistent time attributes in gluster across replica/distribute + + +#### Problem: +Traditionally gluster has been using time attributes (ctime, atime, mtime) of files/dirs from bricks. The problem with this approach is that, it is not consisteant across replica and distribute bricks. And applications which depend on it breaks as replica might not always return time attributes from same brick. + +Tar especially gives "file changed as we read it" whenever it detects ctime differences when stat is served from different bricks. The way we have been trying to solve it is to serve the stat structures from same brick in afr, max-time in dht. But it doesn't avoid the problem completely. Because there is no way to change ctime at the moment(lutimes() only allows mtime, atime), there is little we can do to make sure ctimes match after self-heals/xattr updates/rebalance. + +#### Solution Proposed: +Store time attribues (ctime, mtime, atime) as an xattr of the file. The xattr is updated based +on the fop. If a filesystem fop changes only mtime and ctime, update only those in xattr for +that file. + +#### Design Overview: +1) As part of each fop, top layer will generate a time stamp and pass it to the down along + with other information + - This will bring a dependency for NTP synced clients along with servers + - There can be a diff in time if the fop stuck in the xlator for various reason, +for ex: because of locks. + + 2) On the server, posix layer stores the value in the memory (inode ctx) and will sync the data periodically to the disk as an extended attr + - Of course sync call also will force it. And fop comes for an inode which is not linked, we do the sync immediately. + + 3) Each time when inodes are created or initialized it read the data from disk and store in inode ctx. + + 4) Before setting to inode_ctx we compare the timestamp stored and the timestamp received, and only store if the stored value is lesser than the current value. + + 5) So in best case data will be stored and retrieved from the memory. We replace the values in iatt with the values in inode_ctx. + + 6) File ops that changes the parent directory attr time need to be consistent across all the distributed directories across the subvolumes. (for eg: a create call will change ctime and mtime of parent dir) + + - This has to handle separately because we only send the fop to the hashed subvolume. + - We can asynchronously send the timeupdate setattr fop to the other subvoumes and change the values for parent directory if the file fops is successful on hashed subvolume. + - This will have a window where the times are inconsistent across dht subvolume (Please provide your suggestions) + +7) Currently we have couple of mount options for time attributes like noatime, relatime , nodiratime etc. But we are not explicitly handled those options even if it is given as mount option when gluster mount. + + +#### Implementation Overview: +This features involves changes in following xlators. + - utime xlator + - posix xlator + +##### utime xlator: +This is a new client side xlator which does following tasks. + +1. It will generate a time stamp and passes it down in frame->root->ctime and over the network. +2. Based on fop, it also decides the time attributes to be updated and this passed using "frame->root->flags" + + Patches: + 1. https://review.gluster.org/#/c/19857/ + +##### posix xlator: +Following tasks are done in posix xlator: + +1. Provides APIs to set and get the xattr from backend. It also caches the xattr in inode context. During get, it updates time attributes stored in xattr into iatt structure. +2. Based on the flags from utime xlator, relevant fops update the time attributes in the xattr. + + Patches: + 1. https://review.gluster.org/#/c/19267/ + 2. https://review.gluster.org/#/c/19795/ + 3. https://review.gluster.org/#/c/19796/ + +#### Pending Work: +1. Handling of time related mount options (noatime, realatime,etc) +2. flag based create (depending on flags in open, create behaviour might change) +3. Changes in dht for direcotory sync acrosss multiple subvolumes +4. readdirp stat need to be worked on. diff --git a/doc/features/ganesha-ha.md b/doc/features/ganesha-ha.md new file mode 100644 index 00000000000..4b226a22ccf --- /dev/null +++ b/doc/features/ganesha-ha.md @@ -0,0 +1,43 @@ +# Overview of Ganesha HA Resource Agents in GlusterFS 3.7 + +The ganesha_mon RA monitors its ganesha.nfsd daemon. While the +daemon is running, it creates two attributes: ganesha-active and +grace-active. When the daemon stops for any reason, the attributes +are deleted. Deleting the ganesha-active attribute triggers the +failover of the virtual IP (the IPaddr RA) to another node — +according to constraint location rules — where ganesha.nfsd is +still running. + +The ganesha_grace RA monitors the grace-active attribute. When +the grace-active attibute is deleted, the ganesha_grace RA stops, +and will not restart. This triggers pacemaker to invoke the notify +action in the ganesha_grace RAs on the other nodes in the cluster; +which send a DBUS message to their respective ganesha.nfsd. + +(N.B. grace-active is a bit of a misnomer. while the grace-active +attribute exists, everything is normal and healthy. Deleting the +attribute triggers putting the surviving ganesha.nfsds into GRACE.) + +To ensure that the remaining/surviving ganesha.nfsds are put into + NFS-GRACE before the IPaddr (virtual IP) fails over there is a +short delay (sleep) between deleting the grace-active attribute +and the ganesha-active attribute. To summarize, e.g. in a four +node cluster: + +1. on node 2 ganesha_mon::monitor notices that ganesha.nfsd has died + +2. on node 2 ganesha_mon::monitor deletes its grace-active attribute + +3. on node 2 ganesha_grace::monitor notices that grace-active is gone +and returns OCF_ERR_GENERIC, a.k.a. new error. When pacemaker tries +to (re)start ganesha_grace, its start action will return +OCF_NOT_RUNNING, a.k.a. known error, don't attempt further restarts. + +4. on nodes 1, 3, and 4, ganesha_grace::notify receives a post-stop +notification indicating that node 2 is gone, and sends a DBUS message +to its ganesha.nfsd, putting it into NFS-GRACE. + +5. on node 2 ganesha_mon::monitor waits a short period, then deletes +its ganesha-active attribute. This triggers the IPaddr (virt IP) +failover according to constraint location rules. + diff --git a/doc/features/rdma-cm-in-3.4.0.txt b/doc/features/rdma-cm-in-3.4.0.txt deleted file mode 100644 index fd953e56b3f..00000000000 --- a/doc/features/rdma-cm-in-3.4.0.txt +++ /dev/null @@ -1,9 +0,0 @@ -Following is the impact of http://review.gluster.org/#change,149. - -New userspace packages needed: -librdmacm -librdmacm-devel - -rdmacm needs an IPoIB address for connection establishment. This requirement results in following issues: -* Because of bug #890502, we've to probe the peer on an IPoIB address. This imposes a restriction that all volumes created in the future have to communicate over IPoIB address (irrespective of whether they use gluster's tcp or rdma transport). -* Currently client has an independence to choose b/w tcp and rdma transports while communicating with the server (by creating volumes with transport-type tcp,rdma). This independence was a byproduct of our ability use the normal channel used with transport-type tcp for rdma connectiion establishment handshake too. However, with new requirement of IPoIB address for connection establishment, we loose this independence (till we bring in multi-network support - where a brick can be identified by a set of ip-addresses and we can choose different pairs of ip-addresses for communication based on our requirements - in glusterd). diff --git a/doc/features/rebalance.md b/doc/features/rebalance.md deleted file mode 100644 index 29b993008d2..00000000000 --- a/doc/features/rebalance.md +++ /dev/null @@ -1,74 +0,0 @@ -## Background - - -For a more detailed description, view Jeff Darcy's blog post [here] -(http://hekafs.org/index.php/2012/03/glusterfs-algorithms-distribution/) - -GlusterFS uses the distribute translator (DHT) to aggregate space of multiple servers. DHT distributes files among its subvolumes using a consistent hashing method providing 32-bit hashes. Each DHT subvolume is given a range in the 32-bit hash space. A hash value is calculated for every file using a combination of its name. The file is then placed in the subvolume with the hash range that contains the hash value. - -## What is rebalance? - -The rebalance process migrates files between the DHT subvolumes when necessary. - -## When is rebalance required? - -Rebalancing is required for two main cases. - -1. Addition/Removal of bricks - -2. Renaming of a file - -## Addition/Removal of bricks - -Whenever the number or order of DHT subvolumes change, the hash range given to each subvolume is recalculated. When this happens, already existing files on the volume will need to be moved to the correct subvolume based on their hash. Rebalance does this activity. - -Addition of bricks which increase the size of a volume will increase the number of DHT subvolumes and lead to recalculation of hash ranges (This doesn't happen when bricks are added to a volume to increase redundancy, i.e. increase replica count of a volume). This will require an explicit rebalance command to be issued to migrate the files. - -Removal of bricks which decrease the size of a volumes also causes the hash ranges of DHT to be recalculated. But we don't need to issue an explicit rebalance command in this case, as rebalance is done automatically by the remove-brick process if needed. - -## Renaming of a file - -Renaming of file will cause its hash to change. The file now needs to be moved to the correct subvolume based on its new hash. Rebalance does this. - -## How does rebalance work? - -At a high level, the rebalance process consists of the following 3 steps: - -1. Crawl the volume to access all files -2. Calculate the hash for the file -3. If needed move the migrate the file to the correct subvolume. - - -The rebalance process has been optimized by making it distributed across the trusted storage pool. With distributed rebalance, a rebalance process is launched on each peer in the cluster. Each rebalance process will crawl files on only those bricks of the volume which are present on it, and migrate the files which need migration to the correct brick. This speeds up the rebalance process considerably. - -## What will happen if rebalance is not run? - -### Addition of bricks - -With the current implementation of add-brick, when the size of a volume is augmented by adding new bricks, the new bricks are not put into use immediately i.e., the hash ranges there not recalculated immediately. This means that the files will still be placed only onto the existing bricks, leaving the newly added storage space unused. Starting a rebalance process on the volume will cause the hash ranges to be recalculated with the new bricks included, which allows the newly added storage space to be used. - -### Renaming a file - -When a file rename causes the file to be hashed to a new subvolume, DHT writes a link file on the new subvolume leaving the actual file on the original subvolume. A link file is an empty file, which has an extended attribute set that points to the subvolume on which the actual file exists. So, when a client accesses the renamed file, DHT first looks for the file in the hashed subvolume and gets the link file. DHT understands the link file, and gets the actual file from the subvolume pointed to by the link file. This leads to a slight reduction in performance. A rebalance will move the actual file to the hashed subvolume, allowing clients to access the file directly once again. - -## Are clients affected during a rebalance process? - -The rebalance process is transparent to applications on the clients. Applications which have open files on the volume will not be affected by the rebalance process, even if the open file requires migration. The DHT translator on the client will hide the migration from the applications. - -##How are open files migrated? - -(A more technical description of the algorithm used can be seen in the commit message of commit a07bb18c8adeb8597f62095c5d1361c5bad01f09.) - -To achieve migration of open files, two things need to be assured of, -a) any writes or changes happening to the file during migration are correctly synced to destination subvolume after the migration is complete. -b) any further changes should be made to the destination subvolume - -Both of these requirements require sending notificatoins to clients. Clients are notified by overloading an attribute used in every callback functions. DHT understands these attributes in the callbacks and can be notified if a file is being migrated or not. - -During rebalance, a file will be in two phases - -1. Migration in process - In this phase the file is being migrated by the rebalance process from the source subvolume to the destination subvolume. The rebalance process will set a 'in-migration' attribute on the file, which will notify the clients' DHT translator. The clients' DHT translator will then take care to send any further changes to the destination subvolume as well. This way we satisfy the first requirement - -2. Migration completed - Once the file has been migrated, the rebalance process will set a 'migration-complete' attribute on the file. The clients will be notified of the completion and all further operations on the file will happen on the destination subvolume. - -The DHT translator handles the above and allows the applications on the clients to continue working on a file under migration. |
