Documentation for tiering feature. (WIP)

This is a WIP. Change-Id: Ia36f77d158a370f77cb866a32308b27e10d39b5e BUG: 1218638 Signed-off-by: Dan Lambright <dlambrig@redhat.com> Reviewed-on: http://review.gluster.org/10656 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Humble Devassy Chirammal <humble.devassy@gmail.com>
author: Dan Lambright <dlambrig@redhat.com> 2015-05-07 22:48:24 -0400
committer: Humble Devassy Chirammal <humble.devassy@gmail.com> 2015-05-07 22:49:13 -0700
commit: 4ccd70b323d4cb929b7b7a88e592fc98fab06198 (patch)
tree: 9bd7f439f2f2cd670a281f66ed00d4020f0b8353
parent: d914cd909b9a99d7645b633000940195277bb6ff (diff)
1 files changed, 118 insertions, 0 deletions
diff --git a/doc/features/tier/tier.txt b/doc/features/tier/tier.txt
new file mode 100644
index 00000000000..3b99fb82c32
--- /dev/null
+++ b/doc/features/tier/tier.txt
@@ -0,0 +1,118 @@
+Tiering =======
+
+* Feature page:
+http://www.gluster.org/community/documentation/index.php/Features/data-classification
+
+* Design: goo.gl/bkU5qv
+
+Theory of operation -------------------
+
+The tiering feature enables different storage types to be used by the same
+logical volume. In Gluster 3.7, the two types are classified as "cold" and
+"hot", and are represented as two groups of bricks. The hot group acts as
+a cache for the cold group. The bricks within the two groups themselves are
+arranged according to standard Gluster volume conventions, e.g. replicated,
+distributed replicated, or dispersed.
+
+A normal gluster volume can become a tiered volume by "attaching" bricks
+to it. The attached bricks become the "hot" group. The bricks within the
+original gluster volume are the "cold" bricks.
+
+For example, the original volume may be dispersed on HDD, and the "hot"
+tier could be distributed-replicated SSDs.
+
+Once this new "tiered" volume is built, I/Os to it are subjected to cacheing
+heuristics:
+
+* All I/Os are forwarded to the hot tier.
+
+* If a lookup fails to the hot tier, the I/O will be forwarded to the cold
+tier. This is a "cache miss".
+
+* Files on the hot tier that are not touched within some time are demoted
+(moved) to the cold tier (see performance parameters, below).
+
+* Files on the cold tier that are touched one or more times are promoted
+(moved) to the hot tier. (see performance parameters, below).
+
+This resembles implementations by Ceph and the Linux data management (DM)
+component.
+
+Performance enhancements being considered include:
+
+* Biasing migration of large files over small.
+
+* Only demoting when the hot tier is close to full.
+
+* Write-back cache for database updates.
+
+Code organization -----------------
+
+The design endevors to be upward compatible with future migration policies,
+such as scheduled file migration, data classification, etc. For example,
+the caching logic is self-contained and separate from the file migration. A
+different set of migration policies could use the same underlying migration
+engine. The I/O tracking and meta data store compontents are intended to be
+reusable for things besides caching semantics.
+
+Meta data:
+
+A database stores meta-data on the files. Entries within it are added or
+removed by the changetimerecorder translator. The database is queried by
+the migration daemon. The results of the queries drive which files are to
+be migrated.
+
+The database resides withi the libgfdb subdirectory. There is one database
+for each brick. The database is currently sqlite. However, the libgfdb
+library API is not tied to sqlite, and a different database could be used.
+
+For more information on libgfdb see the doc file: libgfdb.txt.
+
+I/O tracking:
+
+The changetimerecorder server-side translator generates metadata about I/Os
+as they happen. Metadata is then entered into the database after the I/O
+completes. Internal I/Os are not included.
+
+Migration daemon:
+
+When a tiered volume is created, a migration daemon starts. There is one daemon
+for every tiered volume per node. The daemon sleeps and then periodically
+queries the database for files to promote or demote. The query callbacks
+assembles files in a list, which is then enumerated. The frequencies by
+which promotes and demotes happen is subject to user configuration.
+
+Selected files are migrated between the tiers using existing DHT migration
+logic. The tier translator will leverage DHT rebalance performance
+enhancements.
+
+tier translator:
+
+The tier translator is the root node in tiered volumes. The first subvolume
+is the cold tier, and the second the hot tier. DHT logic for fowarding I/Os
+is largely unchanged. Exceptions are handled according to the dht_methods_t
+structure, which forks control according to DHT or tier type.
+
+The major exception is DHT's layout is not utilized for choosing hashed
+subvolumes. Rather, the hot tier is always the hashed subvolume.
+
+Changes to DHT were made to allow "stacking", i.e. DHT over DHT:
+
+* readdir operations remember the index of the "leaf node" in the volume graph
+(client id), rather than a unique index for each DHT instance.
+
+* Each DHT instance uses a unique extended attribute for tracking migration.
+
+* In certain cases, it is legal for tiered volumes to have unpopulated inodes
+(wheras this would be an error in DHT's case).
+
+Currently tiered volume expansion (adding and removing bricks) is unsupported.
+
+glusterd:
+
+The tiered volume tree is a composition of two other volumes. The glusterd
+daemon builds it. Existing logic for adding and removing bricks is heavily
+leveraged to attach and detach tiers, and perform statistics collection.
+
+
+
author	Dan Lambright <dlambrig@redhat.com>	2015-05-07 22:48:24 -0400
committer	Humble Devassy Chirammal <humble.devassy@gmail.com>	2015-05-07 22:49:13 -0700
commit	4ccd70b323d4cb929b7b7a88e592fc98fab06198 (patch)
tree	9bd7f439f2f2cd670a281f66ed00d4020f0b8353
parent	d914cd909b9a99d7645b633000940195277bb6ff (diff)