From 9c21c5a632ba22a6f46d037bd4fa4d825b24d07f Mon Sep 17 00:00:00 2001 From: Joseph Fernandes Date: Fri, 8 May 2015 15:18:06 +0530 Subject: tiering: Correction to tiering documentation 1) convert to md format 2) Add info about ctr and libgfdb Change-Id: I531d8a0bff8195f759302c5e613c7af2113729eb BUG: 1218638 Signed-off-by: Joseph Fernandes Reviewed-on: http://review.gluster.org/10665 Reviewed-by: Humble Devassy Chirammal Tested-by: Humble Devassy Chirammal --- doc/features/tier/tier.md | 168 +++++++++++++++++++++++++++++++++++++++++++++ doc/features/tier/tier.txt | 118 ------------------------------- 2 files changed, 168 insertions(+), 118 deletions(-) create mode 100644 doc/features/tier/tier.md delete mode 100644 doc/features/tier/tier.txt (limited to 'doc/features/tier') diff --git a/doc/features/tier/tier.md b/doc/features/tier/tier.md new file mode 100644 index 00000000000..13e7d971bdf --- /dev/null +++ b/doc/features/tier/tier.md @@ -0,0 +1,168 @@ +##Tiering + +* ####Feature page: +http://www.gluster.org/community/documentation/index.php/Features/data-classification + +* #####Design: goo.gl/bkU5qv + +###Theory of operation + + +The tiering feature enables different storage types to be used by the same +logical volume. In Gluster 3.7, the two types are classified as "cold" and +"hot", and are represented as two groups of bricks. The hot group acts as +a cache for the cold group. The bricks within the two groups themselves are +arranged according to standard Gluster volume conventions, e.g. replicated, +distributed replicated, or dispersed. + +A normal gluster volume can become a tiered volume by "attaching" bricks +to it. The attached bricks become the "hot" group. The bricks within the +original gluster volume are the "cold" bricks. + +For example, the original volume may be dispersed on HDD, and the "hot" +tier could be distributed-replicated SSDs. + +Once this new "tiered" volume is built, I/Os to it are subjected to cacheing +heuristics: + +* All I/Os are forwarded to the hot tier. + +* If a lookup fails to the hot tier, the I/O will be forwarded to the cold +tier. This is a "cache miss". + +* Files on the hot tier that are not touched within some time are demoted +(moved) to the cold tier (see performance parameters, below). + +* Files on the cold tier that are touched one or more times are promoted +(moved) to the hot tier. (see performance parameters, below). + +This resembles implementations by Ceph and the Linux data management (DM) +component. + +Performance enhancements being considered include: + +* Biasing migration of large files over small. + +* Only demoting when the hot tier is close to full. + +* Write-back cache for database updates. + +###Code organization + +The design endevors to be upward compatible with future migration policies, +such as scheduled file migration, data classification, etc. For example, +the caching logic is self-contained and separate from the file migration. A +different set of migration policies could use the same underlying migration +engine. The I/O tracking and meta data store compontents are intended to be +reusable for things besides caching semantics. + +####Libgfdb: + +Libgfdb provides abstract mechanism to record extra/rich metadata +required for data maintenance, such as data tiering/classification. +It provides consumer with API for recording and querying, keeping +the consumer abstracted from the data store used beneath for storing data. +It works in a plug-and-play model, where data stores can be plugged-in. +Presently we have plugin for Sqlite3. In the future will provide recording +and querying performance optimizer. In the current implementation the schema +of metadata is fixed. + +####Schema: + + GF_FILE_TB Table: + This table has one entry per file inode. It holds the metadata required to + make decisions in data maintenance. + GF_ID (Primary key) : File GFID (Universal Unique IDentifier in the namespace) + W_SEC, W_MSEC : Write wind time in sec & micro-sec + UW_SEC, UW_MSEC : Write un-wind time in sec & micro-sec + W_READ_SEC, W_READ_MSEC : Read wind time in sec & micro-sec + UW_READ_SEC, UW_READ_MSEC : Read un-wind time in sec & micro-sec + WRITE_FREQ_CNTR INTEGER : Write Frequency Counter + READ_FREQ_CNTR INTEGER : Read Frequency Counter + + GF_FLINK_TABLE: + This table has all the hardlinks to a file inode. + GF_ID : File GFID (Composite Primary Key)``| + GF_PID : Parent Directory GFID (Composite Primary Key) |-> Primary Key + FNAME : File Base Name (Composite Primary Key)__| + FPATH : File Full Path (Its redundant for now, this will go) + W_DEL_FLAG : This Flag is used for crash consistancy, when a link is unlinked. + i.e Set to 1 during unlink wind and during unwind this record is deleted + LINK_UPDATE : This Flag is used when a link is changed i.e rename. + Set to 1 when rename wind and set to 0 in rename unwind + +Libgfdb API : +Refer libglusterfs/src/gfdb/gfdb_data_store.h + +####ChangeTimeRecorder (CTR) Translator: + +ChangeTimeRecorder(CTR) is server side xlator(translator) which sits +just above posix xlator. The main role of this xlator is to record the +access/write patterns on a file residing the brick. It records the +read(only data) and write(data and metadata) times and also count on +how many times a file is read or written. This xlator also captures +the hard links to a file(as its required by data tiering to move +files). + +CTR Xlator is the consumer of libgfdb. + +To Enable/Disable CTR Xlator: + + **gluster volume set features.ctr-enabled {on/off}** + +To Enable/Disable Frequency Counter Recording in CTR Xlator: + + **gluster volume set features.record-counters {on/off}** + + +####Migration daemon: + +When a tiered volume is created, a migration daemon starts. There is one daemon +for every tiered volume per node. The daemon sleeps and then periodically +queries the database for files to promote or demote. The query callbacks +assembles files in a list, which is then enumerated. The frequencies by +which promotes and demotes happen is subject to user configuration. + +Selected files are migrated between the tiers using existing DHT migration +logic. The tier translator will leverage DHT rebalance performance +enhancements. + +Configurable for Migration daemon: + + gluster volume set cluster.tier-demote-frequency + + gluster volume set cluster.tier-promote-frequency + + gluster volume set cluster.read-freq-threshold + + gluster volume set cluster.write-freq-threshold + + +####Tier Translator: + +The tier translator is the root node in tiered volumes. The first subvolume +is the cold tier, and the second the hot tier. DHT logic for fowarding I/Os +is largely unchanged. Exceptions are handled according to the dht_methods_t +structure, which forks control according to DHT or tier type. + +The major exception is DHT's layout is not utilized for choosing hashed +subvolumes. Rather, the hot tier is always the hashed subvolume. + +Changes to DHT were made to allow "stacking", i.e. DHT over DHT: + +* readdir operations remember the index of the "leaf node" in the volume graph +(client id), rather than a unique index for each DHT instance. + +* Each DHT instance uses a unique extended attribute for tracking migration. + +* In certain cases, it is legal for tiered volumes to have unpopulated inodes +(wheras this would be an error in DHT's case). + +Currently tiered volume expansion (adding and removing bricks) is unsupported. + +####glusterd: + +The tiered volume tree is a composition of two other volumes. The glusterd +daemon builds it. Existing logic for adding and removing bricks is heavily +leveraged to attach and detach tiers, and perform statistics collection. + diff --git a/doc/features/tier/tier.txt b/doc/features/tier/tier.txt deleted file mode 100644 index 3b99fb82c32..00000000000 --- a/doc/features/tier/tier.txt +++ /dev/null @@ -1,118 +0,0 @@ -Tiering ======= - -* Feature page: -http://www.gluster.org/community/documentation/index.php/Features/data-classification - -* Design: goo.gl/bkU5qv - -Theory of operation ------------------- - -The tiering feature enables different storage types to be used by the same -logical volume. In Gluster 3.7, the two types are classified as "cold" and -"hot", and are represented as two groups of bricks. The hot group acts as -a cache for the cold group. The bricks within the two groups themselves are -arranged according to standard Gluster volume conventions, e.g. replicated, -distributed replicated, or dispersed. - -A normal gluster volume can become a tiered volume by "attaching" bricks -to it. The attached bricks become the "hot" group. The bricks within the -original gluster volume are the "cold" bricks. - -For example, the original volume may be dispersed on HDD, and the "hot" -tier could be distributed-replicated SSDs. - -Once this new "tiered" volume is built, I/Os to it are subjected to cacheing -heuristics: - -* All I/Os are forwarded to the hot tier. - -* If a lookup fails to the hot tier, the I/O will be forwarded to the cold -tier. This is a "cache miss". - -* Files on the hot tier that are not touched within some time are demoted -(moved) to the cold tier (see performance parameters, below). - -* Files on the cold tier that are touched one or more times are promoted -(moved) to the hot tier. (see performance parameters, below). - -This resembles implementations by Ceph and the Linux data management (DM) -component. - -Performance enhancements being considered include: - -* Biasing migration of large files over small. - -* Only demoting when the hot tier is close to full. - -* Write-back cache for database updates. - -Code organization ----------------- - -The design endevors to be upward compatible with future migration policies, -such as scheduled file migration, data classification, etc. For example, -the caching logic is self-contained and separate from the file migration. A -different set of migration policies could use the same underlying migration -engine. The I/O tracking and meta data store compontents are intended to be -reusable for things besides caching semantics. - -Meta data: - -A database stores meta-data on the files. Entries within it are added or -removed by the changetimerecorder translator. The database is queried by -the migration daemon. The results of the queries drive which files are to -be migrated. - -The database resides withi the libgfdb subdirectory. There is one database -for each brick. The database is currently sqlite. However, the libgfdb -library API is not tied to sqlite, and a different database could be used. - -For more information on libgfdb see the doc file: libgfdb.txt. - -I/O tracking: - -The changetimerecorder server-side translator generates metadata about I/Os -as they happen. Metadata is then entered into the database after the I/O -completes. Internal I/Os are not included. - -Migration daemon: - -When a tiered volume is created, a migration daemon starts. There is one daemon -for every tiered volume per node. The daemon sleeps and then periodically -queries the database for files to promote or demote. The query callbacks -assembles files in a list, which is then enumerated. The frequencies by -which promotes and demotes happen is subject to user configuration. - -Selected files are migrated between the tiers using existing DHT migration -logic. The tier translator will leverage DHT rebalance performance -enhancements. - -tier translator: - -The tier translator is the root node in tiered volumes. The first subvolume -is the cold tier, and the second the hot tier. DHT logic for fowarding I/Os -is largely unchanged. Exceptions are handled according to the dht_methods_t -structure, which forks control according to DHT or tier type. - -The major exception is DHT's layout is not utilized for choosing hashed -subvolumes. Rather, the hot tier is always the hashed subvolume. - -Changes to DHT were made to allow "stacking", i.e. DHT over DHT: - -* readdir operations remember the index of the "leaf node" in the volume graph -(client id), rather than a unique index for each DHT instance. - -* Each DHT instance uses a unique extended attribute for tracking migration. - -* In certain cases, it is legal for tiered volumes to have unpopulated inodes -(wheras this would be an error in DHT's case). - -Currently tiered volume expansion (adding and removing bricks) is unsupported. - -glusterd: - -The tiered volume tree is a composition of two other volumes. The glusterd -daemon builds it. Existing logic for adding and removing bricks is heavily -leveraged to attach and detach tiers, and perform statistics collection. - - - -- cgit