summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorKrutika Dhananjay <kdhananj@redhat.com>2015-05-05 07:57:10 +0530
committerHumble Devassy Chirammal <humble.devassy@gmail.com>2015-05-05 00:16:44 -0700
commit41c3c1c1e4ccc425ac2edb31d7fa970ff464ebdd (patch)
tree9775e8868cc9faeb1869d703c20a6384f2c8e45a
parent25e8e74eb7b81ccd114a9833371a3f72d284c48d (diff)
doc: Feature doc for shard translator
Source: https://gist.github.com/avati/af04f1030dcf52e16535#sharding-xlator-stripe-20 Change-Id: I127685bacb6cf924bb41a3b782ade1460afb91d6 BUG: 1200082 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/10537 Reviewed-by: Humble Devassy Chirammal <humble.devassy@gmail.com> Tested-by: Humble Devassy Chirammal <humble.devassy@gmail.com>
-rw-r--r--doc/features/shard.md68
1 files changed, 68 insertions, 0 deletions
diff --git a/doc/features/shard.md b/doc/features/shard.md
new file mode 100644
index 00000000000..3588531a2b4
--- /dev/null
+++ b/doc/features/shard.md
@@ -0,0 +1,68 @@
+### Sharding xlator (Stripe 2.0)
+
+GlusterFS's answer to very large files (those which can grow beyond a
+single brick) has never been clear. There is a stripe xlator which allows you to
+do that, but that comes at a cost of flexibility - you can add servers only in
+multiple of stripe-count x replica-count, mixing striped and unstriped files is
+not possible in an "elegant" way. This also happens to be a big limiting factor
+for the big data/Hadoop use case where super large files are the norm (and where
+you want to split a file even if it could fit within a single server.)
+
+The proposed solution for this is to replace the current stripe xlator with a
+new Shard xlator. Unlike the stripe xlator, Shard is not a cluster xlator. It is
+placed on top of DHT. Initially all files will be created as normal files, even
+up to a certain configurable size. The first block (default 4MB) will be stored
+like a normal file. However further blocks will be stored in a file, named by
+the GFID and block index in a separate namespace (like /.shard/GFID1.1,
+/.shard/GFID1.2 ... /.shard/GFID1.N). File IO happening to a particular offset
+will write to the appropriate "piece file", creating it if necessary. The
+aggregated file size and block count will be stored in the xattr of the original
+(first block) file.
+
+The advantage of such a model:
+
+- Data blocks are distributed by DHT in a "normal way".
+- Adding servers can happen in any number (even one at a time) and DHT's
+ rebalance will spread out the "piece files" evenly.
+- Self-healing of a large file is now more distributed into smaller files across
+ more servers.
+- piece file naming scheme is immune to renames and hardlinks.
+
+Source: https://gist.github.com/avati/af04f1030dcf52e16535#sharding-xlator-stripe-20
+
+## Usage:
+
+Shard translator is disabled by default. To enable it on a given volume, execute
+<code>
+gluster volume set <VOLNAME> features.shard on
+</code>
+
+The default shard block size is 4MB. To modify it, execute
+<code>
+gluster volume set <VOLNAME> features.shard-block-size <value>
+</code>
+
+When a file is created in a volume with sharding disabled, its block size is
+persisted in its xattr on the first block. This property of the file will remain
+even if the shard-block-size for the volume is reconfigured later.
+
+If you want to disable sharding on a volume, it is advisable to create a new
+volume without sharding and copy out contents of this volume into the new
+volume.
+
+## Note:
+* Shard translator is still a beta feature in 3.7.0 and will be possibly fully
+ supported in one of the 3.7.x releases.
+* It is advisable to use shard translator in volumes with replication enabled
+ for fault tolerance.
+
+## TO-DO:
+* Complete implementation of zerofill, discard and fallocate fops.
+* Introduce caching and its invalidation within shard translator to store size
+ and block count of shard'ed files.
+* Make shard translator work for non-Hadoop and non-VM use cases where there are
+ multiple clients operating on the same file.
+* Serialize appending writes.
+* Manage recovery of size and block count better in the face of faults during
+ ongoing inode write fops.
+* Anything else that could crop up later :)