From 490b791f44135db72cba1d8df9b40a66b457bff2 Mon Sep 17 00:00:00 2001
From: shishir gowda <sgowda@redhat.com>
Date: Wed, 10 Apr 2013 10:42:25 +0530
Subject: dht: improve transform/detransform of d_off (and be ext4 safe)

Backporting  Avati's fix http://review.gluster.org/4711

The scheme to encode brick d_off and brick id into global d_off has
two approaches. Since both brick d_off and global d_off are both 64-bit
wide, we need to be careful about how the brick id is encoded.

Filesystems like XFS always give a d_off which fits within 32bits. So
we have another 32bits (actually 31, in this scheme, as seen ahead) to
encode the brick id - which is typically plenty.

Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off,
as the d_off is calculated based on a hash function value. This leaves
us no "unused" bits to encode the brick id.

However both these filesystmes (EXT4 more importantly) are "tolerant" in
terms of the accuracy of the value presented back in seekdir(). i.e, a
seekdir(val) actually seeks to the entry which has the "closest" true
offset.

This "two-prong" scheme exploits this behavior - which seems to be the
best middle ground amongst various approaches and has all the advantages
of the old approach:

- Works against XFS and EXT4, the two most common filesystems out there.
  (which wasn't an "advantage" of the old approach as it is borken against
   EXT4)

- Probably works against most of the others as well. The ones which would
  NOT work are those which return HUGE d_offs _and_ NOT tolerant to
  seekdir() to "closest" true offset.

- Nothing to "remember in memory" or evict "old entries".

- Works fine across NFS server reboots and also NFS head failover.

- Tolerant to seekdir() to arbitrary locations.

Algorithm:

Each d_off can be encoded in either of the two schemes. There is no
requirement to encode all d_offs of a directory or a reply-set in
the same scheme.

The topmost bit of the 64 bits is used to specify the "type" of encoding
of this particular d_off. If the topmost bit (bit-63) is 1, it indicates
that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0,
it indicates that the "small" d_off encoding scheme is used.

The goal of the "small" d_off encoding is to stay as dense as possible
towards the lower bits even in the global d_off.

The goal of the HUGE d_off encoding is to stay as accurate (close) as
possible to the "true" d_off after a round of encoding and decoding.

If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick
ID (call it "n").

SMALL d_off
===========

Encoding
--------
    If the top n + 1 bits are free in a brick offset, then we leave the
top bit as 0 and set the remaining bits based on the old formula:

   hi_mask = 0xffffffffffffffff

   hi_mask = ~(hi_mask >> (n + 1))

   if ((hi_mask & d_off_brick) != 0)
       do_large_d_off_encoding ()

   d_off_global = (d_off_brick * N) + brick_id

Decoding
--------
    If the top bit in the global offset is 0, it indicates that this
is the encoding formula used. So decoding such a global offset will
be like the old formula:

   if ((d_off_global & 0x8000000000000000) != 0)
      do_large_d_off_decoding()

   d_off_brick = (d_off_global % N)

   brick_id = d_off_global / N

HUGE d_off
==========

Encoding
--------
   If the top n + 1 bits are NOT free in a given brick offset, then we
set the top bit as 1 in the global offset. The low n bits are replaced
by brick_id.

    low_mask = 0xffffffffffffffff << n   // where n is ROOF(Log2(N))

    d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id

    if (d_off_global == 0xffffffffffffffff)
        discard_entry();

Decoding
--------
    If the top bit in the global offset is set 1, it indicates that
the encoding formula used is above. So decoding would look like:

    hi_mask = (0xffffffffffffffff << n)
    low_mask = ~(hi_mask)

    d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff)

    brick_id = global_d_off & low_mask

    If "losing" the low n bits in this decoding of d_off_brick looks
"scary", we need to realize that till recently EXT4 used to only
return what can now be expressed as (d_off_global >> 32). The extra
31 bits of hash added by EXT recently, only decreases the probability
of a collision, and not eliminate it completely, anyways. In a way,
the "lost" n bits are made up by decreasing the probability of
collision by sharding the files into N bricks / EXT directories
    -- call it "hash hedging", if you will :-)

Change-Id: I9551c581c3f3d4c9e719764881036d554f60c557
Thanks-to: Zach Brown <zab@redhat.com>
BUG: 838784
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4799
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-on: http://review.gluster.org/4822
---
 xlators/cluster/dht/src/dht-helper.c | 87 +++++++++++++++++++++++++++++++++---
 1 file changed, 82 insertions(+), 5 deletions(-)

diff --git a/xlators/cluster/dht/src/dht-helper.c b/xlators/cluster/dht/src/dht-helper.c
index 920a7aabc50..d866a2cd17f 100644
--- a/xlators/cluster/dht/src/dht-helper.c
+++ b/xlators/cluster/dht/src/dht-helper.c
@@ -40,6 +40,43 @@ dht_frame_return (call_frame_t *frame)
 }
 
 
+static uint64_t
+dht_bits_for (uint64_t num)
+{
+	uint64_t bits = 0, ctrl = 1;
+
+	while (ctrl < num) {
+		ctrl *= 2;
+		bits ++;
+	}
+
+	return bits;
+}
+
+/*
+ * A slightly "updated" version of the algorithm described in the commit log
+ * is used here.
+ *
+ * The only enhancement is that:
+ *
+ * - The number of bits used by the backend filesystem for HUGE d_off which
+ *   is described as 63, and
+ * - The number of bits used by the d_off presented by the transformation
+ *   upwards which is described as 64, are both made "configurable."
+ */
+
+
+#define BACKEND_D_OFF_BITS 63
+#define PRESENT_D_OFF_BITS 63
+
+#define ONE 1ULL
+#define MASK (~0ULL)
+#define PRESENT_MASK (MASK >> (64 - PRESENT_D_OFF_BITS))
+#define BACKEND_MASK (MASK >> (64 - BACKEND_D_OFF_BITS))
+
+#define TOP_BIT (ONE << (PRESENT_D_OFF_BITS - 1))
+#define SHIFT_BITS (max (0, (BACKEND_D_OFF_BITS - PRESENT_D_OFF_BITS + 1)))
+
 int
 dht_itransform (xlator_t *this, xlator_t *subvol, uint64_t x, uint64_t *y_p)
 {
@@ -47,6 +84,9 @@ dht_itransform (xlator_t *this, xlator_t *subvol, uint64_t x, uint64_t *y_p)
         int         cnt = 0;
         int         max = 0;
         uint64_t    y = 0;
+        uint64_t    hi_mask = 0;
+        uint64_t    off_mask = 0;
+        int         max_bits = 0;
 
         if (x == ((uint64_t) -1)) {
                 y = (uint64_t) -1;
@@ -60,7 +100,23 @@ dht_itransform (xlator_t *this, xlator_t *subvol, uint64_t x, uint64_t *y_p)
         max = conf->subvolume_cnt;
         cnt = dht_subvol_cnt (this, subvol);
 
-        y = ((x * max) + cnt);
+	if (max == 1) {
+		y = x;
+		goto out;
+	}
+
+        max_bits = dht_bits_for (max);
+
+        hi_mask = ~(PRESENT_MASK >> (max_bits + 1));
+
+        if (x & hi_mask) {
+                /* HUGE d_off */
+                off_mask = MASK << max_bits;
+                y = TOP_BIT | ((x >> SHIFT_BITS) & off_mask) | cnt;
+        } else {
+                /* small d_off */
+                y = ((x * max) + cnt);
+        }
 
 out:
         if (y_p)
@@ -137,16 +193,38 @@ dht_deitransform (xlator_t *this, uint64_t y, xlator_t **subvol_p,
         int         max = 0;
         uint64_t    x = 0;
         xlator_t   *subvol = 0;
+        int         max_bits = 0;
+        uint64_t    off_mask = 0;
+        uint64_t    host_mask = 0;
 
         if (!this->private)
-                goto out;
+                return -1;
 
         conf = this->private;
         max = conf->subvolume_cnt;
 
-        cnt = y % max;
-        x   = y / max;
+	if (max == 1) {
+		x = y;
+		cnt = 0;
+		goto out;
+	}
 
+        if (y & TOP_BIT) {
+                /* HUGE d_off */
+                max_bits = dht_bits_for (max);
+                off_mask = (MASK << max_bits);
+                host_mask = ~(off_mask);
+
+                x = ((y & ~TOP_BIT) & off_mask) << SHIFT_BITS;
+
+                cnt = y & host_mask;
+	} else {
+                /* small d_off */
+                cnt = y % max;
+                x = y / max;
+        }
+
+out:
         subvol = conf->subvolumes[cnt];
 
         if (subvol_p)
@@ -155,7 +233,6 @@ dht_deitransform (xlator_t *this, uint64_t y, xlator_t **subvol_p,
         if (x_p)
                 *x_p = x;
 
-out:
         return 0;
 }
 
-- 
cgit