<feed xmlns='http://www.w3.org/2005/Atom'>
<title>glusterfs.git/xlators, branch v3.4.0alpha3</title>
<subtitle></subtitle>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/'/>
<entry>
<title>glusterd: big lock - a coarse-grained locking to prevent races</title>
<updated>2013-04-17T12:48:50+00:00</updated>
<author>
<name>Krishnan Parthasarathi</name>
<email>kparthas@redhat.com</email>
</author>
<published>2013-04-15T10:26:57+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=92729add67e2e7b8c7589c2dfab0bde071a7faf2'/>
<id>92729add67e2e7b8c7589c2dfab0bde071a7faf2</id>
<content type='text'>
There are primarily three lists that are part of glusterd process,
that are concurrently accessed. Namely, priv-&gt;volumes, priv-&gt;peers
and volinfo-&gt;bricks_list.

Big-lock approach
-----------------
WHAT IS IT?
Big lock is a coarse-grained lock which protects all three
lists, mentioned above, from racy access.

HOW DOES IT WORK?
At any given point in time, glusterd's thread(s) are in execution
_iff_ there is a preceding, inbound network event. Of course, the
sigwaiter thread and timer thread are exceptions.
A network event is an external trigger to glusterd, via the epoll
thread, in the form of POLLIN and POLLERR.
As long as we take the big-lock at all such entry points and yield
it when we are done, we are guaranteed that all the network events,
accessing the global lists, are serialised.

This amounts to holding the big lock at
- all the handlers of all the actors in glusterd. (POLLIN)
- all the cbks in glusterd. (POLLIN)
- rpc_notify (DISCONNECT event), if we access/modify
  one of the three lists. (POLLERR)

In the case of synctask'ized volume operations, we must remember that,
if we held the big lock for the entire duration of the handler,
we may block other non-synctask rpc actors from executing.
For eg, volume-start would block in PMAP SIGNIN, if done incorrectly.
To prevent this, we need to yield the big lock, when we yield the
synctask, and reacquire on waking up of the synctask.

BUG: 948686
Change-Id: I429832f1fed67bcac0813403d58346558a403ce9
Signed-off-by: Krishnan Parthasarathi &lt;kparthas@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4835
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Vijay Bellur &lt;vbellur@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There are primarily three lists that are part of glusterd process,
that are concurrently accessed. Namely, priv-&gt;volumes, priv-&gt;peers
and volinfo-&gt;bricks_list.

Big-lock approach
-----------------
WHAT IS IT?
Big lock is a coarse-grained lock which protects all three
lists, mentioned above, from racy access.

HOW DOES IT WORK?
At any given point in time, glusterd's thread(s) are in execution
_iff_ there is a preceding, inbound network event. Of course, the
sigwaiter thread and timer thread are exceptions.
A network event is an external trigger to glusterd, via the epoll
thread, in the form of POLLIN and POLLERR.
As long as we take the big-lock at all such entry points and yield
it when we are done, we are guaranteed that all the network events,
accessing the global lists, are serialised.

This amounts to holding the big lock at
- all the handlers of all the actors in glusterd. (POLLIN)
- all the cbks in glusterd. (POLLIN)
- rpc_notify (DISCONNECT event), if we access/modify
  one of the three lists. (POLLERR)

In the case of synctask'ized volume operations, we must remember that,
if we held the big lock for the entire duration of the handler,
we may block other non-synctask rpc actors from executing.
For eg, volume-start would block in PMAP SIGNIN, if done incorrectly.
To prevent this, we need to yield the big lock, when we yield the
synctask, and reacquire on waking up of the synctask.

BUG: 948686
Change-Id: I429832f1fed67bcac0813403d58346558a403ce9
Signed-off-by: Krishnan Parthasarathi &lt;kparthas@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4835
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Vijay Bellur &lt;vbellur@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>glusterd: Fixed spurious wakeups in glusterd syncops</title>
<updated>2013-04-17T12:46:43+00:00</updated>
<author>
<name>Krishnan Parthasarathi</name>
<email>kparthas@redhat.com</email>
</author>
<published>2013-04-10T11:42:01+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=47c118e22d9d6fb6662fe96841ed4fe3089739b5'/>
<id>47c118e22d9d6fb6662fe96841ed4fe3089739b5</id>
<content type='text'>
glusterd syncops perform a barrier_wake whenever rpc_clnt_submit returned -1.
This is based on the wrong assumption that the cbkfn wasn't called.
This would result in one more wakeup than there ought to be.

BUG: 948686
Change-Id: I839fd218a81255fe50c2047d67461d45360e894d
Signed-off-by: Krishnan Parthasarathi &lt;kparthas@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4834
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Vijay Bellur &lt;vbellur@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
glusterd syncops perform a barrier_wake whenever rpc_clnt_submit returned -1.
This is based on the wrong assumption that the cbkfn wasn't called.
This would result in one more wakeup than there ought to be.

BUG: 948686
Change-Id: I839fd218a81255fe50c2047d67461d45360e894d
Signed-off-by: Krishnan Parthasarathi &lt;kparthas@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4834
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Vijay Bellur &lt;vbellur@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>glusterd: fix segfault on volume status detail</title>
<updated>2013-04-16T16:43:23+00:00</updated>
<author>
<name>Lars Ellenberg</name>
<email>lars@linbit.com</email>
</author>
<published>2013-03-01T23:59:15+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=4c8bb7c4b0471fe2a5095639f0fd44f50ba28dc8'/>
<id>4c8bb7c4b0471fe2a5095639f0fd44f50ba28dc8</id>
<content type='text'>
If for some reason glusterd_get_brick_root() fails,
it frees the gf_strdup'ed *mount_point in its own error path,
and returns -1.

Unfortunately it already had assigned that pointer value
to the output argument, the caller function
glusterd_add_brick_detail() sees a non-NULL pointer,
and free() again: segfault.

Could be fixed with a one-liner (*mount_point = NULL)
in the error path, but I think glusterd_get_brick_root()
should only assign to the output argument once all checks passed,
so I use a local temporary pointer, which increases the patch a bit.

Change-Id: I3f3035f01e80a5e9bdf2da895e4cf7baa3dfbd2f
BUG: 919352
Signed-off-by: Lars Ellenberg &lt;lars@linbit.com&gt;
Reviewed-on: http://review.gluster.org/4646
Reviewed-by: Krishnan Parthasarathi &lt;kparthas@redhat.com&gt;
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-on: http://review.gluster.org/4841
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
If for some reason glusterd_get_brick_root() fails,
it frees the gf_strdup'ed *mount_point in its own error path,
and returns -1.

Unfortunately it already had assigned that pointer value
to the output argument, the caller function
glusterd_add_brick_detail() sees a non-NULL pointer,
and free() again: segfault.

Could be fixed with a one-liner (*mount_point = NULL)
in the error path, but I think glusterd_get_brick_root()
should only assign to the output argument once all checks passed,
so I use a local temporary pointer, which increases the patch a bit.

Change-Id: I3f3035f01e80a5e9bdf2da895e4cf7baa3dfbd2f
BUG: 919352
Signed-off-by: Lars Ellenberg &lt;lars@linbit.com&gt;
Reviewed-on: http://review.gluster.org/4646
Reviewed-by: Krishnan Parthasarathi &lt;kparthas@redhat.com&gt;
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-on: http://review.gluster.org/4841
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>glusterd: allow multiple instances of glusterd on one machine</title>
<updated>2013-04-16T11:48:58+00:00</updated>
<author>
<name>Krishnan Parthasarathi</name>
<email>kparthas@redhat.com</email>
</author>
<published>2013-04-16T05:06:31+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=29d7563416c0d94cf36d7e05493332aacebfa0e0'/>
<id>29d7563416c0d94cf36d7e05493332aacebfa0e0</id>
<content type='text'>
This is needed to support automated testing of cluster-communication
features such as probing and quorum.  In order to use this, you need to
do the following preparatory steps.

* Copy /var/lib/glusterd to another directory for each virtual host

* Ensure that each virtual host has a different UUID in its glusterd.info

Now you can start each copy of glusterd with the following xlator-options.

* management.transport.socket.bind-address=$ip_address

* management.working-directory=$unique_working_directory

You can use 127.x.y.z addresses for binding without needing to assign
them to interfaces explicitly.  Note that you must use addresses, not
names, because of some stuff in the socket code that's not worth fixing
just for this usage, but after that you can use names in /etc/hosts
instead.

At this point you can issue CLI commands to a specific glusterd using
the --remote-host option.  So far probe, volume create/start/stop,
mount, and basic I/O all seem to work as expected with multiple
instances.

Change-Id: I1beabb44cff8763d2774bc208b2ffcda27c1a550
BUG: 913555
Original-author: Jeff Darcy &lt;jdarcy@redhat.com&gt;
Signed-off-by: Krishnan Parthasarathi &lt;kparthas@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4838
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Vijay Bellur &lt;vbellur@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This is needed to support automated testing of cluster-communication
features such as probing and quorum.  In order to use this, you need to
do the following preparatory steps.

* Copy /var/lib/glusterd to another directory for each virtual host

* Ensure that each virtual host has a different UUID in its glusterd.info

Now you can start each copy of glusterd with the following xlator-options.

* management.transport.socket.bind-address=$ip_address

* management.working-directory=$unique_working_directory

You can use 127.x.y.z addresses for binding without needing to assign
them to interfaces explicitly.  Note that you must use addresses, not
names, because of some stuff in the socket code that's not worth fixing
just for this usage, but after that you can use names in /etc/hosts
instead.

At this point you can issue CLI commands to a specific glusterd using
the --remote-host option.  So far probe, volume create/start/stop,
mount, and basic I/O all seem to work as expected with multiple
instances.

Change-Id: I1beabb44cff8763d2774bc208b2ffcda27c1a550
BUG: 913555
Original-author: Jeff Darcy &lt;jdarcy@redhat.com&gt;
Signed-off-by: Krishnan Parthasarathi &lt;kparthas@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4838
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Vijay Bellur &lt;vbellur@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>license: xlators/protocol/server dual license GPLv2 and LGPLv3+</title>
<updated>2013-04-12T17:06:20+00:00</updated>
<author>
<name>Kaleb S. KEITHLEY</name>
<email>kkeithle@redhat.com</email>
</author>
<published>2013-04-12T13:08:49+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=9b28ecc98dd5c773bf078e2e4d4752fa10c0daba'/>
<id>9b28ecc98dd5c773bf078e2e4d4752fa10c0daba</id>
<content type='text'>
cherry-pick from:
 refs/changes/16/4816/1; http://review.gluster.org/#/c/4816/

BUG: 951551
Change-Id: I3de5bd86d4238a60a0a85ba2e15d9c131969b210
Signed-off-by: Kaleb S. KEITHLEY &lt;kkeithle@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4817
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
cherry-pick from:
 refs/changes/16/4816/1; http://review.gluster.org/#/c/4816/

BUG: 951551
Change-Id: I3de5bd86d4238a60a0a85ba2e15d9c131969b210
Signed-off-by: Kaleb S. KEITHLEY &lt;kkeithle@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4817
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>glusterd: Fixed volume-sync in synctask codepath.</title>
<updated>2013-04-12T12:59:05+00:00</updated>
<author>
<name>Krishnan Parthasarathi</name>
<email>kparthas@redhat.com</email>
</author>
<published>2013-03-05T08:41:02+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=c88fa6d979900e7ee32bd13445bcf4a1dcf3f617'/>
<id>c88fa6d979900e7ee32bd13445bcf4a1dcf3f617</id>
<content type='text'>
Change-Id: I2911d3ac80825310f84c5ba6bd7890e65e1ee219
BUG: 950048
Signed-off-by: Krishnan Parthasarathi &lt;kparthas@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4643
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Vijay Bellur &lt;vbellur@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Change-Id: I2911d3ac80825310f84c5ba6bd7890e65e1ee219
BUG: 950048
Signed-off-by: Krishnan Parthasarathi &lt;kparthas@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4643
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-by: Vijay Bellur &lt;vbellur@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>cluster/afr: Preserve mtime in self-heal</title>
<updated>2013-04-12T07:22:01+00:00</updated>
<author>
<name>Pranith Kumar K</name>
<email>pkarampu@redhat.com</email>
</author>
<published>2013-03-08T09:50:22+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=38c53e13e0eab7c8e5b10b71b2be20d4ef4ebc5a'/>
<id>38c53e13e0eab7c8e5b10b71b2be20d4ef4ebc5a</id>
<content type='text'>
Problem:
Data self-heal may choose sink iatt to set mtimes.
This happens because after syncing of data is done
self-heal does one more xattrops/fstat to determine
sources sinks to set the inode-ctx. Since this is done
after data syncing and erase of xattrs, old source and
old sink are now sources, but the mtimes of them differ.
Old code just takes the first source from the list and
update mtimes, which could be sink before the self-heal
started.

Fix:
Set mtime from 'sources before syncing'.

Change-Id: Id769e1b99aa4f041eaee775f64cbf2c57b799723
BUG: 918437
Signed-off-by: Pranith Kumar K &lt;pkarampu@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4658
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-on: http://review.gluster.org/4663
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Problem:
Data self-heal may choose sink iatt to set mtimes.
This happens because after syncing of data is done
self-heal does one more xattrops/fstat to determine
sources sinks to set the inode-ctx. Since this is done
after data syncing and erase of xattrs, old source and
old sink are now sources, but the mtimes of them differ.
Old code just takes the first source from the list and
update mtimes, which could be sink before the self-heal
started.

Fix:
Set mtime from 'sources before syncing'.

Change-Id: Id769e1b99aa4f041eaee775f64cbf2c57b799723
BUG: 918437
Signed-off-by: Pranith Kumar K &lt;pkarampu@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4658
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
Reviewed-on: http://review.gluster.org/4663
</pre>
</div>
</content>
</entry>
<entry>
<title>cluster/distribute: Ignore non-participating subvols for layout checks</title>
<updated>2013-04-11T17:41:01+00:00</updated>
<author>
<name>shishir gowda</name>
<email>sgowda@redhat.com</email>
</author>
<published>2013-04-04T05:53:08+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=a8b6cf1b64de7e03652c05ecd8d63b73bbd2523e'/>
<id>a8b6cf1b64de7e03652c05ecd8d63b73bbd2523e</id>
<content type='text'>
Backporting fix http://review.gluster.org/#/c/4668/

When subvols-per-directory is &lt; available subvols, then there are layouts
which are not populated. This leads to incorrect identification of holes or
overlaps. We need to ignore layouts, which have err == 0, and start == stop.
In the current scenario (start == stop == 0).

Additionally, in layout-merge, treat missing xattrs as err = 0. In case of
missing layouts, anomalies will reset them.

For any other valid subvoles, err != 0 in case of layouts being zeroed out.

Also reverted back dht_selfheal_dir_xattr, which does layout calculation only
on subvols which have errors.

BUG: 921408
Change-Id: I75a8edcb92af5b53b3253c9addd7a812e9242836
Signed-off-by: shishir gowda &lt;sgowda@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4800
Reviewed-by: Amar Tumballi &lt;amarts@redhat.com&gt;
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Backporting fix http://review.gluster.org/#/c/4668/

When subvols-per-directory is &lt; available subvols, then there are layouts
which are not populated. This leads to incorrect identification of holes or
overlaps. We need to ignore layouts, which have err == 0, and start == stop.
In the current scenario (start == stop == 0).

Additionally, in layout-merge, treat missing xattrs as err = 0. In case of
missing layouts, anomalies will reset them.

For any other valid subvoles, err != 0 in case of layouts being zeroed out.

Also reverted back dht_selfheal_dir_xattr, which does layout calculation only
on subvols which have errors.

BUG: 921408
Change-Id: I75a8edcb92af5b53b3253c9addd7a812e9242836
Signed-off-by: shishir gowda &lt;sgowda@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4800
Reviewed-by: Amar Tumballi &lt;amarts@redhat.com&gt;
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>dht: improve transform/detransform of d_off (and be ext4 safe)</title>
<updated>2013-04-11T17:40:44+00:00</updated>
<author>
<name>shishir gowda</name>
<email>sgowda@redhat.com</email>
</author>
<published>2013-04-10T05:12:25+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=2a734f92c4f2797523aaf2ec2803ea88382ec1d6'/>
<id>2a734f92c4f2797523aaf2ec2803ea88382ec1d6</id>
<content type='text'>
Backporting  Avati's fix http://review.gluster.org/4711

The scheme to encode brick d_off and brick id into global d_off has
two approaches. Since both brick d_off and global d_off are both 64-bit
wide, we need to be careful about how the brick id is encoded.

Filesystems like XFS always give a d_off which fits within 32bits. So
we have another 32bits (actually 31, in this scheme, as seen ahead) to
encode the brick id - which is typically plenty.

Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off,
as the d_off is calculated based on a hash function value. This leaves
us no "unused" bits to encode the brick id.

However both these filesystmes (EXT4 more importantly) are "tolerant" in
terms of the accuracy of the value presented back in seekdir(). i.e, a
seekdir(val) actually seeks to the entry which has the "closest" true
offset.

This "two-prong" scheme exploits this behavior - which seems to be the
best middle ground amongst various approaches and has all the advantages
of the old approach:

- Works against XFS and EXT4, the two most common filesystems out there.
  (which wasn't an "advantage" of the old approach as it is borken against
   EXT4)

- Probably works against most of the others as well. The ones which would
  NOT work are those which return HUGE d_offs _and_ NOT tolerant to
  seekdir() to "closest" true offset.

- Nothing to "remember in memory" or evict "old entries".

- Works fine across NFS server reboots and also NFS head failover.

- Tolerant to seekdir() to arbitrary locations.

Algorithm:

Each d_off can be encoded in either of the two schemes. There is no
requirement to encode all d_offs of a directory or a reply-set in
the same scheme.

The topmost bit of the 64 bits is used to specify the "type" of encoding
of this particular d_off. If the topmost bit (bit-63) is 1, it indicates
that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0,
it indicates that the "small" d_off encoding scheme is used.

The goal of the "small" d_off encoding is to stay as dense as possible
towards the lower bits even in the global d_off.

The goal of the HUGE d_off encoding is to stay as accurate (close) as
possible to the "true" d_off after a round of encoding and decoding.

If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick
ID (call it "n").

SMALL d_off
===========

Encoding
--------
    If the top n + 1 bits are free in a brick offset, then we leave the
top bit as 0 and set the remaining bits based on the old formula:

   hi_mask = 0xffffffffffffffff

   hi_mask = ~(hi_mask &gt;&gt; (n + 1))

   if ((hi_mask &amp; d_off_brick) != 0)
       do_large_d_off_encoding ()

   d_off_global = (d_off_brick * N) + brick_id

Decoding
--------
    If the top bit in the global offset is 0, it indicates that this
is the encoding formula used. So decoding such a global offset will
be like the old formula:

   if ((d_off_global &amp; 0x8000000000000000) != 0)
      do_large_d_off_decoding()

   d_off_brick = (d_off_global % N)

   brick_id = d_off_global / N

HUGE d_off
==========

Encoding
--------
   If the top n + 1 bits are NOT free in a given brick offset, then we
set the top bit as 1 in the global offset. The low n bits are replaced
by brick_id.

    low_mask = 0xffffffffffffffff &lt;&lt; n   // where n is ROOF(Log2(N))

    d_off_global = (0x8000000000000000 | d_off_brick &amp; low_mask) + brick_id

    if (d_off_global == 0xffffffffffffffff)
        discard_entry();

Decoding
--------
    If the top bit in the global offset is set 1, it indicates that
the encoding formula used is above. So decoding would look like:

    hi_mask = (0xffffffffffffffff &lt;&lt; n)
    low_mask = ~(hi_mask)

    d_off_brick = (global_d_off &amp; hi_mask &amp; 0x7fffffffffffffff)

    brick_id = global_d_off &amp; low_mask

    If "losing" the low n bits in this decoding of d_off_brick looks
"scary", we need to realize that till recently EXT4 used to only
return what can now be expressed as (d_off_global &gt;&gt; 32). The extra
31 bits of hash added by EXT recently, only decreases the probability
of a collision, and not eliminate it completely, anyways. In a way,
the "lost" n bits are made up by decreasing the probability of
collision by sharding the files into N bricks / EXT directories
    -- call it "hash hedging", if you will :-)

Change-Id: I9551c581c3f3d4c9e719764881036d554f60c557
Thanks-to: Zach Brown &lt;zab@redhat.com&gt;
BUG: 838784
Signed-off-by: shishir gowda &lt;sgowda@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4799
Reviewed-by: Amar Tumballi &lt;amarts@redhat.com&gt;
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Backporting  Avati's fix http://review.gluster.org/4711

The scheme to encode brick d_off and brick id into global d_off has
two approaches. Since both brick d_off and global d_off are both 64-bit
wide, we need to be careful about how the brick id is encoded.

Filesystems like XFS always give a d_off which fits within 32bits. So
we have another 32bits (actually 31, in this scheme, as seen ahead) to
encode the brick id - which is typically plenty.

Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off,
as the d_off is calculated based on a hash function value. This leaves
us no "unused" bits to encode the brick id.

However both these filesystmes (EXT4 more importantly) are "tolerant" in
terms of the accuracy of the value presented back in seekdir(). i.e, a
seekdir(val) actually seeks to the entry which has the "closest" true
offset.

This "two-prong" scheme exploits this behavior - which seems to be the
best middle ground amongst various approaches and has all the advantages
of the old approach:

- Works against XFS and EXT4, the two most common filesystems out there.
  (which wasn't an "advantage" of the old approach as it is borken against
   EXT4)

- Probably works against most of the others as well. The ones which would
  NOT work are those which return HUGE d_offs _and_ NOT tolerant to
  seekdir() to "closest" true offset.

- Nothing to "remember in memory" or evict "old entries".

- Works fine across NFS server reboots and also NFS head failover.

- Tolerant to seekdir() to arbitrary locations.

Algorithm:

Each d_off can be encoded in either of the two schemes. There is no
requirement to encode all d_offs of a directory or a reply-set in
the same scheme.

The topmost bit of the 64 bits is used to specify the "type" of encoding
of this particular d_off. If the topmost bit (bit-63) is 1, it indicates
that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0,
it indicates that the "small" d_off encoding scheme is used.

The goal of the "small" d_off encoding is to stay as dense as possible
towards the lower bits even in the global d_off.

The goal of the HUGE d_off encoding is to stay as accurate (close) as
possible to the "true" d_off after a round of encoding and decoding.

If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick
ID (call it "n").

SMALL d_off
===========

Encoding
--------
    If the top n + 1 bits are free in a brick offset, then we leave the
top bit as 0 and set the remaining bits based on the old formula:

   hi_mask = 0xffffffffffffffff

   hi_mask = ~(hi_mask &gt;&gt; (n + 1))

   if ((hi_mask &amp; d_off_brick) != 0)
       do_large_d_off_encoding ()

   d_off_global = (d_off_brick * N) + brick_id

Decoding
--------
    If the top bit in the global offset is 0, it indicates that this
is the encoding formula used. So decoding such a global offset will
be like the old formula:

   if ((d_off_global &amp; 0x8000000000000000) != 0)
      do_large_d_off_decoding()

   d_off_brick = (d_off_global % N)

   brick_id = d_off_global / N

HUGE d_off
==========

Encoding
--------
   If the top n + 1 bits are NOT free in a given brick offset, then we
set the top bit as 1 in the global offset. The low n bits are replaced
by brick_id.

    low_mask = 0xffffffffffffffff &lt;&lt; n   // where n is ROOF(Log2(N))

    d_off_global = (0x8000000000000000 | d_off_brick &amp; low_mask) + brick_id

    if (d_off_global == 0xffffffffffffffff)
        discard_entry();

Decoding
--------
    If the top bit in the global offset is set 1, it indicates that
the encoding formula used is above. So decoding would look like:

    hi_mask = (0xffffffffffffffff &lt;&lt; n)
    low_mask = ~(hi_mask)

    d_off_brick = (global_d_off &amp; hi_mask &amp; 0x7fffffffffffffff)

    brick_id = global_d_off &amp; low_mask

    If "losing" the low n bits in this decoding of d_off_brick looks
"scary", we need to realize that till recently EXT4 used to only
return what can now be expressed as (d_off_global &gt;&gt; 32). The extra
31 bits of hash added by EXT recently, only decreases the probability
of a collision, and not eliminate it completely, anyways. In a way,
the "lost" n bits are made up by decreasing the probability of
collision by sharding the files into N bricks / EXT directories
    -- call it "hash hedging", if you will :-)

Change-Id: I9551c581c3f3d4c9e719764881036d554f60c557
Thanks-to: Zach Brown &lt;zab@redhat.com&gt;
BUG: 838784
Signed-off-by: shishir gowda &lt;sgowda@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4799
Reviewed-by: Amar Tumballi &lt;amarts@redhat.com&gt;
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mgmt/glusterd: Start fs-crawl in separate thread so as not to block epoll</title>
<updated>2013-03-19T17:29:59+00:00</updated>
<author>
<name>Varun Shastry</name>
<email>vshastry@redhat.com</email>
</author>
<published>2013-03-14T13:26:18+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=e43c8fe9ee9e782ebed8aafdb15310edde587ba0'/>
<id>e43c8fe9ee9e782ebed8aafdb15310edde587ba0</id>
<content type='text'>
tests/basic/quota.t covers test case for this.

Patch is only for 3.4 branch, http://review.gluster.org/4495 fixes the issue
in master.

Change-Id: I92674f5413441cc896245d5b3d0925f44ce8b2d3
BUG: 919998
Signed-off-by: Varun Shastry &lt;vshastry@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4680
Reviewed-by: Amar Tumballi &lt;amarts@redhat.com&gt;
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
tests/basic/quota.t covers test case for this.

Patch is only for 3.4 branch, http://review.gluster.org/4495 fixes the issue
in master.

Change-Id: I92674f5413441cc896245d5b3d0925f44ce8b2d3
BUG: 919998
Signed-off-by: Varun Shastry &lt;vshastry@redhat.com&gt;
Reviewed-on: http://review.gluster.org/4680
Reviewed-by: Amar Tumballi &lt;amarts@redhat.com&gt;
Reviewed-by: Jeff Darcy &lt;jdarcy@redhat.com&gt;
Tested-by: Gluster Build System &lt;jenkins@build.gluster.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
