<feed xmlns='http://www.w3.org/2005/Atom'>
<title>glusterfs.git/tests/basic/afr, branch release-4.1</title>
<subtitle></subtitle>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/'/>
<entry>
<title>cluster/afr: Delegate name-heal when possible</title>
<updated>2018-09-21T13:27:00+00:00</updated>
<author>
<name>Pranith Kumar K</name>
<email>pkarampu@redhat.com</email>
</author>
<published>2018-08-27T07:10:16+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=a642f594a00e943e1fa1e121a3f4d331fed1c70b'/>
<id>a642f594a00e943e1fa1e121a3f4d331fed1c70b</id>
<content type='text'>
Problem:
When name-self-heal is triggered on the mount, it blocks
lookup until name-self-heal completes. But that can lead
to hangs when lot of clients are accessing a directory which
needs name heal and all of them trigger heals waiting
for other clients to complete heal.

Fix:
When a name-heal is needed but quorum number of names have the
file and pending xattrs exist on the parent, then better to
delegate the heal to SHD which will be completed as part of
entry-heal of the parent directory. We could also do the same
for quorum-number of names not present but we don't have
any known use-case where this is a frequent occurrence so
not changing that part at the moment. When there is a gfid
mismatch or missing gfid it is important to complete the heal
so that next rename doesn't assume everything is fine and
perform a rename etc

fixes bz#1625575
Change-Id: I8b002c85dffc6eb6f2833e742684a233daefeb2c
Signed-off-by: Pranith Kumar K &lt;pkarampu@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Problem:
When name-self-heal is triggered on the mount, it blocks
lookup until name-self-heal completes. But that can lead
to hangs when lot of clients are accessing a directory which
needs name heal and all of them trigger heals waiting
for other clients to complete heal.

Fix:
When a name-heal is needed but quorum number of names have the
file and pending xattrs exist on the parent, then better to
delegate the heal to SHD which will be completed as part of
entry-heal of the parent directory. We could also do the same
for quorum-number of names not present but we don't have
any known use-case where this is a frequent occurrence so
not changing that part at the moment. When there is a gfid
mismatch or missing gfid it is important to complete the heal
so that next rename doesn't assume everything is fine and
perform a rename etc

fixes bz#1625575
Change-Id: I8b002c85dffc6eb6f2833e742684a233daefeb2c
Signed-off-by: Pranith Kumar K &lt;pkarampu@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>cluster/afr: Delegate metadata heal with pending xattrs to SHD</title>
<updated>2018-09-21T13:27:00+00:00</updated>
<author>
<name>Pranith Kumar K</name>
<email>pkarampu@redhat.com</email>
</author>
<published>2018-08-27T06:16:33+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=00dae0e2ea1260a082af33dac5b06ca6aa68ccc3'/>
<id>00dae0e2ea1260a082af33dac5b06ca6aa68ccc3</id>
<content type='text'>
Problem:
When metadata-self-heal is triggered on the mount, it blocks
lookup until metadata-self-heal completes. But that can lead
to hangs when lot of clients are accessing a directory which
needs metadata heal and all of them trigger heals waiting
for other clients to complete heal.

Fix:
Only when the heal is needed but the pending xattrs are not set,
trigger metadata heal that could block lookup. This is the only
case where different clients may give different metadata to the
clients without heals, which should be avoided.

Updates bz#1625575
Change-Id: I6089e9fda0770a83fb287941b229c882711f4e66
Signed-off-by: Pranith Kumar K &lt;pkarampu@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Problem:
When metadata-self-heal is triggered on the mount, it blocks
lookup until metadata-self-heal completes. But that can lead
to hangs when lot of clients are accessing a directory which
needs metadata heal and all of them trigger heals waiting
for other clients to complete heal.

Fix:
Only when the heal is needed but the pending xattrs are not set,
trigger metadata heal that could block lookup. This is the only
case where different clients may give different metadata to the
clients without heals, which should be avoided.

Updates bz#1625575
Change-Id: I6089e9fda0770a83fb287941b229c882711f4e66
Signed-off-by: Pranith Kumar K &lt;pkarampu@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>afr: add new value for read-hash-mode volume option</title>
<updated>2018-03-29T07:37:04+00:00</updated>
<author>
<name>Ravishankar N</name>
<email>ravishankar@redhat.com</email>
</author>
<published>2018-03-22T12:25:15+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=c87bd439ef12adc70dc580e75304121c3cd38e9a'/>
<id>c87bd439ef12adc70dc580e75304121c3cd38e9a</id>
<content type='text'>
Updates: #363

This new value (3) will try to wind read requests to the child of AFR
having the least amount of pending requests in its queue.

Change-Id: If6bda2aac9bf7aec3fc39622f78659313c4b6508
Signed-off-by: Ravishankar N &lt;ravishankar@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Updates: #363

This new value (3) will try to wind read requests to the child of AFR
having the least amount of pending requests in its queue.

Change-Id: If6bda2aac9bf7aec3fc39622f78659313c4b6508
Signed-off-by: Ravishankar N &lt;ravishankar@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>cluster/afr: Switch to active-fd-count for open-fd checks</title>
<updated>2018-03-21T05:06:32+00:00</updated>
<author>
<name>Pranith Kumar K</name>
<email>pkarampu@redhat.com</email>
</author>
<published>2018-03-19T09:56:40+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=448dec703d603a150dc1f1cc231c8389ab2fb2ea'/>
<id>448dec703d603a150dc1f1cc231c8389ab2fb2ea</id>
<content type='text'>
BUG: 1557932
Change-Id: I3783e41b3812267bc10c0d05d062a31396ce135b
Signed-off-by: Pranith Kumar K &lt;pkarampu@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
BUG: 1557932
Change-Id: I3783e41b3812267bc10c0d05d062a31396ce135b
Signed-off-by: Pranith Kumar K &lt;pkarampu@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>cluster/afr: Remove compound-fops usage in afr</title>
<updated>2018-03-06T07:55:25+00:00</updated>
<author>
<name>Pranith Kumar K</name>
<email>pkarampu@redhat.com</email>
</author>
<published>2018-03-02T04:43:20+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=9be043159a514db68b336c6aea49613f3125c5e0'/>
<id>9be043159a514db68b336c6aea49613f3125c5e0</id>
<content type='text'>
We are not seeing much improvement with this change. So removing the
feature so that it doesn't need to be maintained anymore.

Fixes: #414
Change-Id: Ic7969b151544daf2547bd262a9fa03f575626411
Signed-off-by: Pranith Kumar K &lt;pkarampu@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We are not seeing much improvement with this change. So removing the
feature so that it doesn't need to be maintained anymore.

Fixes: #414
Change-Id: Ic7969b151544daf2547bd262a9fa03f575626411
Signed-off-by: Pranith Kumar K &lt;pkarampu@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tests: bring option of per test timeout</title>
<updated>2018-02-15T08:20:19+00:00</updated>
<author>
<name>Amar Tumballi</name>
<email>amarts@redhat.com</email>
</author>
<published>2018-01-18T17:54:04+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=9d0d1fdd091d754149242fd4389b964695aacf13'/>
<id>9d0d1fdd091d754149242fd4389b964695aacf13</id>
<content type='text'>
This uses 'timeout' command with 300 seconds default. Right now,
there is just 1 test which takes more than that in a properly
setup machine.

Ideally best case is set the default to something like 30 seconds,
and if a test is supposed to take more than that, owner should add
a timeout line to test knowingly. That way, it makes test writers
think about a time limit too.

Change-Id: I747005ce1f208aeb2ecbf899e8feea487ecd21a0
Signed-off-by: Amar Tumballi &lt;amarts@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This uses 'timeout' command with 300 seconds default. Right now,
there is just 1 test which takes more than that in a properly
setup machine.

Ideally best case is set the default to something like 30 seconds,
and if a test is supposed to take more than that, owner should add
a timeout line to test knowingly. That way, it makes test writers
think about a time limit too.

Change-Id: I747005ce1f208aeb2ecbf899e8feea487ecd21a0
Signed-off-by: Amar Tumballi &lt;amarts@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>afr: don't treat all cases all bricks being blamed as split-brain</title>
<updated>2018-02-01T14:17:50+00:00</updated>
<author>
<name>Ravishankar N</name>
<email>ravishankar@redhat.com</email>
</author>
<published>2018-01-28T08:20:47+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=0e6e8216823c2d9dafb81aae0f6ee3497c23d140'/>
<id>0e6e8216823c2d9dafb81aae0f6ee3497c23d140</id>
<content type='text'>
Problem:
We currently don't have a roll-back/undoing of post-ops if quorum is not
met. Though the FOP is still unwound with failure, the xattrs remain on
the disk.  Due to these partial post-ops and partial heals (healing only when
2 bricks are up), we can end up in split-brain purely from the afr
xattrs point of view i.e each brick is blamed by atleast one of the
others. These scenarios are hit when there is frequent
connect/disconnect of the client/shd to the bricks while I/O or heal
are in progress.

Fix:
Instead of undoing the post-op, pick a source based on the xattr values.
If 2 bricks blame one, the blamed one must be treated as sink.
If there is no majority, all are sources. Once we pick a source,
self-heal will then do the heal instead of erroring out due to
split-brain.

Change-Id: I3d0224b883eb0945785ade0e9697a1c828aec0ae
BUG: 1539358
Signed-off-by: Ravishankar N &lt;ravishankar@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Problem:
We currently don't have a roll-back/undoing of post-ops if quorum is not
met. Though the FOP is still unwound with failure, the xattrs remain on
the disk.  Due to these partial post-ops and partial heals (healing only when
2 bricks are up), we can end up in split-brain purely from the afr
xattrs point of view i.e each brick is blamed by atleast one of the
others. These scenarios are hit when there is frequent
connect/disconnect of the client/shd to the bricks while I/O or heal
are in progress.

Fix:
Instead of undoing the post-op, pick a source based on the xattr values.
If 2 bricks blame one, the blamed one must be treated as sink.
If there is no majority, all are sources. Once we pick a source,
self-heal will then do the heal instead of erroring out due to
split-brain.

Change-Id: I3d0224b883eb0945785ade0e9697a1c828aec0ae
BUG: 1539358
Signed-off-by: Ravishankar N &lt;ravishankar@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tests: Renable basic/afr/split-brain-favorite-child-policy.t</title>
<updated>2017-11-29T08:51:28+00:00</updated>
<author>
<name>Nigel Babu</name>
<email>nigelb@redhat.com</email>
</author>
<published>2017-11-29T07:15:15+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=e05d8eb96b9f06c39660caa9a613d6064f46f3f2'/>
<id>e05d8eb96b9f06c39660caa9a613d6064f46f3f2</id>
<content type='text'>
This test was failing due to an infra issue. The infra issue is now
fixed.

BUG: 1517961
Change-Id: I09dfab9c0a3ebe73c738222e6269d9e35c85eddb
Signed-off-by: Nigel Babu &lt;nigelb@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This test was failing due to an infra issue. The infra issue is now
fixed.

BUG: 1517961
Change-Id: I09dfab9c0a3ebe73c738222e6269d9e35c85eddb
Signed-off-by: Nigel Babu &lt;nigelb@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tests: mark currently failing regression tests as known issues</title>
<updated>2017-11-28T13:04:43+00:00</updated>
<author>
<name>Amar Tumballi</name>
<email>amarts@redhat.com</email>
</author>
<published>2017-11-27T18:26:50+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=858fae39936e5aee5ea4e3816a10ba310d04cf61'/>
<id>858fae39936e5aee5ea4e3816a10ba310d04cf61</id>
<content type='text'>
Change-Id: If6c36dc6c395730dfb17b5b4df6f24629d904926
BUG: 1517961
Signed-off-by: Amar Tumballi &lt;amarts@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Change-Id: If6c36dc6c395730dfb17b5b4df6f24629d904926
BUG: 1517961
Signed-off-by: Amar Tumballi &lt;amarts@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>afr: add checks for allowing lookups</title>
<updated>2017-11-18T00:38:47+00:00</updated>
<author>
<name>Ravishankar N</name>
<email>ravishankar@redhat.com</email>
</author>
<published>2017-08-16T12:31:17+00:00</published>
<link rel='alternate' type='text/html' href='http://git.gluster.org/cgit/glusterfs.git/commit/?id=bd44d59741bb8c0f5d7a62c5b1094179dd0ce8a4'/>
<id>bd44d59741bb8c0f5d7a62c5b1094179dd0ce8a4</id>
<content type='text'>
Problem:
In an arbiter volume, lookup was being served from one of the sink
bricks (source brick was down). shard uses the iatt values from lookup cbk
to calculate the size and block count, which in this case were incorrect
values. shard_local_t-&gt;last_block was thus initialised to -1, resulting
in an infinite while loop in shard_common_resolve_shards().

Fix:
Use client quorum logic to allow or fail the lookups from afr if there
are no readable subvolumes. So in replica-3 or arbiter vols, if there is
no good copy or if quorum is not met, fail lookup with ENOTCONN.

With this fix, we are also removing support for quorum-reads xlator
option. So if quorum is not met, neither read nor write txns are allowed
and we fail the fop with ENOTCONN.

Change-Id: Ic65c00c24f77ece007328b421494eee62a505fa0
BUG: 1467250
Signed-off-by: Ravishankar N &lt;ravishankar@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Problem:
In an arbiter volume, lookup was being served from one of the sink
bricks (source brick was down). shard uses the iatt values from lookup cbk
to calculate the size and block count, which in this case were incorrect
values. shard_local_t-&gt;last_block was thus initialised to -1, resulting
in an infinite while loop in shard_common_resolve_shards().

Fix:
Use client quorum logic to allow or fail the lookups from afr if there
are no readable subvolumes. So in replica-3 or arbiter vols, if there is
no good copy or if quorum is not met, fail lookup with ENOTCONN.

With this fix, we are also removing support for quorum-reads xlator
option. So if quorum is not met, neither read nor write txns are allowed
and we fail the fop with ENOTCONN.

Change-Id: Ic65c00c24f77ece007328b421494eee62a505fa0
BUG: 1467250
Signed-off-by: Ravishankar N &lt;ravishankar@redhat.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
