| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
| |
Open behind was not keeping any reference on fd's pending to be
opened. This makes it possible that a concurrent close and en entry
fop (unlink, rename, ...) caused destruction of the fd while it
was still being used.
Change-Id: Ie9e992902cf2cd7be4af1f8b4e57af9bd6afd8e9
Fixes: #1028
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A crash is seen during a reattempt to clean up shards in background
upon remount. And this happens even on remount (which means a remount
is no workaround for the crash).
In such a situation, the in-memory base inode object will not be
existent (new process, non-existent base shard).
So local->resolver_base_inode will be NULL.
In the event of an error (in this case, of space running out), the
process would crash at the time of logging the error in the following line -
gf_msg(this->name, GF_LOG_ERROR, local->op_errno, SHARD_MSG_FOP_FAILED,
"failed to delete shards of %s",
uuid_utoa(local->resolver_base_inode->gfid));
Fixed that by using local->base_gfid as the source of gfid when
local->resolver_base_inode is NULL.
Change-Id: I0b49f2b58becd0d8874b3d4b14ff8d92a89d02d5
Fixes: #1127
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
(cherry picked from commit cc43ac8651de9aa508b01cb259b43c02d89b2afc)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The was a problem when self-heal was sending lookups at the same time
that one of the bricks was coming up. In this case there was a chance
that the number of 'up' bricks changes in the middle of sending the
requests to subvolumes which caused a discrepancy in the expected
number of replies and the actual number of sent requests.
This discrepancy caused that AFR continued executing requests before
all requests were complete. Eventually, the frame of the pending
request was destroyed when the operation terminated, causing a use-
after-free issue when the answer was finally received.
In theory the same thing could happen in the reverse way, i.e. AFR
tries to wait for more replies than sent requests, causing a hang.
Backport of:
> Change-Id: I7ed6108554ca379d532efb1a29b2de8085410b70
> Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
> Fixes: bz#1808875
Change-Id: I7ed6108554ca379d532efb1a29b2de8085410b70
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
Fixes: bz#1809440
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
In a hyperconverged setup with granular-entry-heal enabled, if a file is
recreated while one of the bricks is down, and an index heal is triggered
(with the brick still down), entry-self heal was doing a spurious heal
with just the 2 good bricks. It was doing a post-op leading to removal
of the filename from .glusterfs/indices/entry-changes as well as
erroneous setting of afr xattrs on the parent. When the brick came up,
the xattrs were cleared, resulting in the renamed file not getting
healed and leading to gfid split-brain and EIO on the mount.
Fix:
Proceed with entry heal only when shd can connect to all bricks of the replica,
just like in data and metadata heal.
fixes: #1103
Change-Id: I916ae26ad1fabf259bc6362da52d433b7223b17e
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
(cherry picked from commit 06453d77d056fbaa393a137ca277a20e38d2f67e)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
...if pending xattrs are zero for all children.
Problem:
If there are no pending xattrs and a metadata heal needs to be
performed, it can be possible that we end up with xattrs inadvertendly
deleted from all bricks, as explained in the BZ.
Fix:
After picking one among the sources as the good copy, mark pending xattrs on
all sources to blame the sinks. Now even if this metadata heal fails midway,
a subsequent heal will still choose one of the valid sources that it
picked previously.
Updates: #1067
Change-Id: If1b050b70b0ad911e162c04db4d89b263e2b8d7b
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
(cherry picked from commit 2d5ba449e9200b16184b1e7fc84cabd015f1f779)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
We currently don't have a roll-back/undoing of post-ops if quorum is not met.
Though the FOP is still unwound with failure, the xattrs remain on the disk.
Due to these partial post-ops and partial heals (healing only when 2 bricks
are up), we can end up in metadata split-brain purely from the afr xattrs
point of view i.e each brick is blamed by atleast one of the others for
metadata. These scenarios are hit when there is frequent connect/disconnect
of the client/shd to the bricks.
Fix:
Pick a source based on the xattr values. If 2 bricks blame one, the blamed
one must be treated as sink. If there is no majority, all are sources. Once
we pick a source, self-heal will then do the heal instead of erroring out
due to split-brain.
This patch also adds restriction of all the bricks to be up to perform
metadata heal to avoid any metadata loss.
Removed the test case tests/bugs/replicate/bug-1468279-source-not-blaming-sinks.t
as it was doing metadata heal even when only 2 of 3 bricks were up.
Change-Id: I07a9d62f84ceda329dcab1f02a33aeed258dcb09
fixes: bz#1806931
Signed-off-by: karthik-us <ksubrahm@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If an open comes on a file when a brick is down and after the brick comes up,
a fop comes on the fd, client xlator would still wind the fop on anon-fd
leading to wrong behavior of the fops in some cases.
Example:
If lk fop is issued on the fd just after the brick is up in the scenario above,
lk fop will be sent on anon-fd instead of failing it on that client xlator.
This lock will never be freed upon close of the fd as flush on anon-fd is
invalid and is not wound below server xlator.
As a fix, failing the fop unless the fd has FALLBACK_TO_ANON_FD flag.
> Change-Id: I77692d056660b2858e323bdabdfe0a381807cccc
> fixes bz#1390914
> Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Change-Id: I77692d056660b2858e323bdabdfe0a381807cccc
fixes bz#1808256
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Backport of https://review.gluster.org/#/c/glusterfs/+/23288/
...whenever shd is re-enabled after disabling or there is a change in
`cluster.heal-timeout`, without needing to restart shd or waiting for the
current `cluster.heal-timeout` seconds to expire.
See BZ 1743988 for more details.
Change-Id: Ia5ebd7c8e9f5b54cba3199c141fdd1af2f9b9bfe
fixes: bz#1807431
Reported-by: Glen Kiessling <glenk1973@hotmail.com>
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
ec_getxattr_heal_cbk was called with NULL as second argument
in case heal was failing.
This function was dereferencing "cookie" argument which caused crash.
Solution:
Cookie is changed to carry the value that was supposed to be
stored in fop->data, so even in the case when fop is NULL in error
case, there won't be any NULL dereference.
Thanks to Xavi for the suggestion about the fix.
Change-Id: I0798000d5cadb17c3c2fbfa1baf77033ffc2bb8c
fixes: bz#1805057
|
|
|
|
|
|
|
|
|
|
| |
When discard/truncate performs write fop, it should do so
after updating lock->good_mask to make sure readv happens
on the correct mask
fixes: bz#1805056
Change-Id: Idfef0bbcca8860d53707094722e6ba3f81c583b7
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
when a file needs to be re-opened O_APPEND and O_EXCL
flags are not filtered in EC.
- O_APPEND should be filtered because EC doesn't send O_APPEND below EC for
open to make sure writes happen on the individual fragments instead of at the
end of the file.
- O_EXCL should be filtered because shd could have created the file so even
when file exists open should succeed
- O_CREAT should be filtered because open happens with gfid as parameter. So
open fop will create just the gfid which will lead to problems.
Fix:
Filter out these two flags in reopen.
Change-Id: Ia280470fcb5188a09caa07bf665a2a94bce23bc4
Fixes: bz#1805055
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
There are cases where fop->mask may have fop->healing added
and readv shouldn't be wound on fop->healing. To avoid this
always wind readv to lock->good_mask
fixes: bz#1805054
Change-Id: I2226ef0229daf5ff315d51e868b980ee48060b87
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
EC doesn't allow concurrent writes on overlapping areas, they are
serialized. However non-overlapping writes are serviced in parallel.
When a write is not aligned, EC first needs to read the entire chunk
from disk, apply the modified fragment and write it again.
The problem appears on sparse files because a write to an offset
implicitly creates data on offsets below it (so, in some way, they
are overlapping). For example, if a file is empty and we read 10 bytes
from offset 10, read() will return 0 bytes. Now, if we write one byte
at offset 1M and retry the same read, the system call will return 10
bytes (all containing 0's).
So if we have two writes, the first one at offset 10 and the second one
at offset 1M, EC will send both in parallel because they do not overlap.
However, the first one will try to read missing data from the first chunk
(i.e. offsets 0 to 9) to recombine the entire chunk and do the final write.
This read will happen in parallel with the write to 1M. What could happen
is that half of the bricks process the write before the read, and the
half do the read before the write. Some bricks will return 10 bytes of
data while the otherw will return 0 bytes (because the file on the brick
has not been expanded yet).
When EC tries to recombine the answers from the bricks, it can't, because
it needs more than half consistent answers to recover the data. So this
read fails with EIO error. This error is propagated to the parent write,
which is aborted and EIO is returned to the application.
The issue happened because EC assumed that a write to a given offset
implies that offsets below it exist.
This fix prevents the read of the chunk from bricks if the current size
of the file is smaller than the read chunk offset. This size is
correctly tracked, so this fixes the issue.
Also modifying ec-stripe.t file for Test #13 within it.
In this patch, if a file size is less than the offset we are writing, we
fill zeros in head and tail and do not consider it strip cache miss.
That actually make sense as we know what data that part holds and there is
no need of reading it from bricks.
Change-Id: Ic342e8c35c555b8534109e9314c9a0710b6225d6
Fixes: bz#1805053
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ec_manager_open/opendir memsets ctx->loc which causes
memory/inode leak, and ec_fheal uses ctx->loc out of fd->lock
that loc_copy may copy bad data when memset it.
This patch skips updating ctx->loc when it is initilizaed.
With it, ctx->loc is filled once, and never updated.
Change-Id: I3bf5ffce4caf4c1c667f7acaa14b451d37a3550a
fixes: bz#1805052
Signed-off-by: Kinglong Mee <mijinlong@horiscale.com>
|
|
|
|
|
|
|
|
|
| |
If lock has info, fop should inherit healing mask from it.
Otherwise, fop cannot inherit right healing when changed_flags is zero.
Change-Id: Ife80c9169d2c555024347a20300b0583f7e8a87f
fixes: bz#1805051
Signed-off-by: Kinglong Mee <mijinlong@horiscale.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
backport of https://review.gluster.org/#/c/glusterfs/+/23506/
The fd processing loops in the
dht_migration_complete_check_task and the
dht_rebalance_inprogress_task functions were unsafe
and could cause an open to be sent on an already freed
fd. This has been fixed.
Change-Id: I0a3c7d2fba314089e03dfd704f9dceb134749540
Fixes: bz#1804522
Signed-off-by: N Balachandran <nbalacha@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Race:
Thread-1 Thread-2
1) Does ec_get_size_version() to perform
pre-op fxattrop as part of write-1
2) Calls ec_set_dirty_flag() in
ec_get_size_version() for write-2.
This sets dirty[] to 1
3) Completes executing
ec_prepare_update_cbk leading to
ctx->dirty[] = '1'
4) Takes LOCK(inode->lock) to check if there are
any flags and sets dirty-flag because
lock->waiting_flag is 0 now. This leads to
fxattrop to increment on-disk dirty[] to '2'
At the end of the writes the file will be marked for heal even when it doesn't need heal.
Fix:
Perform ec_set_dirty_flag() and other checks inside LOCK() to prevent dirty[] to be marked
as '1' in step 2) above
Fixes: bz#1805050
Change-Id: Icac2ab39c0b1e7e154387800fbededc561612865
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Doing re-open with O_TRUNC will truncate the fragment even when it is not
needed needing extra heals
Fix:
At the time of re-open don't use O_TRUNC.
fixes: bz#1805049
Change-Id: Idc6408968efaad897b95a5a52481c66e843d3fb8
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: At the time of coming up one server node(1x3) after reboot
client is unmounted.The client is unmounted because a client
is getting AUTH_FAILED event and client call fini for the graph.The
client is getting AUTH_FAILED because brick is not attached with a
graph at that moment
Solution: To avoid the unmounting the client graph throw ENOENT error
from server in case if brick is not attached with server at
the time of authenticate clients.
> Credits: Xavi Hernandez <xhernandez@redhat.com>
> Change-Id: Ie6fbd73cbcf23a35d8db8841b3b6036e87682f5e
> Fixes: bz#1793852
> Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
> (cherry picked from commit f6421dff22a6ddaf14134f6894deae219948c89d)
Fixes: bz#1804512
Change-Id: Ie6fbd73cbcf23a35d8db8841b3b6036e87682f5e
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently EC tries to reopen fd's that have been opened while a brick
was down. This is done as part of regular write operations, just after
having acquired the locks, and it's sent as a sub-fop of the main write
fop.
There were two problems:
1. The reopen was attempted on all UP bricks, even if a previous lock
didn't succeed. This is incorrect because most probably the open will
fail.
2. If reopen is sent and fails, the error is propagated to the main
operation, causing it to fail when it shouldn't.
To fix this, we only attempt reopens on bricks where the current fop
owns a lock, and we prevent any error to be propagated to the main
fop.
To implement this behaviour an argument used to indicate the minimum
number of required answers has overloaded to also include some flags. To
make the change consistent, it has been necessary to rename the
argument, which means that a lot of files have been changed. However
there are no functional changes.
This change has also uncovered a problem in discard code, which didn't
correctely process requests of small sizes because no real discard fop
was being processed, only a write of 0's on some region. In this case
some fields of the fop remained uninitialized or with incorrect values.
To fix this, a new function has been created to simulate success on a
fop and it's used in the discard case.
Thanks to Pranith for providing a test script that has also detected an
issue in this patch. This patch includes a small modification of this
script to force data to be written into bricks before stopping them.
Change-Id: If272343873369186c2fb8f43c1d9c52c3ea304ec
Fixes: bz#1805047
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We need to safe-guard against possible zero'ing out of iatt
structure in acl ctx, which can cause many issues.
Backport of:
> fixes: bz#1668286
> Change-Id: Ie81a57d7453a6624078de3be8c0845bf4d432773
> Signed-off-by: Amar Tumballi <amarts@redhat.com>
(cherry-picked from commit 6bf9637a93011298d032332ca93009ba4e377e46)
fixes: bz#1767305
Change-Id: Ie81a57d7453a6624078de3be8c0845bf4d432773
Signed-off-by: Alan Orth <alan.orth@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When iot_worker terminates, its resources have not been reaped, which
will consumes lots of memory.
Detach iot_worker to automically release its resources back to the
system.
Change-Id: I71fabb2940e76ad54dc56b4c41aeeead2644b8bb
fixes: bz#1718734
Signed-off-by: Liguang Li <liguang.lee6@gmail.com>
Signed-off-by: N Balachandran <nbalacha@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
In a situation where B1 blames B2, B2 blames B1 and B3 doesn't blame
anything for entry heal, heal will not complete even though we have
clear source and sinks. This will happen because while doing
afr_selfheal_find_direction() only the bricks which are blamed by
non-accused bricks are considered as sinks. Later in
__afr_selfheal_entry_finalize_source() when it tries to mark all the
non-sources as sinks it fails to do so because there won't be any
healed_sinks marked, no witness present and there will be a source.
Fix:
If there is a source and no healed_sinks, then reset all the locked
sources to 0 and healed sinks to 1 to do conservative merge.
Change-Id: If40d8bc95d52a52b2730f55bdcf135109b421548
Fixes: bz#1760710
Signed-off-by: karthik-us <ksubrahm@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We were not passing xattr_req when doing a name self heal
as well as a meta data heal. Because of this, some xdata
was missing which causes i/o errors
Backport of>https://review.gluster.org/#/c/glusterfs/+/23024/
>Change-Id: Ibfb1205a7eb0195632dc3820116ffbbb8043545f
>Fixes: bz#1728770
>Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
Change-Id: I740ccdbfcf2667fe4a850c12ecdc4b9eeed08293
Fixes: bz#1749352
Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Two reasons:
* ping responses from glusterd may not be relevant for Halo
replication. Instead, it might be interested in only knowing whether
the brick itself is responsive.
* When a brick is killed, propagating GF_EVENT_CHILD_PING of ping
response from glusterd results in GF_EVENT_DISCONNECT spuriously
propagated to parent xlators. These DISCONNECT events are from the
connections client establishes with glusterd as part of its
reconnect logic. Without GF_EVENT_CHILD_PING, the last event
propagated to parent xlators would be the first DISCONNECT event
from brick and hence subsequent DISCONNECTS to glusterd are not
propagated as protocol/client prevents same event being propagated
to parent xlators consecutively. propagating GF_EVENT_CHILD_PING for
ping responses from glusterd would change the last_sent_event to
GF_EVENT_CHILD_PING and hence protocol/client cannot prevent
subsequent DISCONNECT events
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
Fixes: bz#1739336
Change-Id: I50276680c52f05ca9e12149a3094923622d6eaef
(cherry picked from commit 5d66eafec581fb3209af74595784be8854ca40a4)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
... with out which volume creation fails with "volume create: <xyz>: failed:
Failed to create volume files"
>Fixes: bz#1716812
>Change-Id: I2f4c2c6d5290f066b54e1c1db19e25db9937bedb
>Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
Fixes: bz#1721106
Change-Id: I2f4c2c6d5290f066b54e1c1db19e25db9937bedb
Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For nameless LOOKUPs, server creates a new inode which shall
remain invalid until the fop is successfully processed post
which it is linked to the inode table.
But incase if there is an already linked inode for that entry,
it discards that newly created inode which results in upcall
notification. This may result in client being bombarded with
unnecessary upcalls affecting performance if the data set is huge.
This issue can be avoided by looking up and storing the upcall
context in the original linked inode (if exists), thus saving up on
those extra callbacks.
This is backport of below mainline fix -
> fixes: bz#1718338
> patch url: https://review.gluster.org/#/c/glusterfs/+/22840/
> Signed-off-by: Soumya Koduri <skoduri@redhat.com>
Change-Id: I044a1737819bb40d1a049d2f53c0566e746d2a17
fixes: bz#1720634
Signed-off-by: Soumya Koduri <skoduri@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
EC was ignoring lock contention notifications received while a lock was
being acquired. When a lock is partially acquired (some bricks have
granted the lock but some others not yet) we can receive notifications
from acquired bricks, which should be honored, since we may not receive
more notifications after that.
Since EC was ignoring them, once the lock was acquired, it was not
released until the eager-lock timeout, causing unnecessary delays on
other clients.
This fix takes into consideration the notifications received before
having completed the full lock acquisition. After that, the lock will
be releaed as soon as possible.
Backport of:
> BUG: bz#1708156
> Change-Id: I2a306dbdb29fb557dcab7788a258bd75d826cc12
> Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
Fixes: bz#1717282
Change-Id: I2a306dbdb29fb557dcab7788a258bd75d826cc12
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
Signed-off-by: Hari Gowtham <hgowtham@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
ec_truncate_clean does writing under the lock granted for truncate,
but the lock is calculated by ec_adjust_offset_up, so that,
the write in ec_truncate_clean is out of lock.
Updates: bz#1699500
Change-Id: I15ed1b0807d75c5eb817323f1c227e97d03e0e7c
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
(cherry picked from commit 0e1223491e964096384edfae5032ed0d50d028ad)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is a race in the way O_DIRECT writes are handled. Assume two
overlapping write requests w1 and w2.
* w1 is issued and is in wb_inode->wip queue as the response is still
pending from bricks. Also wb_request_unref in wb_do_winds is not yet
invoked.
list_for_each_entry_safe (req, tmp, tasks, winds) {
list_del_init (&req->winds);
if (req->op_ret == -1) {
call_unwind_error_keep_stub (req->stub, req->op_ret,
req->op_errno);
} else {
call_resume_keep_stub (req->stub);
}
wb_request_unref (req);
}
* w2 is issued and wb_process_queue is invoked. w2 is not picked up
for winding as w1 is still in wb_inode->wip. w1 is added to todo
list and wb_writev for w2 returns.
* response to w1 is received and invokes wb_request_unref. Assume
wb_request_unref in wb_do_winds (see point 1) is not invoked
yet. Since there is one more refcount, wb_request_unref in
wb_writev_cbk of w1 doesn't remove w1 from wip.
* wb_process_queue is invoked as part of wb_writev_cbk of w1. But, it
fails to wind w2 as w1 is still in wip.
* wb_requet_unref is invoked on w1 as part of wb_do_winds. w1 is
removed from all queues including w1.
* After this point there is no invocation of wb_process_queue unless
new request is issued from application causing w2 to be hung till
the next request.
This bug is similar to bz 1626780 and bz 1379655.
Change-Id: Iaa47437613591699d4c8ad18bc0b32de6affcc31
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
Fixes: bz#1707198
(cherry picked from commit 6454132342c0b549365d92bcf3572ecd914f7fa8)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When eager-lock lock acquisition fails because of say network failures, the
local is not being removed from owners_list, this leads to accumulation of
waiting frames and the application will hang because the waiting frames are
under the assumption that another transaction is in the process of acquiring
lock because owner-list is not empty. Handled this case as well in this patch.
Added asserts to make it easier to find these problems in future.
fixes bz#1699736
Change-Id: I3101393265e9827755725b1f2d94a93d8709e923
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If parallel-readdir is enabled, the rda xlator is loaded
below dht in the graph and proactively lists and caches
entries when an opendir is performed. dht_rmdir checks if
the directory being deleted contains stale linkto files by
performing a readdirp on its child subvols. However, as
the entries are actually read in during the opendir operation
which does not request the linkto xattr,no linkto xattrs are
present for the entries causing dht to incorrectly identify
them as data files and fail the rmdir operation with ENOTEMPTY.
DHT now always adds the linkto xattr in the list of xattrs
requested in the opendir.
Change-Id: I0711198e66c59146282eb8b88084170bedfb4018
fixes: bz#1695399
Signed-off-by: N Balachandran <nbalacha@redhat.com>
(cherry picked from commit 110006bbcd5bb3e814b4cfe7d74cb41891ac3b0c)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit ensures the following:
1. Don't send commit op request to the remote nodes when gluster v
status all is executed as for the status all transaction the local
commit gets the name of the volumes and remote commit ops are
technically a no-op. So no need for additional rpc requests.
2. In op state machine flow, if the transaction is in staged state and
op_info.skip_locking is true, then no need to set the txn id in the
priv->glusterd_txn_opinfo dictionary which never gets freed.
Fixes: bz#1694612
Change-Id: Ib6a9300ea29633f501abac2ba53fb72ff648c822
Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
(cherry picked from commit 34e010d64905b7387de57840d3fb16a326853c9b)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The fops allocate 3 kind of payload(buffer) in the client xlator:
- fop payload, this is the buffer allocated by the write and put fop
- rsphdr paylod, this is the buffer required by the reply cbk of
some fops like lookup, readdir.
- rsp_paylod, this is the buffer required by the reply cbk of fops like
readv etc.
Currently, in the lookup and readdir fop the rsphdr is sent as payload,
hence the allocated rsphdr buffer is also sent on the wire, increasing
the bandwidth consumption on the wire.
With this patch, the issue is fixed.
Fixes: bz#1673058
Change-Id: Ie8158921f4db319e60ad5f52d851fa5c9d4a269b
Signed-off-by: Poornima G <pgurusid@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A race between the lookup selfheal and rmdir can cause
directories to be healed only on non-hashed subvols.
This can prevent the directory from being listed from
the mount point and in turn causes rm -rf to fail with
ENOTEMPTY.
Fix: Update the layout information correctly and reduce
the call count only after processing the response.
Change-Id: I812779aaf3d7bcf24aab1cb158cb6ed50d212451
fixes: bz#1695403
Signed-off-by: N Balachandran <nbalacha@redhat.com>
(cherry picked from commit b0f1d782fc45313fce4e1c0e74127401d5342d05)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
In an arbiter volume configuration SHD will not send any writes onto the arbiter
brick even if there is data pending marker for the arbiter brick. If we have a
arbiter setup on the geo-rep master and there are data pending markers for the files
on arbiter brick, SHD will not mark any data changelog during healing. While syncing
the data from master to slave, if the arbiter-brick is considered as ACTIVE, then
there is a chance that slave will miss out some data. If the arbiter brick is being
newly added or replaced there is a chance of slave missing all the data during sync.
Fix:
If there is data pending marker for the arbiter brick, send truncate on the arbiter
brick during heal, so that it will record truncate as the data transaction in changelog.
Change-Id: I3242ba6cea6da495c418ef860d9c3359c5459dec
fixes: bz#1687687
Signed-off-by: karthik-us <ksubrahm@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: commit 5a152a changed the mechansim of computing the
checksum. In heterogeneous cluster, peers are running into
rejected state because we have different cksum computation
mechansims in upgraded and non-upgraded nodes.
Solution: add a check for op-version so that all the nodes
in the cluster follow the same mechanism for computing the
cksum.
fixes: bz#1684569
> Change-Id: I1508f000e8c9895588b6011b8b6cc0eda7102193
> BUG: bz#1685120
> Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
Change-Id: I1508f000e8c9895588b6011b8b6cc0eda7102193
Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Two issues were found:
1. in wb_readdirp_cbk, inode should unrefed after wb_inode is
unlocked. Otherwise, inode and hence the context wb_inode can be freed
by the type we try to unlock wb_inode
2. wb_readdirp_mark_end iterates over a list of wb_inodes of children
of a directory. But inodes could've been freed and hence the list
might be corrupted. To fix take a reference on inode before adding it
to invalidate_list of parent.
Change-Id: I911b0e0b2060f7f41ded0b05db11af6f9b7c09c5
Signed-off-by: Raghavendra Gowdappa <rgowdapp@redhat.com>
Updates: bz#1671556
|
|
|
|
|
|
|
| |
Change-Id: I7be9a5f48dcad1b136c479c58b1dca1e0488166d
Signed-off-by: Raghavendra Gowdappa <rgowdapp@redhat.com>
Fixes: bz#1671556
(cherry picked from commit 6175cb10cd5f59f3c7ae4100bc78f359b68ca3e9)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The "struct iatt" in iatt.h is using int64_t types for storing
the atime, mtime and ctime. Therefore the struct 'struct md_cache' in
md-cache.c should also use this types to avoid an integer overflow.
This can happen e.g. if someone uses a very high default-retention-period
in the WORM-Xlator.
Change-Id: I605268d300ab622b9c8ab30e459dc00d9340aad1
fixes: bz#1678726
Signed-off-by: David Spisla <david.spisla@iternity.com>
(cherry picked from commit 15423e14f16dd1a15ee5e5cbbdbdd370e57ed59f)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a fop to create an entry fails on one of the data brick,
we mark the pending changelog on the entry on brick for which
it was successful. This is done as part of post op phase to
make sure that entry gets healed even if it gets renamed to
some other path where its parent was not marked as bad.
As it happens as part of post op, we should consider thin-arbiter
to check if the brick, which was successful, is the good brick or not.
This will avoide split brain and other issues.
>Change-Id: I12686675be98f02f70a5186b3ed748c541514d53
>Signed-off-by: Ashish Pandey <aspandey@redhat.com>
Change-Id: I12686675be98f02f70a5186b3ed748c541514d53
updates: bz#1672314
Signed-off-by: Ashish Pandey <aspandey@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PROBLEM:
Lot of the earlier changes in the management of shards in lru, fsync
lists assumed that if a given shard exists in fsync list, it must be
part of lru list as well. This was found to be not true.
Consider this - a file is FALLOCATE'd to a size which would make the
number of participant shards to be greater than the lru list size.
In this case, some of the resolved shards that are to participate in
this fop will be evicted from lru list to give way to the rest of the
shards. And once FALLOCATE completes, these shards are added to fsync
list but without a ref. After the fop completes, these shard inodes
are unref'd and destroyed while their inode ctxs are still part of
fsync list. Now when an FSYNC is called on the base file and the
fsync-list traversed, the client crashes due to illegal memory access.
FIX:
Hold a ref on the shard inode when adding to fsync list as well.
And unref under following conditions:
1. when the shard is evicted from lru list
2. when the base file is fsync'd
3. when the shards are deleted.
Change-Id: Iab460667d091b8388322f59b6cb27ce69299b1b2
fixes: bz#1669382
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
(cherry picked from commit 72922c1fd69191b220f79905a23395c3a87f86ce)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
...when ctime is zero. ia_type and ia_gfid always need to be non-zero
for things to work correctly.
Problem:
Commit c9bde3021202f1d5c5a2d19ac05a510fc1f788ac zeroed out the iatt
buffer in the cbks of modification fops before unwinding if the ctime in
the buffer was zero. This was causing the fops to fail: noticeable when
AFR's 'consistent-metadata' option was enabled. (AFR zeros out the ctime
when the option is set. See commit
4c4624c9bad2edf27128cb122c64f15d7d63bbc8).
Fixes:
-Do not zero out the ia_type and ia_gfid of the iatt buff under any
circumstance.
-Also, fixed _rda_inode_ctx_update_iatts() to always update these values from
the incoming buf when ctime is zero. Otherwise we end up with zero
ia_type and ia_gfid the first time the function is called *and* the
incoming buf has ctime set to zero.
fixes: bz#1665145
Reported-By:Michael Hanselmann <public@hansmi.ch>
Change-Id: Ib72228892d42c3513c19fc6dfb543f2aa3489eca
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
(cherry picked from commit 09db11b0c020bc79d493c6d7e7ea4f3beb000c68)
|
|
|
|
|
|
|
|
|
|
|
|
| |
rm -rf <dir> fails on dirs which contain linkto files
that point to themselves because dht incorrectly thought
that they were cached files after looking them up.
The fix now treats them as invalid linkto files
and deletes them.
Change-Id: I376c72a5309714ee339c74485e02cfb4e29be643
fixes: bz#1671611
Signed-off-by: N Balachandran <nbalacha@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The inode LRU mechanism is moot in fuse xlator (ie. there is no
limit for the LRU list), as fuse inodes are referenced from
kernel context, and thus they can only be dropped on request of
the kernel. This might results in a high number of passive
inodes which are useless for the glusterfs client, causing a
significant memory overhead.
This change tries to remedy this by extending the LRU semantics
and allowing to set a finite limit on the fuse inode LRU.
A brief history of problem:
When gluster's inode table was designed, fuse didn't have any
'invalidate' method, which means, userspace application could
never ask kernel to send a 'forget()' fop, instead had to wait
for kernel to send it based on kernel's parameters. Inode table
remembers the number of times kernel has cached the inode based
on the 'nlookup' parameter. And 'nlookup' field is not used by
no other entry points (like server-protocol, gfapi etc).
Hence the inode_table of fuse module always has to have lru-limit
as '0', which means no limit. GlusterFS always had to keep all
inodes in memory as kernel would have had a reference to it.
Again, the reason for this is, kernel's glusterfs inode reference
was pointer of 'inode_t' structure in glusterfs. As it is a
pointer, we could never free it (to prevent segfault, or memory
corruption).
Solution:
In the inode table, handle the prune case of inodes with 'nlookup'
differently, and call a 'invalidator' method, which in this case is
fuse_invalidate(), and it sends the request to kernel for getting
the forget request.
When the kernel sends the forget, it means, it has dropped all
the reference to the inode, and it will send the forget with the
'nlookup' parameter too. We just need to make sure to reduce the
'nlookup' value we have when we get forget. That automatically
cause the relevant prune to happen.
Credits: Csaba Henk, Xavier Hernandez, Raghavendra Gowdappa, Nithya B
fixes: bz#1623107
Change-Id: Ifee0737b23b12b1426c224ec5b8f591f487d83a2
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PROBLEM:
When multiple sharded files are deleted in quick succession, multiple
issues were observed:
1. misleading logs corresponding to a sharded file where while one log
message said the shards corresponding to the file were deleted
successfully, this was followed by multiple logs suggesting the very
same operation failed. This was because of multiple synctasks
attempting to clean up shards of the same file and only one of them
succeeding (the one that gets ENTRYLK successfully), and the rest of
them logging failure.
2. multiple synctasks to do background deletion would be launched, one
for each deleted file but all of them could readdir entries from
.remove_me at the same time could potentially contend for ENTRYLK on
.shard for each of the entry names. This is undesirable and wasteful.
FIX:
Background deletion will now follow a state machine. In the event that
there are multiple attempts to launch synctask for background deletion,
one for each file deleted, only the first task is launched. And if while
this task is doing the cleanup, more attempts are made to delete other
files, the state of the synctask is adjusted so that it restarts the
crawl even after reaching end-of-directory to pick up any files it may
have missed in the previous iteration.
This patch also fixes uninitialized lk-owner during syncop_entrylk()
which was leading to multiple background deletion synctasks entering
the critical section at the same time and leading to illegal memory access
of base inode in the second syntcask after it was destroyed post shard deletion
by the first synctask.
Change-Id: Ib33773d27fb4be463c7a8a5a6a4b63689705324e
updates: bz#1665803
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
(cherry picked from commit c0c2022e7d7097e96270a74f37813eda0c4e6339)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
excessive logging
... of the kind
"[2018-12-26 05:22:44.195019] E [MSGID: 133010]
[shard.c:2253:shard_common_lookup_shards_cbk] 0-volume1-shard: Lookup
on shard 785 failed. Base file gfid = cd938e64-bf06-476f-a5d4-d580a0d37416
[No such file or directory]"
shard_common_lookup_shards_cbk() has a specific check to ignore ENOENT error without
logging them during specific fops. But because background deletion is done in a new
frame (with local->fop being GF_FOP_NULL), the ENOENT check is skipped and the
absence of shards gets logged everytime.
To fix this, local->fop is initialized to GF_FOP_UNLINK during background deletion.
Change-Id: I0ca8d3b3bfbcd354b4a555eee520eb0479bcda35
updates: bz#1665803
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
(cherry picked from commit aa28fe32364e39981981d18c784e7f396d56153f)
|
|
|
|
|
|
|
|
|
|
| |
To avoid use_after_free, reset lease_ctx->timer back to NULL
after the structure has been freed.
Change-Id: Icd213ec809b8af934afdb519c335a4680a1d6cdc
updates: bz#1651323
Signed-off-by: Soumya Koduri <skoduri@redhat.com>
(cherry picked from commit a9b0003c717087ff168bc143c70559162e53e0d5)
|
|
|
|
|
|
|
|
|
| |
Replaced "recieve" with "receive".
Change-Id: I58a3d3d4a0093df4743de9fae4d8ff152d4b216c
fixes: bz#1662200
Signed-off-by: N Balachandran <nbalacha@redhat.com>
(cherry picked from commit a11c5c66321dd8411373a68cc163c981c7d083df)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
io-cache xlator has been skipping xdata references when the
date needs to be read into page cache. This patch fixes the same.
Note: similar changes may be needed for other fops as well
which are handled by io-cache.
Change-Id: I28d73d4ba471d13eb55d0fd0b5197d222df77a2a
updates: bz#1651323
Signed-off-by: Soumya Koduri <skoduri@redhat.com>
(cherry picked from commit b3d88a0904131f6851f4185e43f815ecc3353ab5)
|