glusterfs.git/xlators/cluster/ec/src/ec-heal.c, branch release-8

cluster/ec: Remove stale entries from indices/xattrop folder

2020-08-20T13:21:58+00:00

Problem:
If a gfid is present in indices/xattrop folder while
the file/dir is actaully healthy and all the xattrs are healthy,
it causes lot of lookups by shd on an entry which does not need
to be healed.
This whole process eats up lot of CPU usage without doing meaningful
work.

Solution:
Set trusted.ec.dirty xattr of the entry so that actual heal process
happens and at the end of it, during unset of dirty, gfid enrty from
indices/xattrop will be removed.

Change-Id: Ib1b9377d8dda384bba49523e9ff6ba9f0699cc1b
Fixes: #1385
Signed-off-by: Ashish Pandey 
(cherry picked from commit ba1b0a471dec968633f89c7f790b099fb4ad700d)

cluster/ec: Improve detection of new heals

2020-08-19T18:00:31+00:00

When EC successfully healed a directory it assumed that maybe other
entries inside that directory could have been created, which could
require additional heal cycles. For this reason, when the heal happened
as part of one index heal iteration, it triggered a new iteration.

The problem happened when the directory was healthy, so no new entries
were added, but its index entry was not removed for some reason. In
this case self-heal started and endless loop healing the same directory
continuously, cause high CPU utilization.

This patch improves detection of new files added to the heal index so
that a new index heal iteration is only triggered if there is new work
to do.

Change-Id: I2355742b85fbfa6de758bccc5d2e1a283c82b53f
Fixes: #1354
Signed-off-by: Xavi Hernandez

cluster/ec: Change handling of heal failure to avoid crash

2019-11-04T11:00:40+00:00

Problem:
ec_getxattr_heal_cbk was called with NULL as second argument
in case heal was failing.
This function was dereferencing "cookie" argument which caused crash.

Solution:
Cookie is changed to carry the value that was supposed to be
stored in fop->data, so even in the case when fop is NULL in error
case, there won't be any NULL dereference.

Thanks to Xavi for the suggestion about the fix.

Change-Id: I0798000d5cadb17c3c2fbfa1baf77033ffc2bb8c
fixes: bz#1729085

ctime/rebalance: Heal ctime xattr on directory during rebalance

2019-09-16T06:42:29+00:00

After add-brick and rebalance, the ctime xattr is not present
on rebalanced directories on new brick. This patch fixes the
same.

Note that ctime still doesn't support consistent time across
distribute sub-volume.

This patch also fixes the in-memory inconsistency of time attributes
when metadata is self healed.

Change-Id: Ia20506f1839021bf61d4753191e7dc34b31bb2df
fixes: bz#1734026
Signed-off-by: Kotresh HR

cluster/ec: Fix coverity issue.

2019-08-13T05:59:21+00:00

Change-Id: I727287784a15d89441865de7f438002e4a370250
fixes: bz#1738763
Signed-off-by: Ashish Pandey

cluster/ec: Create heal task with heal process id

2019-07-30T12:57:42+00:00

Problem:
ec_data_undo_pending calls syncop_fxattrop->SYNCOP without
a frame. In this case SYNCOP gets the frame of the task.
However, when we create a synctask for heal we provide
frame as NULL.
Now, if the read-only feature is ON, it will receive the
process ID of the shd as 0 and will consider that it as
not an internal process. This will prevent healing of a
file with "Read-only file system" error message log.

Solution:
While launching heal, create a synctask using frame and set
process id of the SHD which is -6.

Change-Id: I37195399c85de322cbcac75633888922c4e3db4a
Fixes: bz#1734252

ec-heal: check file's gfid when deleting stale name

2019-06-20T21:24:17+00:00

A name-less lookup does not contain parent's stat,
It is hard to check the lookuped file is at the right path.

This patch changes to a name lookup, and check file's gfid with
expected gfid. If the gfid is different, mark it estale.

fixes: bz#1702131
Change-Id: I2de20b10d680eed1e2fb1d3830b3b3dec4520dbf
Signed-off-by: Kinglong Mee

multiple files: another attempt to remove includes

2019-06-14T16:50:32+00:00

There are many include statements that are not needed.
A previous more ambitious attempt failed because of *BSD plafrom
(see https://review.gluster.org/#/c/glusterfs/+/21929/ )

Now trying a more conservative reduction.
It does not solve all circular deps that we have, but it
does reduce some of them. There is just too much to handle
reasonably (dht-common.h includes dht-lock.h which includes
dht-common.h ...), but it does reduce the overall number of lines
of include we need to look at in the future to understand and fix
the mess later one.

Change-Id: I550cd001bdefb8be0fe67632f783c0ef6bee3f9f
updates: bz#1193929
Signed-off-by: Yaniv Kaul

ec/fini: Fix race between xlator cleanup and on going async fop

2019-06-08T12:20:10+00:00

Problem:
While we process a cleanup, there is a chance for a race between
async operations, for example ec_launch_replace_heal. So this can
lead to invalid mem access.

Solution:
Just like we track on going heal fops, we can also track fops like
ec_launch_replace_heal, so that we can decide when to send a
PARENT_DOWN request.

Change-Id: I055391c5c6c34d58aef7336847f3b570cb831298
fixes: bz#1703948
Signed-off-by: Mohammed Rafi KC

ec/fini: Fix race with ec_fini and ec_notify

2019-05-21T12:57:39+00:00

During a graph cleanup, we first sent a PARENT_DOWN and wait for
a child down to ultimately free the xlator and the graph.

In the ec xlator, we cleanup the threads when we get a PARENT_DOWN event.
But a racing event like CHILD_UP or event xl_op may trigger healing threads
after threads cleanup.

So there is a chance that the threads might access a freed private variabe

Change-Id: I252d10181bb67b95900c903d479de707a8489532
fixes: bz#1703948
Signed-off-by: Mohammed Rafi KC