glusterfs.git/xlators/storage, branch v3.6.1

Avoid spurious EINVAL in posix_readdir()

2014-10-29T09:26:56+00:00

On non Linux systems, we check that seekdir() succeeds and we return
EINVAL if it does not. We need this to avoid infinite loops if some
other component in GlusterFS makes an invalid seekdir() usage. This
was introduced in this change: http://review.gluster.org/#/c/8760/

But seekdir() also fails when using the offset returned for the
last entry, and this is expected behavior. As a result, the seekdir()
test produces a spurious EINVAL when reaching end of directory. That
error is not propagated to calling process, but it may harm internal
GlusterFS processing. At least it produce a spurious error message
in brick's log.

We fix the problem by remembering the last entry offset in fd private
data. When a new posix_readdir() invocation requests that offset,
we avoid returning EINVAL.

Backport of I4e67a2ea46538aae63eea663dd4aa33b16ad24c7

BUG: 1138897
Change-Id: I4e98294d157f67ae1a1f0ece1562c77d1219da40
Signed-off-by: Emmanuel Dreyfus 
Reviewed-on: http://review.gluster.org/8933
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

POSIX filesystem compliance: PATH_MAX

2014-10-03T14:58:10+00:00

POSIX mandates the filesystem to support paths of lengths up to
_XOPEN_PATH_MAX (1024).  This is the PATH_MAX limit here:
http://pubs.opengroup.org/onlinepubs/009604499/basedefs/limits.h.html

When using a path of 1023 bytes, the posix xlator attempts to create
an absolute path by  prefixing the 1023 bytes path by the brick
base path. The result is an absolute path of more than _XOPEN_PATH_MAX
bytes which may be rejected by the backend filesystem.

Linux's ext3fs PATH_MAX seems to defaut to 4096, which means it
will work (except if brick base path is longer than 2072 bytes but
it is unlikely to happen. NetBSD's FFS PATH_MAX defaults to 1024,
which means the bug can happen regardless of brick base path length.

If this condition is detected for a brick, the proposed fix is to
chdir() the brick glusterfsd daemon to its brick base directory.
Then when encountering a path that will exceed _XOPEN_PATH_MAX once
prefixed by the brick base path, a relative path is used instead
of an absolute one. We do not always use relative path because some
operations require an absolute path on the brick base path itself
(e.g.: statvfs).

At least on NetBSD, this chdir() uncovers a race condition which
causes file lookup to fail with ENODATA for a few seconds. The
volume quickly reaches a sane state, but regression tests are fast
enough to choke on it. The reason is obscure (as often with race
conditions), but sleeping one second after the chdir() seems to
change scheduling enough that the problem disapear.

Note that since the chdir() is done if brick backend filesystem
does not support path long enough, it will not occur with Linux
ext3fs (except if brick base path is over 2072 bytes long).

This is a backport of I7db3567948bc8fa8d99ca5f5ba6647fe425186a9

BUG: 1138897
Change-Id: Ib8eb3efaac8a7ba505d830623921338689229e9a
Signed-off-by: Emmanuel Dreyfus 
Reviewed-on: http://review.gluster.org/8864
Tested-by: Gluster Build System 
Reviewed-by: Harshavardhana 
Tested-by: Harshavardhana 
Reviewed-by: Vijay Bellur

Fix invalid seekdir() usage

2014-09-30T16:50:26+00:00

According to POSIX, seekdir() should only be given offset obtained from
telldir() on the same DIR *
http://pubs.opengroup.org/onlinepubs/9699919799/functions/seekdir.html

Code from afr-self-heald.c and index.c is operating outside of the
specification, by doing using seekdir() with offset from a previously
open/close/re-open directory. This seems to work on Linux (although with
no guarantee it will always in the future). On NetBSD the seekdir()
with a in invalid offset is a nilpotent operation, and causes an infinite
loop, since index_fill_readdir() always restart from the beginning of the
directory.

The situation is fixed by using a non anonymous fd in afr-self-heald.c:
we explicitely open the directory so that it remains open on the brick
side during the timeframe where we want to reuse offsets in seekdir().
This requires adding an opendir fop in index xlator.

If the brick was not updated, the opendir will fail and we fallback
to the standard violating approach for backward compatibility on Linux.
On other systems we fail since it never worked.

While there, add tests to check seekdir() success in index and posix
xlators, so that incorrect usage from calling code produce an explicit
error instead of an infinite loop. We can only do it on non Linux systems,
for the sake of backward compatibility when the brick was updated but
not the client.

Backport of I88ca90acfcfee280988124bd6addc1a1893ca7ab

BUG: 1138897
Change-Id: I5446a9a17d5451ec5aab8fbd10d381da9a0a23ad
Signed-off-by: Emmanuel Dreyfus 
Reviewed-on: http://review.gluster.org/8860
Tested-by: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
Reviewed-by: Vijay Bellur

glusterd/quota: Heal pgfid xattr on existing data when the quota is enable

2014-09-30T16:42:40+00:00

This is a backport of http://review.gluster.org/#/c/8878/

The pgfid extended attributes are used to construct the ancestry path
(from the file to the volume root) for nameless lookups on files.
As NFS relies on nameless lookups heavily, quota enforcement through NFS
would be inconsistent if quota were to be enabled on a volume with
existing data.

Solution is to heal the pgfid extended attributes as a part of lookup
perfomed by quota-crawl process. In a posix lookup check for pgfid xattr
and if it is missing set the xattr.

BUG: 1147953
Change-Id: I707d91a056e07452bfd1e070af5eddaa752a84ac
Signed-off-by: vmallika 
Reviewed-on: http://review.gluster.org/8890
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

Do not forbid fallocate on non Linux systems

2014-09-26T12:24:04+00:00

Linux fallocate() differs from posix_fallocate() by
an extra flag that can have the FALLOC_FL_KEEP_SIZE value;

Do not test FALLOC_FL_KEEP_SIZE existence to enable fallocate()
in posix xlator, as sys_fallocate() in libglusterfs provides
support for both implementations.

Backport of Idf41a0396028a15e81281791bf6912d7fd674e3f
BUG: 1138897
Signed-off-by: Emmanuel Dreyfus 

Change-Id: Ie6e5ea923561630c52a6db5c7f83313cfdc34811
Reviewed-on: http://review.gluster.org/8862
Tested-by: Gluster Build System 
Reviewed-by: Kaleb KEITHLEY 
Reviewed-by: Vijay Bellur

storage/posix: Log when mkdir is on an existing gfid but non-existent

2014-09-19T16:46:11+00:00

path.

consider following steps on a distribute volume

1. rename (src, dst) on hashed subvolume
2. snapshot taken
3. restore snapshots and do stat on src and dst

Now, we end up with two directories src and dst having same gfid,
because of distribute creating directories on non-existent subvolumes
as part of directory healing.

This can happen even with race between rename and directory healing in
dht-lookup. This can lead to undefined behaviour while accessing any
of both directories. Hence, we are logging paths of both
directories, so that a sysadmin can take some corrective action when
(s)he sees this log. One of the corrective action can be to copy
contents of both directories from backend into a new directory and
delete both directories.

Since effort involved to fix this issue is non-trivial, giving this
workaround till we come up with a fix.

Change-Id: I38f4520e6787ee33180a9cd1bf2f36f46daea1ea
BUG: 1144485
Signed-off-by: Raghavendra G 
Reviewed-on-master: http://review.gluster.org/8008
Reviewed-by: Pranith Kumar Karampuri 
Reviewed-by: Vijay Bellur 
Tested-by: Vijay Bellur 
Reviewed-on: http://review.gluster.org/8783
Tested-by: Gluster Build System

storage/posix: Don't unlink .glusterfs-hardlink before linkto check

2014-09-12T14:06:16+00:00

BUG: 1138385
Change-Id: I90a10ac54123fbd8c7383ddcbd04e8879ae51232
Signed-off-by: Venkatesh Somyajulu 
Reviewed-on-master: http://review.gluster.org/8559
Tested-by: Gluster Build System 
Reviewed-by: N Balachandran 
Reviewed-by: Vijay Bellur 
Reviewed-on: http://review.gluster.org/8612

storage/posix: Prefer gfid links for inode-handle

2014-09-12T09:55:51+00:00

        Backport of http://review.gluster.org/8575

Problem:
File path could change by other entry operations in-flight so if renames are in
progress at the time of other operations like open, it may lead to failures.
We observed that this issue can also happen while renames and readdirps/lookups
are in progress because dentry-table is going stale sometimes.

Fix:
Prefer gfid-handles over paths for files. For directory handles prefering
gfid-handles hits performance issues because it needs to resolve paths
traversing up the symlinks.
Tests which test if files are opened should check on gfid path after this change.
So changed couple of tests to reflect the same.

Note:
This patch doesn't fix the issue for directories. I think a complete fix is to
come up with an entry operation serialization xlator. Until then lets live with
this.

BUG: 1136821
Change-Id: If93e46d542a4e96a81a0639b5210330f7dbe8be0
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/8594
Reviewed-by: Vijay Bellur 
Tested-by: Gluster Build System

storage/posix: removing deleting entries in case of creation failures

2014-09-10T16:11:20+00:00

The code is not atomic enough to not to delete a dentry created by a
prallel dentry creation operation.

Change-Id: I9bd6d2aa9e7a1c0688c0a937b02a4b4f56d7aa2e
BUG: 1138387
Signed-off-by: Raghavendra G 
Reviewed-on-master: http://review.gluster.org/8327
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur 
Reviewed-on: http://review.gluster.org/8693
Tested-by: Vijay Bellur

cluster/dht: Modified logic of linkto file deletion on non-hashed

2014-09-10T15:48:56+00:00

Currently whenever dht_lookup_everywhere gets called, if in
dht_lookup_everywhere_cbk, a linkto file is found on non-hashed
subvolume, file is unlinked. But there are cases when this file
is under migration. Under such condition, we should avoid deletion
of file.

When  some other rebalance process changes the layout of parent
such that dst_file (w.r.t. migration) falls on non-hashed node,
then may be lookup could have found it as linkto file but just
before unlink, file  is under migration or already migrated
In such cased unlink can be avoided.

Race:
-------
If we have two bricks (brick-1 and brick-2) with initial file "a"
under BaseDir which is hashed as well as cached on (brick-1).

Assume "a"  hashing gives 44.

                              Brick-1              Brick-2

Initial Setup:               BaseDir/a             BaseDir
                             [1-50]                [51-100]

Now add new-brick Brick-3.

1. Rebalance-1 on node Node-1 (Brick-1 node) will reset
the BaseDir Layout.

2. After that it will perform
a)  Create linkto file on  new-hashed (brick-2)
b)  Perform file migration.

1.Rebalance-1 Fixes the base-layout:
                 Brick-1             Brick-2           Brick-3
                 ---------         ----------         ------------
                 BaseDir/a            BaseDir           BaseDir
                  [1-33]              [34-66]           [67-100]

2. Only a) is     BaseDir/a          BaseDir/a(linkto)   BaseDir
   performed                         Create linktofile

Now rebalance 2 on node-2 jumped in and it will perform
step 1 and 2-a.

After (rebal-2, step-1), it changes the layout of the BaseDir.
                    BaseDir/a     BaseDir/a(link)    BaseDir
                    [67-100]           [1-33]        [34-66]

For  (rebale-2, step-2), It will perform lookup at Brick-3 as w.r.t new
layout 44 falls for brick-3. But lookup will fail.
So  dht_lookup_everywhere gets called.

NOTE: On brick-2 by rebalance-1, a linkto file was created.

Currently that linkto files gets deleted by rebalance-2 lookup as it
is considered as stale linkto file.  But  with patch if rebalance is
already in progress or rebalance is over,  linkto file will not be
unlinked. If rebalance is in progress fd will be  open and if rebalance
is over then linkto file wont be set.

Change-Id: I3fee0d28de3c76197325536a9e30099d2413f07d
BUG: 1138385
Signed-off-by: Venkatesh Somyajulu 
Reviewed-on-master: http://review.gluster.org/8345
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra G 
Reviewed-by: Shyamsundar Ranganathan 
Reviewed-by: Vijay Bellur