glusterfs.git/libglusterfs/src/glusterfs.h, branch v3.11.0rc0

feature/dht: Directory synchronization

2017-04-26T09:00:34+00:00

Design doc: https://review.gluster.org/16876

Directory creation is now synchronized with blocking inodelk of the
parent on the hashed subvolume followed by the entrylk on the hashed
subvolume between dht_mkdir, dht_rmdir, dht_rename_dir and lookup
selfheal mkdir.

To maintain internal consistency of directories across all subvols of
dht, we need locks. Specifically we are interested in:

 1. Consistency of layout of a directory. Only one writer should modify
    the layout at a time. A writer (layout setting during directory heal
    as part of lookup) shouldn't modify the layout while there are
    readers (all other fops like create, mkdir etc., which consume
    layout) and readers shouldn't read the layout while a writer is in
    progress. Readers can read the layout simultaneously. Writer takes
    a WRITE inodelk on the directory (whose layout is being modified)
    across ALL subvols. Reader takes a READ inodelk on the directory
    (whose layout is being read) on ANY subvol.

 2. Consistency of directory namespace across subvols. The path and
    associated gfid should be same on all subvols. A gfid should not be
    associated with more than one path on any subvol. All fops that can
    change directory names (mkdir, rmdir, renamedir, directory creation
    phase in lookup-heal) takes an entrylk on hashed subvol of the
    directory.

 NOTE1: In point 2 above, since dht takes entrylk on hashed subvol of a
        directory, the transaction itself is a consumer of layout on
        parent directory. So, the transaction is a reader of parent
        layout and does an inodelk on parent directory just like any
        other layout reader. So a mkdir (dir/subdir) would:

     > Acquire a READ inodelk on "dir" on any subvol.
     > Acquire an entrylk (dir, "subdir") on hashed subvol of "subdir".
     > creates directory on hashed subvol and possibly on non-hashed subvols.
     > UNLOCK (entrylk)
     > UNLOCK (inodelk)

 NOTE2: mkdir fop while setting the layout of the directory being created
        is considered as a reader, but NOT a writer. The reason is for
        a fop which can consume the layout of a directory to come either
        of the following conditions has to be true:

     > mkdir syscall from application has to complete. In this case no
       need of synchronization.
     > A lookup issued on the directory racing with mkdir has to complete.
       Since layout setting by a lookup is considered as a writer, only
       one of either mkdir or lookup will set the layout.

Code re-organization:
   All the lock related routines are moved to "dht-lock.c" file.
   New wrapper function is introduced to take blocking inodelk
   followed by entrylk 'dht_protect_namespace'

Updates #191
Change-Id: I01569094dfbe1852de6f586475be79c1ba965a31
Signed-off-by: Kotresh HR 
BUG: 1443373
Reviewed-on: https://review.gluster.org/15472
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G 
Smoke: Gluster Build System

xlator: do not call dlclose() when debugging

2017-04-07T17:17:12+00:00

Valgrind can not show the symbols if a .so after calling dlclose(). The
unhelpful ??? in the output gets resolved properly with this change:

  ==25170== 344 bytes in 1 blocks are definitely lost in loss record 233 of 324
  ==25170==    at 0x4C29975: calloc (vg_replace_malloc.c:711)
  ==25170==    by 0x52C7C0B: __gf_calloc (mem-pool.c:117)
  ==25170==    by 0x12B0638A: ???
  ==25170==    by 0x528FCE6: __xlator_init (xlator.c:472)
  ==25170==    by 0x528FE16: xlator_init (xlator.c:498)
  ==25170==    by 0x52DA8D6: glusterfs_graph_init (graph.c:321)
  ==25170==    by 0x52DB587: glusterfs_graph_activate (graph.c:695)
  ==25170==    by 0x5046407: glfs_process_volfp (glfs-mgmt.c:79)
  ==25170==    by 0x5043B9E: glfs_volumes_init (glfs.c:281)
  ==25170==    by 0x5044FEC: glfs_init_common (glfs.c:986)
  ==25170==    by 0x50451A7: glfs_init@@GFAPI_3.4.0 (glfs.c:1031)

By not calling dlclose(), the dynamically loaded .so is still available
upon program exit, and Valgrind is able to resolve the symbols. This
will add an additional leak, so dlclose() is called for normal builds,
but skipped when configuring with "./configure --enable-valgrind" or
passing the "run-with-valgrind" xlator option.

URL: http://valgrind.org/docs/manual/faq.html#faq.unhelpful
Change-Id: I2044e21b1b8fcce32ad1a817fdd795218f967731
BUG: 1425623
Signed-off-by: Niels de Vos 
Reviewed-on: https://review.gluster.org/16809
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Samikshan Bairagya 
Reviewed-by: Kaleb KEITHLEY

core: run many bricks within one glusterfsd process

2017-01-31T00:13:58+00:00

This patch adds support for multiple brick translator stacks running
in a single brick server process.  This reduces our per-brick memory usage by
approximately 3x, and our appetite for TCP ports even more.  It also creates
potential to avoid process/thread thrashing, and to improve QoS by scheduling
more carefully across the bricks, but realizing that potential will require
further work.

Multiplexing is controlled by the "cluster.brick-multiplex" global option.  By
default it's off, and bricks are started in separate processes as before.  If
multiplexing is enabled, then *compatible* bricks (mostly those with the same
transport options) will be started in the same process.

Change-Id: I45059454e51d6f4cbb29a4953359c09a408695cb
BUG: 1385758
Signed-off-by: Jeff Darcy 
Reviewed-on: https://review.gluster.org/14763
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Vijay Bellur

tier : Tier as a service

2017-01-17T04:49:47+00:00

tierd is implemented by separating from rebalance process.

The commands affected:

1) Attach tier will trigger this process instead of old one
2) tier start and tier start force will also trigger this process.
3) volume status [tier] will show tier daemon as a process instead
of task and normal tier status and tier detach status works.
4) tier stop implemented.
5) detach tier implemented separately along with new detach tier
status
6) volume tier volname status will work using the changes.
7) volume set works

This patch has separated the tier translator from the legacy
DHT rebalance code. It now sends the RPCs from the CLI
to glusterd separate to the DHT rebalance code.
The daemon is now a service, similar to the snapshot daemon,
and can be viewed using the volume status command.

The code for the validation and commit phase are the same
as the earlier tier validation code in DHT rebalance.

The “brickop” phase has been changed so that the status
command can use this framework.

The service management framework is now used.
DHT rebalance does not use this framework.

This service framework takes care of :

*) spawning the daemon, killing it and other such processes.
*) volume set options , which are written on the volfile.
*) restart and reconfigure functions. Restart is to restart
the daemon at two points
        1)after gluster goes down and comes up.
        2) to stop detach tier.
*) reconfigure is used to make immediate volfile changes.
By doing this, we don’t restart the daemon.
it has the code to rewrite the volfile for topological
changes too (which comes into place during add and remove brick).

With this patch the log, pid, and volfile are separated
and put into respective directories.

Change-Id: I3681d0d66894714b55aa02ca2a30ac000362a399
BUG: 1313838
Signed-off-by: hari gowtham 
Reviewed-on: http://review.gluster.org/13365
Smoke: Gluster Build System 
Tested-by: hari gowtham 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Dan Lambright 
Reviewed-by: Atin Mukherjee

ec: Invalidations in disperse volume should not update the stat

2017-01-06T05:12:18+00:00

Issue:
In disperse volume, the file is present across bricks, hence the stat
from one brick doesn't carry the valid size of the file. Therefore
the upcall from one brick updating the md-cache results in wrong size
being updated.

Fix:
If the notification is cache invalidation then, indicate md-cache that
the attributes is invalid.

BUG: 1410375
Change-Id: Id89d2283478e70b62b435a8891fffc86d2be8cb2
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/16329
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Xavier Hernandez 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

md-cache, afr: Reduce the window of stale read

2016-10-20T07:07:55+00:00

Problem:
Consider a replica setup, where one mount writes data to a
file and the other mount reads the file. In afr, read operations
are not transaction based, a brick(read subvolume) is chosen as
a part of lookup or other operations, read is always wound only
to the read subvolume, even if there was write from a different client
that failed on this brick. This stale read continues until there is
a lookup or any write operation from the mount point. Currently, this
is not a major issue, as a lookup is issued before every read and it will
switch the read subvolume to a correct one. But with the plan of
increasing md-cache timeout to 600s, the stale read problem will be
more pronounced, i.e. stale read can continue for 600s(or more if cascaded
with readdirp), as there will be no lookups.

Solution:
Afr doesn't have any built-in solution for stale read(without affecting
the performance). The solution that came up, was to use upcall. When a file
on any brick is marked bad for the first time, upcall sends a notification
to all the clients that had recently accessed the file. The solution has
2 parts:
- Identifying when a file is marked bad, on any of the bricks,
  for the first time
- Client side actions on recieving the notifications

Identifying when a file is marked bad on any of the bricks for the first time:
-----------------------------------------------------------------------------
The idea is to track xattrop in upcall. xattrop currently comes with 2 afr
xattrs - afr dirty bit and afr pending xattrs.
   Dirty xattr is set to 1 before every write, and is unset if write succeeds.
In certain scenarios, dirty xattr can be 0 and still the file could be bad
copy. Hence do not track dirty xattr.
   Pending xattr is set on the good copy, indicating the other bricks that have
bad copy. It is still not as simple as, notifying when any of the pending xattrs
change. It could lead to flood of notifcations, in case the other brick is
completely down or consistantly failing. Hence it is important to notify only
once, the first time a good copy is marked bad.

Client side actions on recieving pending xattr change, notification:
--------------------------------------------------------------------
md-cache will invalidate the cache of that file, so that further lookup is
passed down to afr and hence update the read subvolume. Invalidating only in
md-cache is not enough, consider the folling oder of opertaions:
- pending xattr invalidation - invalidate md-cache
- readdirp on the bad read subvolume - fill md-cache
- lookup (served from md-cache)
- read - wound to the old read subvol.
Hence, along with invalidating md-cache, it is very important to reset the
read subvolume for that file, in afr.

Design Credit: Anuradha Talur, Ravishankar N

1. xattrop doesn't carry info saying post op/pre op.
2. Pre xattrop will have 0 value for all pending xattrs,
   the cbk of pre xattrop carries the on-disk xattr value.
   Non zero indicated healing is required.
3. Post xattrop will have non zero value for any of the
   pending xattrs, if the fop failed on any of the bricks.

Change-Id: I469cbc111714c433984fe1c922be2ef113c25804
BUG: 1211863
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/15398
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

glusterfsd/main: fix OOM adjustment for older kernels

2016-10-11T12:18:05+00:00

Milind Changire reported that GlusterFS fails to build on RHEL5
because linux/oom.h is unavailable.

Milind's initial patch disables OOM adjustment completely
for those environments that do not have this header. However,
I'd take another approach that:

1) checks for linux/oom.h in compile-time and defines necessary
constants if the header is not present;
2) checks for available OOM API in /proc in run-time and uses it
accordingly.

This allows OOM to be adjusted properly on RHEL5 (the kernel is pretty new
to present /proc API for that) as well as RHEL6 (the kernel has many thing
backported including new /proc API).

Change-Id: I1bc610586872d208430575c149a7d0c54bd82370
BUG: 1379769
Signed-off-by: Oleksandr Natalenko 
Reviewed-on: http://review.gluster.org/15587
Tested-by: Oleksandr Natalenko 
Reviewed-by: Niels de Vos 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

dht, md-cache, upcall: Add invalidation of IATT when the layout changes

2016-08-31T06:08:54+00:00

Issue:
dht_layout is built as a part of lookup only. The layout can be
modified by rebalance process. Since every IO fop is preceded
by a lookup, there are very less issues of stale layout. But
with enhancements of aggressive caching of stats in md-cache,
the lookup will reduce and expose the stale layout issue often.

Solution:
Since stale layout is already an issue on dht, there is already
a plan to fix this at the dht layer, but this fix is not currently
planned for any release. Until this fix comes out, we can have
a workaround where, the upcall will send a notification to md-cache
when a layout xattr is changed. As a part of layout change notification
the existing cache is invalidated and the next lookup will fetch the
latest layout.

This is not a foolproof solution as the window between the layout change
and the next lookup(after invalidation of stat), where there will be stale
layout. But until the final fix comes in, this reduces the stale layout
window.

Change-Id: Iacf871a38b35880c1fc0bc68fe7ce291265e71d4
BUG: 1369638
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/15300
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Raghavendra G

eventsapi: Fix disable-events issue

2016-08-24T08:23:06+00:00

Events related sources are not loaded in libglusterfs when
configure is run with --disable-events option. Due to this
every call of gf_event should be guarded with USE_EVENTS macro.

To prevent this, USE_EVENTS macro was included in events.c
itself(Patch #15054)

Instead of disabling building entire directory "events", selectively
disabled the code. So that constants and empty function gf_event is
exposed. Code will not fail even if gf_event is called when events is
disabled.

BUG: 1368042
Change-Id: Ia6abfe9c1e46a7640c4d8ff5ccf0e9c30c87f928
Signed-off-by: Aravinda VK 
Reviewed-on: http://review.gluster.org/15198
Smoke: Gluster Build System 
Reviewed-by: Niels de Vos 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System

eventsapi: Gluster Eventing Feature implementation

2016-07-18T20:52:20+00:00

[Depends on http://review.gluster.org/14627]

Design is available in `glusterfs-specs`, A change from the design
is support of webhook instead of Websockets as discussed in the design

http://review.gluster.org/13115

Since Websocket support depends on REST APIs, I will add Websocket support
once REST APIs patch gets merged

Usage:
Run following command to start/stop Eventsapi server in all Peers,
which will collect the notifications from any Gluster daemon and emits
to configured client.

    gluster-eventsapi start|stop|restart|reload

Status of running services can be checked using,

    gluster-eventsapi status

Events listener is a HTTP(S) server which listens to events emited by
the Gluster. Create a HTTP Server to listen on POST and register that
URL using,

    gluster-eventsapi webhook-add  [--bearer-token ]

For example, if HTTP Server running in `http://192.168.122.188:9000`
then add that URL using,

    gluster-eventsapi webhook-add http://192.168.122.188:9000

If it expects a Token then specify it using `--bearer-token` or `-t`

We can also test Webhook if all peer nodes can send message or not
using,

    gluster-eventsapi webhook-test  [--bearer-token ]

Configurations can be viewed/updated using,

    gluster-eventsapi config-get [--name]
    gluster-eventsapi config-set  
    gluster-eventsapi config-reset 

If any one peer node was down during config-set/reset or webhook
modifications, Run sync command from good node when a peer node comes
back. Automatic update is not yet implemented.

    gluster-eventsapi sync

Basic Events Client(HTTP Server) is included with the code, Start
running the client with required port and start listening to the
events.

    /usr/share/glusterfs/scripts/eventsdash.py --port 8080

Default port is 9000, if no port is specified, once it started running
then configure gluster-eventsapi to send events to that client.

Eventsapi Client can be outside of the Cluster, it can be run event on
Windows. But only requirement is the client URL should be accessible
by all peer nodes.(Or ngrok(https://ngrok.com) like tools can be used)

Events implemented with this patch,
- Volume Create
- Volume Start
- Volume Stop
- Volume Delete
- Peer Attach
- Peer Detach

It is easy to add/support more events, since it touches Gluster cmd
code and to avoid merge conflicts I will add support for more events
once this patch merges.

BUG: 1334044
Change-Id: I316827ac9dd1443454df7deffe4f54835f7f6a08
Signed-off-by: Aravinda VK 
Reviewed-on: http://review.gluster.org/14248
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Jeff Darcy