glusterfs.git/rpc/xdr/src/glusterfs-fops.x, branch v7dev

copy_file_range support in GlusterFS

2018-12-12T15:56:55+00:00

    * libglusterfs changes to add new fop

    * Fuse changes:
      - Changes in fuse bridge xlator to receive and send responses

    * posix changes to perform the op on the backend filesystem

    * protocol and rpc changes for sending and receiving the fop

    * gfapi changes for performing the fop

    * tools: glfs-copy-file-range tool for testing copy_file_range fop

      - Although, copy_file_range support has been added to the upstream
	    fuse kernel module, no release has been made yet of a kernel
        which contains the support. It is expected to come in the
        upcoming release of linux-4.20

        So, as of now, executing copy_file_range fop on a fused based
        filesystem results in fuse kernel module sending read on the
	    source fd and write on the destination fd.

	    Therefore a small gfapi based tool has been written to be able
        test the copy_file_range fop. This tool is similar (in functionality)
	    to the example program given in copy_file_range man page.

	    So, running regular copy_file_range on a fuse mount point and
	    running gfapi based glfs-copy-file-range tool gives some idea about
	    how fast, the copy_file_range (or reflink) can be.

	    On the local machine this was the result obtained.

	    mount -t glusterfs workstation:new /mnt/glusterfs
	    [root@workstation ~]# cd /mnt/glusterfs/
	    [root@workstation glusterfs]# ls
	    file
	    [root@workstation glusterfs]# cd
	    [root@workstation ~]# time /tmp/a.out /mnt/glusterfs/file /mnt/glusterfs/new
	    real  0m6.495s
	    user  0m0.000s
	    sys   0m1.439s
	    [root@workstation ~]# time glfs-copy-file-range $(hostname) new /tmp/glfs.log /file /rrr
	    OPEN_SRC: opening /file is success
	    OPEN_DST: opening /rrr is success
	    FSTAT_SRC: fstat on /rrr is success
	    copy_file_range successful

        real  0m0.309s
        user  0m0.039s
        sys   0m0.017s

        This tool needs following arguments
         1) hostname
         2) volume name
         3) log file path
         4) source file path (relative to the gluster volume root)
         5) destination file path (relative to the gluster volume root)

        "glfs-copy-file-range     "

      - Added a testcase as well to run glfs-copy-file-range tool

    * io-stats changes to capture the fop for profiling

    * NOTE:

      - Added conditional check to see whether the copy_file_range syscall
        is available or not. If not, then return ENOSYS.

      - Added conditional check for kernel minor version in fuse_kernel.h
        and fuse-bridge while referring to copy_file_range. And the kernel
        minor version is kept as it is. i.e. 24. Increment it in future
        when there is a kernel release which contains the support for
        copy_file_range fop in fuse kernel module.

    * The document which contains a writeup on this enhancement can be found at
      https://docs.google.com/document/d/1BSILbXr_knynNwxSyyu503JoTz5QFM_4suNIh2WwrSc/edit

Change-Id: I280069c814dd21ce6ec3be00a884fc24ab692367
updates: #536
Signed-off-by: Raghavendra Bhat

libglusterfs: Move devel headers under glusterfs directory

2018-12-05T21:47:04+00:00

libglusterfs devel package headers are referenced in code using
include semantics for a program, this while it works can be better
especially when dealing with out of tree xlator builds or in
general out of tree devel package usage.

Towards this, the following changes are done,
- moved all devel headers under a glusterfs directory
- Included these headers using system header notation <> in all
code outside of libglusterfs
- Included these headers using own program notation "" within
libglusterfs

This change although big, is just moving around the headers and
making it correct when including these headers from other sources.

This helps us correctly include libglusterfs includes without
namespace conflicts.

Change-Id: Id2a98854e671a7ee5d73be44da5ba1a74252423b
Updates: bz#1193929
Signed-off-by: ShyamsundarR

dict: add another type to handle backward compatibility

2018-01-17T03:53:37+00:00

This new type helps to avoid excessive logs. It should be
set only in case of
 * volume graph building (graph.y)
 * dict unserialize
   (happens once a dictionary is received on wire in old protocol)

All other dict set and get should have proper check and warning
logs if there is a mismatch.

updates #220

Change-Id: I1cccb304a877aa80c07aaac95f10f5005e35b9c5
Signed-off-by: Amar Tumballi

dict: add more types for values

2018-01-05T09:35:07+00:00

Added 2 more types which are present in gluster codebase, mainly
IATT and UUID.

Updates #203

Change-Id: Ib6d6d6aefb88c3494fbf93dcbe08d9979484968f
Signed-off-by: Amar Tumballi

dict: support better on-wire transfer

2017-12-27T05:20:30+00:00

This patch brings data type awareness to dictionary,
and also makes sure valid data is properly sent to the
other side of the wire using XDR.

Next step is to allow people to add more data types
(for example, Bool, UUID, iatt etc), and then make
it part of every fop signature in wire.

Fixes #203

Change-Id: Ie0eee2db847bea2bf7dad80dec89ce3e7c5917c1
Signed-off-by: Amar Tumballi

rio/everywhere: add icreate/namelink fop

2017-12-05T21:23:57+00:00

icreate creates inode, while namelink links the basename to it's
parent gfid.

For now mkdir is the primary user of these fops. Better distribution is
acheived by creating the inode on ,(say) mds1 and linking the basename to it's
parent gfid on mds2. The inode serves readdirp, stat etc.

More details about the fops are present at:
https://review.gluster.org/#/c/13395/3/design/DHT2/DHT2_Icreate_Namelink_Notes.md

This backport of three patches from experimental branch.
1- https://review.gluster.org/#/c/18085/
2- https://review.gluster.org/#/c/18086/
3- https://review.gluster.org/#/c/18094/

Updates gluster/glusterfs#243
Change-Id: I1bd3d5a441a3cfab1acfeb52f15c6c867d362592
Signed-off-by: Susant Palai

libglusterfs: Add put fop

2017-12-05T14:21:01+00:00

Problem: It had been a longtime request to implement put fop
in gluster. put fop in gluster may not have the exact sementics
of HTTP PUT, but can be easily extended to do so. The subsequent
patches, will contain more semantics on the put fop and its
guarentees.

Why compound fop framework is not used for put?
Compound fop framework currently doesn't allow compounding of
entry fop and inode fops, i.e. fops on multiple inodes cannot be
combined in compound fop.

Updates #353
Change-Id: Idb7891b3e056d46d570bb7e31bad1b6a28656ada
Signed-off-by: Poornima G

glusterfs: Not able to mount running volume after enable brick mux and stopped any volume

2017-05-31T20:43:53+00:00

Problem: After enabled brick mux if any volume has down and then try ot run mount
         with running volume , mount command is hung.

Solution: After enable brick mux server has shared one data structure server_conf
          for all associated subvolumes.After down any subvolume in some
          ungraceful manner (remove brick directory) posix xlator sends
          GF_EVENT_CHILD_DOWN event to parent xlatros and server notify
          updates the child_up to false in server_conf.When client is trying
          to communicate with server through mount it checks conf->child_up
          and it is FALSE so it throws message "translator are not yet ready".
          From this patch updated structure server_conf to save child_up status
          for xlator wise. Another improtant correction from this patch is
          cleanup threads from server side xlators after stop the volume.

BUG: 1453977
Change-Id: Ic54da3f01881b7c9429ce92cc569236eb1d43e0d
Signed-off-by: Mohit Agrawal 
Reviewed-on: https://review.gluster.org/17356
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Raghavendra Talur 
CentOS-regression: Gluster Build System 
Reviewed-by: Jeff Darcy

Halo Replication feature for AFR translator

2017-05-02T10:23:53+00:00

Summary:
Halo Geo-replication is a feature which allows Gluster or NFS clients to write
locally to their region (as defined by a latency "halo" or threshold if you
like), and have their writes asynchronously propagate from their origin to the
rest of the cluster.  Clients can also write synchronously to the cluster
simply by specifying a halo-latency which is very large (e.g. 10seconds) which
will include all bricks.

In other words, it allows clients to decide at mount time if they desire
synchronous or asynchronous IO into a cluster and the cluster can support both
of these modes to any number of clients simultaneously.

There are a few new volume options due to this feature:
  halo-shd-latency:  The threshold below which self-heal daemons will
  consider children (bricks) connected.

  halo-nfsd-latency: The threshold below which NFS daemons will consider
  children (bricks) connected.

  halo-latency: The threshold below which all other clients will
  consider children (bricks) connected.

  halo-min-replicas: The minimum number of replicas which are to
  be enforced regardless of latency specified in the above 3 options.
  If the number of children falls below this threshold the next
  best (chosen by latency) shall be swapped in.

New FUSE mount options:
  halo-latency & halo-min-replicas: As descripted above.

This feature combined with multi-threaded SHD support (D1271745) results in
some pretty cool geo-replication possibilities.

Operational Notes:
- Global consistency is gaurenteed for synchronous clients, this is provided by
  the existing entry-locking mechanism.
- Asynchronous clients on the other hand and merely consistent to their region.
  Writes & deletes will be protected via entry-locks as usual preventing
  concurrent writes into files which are undergoing replication.  Read operations
  on the other hand should never block.
- Writes are allowed from _any_ region and propagated from the origin to all
  other regions.  The take away from this is care should be taken to ensure
  multiple writers do not write the same files resulting in a gfid split-brain
  which will require resolution via split-brain policies (majority, mtime &
  size).  Recommended method for preventing this is using the nfs-auth feature to
  define which region for each share has RW permissions, tiers not in the origin
  region should have RO perms.

TODO:
- Synchronous clients (including the SHD) should choose clients from their own
  region as preferred sources for reads.  Most of the plumbing is in place for
  this via the child_latency array.
- Better GFID split brain handling & better dent type split brain handling
  (i.e. create a trash can and move the offending files into it).
- Tagging in addition to latency as a means of defining which children you wish
  to synchronously write to

Test Plan:
- The usual suspects, clang, gcc w/ address sanitizer & valgrind
- Prove tests

Reviewers: jackl, dph, cjh, meyering

Reviewed By: meyering

Subscribers: ethanr

Differential Revision: https://phabricator.fb.com/D1272053

Tasks: 4117827

Change-Id: I694a9ab429722da538da171ec528406e77b5e6d1
BUG: 1428061
Signed-off-by: Kevin Vigor 
Reviewed-on: http://review.gluster.org/16099
Reviewed-on: https://review.gluster.org/16177
Tested-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

afr,dht,ec: Replace GF_EVENT_CHILD_MODIFIED with event SOME_DESCENDENT_DOWN/UP

2016-11-21T09:32:05+00:00

Currently these are few events related to child_up/down:
GF_EVENT_CHILD_UP :  Issued when any of the protocol client
connects.
GF_EVENT_CHILD_MODIFIED : Issued by afr/dht/ec
GF_EVENT_CHILD_DOWN : Issued when any of the protocol client
disconnects.
These events get modified at the dht/afr/ec layers. Here is a
brief on the same.

DHT:
- All the subvolumes reported once, and atleast one child came
  up, then GF_EVENT_CHILD_UP is issued
- connect GF_EVENT_CHILD_UP is issued
- disconnect GF_EVENT_CHILD_MODIFIED is issued
- All the subvolumes disconnected, GF_EVENT_CHILD_DOWN is issued

AFR:
- First subvolume came up, then GF_EVENT_CHILD_UP is issued
- Subsequent subvolumes coming up, results in GF_EVENT_CHILD_MODIFIED
- Any of the subvolumes go down, then GF_EVENT_SOME_CHILD_DOWN is issued
- Last up subvolume goes down, then GF_EVENT_CHILD_DOWN is issued

Until the patch [1] introduced GF_EVENT_SOME_CHILD_UP,
GF_EVENT_CHILD_MODIFIED was issued by afr/dht when any of the subvolumes
go up or down.

Now with md-cache changes, there is a necessity to differentiate between
child up and down. Hence, introducing GF_EVENT_SOME_DESCENDENT_DOWN/UP and
getting rid of GF_EVENT_CHILD_MODIFIED.

[1] http://review.gluster.org/12573

Change-Id: I704140b6598f7ec705493251d2dbc4191c965a58
BUG: 1396038
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/15764
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Smoke: Gluster Build System 
Reviewed-by: N Balachandran 
Reviewed-by: Pranith Kumar Karampuri 
Reviewed-by: Rajesh Joseph