glusterfs-snapshot.git/xlators, branch upstream

fuse: Check the return status from state->resolve_now

2013-11-15T01:47:59+00:00

Change-Id: I85fc6dd393449d365bb908b38c2827b58cb08171
BUG: 1030208
Signed-off-by: Vijaykumar M 
Reviewed-on: http://review.gluster.org/6262
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

gNFS: RFE for NFS connection behavior

2013-11-15T00:07:02+00:00

Implement reconfigure() for NFS xlator so that volume set/reset wont
restart the NFS server process. But few options can not be reconfigured
dynamically e.g. nfs.mem-factor, nfs.port etc which needs NFS to be
restarted.

Change-Id: Ic586fd55b7933c0a3175708d8c41ed0475d74a1c
BUG: 1027409
Signed-off-by: Santosh Kumar Pradhan 
Reviewed-on: http://review.gluster.org/6236
Tested-by: Gluster Build System 
Reviewed-by: Rajesh Joseph 
Reviewed-by: Anand Avati

Transparent data encryption and metadata authentication

2013-11-13T23:12:49+00:00

.. in the systems with non-trusted server

This new functionality can be useful in various cloud technologies.
It is implemented via a special encryption/crypt translator,which
works on the client side and performs encryption and authentication;

              1. Class of supported algorithms

The crypt translator can support any atomic symmetric block cipher
algorithms (which require to pad plain/cipher text before performing
encryption/decryption transform (see glossary in atom.c for
definitions). In particular, it can support algorithms with the EOF
issue (which require to pad the end of file by extra-data).

Crypt translator performs translations
user -> (offset, size) -> (aligned-offset, padded-size) ->server
(and backward), and resolves individual FOPs (write(), truncate(),
etc) to read-modify-write sequences.

A volume can contain files encrypted by different algorithms of the
mentioned class. To change some option value just reconfigure the
volume.

Currently only one algorithm is supported: AES_XTS.

Example of algorithms, which can not be supported by the crypt
translator:

1. Asymmetric block cipher algorithms, which inflate data, e.g. RSA;
2. Symmetric block cipher algorithms with inline MACs for data
   authentication.

                   2. Implementation notes.

a) Atomic algorithms

Since any process in a stackable file system manipulates with local
data (which can be obsoleted by local data of another process), any
atomic cipher algorithm without proper support can lead to non-POSIX
behavior. To resolve the "collisions" we introduce locks: before
performing FOP->read(), FOP->write(), etc. the process should first
lock the file.

b) Algorithms with EOF issue

Such algorithms require to pad the end of file with some extra-data.
Without proper support this will result in losing information about
real file size. Keeping a track of real file size is a responsibility
of the crypt translator. A special extended attribute with the name
"trusted.glusterfs.crypt.att.size" is used for this purpose. All files
contained in bricks of encrypted volume do have "padded" sizes.

                  3. Non-trusted servers and
                     Metadata authentication

We assume that server, where user's data is stored on is non-trusted.
It means that the server can be subjected to various attacks directed
to reveal user's encrypted personal data. We provide protection
against such attacks.

Every encrypted file has specific private attributes (cipher algorithm
id, atom size, etc), which are packed to a string (so-called "format
string") and stored as a special extended attribute with the name
"trusted.glusterfs.crypt.att.cfmt". We protect the string from
tampering. This protection is mandatory, hardcoded and is always on.
Without such protection various attacks (based on extending the scope
of per-file secret keys) are possible.

Our authentication method has been developed in tight collaboration
with Red Hat security team and is implemented as "metadata loader of
version 1" (see file metadata.c). This method is NIST-compliant and is
based on checking 8-byte per-hardlink MACs created(updated) by
FOP->create(), FOP->link(), FOP->unlink(), FOP->rename() by the
following unique entities:

. file (hardlink) name;
. verified file's object id (gfid).

Every time, before manipulating with a file, we check it's MACs at
FOP->open() time. Some FOPs don't require a file to be opened (e.g.
FOP->truncate()). In such cases the crypt translator opens the file
mandatory.

                        4. Generating keys

Unique per-file keys are derived by NIST-compliant methods from the

a) parent key;
b) unique verified object-id of the file (gfid);
Per-volume master key, provided by user at mount time is in the root
of this "tree of keys".

Those keys are used to:

1) encrypt/decrypt file data;
2) encrypt/decrypt file metadata;
3) create per-file and per-link MACs for metadata authentication.

                          5. Instructions
                 Getting started with crypt translator

Example:

1) Create a volume "myvol" and enable encryption:

   # gluster volume create myvol pepelac:/vols/xvol
   # gluster volume set myvol encryption on

2) Set location (absolute pathname) of your master key:

   # gluster volume set myvol encryption.master-key /home/me/mykey

3) Set other options to override default options, if needed.
   Start the volume.

4) On the client side make sure that the file /home/me/mykey exists
   and contains proper per-volume master key (that is 256-bit AES
   key). This key has to be in hex form, i.e. should be represented
   by 64 symbols from the set  {'0', ..., '9', 'a', ..., 'f'}.
   The key should start at the beginning of the file. All symbols at
   offsets >= 64 are ignored.

5) Mount the volume "myvol" on the client side:

   # glusterfs --volfile-server=pepelac --volfile-id=myvol /mnt

   After successful mount the file which contains master key may be
   removed. NOTE: Keeping the master key between mount sessions is in
   user's competence.

**********************************************************************

WARNING! Losing the master key will make content of all regular files
inaccessible. Mount with improper master key allows to access content
of directories: file names are not encrypted.

**********************************************************************

               6. Options of crypt translator

1) "master-key": specifies location (absolute pathname) of the file
   which contains per-volume master key. There is no default location
   for master key.

2) "data-key-size": specifies size of per-file key for data encryption
   Possible values:
   . "256" default value
   . "512"

3) "block-size": specifies atom size. Possible values:
   . "512"
   . "1024"
   . "2048"
   . "4096" default value;

                       7. Test cases

Any workload, which involves the following file operations:

->create();
->open();
->readv();
->writev();
->truncate();
->ftruncate();
->link();
->unlink();
->rename();
->readdirp().

                        8. TODOs:

1) Currently size of IOs issued by crypt translator is restricted
   by block_size (4K by default). We can use larger IOs to improve
   performance.

Change-Id: I2601fe95c5c4dc5b22308a53d0cbdc071d5e5cee
BUG: 1030058
Signed-off-by: Edward Shishkin 
Signed-off-by: Anand Avati 
Reviewed-on: http://review.gluster.org/4667
Tested-by: Gluster Build System

cluster/dht - rebalance: handle the rebalance @ inode level (!fd level)

2013-11-13T19:45:18+00:00

* migrate all the fd's on an inode to newer subvol after rebalance
* use the migration in progress flag in inode, so all the operations
  on the inode can make use of it

Change-Id: Ib807a46e927a1062688fc15119c916797c52a350
BUG: 1013456
Signed-off-by: Amar Tumballi 
Reviewed-on: http://review.gluster.org/5891
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

libglusterfs/inode: introduce new APIs for ctx handling

2013-11-13T19:43:48+00:00

* inode_ctx_reset{0,1,2}() for reseting value1, value2, and both respectively
* inode_ctx_get0() - to get the first value only
* inode_ctx_set0() - to set the first value only
* inode_ctx_get1() - to get the second value only
* inode_ctx_set1() - to set the second value only

Change-Id: I4dfbdac81d6a3f4e5784e060c76edabb1692ce03
Signed-off-by: Amar Tumballi 
Reviewed-on: http://review.gluster.org/5890
Reviewed-by: Anand Avati 
Tested-by: Anand Avati

bd: Add support to create clone, snapshot and merge of LV images.

2013-11-13T19:39:22+00:00

Special xattr names "clone" & "snapshot" can be used to create full and
linked clone of the LV images. GFID of destination posix file (to be
mapped) is passed as a value to the xattr. Destination posix file must
exist before running this operation.

These operations form a basis for offloading storage related operations
from QEMU to GlusterFS.

Syntax for full clone: xattr name: "clone" value: "gfid-of-dest-file"
Syntax for linked clone: xattr name: "snapshot" value: "gfid-of-dest-file"
Syntax for merging: xattr name: "merge" value: "path-to-snapshot-file"

Example:
	setfattr -n clone -v  /media/source
	setfattr -n snapshot -v  /media/source
	setfattr -n merge -v "/media/sn" /media/sn

Change-Id: Id9f984a709d4c2e52a64ae75bb12a8ecb01f8776
BUG: 1028672
Signed-off-by: M. Mohan Kumar 
Reviewed-on: http://review.gluster.org/5626
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

bd: Add aio support to BD xlator

2013-11-13T19:39:11+00:00

Volume option bd-aio controls AIO feature for BD xlator. Code taken from
posix-aio.c

Change-Id: Ib049bd59c9d3f9101d33939838322cfa808de053
BUG: 1028672
Signed-off-by: M. Mohan Kumar 
Reviewed-on: http://review.gluster.org/5748
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

bd: Add BD support to other xlators

2013-11-13T19:38:55+00:00

Make changes to distributed xlator to work with BD xlator. Unlike files,
a block device can't be removed when its opened. So some part of the
code were moved down to avoid this situation. Also before truncating a
BD file its BD_XATTR should be set otherwise truncate will result in
truncating posix file. So file is created with needed BD_XATTR and
truncate is invoked. Also enables BD xlator in stripe volume type.

Change-Id: If127516e261fac5fc5b137e7fe33e100bc92acc0
BUG: 1028672
Signed-off-by: M. Mohan Kumar 
Reviewed-on: http://review.gluster.org/5235
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

bd: posix/multi-brick support to BD xlator

2013-11-13T19:38:42+00:00

Current BD xlator (block backend) has a few limitations such as
* Creation of directories not supported
* Supports only single brick
* Does not use extended attributes (and client gfid) like posix xlator
* Creation of special files (symbolic links, device nodes etc) not
  supported

Basic limitation of not allowing directory creation is blocking
oVirt/VDSM to consume BD xlator as part of Gluster domain since VDSM
creates multi-level directories when GlusterFS is used as storage
backend for storing VM images.

To overcome these limitations a new BD xlator with following
improvements is suggested.

* New hybrid BD xlator that handles both regular files and block device
  files
* The volume will have both POSIX and BD bricks. Regular files are
  created on POSIX bricks, block devices are created on the BD brick (VG)
* BD xlator leverages exiting POSIX xlator for most POSIX calls and
  hence sits above the POSIX xlator
* Block device file is differentiated from regular file by an extended
  attribute
* The xattr 'user.glusterfs.bd' (BD_XATTR) plays a role in mapping a
  posix file to Logical Volume (LV).
* When a client sends a request to set BD_XATTR on a posix file, a new
  LV is created and mapped to posix file. So every block device will
  have a representative file in POSIX brick with 'user.glusterfs.bd'
  (BD_XATTR) set.
* Here after all operations on this file results in LV related
  operations.

For example opening a file that has BD_XATTR set results in opening
the LV block device, reading results in reading the corresponding LV
block device.

When BD xlator gets request to set BD_XATTR via setxattr call, it
creates a LV and information about this LV is placed in the xattr of the
posix file. xattr "user.glusterfs.bd" used to identify that posix file
is mapped to BD.

Usage:
Server side:
[root@host1 ~]# gluster volume create bdvol host1:/storage/vg1_info?vg1 host2:/storage/vg2_info?vg2
It creates a distributed gluster volume 'bdvol' with Volume Group vg1
using posix brick /storage/vg1_info in host1 and Volume Group vg2 using
/storage/vg2_info in host2.

[root@host1 ~]# gluster volume start bdvol

Client side:
[root@node ~]# mount -t glusterfs host1:/bdvol /media
[root@node ~]# touch /media/posix
It creates regular posix file 'posix' in either host1:/vg1 or host2:/vg2 brick
[root@node ~]# mkdir /media/image
[root@node ~]# touch /media/image/lv1
It also creates regular posix file 'lv1' in either host1:/vg1 or
host2:/vg2 brick
[root@node ~]# setfattr -n "user.glusterfs.bd" -v "lv" /media/image/lv1
[root@node ~]#
Above setxattr results in creating a new LV in corresponding brick's VG
and it sets 'user.glusterfs.bd' with value 'lv: --deltag
>

Changes from previous version V5:
* Removed support for delayed deleting of LVs

Changes from previous version V4:
* Consolidated the patches
* Removed usage of BD_XATTR_SIZE and consolidated it in BD_XATTR.

Changes from previous version V3:
* Added support in FUSE to support full/linked clone
* Added support to merge snapshots and provide information about origin
* bd_map xlator removed
* iatt structure used in inode_ctx. iatt is cached and updated during
fsync/flush
* aio support
* Type and capabilities of volume are exported through getxattr

Changes from version 2:
* Used inode_context for caching BD size and to check if loc/fd is BD or
  not.
* Added GlusterFS server offloaded copy and snapshot through setfattr
  FOP. As part of this libgfapi is modified.
* BD xlator supports stripe
* During unlinking if a LV file is already opened, its added to delete
  list and bd_del_thread tries to delete from this list when a last
  reference to that file is closed.

Changes from previous version:
* gfid is used as name of LV
* ? is used to specify VG name for creating BD volume in volume
  create, add-brick. gluster volume create volname host:/path?vg
* open-behind issue is fixed
* A replicate brick can be added dynamically and LVs from source brick
  are replicated to destination brick
* A distribute brick can be added dynamically and rebalance operation
  distributes existing LVs/files to the new brick
* Thin provisioning support added.
* bd_map xlator support retained
* setfattr -n user.glusterfs.bd -v "lv" creates a regular LV and
  setfattr -n user.glusterfs.bd -v "thin" creates thin LV
* Capability and backend information added to gluster volume info (and
--xml) so
  that management tools can exploit BD xlator.
* tracing support for bd xlator added

TODO:
* Add support to display snapshots for a given LV
* Display posix filename for list-origin instead of gfid

Change-Id: I00d32dfbab3b7c806e0841515c86c3aa519332f2
BUG: 1028672
Signed-off-by: M. Mohan Kumar 
Reviewed-on: http://review.gluster.org/4809
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

bd_map: Remove bd_map xlator

2013-11-13T19:38:28+00:00

Remove bd_map xlator and CLI related changes.

Change-Id: If7086205df1907127c1a1fa4ba603f1c48421d09
BUG: 1028672
Signed-off-by: M. Mohan Kumar 
Reviewed-on: http://review.gluster.org/5747
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati