glusterfs.git/xlators/cluster/dht/src/dht-helper.c, branch v3.7.0beta2

guster/dht: tiered volumes may not allow access to files undergoing migration

2015-05-08T08:55:41+00:00

This is a backport of fix 10324 to Gluster 3.7.

If a read IO occurs against a file that has reached rebalance
phase 2, we redirect the IO to the destination. For tiered
volumes, when we try to reopen the file (on the destination),
the lower level DHT receives the open call and fails; it does
not have a "cached subvol". Fix is to "teach" the lower level
DHT of the new location by sending a locate before the open.

> http://review.gluster.org/#/c/10324/
> Change-Id: Ia4acb0035ff1da15f6a8f9ed54f43c76e8b98f5f
> BUG: 1214048
> Signed-off-by: Dan Lambright 
> Signed-off-by: root 
> Signed-off-by: Dan Lambright 
> Reviewed-on: http://review.gluster.org/10324
> Tested-by: NetBSD Build System
> Tested-by: Gluster Build System 
> Reviewed-by: Raghavendra G 
> Tested-by: Raghavendra G 
> Signed-off-by: Dan Lambright 

Change-Id: Ia4acb0035ff1da15f6a8f9ed54f43c76e8b98f5f
BUG: 1219608
Signed-off-by: Dan Lambright 
Reviewed-on: http://review.gluster.org/10654
Tested-by: Gluster Build System 
Tested-by: NetBSD Build System
Reviewed-by: Joseph Fernandes
Reviewed-by: Vijay Bellur

rebalance: Introducing local crawl and parallel migration

2015-05-07T09:37:02+00:00

The current patch address two part of the design proposed.
1. Rebalance multiple files in parallel
2. Crawl only bricks that belong to the current node

Brief design explanation for the above two points.

1. Rebalance multiple files in parallel:
   -------------------------------------
The existing rebalance engine is single threaded. Hence, introduced
multiple threads which will be running parallel to the crawler. The
current rebalance migration is converted to a "Producer-Consumer"
frame work.

Where Producer is : Crawler
      Consumer is : Migrating Threads

Crawler: Crawler is the main thread. The job of the crawler is now
limited to fix-layout of each directory and add the files which are
eligible for the migration to a global queue in a round robin manner
so that we will use all the disk resources efficiently. Hence, the
crawler will not be "blocked" by migration process.

Producer: Producer will monitor the global queue. If any file is
added to this queue, it will dqueue that entry and migrate the file.
Currently 20 migration threads are spawned at the beginning of the
rebalance process. Hence, multiple file migration happens in parallel.

2. Crawl only bricks that belong to the current node:
   --------------------------------------------------
As rebalance process is spawned per node, it migrates only the files
that belongs to it's own node for the sake of load balancing. But it
also reads entries from the whole cluster, which is not necessary as
readdir hits other nodes.

New Design:
        As part of the new design the rebalancer decides the subvols
that are local to the rebalancer node by checking the node-uuid of
root directory prior to the crawler starts. Hence, readdir won't hit
the whole cluster  as it has already the context of local subvols and
also node-uuid request for each file can be avoided. This makes the
rebalance process "more scalable".

Change-Id: I6f1b44086a09df8ca23935fd213509c70cc0c050
BUG: 1217381
Signed-off-by: Susant Palai 
Reviewed-on: http://review.gluster.org/10466
Tested-by: Gluster Build System 
Tested-by: NetBSD Build System
Reviewed-by: N Balachandran

libglusterfs/syncop: Add xdata to all syncop calls

2015-04-08T15:14:59+00:00

This patch adds support for xdata in both the
request and response path of syncops.

Few calls like lookup already had the support;
have renamed variables in few places to maintain
uniformity.

xdata passed downwards is known as xdata_in
and xdata passed upwards is known as xdata_out.

There is an old patch by Jeff Darcy at
http://review.gluster.org/#/c/8769/3 which does the
same for some selected calls. It also brings in
xdata support at gfapi level.

xdata support at gfapi level would be introduced
in subsequent patches.

Change-Id: I340e94ebaf2a38e160e65bc30732e8fe1c532dcc
BUG: 1158621
Signed-off-by: Raghavendra Talur 
Reviewed-on: http://review.gluster.org/9859
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

Avoid conflict between contrib/uuid and system uuid

2015-04-04T17:48:35+00:00

glusterfs relies on Linux uuid implementation, which
API is incompatible with most other systems's uuid. As
a result, libglusterfs has to embed contrib/uuid,
which is the Linux implementation, on non Linux systems.
This implementation is incompatible with systtem's
built in, but the symbols have the same names.

Usually this is not a problem because when we link
with -lglusterfs, libc's symbols are trumped. However
there is a problem when a program not linked with
-lglusterfs will dlopen() glusterfs component. In
such a case, libc's uuid implementation is already
loaded in the calling program, and it will be used
instead of libglusterfs's implementation, causing
crashes.

A possible workaround is to use pre-load libglusterfs
in the calling program (using LD_PRELOAD on NetBSD for
instance), but such a mechanism is not portable, nor
is it flexible. A much better approach is to rename
libglusterfs's uuid_* functions to gf_uuid_* to avoid
any possible conflict. This is what this change attempts.

BUG: 1206587
Change-Id: I9ccd3e13afed1c7fc18508e92c7beb0f5d49f31a
Signed-off-by: Emmanuel Dreyfus 
Reviewed-on: http://review.gluster.org/10017
Tested-by: Gluster Build System 
Reviewed-by: Niels de Vos

cluster/dht: Add tier translator.

2015-03-21T16:50:29+00:00

The tier translator shares most of DHT's code. It differs in how
subvolumes are chosen for I/Os, and how file migration (cache promotion
and demotion) is managed. That different functionality is split to either
DHT or tier logic according to the "tier_methods" structure.

A cache promotion and demotion thread is created in a manner
similar to the rebalance daemon. The thread operates a timing
wheel which periodically checks for promotion and demotion candidates
(files). Candidates are queued and then migrated. Candidates must exist on
the same node as the daemon and meet other critera per caching policies.

This patch has two authors (Dan Lambright and Joseph Fernandes). Dan
did the DHT changes and Joe wrote the cache policies. The fix depends on
DHT readidr changes and the database library which have been submitted
separately.  Header files in libglusterfs/src/gfdb should be reviewed in
patch 9683.

For more background and design see the feature page [1].

[1]
http://www.gluster.org/community/documentation/index.php/Features/data-classification

Change-Id: Icc26c517ccecf5c42aef039f5b9c6f7afe83e46c
BUG: 1194753
Signed-off-by: Dan Lambright 
Reviewed-on: http://review.gluster.org/9724
Reviewed-by: Vijay Bellur 
Tested-by: Vijay Bellur

cluster/dht: Change the subvolume encoding in d_off to be a "global"

2015-03-18T11:47:41+00:00

position in the graph rather than relative (local) to a particular
translator.

Encoding the volume in this way allows a single translator to manage
which brick is currently being scanned for directory entries. Using a
single translator minimizes allocated bits in the d_off. It also allows
multiple DHT translators in the same graph to have a common frame of
reference (the graph position) for which brick is being read. Multiple
DHT translators are needed for the Tiering feature.

The fix builds off a previous change (9332) which removed subvolume
encoding from AFR. The fix makes an equivalent change to the EC
translator.

More background can be found in fix 9332 and gluster-dev discussions [1].

DHT and AFR/EC are responsibile (as before) for choosing which brick to
enumerate directory entries in over the readdir lifecycle.

The client translator receiving the readdir fop encodes the dht_t. It
is referred to as the "leaf node" in the graph and corresponds to the
brick being scanned.

When DHT decodes the d_off, it translates the leaf node to a local
subvolume, which represents the next node in the graph leading to
the brick.

Tracking of leaf nodes is done in common utility functions. Leaf nodes
counts and positional information are updated on a graph switch.

[1] www.gluster.org/pipermail/gluster-devel/2015-January/043592.html

Change-Id: Iaf0ea86d7046b1ceadbad69d88707b243077ebc8
BUG: 1190734
Signed-off-by: Dan Lambright 
Reviewed-on: http://review.gluster.org/9688
Reviewed-by: Xavier Hernandez 
Reviewed-by: Krishnan Parthasarathi 
Reviewed-by: Vijay Bellur 
Tested-by: Vijay Bellur

dht: fix for dht_lock_count() compile error

2015-02-26T17:14:06+00:00

dht-common.h includes a function definition with "inline", but the
function is not declared in the header. Dropping the "inline" compile
directive so that linking against .o files works correctly.

BUG: 1196650
Change-Id: I105be591125b29cd455769b0c4ff22d6e139227d
Signed-off-by: Niels de Vos 
Reviewed-on: http://review.gluster.org/9760
Tested-by: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan 
Reviewed-by: Raghavendra G 
Tested-by: Raghavendra G

cluster/dht: synchronize with other concurrent healers while healing layout.

2015-02-20T10:35:00+00:00

Current layout heal code assumes layout setting is idempotent. This
allowed multiple concurrent healers to set the layout without any
synchronization. However, this is not the case as different healers
can come up with different layout for same directory and making layout
setting non-idempotent. So, we bring in synchronization among healers
to
   1. Not to overwrite an ondisk well-formed layout.
   2. Refresh the in-memory layout with the ondisk layout if in-memory
   layout needs healing and ondisk layout is well formed.

This patch can synchronize
   1. among multiple healers.
   2. among multiple fix-layouts (which extends layout to consider
   added or removed brick)
   3. (but) not between healers and fix-layouts. So, the problem of
   in-memory stale layouts (not matching with layout ondisk), is not
   _completely_ fixed by this patch.

Signed-off-by: Raghavendra G 
Change-Id: Ia285f25e8d043bb3175c61468d0d11090acee539
BUG: 1176008
Reviewed-on: http://review.gluster.org/9302
Reviewed-by: N Balachandran

libglusterfs: change signature of syncop_(f)getxattr

2015-01-06T06:00:09+00:00

Pass xdata dict to syncop_(f)getxattr calls.

This patch [1/3] is required as a part of afr automated split-brain resolution
implementation.

Change-Id: I3970b3dd6daf64681a031e37f8e9afb14fb3d668
BUG: 1136769
Signed-off-by: Ravishankar N 
Reviewed-on: http://review.gluster.org/9375
Reviewed-by: Pranith Kumar Karampuri 
Reviewed-by: Niels de Vos 
Tested-by: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan 
Reviewed-by: Vijay Bellur

cluster/dht: fix memory corruption in locking api.

2014-09-10T03:29:42+00:00



     The  contents  of the array are sorted in ascending order
     according to a comparison function pointed to by compar, which is
     called with two arguments that "point to the objects being
     compared".



qsort passes "pointers to members of the array" to comparision
function. Since the members of the array happen to be (dht_lock_t *),
the arguments passed to dht_lock_request_cmp are of type (dht_lock_t
**). Previously we assumed them to be of type (dht_lock_t *), which
resulted in memory corruption.

Change-Id: Iee0758704434beaff3c3a1ad48d549cbdc9e1c96
BUG: 1139506
Signed-off-by: Raghavendra G 
Reviewed-on: http://review.gluster.org/8659
Tested-by: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan 
Reviewed-by: Vijay Bellur