summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorShireesh Anjal <anjalshireesh@gmail.com>2011-08-16 09:41:45 -0700
committerShireesh Anjal <anjalshireesh@gmail.com>2011-08-16 09:41:45 -0700
commitc61a9f626ed4efe7a2a64366c9ae66b440e85ebc (patch)
tree9a6024730d186c182ad4834822b0e81313182179
parentf98398d79aeca0a363694cb0c27cd1f47e94c721 (diff)
parent22cd0a65e79cf12e25800191d601bf437cf6213a (diff)
Merge pull request #235 from Dhandapani/81b070f7b594c6160c34ddc9ff3ba6eeda5aef81
Missing plugins added
-rw-r--r--src/com.gluster.storage.management.console.feature/feature.xml14
1 files changed, 14 insertions, 0 deletions
diff --git a/src/com.gluster.storage.management.console.feature/feature.xml b/src/com.gluster.storage.management.console.feature/feature.xml
index 8fbd13ec..fdafcd35 100644
--- a/src/com.gluster.storage.management.console.feature/feature.xml
+++ b/src/com.gluster.storage.management.console.feature/feature.xml
@@ -918,4 +918,18 @@
version="0.0.0"
unpack="false"/>
+ <plugin
+ id="com.jcraft.jsch"
+ download-size="0"
+ install-size="0"
+ version="0.0.0"
+ unpack="false"/>
+
+ <plugin
+ id="org.eclipse.jsch.core"
+ download-size="0"
+ install-size="0"
+ version="0.0.0"
+ unpack="false"/>
+
</feature>
id='n93' href='#n93'>93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
cluster/afr translator
======================

Locking
-------

Before understanding replicate, one must understand two internal FOPs:

### `GF_FILE_LK`

This is exactly like `fcntl(2)` locking, except the locks are in a
separate domain from locks held by applications.

### `GF_DIR_LK (loc_t *loc, char *basename)`

This allows one to lock a name under a directory. For example,
to lock /mnt/glusterfs/foo, one would use the call:

```
GF_DIR_LK ({loc_t for "/mnt/glusterfs"}, "foo")
```

If one wishes to lock *all* the names under a particular directory,
supply the basename argument as `NULL`.

The locks can either be read locks or write locks; consult the
function prototype for more details.

Both these operations are implemented by the features/locks (earlier
known as posix-locks) translator.

Basic design
------------

All FOPs can be classified into four major groups:

### inode-read

Operations that read an inode's data (file contents) or metadata (perms, etc.).

access, getxattr, fstat, readlink, readv, stat.

### inode-write

Operations that modify an inode's data or metadata.

chmod, chown, truncate, writev, utimens.

### dir-read

Operations that read a directory's contents or metadata.

readdir, getdents, checksum.

### dir-write

Operations that modify a directory's contents or metadata.

create, link, mkdir, mknod, rename, rmdir, symlink, unlink.

Some of these make a subgroup in that they modify *two* different entries:
link, rename, symlink.

### Others

Other operations.

flush, lookup, open, opendir, statfs.

Algorithms
----------

Each of the four major groups has its own algorithm:

### inode-read, dir-read

1. Send a request to the first child that is up:
    * if it fails:
        * try the next available child
    * if we have exhausted all children:
        * return failure

### inode-write

 All operations are done in parallel unless specified otherwise.

1. Send a ``GF_FILE_LK`` request on all children for a write lock on the 
   appropriate region
   (for metadata operations: entire file (0, 0) for writev:
   (offset, offset+size of buffer))
    * If a lock request fails on a child:
        * unlock all children
        * try to acquire a blocking lock (`F_SETLKW`) on each child, serially.
	  If this fails (due to `ENOTCONN` or `EINVAL`):
          Consider this child as dead for rest of transaction.
2. Mark all children as "pending" on all (alive) children (see below for 
meaning of "pending").
    * If it fails on any child:
        * mark it as dead (in transaction local state).
3. Perform operation on all (alive) children.
    * If it fails on any child:
        * mark it as dead (in transaction local state).
4. Unmark all successful children as not "pending" on all nodes.
5. Unlock region on all (alive) children.

### dir-write

 The algorithm for dir-write is same as above except instead of holding
 `GF_FILE_LK` locks we hold a GF_DIR_LK lock on the name being operated upon.
 In case of link-type calls, we hold locks on both the operand names.

"pending"
---------

The "pending" number is like a journal entry. A pending entry is an
array of 32-bit integers stored in network byte-order as the extended
attribute of an inode (which can be a directory as well).

There are three keys corresponding to three types of pending operations:

### `AFR_METADATA_PENDING`

There are some metadata operations pending on this inode (perms, ctime/mtime,
xattr, etc.).

### `AFR_DATA_PENDING`

There is some data pending on this inode (writev).

### `AFR_ENTRY_PENDING`

There are some directory operations pending on this directory
(create, unlink, etc.).

Self heal
---------

* On lookup, gather extended attribute data:
    * If entry is a regular file:
        * If an entry is present on one child and not on others:
            * create entry on others.
        * If entries exist but have different metadata (perms, etc.):
            * consider the entry with the highest `AFR_METADATA_PENDING` number as
              definitive and replicate its attributes on children.
    * If entry is a directory:
        * Consider the entry with the highest `AFR_ENTRY_PENDING` number as
          definitive and replicate its contents on all children.
    * If any two entries have non-matching types (i.e., one is file and
      other is directory):
        * Announce to the user via log that a split-brain situation has been
          detected, and do nothing.
* On open, gather extended attribute data:
    * Consider the file with the highest `AFR_DATA_PENDING` number as
      the definitive one and replicate its contents on all other
      children.

During all self heal operations, appropriate locks must be held on all
regions/entries being affected.

Inode scaling
-------------

Inode scaling is necessary because if a situation arises where an inode number 
is returned for a directory (by lookup) which was previously the inode number 
of a file (as per FUSE's table), then FUSE gets horribly confused (consult a 
FUSE expert for more details).

To avoid such a situation, we distribute the 64-bit inode space equally
among all children of replicate.

To illustrate:

If c1, c2, c3 are children of replicate, they each get 1/3 of the available
inode space:

-------------  --  --  --  --  --  --  --  --  --  --  --  ---
Child:         c1  c2  c3  c1  c2  c3  c1  c