Add GD2 Design

Change-Id: I8ba55fd9a0f904d3bc57ac398babc7663036c62c Signed-off-by: Atin Mukherjee <amukherj@redhat.com> Reviewed-on: http://review.gluster.org/13237 Reviewed-by: Prashanth Pai <ppai@redhat.com> Tested-by: Prashanth Pai <ppai@redhat.com>
author: Atin Mukherjee <amukherj@redhat.com> 2016-01-14 13:08:52 +0530
committer: Kaushal M <kaushal@redhat.com> 2016-01-27 21:54:01 -0800
commit: 063b5556d7271bfe06ec80b6a1957fbd5cacec51 (patch)
tree: a370786039866e5442c00fdb0e4b779cc170cf4f /design/GlusterD2
parent: 9968356e8e0b04fe87213e5ed2795d6afe65bff1 (diff)
4 files changed, 518 insertions, 0 deletions
diff --git a/design/GlusterD2/GD2-Design.md b/design/GlusterD2/GD2-Design.md
new file mode 100644
index 0000000..dafd7d1
--- /dev/null
+++ b/design/GlusterD2/GD2-Design.md
@@ -0,0 +1,135 @@
+# GlusterD 2.0 Design Document
+
+This document gives a high level overview of the GlusterD-2.0 design. The GlusterD-2.0 design is being refined as we go along, and this document will be updated along the way.
+
+## Why GlusterD-2.0
+
+Gluster has come a long way being the POSIX-compliant distributed file system in clusters of small to medium sized clusters (10s-100s). Gluster.Next is a collection of improvements to push Gluster's capabilities to cloud-scale (read 1000s of nodes).
+
+GlusterD-2.0, the next version of native Gluster management software, aims to offer devops-friendly interfaces to build, configure and deploy a 'thousand-node' Gluster cloud.
+
+## GlusterD 1.0 limitations
+
+Following are the different categories of limitation the current form of GlusterD has.
+
+### Nonlinear node scalability
+
+GlusterD internal configuration data be it membership data or configuration data is replicated across all the nodes and each node maintains the state of its own and all of its peers. This requires the use of a n^2 heartbeat/membership protocol, which doesn’t scale if the cluster forms thousands of nodes.
+
+### Code maintainability & feature integration
+
+Non trivial effort is involved in adding management support for a new feature. Any new feature needs to hook into GlusterD codebase and this is how GlusterD codebase grows exponentially and it becomes difficult to maintain at the same point of time.
+
+## Architectural Overview
+
+[[images/gd2-arch.png]]
+
+The core of GD2 is the centralized store. GD2 will maintain the cluster configuration data, which includes peer and volume information, in a central store instead of maintining it on each peer. It is planned to use [etcd](https://coreos.com/etcd/) to provide the centralized store. etcd servers will only run on a subset of the cluster, tentatively being called the monitor cluster. All other nodes of the cluster will be clients of the monitor cluster.
+
+GD2 will also be a native ReST server, exposing cluster management interfaces via a HTTP ReST API. The CLI will be rewritten as a ReST client which uses this API.
+
+## ReST API
+
+The main management interface with GD2 will be a HTTP ReST interface. APIs will be provided for the management of peers, management of volumes, local-GlusterD management, monitoring (events) and long-running asynchronous operations.
+
+More details on the ReST-API can be found at [[ReST-API]] (_note that this is still under active development_).
+
+### Gluster CLI
+
+The CLI application will be a ReST Client application talking over HTTP ResT interfaces with GD2. The CLI will support GlusterFS 3.x semantics, with changes as appropriate to fix some known issues.
+
+
+## Centralized store
+
+The central store is the most important part of GD2. The central store will provide GD2 with a centralized location to save cluster data, and have it accessible from the whole cluster. The central store helps avoid the complex and costly transactions in use now. We choose to use an external, distributed-replicated key-value store as the centralized store instead of implementing a new central store framework in GD2. [etcd](https://coreos.com/etcd/) and [consul](https://www.consul.io) are two of the stores in consideration.
+
+### Cluster topography with centralized store
+
+Only a subset of the larger GD2 cluster will be used for serving the centralized store, and remaining cluster will be clients of this sub-cluster. This makes the GD2 cluster a two-tiered cluster.
+
+### Bootstrapping and managing the centralized store
+
+The centralized stores needs to be bootstrapped and managed by GD2. Bootstrapping the central requires choosing the nodes to be used as servers. A discussion on how to do this happened [here](https://www.gluster.org/pipermail/gluster-devel/2015-September/046740.html).
+
+The summary of the discussion is that,
+- each GD2 will manage a private instance of it's own store server
+- on startup every GD2 will start the store in single-node mode
+- At the time of cluster expansion, the GD2 admin will decide whether the new node will be acting as a central store server or client based on an option called "join", if its true then the new node will start acting as one of the central store server otherwise it will be a client.
+ - if the probed node is a central-store server, GD2 will shutdown its single-node store, and restart and add the store server to the store cluster.
+ - if the probed node is not a central-store server, GD2 will shutdown its single-node store and establish a client connection to the store cluster.
+
+The mechanisms of promotion and demotion of the nodes in the store cluster are still under discussion.
+
+
+## Transaction framework
+
+The transaction framework will ensure that a given set of actions will be performed in order, one after another, on the required peers of the cluster.
+
+The new transaction framework will be built around the central store. The new transaction framework will have 2 major differences to the existing frameworks,
+1. actions will only be performed where required, instead of being done across the whole cluster
+2. final committing of the results into the store will only be done by the node (be it a client or server) where the transaction was initiated
+
+The above 2 changes will help keep the transaction framework simple and fast.
+
+More details can be found at [[Transaction-framework]].
+
+## RPC communication
+
+GD2 is going to implement a new cross language RPC framework using [protobuf](https://github.com/google/protobuf). A PoC golang package has been already implemented [here](https://github.com/kshlm/pbrpc). This will be used in communication between GlusterD and GlusterD/Glusterfsd deamons. Clients to bricks and vice versa will still follow the old way of xdr based sun rpc implementation. 
+
+GD2 is going to have the RPC hierarchy in the following manner
+```
+GD2
+|       |- client
+|- rpc -|- server
+|       |- services
+
+server/ -> would contain the code necessary to create and manage a listener and methods to register services
+client/ -> would contain code necessary to establish and manage client connections and methods to send requests
+services/ -> would contain the actual services. The services and the handler functions would be defined here
+```
+
+## Feature pluggability
+
+To ease integration of GlusterFS features into GD2 and to reduce the maintenance effort, GD2 will provide a pluggable interface. This interface will allow new features to integrate with GD2, without the feature developers having to modify GD2. This interface will mainly be targetted for filesystem features, that wouldn't require a large amount of change to management code. The interface will aim to provide the ability to,
+
+- insert xlators into a volume graph
+- set options on xlators
+- define and create custom volume graphs
+- define and manage daemons
+- create CLI commands
+- hook into existing CLI commands
+- query cluster and volume information
+- associate information with objects (peers, volumes)
+
+The above should satisfy most of the common requirements for GD2 plugins.
+
+The actual design on the plugin interface is yet to begin.
+
+### Improving logging
+
+Improvements to logging in GD2 is a crucial, to better support the scale expected. GD2 will use structured logging to help improve log readability and machine parseability. Structured logging uses fixed strings with some attached metadata, generally in the form of key-value pairs, instead of variable log strings. Structured logging also allows us to create log contexts, which can be used to attach specific metadata to all logs in the context. Using log contexts and transaction-ids/request-ids allows us to easily identify and group logs related to a specific transaction or request.
+
+The current PoC of GD2 uses [logrus](https://github.com/Sirupsen/logrus), structured logger for Go.
+
+
+## Other improvements
+
+GD2 also aims to improve the below.
+
+### Better op-version management
+
+Operating version (or op-version) is used to prevent troubles caused by running heterogeneous (nodes running different versions of glusterfs) clusters. It currently does not have clear guidelines on how it is supposed to be used, leading to inconsistent usage and problems. GD2 will set clear guidelines on how op-versions are supposed to be used, and provide suggested patterns to effectively use op-versions.
+
+### Developer/User documentation
+
+More focus will be given on improving the documentation for both users and developmeners. GD2 code will follow Go documentation practices and include proper documentation inline. This documentation can be easily extracted and hosted using the `godoc` tool.
+
+## Upgrades & Backward Compatibility
+
+As GD2 moves to a new store format and a new transaction mechanism, rolling upgrades from and backwards compatiability with 3.x releases will not be possible. Upgrades will involve service disruption.
+
+Support will be provided for the migration of older configuration data to GD2, possibly in the form of helper scripts.
+
+A detailed discussion on this can be found [here](http://www.gluster.org/pipermail/gluster-devel/2015-October/046866.html).
+
diff --git a/design/GlusterD2/ReST-API.md b/design/GlusterD2/ReST-API.md
new file mode 100644
index 0000000..b220171
--- /dev/null
+++ b/design/GlusterD2/ReST-API.md
@@ -0,0 +1,289 @@
+## Overview
+The main communication interface with GlusterD-2.0 (GD2) will be a HTTP ReST interface. This interface will be made use of by the GD2 CLI, and will also be available for external consumers to use.
+
+GD2 will provide ReST interfaces for the management of peers, management of volumes, local-GlusterD management, interfaces for monitoring (events) and long-running asynchronous operations.
+
+## Authentication
+The API will use a stateless authentication model using JSON web tokens. This will be based on the [Heketi authentication model][1].
+> This is still tentative
+
+## API
+This is the first version of the GD2 ReST API, APIVERSION is **1**. All the endpoints defined below will be prefixed with `/v<API_VERSION>` unless otherwise mentioned.
+
+GD2 uses JSON as its data serialization format. XML support is not planned initially.
+
+Most APIs use the following methods on the URIs:
+- URIs in the form of `/<version>/<ReST endpoint>/{id}`
+- Requests and responses in JSON format
+- `POST`: Send data to GD2 where the body has data described in JSON format.
+- `GET`: Retrieve data from GD2 where the body has data described in JSON format.
+- `DELETE`: Deletes the specified object from GD2.
+- The HTTP content-type for all requests and responses will be `application/json`.
+
+### Meta
+The APIs are used to get some meta information about GD2 and the cluster itself.
+
+#### Get version
+Returns the GlusterD and API versions.
+- **Method** : `GET`
+- **Endpoint** : `/version` _will not be prefixed_
+- **Request** : _Empty_
+- **Response** :
+	- **Status code** : `200 OK`
+	- **Body** :
+		- *glusterd-version* : A string. The GlusterD-2.0 version
+		- *api-version* : A string. The ReST API version
+		- *Example* :
+```json
+{
+    "glusterd-version": "dev",
+    "api-version": "1"
+}
+```
+
+#### Get info
+Returns some information about the cluster
+- **Method** : `GET`
+- **Endpoint** : `/info` _will not be prefixed_
+- **Request** : _Empty_
+- **Response** :
+  - **Status code** : `200 OK`
+  - **Body** : ***TBD***
+
+> NOTE: The  APIs described only define the responses of successful request. Failed requests will follow a common response format, and have been defined later.
+
+### Peers
+The _peers_ endpoint will be used to manage peers in the cluster. All _peers_ endpoints will have the prefix `/peers/`.
+
+#### Attach peer
+- **Method** : `POST`
+- **Endpoint** : `/peers/`
+- **Request** :
+	- **Parameters** :  _None_
+	- **Body**:
+	    - *addresses* : An array of strings. Gives a list of addresses by which the new host can be contacted. The addresses can be FQDNs, short-names or IP addresses. At least 1 address is required
+		- *name* : A string, optional. The name to be used for the peer. This name can be used to refer to the peer in other commands. If not given, the first address in the addresses array will be used as the name.
+		- Example :
+```json
+{
+    "addresses": [
+        "host1name1",
+        "host1name2"
+    ],
+    "name": "host1"
+}
+```
+- **Response** :
+	- **Status code**: `201 Created`
+	-  **Body**:
+		- *id* : A string. The UUID of the newly added peer
+		- Example :
+```json
+	{ "id" : "de305d54-75b4-431b-adb2-eb6b9e546014"}
+```
+
+#### Get peer
+- **Method** : `GET`
+- **Endpoint** : `/peers/{id}`
+- **Request**:
+	- **Parameters** :
+		- `id` : UUID of the peer or the name of the peer to get.
+	- **Body** : *Empty*
+- **Response**:
+	- **Status code**: `200 OK`
+	- **Body** :
+		- *id* : A string. The UUID of the peer
+		- *name* : A string. The name of the peer.
+		- *addresses* : An array of strings. Each entry in the list is an address by which the peer can be connected to.
+		- *online* : A boolean. Gives online status of the peer
+		- Example :
+```json
+{
+    "id": "c1cf34e2-5bb7-4885-b8c2-0d92199993f9",
+    "name": "peer2",
+    "addresses": [
+      "p2a1",
+      "p2a2",
+      "p2a3"
+    ],
+    "online": true
+}
+```
+
+#### List peers
+- **Method** : `GET`
+- **Endpoint** : `/peers/`
+- **Request**:
+	- **Parameters** : *None*
+	- **Body** : *Empty*
+- **Response**:
+	- **Status code** : `200 OK`
+	- **Body** : An array of peer information JSON objects. For details of the peer information object refer *Get peer*.
+
+#### Detach peer
+- **Method** : `DELETE`
+- **Endpoint**: `/peers/{id}`
+- **Request**:
+	- **Parameters** :
+		- `id` : Name or ID of peer to be detached from the cluster
+	- **Body** : *Empty*
+- **Response**:
+	- **Status code** : `204 No Content`
+	- **Body** : *Empty*
+
+### Volumes
+The _volumes_ endpoints will be used to manage volumes in the cluster. All _volumes_ endpoints have the prefix `/volumes/`
+
+#### Create volume
+- **Method** : `POST`
+- **Endpoint**: `/volumes/`
+- **Request**:
+	- **Parameters** : *None*
+	- **Body** :
+	    - *name* : A string. Name of the volume
+	    - *stripe* :  An integer, optional. Represents the stripe count for a stripe volume
+	    - *replica* : An integer, optional. Represents the replica count for a replicate volume. Default to be 1
+	    - *arbiter* : An integer, optional. Represents the arbiter count for an arbiter volume
+	    - *disperse-data* : An integer, optional. Represents the disperse-data count for a dispersed volume
+	    - *disperse-redundancy* : An integer, optional. Represents the redundancy count for a dispersed volume
+	    - *transport* : A string, optional. Represents the transport type (TCP/RDMA/Both). Default to be TCP
+	    - *bricks* : An array of strings. Holds list of bricks to be configured for the volume. The semantics of a brick is maintained in the form of IP:brickpath
+	    - *flags* : A JSON object with string-boolean key-value pairs, optional. The keys are flags and the values are their state. The default value is false.
+        - Available flags
+          - "reuse_bricks"
+          - "allow_root_dir"
+          - <need to add others we introduce>
+    - Example :
+```json
+{
+    "name": "test-vol",
+    "replica" : 2,
+    "bricks": [
+        "x.x.x.x:/home/bricks/b1",
+        "y.y.y.y:/home/bricks/b2"
+    ],
+}
+```
+- **Response**:
+	- **Status code** : `201`
+	- **Body** : A JSON object of a volume. For details please refer *Get volume*
+
+#### Get volume
+- **Method** : `GET`
+- **Endpoint**: `/volumes/{id}`
+- **Request**:
+	- **Parameters** :
+		- `id` : ID of the volume. The name of the volume is also accepted.
+	- **Body** : *Empty*
+- **Response**:
+	- **Status code**: `200 OK`
+	- **Body** :
+	    - *id* : A string. UUID of the volume
+	    - *name* :  A string. Name of the volume
+	    - *type* : A string. Type of the volume which is any of the following
+	        - Distribute
+	        - Stripe
+	        - Striped-Replicate
+	        - Disperse
+	        - Tier
+	        - Distributed-Stripe
+	        - Distributed-Replicate
+	        - Distributed-Striped-Replicate
+	        - Distributed-Disperse
+	    - *stripe* :  An integer. Represents the stripe count for a stripe volume
+	    - *replica* : An integer. Represents the replica count for a replicate volume
+	    - *arbiter* : An integer. Represents the arbiter count for an arbiter volume
+	    - *disperse-data* : An integer. Represents the disperse-data count for a dispersed volume
+	    - *disperse-redundancy* : An integer. Represents the redundancy count for a dispersed volume
+	    - *transport* : A string, transport type of the volume
+	    - *options* : A map of key value strings. Represents volume tunables
+	    - *status* : A string. Status of the volume either "Created" or "Started" or "Stopped"
+	    - *version* : An integer. Version of the volume, internal to GlusterD
+	    - *checksum* : An integer. md5 checksum value of the volume configuration, internal to GlusterD
+	    - *bricks* : An array of strings. Represents each brick with following details:
+	        - *id* : A string. UUID of the peer to which this brick belongs
+	        - *hostname* : A string. Host/IP of the brick
+	        - *path* : A string. Path of the brick
+	    - Example :
+```json
+{
+    "id": "0bef87b3-82ba-11e5-9d59-3c970e9eb10d"
+    "name": "test-vol",
+    "type": "Distribute",
+    "stripe": 0,
+    "replica": 0,
+    "arbiter": 0,
+    "disperse-data": 0,
+    "disperse-redundancy": 0,
+    "transport": "TCP"
+    "options": {},
+    "status": "Created",
+    "checksum": 12345,
+    "version": 1,
+    "bricks": [
+        {
+            "id": "e5c26603-82bf-11e5-9d59-3c970e9eb10d",
+            "hostname": "x.x.x.x"
+            "path": "/gluster/brick1",
+        },
+        {
+            "id": "06c47cc3-82c0-11e5-9d59-3c970e9eb10d",
+            "hostname": "y.y.y.y"
+            "path": "/gluster/brick2",
+        }
+    ]
+}
+```
+#### Get volumes
+- **Method** : `GET`
+- **Endpoint**: `/volumes/`
+- **Request** :
+	- **Parameters** : *None*
+	- **Body** : *Empty*
+- **Response** :
+	- **Status code** : `200 OK`
+	- **Body** : A string-string JSON map, with volume-id as key and volume name as value.
+    - Example :
+```json
+{
+  "0bef87b3-82ba-11e5-9d59-3c970e9eb10d": "test-vol",
+  "eb6aaea1-1689-4a45-8dbf-d4c98f76d9c5": "vol1"
+}
+```
+
+
+#### Start volume
+- **Method** : `POST`
+- **Endpoint**: `/volumes/{name}/start`
+- **Request** :
+	- **Parameters** :
+		- `name`: Name of the volume
+	- **Request body** : _Empty_
+- **Response** :
+	- **Status code** : `200 OK`
+	- **Response body** : _Empty_
+
+#### Stop volume
+- **Method** : `POST`
+- **Endpoint**: `/volumes/{name}/stop`
+- **Request** :
+	- **Parameters** :
+		- `name` : Name or ID of volume
+	- **Body** : _Empty_
+- **Response** :
+	- **Status code** : `200 OK`
+	- **Body** : _Empty_
+
+#### Delete volume
+- **Method** : `DELETE`
+- **Endpoint**: `/volumes/{name}`
+- **Request** :
+	- **Parameters** :
+		- `name` : Name of the volume
+	- **Body** : _Empty_
+- **Response** :
+	- **Status code** : `204 No Content`
+	- **Body** : _Empty_
+
+[1]: https://github.com/heketi/heketi/wiki/API#authentication
+
diff --git a/design/GlusterD2/Txn-Framework.md b/design/GlusterD2/Txn-Framework.md
new file mode 100644
index 0000000..5b6ce40
--- /dev/null
+++ b/design/GlusterD2/Txn-Framework.md
@@ -0,0 +1,94 @@
+# Transaction framework
+
+The transaction framework will ensure that a given set of actions will be performed in order, one after another, on the required peers of the cluster.
+
+The new transaction framework will be built around the central store.
+A GD2 transaction will have the following characteristics,
+
+1. actions will only be performed where required, instead of being done across the whole cluster
+2. final committing into the store will only be done by the node where the transaction was initiated, instead of being done on all nodes.
+
+
+## Transaction
+
+A transaction is basically a collection of steps or actions to be performed in order.
+A transaction object provides the framework with the following,
+
+1. a list of nodes that will be a part of the transaction
+2. a set of transaction [steps](#transaction-step)
+
+Given this information, the GD2 transaction framework will,
+
+- verify if all the listed nodes are online
+- run each step on all of the nodes, before proceeding to the next step
+- if a step fails, undo the changes done by the step and all previous steps.
+
+The base transaction is basically free-form, allow users to create any order of steps. This keeps it flexible and extensible to create complex transactions.
+
+### Transaction step
+A step is an action to be performed, most likely a function that needs to be run.
+A step object provides the following information,
+
+1. The function to be run
+2. The list of nodes the step should be run on.
+3. An undo function that reverts any changes done by the step.
+
+Each step can have its own list of nodes, so that steps can be targeted to specific nodes and provide more flexibility. The list of nodes can also be specified as ALL or LEADER (Leader indicates the originator node of the transaction) to target all nodes in the transaction or just the leader.
+
+## Simple transaction template
+
+Any user is free to create free-form transactions by providing their own set of steps.
+To make it easier to quickly create simple transactions, a simple transaction template will be provided.
+
+The template will be based on the existing four step transaction algorithm being used.
+The template requires users to provide the following information
+
+1. a list of nodes to run the transaction on
+2. an object within the GD2 store to obtain a lock on
+3. a staging function, which checks is the node can perform the transaction
+4. a perform function, which performs the transaction action on the node
+5. a rollback function, to undo changes done by the commit function
+6. a store function to save results if required in to the store
+
+When following the common template, a transaction will occur as follows.
+
+1. the transaction framework verifies all listed nodes are online
+2. the initiator node obtains a lock on the specified object
+3. the staging function is run on the listed nodes to verify if the transaction can happen
+4. the perform function is run on the listed nodes to actually perform the operation
+5. the store function is run on the initiator to store results
+6. the initiator node unlocks the locked object
+
+If any of the above fail (except unlock) fail, the transaction is aborted.
+- If 1 or 2 fail, the transaction is aborted immediately.
+- If 3 fails, unlock is done and transaction is aborted.
+- If 4 or 5 fail, the rollback function is run on all nodes and followed by unlocking and the transaction is aborted.
+
+## Example
+_(in pseudo code)_
+```
+# NewSimpleTxn returns a transaction object following the simple transaction template
+NewSimpleTxn(nodes, lockobj, stage, commit, store, rollback) {
+  lockStep = Step{Func: DefaultLock(lockobj), Undo: DefaulUnlock(lockobj), Nodes: [LEADER]}
+  stageStep = Step{Func: stage, Undo: nil, Nodes: [ALL]}
+  commitStep = Step{Func: commit, Undo: rollback, Nodes: [ALL]}
+  storeStep = Step{Func: store, Undo: nil, Nodes: [LEADER]}
+  unlockStep = Step{Func: DefaultUnlock(lockobj), Undo: nil, Nodes: [LEADER]}
+
+  return Txn{Nodes:nodes, Steps:[lockStep, stageStep, commitStep, storeStep, unlockStep]}
+}
+
+# Running a simple transaction
+txn = NewSimpleTxn(["node1","node2"], "some-volume", stagefunc, commitfunc, storefunc, rollbackfunc)
+result, error = txn.Perform() # Perform will run the steps in order and return the result
+
+# Free-form transaction
+# Taking replace-brick as an example (this is not a 100% correct replace-brick transaction)
+nodes = ["source", "dest"]
+checkdest = Step{Func: CanBrickBeCreated, Undo: nil. Nodes:["dest"]}
+updateVolume = Step{Func: UpdateVolumeWithNewInfo, Undo: RevertToOldVolume, Nodes[LEADER]}
+stopsource = Step{Func:Stopbrick, Undo: StartBrick, Nodes:["source"]}
+startdest = Step{Func:Startbrick, Undo: StopBrick, Nodes:["dest"]}
+replaceBrick = Txn{Nodes: nodes, Steps:[checkdest, updateVolume, startdest, stopsource]}
+res, err = replaceBrick.Perform()
+```
diff --git a/design/GlusterD2/images/gd2-arch.png b/design/GlusterD2/images/gd2-arch.png
new file mode 100644
index 0000000..a557ece
--- /dev/null
+++ b/design/GlusterD2/images/gd2-arch.png
author	Atin Mukherjee <amukherj@redhat.com>	2016-01-14 13:08:52 +0530
committer	Kaushal M <kaushal@redhat.com>	2016-01-27 21:54:01 -0800
commit	063b5556d7271bfe06ec80b6a1957fbd5cacec51 (patch)
tree	a370786039866e5442c00fdb0e4b779cc170cf4f /design/GlusterD2
parent	9968356e8e0b04fe87213e5ed2795d6afe65bff1 (diff)