doc: add documentation for the "Brick Failure Detection" feature

The documentation from the feature page should be included in the sources. Change-Id: I4fd67ce1c56afc5236c00de8be9110dfa6bbe91f BUG: 1086700 Feature-page: http://www.gluster.org/community/documentation/index.php/Features/Brick_Failure_Detection Signed-off-by: Niels de Vos <ndevos@redhat.com> Reviewed-on: http://review.gluster.org/7449 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
author: Niels de Vos <ndevos@redhat.com> 2014-04-11 12:56:24 +0200
committer: Vijay Bellur <vbellur@redhat.com> 2014-04-11 17:40:12 -0700
commit: 1c079acf4e9ef121e5e22e12243f15b080ae5f65 (patch)
tree: 566ba7db5e79367f96795d3c77a8eb8e2821a337
parent: 0e7f8af0db8201ee892979713ac86d5548f5ec73 (diff)
1 files changed, 67 insertions, 0 deletions
diff --git a/doc/features/brick-failure-detection.md b/doc/features/brick-failure-detection.md
new file mode 100644
index 000000000..24f2a18f3
--- /dev/null
+++ b/doc/features/brick-failure-detection.md
@@ -0,0 +1,67 @@
+# Brick Failure Detection
+
+This feature attempts to identify storage/file system failures and disable the failed brick without disrupting the remainder of the node's operation.
+
+## Description
+
+Detecting failures on the filesystem that a brick uses makes it possible to handle errors that are caused from outside of the Gluster environment.
+
+There have been hanging brick processes when the underlying storage of a brick went unavailable. A hanging brick process can still use the network and repond to clients, but actual I/O to the storage is impossible and can cause noticible delays on the client side.
+
+Provide better detection of storage subsytem failures and prevent bricks from hanging. It should prevent hanging brick processes when storage-hardware or the filesystem fails.
+
+A health-checker (thread) has been added to the posix xlator. This thread periodically checks the status of the filesystem (implies checking of functional storage-hardware).
+
+`glusterd` can detect that the brick process has exited, `gluster volume status` will show that the brick process is not running anymore. System administrators checking the logs should be able to triage the cause.
+
+## Usage and Configuration
+
+The health-checker is enabled by default and runs a check every 30 seconds. This interval can be changed per volume with:
+
+    # gluster volume set <VOLNAME> storage.health-check-interval <SECONDS>
+
+If `SECONDS` is set to 0, the health-checker will be disabled.
+
+## Failure Detection
+
+Error are logged to the standard syslog (mostly `/var/log/messages`):
+
+    Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 5 buf count 512
+    Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): I/O Error Detected. Shutting down filesystem
+    Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): Please umount the filesystem and rectify the problem(s)
+    Jun 24 11:31:49 vm130-32 kernel: VFS:Filesystem freeze failed
+    Jun 24 11:31:50 vm130-32 GlusterFS[1969]: [2013-06-24 10:31:50.500674] M [posix-helpers.c:1114:posix_health_check_thread_proc] 0-failing_xfs-posix: health-check failed, going down
+    Jun 24 11:32:09 vm130-32 kernel: XFS (dm-2): xfs_log_force: error 5 returned.
+    Jun 24 11:32:20 vm130-32 GlusterFS[1969]: [2013-06-24 10:32:20.508690] M [posix-helpers.c:1119:posix_health_check_thread_proc] 0-failing_xfs-posix: still alive! -> SIGTERM
+
+The messages labelled with `GlusterFS` in the above output are also written to the logs of the brick process.
+
+## Recovery after a failure
+
+When a brick process detects that the underlaying storage is not responding anymore, the process will exit. There is no automated way that the brick process gets restarted, the sysadmin will need to fix the problem with the storage first.
+
+After correcting the storage (hardware or filesystem) issue, the following command will start the brick process again:
+
+    # gluster volume start <VOLNAME> force
+
+## How To Test
+
+The health-checker thread that is part of each brick process will get started automatically when a volume has been started. Verifying its functionality can be done in different ways.
+
+On virtual hardware:
+
+* disconnect the disk from the VM that holds the brick
+
+On real hardware:
+
+* simulate a RAID-card failure by unplugging the card or cables
+
+On a system that uses LVM for the bricks:
+
+* use device-mapper to load an error-table for the disk, see [this description](http://review.gluster.org/5176).
+
+On any system (writing to random offsets of the block device, more difficult to trigger):
+
+1. cause corruption on the filesystem that holds the brick
+2. read contents from the brick, hoping to hit the corrupted area
+3. the filsystem should abort after hitting a bad spot, the health-checker should notice that shortly afterwards
author	Niels de Vos <ndevos@redhat.com>	2014-04-11 12:56:24 +0200
committer	Vijay Bellur <vbellur@redhat.com>	2014-04-11 17:40:12 -0700
commit	1c079acf4e9ef121e5e22e12243f15b080ae5f65 (patch)
tree	566ba7db5e79367f96795d3c77a8eb8e2821a337
parent	0e7f8af0db8201ee892979713ac86d5548f5ec73 (diff)