From 97defef2375b911c7b6a3924c242ba8ef4593686 Mon Sep 17 00:00:00 2001
From: Pranith Kumar K <pkarampu@redhat.com>
Date: Wed, 28 Jun 2017 09:10:53 +0530
Subject: features/shard: Remove ctx from LRU in shard_forget

Problem:
There is a race when the following two commands are executed on the mount in
parallel from two different terminals on a sharded volume,
which leads to use-after-free.

Terminal-1:
while true; do dd if=/dev/zero of=file1 bs=1M count=4; done

Terminal-2:
while true; do cat file1 > /dev/null; done

In the normal case this is the life-cycle of a shard-inode
1) Shard is added to LRU when it is first looked-up
2) For every operation on the shard it is moved up in LRU
3) When "unlink of the shard"/"LRU limit is hit" happens it is removed from LRU

But we are seeing a race where the inode stays in Shard LRU even after it is
forgotten which leads to Use-after-free and then some memory-corruptions.

These are the steps:
1) Shard is added to LRU when it is first looked-up
2) For every operation on the shard it is moved up in LRU

Reader-handler                                    Truncate-handler
1) Reader handler needs shard-x to be read.     1) Truncate has just deleted shard-x
2) In shard_common_resolve_shards(), it does
   inode_resolve() and that leads to
   a hit in LRU, so it is going to call
__shard_update_shards_inode_list() to move the
   inode to top of LRU
						2) shard-x gets unlinked from the itable
						   and inode_forget(inode, 0) is called
						   to make sure the inode can be purged
						   upon last unref
3) when __shard_update_shards_inode_list() is
   called it finds that the inode is not in LRU
   so it adds it back to the LRU-list

Both these operations complete and call inode_unref(shard-x) which leads to the inode
getting freed and forgotten, even when it is in Shard LRU list. When more inodes are
added to LRU, use-after-free will happen and it leads to undefined behaviors.

Fix:
I see that the inode can be removed from LRU even by the protocol layers like gfapi/gNFS
when LRU limit is reached. So it is better to add a check in shard_forget() to remove itself
from LRU list if it exists.

BUG: 1466037
Change-Id: Ia79c0c5c9d5febc56c41ddb12b5daf03e5281638
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: https://review.gluster.org/17644
Smoke: Gluster Build System <jenkins@build.gluster.org>
CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com>
---
 tests/bugs/shard/parallel-truncate-read.t | 48 +++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)
 create mode 100644 tests/bugs/shard/parallel-truncate-read.t

(limited to 'tests')

diff --git a/tests/bugs/shard/parallel-truncate-read.t b/tests/bugs/shard/parallel-truncate-read.t
new file mode 100644
index 00000000000..4de876f58f6
--- /dev/null
+++ b/tests/bugs/shard/parallel-truncate-read.t
@@ -0,0 +1,48 @@
+#!/bin/bash
+
+#This test will crash if shard's LRU contains a shard's inode even after the
+#inode is forgotten. Minimum time for crash to happen I saw was 180 seconds
+
+. $(dirname $0)/../../include.rc
+
+function keep_writing {
+        cd $M0;
+        while [ -f /tmp/parallel-truncate-read ]
+        do
+                dd if=/dev/zero of=file1 bs=1M count=16
+        done
+        cd
+}
+
+function keep_reading {
+        cd $M0;
+        while [ -f /tmp/parallel-truncate-read ]
+        do
+                cat file1 > /dev/null
+        done
+        cd
+}
+
+cleanup;
+
+TEST touch /tmp/parallel-truncate-read
+TEST glusterd
+TEST pidof glusterd
+TEST $CLI volume create $V0 replica 3 $H0:$B0/${V0}{0,1,2}
+TEST $CLI volume set $V0 features.shard on
+TEST $CLI volume set $V0 performance.quick-read off
+TEST $CLI volume set $V0 performance.io-cache off
+TEST $CLI volume set $V0 performance.stat-prefetch off
+TEST $CLI volume set $V0 performance.read-ahead off
+TEST $CLI volume start $V0
+
+TEST $GFS --volfile-id=$V0 --volfile-server=$H0 $M0
+keep_writing &
+keep_reading &
+sleep 180
+TEST rm -f /tmp/parallel-truncate-read
+wait
+#test that the mount is operational
+TEST stat $M0
+
+cleanup;
-- 
cgit