summaryrefslogtreecommitdiffstats
path: root/under_review/stat-xattr-cache.md
blob: e00399da005a86422ce94436df9c57f507250be8 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
Feature
-------

server-side md-cache

Summary
-------

Two years ago, Peter Portante noticed the extremely high number of
system calls on the XFS brick required per Swift object. Since then, he
and Ben England have observed several similar cases.

More recently, while looking at a **netmist** single-thread workload run
by a major banking customer to characterize Gluster performance, Ben
observed this [system call profile PER
FILE](https://s3.amazonaws.com/ben.england/netmist-and-gluster.pdf) .
This is strong evidence of several problems with POSIX translator:

-   repeated polling with **sys\_lgetxattr** of the **gfid** xattr
-   repeated **sys\_lstat** calls
-   polling of xattrs that were *undefined*
-   calling **sys\_llistattr** to get list of all xattrs AFTER all other
    calls
-   calling *'sys\_lgetxattr* two times, once to find out how big the
    value is and once to get the value!
-   one-at-a-time calls to get individual xattrs

All of the problems except for the last one could be solved through use
of a metadata cache associated with each inode. The last problem is not
solvable in a pure POSIX API at this time, although XFS offers an
**ioctl** that can get all xattrs at once (the cache could conceivably
determine whether the brick was XFS or not and exploit this where
available).

Note that as xattrs are added to the system, this becomes more and more
costly, and as Gluster adds new features, these typically require that
state be kept associated with a file, usually in one or more xattrs.

Owners
------

TBS

Current status
--------------

There is already a **md-cache** translator, so you would think that
problems like this would not occur, but clearly they do -- this
translator is typically on the client side of the protocol and is
typically above such translators as AFR and DHT. The problems may be
worse in cases where the md-cache translator is not present (example:
SMB with gluster-vfs plugin that requires stat-prefetch volume parameter
to be set to *off*.

Related Feature Requests and Bugs
---------------------------------

-   [Features/Smallfile Perf](../GlusterFS 3.7/Small File Performance.md)
-   bugzillas TBS

Detailed Description
--------------------

This proposal has changed as a result of discussions in
\#gluster-meeting - instead of modifying the POSIX translator, we
propose to use the md-cache translator in the server above the POSIX
translator, and add negative caching capabilities to the md-cache
translator.

By "negative caching" we mean that md-cache can tell you if the xattr
does not exist without calling down the translator stack. How can it do
this? In the server side, the only path to the brick is through the
md-cache translator. When it encounters a xattr get request for a file
it has not seen before, the first step is to call down with llistxattr()
to find out what xattrs are stored for that file. From that point on
until the file is evicted from the cache, any request for non-existent
xattr values from higher translators will immediately be returned with
ENODATA, without calling down to POSIX translator.

We must ensure that memory leaks do not occur, and that race conditions
do not occur while multiple threads are accessing the cache, but this
seems like a manageable problem and is certainly not a new problem for
Gluster translator code.

Benefit to GlusterFS
--------------------

Most of the system calls and about 50% of the elapsed time could have
been removed from the above small-file read profile through use of this
cache. This benefit will be more visible as we transition to using SSD
storage, where disk seek times will not mask overheads such as this.

Scope
-----

This can be done local to the glusterfsd process by inserting md-cache
translator just above the POSIX translator, where the vast majority of
the stat, getxattr and setxattr calls are generated from.

### Nature of proposed change

No new translators are required. We may require some existing
translators to call down the stack ("wind a FOP") instead of calling
sys\_\*xattr themselves if these calls are heavily used, so that they
can take advantage of the stat-xattr-cache.

It is *really important* that the md-cache use listxattr() to
immediately determine which xattrs are on disk, avoiding needless
getxattr calls this way. At present it does not do this.

### Implications on manageability

None. We need to make sure that the cache is big enough to support the
threads that use it, but not so big that it consumes a significant
percentage of memory. We may want to make a cache size and expiration
time a tunable so that we can experiment in performance testing to
determine optimal parameters.

### Implications on presentation layer

Translators above the md-cache translator are not affected

### Implications on persistence layer

None.

### Implications on 'GlusterFS' backend

None

### Modification to GlusterFS metadata

None

### Implications on 'glusterd'

None

How To Test
-----------

We can use strace of a single-thread smallfile workload to verify that
the cache is filtering out excess system calls. We could include
counters into the cache to measure the cache hit rate.

User Experience
---------------

single-thread small-file creates should be faster, particularly on SSD
storage. Performance testing is needed to further quantify this.

Dependencies
------------

None

Documentation
-------------

None, except for tunables relating to cache size and expiration time.

Status
------

Not started.

Comments and Discussion
-----------------------

Jeff Darcy: I've been saying for ages that we should store xattrs in a
local DB and avoid local xattrs altogether. Besides performance, this
would also eliminate the need for special configuration of the
underlying local FS (to accommodate our highly unusual use of this
feature) and generally be good for platform independence. Not quite so
sure about other stat(2) information, but perhaps I could be persuaded.
In any case, this has led me to look into the relevant code on a few
occasions. Unfortunately, there are \*many\* places that directly call
sys\_\*xattr instead of winding fops - glusterd (for replace-brick),
changelog, quota, snapshots, and others. I think this feature is still
very worthwhile, but all of the "cheating" we've tolerated over the
years is going to make it more difficult.

Ben England: a local DB might be a good option but could also become a
bottleneck, unless you have a DB instance per brick (local) filesystem.
One problem that the DB would solve is getting all the metadata in one
query - at present POSIX API requires you to get one xattr at a time. If
we implement a caching layer that hides whether a DB or xattrs are being
used, we can make it easier to experiment with a DB (level DB?). On your
2nd point, While it's true that there are many sites that call
sys\_\*xattr directory, only a few of these really generate a lot of
system calls. For example, some of these calls are only for the
mountpoint. From a performance perspective, as long as we can intercept
the vast majority of the sys\_\*xattr calls with this caching layer,
IMHO we can tolerate a few exceptions in glusterd, etc. However, from a
CORRECTNESS standpoint, we have to be careful that calls bypassing the
caching layer don't cause cache contents to become stale (out-of-date,
inconsistent with the on-disk brick filesystem contents).