1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
|
Feature
-------
Implement negative lookup cache and hence improve create performance
Summary
-------
Before creating any file lookups(1 in Fuse, 5-6 in SMB etc.) are sent to verify if the file
already exists. Serving these lookup from the cache when possible, increases the create
performance by multiple folds in SMB access and by some percentage in Fuse/NFS access.
A lookup that fails with ENOENT is called as negative lookup.
Owners
------
Poornima G <pgurusid@redhat.com>
Pranith Kumar Karampuri <pkarampu@redhat.com>
Rajesh Joseph <rjoseph@redhat.com>
Current status
--------------
In design phase.
Related Feature Requests and Bugs
---------------------------------
Detailed Description
--------------------
### Design:
To identify if the lookup is on a non existant file, we need to compare the filename in lookup
against filenames in the parent directory. Hence, we have to have some kind of dentry cache and
a way to ensure consistency of the cache.
#### Which cache consistency model?
1. Cache entries and use directory leases for consistency
Directory RW lease is taken by any client that is caching the existing entries and creating new
entries in the directory. The lease is broken when any other client tries to create an entry in
the directory. With this approach we need to first implement directory leases and then the
negative lookup cache xlator. Directory leases implementation in the current dht setup becomes
very complex as the directories are replicated on all the bricks and leases should be acquired
on all the bricks. Since lease will be a in memory state on the bricks it becomes harder to heal
the states and handle disconnects, network partitions etc. Hence the approach 2.
2. Cache entries and use upcall for consistency
Directory entries are cached, whenever a any other client is creating entries, the cache is
invalidated and built a fresh. With this there is a small window when the file was actually
created and the upcall has not yet reached the client. In this window, the cache can return
lookup as ENOENT though the file existed at that point on the bricks. But this window can happen
even otherwise because we do not serialize fops from multiple clients. If there are applications
which poll on a file being created, it is not a suitable use case for caching. Hence going with
this approach.
#### What to cache?
Positive entries (entries that are already present on disk) and/or negative entries (entries that
were looked up and found to be ENOENT). The names of the entries are cached, but it can be plugged
to cache the hash of the name ([1] suggested by Pranith). The cache is bound by the upper limit
configured by the user.
The positive entries(PE):
- Populated as a part of readdirp, and as a part of mkdir followed by creates inside that directory.
Lookups and other fops do not populate the positive entry (as it can grow long and is of no value
add)
- Freed on recieving upcall(with dentry change flag) or on expiring timeout of the cache.
The negative entries(NE):
- Populated only when lookup/stat returns ENOENT. Fuse mostly sends only one lookup before create,
hence negative entry cache is almost useless. But for SMB access, multiple lookups/stats are sent
before creating the file. Hence the negative entry cache. It can exist even when the positive
entry cache is invalid.
- Freed on recieving upcall(with dentry change flag) or on expiring timeout of the cache.
PE list and NE list can exist independent of each other and can be optionally disabled.
#### What fops gets served from the cache?
1. Lookup/stat if its a negative lookup
2. Getxattr on get_real_filename (this is a case insensitive lookup)
Pseudocode for lookup/stat and getxattr fops:
lookup/stat () {
if inode is found in inode table for this name then goto out
if (parent inode state == NE_VALID)
if found in NE list
STACK_UNWIND (ENOENT)
else
goto out
if (parent inode state == PE_VALID)
if not found in PE list
STACK_UNWIND (ENOENT)
else
goto out
else if (parent inode state == PE_PARTIAL || PE_INVALID)
goto out
out:
STACK_WIND
}
getxattr (get_real_filename) {
if (inode state == PE_VALID)
if (found in case insensitive PE search)
STACK_UNWIND (found filename)
else
if (inode state == PE_FULL)
STACK_UNWIND (ENOENT)
else if (inode state == PE_PARTIAL)
STACK_WIND
}
#### Data structures to store the cache?
First thought would be to have a linked list of filenames. String compare of all the PE and NE list
is CPU consuming and may cause performance hit in some cases. Lookups can be catagorised to: Fresh
+ve lookup, Subsequent lookups(+ve), Negative lookups.
Subsequent lookups: As the inode will be present in inode table we need not search PE/NE list,
hence do not suffer performance hit.
Fresh +Ve lookups: Worst case it will have to compare it with every entry in the NE and PE list.
This will cause some performance hit.
Fresh -Ve lookups: Worst case it will have to compare it with every entry in the NE list to find
that its not there, and PE is INVALID/PARTIAL hence has to goto bricks anyways.
(PE_FULL is not a bad case, as it can return from the cache instead of network fop)
Fresh +ve lookup will have some performance impact, hence we need something better than sequential
string compare of names:
- Store the hash instead of names as suggested in [1]. If this is the case get_real_filename cannot
be served, unless we store two hashes (case as is- for lookups, and lower case hash- for
get_real_filename)
- Store it in htree, so that the search time reduces but adding and removing from a tree is not
O(1) operation.
- Set an upper limit on the number of NE and PE in the linked list.
For the initial implementation going with the linked list for negative entries.
List of inode refs for positive entries.
#### Other things to note?
- The timeout can be for the whole of the inode cache, in that case any further update of the cache
will not not change the cache time. Or the timeout can be per entry if the PE/NE list, but this can
lead to lot of bookkeeping and timers. The timer expiry cleanup can be as a part of different reaper
thread or we can register a time handler for each inode in the timerwheel (as it is scalse well).
[1] http://lists.gluster.org/pipermail/gluster-users/2016-August/028107.html
Benefit to GlusterFS
--------------------
Improves file/directory create performance.
Scope
-----
#### Nature of proposed change
- Adds a new xlator and glusterd canges to enable that xlator
#### Implications on manageability
N/A
#### Implications on presentation layer
N/A
#### Implications on persistence layer
N/A
#### Implications on 'GlusterFS' backend
N/A
#### Modification to GlusterFS metadata
N/A
#### Implications on 'glusterd'
GlusterD changes are integral to this feature, and described above.
How To Test
-----------
For the most part, testing is of the "do no harm" sort; the most thorough test
of this feature is to run our current regression suite.
Some specific test cases include create on all kind of volumes and from multiple clienats:
- distribute
- replicate
- shard
- disperse
- tier
Also, create while:
- rebalance in progress
- tiering migration in progress
- self heal in progress
And all the test cases being run while the memory consumption of the process
is monitored.
User Experience
---------------
Faster file or directory creates
Dependencies
------------
N/A
Documentation
-------------
TBD (very little)
Status
------
Development in progress
Comments and Discussion
-----------------------
N/A
|