summaryrefslogtreecommitdiffstats
path: root/under_review/reflink-based-fsnap.md
blob: 8fb37d9d7eda005afe6781c99965fba0a09ea37d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
Feature
-------

Summary
-------

Introducing a new file based snapshot feature in gluster which is based  on reflinks

Owners
------
Prasanna Kumar Kalever <pkalever@redhat.com>

Current status
--------------
Proposed, awaiting approval.

Related Feature Requests and Bugs
---------------------------------
Currently we have volume based snapshots and qcow2 based file snapshots in gluster.
The qcow2 snapshot has a limitation of taking snapshots of files which are of type qcow2 only,
what about VMDK, VDI or other formats like txt, mk, md, gif, docx, xlsx, pptx etc. 


Detailed Description
--------------------

#### what is a reflink ?

You might have surely used softlinks and hardlinks everyday!

Reflink  supports transparent copy on write, unlike soft/hardlinks which if useful for  snapshotting, basically reflink points to same data blocks that are used  by actual file (blocks are common to real file and a reflink file hence  space efficient), they use different inode numbers hence they can have  different permissions to access same data blocks, although they may look  similar to hardlinks but are more space efficient and can handle all  operations that can be performed on a regular file, unlike hardlinks  that are limited to unlink().

which filesystem support reflink ?
I  think its Btrfs who put it for the first time, now xfs trying hard to  make them available, in the future we can see them in ext4 as well


You can get a feel of reflinks by following tutorial at
https://pkalever.wordpress.com/2016/01/22/xfs-reflinks-tutorial/


##### POC in gluster:
https://asciinema.org/a/be50ukifcwk8tqhvo0ndtdqdd?speed=2

#### How we are doing it ?
Currently  we don't have a specific system-call that gives handle to reflinks, so I  decided to go with ioctl call with XFS_IOC_CLONE command for now.
But in the near feature we can use sys_copy_file_range to do this job.

In POC I have used setxattr/getxattr to create/delete/list the snapshot. Restore feature will use setxattr as well.

We  can have a fop although Fuse doesn't understand this ioctl, we will manage with a  setxattr at Fuse mount point and again from client side it will be a fop till the posix xlator then as a ioctl to the underlying filesystem. Planing  to expose APIs for create, delete, list and restore.
Once the systemcall is ready we can switch to it easily.

Are these snapshots Internal or external?
We  will have a separate file each time we create a snapshot, obviously the  snapshot file will have a different inode number and will be a  readonly, all these files are maintained in the ".fsnap/ " directory  which is maintained by the parent directory where the  snapshot-ted/actual file  resides, therefore they will not be visible to user (even with ls -a option, just like USS).


*** We can always restore to any snapshot available in the list and the best part is we can delete any snapshot between  snapshot1 and  snapshotN because all of them are independent ***

It  is applications duty to ensure the consistency of the file before it  tries to create a snapshot, say in case of VM file snapshot it is the  hyper-visor that should freeze the IO and then request for the snapshot
(or we can use something like qemu hooks to do this on the fly; still need to investigate this)

### Integration with gluster:

#### Quota:
Since  the snapshot files resides in ".fsnap/" directory which is maintained  by the same directory where the actual file exist, it falls in the same  users quota :)

#### DHT:
As said the snapshot files will resides in the same directory where the actual file resides may be in a ".fsnap/" directory

#### Re-balancing:
Simplest  solution could be, copy the actual file as whole copy then for  snapshotfiles rsync only delta's and recreate snapshots history by  repeating snapshot sequence after each snapshotfile rsync.

or we can experiment with dedupe ioctl to check possibilities

By the time we start development we will have "XFS reverse-mapping, reflink, and dedupe support" (I have seen patches and they are under testing phase)

#### AFR:
Mostly  will be same as write fop (inodelk's and quorum's). There could be no  way to recover or recreate a snapshot on node (brick to be precise) which was down while  taking snapshot and comes back later in time.
We  don't heal snapshot files

#### Disperse:
Mostly take the inodelk and snapshot the file, on each of the bricks should work.

#### Sharding:
Assume we have a file split into 4 shards. If the fop for take snapshot is sent to all the subvols having the shards, it would be sufficient. All shards will have the snapshot for the state of the shard.
List of snap fop should be sent only to the main subvol where shard 0 resides.
Delete of a snap should be similar to create.
Restore would be a little difficult because metadata of the file needs to be updated in shard xlator.
<Needs more investigation>
Also in case of sharding, the bricks have gfid based flat filesystem. Hence the snaps created will also be in the shard directory, hence quota is not straightforward and needs additional work in this case.



Benefit to GlusterFS
--------------------

Gluster can have file based snapshots independent of file type, hence every file can maintain revisions created by taking snapshots which will cost nothing.

Scope
-----

#### Nature of proposed change

A FOP will be added

#### Implications on manageability

Introduce new setxattr options

#### Implications on presentation layer

It will be setxattr in FUSE side.

In libgfapi we will have API's for create/list/delete/apply snapshots

#### Implications on persistence layer

Need reflinks feature in the underlying filesystems.

#### Implications on 'GlusterFS' backend

on calling setxattr to create a snapshot of a file, a new file of size zero with only inode (reflink) will be created in '.fsap' directory with name '<filename>-timestamp' which has no gfid allocated to it, 
posix xlator has to filter it when some one calls read-dir <Needs more investigation>.

#### Modification to GlusterFS metadata

None.

#### Implications on 'glusterd'

None.

How To Test
-----------

Testcases required to verify Create/list/delete/apply snapshots

User Experience
---------------

Can take file snapshots on any file (no limitations on file type qcow2), hence can maintain revisions of any file with zero cost.

Dependencies
------------

None.

Documentation
-------------

TBD

Status
------

##### POC is ready
https://asciinema.org/a/be50ukifcwk8tqhvo0ndtdqdd?speed=2

Comments and Discussion
-----------------------

How can we make it better?