# CloudArchival-Design.md This document gives a high level overview of CloudArchival. The design is being refined as we go along, and this document will be updated along the way. ## Introduction This design solves the usecase where data that requires high-speed access is retained internally i.e. Glusterfs and lower-priority data is moved to a low-cost cloud-based archive storage. This will allow reduction in storage cost for usecases where a majority of data is cold and can be archived. ## Architectural Overview CloudeArchival has two components. A scanner/uploader tool and a downloader xlator in Glusterfs stack. ### 1. Scanner/uploader This tool will scan the file system and based on a policy, will upload the data to a predecided Cloud Storage. The policy can be user defined. A simple example would be, upload any file that has not been accessed for one month. ### 2. Downloader This xlator will download the file from Cloud-Storage when an access for read/write (basically any data modification) request is made. This xlator will be placed on the client side as AFR and EC xlators are client xlators. ## Work Flow - Phase I - Post scanning, the uploader will filter out files to be archived to Cloud. Once the data migration is complete to Cloud, the uploader will do a setxattr operation on the file to inform the downloader xlator to truncate the data. As part of this maintenance, downloader will store the size information as an xattr on the file to serve lookup/stat etc and then will truncate the data. - Phase II - While the data resides on Cloud, all meta-data operation can be performed locally on Glusterfs. The data will be downloaded only when a data modification is requested. For read/write request, the downloader will stub the request and start downloading the file from Cloud. Upon successful download, the stubbed request will be resumed. ## Cloud Information and Security Cloud information like which Cloud provider and it's access information can be stored per volume basis through Glusterd. There can only one cloud storage be attached to a volume. Since the communication channel to Cloud needs to be secured, the access information for Cloud should and must reside on the trusted storage pool. GF-proxy fits this requirement nicely as it runs on the trusted storage pool (as for now). Hence, the downloader will be part of GF-proxy daemon on the trusted storage pool. #### Note: Initial implementation will integrate with Amazon Web Service (AWS). Integration with other Cloud Storage will be left open for development to the community.