From 5fdd65f5f4f5df1d28b0fb4f7efed226d5db1b3c Mon Sep 17 00:00:00 2001 From: M S Vishwanath Bhat Date: Fri, 24 Feb 2012 13:18:56 +0530 Subject: renaming hdfs -> glusterfs-hadoop Change-Id: Ibb937af1231f6bbed9a2d4eaeabc6e9d4000887f BUG: 797064 Signed-off-by: M S Vishwanath Bhat Reviewed-on: http://review.gluster.com/2811 Tested-by: Gluster Build System Reviewed-by: Vijay Bellur --- glusterfs-hadoop/README | 182 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 182 insertions(+) create mode 100644 glusterfs-hadoop/README (limited to 'glusterfs-hadoop/README') diff --git a/glusterfs-hadoop/README b/glusterfs-hadoop/README new file mode 100644 index 000000000..3026f11c0 --- /dev/null +++ b/glusterfs-hadoop/README @@ -0,0 +1,182 @@ +GlusterFS Hadoop Plugin +======================= + +INTRODUCTION +------------ + +This document describes how to use GlusterFS (http://www.gluster.org/) as a backing store with Hadoop. + + +REQUIREMENTS +------------ + +* Supported OS is GNU/Linux +* GlusterFS and Hadoop installed on all machines in the cluster +* Java Runtime Environment (JRE) +* Maven (needed if you are building the plugin from source) +* JDK (needed if you are building the plugin from source) + +NOTE: Plugin relies on two *nix command line utilities to function properly. They are: + +* mount: Used to mount GlusterFS volumes. +* getfattr: Used to fetch Extended Attributes of a file + +Make sure they are installed on all hosts in the cluster and their locations are in $PATH +environment variable. + + +INSTALLATION +------------ + +** NOTE: Example below is for Hadoop version 0.20.2 ($GLUSTER_HOME/hdfs/0.20.2) ** + +* Building the plugin from source [Maven (http://maven.apache.org/) and JDK is required to build the plugin] + + Change to glusterfs-hadoop directory in the GlusterFS source tree and build the plugin. + + # cd $GLUSTER_HOME/hdfs/0.20.2 + # mvn package + + On a successful build the plugin will be present in the `target` directory. + (NOTE: version number will be a part of the plugin) + + # ls target/ + classes glusterfs-0.20.2-0.1.jar maven-archiver surefire-reports test-classes + ^^^^^^^^^^^^^^^^^^ + + Copy the plugin to lib/ directory in your $HADOOP_HOME dir. + + # cp target/glusterfs-0.20.2-0.1.jar $HADOOP_HOME/lib + + Copy the sample configuration file that ships with this source (conf/core-site.xml) to conf + directory in your $HADOOP_HOME dir. + + # cp conf/core-site.xml $HADOOP_HOME/conf + +* Installing the plugin from RPM + + See the plugin documentation for installing from RPM. + + +CLUSTER INSTALLATION +-------------------- + + In case it is tedious to do the above steps(s) on all hosts in the cluster; use the build-and-deploy.py script to + build the plugin in one place and deploy it (along with the configuration file on all other hosts). + + This should be run on the host which is that hadoop master [Job Tracker]. + +* STEPS (You would have done Step 1 and 2 anyway while deploying Hadoop) + + 1. Edit conf/slaves file in your hadoop distribution; one line for each slave. + 2. Setup password-less ssh b/w hadoop master and slave(s). + 3. Edit conf/core-site.xml with all glusterfs related configurations (see CONFIGURATION) + 4. Run the following + # cd $GLUSTER_HOME/hdfs/0.20.2/tools + # python ./build-and-deploy.py -b -d /path/to/hadoop/home -c + + This will build the plugin and copy it (and the config file) to all slaves (mentioned in $HADOOP_HOME/conf/slaves). + + Script options: + -b : build the plugin + -d : location of hadoop directory + -c : deploy core-site.xml + -m : deploy mapred-site.xml + -h : deploy hadoop-env.sh + + +CONFIGURATION +------------- + + All plugin configuration is done in a single XML file (core-site.xml) with tags in each + block. + + Brief explanation of the tunables and the values they accept (change them where-ever needed) are mentioned below + + name: fs.glusterfs.impl + value: org.apache.hadoop.fs.glusterfs.GlusterFileSystem + + The default FileSystem API to use (there is little reason to modify this). + + name: fs.default.name + value: glusterfs://server:port + + The default name that hadoop uses to represent file as a URI (typically a server:port tuple). Use any host + in the cluster as the server and any port number. This option has to be in server:port format for hadoop + to create file URI; but is not used by plugin. + + name: fs.glusterfs.volname + value: volume-dist-rep + + The volume to mount. + + + name: fs.glusterfs.mount + value: /mnt/glusterfs + + This is the directory that the plugin will use to mount (FUSE mount) the volume. + + name: fs.glusterfs.server + value: 192.168.1.36, hackme.zugzug.org + + To mount a volume the plugin needs to know the hostname or the IP of a GlusterFS server in the cluster. + Mention it here. + + name: quick.slave.io + value: [On/Off], [Yes/No], [1/0] + + NOTE: This option is not tested as of now. + + This is a performance tunable option. Hadoop schedules jobs to hosts that contain the file data part. The job + then does I/O on the file (via FUSE in case of GlusterFS). When this option is set, the plugin will try to + do I/O directly from the backed filesystem (ext3, ext4 etc..) the file resides on. Hence read performance + will improve and job would run faster. + + +USAGE +----- + + Once configured, start Hadoop Map/Reduce daemons + + # cd $HADOOP_HOME + # ./bin/start-mapred.sh + + If the map/reduce job/task trackers are up, all I/O will be done to GlusterFS. + + +FOR HACKERS +----------- + +* Source Layout + +** version specific: hdfs/ ** +./src +./src/main +./src/main/java +./src/main/java/org +./src/main/java/org/apache +./src/main/java/org/apache/hadoop +./src/main/java/org/apache/hadoop/fs +./src/main/java/org/apache/hadoop/fs/glusterfs +./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFSBrickClass.java +./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFSXattr.java <--- Fetch/Parse Extended Attributes of a file +./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFUSEInputStream.java <--- Input Stream (instantiated during open() calls; quick read from backed FS) +./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFSBrickRepl.java +./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFUSEOutputStream.java <--- Output Stream (instantiated during creat() calls) +./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFileSystem.java <--- Entry Point for the plugin (extends Hadoop FileSystem class) +./src/test +./src/test/java +./src/test/java/org +./src/test/java/org/apache +./src/test/java/org/apache/hadoop +./src/test/java/org/apache/hadoop/fs +./src/test/java/org/apache/hadoop/fs/glusterfs +./src/test/java/org/apache/hadoop/fs/glusterfs/AppTest.java <--- Your test cases go here (if any :-)) +./tools/build-deploy-jar.py <--- Build and Deployment Script +./conf +./conf/core-site.xml <--- Sample configuration file +./pom.xml <--- build XML file (used by maven) + +** toplevel: hdfs/ ** +./COPYING <--- License +./README <--- This file -- cgit