Wednesday, December 1, 2010

Hadoop 0.20.S Virtual Machine Appliance


http://developer.yahoo.com/blogs/hadoop/posts/2010/06/hadoop_020s_virtualmachine/


Hadoop 0.20.S Virtual Machine Appliance

At Yahoo!, we recently implemented a stronger notion of security for the Hadoop platform, based on Kerberos as underlying authentication system. We also successfully enabled this feature within Yahoo! on our internal data processing  clusters. I am sure many Hadoop developers and enterprise users are looking forward to get hands-on experience with this enterprise-class Hadoop Security feature.
In the past, we've aided developers and users get started with Hadoop by hosting a comprehensive Hadoop tutorial on YDN, along with a pre-configured single node Hadoop (0.18.0) Virtual Machine appliance.
This time, we decided to upgrade this Hadoop VM with a pre-configured single node Hadoop 0.20.S cluster, along with required Kerberos system components. We have also included Pig (version 0.7.0), a high level SQL-like data processing language used at Yahoo!.
This blog post describes how to get started with the Hadoop 20.S VM appliance. The basic information about downloading, setting up VM Player, and using the Hadoop VM is same as described in the tutorial module-3 — except the user has to use the following information and links to download the latest VM Player and  Hadoop 0.20.S VM Image. You should also review the following information for security-specific commands that need to be performed before running M/R or Pig jobs.
For more details on deploying and configuring Yahoo! Hadoop 0.20.S security distribution, look for continuing announcements and details on Hadoop-YDN.

Installing and Running the Hadoop 0.20.S Virtual Machine:

  • Virtual Machine and Hadoop environment: See details here.
  • Install VMware Player: See details here. To download latest VMware Player for Windows/Linux, go to Vmware site
  • Setting up the Virtual Environment for Hadoop 0.20.S:
Copy the [Hadoop 0.20.S Virtual Machine] into a location on your hard drive. It is a zipped vmware folder (hadoop-vm-appliance-0-20-S, appriox ~400MB), which includes a few files: a .vmdk file that is a snapshot of the virtual machine's hard drive, and a .vmx file that contains the configuration information to start the virtual machine. After unzipping the vmware folder zip file, to start the virtual machine, double-click on the hadoop-appliance-0.20.S.vmx file.  Note: Uncompressed Size of hadoop-vm-appliance-0-20-S folder is ~2GB. Also, based on that data you upload for testing, VM disk is configured to grow up to 20GB). When you start the virtual machine for the first time, VMware Player will recognize that the virtual machine image is not in the same location it used to be. You should inform VMware Player that you copied this virtual machine image (choose "I copied it"). VMware Player will then generate new session identifiers for this instance of the virtual machine. If you later move the VM image to a different location on your own hard drive, you should tell VMware Player that you have moved the image. After you select this option and click OK, the virtual machine should begin booting normally. You will see it perform the standard boot procedure for a Linux system. It will bind itself to an IP address on an unused network segment, and then display a prompt allowing a user to log in. Note: The IP address displayed on the login screen can be used to connect to VM instance over SSH. The Login screen also displays information about starting/stopping Hadoop daemons, users/passwords, and how to shutdown the VM. Note: It is much more convenient to access the VM via SSH. See details here.
  • Virtual Machine User Accounts:
The virtual machine comes pre-configured with two user accounts: "root" and  "hadoop-user". The hadoop-user account has sudo permissions to perform system-management functions, such as shutting down the virtual machine. The vast majority of your interaction with the virtual machine will be as hadoop-user. To log in as hadoop-user, first click inside the virtual machine's display. The virtual machine will take control of your keyboard and mouse. To escape back into Windows at any time, press CTRL+ALT at the same time. The hadoop-user user's password is hadoop. To log in as root, the password is root.
  • Hadoop Environment:
Linux    : Ubuntu 8.04
Java       : JRE 6 Update 7 (See License info @ /usr/jre16/) Hadoop : 0.20.S  (installed @ /usr/local/hadoop,  /home/hadoop-user/hadoop is symlink to install directory) Pig         : 0.7.0 (pig jar is installed @ /usr/local/pig,  /home/hadoop-user/pig-tutorial/pig.jar  is  symlink to 
one in install directory)
Login: hadoop-user, Passwd: hadoop (sudo privileges are granted for hadoop-user). The other usrers are hdfs and mapred (passwd: hadoop). Hadoop VM starts all the required hadoop and Kerberos daemons during the boot-up process, but in case the user needs to stop/restart,
  • To start/stop/restart hadoop: login as hadoop-user and run 'sudo /etc/init.d/hadoop [start | stop | restart]' ('sudo /etc/init.d/hadoop' gives the usage)
  • To format the HDFS & clean all state/logs: login as hadoop-user and run 'sudo reinit-hadoop'
  • To start/stop/restart Kerberos KDC Server: login as hadoop-user and run 'sudo /etc/init.d/krb5-kdc [start | stop | restart]'
  • To start/stop/restart Kerberos ADMIN Server: login as hadoop-user and run 'sudo /etc/init.d/krb5-admin-server [start | stop | restart]'
To shut down the Virtual Machine: login as hadoop-user and run command 'sudo poweroff' Environment for 'hadoop-user' (set in /home/hadoop-user/.profile)   $HADOOP_HOME=/usr/local/hadoop   $HADOOP_CONF_DIR=/usr/local/etc/hadoop-conf   $PATH=/usr/local/hadoop/bin:$PATH
  • Running M/R Jobs:
Running M/R jobs in Hadoop 0.20.S is pretty much same as running them in non-secure version of Hadoop. Except before running any Hadoop Jobs or HDFS commands, the hadoop-user needs to get the Kerberos authentication token using the command 'kinit'; the password is hadoopYahoo1234.
For example: hadoop-user@hadoop-desk:~$ cd hadoop hadoop-user@hadoop-desk:~$ kinit Password for hadoop-user@LOCALDOMAIN:  hadoopYahoo1234 hadoop-user@hadoop-desk:~/hadoop$ bin/hadoop jar hadoop-examples-0.20.104.1.1006042001.jar pi 10 1000000
For automated runs of hadoop jobs, a keytab file is created under the hadoop-user's home directory (/home/hadoop-user/hadoop-user.keytab). This will allow user to execute the "kinit" without having to manually enter the password. So for automated runs of hadoop commands or M/R, Pig jobs through the cron daemon, users can invoke the following command to get the Kerberos ticket. Use command 'klist' to view the Kerberos ticket and its validity.
For example: hadoop-user@hadoop-desk:~$ cd hadoop hadoop-user@hadoop-desk:~$ kinit -k -t /home/hadoop-user/hadoop-user.keytab hadoop-user/localhost@LOCALDOMAIN hadoop-user@hadoop-desk:~/hadoop$ bin/hadoop jar hadoop-examples-0.20.104.1.1006042001.jar pi 10 1000000
  • Running Pig Tutorial:
The Pig tutorial is installed at "/home/hadoop-user/pig-tutorial". Example commands to run the Pig script are given in "example.run.cmd.sh". The Data needed for Pig scripts are already copied to HDFS. See more details about the Pig Tutorial at Pig@Apache
  • hadoop-user@hadoop-desk:~$ cd pig-tutorial
  • hadoop-user@hadoop-desk:~$ sh example.run.cmd.sh
  • Shutting down the VM:
When you are done with the virtual machine, you can turn it off by logging in as the hadoop-user and running the command 'sudo poweroff'. The virtual machine will shut itself down in an orderly fashion and the window it runs in will disappear.
Last but not least, I would like to thank Devaraj Das and Jianyong Dai from the Yahoo! Hadoop & Pig Develoment team for their help in setting up and configuring Hadoop 0.20.S and Pig respectively.
Notice: Yahoo! does not offer any support for the Hadoop Virtual Machine. The software include cryptographic software that is subject to U.S. export control laws and applicable export and import laws of other countries. BEFORE using any software made available from this site, it is your responsibility to understand and comply with these laws. This software is being exported in accordance with the Export Administration Regulations. As of June 2009, you are prohibited from exporting and re-exporting this software to Cuba, Iran, North Korea, Sudan, Syria and any other countries specified by regulatory update to the U.S. export control laws and regulations. Diversion contrary to U.S. law is prohibited.

No comments:

Post a Comment