Thursday, October 27, 2016

ML & AI


Article worth a read



Friday, October 21, 2016

Cloudera Docker Container 사용하기


source:  https://www.cloudera.com/documentation/enterprise/5-6-x/topics/quickstart_docker_container.html





Cloudera Docker Container


docker pull cloudera/quickstart:latest
latest: Pulling from cloudera/quickstart
1d00652ce734: Downloading [==>                        ] 232.5 MB/4.444 GB

Importing the Cloudera QuickStart Image

You can import the Docker image by pulling it from the Docker Hub:
docker pull cloudera/quickstart:latest
You can also download the image from the Cloudera website. After the file is downloaded and on your host, you can import it into Docker:
tar xzf cloudera-quickstart-vm-*-docker.tar.gz
docker import - cloudera/quickstart:latest < cloudera-quickstart-vm-*-docker/*.tar

Running a Cloudera QuickStart Container

To run a container using the image, you must know the name or hash of the image. If you followed the import instructions above, the name is cloudera/quickstart:latest. The hash is also printed in the terminal when you import, or you can look up the hashes of all imported images with:
docker images
Once you know the name or hash of the image, you can run it:
docker run --hostname=quickstart.cloudera --privileged=true -t -i [OPTIONS] [IMAGE] /usr/bin/docker-quickstart
The required flags and other options are described in the following table:
OptionDescription
--hostname=quickstart.clouderaRequired: Pseudo-distributed configuration assumes this hostname.
--privileged=trueRequired: For HBase, MySQL-backed Hive metastore, Hue, Oozie, Sentry, and Cloudera Manager.
-tRequired: Allocate a pseudoterminal. Once services are started, a Bash shell takes over. This switch starts a terminal emulator to run the services.
-iRequired: If you want to use the terminal, either immediately or connect to the terminal later.
-p 8888Recommended: Map the Hue port in the guest to another port on the host.
-p [PORT]Optional: Map any other ports (for example, 7180 for Cloudera Manager, 80 for a guided tutorial).
-dOptional: Run the container in the background.
Use /usr/bin/docker-quickstart to start all CDH services, and then run a Bash shell. You can directly run/bin/bash instead if you want to start services manually.
See Networking for details about port mapping.

Connecting to the Docker Shell

If you do not pass the -d flag to docker run, your terminal automatically attaches to the container.
A container dies when you exit the shell, but you can disconnect and leave the container running by typingCtrl+p followed by Ctrl+q.
If you disconnect from the shell or passed the -d flag on startup, you can connect to the shell later using the following command:
docker attach [CONTAINER HASH]
You can look up the hashes of running containers using the following command:
docker ps
When attaching to a container, you might need to press Enter to see the shell prompt. To disconnect from the terminal without the container exiting, type Ctrl+p followed by Ctrl+q.

Networking

To make a port accessible outside the container, pass the -p  flag. Docker maps this port to another port on the host system. You can look up the interface to which it binds and the port number it maps to using the following command:
docker port [CONTAINER HASH] [GUEST PORT]
To interact with the Cloudera QuickStart image from other systems, make sure quickstart.cloudera resolves to the IP address of the machine where the image is running. You might also want to set up port forwarding so that the port you would normally connect to on a real cluster is mapped to the corresponding port.
When you are mapping ports like this, services are not aware and might provide links or other references to specific ports that are no longer available on your client.

Wednesday, October 19, 2016

Installing CDH 5 with YARN on a Single Linux Node in Pseudo-distributed mode

Installing CDH 5 with YARN on a Single Linux Node in Pseudo-distributed mode

Before you start, uninstall MRv1 if necessary

If you have already installed MRv1 following the steps in the previous section, you now need to uninstallhadoop-0.20-conf-pseudo before running YARN. Proceed as follows.
  1. Stop the daemons:
    $ for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x stop ; done 
    $ for x in 'cd /etc/init.d ; ls hadoop-0.20-mapreduce-*' ; do sudo service $x stop ; done
  2. Remove hadoop-0.20-conf-pseudo:
    • On Red Hat-compatible systems:
      $ sudo yum remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
    • On SLES systems:
      $ sudo zypper remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
    • On Ubuntu or Debian systems:
      $ sudo apt-get remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
    In this case (after uninstalling hadoop-0.20-conf-pseudo) you can skip the package download steps below.

On Red Hat/CentOS/Oracle 5 or Red Hat 6 systems, do the following:

Download the CDH 5 Package
  1. Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write access (it can be your home directory).
    OS VersionLink to CDH 5 RPM
    RHEL/CentOS/Oracle 5RHEL/CentOS/Oracle 5 link
    RHEL/CentOS/Oracle 6RHEL/CentOS/Oracle 6 link
    RHEL/CentOS/Oracle 7RHEL/CentOS/Oracle 7 link
  2. Install the RPM.
    For Red Hat/CentOS/Oracle 5:
    $ sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm 
    For Red Hat/CentOS/Oracle 6 (64-bit):
    $ sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
    For instructions on how to add a CDH 5 yum repository or build your own CDH 5 yum repository, seeInstalling CDH 5 On Red Hat-compatible systems.
Install CDH 5
  1. (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
    • For Red Hat/CentOS/Oracle 5 systems:
      $ sudo rpm --import https://archive.cloudera.com/cdh5/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera
    • For Red Hat/CentOS/Oracle 6 systems:
      $ sudo rpm --import https://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
  2. Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
    $ sudo yum install hadoop-conf-pseudo

On SLES systems, do the following:

Download and install the CDH 5 package
  1. Download the CDH 5 "1-click Install" package.
    Download the rpm file, choose Save File, and save it to a directory to which you have write access (for example, your home directory).
  2. Install the RPM:
    $ sudo rpm -i cloudera-cdh-5-0.x86_64.rpm
    For instructions on how to add a CDH 5 SLES repository or build your own CDH 5 SLES repository, seeInstalling CDH 5 On SLES systems.
Install CDH 5
  1. (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
    • For all SLES systems:
      $ sudo rpm --import https://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
  2. Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
    $ sudo zypper install hadoop-conf-pseudo 

On Ubuntu and other Debian systems, do the following:

Download and install the package
  1. Download the CDH 5 "1-click Install" package:
    OS VersionPackage Link
    WheezyWheezy package
    PrecisePrecise package
    TrustyTrusty package
  2. Install the package by doing one of the following:
    • Choose Open with in the download window to use the package manager.
    • Choose Save File, save the package to a directory to which you have write access (for example, your home directory), and install it from the command line. For example:
      sudo dpkg -i cdh5-repository_1.0_all.deb
Install CDH 5
  1. (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
    • For Ubuntu Lucid systems:
      $ curl -s https://archive.cloudera.com/cdh5/ubuntu/lucid/amd64/cdh/archive.key | sudo apt-key add -
    • For Ubuntu Precise systems:
      $ curl -s https://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
    • For Debian Squeeze systems:
      $ curl -s https://archive.cloudera.com/cdh5/debian/squeeze/amd64/cdh/archive.key | sudo apt-key add -
  2. Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
    $ sudo apt-get update 
    $ sudo apt-get install hadoop-conf-pseudo

Starting Hadoop and Verifying it is Working Properly

For YARN, a pseudo-distributed Hadoop installation consists of one node running all five Hadoop daemons:namenodesecondarynamenoderesourcemanagerdatanode, and nodemanager.
  • To view the files on Red Hat or SLES systems:
$ rpm -ql hadoop-conf-pseudo
  • To view the files on Ubuntu systems:
$ dpkg -L hadoop-conf-pseudo
The new configuration is self-contained in the /etc/hadoop/conf.pseudo directory.
The Cloudera packages use the alternative framework for managing which Hadoop configuration is active. All Hadoop components search for the Hadoop configuration in /etc/hadoop/conf.
To start Hadoop, proceed as follows.

Step 1: Format the NameNode.

Before starting the NameNode for the first time you must format the file system.
$ sudo -u hdfs hdfs namenode -format
Make sure you perform the format of the NameNode as user hdfs. You can do this as part of the command string, using sudo -u hdfs as in the command above.

Step 2: Start HDFS

$ for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done
To verify services have started, you can check the web console. The NameNode provides a web consolehttp://localhost:50070/ for viewing your Distributed File System (DFS) capacity, number of DataNodes, and logs. In this pseudo-distributed configuration, you should see one live DataNode named localhost.

Step 3: Create the directories needed for Hadoop processes.

Issue the following command to create the directories needed for all installed Hadoop processes with the appropriate permissions.
$ sudo /usr/lib/hadoop/libexec/init-hdfs.sh

Step 4: Verify the HDFS File Structure:

Run the following command:
$ sudo -u hdfs hadoop fs -ls -R /
You should see output similar to the following excerpt:
...
drwxrwxrwt - hdfs supergroup 0 2012-05-31 15:31 /tmp
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /tmp/hadoop-yarn
drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging
drwxr-xr-x - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history
drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history/done_intermediate
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var/log
drwxr-xr-x - yarn mapred 0 2012-05-31 15:31 /var/log/hadoop-yarn
...

Step 5: Start YARN

$ sudo service hadoop-yarn-resourcemanager start
$ sudo service hadoop-yarn-nodemanager start 
$ sudo service hadoop-mapreduce-historyserver start

Step 6: Create User Directories

Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:
$ sudo -u hdfs hadoop fs -mkdir /user/
$ sudo -u hdfs hadoop fs -chown  /user/
where is the Linux username of each user.
Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:
$ sudo -u hdfs hadoop fs -mkdir /user/$USER
$ sudo -u hdfs hadoop fs -chown $USER /user/$USER

Running an example application with YARN

  1. Create a home directory on HDFS for the user who will be running the job (for example, joe):
    $ sudo -u hdfs hadoop fs -mkdir /user/joe 
    $ sudo -u hdfs hadoop fs -chown joe /user/joe
    Do the following steps as the user joe.
  2. Make a directory in HDFS called input and copy some XML files into it by running the following commands in pseudo-distributed mode:
    $ hadoop fs -mkdir input
    $ hadoop fs -put /etc/hadoop/conf/*.xml input
    $ hadoop fs -ls input
    Found 3 items:
    -rw-r--r-- 1 joe supergroup 1348 2012-02-13 12:21 input/core-site.xml
    -rw-r--r-- 1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-site.xml
    -rw-r--r-- 1 joe supergroup 1001 2012-02-13 12:21 input/mapred-site.xml
  3. Set HADOOP_MAPRED_HOME for user joe:
    $ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
  4. Run an example Hadoop job to grep with a regular expression in your input data.
    $ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'
  5. After the job completes, you can find the output in the HDFS directory named output23 because you specified that output directory to Hadoop.
    $ hadoop fs -ls 
    Found 2 items
    drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input
    drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output23
    You can see that there is a new directory called output23.
  6. List the output files.
    $ hadoop fs -ls output23 
    Found 2 items
    drwxr-xr-x - joe supergroup 0 2009-02-25 10:33 /user/joe/output23/_SUCCESS
    -rw-r--r-- 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output23/part-r-00000
  7. Read the results in the output file.
    $ hadoop fs -cat output23/part-r-00000 | head
    1 dfs.safemode.min.datanodes
    1 dfs.safemode.extension
    1 dfs.replication
    1 dfs.permissions.enabled
    1 dfs.namenode.name.dir
    1 dfs.namenode.checkpoint.dir
    1 dfs.datanode.data.dir

Starting CDH Services Using the Command Line

Starting CDH Services Using the Command Line

You need to start and stop services in the right order to make sure everything starts or stops cleanly.
START services in this order:
OrderServiceCommentsFor instructions and more information
1
ZooKeeper
Cloudera recommends starting ZooKeeper before starting HDFS; this is a requirement in a high-availability (HA)deployment. In any case, always start ZooKeeper before HBase.
2
HDFS
Start HDFS before all other services except ZooKeeper. If you are using HA, see the CDH 5 High Availability Guide for instructions.
3
HttpFS

4a
MRv1
Start MapReduce before Hive or Oozie. Do not start MRv1 if YARN is running.
4b
YARN
Start YARN before Hive or Oozie. Do not start YARN if MRv1 is running.
5
HBase

6
Hive
Start the Hive metastore before starting HiveServer2 and the Hive console.
7
Oozie

8
Flume 1.x

9
Sqoop

10
Hue