Thursday, October 27, 2016

ML & AI

Article worth a read

Ø https://a-misra.com/machine-learning/

Ø https://www.linkedin.com/pulse/impact-artificial-intelligence-vicky-falconer?trk=mp-author-card

Ø https://a-misra.com/2016/10/02/oracles-big-data-analytics-platform-for-data-scientists/

Ø http://www.zdnet.com/article/oracle-vs-salesforce-on-ai-what-to-expect-when/

Friday, October 21, 2016

Cloudera Docker Container 사용하기

source: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/quickstart_docker_container.html

Cloudera Docker Container

docker pull cloudera/quickstart:latest
latest: Pulling from cloudera/quickstart
1d00652ce734: Downloading [==>                        ] 232.5 MB/4.444 GB

Importing the Cloudera QuickStart Image

You can import the Docker image by pulling it from the Docker Hub:

docker pull cloudera/quickstart:latest

You can also download the image from the Cloudera website. After the file is downloaded and on your host, you can import it into Docker:

tar xzf cloudera-quickstart-vm-*-docker.tar.gz
docker import - cloudera/quickstart:latest < cloudera-quickstart-vm-*-docker/*.tar

Running a Cloudera QuickStart Container

To run a container using the image, you must know the name or hash of the image. If you followed the import instructions above, the name is cloudera/quickstart:latest. The hash is also printed in the terminal when you import, or you can look up the hashes of all imported images with:

docker images

Once you know the name or hash of the image, you can run it:

docker run --hostname=quickstart.cloudera --privileged=true -t -i [OPTIONS] [IMAGE] /usr/bin/docker-quickstart

The required flags and other options are described in the following table:

Option	Description
`--hostname=quickstart.cloudera`	Required: Pseudo-distributed configuration assumes this hostname.
`--privileged=true`	Required: For HBase, MySQL-backed Hive metastore, Hue, Oozie, Sentry, and Cloudera Manager.
`-t`	Required: Allocate a pseudoterminal. Once services are started, a Bash shell takes over. This switch starts a terminal emulator to run the services.
`-i`	Required: If you want to use the terminal, either immediately or connect to the terminal later.
`-p 8888`	Recommended: Map the Hue port in the guest to another port on the host.
`-p [PORT]`	Optional: Map any other ports (for example, 7180 for Cloudera Manager, 80 for a guided tutorial).
`-d`	Optional: Run the container in the background.

Use /usr/bin/docker-quickstart to start all CDH services, and then run a Bash shell. You can directly run/bin/bash instead if you want to start services manually.

See Networking for details about port mapping.

Connecting to the Docker Shell

If you do not pass the -d flag to docker run, your terminal automatically attaches to the container.

A container dies when you exit the shell, but you can disconnect and leave the container running by typingCtrl+p followed by Ctrl+q.

If you disconnect from the shell or passed the -d flag on startup, you can connect to the shell later using the following command:

docker attach [CONTAINER HASH]

You can look up the hashes of running containers using the following command:

docker ps

When attaching to a container, you might need to press Enter to see the shell prompt. To disconnect from the terminal without the container exiting, type Ctrl+p followed by Ctrl+q.

Networking

To make a port accessible outside the container, pass the -p flag. Docker maps this port to another port on the host system. You can look up the interface to which it binds and the port number it maps to using the following command:

docker port [CONTAINER HASH] [GUEST PORT]

To interact with the Cloudera QuickStart image from other systems, make sure quickstart.cloudera resolves to the IP address of the machine where the image is running. You might also want to set up port forwarding so that the port you would normally connect to on a real cluster is mapped to the corresponding port.

When you are mapping ports like this, services are not aware and might provide links or other references to specific ports that are no longer available on your client.

Wednesday, October 19, 2016

Installing CDH 5 with YARN on a Single Linux Node in Pseudo-distributed mode

Before you start, uninstall MRv1 if necessary

If you have already installed MRv1 following the steps in the previous section, you now need to uninstallhadoop-0.20-conf-pseudo before running YARN. Proceed as follows.

Stop the daemons:

$ for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x stop ; done 
$ for x in 'cd /etc/init.d ; ls hadoop-0.20-mapreduce-*' ; do sudo service $x stop ; done

Remove hadoop-0.20-conf-pseudo:
- On Red Hat-compatible systems:
```
$ sudo yum remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
```
- On SLES systems:
```
$ sudo zypper remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
```
- On Ubuntu or Debian systems:
```
$ sudo apt-get remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
```
In this case (after uninstalling hadoop-0.20-conf-pseudo) you can skip the package download steps below.

On Red Hat/CentOS/Oracle 5 or Red Hat 6 systems, do the following:

Download the CDH 5 Package

Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write access (it can be your home directory).

OS Version Link to CDH 5 RPM

RHEL/CentOS/Oracle 5 RHEL/CentOS/Oracle 5 link

RHEL/CentOS/Oracle 6 RHEL/CentOS/Oracle 6 link

RHEL/CentOS/Oracle 7 RHEL/CentOS/Oracle 7 link
Install the RPM.
For Red Hat/CentOS/Oracle 5:
```
$ sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm 
```
For Red Hat/CentOS/Oracle 6 (64-bit):
```
$ sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
```
For instructions on how to add a CDH 5 yum repository or build your own CDH 5 yum repository, seeInstalling CDH 5 On Red Hat-compatible systems.

OS Version	Link to CDH 5 RPM
RHEL/CentOS/Oracle 5	RHEL/CentOS/Oracle 5 link
RHEL/CentOS/Oracle 6	RHEL/CentOS/Oracle 6 link
RHEL/CentOS/Oracle 7	RHEL/CentOS/Oracle 7 link

Install CDH 5

(Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
- For Red Hat/CentOS/Oracle 5 systems:
```
$ sudo rpm --import https://archive.cloudera.com/cdh5/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera
```
- For Red Hat/CentOS/Oracle 6 systems:
```
$ sudo rpm --import https://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
```
Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
```
$ sudo yum install hadoop-conf-pseudo
```

On SLES systems, do the following:

Download and install the CDH 5 package

Download the CDH 5 "1-click Install" package.
Download the rpm file, choose Save File, and save it to a directory to which you have write access (for example, your home directory).
Install the RPM:
```
$ sudo rpm -i cloudera-cdh-5-0.x86_64.rpm
```
For instructions on how to add a CDH 5 SLES repository or build your own CDH 5 SLES repository, seeInstalling CDH 5 On SLES systems.

Install CDH 5

(Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
- For all SLES systems:
```
$ sudo rpm --import https://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
```
Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
```
$ sudo zypper install hadoop-conf-pseudo 
```

On Ubuntu and other Debian systems, do the following:

Download and install the package

Download the CDH 5 "1-click Install" package:

OS Version Package Link

Wheezy Wheezy package

Precise Precise package

Trusty Trusty package
Install the package by doing one of the following:
- Choose Open with in the download window to use the package manager.
- Choose Save File, save the package to a directory to which you have write access (for example, your home directory), and install it from the command line. For example:
```
sudo dpkg -i cdh5-repository_1.0_all.deb
```

OS Version	Package Link
Wheezy	Wheezy package
Precise	Precise package
Trusty	Trusty package

Install CDH 5

(Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:

For Ubuntu Lucid systems:

$ curl -s https://archive.cloudera.com/cdh5/ubuntu/lucid/amd64/cdh/archive.key | sudo apt-key add -

For Ubuntu Precise systems:

$ curl -s https://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -

For Debian Squeeze systems:

$ curl -s https://archive.cloudera.com/cdh5/debian/squeeze/amd64/cdh/archive.key | sudo apt-key add -

Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
```
$ sudo apt-get update 
$ sudo apt-get install hadoop-conf-pseudo
```

Starting Hadoop and Verifying it is Working Properly

For YARN, a pseudo-distributed Hadoop installation consists of one node running all five Hadoop daemons:namenode, secondarynamenode, resourcemanager, datanode, and nodemanager.

To view the files on Red Hat or SLES systems:

$ rpm -ql hadoop-conf-pseudo

To view the files on Ubuntu systems:

$ dpkg -L hadoop-conf-pseudo

The new configuration is self-contained in the /etc/hadoop/conf.pseudo directory.

The Cloudera packages use the alternative framework for managing which Hadoop configuration is active. All Hadoop components search for the Hadoop configuration in /etc/hadoop/conf.

To start Hadoop, proceed as follows.

Step 1: Format the NameNode.

Before starting the NameNode for the first time you must format the file system.

$ sudo -u hdfs hdfs namenode -format

Make sure you perform the format of the NameNode as user hdfs. You can do this as part of the command string, using sudo -u hdfs as in the command above.

Step 2: Start HDFS

$ for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done

To verify services have started, you can check the web console. The NameNode provides a web consolehttp://localhost:50070/ for viewing your Distributed File System (DFS) capacity, number of DataNodes, and logs. In this pseudo-distributed configuration, you should see one live DataNode named localhost.

Step 3: Create the directories needed for Hadoop processes.

Issue the following command to create the directories needed for all installed Hadoop processes with the appropriate permissions.

$ sudo /usr/lib/hadoop/libexec/init-hdfs.sh

Step 4: Verify the HDFS File Structure:

Run the following command:

$ sudo -u hdfs hadoop fs -ls -R /

You should see output similar to the following excerpt:

...
drwxrwxrwt - hdfs supergroup 0 2012-05-31 15:31 /tmp
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /tmp/hadoop-yarn
drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging
drwxr-xr-x - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history
drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history/done_intermediate
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var/log
drwxr-xr-x - yarn mapred 0 2012-05-31 15:31 /var/log/hadoop-yarn
...

Step 5: Start YARN

$ sudo service hadoop-yarn-resourcemanager start
$ sudo service hadoop-yarn-nodemanager start 
$ sudo service hadoop-mapreduce-historyserver start

Step 6: Create User Directories

Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:

$ sudo -u hdfs hadoop fs -mkdir /user/
$ sudo -u hdfs hadoop fs -chown  /user/

where is the Linux username of each user.

Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:

$ sudo -u hdfs hadoop fs -mkdir /user/$USER
$ sudo -u hdfs hadoop fs -chown $USER /user/$USER

Running an example application with YARN

Create a home directory on HDFS for the user who will be running the job (for example, joe):
```
$ sudo -u hdfs hadoop fs -mkdir /user/joe 
$ sudo -u hdfs hadoop fs -chown joe /user/joe
```
Do the following steps as the user joe.

Make a directory in HDFS called input and copy some XML files into it by running the following commands in pseudo-distributed mode:

$ hadoop fs -mkdir input
$ hadoop fs -put /etc/hadoop/conf/*.xml input
$ hadoop fs -ls input
Found 3 items:
-rw-r--r-- 1 joe supergroup 1348 2012-02-13 12:21 input/core-site.xml
-rw-r--r-- 1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-site.xml
-rw-r--r-- 1 joe supergroup 1001 2012-02-13 12:21 input/mapred-site.xml

Set HADOOP_MAPRED_HOME for user joe:

$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

Run an example Hadoop job to grep with a regular expression in your input data.

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'

After the job completes, you can find the output in the HDFS directory named output23 because you specified that output directory to Hadoop.
```
$ hadoop fs -ls 
Found 2 items
drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input
drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output23
```
You can see that there is a new directory called output23.

List the output files.

$ hadoop fs -ls output23 
Found 2 items
drwxr-xr-x - joe supergroup 0 2009-02-25 10:33 /user/joe/output23/_SUCCESS
-rw-r--r-- 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output23/part-r-00000

Read the results in the output file.

$ hadoop fs -cat output23/part-r-00000 | head
1 dfs.safemode.min.datanodes
1 dfs.safemode.extension
1 dfs.replication
1 dfs.permissions.enabled
1 dfs.namenode.name.dir
1 dfs.namenode.checkpoint.dir
1 dfs.datanode.data.dir

Installing CDH 5 with MRv1 on a Single Linux Node in Pseudo-distributed mode

Starting CDH Services Using the Command Line

You need to start and stop services in the right order to make sure everything starts or stops cleanly.

START services in this order:

Order	Service	Comments	For instructions and more information
1	ZooKeeper	Cloudera recommends starting ZooKeeper before starting HDFS; this is a requirement in a high-availability (HA)deployment. In any case, always start ZooKeeper before HBase.	Installing the ZooKeeper Server Package and Starting ZooKeeper on a Single Server; Installing ZooKeeper in a Production Environment; Deploying HDFS High Availability;Configuring High Availability for the JobTracker (MRv1)
2	HDFS	Start HDFS before all other services except ZooKeeper. If you are using HA, see the CDH 5 High Availability Guide for instructions.	Deploying HDFS on a Cluster; HDFS High Availability
3	HttpFS		HttpFS Installation
4a	MRv1	Start MapReduce before Hive or Oozie. Do not start MRv1 if YARN is running.	Deploying MapReduce v1 (MRv1) on a Cluster;Configuring High Availability for the JobTracker (MRv1)
4b	YARN	Start YARN before Hive or Oozie. Do not start YARN if MRv1 is running.	Deploying MapReduce v2 (YARN) on a Cluster
5	HBase		Starting and Stopping HBase; Deploying HBase in a Distributed Cluster
6	Hive	Start the Hive metastore before starting HiveServer2 and the Hive console.	Installing Hive
7	Oozie		Starting the Oozie Server
8	Flume 1.x		Running Flume
9	Sqoop		Sqoop Installation andSqoop 2 Installation
10	Hue		Hue Installation

Categories: CDH | Installing | Starting and Stopping | All Categories

나의 인생 이야기

Thursday, October 27, 2016

ML & AI

Friday, October 21, 2016

Cloudera Docker Container 사용하기

Cloudera Docker Container

Importing the Cloudera QuickStart Image

Running a Cloudera QuickStart Container

Connecting to the Docker Shell

Networking

Wednesday, October 19, 2016

Installing CDH 5 with YARN on a Single Linux Node in Pseudo-distributed mode

Installing CDH 5 with YARN on a Single Linux Node in Pseudo-distributed mode

Before you start, uninstall MRv1 if necessary

On Red Hat/CentOS/Oracle 5 or Red Hat 6 systems, do the following:

On SLES systems, do the following:

On Ubuntu and other Debian systems, do the following:

Starting Hadoop and Verifying it is Working Properly

Step 1: Format the NameNode.

Step 2: Start HDFS

Step 3: Create the directories needed for Hadoop processes.

Step 4: Verify the HDFS File Structure:

Step 5: Start YARN

Step 6: Create User Directories

Running an example application with YARN

Starting CDH Services Using the Command Line

Starting CDH Services Using the Command Line

About Me