Wednesday, December 1, 2010

Downloading and installing Hadoop

GettingStartedWithHadoop

Downloading and installing Hadoop

Hadoop can be downloaded from one of the Apache download mirrors. You may also download a nightly build or check out the code from subversion and build it with Ant. Select a directory to install Hadoop under (let's say /foo/bar/hadoop-install) and untar the tarball in that directory. A directory corresponding to the version of Hadoop downloaded will be created under the /foo/bar/hadoop-install directory. For instance, if version 0.6.0 of Hadoop was downloaded untarring as described above will create the directory /foo/bar/hadoop-install/hadoop-0.6.0. The examples in this document assume the existence of an environment variable $HADOOP_INSTALL that represents the path to all versions of Hadoop installed. In the above instance HADOOP_INSTALL=/foo/bar/hadoop-install. They further assume the existence of a symlink namedhadoop in $HADOOP_INSTALL that points to the version of Hadoop being used. For instance, if version 0.6.0 is being used then$HADOOP_INSTALL/hadoop -> hadoop-0.6.0. All tools used to run Hadoop will be present in the directory $HADOOP_INSTALL/hadoop/bin. All configuration files for Hadoop will be present in the directory $HADOOP_INSTALL/hadoop/conf.

Startup scripts

The $HADOOP_INSTALL/hadoop/bin directory contains some scripts used to launch Hadoop DFS and Hadoop Map/Reduce daemons. These are:

start-all.sh - Starts all Hadoop daemons, the namenode, datanodes, the jobtracker and tasktrackers.
stop-all.sh - Stops all Hadoop daemons.
start-mapred.sh - Starts the Hadoop Map/Reduce daemons, the jobtracker and tasktrackers.
stop-mapred.sh - Stops the Hadoop Map/Reduce daemons.
start-dfs.sh - Starts the Hadoop DFS daemons, the namenode and datanodes.
stop-dfs.sh - Stops the Hadoop DFS daemons.

It is also possible to run the Hadoop daemons as Windows Services using the Java Service Wrapper (download this separately). This still requires Cygwin to be installed as Hadoop requires its df command. See the following JIRA issues for details:

Configuration files

The $HADOOP_INSTALL/hadoop/conf directory contains some configuration files for Hadoop. These are:

hadoop-env.sh - This file contains some environment variable settings used by Hadoop. You can use these to affect some aspects of Hadoop daemon behavior, such as where log files are stored, the maximum amount of heap used etc. The only variable you should need to change in this file is JAVA_HOME, which specifies the path to the Java 1.5.x installation used by Hadoop.
slaves - This file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run. By default this contains the single entry localhost
hadoop-default.xml - This file contains generic default settings for Hadoop daemons and Map/Reduce jobs. Do not modify this file.
mapred-default.xml - This file contains site specific settings for the Hadoop Map/Reduce daemons and jobs. The file is empty by default. Putting configuration properties in this file will override Map/Reduce settings in the hadoop-default.xml file. Use this file to tailor the behavior of Map/Reduce on your site.
hadoop-site.xml - This file contains site specific settings for all Hadoop daemons and Map/Reduce jobs. This file is empty by default. Settings in this file override those in hadoop-default.xml and mapred-default.xml. This file should contain settings that must be respected by all servers and clients in a Hadoop installation, for instance, the location of the namenode and the jobtracker.

More details on configuration can be found on the HowToConfigure page.

Setting up Hadoop on a single node

This section describes how to get started by setting up a Hadoop cluster on a single node. The setup described here is an HDFS instance with a namenode and a single datanode and a Map/Reduce cluster with a jobtracker and a single tasktracker. The configuration procedures described in Basic Configuration are just as applicable for larger clusters.

Basic Configuration

Take a pass at putting together basic configuration settings for your cluster. Some of the settings that follow are required, others are recommended for more straightforward and predictable operation.

Hadoop Environment Settings - Ensure that JAVA_HOME is set in hadoop-env.sh and points to the Java installation you intend to use. You can set other environment variables in hadoop-env.sh to suit your requirments. Some of the default settings refer to the variableHADOOP_HOME. The value of HADOOP_HOME is automatically inferred from the location of the startup scripts. HADOOP_HOME is the parent directory of the bin directory that holds the Hadoop scripts. In this instance it is $HADOOP_INSTALL/hadoop.
Jobtracker and Namenode settings - Figure out where to run your namenode and jobtracker. Set the variable fs.default.name to the Namenode's intended host:port. Set the variable mapred.job.tracker to the jobtrackers intended host:port. These settings should be inhadoop-site.xml. You may also want to set one or more of the following ports (also in hadoop-site.xml):
- dfs.datanode.port
- dfs.info.port
- mapred.job.tracker.info.port
- mapred.task.tracker.output.port
- mapred.task.tracker.report.port
Data Path Settings - Figure out where your data goes. This includes settings for where the namenode stores the namespace checkpoint and the edits log, where the datanodes store filesystem blocks, storage locations for Map/Reduce intermediate output and temporary storage for the HDFS client. The default values for these paths point to various locations in /tmp. While this might be ok for a single node installation, for larger clusters storing data in /tmp is not an option. These settings must also be in hadoop-site.xml. It is important for these settings to be present in hadoop-site.xml because they can otherwise be overridden by client configuration settings in Map/Reduce jobs. Set the following variables to appropriate values:
- dfs.name.dir
- dfs.data.dir
- dfs.client.buffer.dir
- mapred.local.dir

An example of a hadoop-site.xml file:





  hadoop.tmp.dir
  /tmp/hadoop-${user.name}


  fs.default.name
  hdfs://localhost:54310


  mapred.job.tracker
  hdfs://localhost:54311

 
  dfs.replication
  8


  mapred.child.java.opts
  -Xmx512m

Formatting the Namenode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem, which is implemented on top of the local filesystems of your cluster. You need to do this the first time you set up a Hadoop installation. Do not format a running Hadoop filesystem, this will cause all your data to be erased. Before formatting, ensure that the dfs.name.dir directory exists. If you just used the default, then mkdir -p /tmp/hadoop-username/dfs/name will create the directory. To format the filesystem (which simply initializes the directory specified by the dfs.name.dirvariable), run the command:
% $HADOOP_INSTALL/hadoop/bin/hadoop namenode -format

Starting a Single node cluster

Run the command:
% $HADOOP_INSTALL/hadoop/bin/start-all.sh
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

Stopping a Single node cluster

Run the command
% $HADOOP_INSTALL/hadoop/bin/stop-all.sh
to stop all the daemons running on your machine.

Separating Configuration from Installation

In the example described above, the configuration files used by the Hadoop cluster all lie in the Hadoop installation. This can become cumbersome when upgrading to a new release since all custom config has to be re-created in the new installation. It is possible to separate the config from the install. To do so, select a directory to house Hadoop configuration (let's say /foo/bar/hadoop-config. Copy all conf files to this directory. You can either set the HADOOP_CONF_DIR environment variable to refer to this directory or pass it directly to the Hadoop scripts with the --config option. In this case, the cluster start and stop commands specified in the above two sub-sections become
% $HADOOP_INSTALL/hadoop/bin/start-all.sh --config /foo/bar/hadoop-config and
% $HADOOP_INSTALL/hadoop/bin/stop-all.sh --config /foo/bar/hadoop-config.
Only the absolute path to the config directory should be passed to the scripts.

Starting up a larger cluster

Ensure that the Hadoop package is accessible from the same path on all nodes that are to be included in the cluster. If you have separated configuration from the install then ensure that the config directory is also accessible the same way.
Populate the slaves file with the nodes to be included in the cluster. One node per line.
Follow the steps in the Basic Configuration section above.
Format the Namenode
Run the command % $HADOOP_INSTALL/hadoop/bin/start-dfs.sh on the node you want the Namenode to run on. This will bring up HDFS with the Namenode running on the machine you ran the command on and Datanodes on the machines listed in the slaves file mentioned above.
Run the command % $HADOOP_INSTALL/hadoop/bin/start-mapred.sh on the machine you plan to run the Jobtracker on. This will bring up the Map/Reduce cluster with Jobtracker running on the machine you ran the command on and Tasktrackers running on machines listed in the slaves file.
The above two commands can also be executed with a --config option.

Stopping the cluster

The cluster can be stopped by running % $HADOOP_INSTALL/hadoop/bin/stop-mapred.sh and then % $HADOOP_INSTALL/hadoop/bin/stop-dfs.sh on your Jobtracker and Namenode respectively. These commands also accept the --config option.

Hadoop-Hive/GettingStarted

Hive/GettingStarted - Hadoop Wiki

Hive introduction videos From Cloudera

Hive Introduction Video

Hive Tutorial Video

Installation and Configuration

Requirements

Java 1.6
Hadoop 0.17.x to 0.20.x.

Installing Hive from a Stable Release

Start by downloading the most recent stable release of Hive from one of the Apache download mirrors (see Hive Releases).

Next you need to unpack the tarball. This will result in the creation of a subdirectory named hive-x.y.z:

  $ tar -xzvf hive-x.y.z.tar.gz

Set the environment variable HIVE_HOME to point to the installation directory:

  $ cd hive-x.y.z
  $ export HIVE_HOME=`pwd`

Finally, add $HIVE_HOME/bin to your PATH:

  $ export PATH=$HIVE_HOME/bin:$PATH

Building Hive from Source

The Hive SVN repository is located here: http://svn.apache.org/repos/asf/hive/trunk

  $ svn co http://svn.apache.org/repos/asf/hive/trunk hive
  $ cd hive
  $ ant clean package
  $ cd build/dist
  $ ls
  README.txt
  bin/ (all the shell scripts)
  lib/ (required jar files)
  conf/ (configuration files)
  examples/ (sample input and query files)

In the rest of the page, we use build/dist and interchangeably.

Running Hive

Hive uses hadoop that means:

you must have hadoop in your path OR
export HADOOP_HOME=

In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a table can be created in Hive.

Commands to perform this setup

  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

I also find it useful but not necessary to set HIVE_HOME

  $ export HIVE_HOME=

To use hive command line interface (cli) from the shell:

  $ $HIVE_HOME/bin/hive

Configuration management overview

- Hive default configuration is stored in /conf/hive-default.xml

Configuration variables can be changed by (re-)defining them in /conf/hive-site.xml

- The location of the Hive configuration directory can be changed by setting the HIVE_CONF_DIR environment variable. - Log4j configuration is stored in /conf/hive-log4j.properties

- Hive configuration is an overlay on top of hadoop - meaning the hadoop configuration variables are inherited by default.

- Hive configuration can be manipulated by:

Editing hive-site.xml and defining any desired variables (including hadoop variables) in it
From the cli using the set command (see below)
By invoking hive using the syntax:
- $ bin/hive -hiveconf x1=y1 -hiveconf x2=y2
  - this sets the variables x1 and x2 to y1 and y2 respectively
- By setting the HIVE_OPTS environment variable to "-hiveconf x1=y1 -hiveconf x2=y2" which does the same as above

Runtime configuration

Hive queries are executed using map-reduce queries and, therefore, the behavior of such queries can be controlled by the hadoop configuration variables.
The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For example:

    hive> SET mapred.job.tracker=myhost.mycompany.com:50030;
    hive> SET -v;

The latter shows all the current settings. Without the -v option only the variables that differ from the base hadoop configuration are displayed

Hive, Map-Reduce and Local-Mode

Hive compiler generates map-reduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:

  mapred.job.tracker

While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty option to run map-reduce jobs locally on the user's workstation. This can be very useful to run queries over small data sets - in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode only runs with one reducer and can be very slow processing larger data sets.

Starting v-0.7, Hive fully supports local mode execution. To enable this, the user can enable the following option:

  hive> SET mapred.job.tracker=local;

In addition, mapred.local.dir should point to a path that's valid on the local machine (for example /tmp//mapred/local). (Otherwise, the user will get an exception allocating local disk space).

Starting v-0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are:

  hive> SET hive.exec.mode.local.auto=false;

note that this feature is disabled by default. If enabled - Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied:

The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)
The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default)
The total number of reduce tasks required is 1 or 0.

So for queries over small data sets, or for queries with multiple map-reduce jobs where the input to subsequent jobs is substantially smaller (because of reduction/filtering in the prior job), jobs may be run locally.

Note that there may be differences in the runtime environment of hadoop server nodes and the machine running the hive client (because of different jvm versions or different software libraries). This can cause unexpected behavior/errors while running in local mode. Also note that local mode execution is done in a separate, child jvm (of the hive client). If the user so wishes, the maximum amount of memory for this child jvm can be controlled via the option hive.mapred.local.mem. By default, it's set to zero, in which case Hive lets Hadoop determine the default memory limits of the child jvm.

Error Logs

Hive uses log4j for logging. By default logs are not emitted to the console by the CLI. The default logging level is WARN and the logs are stored in the folder:

/tmp/{user.name}/hive.log

If the user wishes - the logs can be emitted to the console by adding the arguments shown below:

bin/hive -hiveconf hive.root.logger=INFO,console

Alternatively, the user can change the logging level only by using:

bin/hive -hiveconf hive.root.logger=INFO,DRFA

Note that setting hive.root.logger via the 'set' command does not change logging properties since they are determined at initialization time.

Logging during Hive execution on a Hadoop cluster is controlled by Hadoop configuration. Usually Hadoop will produce one log file per map and reduce task stored on the cluster machine(s) where the task was executed. The log files can be obtained by clicking through to the Task Details page from the Hadoop JobTracker Web UI.

When using local mode (using mapred.job.tracker=local), Hadoop/Hive execution logs are produced on the client machine itself. Starting v-0.6 - Hive uses the hive-exec-log4j.properties (falling back to hive-log4j.properties only if it's missing) to determine where these logs are delivered by default. The default configuration file produces one log file per query executed in local mode and stores it under /tmp/{user.name}. The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors.

Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to hive-dev@hadoop.apache.org.

DDL Operations

Creating Hive tables and browsing through them

  hive> CREATE TABLE pokes (foo INT, bar STRING);

Creates a table called pokes with two columns, the first being an integer and the other a string

  hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

Creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into.

By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a).

  hive> SHOW TABLES;

lists all the tables

  hive> SHOW TABLES '.*s';

lists all the table that end with 's'. The pattern matching follows Java regular expressions. Check out this link for documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html

hive> DESCRIBE invites;

shows the list of columns

As for altering tables, table names can be changed and additional columns can be dropped:

  hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
  hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
  hive> ALTER TABLE events RENAME TO 3koobecaf;

Dropping tables:

  hive> DROP TABLE pokes;

Metadata Store

Metadata is in an embedded Derby database whose disk storage location is determined by the hive configuration variable namedjavax.jdo.option.ConnectionURL. By default (see conf/hive-default.xml), this location is ./metastore_db

Right now, in the default configuration, this metadata can only be seen by one user at a time.

Metastore can be stored in any database that is supported by JPOX. The location and the type of the RDBMS can be controlled by the two variables javax.jdo.option.ConnectionURL and javax.jdo.option.ConnectionDriverName. Refer to JDO (or JPOX) documentation for more details on supported databases. The database schema is defined in JDO metadata annotations file package.jdo atsrc/contrib/hive/metastore/src/model.

In the future, the metastore itself can be a standalone server.

If you want to run the metastore as a network server so it can be accessed from multiple nodes try HiveDerbyServerMode.

DML Operations

Loading data from flat files into Hive:

  hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

Loads a file that contains two columns separated by ctrl-a into pokes table. 'local' signifies that the input file is on the local file system. If 'local' is omitted then it looks for the file in HDFS.

The keyword 'overwrite' signifies that existing data in the table is deleted. If the 'overwrite' keyword is omitted, data files are appended to existing data sets.

NOTES:

NO verification of data against the schema is performed by the load command.
If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
The root of the Hive directory is specified by the option hive.metastore.warehouse.dir in hive-default.xml. We advise users to create this directory before trying to create tables via Hive.

  hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
  hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');

The two LOAD statements above load data into two different partitions of the table invites. Table invites must be created as partitioned by the key ds for this to succeed.

  hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

The above command will load data from an HDFS file/directory to the table. Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous.

SQL Operations

Example Queries

Some example queries are shown below. They are available in build/dist/examples/queries. More are available in the hive sources atql/src/test/queries/positive

SELECTS and FILTERS

  hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

selects column 'foo' from all rows of partition ds=2008-08-15 of the invites table. The results are not stored anywhere, but are displayed on the console.

Note that in all the examples that follow, INSERT (into a hive table, local directory or HDFS directory) is optional.

  hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';

selects all rows from partition ds=2008-08-15 of the invites table into an HDFS directory. The result data is in files (depending on the number of mappers) in that directory. NOTE: partition columns if any are selected by the use of *. They can also be specified in the projection clauses.

Partitioned tables must always have a partition selected in the WHERE clause of the statement.

  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;

Selects all rows from pokes table into a local directory

  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; 
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='2008-08-15';
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;

Sum of a column. avg, min, max can also be used. Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

GROUP BY

  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;
  hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

JOIN

  hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

MULTITABLE INSERT

  FROM src
  INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
  INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
  INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
  INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;

STREAMING

  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';

This streams the data in the map phase through the script /bin/cat (like hadoop streaming). Similarly - streaming can be used on the reduce side (please see the Hive Tutorial or examples)

Simple Example Use Cases

MovieLens User Ratings

First, create a table with tab-delimited text file format:

CREATE TABLE u_data (
  userid INT,
  movieid INT,
  rating INT,
  unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

Then, download and extract the data files:

wget http://www.grouplens.org/system/files/ml-data.tar__0.gz
tar xvzf ml-data.tar__0.gz

And load it into the table that was just created:

LOAD DATA LOCAL INPATH 'ml-data/u.data'
OVERWRITE INTO TABLE u_data;

Count the number of rows in table u_data:

SELECT COUNT(*) FROM u_data;

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

Now we can do some complex data analysis on the table u_data:

Create weekday_mapper.py:

import sys
import datetime

for line in sys.stdin:
  line = line.strip()
  userid, movieid, rating, unixtime = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print '\t'.join([userid, movieid, rating, str(weekday)])

Use the mapper script:

CREATE TABLE u_data_new (
  userid INT,
  movieid INT,
  rating INT,
  weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

add FILE weekday_mapper.py;

INSERT OVERWRITE TABLE u_data_new
SELECT
  TRANSFORM (userid, movieid, rating, unixtime)
  USING 'python weekday_mapper.py'
  AS (userid, movieid, rating, weekday)
FROM u_data;

SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;

Note that if you're using Hive 0.5.0 or earlier you will need to use COUNT(1) in place of COUNT(*).

Apache Weblog Data

The format of Apache weblog is customizable, while most webmasters uses the default. For default Apache weblog, we can create a table with the following command.

More about RegexSerDe can be found here: http://issues.apache.org/jira/browse/HIVE-662

add jar ../build/contrib/hive_contrib.jar;

CREATE TABLE apachelog (
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
  "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;

두려움

만일 자신이 진정으로 원하느느 일들을 했다면 어느 날 병에 걸리고 나이가 든다 해도 후회로 가득차지 않을 것입니다. 절반도 살아 보지 못한 채 인생을 끝낸다는 생각은 들지 않을 것입니다. 이제 한 가지 배움은 확실합니다. 꿈꾸는 일들을 아직 행동에 옮길 수 있을 때, 주려움을 이겨 내야 하는 것입니다.

삶의 종착점에 있는 환자들은 돋잘, 자신들이 더 이상 두려워 할 것이 없고 더 이상 잃을 것이 없다는 것을 깨닫게 된 뒤 무한한 행복을 발견했다고 합니다. 우리가 두려워 하는 것은 두려움 그 자체이지 두려워하는 대상이 아닙니다.

Wednesday, December 1, 2010

Downloading and installing Hadoop

http://wiki.apache.org/hadoop/GettingStartedWithHadoop

Downloading and installing Hadoop

Startup scripts

Configuration files

Setting up Hadoop on a single node

Basic Configuration

Formatting the Namenode

Starting a Single node cluster

Stopping a Single node cluster

Separating Configuration from Installation

Starting up a larger cluster

Stopping the cluster

Hadoop-Hive/GettingStarted

Hive introduction videos From Cloudera

Installation and Configuration

Requirements

Installing Hive from a Stable Release

Building Hive from Source

Running Hive

Configuration management overview

Runtime configuration

Hive, Map-Reduce and Local-Mode

Error Logs

DDL Operations

Metadata Store

DML Operations

SQL Operations

Example Queries

SELECTS and FILTERS

GROUP BY

JOIN

MULTITABLE INSERT

STREAMING

Simple Example Use Cases

MovieLens User Ratings

Apache Weblog Data

두려움

About Me