Friday, April 25, 2014

NoSql + RDBMS = NewSQL

source: http://www.bloter.net/archives/134607

빅데이터 시대, ‘NewSQL’에 주목하라

이지영 | 2012.11.20

올 한해 진행된 빅데이터 관련 개발자 행사에서 H베이스, 카산드라, 몽고DB, 레디스 등 NoSQL을 다룬 곳이 유독 많았다. 빅데이터가 등장 후 대용량 데이터 처리 방식을 놓고 지금까지 관계형 데이터베이스(RDB) 처리 도구로 잘 써왔던 MySQL로 데이터 처리를 계속할지, 아니면 관계형 데이터 모델과 SQL을 사용하지 않고 데이터를 분산 처리하는 NoSQL을 택할지 국내 데이터베이스 관리자(DBA)나 웹서비스 개발자들이 고민에 빠졌기 때문이리라. 두 기술 모두 대용량 데이터 분산 처리에서 저마다 장단점을 가지고 있어 우열을 가리기가 쉽지 않다.

“지난 9월 구글은 ‘스패너’라는 대용량 데이터 분산 처리 데이터베이스 시스템을 발표했습니다. SQL 안정성과 NoSQL의 유연성을 살린 NewSQL DB 시스템이었지요. SQL과 NoSQL은 이제 함께 발전하고 있습니다.”

고민에 빠진 DBA와 개발자들에게 콜린 찰스 스카이SQL 기술전도사 겸 MySQL 개발자가 제시한 해답은 간단했다. “둘 다 선택할 것.” 그는 어느 한 쪽 기술에만 집중해 서비스를 운영하기보다는 두 기술을 적절히 융합해 활용하는 게 가장 좋다고 답했다.

NewSQL은 NoSQL처럼 높은 확장성과 성능을 갖춘 RDB를 일컫는다. SQL을 지원하고, SQL이 트랜잭션 데이터를 처리하기 위해 기업이 갖추고 있어야 할 4가지 속성인 ACID(Atomicity, consistency, isolation, durability) 등록정보를 준수한다. 여기에 NoSQL의 특징인 확장성과 유연성을 데이터베이스 관리 시스템(DBMS) 더했다. SQL과 NoSQL에서 장점만 뽑아 결합한 셈이다.

겉보기엔 초등학생도 내놓을 수 있을 것 같은 쉬운 해답으로 비칠 수 있다. 그러나 SQL과 NoSQL이 걸어온 길을 조금이라도 이해한다면 이 두 기술을 융합해 활용한다는 게 결코 쉬운 일이 아니란 걸 안다. 전세계에서 가장 많은 데이터를 다루는 기업 중 한 곳인 구글이 왜 올해가 돼서야 SQL과 NoSQL을 융합한 기술을 발표했겠는가. SQL은 1970년대 등장한 오래된 기술이고, NoSQL은 2004년 구글이 발표한 맵리듀스 관련 논문에서 나온 개념인데 말이다.

콜린 기술전도사 설명에 따르면, NoSQL이 등장하기 전만 해도 기업은 SQL이 데이터를 저장하는 데 최적인 기술로 믿었다고 한다. SQL이 ACID 등록정보를 시스템 내에 갖추고 있는 것도 한몫했다. 그러나 사진, 동영상, 검색 로그 같은 비정형 데이터가 등장하면서 상황이 바뀌었다.

“처음엔 SQL의 편의성 때문에 기업들이 다른 DB 시스템에는 눈길도 주지 않았습니다. 그러나 시간이 흐르고 사회관계망 서비스(SNS) 데이터 등 SQL로 처리하기 어려운 비정형 데이터가 등장하면서 DB 시스템에 변화가 생겼지요.”

개발자들은 자연스레 비정형 데이터를 더욱 쉽게 처리하고 저장하는 구조를 가진 NoSQL DB로 눈을 돌렸다. NoSQL은 ‘Not Only SQL(SQL뿐만 아니라)’에서 따온 말답게 기존 정해진 틀이 잡혀 있는 SQL에서 벗어나 분산 아키텍처의 확장성, 유연성 등을 장점으로 내세우며 데이터 분산 처리 시 필요한 기술로 자리잡기 시작했다. NoSQL이 대부분 오픈소스 프로젝트로 저렴한 비용으로 데이터를 처리할 수 있다는 점도 인기를 끌었다.

“게다가 구글, 페이스북, 트위터 같은 회사들이 NoSQL을 강조하자, 마치 IT 유행처럼 NoSQL 열풍이 불었습니다. 너도나도 NoSQL DB를 발표했지요. NoSQL은 SQL과 다른 노선으로 발전하기 시작했습니다.”

콜린 지식전도사 설명에 따르면 서로 다른 기술로 발전해 영원히 만나지 않을 것 같은 두 기술은 NoSQL이 한계를 드러내면서 새로운 국면을 맞이했다.

“NoSQL은 뛰어난 확장성을 갖고 있지만, 스키마 변경이 불가능해 막상 데이터에 문제가 생겼을 때 이를 감지하는 게 어렵습니다. 여기에 SQL과 같이 정해진 언어가 없는데다가 도큐먼트 스토리지 기반으로 돼 있어 레코드를 개발해 본인들이 직접 넣어야 하는 식이다 보니 개발자들로부터 ‘다루기 어렵다’라는 말이 나오기 시작했지요.”

기존 SQL 기반의 RDB 장점을 포용하고, 확장성과 유연성 등 NoSQL의 장점을 가미한 NewSQL이 등장한 배경이다. NewSQL DB는 대규모 트랜잭션을 감당할 수 있는 분산처리기술과 분산 아키텍처의 확장성을 두루 갖출 수 있게끔 설계됐다. 구글 스패너를 비롯해 마리아DB, 저스트원DB, 드리즐, 지니DB가 NewSQL DB로 자리 잡았다.

“물론 NewSQL 등장으로 SQL과 NoSQL이 사라질 거라고는 생각하지 않습니다. NewSQL 역시 DB 처리 방식 기술 중 하나가 되겠지요. 다만 대용량 데이터를 다루는 사람이라면 각 기술의 장점이 고루 융합된 NewSQL을 눈여겨볼 필요가 있다고 생각합니다.”

Wednesday, April 23, 2014

Build your own CDH5 QuickStart VM with Spark on CentOS

source: http://dennyglee.com/2014/03/04/build-your-own-cdh5-quickstart-vm-with-spark-on-centos/

By Denny Lee

Rate this:

4 Votes

A great way to jump into CDH5 and Spark (with the latest version of Hue) is to build your own CDH5 setup on a VM. As of this writing, a CDH5 QuickStart VM is not available (though you can download the Cloudera QuickStart VM for CDH4.5). Below are the steps to build your own CDH5 / Spark setup on CentOS 6.5. Note, the installation of CDH5 through Cloudera Manager is actually quite straight forward. Instead, these instructions focus on the steps prior to installing Cloudera Manager 5 (and the express install of CDH5) to minimize the hiccups you may run into. These instructions after you’ve setup your CentOS VM – in my case I am using CentOS 6.5 (the latest download as of this writing) and VMWare Fusion (for my Mac … and no, I’m not going to get into the Parallels vs. Mac debate!)

Basic Configuration

In this case, I’ve setup VMWare Fusion VM so that way i can get it up and running on my Mac (that and take backups / snapshots if and when I mess up the configuration). It’s basic configuration is 4GB RAM, 2 cores, and 80GB of disk space with Bridged (Autodetect) network so it can have its own IP address.

Ensure your login has sudo access or able to log in as root

For this setup, I have a login of spark and I’ve added the spark login to the list of sudoers:

- login as root

- edit the sudoers list:

visudo –f /etc/sudoers

Validate Hostname

Ensure that the hostname and hosts file is setup correctly, this way both Cloudera Manager and CDH can work correctly. As well, you need to keep localhost so that way if you choose to do embedded postgresql (for Hive, Oozie metastores) it will install correctly. Note, this configuration works if you’re developing – if you are doing anything larger, it is recommended that you go with the remote database setup. For example, Oozie will not be able to execute Hive jobs if the Hive metastore is configured locally.

/etc/sysconfig/network
    HOSTNAME=sparky 

/etc/hosts
    10.0.0.16   sparky
    127.0.0.1   localhost

A good way to validate the HostName is setup correctly is to check with the python script below (this is the script CM5 is using to validate the hostname)

python -c 'import socket; print socket.getfqdn(), socket.gethostbyname(socket.getfqdn())'

Opening up for connectivity

Another way to say this is that I’m opening up the surface area of attack. For dev systems behind a firewall that contain non-sensitive data, these actions should be okay. But please do so at your own risk. (sorry for the legal-ese here).

Disable SELinux
/etc/sysconfig/selinux 

Disable Firewall
System > Administration > Firewall 

Restart

These actions are required because you will need to diasable SELinux in order to install Cloudera Manager. I disabled the firewall so as to need to open all of the different ports that CDH uses (i.e. being lazy here). For these changes to take effect, you will need to restart.

Install Java

While not strictly required as Cloudera Manager and CDH5 typically includes the JDK, I usually do it anyways. Since this is a dev setup on my Mac (VMWare Fusion running CentOS 6.5), then I chose the latest version of Java (as of this writing, it is JDK 7u51). You can download the latest Linux x86 RPMs of Java at:http://www.java.com/en/download/help/linux_x64rpm_install.xml

rpm -ivh jdk-7u51-linux-x64.rpm

Optional Database Installation

If this is a production system, it is highly recommended that you follow these optional steps to install Postgres as a remote database (instead of an embedded database). If you are building this for your own development purposes, using the automated installation of an embedded database works fine and is much easier.

Install and Configure Postgresql: Install and configure Postgres for use with Cloudera Manager / CDH

Post-Install Steps for Postgresql: Validate that you can utilize postgresql.

Install Cloudera Manager and then CDH 5

Now that you’ve done all the above steps, you can run the automated installation of Cloudera Manager. Once this completes, it will jump into the express installation of CDH5. The handy instructions include:

CDH5 Installation Guide

Installation Path A – Automated Installation by Cloudera Manager

The above link is handy because you can just click on the installation through the web browser and choose the appropriate configurations (e.g. YARN, Spark, etc.). By default, Spark is included with the default CDH5 installation so you should be good to go provided you do not uncheck it. As noted above, the installation of CM5 and CDH5 is relatively straightforward and easy.

Some Installation Tips

Swappiness Error Message

During the installation, you may get the following error message:

Cloudera recommends setting /proc/sys/vm/swappiness to 0. Current setting is 60. Use the sysctl command to change this setting at runtime and edit /etc/sysctl.conf for this setting to be saved after a reboot. You may continue with installation, but you may run into issues with Cloudera Manager reporting that your hosts are unhealthy because they are swapping. The following hosts are affected:

To resolve this, you can run the command:

sudo sysctl -w vm.swappiness=0

Reconfigure Disk space

When I built my CentOS VM, I originally had built it using the default 20GB of disk space and then wanted to expand it to 80GB. In addition to wanting more disk space for data, Cloudera Manager has health alerts for the /opt and /var folders – if they go below 10GB of space, you will typically get alerts. The /opt folder contains third party software including Cloudera’s parcels and parcel cache. Meanwhile the /var folder will typically contain the Cloudera logs. With a VM built by VMWare Fusion – typically you will have three partitions built with sda1 (the boot device), sda2 (contains everything), and sda3 (contains the Linux swap).

To reconfigure the disk space, there is an excellent blog: Live Resizing of an EXT4 FileSystem on Linux

My original setup had the configuration below

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048      616447      307200   83  Linux
/dev/sda2          616448    37814271    18598912   83  Linux
/dev/sda3        37814272    41943039     2064384   82  Linux swap / Solaris

After resizing based on the linked instructions, now this is my setup.

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048      616447      307200   83  Linux
/dev/sda2          616448   163643392    81513472+  83  Linux
/dev/sda3       163643393   167772159     2064383+  82  Linux swap / Solaris

Quick Links to Spark Tutorials

Below are some links to get you jump started on how to work with Spark:

Tuesday, April 22, 2014

HEL / Centos 6: Install Nginx Using Yum Command

source: http://www.cyberciti.biz/faq/install-nginx-centos-rhel-6-server-rpm-using-yum-command/

RHEL / Centos 6: Install Nginx Using Yum Command
by NIX CRAFT on JANUARY 7, 2013 · 6 COMMENTS· LAST UPDATED JANUARY 7, 2013
in CENTOS, NGINX, REDHAT AND FRIENDS
How can I install Nginx web server On CentOS Linux 6 or Red Hat Enterprise Linux 6 using yum command?

Tutorial details
Difficulty Intermediate (rss)
Root privileges Yes
Requirements CentOS/RHEL
yum
Estimated completion time N/A

Recently, nginx web project started to distribute binary packages using nginx yum repository. You can either create /etc/yum.repos.d/nginx.repo or directly install rpm package. This package contains yum configuration file and a public PGP key necessary to authenticate signed RPMs.
Step #1: Install nginx repo

Type the following wget command to install nginx yum configuration file:
# cd /tmp

CentOS Linux v6.x user type the following command:
# wget http://nginx.org/packages/centos/6/noarch/RPMS/nginx-release-centos-6-0.el6.ngx.noarch.rpm
# rpm -ivh nginx-release-centos-6-0.el6.ngx.noarch.rpm

RHEL v6.x user type the following command:
# wget http://nginx.org/packages/rhel/6/noarch/RPMS/nginx-release-rhel-6-0.el6.ngx.noarch.rpm
# rpm -ivh nginx-release-rhel-6-0.el6.ngx.noarch.rpm

Sample outputs:

warning: nginx-release-rhel-6-0.el6.ngx.noarch.rpm: Header V4 RSA/SHA1 Signature, key ID 7bd9bf62: NOKEY
Preparing... ########################################### [100%]
1:nginx-release-rhel ########################################### [100%]
Step #2: Install nginx web-server

Type the following yum command to install nginx web-server:
# yum install nginx

Sample outputs:

Loaded plugins: product-id, rhnplugin, security, subscription-manager
Updating certificate-based repositories.
Unable to read consumer identity
nginx | 1.3 kB 00:00
nginx/primary | 4.8 kB 00:00
nginx 33/33
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package nginx.x86_64 0:1.2.6-1.el6.ngx will be installed
--> Finished Dependency Resolution

Dependencies Resolved

================================================================================
Package Arch Version Repository Size
================================================================================
Installing:
nginx x86_64 1.2.6-1.el6.ngx nginx 361 k

Transaction Summary
================================================================================
Install 1 Package(s)

Total download size: 361 k
Installed size: 835 k
Is this ok [y/N]: y
Downloading Packages:
nginx-1.2.6-1.el6.ngx.x86_64.rpm | 361 kB 00:00
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
Warning: RPMDB altered outside of yum.
Installing : nginx-1.2.6-1.el6.ngx.x86_64 1/1
----------------------------------------------------------------------

Thanks for using NGINX!

Check out our community web site:
* http://nginx.org/en/support.html

If you have questions about commercial support for NGINX please visit:
* http://www.nginx.com/support.html

----------------------------------------------------------------------
Installed products updated.
Verifying : nginx-1.2.6-1.el6.ngx.x86_64 1/1

Installed:
nginx.x86_64 0:1.2.6-1.el6.ngx

Complete!

Step #3: Turn on nginx service

Type the following command:
# chkconfig nginx on

How do I start / stop / restart nginx web-server?

Type the following commands:
# service nginx start
# service nginx stop
# service nginx restart
# service nginx status
# service nginx reload

Step #4: Configuration files

Default configuration directory: /etc/nginx/
Default SSL and vhost config directory: /etc/nginx/conf.d/
Default log file directory: /var/log/nginx/
Default document root directory: /usr/share/nginx/html
Default configuration file: /etc/nginx/nginx.conf
Default server access log file: /var/log/nginx/access.log
Default server access log file: /var/log/nginx/error.log
To edit the nginx configuration file, enter:
# vi /etc/nginx/nginx.conf

Set or update worker_processes as follows (this must be set to CPU(s) in your system. Use the lscpu | grep '^CPU(s)' command to list the number of CPUs in the server)

worker_processes 2;
Turn on gzip support:

gzip on;
Save and close the file. Edit the file /etc/nginx/conf.d/default.conf, enter:
# vi /etc/nginx/conf.d/default.conf

Set IP address and TCP port number:

listen 202.54.1.1.1:80;
Set server name:

server_name www.cyberciti.biz;
Save and close the file. Start the server:
# service nginx start

Verify that everything is working:
# netstat -tulpn | grep :80
# ps aux | grep nginx

Firewall configuration: Open TCP port # 80

Edit the file /etc/sysconfig/iptables, enter:
# vi /etc/sysconfig/iptables

Add the following lines, ensuring that they appear before the final LOG and DROP lines for the INPUT chain to open port 80:

-A INPUT -m state --state NEW -p tcp --dport 80 -j ACCEPT

Save and close the file. Finally, restart the firewall:
# service iptables restart

Monday, April 21, 2014

Cisco UCS and HP Blades: A Look at TCO

Total Cost of Ownership (TCO) of IT

source: http://www.nashnetworks.ca/total-cost-of-ownership-tco-of-it.htm

January 2009 "Making every IT dollar count!" Part 1
Copyright Nash Networks

Total Cost of Ownership (TCO) of IT

Click here to download PDF Version

Executive Summary

Total cost of ownership (TCO) includes the direct and indirect costs over the lifecycle of an asset.

The biggest direct IT costs are hardware, software and support.

The purchase price of hardware and software is typically less than 50% of the total direct costs.

Indirect, or "hidden", costs are caused by lost or reduced productivity because of downtime, informal peer support, suboptimal performance and other causes of poor functioning or wasted time.

Indirect costs can account for more than half the TCO. Despite this, they are often totally overlooked.

Costs can be reduced and productivity increased by proper planning and management.

Detailed calculation of TCO doesn't make much sense for most small businesses, but understanding the true costs, their relative importance and how they can be contained, is critically important.

Decision-makers must always balance the costs of a system versus the benefits it brings to the business and the end users.

"Cost is what you pay. Value is what you get." (Warren Buffett).

What Is TCO and Why Does It Matter?
TCO calculates the direct and indirect costs of IT over its lifecycle.

It's important because it gives a realistic picture of the true cost of IT, and from there it's possible to decide how to make IT expenditure most cost-effective.
Typical TCO and TCO BreakdownThe TCO of typical office PC systems ranges from $3,000 to $10,000 per unit per year. One study found that TCO was 2½ - 3 times the direct cost of hardware, software and support. In other words, for every dollar spent on direct costs, another dollar or two was spent indirectly.

For a typical server, management and maintenance represent about 60% of total costs and downtime chews up 15%.

TCO differs between organizations, given their different computing environments, user experience level and IT expertise.

PC systems have much higher indirect costs than direct costs.

TCO analysis is always inexact, due to the many assumptions and unknowns that have to be taken into account.

As you provide more functionality and capability to end users, TCO rises.

As you install more software or provide more complex hardware at the hands of end users, you pay increasingly more for support and maintenance.

The following table shows the TCO for a single desktop PC over its 3-4 year lifecycle. This example doesn't take all hidden costs into account, which is why it's so much lower than other estimates in this paper, but the numbers still demonstrate that the total cost of the computer over its lifetime is more than double the purchase cost.

Phase of Lifecycle	Cost
Purchase (computer; printer/scanner/fax; cables, printer ink; paper)	$3,090
Deployment (Setup, staff downtime)	$500
Operations (Admin, downtime)	$1,040
Support	$1,680
Retirement	$630
TOTAL COST	$6,940
APPROX. ANNUAL COST	$2,000

Microsoft's 2008 figures for a similar analysis are $5,384 per year over the lifecycle of a PC, of which acquisition costs average $1,364 per year.

Direct CostsDirect IT costs typically include:

Hardware and software

Support

Consumables

Network-related recurring costs (e.g. Internet)

Facilities (e.g. environment-controlled server room)

Administration (human resources, training)

For direct costs alone, support is usually more than half the total IT budget. A study of US schools estimated that a well-supported technology program requires annual expenditures of 30-50% of the original investment. On university campuses, direct costs were broken down as hardware 24%, software 7% and personnel-related 53%. Another source estimated that 65% of IT budgets go to ongoing support.

The following tables show direct costs from 51 large organizations with average revenues of over $450 million and average staff numbers over 2,500 (Gartner 2007). Smaller organizations may have lower costs per user because of smaller budgets, less waste and less complexity, but on the other hand they lack the advantage of economies of scale and may need to maintain ageing systems.

Overview (Gartner 2007)

IT budget
Average IT operating budget as % of revenue	5.5%
Average IT capital budget as % of revenue	2.5%
Average IT operating budget per employee	$9,100

Breakdown of direct IT costs (Gartner 2007)

IT spending by category
Hardware	26%
Software	20%
Support (staff, external providers, contractors)	41%
Telecommunications	13%

Average operating budget (per employee per year) in different industries (Gartner 2007)

Industry	Annual Cost Per Employee Per Year
Average for all industries	$6,800
Communications	$15,800
Construction	$4,100
Distribution - Retail	$3,300
Distribution - Wholesale	$6,100
Financial Services - Banking	$13,800
Financial Services - Insurance	$9,800
Financial Services - Other	$11,200
Media	$11,000
Professional Services	$9,100
Transportation	$5,900

Indirect Costs

Indirect costs are all, in some way, related to lost productivity – with direct implications for profitability and competitiveness.

Availability takes precedence over all other requirements: A system is only useful if it's up, running and functioning well! Maintaining high availability requires significant maintenance and management and a pro-active approach.

Hidden costs include:

Downtime – scheduled and unscheduled. All or part of the network is not available to users.

Suboptimal functioning – e.g. inappropriate or outdated applications software, slow computers or poorly trained users.

User-induced problems – e.g. deleting critical files, ignoring warning messages, clicking on pop-ups that install viruses, changing configurations.

"Shadow support" – internal support provided by advanced end users on top of their official job duties. (When these end users are proficient and know their limits, this can save, rather than cost, money, but only if the time they spend on IT saves more than the productivity lost from their normal duties.)

“Futz” Factor - use of computers for non-business purposes (e.g. online games, surfing the Web or personal emails).

"Fiddle" Factor – time spent by users changing the look and feel of their computers e.g. changing the desktop, installing desktop accessories, fiddling with fonts.

Time that is often not tracked or is overlooked – for example, time spent researching purchases and getting quotes; time spent dealing with vendors before a problem is diagnosed and fixed.

Quantifying Hidden CostsQuantifying hidden or indirect costs is extremely difficult and it is generally pointlessly time-consuming to quantify them in great detail. However, it's certainly well worth the exercise of working through the different categories and putting an estimate to each. You might find the results surprising or even shocking!

You can download a free TCO calculator, in Excel format, from Info-Tech, or download Tri-Active's white paper, "Calculating Your Total Cost of Ownership (TCO)". Both are free, but require registration.

Cost Reduction Strategies
How do you find the right balance between managing costs and optimizing productivity?

Active management of computers can substantially reduce lifetime costs. The following graphs demonstrate savings for desktop and notebook PCs.

(Notebooks have significantly higher TCO than desktops because they are more difficult to support and are generally provided to users with higher salaries than those using desktops.)

Pro-active monitoring. Monitoring tools, in experienced hands, allow problems to be addressed and resolved before they reach crisis point. Recently, we identified a faulty hard drive on a client's server before it had actually crashed. We were able to back up and replace the drive with no disruption to the organization. Without the monitoring, they would have found themselves with a crashed server, no backup and days of disruption and lost productivity.

Planning. Proper planning can cut both direct and indirect costs. For example, upgrading an entire network at once, if planned properly, can cost substantially less than piecemeal upgrades. You can negotiate volume discounts with suppliers, minimize downtime with a proper project schedule, ensure that all staff use the same systems and bring in trainers to ensure that everyone will be using the new system efficiently.

Policies & standards. In many organizations, users customize their computers to the extent that each is effectively operating a different kind of machine. This dramatically increases support costs as well as chewing up users' time. Clear policies on computer set-up and what is and isn't allowed will help dramatically. Proper Internet usage policies will cut virus and spyware infections.

Training. Poor IT skills in the workplace are a significant cause of lost productivity. Examples are slow and inaccurate typing, inability to effectively use a wide range of functions for key software like Office and inadvertently downloading viruses. Nobody would learn to drive a car by trial and error! Proper training is essential for cost-effective use of IT resources.

Vendor management. The more complex an environment becomes, the harder it is for business owners to diagnose the source of a problem. Let's say email isn't going out or coming in. Is it the Internet connection? The wireless router? The incoming email provider? The outgoing email provider? The spam filter? The BES server? The Exchange server? Outlook? Staff or owners typically spend a lot of time working with multiple vendors to try to pinpoint the problem. Users may not know what questions to ask, and vendor support personnel may be junior and limited in their ability to diagnose. Vendors are also notorious for passing the buck. By using a knowledgeable managed services provider as your single point of contact to liaise with and manage vendors, a whole lot of these issues magically disappear.

Automation. The last few years have seen an explosion of superb monitoring and management tools, yet a surprising number of IT consultants don't use them. That's bad news for their clients, because reactive on-site support costs more. Upgrades, for example, used to be a slow and laborious task, but can now be scheduled to run automatically outside of business hours, with no disruption to users. Remote patch management allows managed services providers to delay installing patches until they've been pronounced safe and then install automatically at convenient times.

Remote support tools allow the number of site visits to be cut by up to 90%, with dramatic savings. Not only do consultants not need to factor in travel costs, but techs can also work on several computers at once. That means you only pay for the time they spend working on your computer, not the hours it sometimes takes for processes to run while the tech hangs around your office drinking your coffee.

Backup and Disaster Recovery. An appropriate, well-functioning backup and DR system is a critical part of business insurance. It can save significant money and productivity, and, in some cases, whole businesses.

Security. Effective and appropriate security measures can prevent significant disruptions, and are essential to some organizations.

Keep It In Perspective

It certainly makes sense to keep costs as low as possible, but decision-makers must always balance the costs of a system versus the benefits it brings to the business and the end users.

Internet connectivity is a good example. Costs are significant - the connection, cabling, security, potential damage from hackers, viruses, and other malicious activities, staff time wasted on unauthorized surfing etc. On the other hand, what business can adequately compete or even survive without the access to information, worldwide reach, and accessibility to customers that the Internet provides?

Ultimately, many IT decisions you make will not be due to cost-avoidance but rather on the basis of business advantage.

Sources

Calculating Your Total Cost of Ownership (TCO)

Forget About March Madness Killing Productivity, Teach Your Employees How To Use Computers

Free Total Cost of Ownership (TCO) Calculator for IT

Gartner Research 2003 Desktop TCO update DF-19-9687

Gartner Research 2006-2007 IT Spending and Staffing Report - North America 5 March 2007 ID Number: G00146284

Gartner Says Effective Management Can Cut Total Cost of Ownership for Desktop PCs by 42 Per cent

How Windows Reduces TCO

IT illiteracy undermines productivity

IT Toolbox Buyer's Guide: Maintenance cost

Managing the PC Lifecycle: Total Cost of Ownership – Info-Tech Research Group 2005

Seven Benchmarks for Information Technology Investments - Educause Quarterly Number 3, 2002

The Enterprise PC Lifecycle - Microsoft 2008

Total Cost Of Ownership (TCO) - Sources Analyst Perspectives

Total Cost of Ownership and Cost Reduction Analyses: An Evaluation of End User Computing Costs

Total Cost of Ownership: Principles and Practical Applications - Floyd Piedad

Total Costs of Notebook Ownership When Used by Travelling Workers

Total Costs per year of PC Ownership, 2008

Thursday, April 17, 2014

Big Data Benchmark

source: https://amplab.cs.berkeley.edu/benchmark/

Big Data Benchmark

Introduction

Several analytic frameworks have been announced in the last year. Among them are inexpensive data-warehousing solutions based on traditional Massively Parallel Processor (MPP) architectures (Redshift), systems which impose MPP-like execution engines on top of Hadoop (Impala, HAWQ) and systems which optimize MapReduce to improve performance on analytical workloads (Shark, Stinger/Tez). This benchmark provides quantitativeand qualitative comparisons of five systems. It is entirely hosted on EC2 and can be reproduced directly from your computer.

Redshift - a hosted MPP database offered by Amazon.com based on the ParAccel data warehouse. We tested Redshift on HDDs.
Hive - a Hadoop-based data warehousing system. (v0.12)
Shark - a Hive-compatible SQL engine which runs on top of the Spark computing framework. (v0.8.1)
Impala - a Hive-compatible* SQL engine with its own MPP-like execution engine. (v1.2.3)
Stinger/Tez - Tez is a next generation Hadoop execution engine currently in development (v0.2.0)

This remains a work in progress and will evolve to include additional frameworks and new capabilities. We welcome contributions.

What this benchmark is not

This benchmark is not intended to provide a comprehensive overview of the tested platforms. We are aware that by choosing default configurations we have excluded many optimizations. The choice of a simple storage format, compressed SequenceFile, omits optimizations included in columnar formats such as ORCFile and Parquet. For now, we've targeted a simple comparison between these systems with the goal that the results areunderstandable and reproducible.

What is being evaluated?

This benchmark measures response time on a handful of relational queries: scans, aggregations, joins, and UDF's, across different data sizes. Keep in mind that these systems have very different sets of capabilities. MapReduce-like systems (Shark/Hive) target flexible and large-scale computation, supporting complex User Defined Functions (UDF's), tolerating failures, and scaling to thousands of nodes. Traditional MPP databases are strictly SQL compliant and heavily optimized for relational queries. The workload here is simply one set of queries that most of these systems these can complete.

Changes and Notes (February 2014)

We changed the Hive configuration from Hive 0.10 on CDH4 to Hive 0.12 on HDP 2.0.6. As a result, direct comparisons between the current and previous Hive results should not be made. It is difficult to account for changes resulting from modifications to Hive as opposed to changes in the underlying Hadoop distribution.
We have added Tez as a supported platform. It is important to note that Tez is currently in a preview state.
Hive has improved its query optimization, which is also inherited by Shark. This set of queries does not test the improved optimizer.
We have changed the underlying filesystem from Ext3 to Ext4 for Hive, Tez, Impala, and Shark benchmarking.

Dataset and Workload

Our dataset and queries are inspired by the benchmark contained in a comparison of approaches to large scale analytics. The input data set consists of a set of unstructured HTML documents and two SQL tables which contain summary information. It was generated using Intel's Hadoop benchmark tools and data sampled from the Common Crawl document corpus. There are three datasets with the following schemas:

Documents Rankings UserVisits

Unstructured HTML documents Lists websites and their page rank Stores server logs for each web page

`Documents`	`Rankings`	`UserVisits`
Unstructured HTML documents	Lists websites and their page rank	Stores server logs for each web page
	`pageURL VARCHAR(300) pageRank INT avgDuration INT`	`sourceIP VARCHAR(116) destURL VARCHAR(100) visitDate DATE adRevenue FLOAT userAgent VARCHAR(256) countryCode CHAR(3) languageCode CHAR(6) searchWord VARCHAR(32) duration INT`

pageURL VARCHAR(300)
pageRank INT
avgDuration INT

sourceIP VARCHAR(116)
destURL VARCHAR(100)
visitDate DATE
adRevenue FLOAT
userAgent VARCHAR(256)
countryCode CHAR(3)
languageCode CHAR(6)
searchWord VARCHAR(32)
duration INT

Query 1 and Query 2 are exploratory SQL queries. We vary the size of the result to expose scaling properties of each systems.
- Variant A: BI-Like - result sets are small (e.g., could fit in memory in a BI tool)
- Variant B: Intermediate - result set may not fit in memory on one node
- Variant C: ETL-Like - result sets are large and require several nodes to store
Query 3 is a join query with a small result set, but varying sizes of joins.
Query 4 is a bulk UDF query. It calculates a simplified version of PageRank using a sample of the Common Crawl dataset.

Hardware Configuration

Results | February 2014

We launch EC2 clusters and run each query several times. We report the median response time here. Except for Redshift, all data is stored on HDFS in compressed SequenceFile format. Each query is run with seven frameworks:

Redshift	Amazon Redshift with default options.
Shark - disk	Input and output tables are on-disk compressed with gzip. OS buffer cache is cleared before each run.
Impala - disk	Input and output tables are on-disk compressed with snappy. OS buffer cache is cleared before each run.
Shark - mem	Input tables are stored in Spark cache. Output tables are stored in Spark cache.
Impala - mem	Input tables are coerced into the OS buffer cache. Output tables are on disk (Impala has no notion of a cached table).
Hive	Hive on HDP 2.0.6 with default options. Input and output tables are on disk compressed with snappy. OS buffer cache is cleared before each run.
Tez	Tez with the configuration parameters specified here. Input and output tables are on disk compressed with snappy. OS buffer cache is cleared before each run.

1. Scan Query

SELECT pageURL, pageRank FROM rankings WHERE pageRank > X

	Query 1A 32,888 results	Query 1B 3,331,851 results	Query 1C 89,974,976 results

	Median Response Time (s)
Redshift (HDD) - Current	2.49	2.61	9.46
Impala - Disk - 1.2.3	12.015	12.015	37.085
Impala - Mem - 1.2.3	2.17	3.01	36.04
Shark - Disk - 0.8.1	6.6	7	22.4
Shark - Mem - 0.8.1	1.7	1.8	3.6
Hive - 0.12 YARN	50.49	59.93	43.34
Tez - 0.2.0	28.22	36.35	26.44

This query scans and filters the dataset and stores the results.

This query primarily tests the throughput with which each framework can read and write table data. The best performers are Impala (mem) and Shark (mem) which see excellent throughput by avoiding disk. For on-disk data, Redshift sees the best throughput for two reasons. First, the Redshift clusters have more disks and second, Redshift uses columnar compression which allows it to bypass a field which is not used in the query. Shark and Impala scan at HDFS throughput with fewer disks.

Both Shark and Impala outperform Hive by 3-4X due in part to more efficient task launching and scheduling. As the result sets get larger, Impala becomes bottlenecked on the ability to persist the results back to disk. Nonetheless, since the last iteration of the benchmark Impala has improved its performance in materializing these large result-sets to disk.

Tez sees about a 40% improvement over Hive in these queries. This is in part due to the container pre-warming and reuse, which cuts down on JVM initialization time.

2. Aggregation Query

SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X)

	Query 2A 2,067,313 groups	Query 2B 31,348,913 groups	Query 2C 253,890,330 groups

	Median Response Time (s)
Redshift (HDD) - Current	25.46	56.51	79.15
Impala - Disk - 1.2.3	113.72	155.31	277.53
Impala - Mem - 1.2.3	84.35	134.82	261.015
Shark - Disk - 0.8.1	151.4	164.3	196.5
Shark - Mem - 0.8.1	83.7	100.1	132.6
Hive - 0.12 YARN	730.62	764.95	833.3
Tez - 0.2.0	377.48	438.03	427.56

This query applies string parsing to each input tuple then performs a high-cardinality aggregation.

Redshift's columnar storage provides greater benefit than in Query 1 since several columns of the UserVistits table are un-used. While Shark's in-memory tables are also columnar, it is bottlenecked here on the speed at which it evaluates the SUBSTR expression. Since Impala is reading from the OS buffer cache, it must read and decompress entire rows. Unlike Shark, however, Impala evaluates this expression using very efficient compiled code. These two factors offset each other and Impala and Shark achieve roughly the same raw throughput for in memory tables. For larger result sets, Impala again sees high latency due to the speed of materializing output tables.

3. Join Query

SELECT sourceIP, totalRevenue, avgPageRank
FROM
  (SELECT sourceIP,
          AVG(pageRank) as avgPageRank,
          SUM(adRevenue) as totalRevenue
    FROM Rankings AS R, UserVisits AS UV
    WHERE R.pageURL = UV.destURL
       AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X')
    GROUP BY UV.sourceIP)
  ORDER BY totalRevenue DESC LIMIT 1

	Query 3A 485,312 rows	Query 3B 53,332,015 rows	Query 3C 533,287,121 rows

	Median Response Time (s)
Redshift (HDD) - Current	33.29	46.08	168.25
Impala - Disk - 1.2.3	108.68	129.815	431.26
Impala - Mem - 1.2.3	41.21	76.005	386.6
Shark - Disk - 0.8.1	111.7	135.6	382.6
Shark - Mem - 0.8.1	44.7	67.3	318
Hive - 0.12 YARN	561.14	717.56	2374.17
Tez - 0.2.0	323.06	402.33	1361.9

This query joins a smaller table to a larger table then sorts the results.

When the join is small (3A), all frameworks spend the majority of time scanning the large table and performing date comparisons. For larger joins, the initial scan becomes a less significant fraction of overall response time. For this reason the gap between in-memory and on-disk representations diminishes in query 3C. All frameworks perform partitioned joins to answer this query. CPU (due to hashing join keys) and network IO (due to shuffling data) are the primary bottlenecks. Redshift has an edge in this case because the overall network capacity in the cluster is higher.

4. External Script Query

CREATE TABLE url_counts_partial AS 
  SELECT TRANSFORM (line)
    USING "python /root/url_count.py" as (sourcePage, destPage, cnt) 
  FROM documents;
CREATE TABLE url_counts_total AS 
  SELECT SUM(cnt) AS totalCount, destPage 
  FROM url_counts_partial 
  GROUP BY destPage;

	Query 4 (phase 1)	Query 4 (phase 2)	Query 4 (total)

	Median Response Time (s)
Redshift (HDD) - Current	not supported	not supported	not supported
Impala - Disk - 1.2.3	untested	untested	untested
Impala - Mem - 1.2.3	untested	untested	untested
Shark - Disk - 0.8.1	232.2	47.2	279.4
Shark - Mem - 0.8.1	162.9	28.1	191.4
Hive - 0.12 YARN	896.47	150.48	1047.45
Tez - 0.2.0	894.16	62.6	966.18

This query calls an external Python function which extracts and aggregates URL information from a web crawl dataset. It then aggregates a total count per URL.

Impala and Redshift do not currently support calling this type of UDF, so they are omitted from the result set. Impala UDFs must be written in Java or C++, where as this script is written in Python. The performance advantage of Shark (disk) over Hive in this query is less pronounced than in 1, 2, or 3 because the shuffle and reduce phases take a relatively small amount of time (this query only shuffles a small amount of data) so the task-launch overhead of Hive is less pronounced. Also note that when the data is in-memory, Shark is bottlenecked by the speed at which it can pipe tuples to the Python process rather than memory throughput. This makes the speedup relative to disk around 5X (rather than 10X or more seen in other queries).

Discussion

These numbers compare performance on SQL workloads, but raw performance is just one of many important attributes of an analytic framework. The reason why systems like Hive, Impala, and Shark are used is because they offer a high degree of flexibility, both in terms of the underlying format of the data and the type of computation employed. Below we summarize a few qualitative points of comparison:

System	SQL variant	Execution engine	UDF Support	Mid-query fault tolerance	Open source	Commercial support	HDFS Compatible
Hive	Hive QL (HQL)	MapReduce	Yes	Yes	Yes	Yes	Yes
Tez	Hive QL (HQL)	Tez	Yes	Yes	Yes	Yes	Yes
Shark	Hive QL (HQL)	Spark	Yes	Yes	Yes	Yes	Yes
Impala	Some HQL + some extensions	DBMS	Yes (Java/C++)	No	Yes	Yes	Yes
Redshift	Full SQL 92 (?)	DBMS	No	No	No	Yes	No

FAQ

What's next?

We would like to include the columnar storage formats for Hadoop-based systems, such as Parquet and RC file. We would also like to run the suite at higher scale factors, using different types of nodes, and/or inducing failures during execution. Finally, we plan to re-evaluate on a regular basis as new versions are released.

We wanted to begin with a relatively well known workload, so we chose a variant of the Pavlo benchmark. This benchmark is heavily influenced by relational queries (SQL) and leaves out other types of analytics, such as machine learning and graph processing. The largest table also has fewer columns than in many modern RDBMS warehouses. In future iterations of this benchmark, we may extend the workload to address these gaps.

How is this different from the 2008 Pavlo et al. benchmark?

This benchmark is not an attempt to exactly recreate the environment of the Pavlo at al. benchmark. Instead, it draws on that benchmark for inspiration in the dataset and workload. The most notable differences are as follows:

We run on a public cloud instead of using dedicated hardware.
We require the results are materialized to an output table. This is necessary because some queries in our version have results which do not fit in memory on one machine.
The dataset used for Query 4 is an actual web crawl rather than a synthetic one.
Query 4 uses a Python UDF instead of SQL/Java UDF's.
We create different permutations of queries 1-3. These permutations result in shorter or longer response times.
The dataset is generated using the newer Intel generator instead of the original C scripts. The newer tools are well supported and designed to output Hadoop datasets.

Did you consider comparing Vertica, Teradata, SAP Hana, MongoDB, Postgres, RAMCloud, SQLite, insert-dbms-or-query-engine-here... etc?

We've started with a small number of EC2-hosted query engines because our primary goal is producing verifiable results. Over time we'd like to grow the set of frameworks. We actively welcome contributions!

This workload doesn't represent queries I run -- how can I test these frameworks on my own workload?

We've tried to cover a set of fundamental operations in this benchmark, but of course, it may not correspond to your own workload. The prepare scripts provided with this benchmark will load sample data sets into each framework. From there, you are welcome to run your own types of queries against these tables. Because these are all easy to launch on EC2, you can also load your own datasets.

Do these queries take advantage of different Hadoop file formats or data-layout options, such as Hive/Impala/Shark partitions or Redshift sort columns?

For now, no. The idea is to test "out of the box" performance on these queries even if you haven't done a bunch of up-front work at the loading stage to optimize for specific access patterns. For this reason we have opted to use simple storage formats across Hive, Impala and Shark benchmarking.

That being said, it is important to note that the various platforms optimize different use cases. As it stands, only Redshift can take advantage of its columnar compression. However, the other platforms could see improved performance by utilizing a columnar storage format. Specifically, Impala is likely to benefit from the usage of the Parquet columnar file format.

We may relax these requirements in the future.

Why didn't you test Hive in memory?

We did, but the results were very hard to stabilize. The reason is that it is hard to coerce the entire input into the buffer cache because of the way Hive uses HDFS: Each file in HDFS has three replicas and Hive's underlying scheduler may choose to launch a task at any replica on a given run. As a result, you would need 3X the amount of buffer cache (which exceeds the capacity in these clusters) and or need to have precise control over which node runs a given task (which is not offered by the MapReduce scheduler).

Contributing a New Framework

We plan to run this benchmark regularly and may introduce additional workloads over time. We welcome the addition of new frameworks as well. The only requirement is that running the benchmark be reproducible and verifiable in similar fashion to those already included. The best place to start is by contacting Patrick Wendell from the U.C. Berkeley AMPLab.

Run This Benchmark Yourself

Since Redshift, Shark, Hive, and Impala all provide tools to easily provision a cluster on EC2, this benchmark can be easily replicated.

Hosted data sets

To allow this benchmark to be easily reproduced, we've prepared various sizes of the input dataset in S3. The scale factor is defined such that each node in a cluster of the given size will hold ~25GB of the UserVisits table, ~1GB of the Rankings table, and ~30GB of the web crawl, uncompressed. The datasets are encoded in TextFile and SequenceFile format along with corresponding compressed versions. They are available publicly at s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix].

S3 Suffix	Scale Factor	`Rankings` (rows)	`Rankings` (bytes)	`UserVisits` (rows)	`UserVisits` (bytes)	`Documents` (bytes)
/tiny/	small	1200	77.6KB	10000	1.7MB	6.8MB
/1node/	1	18 Million	1.28GB	155 Million	25.4GB	29.0GB
/5nodes/	5	90 Million	6.38GB	775 Million	126.8GB	136.9GB

Launching and Loading Clusters

Create an Impala, Redshift, Hive/Tez or Shark cluster using their provided provisioning tools.

Each cluster should be created in the US East EC2 Region
For Redshift, use the Amazon AWS console. Make sure to whitelist the node you plan to run the benchmark from in the Redshift control panel.
For Impala, use the Cloudera Manager EC2 deployment instructions. Make sure to upload your own RSA key so that you can use the same key to log into the nodes and run queries.
- Note: In order to use Ext4 as the underlying file system additional steps must be taken on each host machine. See the Ext4 section below.
For Shark, use Spark/Shark EC2 launch scripts. These are available as part of the latest Spark distribution.
- Note: In order to use the same settings that were used in the benchmark, such as Ext4, you must make a modification to the Spark EC2 script. See the Ext4 section below.

    $> ec2/spark-ec2 -s 5 -k [KEY PAIR NAME] -i [IDENTITY FILE] --hadoop-major-version=2 -t "m2.4xlarge" launch [CLUSTER NAME]

NOTE: You must set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

For Hive and Tez, use the following instructions to launch a cluster

Using Ext4

Shark

Modify ec2/spark_ec2.py:

Change: ssh(master, opts, "rm -rf spark-ec2 && git clone https://github.com/mesos/spark-ec2.git -b v2")
To:     ssh(master, opts, "rm -rf spark-ec2 && git clone https://github.com/ahirreddy/spark-ec2.git -b ext4-update")

Impala

Run the following commands on each node provisioned by the Cloudera Manager. These commands must be issued after an instance is provisioned but before services are installed.

  dev=/dev/xvdb
  sudo umount $dev
  sudo mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 $dev
  sudo mount -o defaults,noatime,nodiratime $dev

  dev=/dev/xvdc
  sudo mkdir /data0
  sudo mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 $dev
  sudo mount -o defaults,noatime,nodiratime $dev
  sudo mount -t ext4 -o defaults,noatime,nodiratime $dev /data0

Hive/Tez

By default our HDP launch scripts will format the underlying filesystem as Ext4, no additional steps are required.

Launching Hive and Tez Clusters

This command will launch and configure the specified number of slaves in addition to a Master and an Ambari host.

          $> AWS_ACCESS_KEY_ID=[AWS ID] AWS_SECRET_ACCESS_KEY=[AWS SECRET]
          ./prepare-hdp.sh --slaves=N --key-pair=[INSTANCE KEYPAIR]
          --identity-file=[SSH PRIVATE KEY] --instance-type=[INSTANCE TYPE]
          launch [CLUSTER NAME]

Once complete, it will report both the internal and external hostnames of each node.

SSH into the Ambari node as root and run ambari-server start
Visit port 8080 of the Ambari node and login as admin to begin cluster setup.
When prompted to enter hosts, you must use the interal EC2 hostnames.
Install all services and take care to install all master services on the node designated as master by the setup script.
This installation should take 10-20 minutes. Load the benchmark data once it is complete.

To install Tez on this cluster, use the following command. It will remove the ability to use normal Hive.

    $> ./prepare-benchmark.sh --hive-tez --hive-host [MASTER REPORTED BY SETUP
    SCRIPT] --hive-identity-file [SSH PRIVATE KEY]

Loading Benchmark Data

Scripts for preparing data are included in the benchmark github repo. Use the provided prepare-benchmark.sh to load an appropriately sized dataset into the cluster.

./prepare-benchmark.sh --help

Here are a few examples showing the options used in this benchmark

Redshift Shark Impala/Hive

Redshift	Shark	Impala/Hive
`$> ./prepare-benchmark.sh --redshift --aws-key-id=[AWS KEY ID] --aws-key=[AWS KEY] --redshift-username=[USERNAME] --redshift-password=[PASSWORD] --redshift-host=[ODBC HOST] --redshift-database=[DATABASE] --scale-factor=5`	`$> ./prepare-benchmark.sh --shark --aws-key-id=[AWS KEY ID] --aws-key=[AWS KEY] --shark-host=[SHARK MASTER] --shark-identity-file=[IDENTITY FILE] --scale-factor=5 --file-format=text-deflate`	`$> ./prepare-benchmark.sh --impala --aws-key-id=[AWS KEY ID] --aws-key=[AWS KEY] --impala-host=[NAME NODE] --impala-identity-file=[IDENTITY FILE] --scale-factor=5 --file-format=sequence-snappy`
`$> ./run-query.sh --redshift --redshift-username=[USERNAME] --redshift-password=[PASSWORD] --redshift-host=[ODBC HOST] --redshift-database=[DATABASE] --query-num=[QUERY NUM]`	`$> ./run-query.sh --shark --shark-host=[SHARK MASTER] --shark-identity-file=[IDENTITY FILE] --query-num=[QUERY NUM]`	`$> ./run-query.sh --impala --impala-hosts=[COMMA SEPARATED LIST OF IMPALA NODES] --impala-identity-file=[IDENTITY FILE] --query-num=[QUERY NUM]`

$> ./prepare-benchmark.sh
  --redshift
  --aws-key-id=[AWS KEY ID]
  --aws-key=[AWS KEY]
  --redshift-username=[USERNAME]
  --redshift-password=[PASSWORD]
  --redshift-host=[ODBC HOST]
  --redshift-database=[DATABASE]
  --scale-factor=5

$> ./prepare-benchmark.sh
  --shark
  --aws-key-id=[AWS KEY ID]
  --aws-key=[AWS KEY]
  --shark-host=[SHARK MASTER]
  --shark-identity-file=[IDENTITY FILE]
  --scale-factor=5
  --file-format=text-deflate

$> ./prepare-benchmark.sh
  --impala
  --aws-key-id=[AWS KEY ID]
  --aws-key=[AWS KEY]
  --impala-host=[NAME NODE]
  --impala-identity-file=[IDENTITY FILE]
  --scale-factor=5
  --file-format=sequence-snappy

$> ./run-query.sh
--redshift
--redshift-username=[USERNAME]
--redshift-password=[PASSWORD]
--redshift-host=[ODBC HOST]
--redshift-database=[DATABASE]
--query-num=[QUERY NUM]

$> ./run-query.sh
--shark
--shark-host=[SHARK MASTER]
--shark-identity-file=[IDENTITY FILE]
--query-num=[QUERY NUM]

$> ./run-query.sh
--impala
--impala-hosts=[COMMA SEPARATED LIST OF IMPALA NODES]
--impala-identity-file=[IDENTITY FILE]
--query-num=[QUERY NUM]

Hive/Tez

Hive/Tez
`$> ./prepare-benchmark.sh --hive --hive-host [MASTER REPORTED BY SETUP SCRIPT] --hive-slaves [COMMA SEPARATED LIST OF SLAVES] --hive-identity-file [SSH PRIVATE KEY] -d [AWS ID] -k [AWS SECRET] --file-format=sequence-snappy --scale-factor=5`

$> ./prepare-benchmark.sh
--hive
--hive-host [MASTER REPORTED BY SETUP SCRIPT]
--hive-slaves [COMMA SEPARATED LIST OF SLAVES]
--hive-identity-file [SSH PRIVATE KEY]
-d [AWS ID]
-k [AWS SECRET]
--file-format=sequence-snappy
--scale-factor=5

If you are adding a new framework or using this to produce your own scientific performance numbers, get in touch with us. The virtualized environment of EC2 makes eeking out the best results a bit tricky. We can help.

Friday, April 25, 2014

NoSql + RDBMS = NewSQL

Wednesday, April 23, 2014

Build your own CDH5 QuickStart VM with Spark on CentOS

Build your own CDH5 QuickStart VM with Spark on CentOS

Rate this:

Basic Configuration

Ensure your login has sudo access or able to log in as root

Validate Hostname

Opening up for connectivity

Install Java

Optional Database Installation

Install Cloudera Manager and then CDH 5

Some Installation Tips

Quick Links to Spark Tutorials

Tuesday, April 22, 2014

HEL / Centos 6: Install Nginx Using Yum Command

Monday, April 21, 2014

Cisco UCS and HP Blades: A Look at TCO

Total Cost of Ownership (TCO) of IT

Thursday, April 17, 2014

Big Data Benchmark

Introduction

What this benchmark is not

What is being evaluated?

Changes and Notes (February 2014)

Dataset and Workload

Hardware Configuration

Results | February 2014

1. Scan Query

2. Aggregation Query

3. Join Query

4. External Script Query

Discussion

FAQ

What's next?

How is this different from the 2008 Pavlo et al. benchmark?

Did you consider comparing Vertica, Teradata, SAP Hana, MongoDB, Postgres, RAMCloud, SQLite, insert-dbms-or-query-engine-here... etc?

This workload doesn't represent queries I run -- how can I test these frameworks on my own workload?

Do these queries take advantage of different Hadoop file formats or data-layout options, such as Hive/Impala/Shark partitions or Redshift sort columns?

Why didn't you test Hive in memory?

Contributing a New Framework

Run This Benchmark Yourself

Hosted data sets

Launching and Loading Clusters

Using Ext4

Shark

Impala

Hive/Tez

Launching Hive and Tez Clusters

Loading Benchmark Data

About Me