Sunday, June 23, 2013

The HDP Sandbox is a Great Way to Start Learning Hadoop

Use the HDP Sandbox to Develop Your Hadoop Admin and Development Skills
Unless you have your own Hadoop Cluster to play with, I strongly recommend you get the HDP Sandbox up and running on your laptop.   What's nice about the HDP Sandbox is that it is 100% open source.  The features and frameworks are free, you're not learning from some vendor's proprietary Hadoop version that has features they will charge you for.   With the Sandbox and HDP you are learning Hadoop from a true open source perspective.

The Sandbox contains:
  • A fully functional Hadoop cluster running Ambari to play with.  You can run examples and sample code. Being able to use the HDP Sandbox is a great way to get hands on practice as you are learning.
  • Your choice of Type 2 Hypervisors (VMware, VirtualBox or Hyper-V) to install Hadoop on.
  • Hadoop is running on Centos 6.4 and using Java 1.6.0_24 (in VMware VM).
  • MySQL and Postgres database servers for the Hadoop cluster.
  • Ability to log in as root in the Centos OS and have command line access to your Hadoop cluster.
  • Ambari the management and monitoring tool for Apache Hadoop and Openstack.
  • Hue is included in the HDP Sandbox.  Hue is a GUI containing:
    • Query editors for Hive, Pig and HCatalog
    • File Browser for HDFS,
    • Job Designer/Browser for MapReduce
    • Oozie editor/dashboard
    • Pig, HBase and Bash shells
    • A collection of Hadoop APIs.
With the Hadoop Sandbox you can:
  • Point and click and run through the tutorials and videos.  Hit the Update button to get the latest tutorials.
  • Use Ambari to manage and monitor your Hadoop cluster.
  • Use the Linux bash shell to log into Centos as root and get command line access to your Hadoop environment.
    • Run a jps command and see all the master servers, data nodes and HBase processes running in your Hadoop cluster.  
    • At the Linux prompt get access to your configuration files and administration scripts. 
  • Use the Hue GUI to run pig, hive, hcatalog commands.
  • Download tools like Datameer and Talend and access your Hadoop cluster from popular tools in the ecosystem.
  • Download data from the Internet and practice data ingestion into your Hadoop cluster.
  • Use Sqoop and the MySQL database that is running to practice moving data between a relational database and a Hadoop cluster.  (Reminder: This MySQL database is a meta-database for your Hadoop cluster so be careful playing with this.  In real life you would not use a meta-database to play, you'd create a separate MySQL database server.
  • If using VMware Fusion you can create snapshots of your VM, so you can always roll back.
Downloading the HDP Sandbox and Working with an OVA File

The number one gotcha when installing the HDP Sandbox on a laptop is virtualization is not turned on in the BIOS.  If you have problems this is the first thing to check.

I choose the VMware VM, which downloads the Hortonworks+Sandbox+1.3+VMware+RC6.ova file. An OVA (open virtual appliance) is a single file distribution of a OVF stored in the TAR format.  A OVF (Open Virtualization Format) is a portable package created to standardize the deployment of a virtual appliance.  An OVF package structure has a number of files: a descriptor file, optional manifest and certificate files, optional disk images, and optional resource files (i.e. ISO’s). The optional disk image files can be VMware vmdk’s, or any other supported disk image file. 

VMware Fusion converts the virtual machine from OVF format to VMware runtime (.vmx) format.
I went to the VMware Fusion menu bar and selected File - Import and imported the OVA file. Fusion performs OVF specification conformance and virtual hardware compliance checks.  Once complete you can start the VM.

When you start the VM, if you are asked to upgrade the VM, I choose yes. You'll then be prompted to initiate your Hortonworks Sandbox session, and to open a browser and enter a URL address like:
http://172.16.168.128. This will take you to a registration page.  When you finish registration it brings up the Sandbox.
  • Instructions are provided for how to start Ambari (management tool), how to login to the VM as root and how to set up your hosts file.
  • Instructions are provided on how to get your cursor back from the VM.
In summary, you download the Sandbox VM file, import it, start the VM and instructions will lead you down the Hadoop yellow brick road.  When you start the VM, the initial screen will show you the URL for bringing up the management interface and also how to log in as root in a terminal window. Accessing Ambari Mgmt Interface,
  • The browser URL was http://172.16.168.128 (yours may be different) to get to Videos, Tutorials, Sandbox and Ambari setup instructions.
  • Running on Mac OS X, hit   Ctrl-Alt-F5 to get a root terminal window. Log in as root/hadoop.
  • Make sure you know how to get out of the VM window.  On Mac it is Ctrl-Alt-F5.
  • Get access to Ambari interface with port 8080, i.e. http://172.16.168.128:8080.


Getting Started with the HDP Sandbox
Start with the following steps:
  • Get Ambari up and running.  Follow all the instructions.
  • Bring up Hue.  Look at all the interfaces and shells you have access to.
  • Log in as root using a terminal interface. In Sandbox 1.3 service accounts are root/hadoop for superuser and hue/hadoop for ordinary user.
  • Watch the videos.
  • Run through the tutorials. 
Here is the Sandbox welcome screen.  You are now walking into the light of Big Data and Hadoop.  :) 


A few commands to get you familiar with the Sandbox environment:
# java -version
# ifconfig
# uname -a
# tail /etc/redhat-release
# ps -ef | grep mysqld
# ps -ef | grep postgres
# PATH=$PATH:$JAVA_HOME/bin
# jps

You can run a jps command and see key Hadoop processes running such as the NameNode, Secondary NameNode, JobTracker, DataNode, TaskTracker, HMaster, RegionServer and AmbariServer.


If you cd to the /etc/hadoop/conf directory, you can see the Hadoop configuration files.  Hint: core-site.xml, mapred-site.xml and hdfs-site.xml are good files to learn for the HDP admin certification test.  :)  


If you cd to the /usr/lib/hadoop/bin  directory, you can see a number of the Hadoop admin scripts.




Most importantly, Have FUN!  :)


No comments:

Post a Comment