RM BI Forum 2014 Notes – Cloudera Hadoop Masterclass

Cloudera

The Rittman Mead BI Forum started off with a one-day Hadoop Masterclass, provided by Lars George.  As he messaged us the day before we have learned what Hadoop is all about, what its major components are, how to acquire, processes and provide data as part of a production data processing pipeline. To that effect, Lars advised that it would be useful to follow along the examples in the course and have an environment handy. That would allow us to experiment at our convenience during and after the class. He directed us to the following link; the Cloudera Quickstart VM.

Lars recommends the following: “Select the CDH5 version of the VM. Please select a virtual machines image matching your VM platform of choice. If you do not have a VM host application installed yet, you can choose from a few available ones. VirtualBox is provided by Oracle and a great choice to use. It can be downloaded here. Set up the VM application, then download and start the Cloudera Quickstart VM to run on top of it. It is as easy as that.”
Find below a few notes I took during the Masterclass.
Lars devided the Masterclass into four parts.

I – Introduction into Hadoop

  • What is Big Data? -  It’s not necessarily volume but also format and speed. Three V’s – Volume, Variety and Velocity
  • Introduction to Hadoop
  • HDFS
  • MapReduce
  • YARN
  • Cluster Planning

Hadoop is Open Source and Apache licensed — http://hadoop.apache.org
Many developers Cloudera, Apple
Contributers
Many related projects, applications, tools
Hadoop is not a system but a set of tools, projects which work together. You should decide, for each part of the architecture, which tool you should use and how you would use it.
HadoopEcosystem

Hadoop where to get it?
Load, Process and Analyze data
Hadoop Concept – distribute data in the system.
Process the data where it resides
No network processing
High level code (java)
No communication between nodes
Data stored on different machines in advance

Map Reduce Data Flow
  • Map
  • Sort en Shuffle
  • Reduce

II - Ingress and Egress

Ingress – moving data into Hadoop (HDFS)
Flume  (Near Real-Time Pipeline)
  • Source
  • (File) Channel
  • Sink —> poll, collect and write to eg. HDFS
Apache_Flume
Transfer data between Relational Database (Oracle, Terradata, Sql Server, etc.) and HDFS
Oracle Database Driver for Sqoop – OraOop by Quest 
FIle Formats important to keep in mind when you want to get the data out again.
Simple File versus Container (Structured) File
Parquet vs Google Dremel —>
BI Integration
  • Sqoop
  • HDFS Connector
  • ODBC/JDBC

III – NoSQL and Hadoop

ACID (atomicity, consistency, isolation, durability) 

IV Analyzing Big Data

  • Pig
  • Hive  (HiveServer2 instead of HiveServer1)
  • Impala
  • Search – Lucien
  • Data Pipelines (micro –  macro)
  • Oozie (Workflow Server)
  • Information Architecture – Where / How to store data and how to secure this structure
  • Spark (Java, Python, Scala compile into code)
I think Lars could have talked about Hadoop two more days (with or without sheets). Hadoop is all about making choices. There are similar tools, projects, concepts, etc. All depends on what you want to achieve.
Although this Masterclass was very informative, I still struggle to see the use case at this moment. A lot of my customers are still struggling with their ’normal’ data……
About these ads

2 thoughts on “RM BI Forum 2014 Notes – Cloudera Hadoop Masterclass

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s