The Rittman Mead BI Forum started off with a one-day Hadoop Masterclass, provided by Lars George. As he messaged us the day before we have learned what Hadoop is all about, what its major components are, how to acquire, processes and provide data as part of a production data processing pipeline. To that effect, Lars advised that it would be useful to follow along the examples in the course and have an environment handy. That would allow us to experiment at our convenience during and after the class. He directed us to the following link; the Cloudera Quickstart VM.
Lars recommends the following: “Select the CDH5 version of the VM. Please select a virtual machines image matching your VM platform of choice. If you do not have a VM host application installed yet, you can choose from a few available ones. VirtualBox is provided by Oracle and a great choice to use. It can be downloaded here
. Set up the VM application, then download and start the Cloudera Quickstart VM to run on top of it. It is as easy as that.”
Find below a few notes I took during the Masterclass.
Lars devided the Masterclass into four parts.
I – Introduction into Hadoop
- What is Big Data? – It’s not necessarily volume but also format and speed. Three V’s – Volume, Variety and Velocity
- Introduction to Hadoop
- Cluster Planning
Many developers Cloudera, Apple
Many related projects, applications, tools
Hadoop is not a system but a set of tools, projects which work together. You should decide, for each part of the architecture, which tool you should use and how you would use it.
Hadoop where to get it?
Load, Process and Analyze data
Hadoop Concept – distribute data in the system.
Process the data where it resides
No network processing
High level code (java)
No communication between nodes
Data stored on different machines in advance
Map Reduce Data Flow
- Sort en Shuffle
II – Ingress and Egress
Ingress – moving data into Hadoop (HDFS)
(Near Real-Time Pipeline)
- (File) Channel
- Sink —> poll, collect and write to eg. HDFS
Transfer data between Relational Database (Oracle, Terradata, Sql Server, etc.) and HDFS
FIle Formats important to keep in mind when you want to get the data out again.
Simple File versus Container (Structured) File
- HDFS Connector
III – NoSQL and Hadoop
ACID (atomicity, consistency, isolation, durability)
IV Analyzing Big Data
I think Lars could have talked about Hadoop two more days (with or without sheets). Hadoop is all about making choices. There are similar tools, projects, concepts, etc. All depends on what you want to achieve.
Although this Masterclass was very informative, I still struggle to see the use case at this moment. A lot of my customers are still struggling with their ’normal’ data……