Clientele ➞

Big Data and Hadoop for Developers – Level 1

processing bigdata with apache hadoop_2

Duration: 2 Days


Gartner predicts that 4.4 Million Jobs will be created globally to support BigData. BigData is a popular term used to describe the exponential growth, availability and use of information; both structured and unstructured. It is imperative that organizations and IT leaders focus on the ever-increasing volume, variety and velocity of information that forms BigData. Hadoop is the core platform for structuring BigData, and solves the problem of making it useful for Analytics. Our course will teach you all you need to learn about using Hadoop for BigData analysis and give you a clear understanding about processing BigData with Hadoop.

Why learn about Processing BigData with Hadoop?

  • Businesses are now aware of the large volumes of data that they generate in their day to day transactions. They have also rea...Read more


  • What is Hadoop and how can it help process large data sets.
  • How to write MapReduce programs using Hadoop API.
  • How to use HDFS (the Hadoop Distributed Filesytem), from the command line and API, for effectively loading and processing data in Hadoop.
  • How to ingest data from a RDBMS or a data warehouse to Hadoop.
  • Best practices for building, debugging and optimizing Hadoop solutions.
  • Get introduced to tools like Pig, Hive, HBase, Elastic MapReduce etc. and understand how they can help in BigData projects.

Who should attend

  • A developer who wants to learn Hadoop but you don’t know where to start
  •  A team that is struggling to extract insights from large scale and fast growing data in traditional systems
  • A team that has decided to migrate from a RDBMS or a traditional data warehouse to Hadoop, but needs help getting started


Course Outline

Day 1 and 2


  • Big Data
    • What is Big Data?
    • Trends across industries.
    • Opportunities to disrupt business models across industries.
    • Industry specific Use Cases.
    • Some brief Case Studies.
  • Data Science
    • An emerging new discipline.
    • Skills required to be a Data Scientist.
  • Hadoop
    • What is Hadoop?
    • Why do we need a new tool? / Motivations for Hadoop
    • A comparison with traditional databases (RDBMS) and data warehouses.
    • Data Hub/Lake/Reservoir: The role of Hadoop in a modern data architecture.
    • Apache Hadoop
    • Distributions including Hadoop: Cloudera, Hortonworks, MapR, IBM, Pivotal and Intel.
    • An overview of a typical Hadoop cluster.
    • Hadoop Deployment
      • Commodity Hardware
      • Hadoop Appliances
      • Hadoop on the Cloud
      • Hadoop as a Service

      Lab: Install and configure a multi node Hadoop cluster with Ambari

Data Storage

  • File System Abstraction
  • Big Data and Distributed File Systems
  • Hadoop Distributed File System (HDFS)
    • HDFS Architecture
      • Architectural assumptions and goals
      • How data is stored in HDFS
      • How data is read from HDFS
      • Namenodes and Datanodes
      • Blocks
      • Data Replication
      • Fault Tolerance
      • Data Integrity
      • Namespaces
      • Federation in Hadoop 2.0
      • High Availability in Hadoop 2.0
      • Security and Encryption
    • HDFS Interfaces: FileSystem API, FSShell, WebHDFS, Fuse etc.
      Lab: Manipulating files in HDFS using hadoop fs commands.
      Lab: Manipulating files in HDFS pragmatically using the FileSystem API.
  • Alternative Hadoop File Systems: IBM GPFS, MapR-FS, Lustre, Amazon S3 etc.

Data Processing

  • MapReduce
    • The fundamentals: map() and reduce()
    • Data Locality
    • Architecture of the MapReduce framework.
    • Phases of a MapReduce Job
      Lab: Write a simple log analysis MapReduce application
    • Job Execution
    • Partitioners
    • Combiners
    • The flow of <key, value> pairs in a MapReduce Job
      Lab: Write an Inverted Index MapReduce Application with custom Partitioner and Combiner
    • Custom types and Composite Keys
    • Custom Comparators
    • InputFormats and OutputFormats
    • Distributed Cache
    • MapReduce Design Patterns
    • Sorting
    • Joins
    • Streaming Job: Writing MapReduce programs in languages other than Java
      Lab: Writing a streaming MapReduce job in Python
  • YARN and Hadoop 2.0
    • Separating resource management and processing
    • YARN Applications: MapReduce, Tez, HBase, Storm, Spark, Giraph etc.
    • YARN Architecture
      • ResourceManager
      • NodeManagers
      • ApplicationMasters
      • Containers
      • Fault Tolerance
    • Tez: Accelerating processing of data stored in HDFS

Data Integration

  • Integrating Hadoop into your existing enterprise.
  • Introduction to Sqoop
    Lab: Importing data from an RDBMS to HDFS using Sqoop
    Lab: Exporting data from HDFS to an RDBMS
  • Other data integration tools: Flume, Kafka, Informatica, Talend etc.

Higher Level Tools

  • Defining workflows with Oozie
  • An introduction to Hive
    • Architecture
    • Interfaces: Hive Shell, Thrift, JDBC, ODBC etc.
    • HiveQL: A dialect of SQL
    • Data Types and File Formats
    • Creating Tables and Loading Data
    • Schema at Read
    • Querying Data
    • User Defined Functions
  • An introduction to Pig
    • Grunt Shell
    • Pig’s Data Model
    • Pig Latin
    • User Defined Functions
  • An introduction to HBase
    • Architecture
    • Client API
    • MapReduce Integration
    • Schema Design

About The Trainer

Dr. Yash Mody
Hadoop, Big Data Solution Specialist, Adobe AEM Architect

yash-modyDr. Yash Mody, PhD, has developed and architected several enterprise applications using platforms like Hadoop, Oracle ADF, SalesForce, IBM Websphere, Quartz, SAP, Adobe AEM, Adobe LiveCycle, Apache Flex, TIBCO etc. At CloudThat, he works with our customers like PwC, Fidelity, Western Union, GE, HP, Oracle, Mahindra Bristlecone, Flipkart, Aditi, Sonata etc. to help them understand various big data technologies and design solutions for modern usecases like social media analytics, web analytics etc.

Over the years, Yash has trained over 1500 developers and architects from over 30 organisations.

View LinkedIn Profile

Other Details



For latest batch dates, fees, location, technical queries and general inquiries, contact our sales team at: +91 8880002200 or email at

Upcoming Batches


Fill out my online form.


Favorite Courses
No Favourites added yet.

Our Partners