This four-day course gives a software engineers practical introduction to Big Data application development using popular projects from the Hadoop ecosystem and beyond.

NOT SCHEDULED If you are interested in, please contact us
4 days training

Target audience
Software engineers

e.g. Hadoop, Spark, Hive, Kafka, Sqoop, Avro, Parquet

Training overview

During the workshop, you’ll act as a Big Data engineer working for a fictional company called StreamRockTM that creates a music streaming application (Spotify alike). The main goal of your work is to take advantage of Big Data technologies such as Hadoop, Spark, Hive, Kafka, Sqoop, Avro, Parquet to collect, store, clean and process various datasets about the users and the song they listened to. You’ll process our data to get data-driven answers to many business questions and power product features that StreamRockTM builds. Every exercise will be executed on a remote multi-node Hadoop cluster.

The workshop is highly focused on a practical experience. The instructor will also introduce you to his own practical experience gained while working with Big Data technologies for several years.

Software engineers who have (at least) basic knowledge of Python, Java and/or Scala and want to understand and develop distributed applications running on a Hadoop YARN cluster.

The training provides a carefully prepared mix of theory, exercises, demos, discussions, quizzes and&fun! We make sure that each participant is highly engaged in hands-on exercises, discussions and teamwork exercises.

Course agenda*


Introduction to the Big Data and Apache Hadoop

Description of StreamRock company along with all its opportunities and challenges that come from the Big Data technologies.

  • Hands-on exercise: Accessing a remote multi-node Hadoop cluster.

Introduction to HDFS

  • Hands-on exercise: Importing structured data into the cluster using HUE
  • Hands-on exercise: Interacting with HDFS using HDFS CLI, Snakebite and WebHDFS

Introduction to YARN

  • Hands-on exercise: Familiarising with YARN Web UI

A short overview of MapReduce

  • Hands-on exercise: Submitting an example ETL map-reduce job to YARN cluster


Providing data-driven answers to business questions using SQL-like solution

Introduction to Apache Hive

  • Hands-on exercise: Creating Hive databases and tables using HUE
  • Hands-on exercise: Ad-hoc analysis of structured data with HiveQL

Advanced aspects of Hive e.g. partitioning, bucketing, strict-mode, execution plan

  • Hands-on exercise: Hive partitioning

Extending Hive with custom UDFs and SerDes

  • Hands-on exercise: Using custom Java UDF and SerDe for JSON

Hadoop File Formats (Avro, Parquet, ORC)

  • Hands-on exercise: Interacting With Parquet And Avro in Hive


Implementing scalable ETL processes on the Hadoop cluster

Introduction to Apache Spark, Spark SQL, and Spark DataFrames

  • Hands-on exercise: Implementation of the ETL job to clean and massage input data using Spark
  • Hands-on exercise: Implementing ad-hoc the queries using Spark SQL and DataFrames
  • Hands-on exercise: Visualisation of the results of Spark queries using the Spark Notebook

Bonus: Overview of Fast-SQL on Hadoop – solutions like Hive, Spark SQL, Impala, Presto and Tez


ETL-related technologies

Introduction to Apache Sqoop

  • Hands-on exercise: Importing structured data from MySQL To HDFS and Hive using Sqoop

Real-time data collection with Apache Kafka

  • Hands-on exercise: Interacting with a Kafka Cluster to produce and consume messages with CLI scripts
  • Hands-on exercise: Using Kafka Java Producer With Avro Schema Registry

Introduction to Apache Oozie

  • Hands-on exercise: Building and executing Oozie workflow
  • Hands-on exercise: Scheduling Oozie workflow with Oozie scheduler

* GetInData reserves the right to make any changes and adjustments to the presented agenda.


Our workshops and training programs are organized by experienced instructors with many years of real-life Big Data experience. Get to know with our team!

More information

The training will last 4 days between 9 am and 5 pm daily. There will be one lunch break and a few coffee breaks during the course.

Contact Us!

Please contact us for any questions on training courses, or if you would like to discuss a custom, on-site training course.

Piotr Krewski                                            Klaudia Zdunczyk                       
+48 888 185 137                                           +48 663 422 641


  • Hadoop Administrator Training, Allegro

    I do highly value substantive content of the course as well as great preparedness and layout. Knowledge passed in a ordered, consistent and effective way. Participants involvement during workshop sessions is the best indicator of this positive training!

  • Big Data Workshop
    Big Data Workshop, Stepstone

    Big Data workshops were led by real professionalists, tools and materials prepared in a way allowing participants to get down to the brass tacks straightaway without losing time. Attendees not disturbing each other and everyone can work comfortably and effectively. One can notice striking knowledge of the host and the fact that it comes from real professional work experience.

  • IE Business School

    This is an excellent course and excellent teacher. Adam was well prepared, new the subject material, was good at transmitting his knowledge to us and had prepared exercises that added a lot of value to the sessions. I would rank this six if I could.

  • Hadoop Developer Training
    Hadoop Developer Training, Conficential

    Professionally prepared and led courses. Coaches with vast experience in the presented realm.

  • IE Business School

    Outstanding professor, the course was very well planned, he is very knowledgeable about what he taught. He talked about real-world cases and managed to get the whole class interested for 6 hours straight. Definitely one of the best courses that we have had in the masters.


Loading posts...
Sort Gallery
Enter your email here