Data Engineer Training

This four-day course gives software engineers practical introduction to Big Data application development using popular projects from the Hadoop ecosystem and beyond.

During the workshop you’ll act as a Big Data engineer working for a fictional company called StreamRockTM that creates a music streaming application (Spotify alike). The main goal of your work is to take advantage of Big Data technologies such as Hadoop, Spark, Hive, Kafka, Sqoop, Avro, Parquet to collect, store, clean and process various datasets about the users and the song they listened to. You’ll process our data to get data-driven answers to many business questions and power product features that StreamRockTM builds. Every exercise will be executed on a remote multi-node Hadoop cluster.

The workshop is highly focused on a practical experience. The instructor will also introduce you to his own practical experience gained while working with Big Data technologies for several years.



You can use our website in order to register for the upcoming training, simply click here.

The next Hadoop Developer Training  will take place in Warsaw from 18th of April until 21th of April 2017. The cost of the training is 5500 PLN per person+tax. The workshop will be conducted in Polish. Before you register please read carefully the Term&Conditions of our trainings.

Target Audience

Software engineers who have (at least) basic knowledge of Python, Java and/or Scala and want to understand and develop distributed applications running on a Hadoop YARN cluster.

 Training Agenda*

Day 1 – Introduction to the Big Data and Apache Hadoop

  • Description of StreamRock company along with all its opportunities and challenges that come from the Big Data technologies.
    • Hands-on exercise: Accessing a remote multi-node Hadoop cluster.
  • Introduction to HDFS
    • Hands-on exercise: Importing structured data into the cluster using HUE
    • Hands-on exercise: Interacting with HDFS using HDFS CLI, Snakebite and WebHDFS
  • Introduction to YARN
    • Hands-on exercise: Familiarising with YARN Web UI
  • Short overview of MapReduce
    • Hands-on exercise: Submitting an example ETL map-reduce job to YARN cluster

Day 2 – Providing data-driven answers to business questions using SQL-like solution

  • Introduction to Apache Hive
  • Hands-on exercise: Creating Hive databases and tables using HUE
  • Hands-on exercise: Ad-hoc analysis of structured data with HiveQL
  • Advanced aspects of Hive e.g. partitioning, bucketing, strict-mode, execution plan
  • Hands-on exercise: Hive partitioning
  • Extending Hive with custom UDFs and SerDes
  • Hands-on exercise: Using custom Java UDF and SerDe for JSON
  • Hadoop File Formats (Avro, Parquet, ORC)
  • Hands-on exercise: Interacting With Parquet And Avro in Hive

Day 3 – Implementing scalable ETL processes on the Hadoop cluster

  • Introduction to Apache Spark, Spark SQL and Spark DataFrames
  • Hands-on exercise: Implementation of the ETL job to clean and massage input data using Spark
  • Hands-on exercise: Implementing ad-hoc the queries using Spark SQL and DataFrames
  • Hands-on exercise: Visualisation of the results of Spark queries using the Spark Notebook
  • Bonus: Overview of “Fast-SQL on Hadoop” solutions like Hive, Spark SQL, Impala, Presto and Tez

Day 4 – ETL-related technologies

  • Introduction to Apache Sqoop
  • Hands-on exercise: Importing structured data from MySQL To HDFS and Hive using Sqoop
  • Real-time data collection with Apache Kafka
  • Hands-on exercise: Interacting with a Kafka Cluster to produce and consume messages with CLI scripts
  • Hands-on exercise: Using Kafka Java Producer With Avro Schema Registry
  • Introduction to Apache Oozie
  • Hands-on exercise: Building and executing Oozie workflow
  • Hands-on exercise: Scheduling Oozie workflow with Oozie scheduler
* GetInData reserves the right to make any changes and adjustments to the presented agenda.

Exemplary Section

See our slides about HCatalog that come from this training!

Our Approach

The training provides a carefully prepared mix of theory, exercises, demos, discussions, quizzes and … fun! We make sure that each participant is highly engaged in hands-on exercises, discussions and teamwork exercises.


A training takes 4 days, but it can be split into two separate 2-day sessions.

More Information

Please contact us for any questions on training courses, or if you would like to discuss a custom, on-site training course.