Hadoop Developer Training
This four-day course gives software engineers a practical introduction to Big Data application development using popular projects from the Hadoop ecosystem and beyond.
Training outcome
Participants will gain a detailed understanding of the architecture and role of the most important technologies from the Hadoop Ecosystem. They will be able to independently load and transform huge datasets with the help of technologies like Hive, Spark, Sqoop, Kafka and Oozie.
Course agenda*
Day 1
Introduction to Big Data and Apache Hadoop
Description of StreamRock along with all its opportunities and challenges that come from Big Data technologies.
- Hands-on exercise: Accessing a remote multi-node Hadoop cluster.
Introduction to HDFS
- Hands-on exercise: Importing structured data into the cluster using HUE
- Hands-on exercise: Interacting with HDFS using HDFS CLI, Snakebite and WebHDFS
Introduction to YARN
- Hands-on exercise: Familiarisation with YARN Web UI
A short overview of MapReduce
- Hands-on exercise: Submitting an example ETL map-reduce job to YARN cluster
Day 2
Providing data-driven answers to business questions using SQL-like solution
Introduction to Apache Hive
- Hands-on exercise: Creating Hive databases and tables using HUE
- Hands-on exercise: Ad-hoc analysis of structured data with HiveQL
Advanced aspects of Hive e.g. partitioning, bucketing, strict-mode, execution plan
- Hands-on exercise: Hive partitioning
Extending Hive with custom UDFs and SerDes
- Hands-on exercise: Using custom Java UDF and SerDe for JSON
Hadoop File Formats (Avro, Parquet, ORC)
- Hands-on exercise: Interacting with Parquet and Avro in Hive
Day 3
Implementing scalable ETL processes on the Hadoop cluster
Introduction to Apache Spark, Spark SQL, and Spark DataFrames
- Hands-on exercise: Implementation of the ETL job to clean and massage input data using Spark
- Hands-on exercise: Implementing ad-hoc queries using Spark SQL and DataFrames
- Hands-on exercise: Visualisation of the results of Spark queries using the Spark Notebook
Bonus: Overview of Fast-SQL on Hadoop – solutions like Hive, Spark SQL, Impala, Presto and Tez
Day 4
ETL-related technologies
Introduction to Apache Sqoop
- Hands-on exercise: Importing structured data from MySQL to HDFS and Hive using Sqoop
Real-time data collection with Apache Kafka
- Hands-on exercise: Interacting with a Kafka Cluster to produce and consume messages with CLI scripts
- Hands-on exercise: Using Kafka Java Producer with Avro Schema Registry
Introduction to Apache Oozie
- Hands-on exercise: Building and executing Oozie workflow
- Hands-on exercise: Scheduling Oozie workflow with Oozie scheduler
Contact person
Testimonials
Other Big Data Training
Machine Learning Operations Training (MLOps)
This four-day course will teach you how to operationalize Machine Learning models using popular open-source tools, like Kedro and Kubeflow, and deploy it using cloud computing.Hadoop Administrator Training
This four-day course provides the practical and theoretical knowledge necessary to operate a Hadoop cluster. We put great emphasis on practical hands-on exercises that aim to prepare participants to work as effective Hadoop administrators.Advanced Spark Training
This 2-day training is dedicated to Big Data engineers and data scientists who are already familiar with the basic concepts of Apache Spark and have hands-on experience implementing and running Spark applications.Data Analyst Training
This four-day course teaches Data Analysts how to analyse massive amounts of data available in a Hadoop YARN cluster.Real-Time Stream Processing
This two-day course teaches data engineers how to process unbounded streams of data in real-time using popular open-source frameworks.Analytics engineering with Snowflake and dbt
This 2-day training is dedicated to data analysts, analytics engineers & data engineers, who are interested in learning how to build and deploy Snowflake data transformation workflows faster than ever before.Mastering ML/MLOps and AI-powered Data Applications in the Snowflake Data Cloud
This 2-day training is dedicated to data engineers, data scientists, or a tech enthusiasts. This workshop will provide hands-on experience and real-world insights into architecting data applications on the Snowflake Data Cloud.Modern Data Pipelines with DBT
In this one day workshop, you will learn how to create modern data transformation pipelines managed by DBT. Discover how you can improve your pipelines’ quality and workflow of your data team by introducing a tool aimed to standardize the way you incorporate good practices within the data team.Real-time analytics with Snowflake and dbt
This 2-day training is dedicated to data analysts, analytics engineers & data engineers, who are interested in learning how to build and deploy real-time Snowlake data pipelines.
Contact us
Interested in our solutions?
Contact us!
Contact us!
Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?