Advanced Spark Training
This 2-day training is dedicated to Big Data engineers and data scientists who are already familiar with the basic concepts of Apache Spark and have hands-on experience implementing and running Spark applications.
Participants will develop knowledge of advanced aspects of working with Apache Spark and will be able to use this to optimise and streamline Spark applications as well as integrate Spark with external data sources and sinks.
Apache Spark Training
- Overview of Interactive Notebooks
- How to set up interactive environment for Spark
- Parquet Optimisations
- RDD vs Dataframes performance
- Eliminating Shuffle
- Controlling number of partitions
- Introduction to Datasets
- RDD vs Datasets vs Dataframes
- Sample: how to process MAX_INT_VALUE records in 2 sec
- Encoders - Object Serialization in Spark
- Exercise: Working with Datasets
- Architecture details
- Advanced configuration settings important for Spark applications
- Testing Spark code
- Structuring Spark applications (clean code principles)
- Scala API tips
- New features at glance
- Unifying DataFrames and Datasets
- Nested queries
- Whole Stage Codegen
- Vectorized reader
Completed in half the estimated time and with a fivefold improvement on data collection goals, the robust product has exponentially increased processing capabilities. GetInData’s in-depth engagement, reliability, and broad industry knowledge enabled seamless project execution and implementation.
GetInData had been supporting us in building production Big Data infrastructure and implementing real-time applications that process large streams of data. In light of our successful cooperation with GetInData, their unique experience and the quality of work delivered, we recommend the company as a Big Data vendor.
GetInData delivered a robust mechanism that met our requirements. Their involvement allowed us to add a feature to our product, despite not having the required developer capacity in-house.
Their consistent communication and responsiveness enabled GetInData to drive the project forward. They possess comprehensive knowledge of the relevant technologies and have an intuitive understanding of business needs and requirements. Customers can expect a partner that is open to feedback.
We sincerely recommend GetInData as a Big Data training provider! The trainer is a very experienced practitioner and he gave us a lot of tips regarding production deployments, possible issues as well as good practices that are invaluable for a Hadoop administrator.
The engineers and administrators at GetInData are world-class experts. They have proven experience in many open-source technologies such as Hadoop, Spark, Kafka and Flink for implementing batch and real-time pipelines.
Other Big Data Training
Machine Learning Operations Training (MLOps)This four-day course will teach you how to operationalize Machine Learning models using popular open-source tools, like Kedro and Kubeflow, and deploy it using cloud computing.
Hadoop Administrator TrainingThis four-day course provides the practical and theoretical knowledge necessary to operate a Hadoop cluster. We put great emphasis on practical hands-on exercises that aim to prepare participants to work as effective Hadoop administrators.
Data Analyst TrainingThis four-day course teaches Data Analysts how to analyse massive amounts of data available in a Hadoop YARN cluster.
Real-Time Stream ProcessingThis two-day course teaches data engineers how to process unbounded streams of data in real-time using popular open-source frameworks.
Modern Data Pipelines with DBTIn this one day workshop, you will learn how to create modern data transformation pipelines managed by DBT. Discover how you can improve your pipelines’ quality and workflow of your data team by introducing a tool aimed to standardize the way you incorporate good practices within the data team.
Interested in our solutions?
Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.