This two-day course teaches data engineers how to process unbounded streams of data in real-time using popular open-source frameworks. We focus mostly on Apache Flink – the most promising open-source stream processing framework that is more and more frequently used in production. Additionally, we provide short introductions to Spark Streaming, Apache Storm and Apache Samza to let students know about existing alternatives to widen their perspective and help to find the best tool for their use-cases.
During the course we simulate real-world end-to-end scenario – processing logs generated by users interacting with a mobile application in real-time. The technologies that we use include Kafka, Flink, HDFS, YARN and Elasticsearch. All exercises are done on Hadoop clusters running on a remote multi-node cluster.
Data engineers who are interested in leveraging large-scale and distributed tools to process streams of data in real-time. Some experience coding in Python, Java, or Scala, plus basic familiarity with Big Data tools (e.g. Hadoop, Spark) is assumed.