Few months ago I was working on a project with a lot of geospatial data. Data was stored in HDFS, easily accessible through Hive. One of the tasks was to analyze this data and first step was to join two datasets on columns which were geographical coordinates. I wanted some easy and efficient solution. But […]
Camus, a MapReduce job that loads data from Kafka into HDFS, has a number of time-related configuration settings and assumptions. They control how many messages are consumed from Kafka in each Camus run and where the data is stored in HDFS. I summarize them in this blog post.
The LinkedIn Engineering blog is a great resource of technical blog posts related to building and using large-scale data pipelines with Kafka and its “ecosystem” of tools. In this post I provide several pictures and diagrams (including quotes) that summarise how data pipeline has evolved at LinkedIn over the years. The actual content is based […]
We are excited to announce that GetInData became the coorganizer (together with our partner Evention) of Big Data Technology Summit 2016. The conference will be held in Warsaw, February 24-25th.
Go to Big Data Weekly Quiz #10 to start playing this week’s edition. The quiz covers topic from the last issue of Hadoop Weekly and it contains questions about Spark, Succinct Spark, Zeppelin, S3 and EMR. Remember to share your score on Twitter or Facebook! 🙂
One of our client uses Apache Sentry (incubating) to define and enforce authorization rules to data in a Hadoop cluster. In this blog post, I would like to share my experience in using Sentry 1.4.0 with several tools from Hadoop Ecosystem that come in CDH 5.3 and CDH 5.4.
We are extremely happy to inform that GetInData becomes the official sponsor of Warsaw Hadoop User Group!
In the first part of this blog series I described a few challenges that I had to face to quickly implement a simple Hive query and schedule it periodically on the Hadoop cluster. These challenges include data cataloguing, data discovery, data lineage and process scheduling. I also explained how they can be addressed using existing […]
When properly deployed, Spark Streaming 1.2 provides zero data loss guarantee. To enjoy this mission-critical feature, you need to fulfil following prerequisites: The input data comes from reliable source and reliable receivers Application metadata is checkpointed by the application driver Write ahead log is enabled Let’s briefly describe these prerequisites. In this blog post, we […]
This blog series is based on the talk “Simplified Data Management and Process Scheduling in Hadoop” that we gave at Big Data Technical Conference in Poland in February 2015. Because the talk was very well received by the audience, we decided to convert it into blog series. In the first part we describe possible open-source […]