Archive for the
‘Post’ Category

We share our knowledge happily

When properly deployed, Spark Streaming 1.2 provides zero data loss guarantee. To enjoy this mission-critical feature, you need to fulfil following prerequisites: The input data comes from reliable source and reliable receivers Application metadata is checkpointed by the application driver Write ahead log is enabled Let’s briefly describe these prerequisites. In this blog post, we […]

This blog series is based on the talk “Simplified Data Management and Process Scheduling in Hadoop” that we gave at Big Data Technical Conference in Poland in February 2015. Because the talk was very well received by the audience, we decided to convert it into blog series. In the first part we describe possible open-source […]

In this blog post, I describe a few surprising gotchas related to the import of a MySQL table into Hive using Sqoop 1.4.5 (the most recent version supported by vendors like Hortonworks or Cloudera at the time of writing this post). Real-world scenario In my simple (yet real-world) use-case, I have a MySQL table and […]

We are happy to say that our Refcardz, titled Getting Started with Apache Hadoop, has been already published by DZone. This Refcard presents Apache Hadoop, a software framework that enables distributed storage and processing of large datasets using simple high-level programming models. The card covers the most important concepts of Hadoop, describes its architecture, and […]

We would like to recommend to read Agile migration of a single-node cluster from MapReduce Version 1 to YARN written by our consultant for IBM developerWorks. Please find the abstract of the article below: Although Hadoop vendors such as Cloudera and Hortonworks provide excellent and detailed documentation for installing YARN, they follow an all-or-nothing approach. […]

Jun 06, 2014

Adam Kawa

hadoop, yarn



We would like to recommend to read “Introduction To YARN” written by our consultant for IBM developerWorks. Please find the abstract of the article below: Apache Hadoop is currently one of the most popular tools for big data processing. It has been successfully deployed in production by many companies for several years. Though Hadoop is […]

Loading posts...
Sort Gallery
Enter your email here