Posts Tagged

We share our knowledge happily

While a lot of problems can be solved in batch, the stream-processing approach can give you even more benefits. In this blog post series we’ll discuss a real-world example of user session analytics to give you a use-case driven overview of business and technical problems that modern stream processing technologies like Apache Flink help you [...]

In this blog post we share motivation, current status and challenges for our new project, called AirHadoop. AirHadoop follows the sharing economy model and it aims to allow companies to use idle Hadoop clusters that belong to somebody else to temporarily gain more computing power and storage. Shared economy A sharing economy is an economic […]

Few months ago I was working on a project with a lot of geospatial data. Data was stored in HDFS, easily accessible through Hive. One of the tasks was to analyze this data and first step was to join two datasets on columns which were geographical coordinates. I wanted some easy and efficient solution. But […]

The LinkedIn Engineering blog is a great resource of technical blog posts related to building and using large-scale data pipelines with Kafka and its “ecosystem” of tools. In this post I provide several pictures and diagrams (including quotes) that summarise how data pipeline has evolved at LinkedIn over the years. The actual content is based […]

In the first part of this blog series I described a few challenges that I had to face to quickly implement a simple Hive query and schedule it periodically on the Hadoop cluster. These challenges include data cataloguing, data discovery, data lineage and process scheduling. I also explained how they can be addressed using existing […]

This blog series is based on the talk “Simplified Data Management and Process Scheduling in Hadoop” that we gave at Big Data Technical Conference in Poland in February 2015. Because the talk was very well received by the audience, we decided to convert it into blog series. In the first part we describe possible open-source […]

We share our slides about Apache Tez delivered by our consultant as a lightening talk given at Warsaw Hadoop User Group. Tez is a highly efficient and scaleable execution engine that can be easily leveraged by existing tools like Hive, Pig or Cascading to run the computation faster. In this talk, we describe Tez and […]

We are happy to share slides about HCatalog that come from Data Analyst Training delivered by GetInData. HCatalog allows users with different data processing tools (such as Apache Hive, Apache Pig, MapReduce) to share data on the Hadoop cluster in an easier way. The slides cover HCatalog’s primary motivation, goals, the most important features, currently […]

Loading posts...
Sort Gallery
Enter your email here