While a lot of problems can be solved in batch, the stream-processing approach can give you even more benefits. In this blog post series we’ll discuss a real-world example of user session analytics to give you a use-case driven overview of business and technical problems that modern stream processing technologies like Apache Flink help you […]
As the Big Data Tech Warsaw 2017 conference is getting closer, we’d like to highlight the most interesting topics that will be covered during this exciting event. This year the event will contain +25 technical talks given in four parallel tracks.
Schema evolution of a Hive table backed by Avro file format allows you to modify the table schema in several “schema-compatible” ways without the need of rewriting all existing data. Thanks to that, your HiveQL queries can read old and new Avro files uniformly using the current table schema. In this blog post I briefly […]
During my 6-year Hadoop adventure, I had an opportunity to work with Big Data technologies at several companies ranging from fast-growing startups (e.g. Spotify) to global corporations and academic institutes. What really amazed me was the difference of how the use-cases were defined, how fast valid solutions were built and how money was spent and […]
We are excited to announce that GetInData becomes the coorganizer (together with our partner Evention) of Big Data Tech Warsaw 2017. The conference will be held in Warsaw (Poland), February 9th, 2017.
In this blog post we share motivation, current status and challenges for our new project, called AirHadoop. AirHadoop follows the sharing economy model and it aims to allow companies to use idle Hadoop clusters that belong to somebody else to temporarily gain more computing power and storage. Shared economy A sharing economy is an economic […]
Camus, a MapReduce job that loads data from Kafka into HDFS, has a number of time-related configuration settings and assumptions. They control how many messages are consumed from Kafka in each Camus run and where the data is stored in HDFS. I summarize them in this blog post.
The LinkedIn Engineering blog is a great resource of technical blog posts related to building and using large-scale data pipelines with Kafka and its “ecosystem” of tools. In this post I provide several pictures and diagrams (including quotes) that summarise how data pipeline has evolved at LinkedIn over the years. The actual content is based […]
We are excited to announce that GetInData became the coorganizer (together with our partner Evention) of Big Data Technology Summit 2016. The conference will be held in Warsaw, February 24-25th.
Go to Big Data Weekly Quiz #10 to start playing this week’s edition. The quiz covers topic from the last issue of Hadoop Weekly and it contains questions about Spark, Succinct Spark, Zeppelin, S3 and EMR. Remember to share your score on Twitter or Facebook! 🙂