Review of presentations on Big Data Tech Warsaw 2017
On 9th of February 2017 we co-organized the third edition of Big Data Tech Warsaw 2017 – the exciting one-day conference with purely technical talks in the field of Big Data analysis, scalability, storage and search. In this year edition, we had three presetations in the plenary session led by guest from Google, Etsy and SAS Institute. Afterwards participants had a chance to listen 24 presentations devided in four simultaneous sessions – Operations & Deployment, Real-Time Processing, Data Application Development, Analytics & Data Science. We were privileged to host Big Data experts from the following companies: Facebook, Spotify, SkyScanner, Criteo, ING, Mesosphere, dataArtisans and of course GetInData.
Below you can find a brief review of some presentations our team members attended to.
Please, feel free to leave a comment!
Tomasz Żukowski, Data Scientist at GetinData
Meta-Experimentation at Etsy presenatation led by Emily Sommer was a very interesting presentation about evolution of A/B testing at Etsy. Emily has shown that there are some “dark paths” in A/B testing which can lead to misleading results. Main problem for Etsy was lack of iid distribution in data, what is one of main assumptions t-test they used. Next step for them was using bootstrap, what solved non-iid issue. But bootstraping from even reasonably sized population is memory intensive, while Etsy reporting system was “on-line”, so process should not be resource intensive. Their choice was bag of little bootstraps as it meets requirements. Presentation showed that A/B testing is tricky field and you should always remember about checking assumptions, A/A testing and careful and reasonable design of tests.
One Jupyter to rule them all was a presentation about one of most interesting tools for Data Analyst/Data Scientist. Mateusz Strzelecki has shown how adapting Jupyter at Allegro helped their not so technically skilled teams to use Spark on daily basis. On the other hand there were also issues raised about problems generated by using big data technologies by people without good knowledge of how it works inside e.g. simply re-running failed Spark jobs without trying to understand and fix the root cause of the failure such as too little resources (executors or memory).
Dawid Wysakowicz, Data Engineer at GetInData
I really liked the order of presentations in the first part of the “Real-time processing” track. It started with a nice presentation from Bartosz Łoś, who described their architecture at RTB House they use to process around 20k events per second. They’ve started with daily jobs and developed a platform in a Lambda Architecture that enabled them to cut that time down to few seconds using streaming technologies like Apache Kafka and Apache Storm. It was a great lessons learned story.
We found it a great introduction to our own GetInData’s presentation of different approach to that kind of problems with modern stream processing engines like Apache Flink. We basically started where Bartek finished. We tried to emphasise advantages of streaming not just in the matter of latency, but in ease of implementation, operation and what’s most important correctness of processing. All that showed on an example use-case that required event-time user-sessions.
The next presentation from Fabian Hueske (dataArtisans) for me personally was a really great ending to the first part of that track. He talked which way the modern streaming engines will go. That way will definitely be StreamSQL which can benefit greatly from the experience of many years of optimizers from the relational databases and of course can leverage the skills of Data Scientist that are already familiar with SQL.
After the break we were presented with some great topics that we rarely think of. It started with a talk by Ashish Tadose from PubMatic who showed us that beside the most talked about engines like Apache Storm or Apache Flink there are still a number of great competitors like Apache Apex described by him that often have even greater level of adoption.
The guys from ING Services (Krzysztof Adamski and Krzysztof Żmij) presented a very technical talk about their platform for processing network logs across different datacenters. It covered many tweaks and decision they had to make to develop a working solution for financial company. They dived as deep as single Kafka parameters or Elasticsearch versions compatibilities and when to upgrade.
The ending presentation from Theofilos Kakantousis(Logical Clocks AB) showed us that many of the problems that guys form ING could be avoided as long as we are ok with Streaming-as-a-Service platforms. Theofilos introduced their own distribution of hadoop that integrates many of standard frameworks like Hadoop(with their own implementation of HDFS and upgrades to Namenodes), Spark, Flink, Kafka, Kibana. All that wrapped into nice set of project abstractions, UI and configuration APIs that takes away a big part of operational issues.
Piotr Bednarek, Big Data Administrator at GetinData
The Operations & Deplayment path started with the presentation by Michał Brzezicki, founder of SentiOne. He spoke about the infrastructure they needed to build, and problems they faced in order to deliver a high performance web crawling and analytic platform for their clients. The presentation covered both general topics, like data gathering from millions of web pages that have different types of data and more detailed subjects, like fine tuning ElasticSearch cluster. For example Michał pointed out that heap size for Elastic node should not exceed 31GB of RAM, to have smaller pointers.
Next speaker, Nikolay Golov, presented how Avito the biggest Russian classifieds platform faced rapid growth and how they scaled their infrastructure from millions to billions of pageviews per day. General approach of Avito was to use enterprise ready solutions provided by the well established vendors like HP and Vertica.
The first part of the session ended with the talk by Kamil Ciukszo from Alterdata.io and Krzysztof Baczyńsk from Cisco. They provided their insight in creating effective, scalable and manageable environment for real-time big data processing and analytics using Cisco’s enterprise solutions.
After the lunch break, Stuart Pook from Criteo delivered a fantastic presentation on building a data center with capacity for 5000 hadoop nodes. Now, when more and more companies decide to move their infrastructure to the Cloud, Criteo decided that it would be cheaper to stick with bare metal and built themselves a new datacenter. Stuart discussed many aspects of such endeavour, like choosing the suitable hardware provider, disks and switches benchmarking and mistakes they’ve made. Just brilliant 🙂
Following was the presentation by Nelson Arape from Spotify. That was a great addition to the previous presentation because of Spotify move from on-premise to cloud infrastructure. Spotify decided that it should put more effort in developing a great music app rather than a great Hadoop cluster, so they “delegated” infrastructure operations to Google. Nelson focused on the back-end side of creating reliable event delivery system and the problems they faced in this new environment.
The closing presentation was delivered by Tomasz Sobczak from Findwise. Tomasz discussed challenges in building large, distributed full-text search systems based on Apache Solr and ElasticSearch. After short comparison of both systems that share common engine – Lucene, Tomasz discussed things like right cluster size, hardware requirements and how many shards and replicas are needed. The really interesting part was about data indexing optimization e.g. that you should avoid unneeded data and always start testing from one shard/one replica and default configuration and then scale up when needed with single configuration change at a time.
Because of our premise of not accepting marketing and sales presentations we selected group of speakers consisted of Big data practitioners, thanks of it all presentations recevied high notes from attendees. We hope that next edition will come up to be another great and fruitful meeting for Big Data professionals.
See you next year!