Apache NiFi: A Complete Guide E-book.
We are proud to present you our first e-book, created by GetInData specialists. Apache NiFi: A Complete Guide is the result of long and fruitful work…
Read moreIn this episode of the RadioData Podcast, Adama Kawa talks with Jonas Björk from Acast. Mentioned topics include: analytics use cases implemented at Acast, Cloud-managed data tech stack at Acast, AI/ML in podcasting used today or tomorrow, trends and innovations in the podcasting industry and more.
We encourage you to listen to the whole podcast or, if you read it here.
Host: Adam Kawa, GetInData | Part of Xebia CEO
Since 2010, Adam has been working with Big Data at Spotify (where he proudly operated one of the largest and fastest-growing Hadoop clusters in Europe), Truecaller and as a Cloudera Training Partner. Nine years ago, he co-founded GetInData | Part of Xebia – a company that helps its customers to become data-driven and builds custom Big Data solutions. Adam is also the creator of many community initiatives such as the RadioData podcast, Big Data meetups and the DATA Pill newsletter.
Guest: Jonas Björk
Jonas Björk is a Chief Technology Officer at Acast. He has worked at many heavy data focused companies like Spotify, BizOne and Ericsson in data related projects.
At Ericsson he worked in the machine learning department, building recommendation systems. He then started working in a startup called BizOne in Business Intelligence as a Service platform. Atllit was during this time that Adam Kawa and Jonas met, which was about 10 years ago, and Jonas started exploring the Hadoop ecosystem which led Jonas to join Spotify. After several years he decided to join Acast to build a podcast service platform.
When Jonas joined Acast, the technical department consisted of about 15 people. Jonas was also very interested in podcasts, especially from an educational point of view. He listened to a lot of podcasts mostly from the technology space. For Jonas it was still one of the few permission-less publishing and permission-less consumption services. Even now anyone can start a podcast and doesn’t need permission to do it. One doesn’t have to agree to terms of service from BigTech companies - there is no gatekeeping. It’s very easy to get started and any voice can be heard by a larger audience. Anyone can subscribe and listen to many podcasts and evaluate them according to his or her standards.
Building the data platform, infrastructure and enabling people to listen to each other is something that keeps Jonas inspired and excited by his work.
Jonas: There are some aspects of this industry that resemble those at Spotify. In both cases it’s not trivial to define the financial/transactional aspect of the business. You have to define when you should count the “listen” (the moment, when you acknowledge that the podcast was “listened to”) to be valid. Were there ads running in that “listen” or not? What advertiser did those ads come from? You have to create sophisticated data pipelines with lots of computation in order to figure out who and how much money is being made. In Spotify we were referring to those pipelines as royalty pipelines and in Acast we call them calculation pipelines.
It’s a unique combination of the data volume that is being processed and the source of data i.e.: server logs - which drives the core of this business. The podcast can be consumed through various types of clients and players, and because of that we don’t have access to the client data, we have to restrict our processing to the server data. Every piece of information has to be extracted from the server logs:
There are billions of requests coming in every month and that makes things both challenging and interesting at the same time.
The International Advertising Bureau (IAB) dictates the regulations regarding the podcasting:
Anytime an external auditor from IAB can come and look at our implementation of those regulations to validate them and provide us with proper certification, that states that we have accurate interpretation of those requirements.
We have our web based client and a mobile application which was discontinued last year. We used those to train models and see if our server side calculations and models were correct when it comes to calculating those important parameters.
Jonas: It’s everything connected to the graphical distribution of the listeners, some of those are:
There are a lot of things that we can do with the data and push it to the creators so that they can understand the audience better, and how the content is being consumed.
Jonas: Obviously general metrics like: how many shows are registering, how many shows are published? Financial data: how the company is performing in different parts of the world, and also more related to different types of initiatives that we’re working on.
We have recently released a self served ad-buying feature for advertisers to sign-up with Acast and buy ads directly on the website without involving any Acast staff along that journey. It’s a very new type of use case for us. We have typically been dealing with demand from larger buyers spending big advertiser money for bigger campaigns. So this year we’re looking at how many advertisers are coming, what types of advertisers are coming etc. It depends on the type of initiative we’re currently working on.
Jonas: When we started working with GetInData in 2019, at that time we had a centralized data infrastructure and we decided that it was time to start moving to decentralized infrastructure. Quite quickly as we gained a critical mass when it comes to data processing, we had to decentralize our infrastructure and teams in order to get some more data competence in the products that we were building and make teams be closer to the product and value creation, to deliver more value to the customers without relying on the central team to do the data work.
We made those decentralizing choices by moving first from Azure to on-premise, and then by finally moving to AWS. Right now we’re trying to streamline how we operate. We’ve made some standardization regarding some aspects of our work like, that data across teams is shared through S3 in a Parquet format and Athena is a query engine on top of it with Glue Catalog as a so to say, cross account data-mesh setup. But also we acquired companies like Podchaser last year which can be considered as the IMDB of podcasting.
It’s easy to create text right now and create a podcast out of text to create content with the help of AI and even deep fake it. This could also be harmful for the podcast because it’s hard to create relationships with AI created content. What makes pod-casting sticky is that it’s created by humans. It’s harder to achieve the relationship with AI generated content, but it’s probably faster to create content with AI technology just by parsing the text.
Jonas: Because podcasts are based on an old RCS technology they haven't seen much innovation recently, it’s something that’s holding this industry back. It’s hard to innovate something that’s beneficial to the whole industry and get adoption.
We had RAD (remote audio data), which made callbacks to some remote endpoints whenever the user hit the markers in the audio so that the podcast provider could register the user behavior more easily. Right now with more and more privacy regulations, companies are no longer allowed to do that. Podcasting had to innovate around those limitations and focus much more on the contextual aspect of what the user is listening to. There is a big field of improvement when it comes to analysis of the content itself, the semantics, the tone of the content and leverage that, instead of the attributes of the listening individual.
Jonas: I think that the number of podcasts will continue to grow, there will be more AI generated podcasts. There is something powerful in the audio. Having kids as a parent you want the kids to have less screen time, when you’re attached to the screen you cannot do anything else, while when you’re listening to a podcast you can do many other activities. For sure this medium is going to grow further in the future.
Rather than googling for something you’ll just tell ChatGPT: “Hey I’ve got a 25 minute run ahead of me, give me an episode about this interesting topic” and you will just start listening.
Jonas: I think an automatic translation is doable but might not be as interesting as the original podcasts, the feeling of the original creator may be lost in the process of translation, but of course it has its use case. It depends on the content, parts of the content may be localized, or the advertisements may be localized. It's probably a lot easier to reach with the localized commercial than in a native language.
Jonas: Currently everything runs on AWS, we try to minimize the time spent on maintenance. We prefer to pick more managed services from the AWS ecosystem to reduce the cost of managing it by ourselves. S3, Parquet, Athena is our data lake layer that we share across all product teams. Most of the pipelines are written in Spark, running on EMR. Most teams have their own Airflow instance as a scheduler. We also use ECS and Lambdas. For the low latency dashboarding stack we use Snowflake. That’s the high level stack.
In our data pipelines we mainly use Python, TypeScript or JavaScript. For the high performance parts we’ve moved to Rust. Rust has started to get some traction in the data ecosystem recently which is also an interesting trend.
We tried to be cloud agnostic in the beginning but we quite rapidly concluded that it was too costly to maintain, so we ended up choosing our go-to cloud provider which was AWS.
Jonas: We pivoted away from a hyper-growth strategy into hitting profitability earlier this year. We’re trying to keep things up and running while reconsidering some choices regarding our stack, it’s more about finding more efficiency in what we currently do than in finding new technologies to grow.
More and more companies are questioning their cloud spendings. Cloud is great when you’re using what you’re paying for. But we very often pay for some things that we’re not using so we revisit some of our expenses to reduce the costs. The financial aspect is becoming more and more important.
Jonas: Coming back to what we mentioned earlier. Big advancements in terms of what data we are collecting and sharing and what in the industry doesn't need to be agreed upon and adopted. There is not just one single player that will be able to decide that. And that is a challenge but it’s also a good thing in the industry. It prevents anyone from BigTech to just come in and set the rules and start privacy intrusive actions to spy on everyone across the entire web. It’s impossible in podcasting right now, and we don’t think it’s a good thing either. The good thing about Acast is that we don’t have to pivot away from those rules, we have a sustainable position without that. We can innovate even further in that space and it’s super interesting in that industry.
I think that open platforms like Acast are the winners against the exclusive platforms for podcasts. We’re helping the creators to make the audience. We want to help podcasters make more money. In Acast your show can be consumed anywhere because we distribute that content everywhere (Spotify, Apple store etc.).
Another challenge is that there is no canonical truth that identifies a show in podcasting. What platforms have to solve for themselves is that there is no canonical truth as to which podcast is the “original” one. The podcasts exist on multiple platforms but there is no true identifier that decides that the one on this platform is the original one. If you move to another platform there is no way to distinguish your podcast from some other replicating show on another platform.
You can listen to the whole episode here:
Subscribe to the Radio DaTa podcast to stay up-to-date with the latest technology trends and discover the most interesting data use cases!
We are proud to present you our first e-book, created by GetInData specialists. Apache NiFi: A Complete Guide is the result of long and fruitful work…
Read morePlanning any journey requires some prerequisites. Before you decide on a route and start packing your clothes, you need to know where you are and what…
Read moreNowadays, we can see that AI/ML is visible everywhere, including advertising, healthcare, education, finance, automotive, public transport…
Read moreOn June 16, 2021, the Polish Insurance Association published the Cloud computing standard for the insurance industry. It is a set of rules for the…
Read moreMachine learning is becoming increasingly popular in many industries, from finance to marketing to healthcare. But let's face it, that doesn't mean ML…
Read moreOne of the main challenges of today's Machine Learning initiatives is the need for a centralized store of high-quality data that can be reused by Data…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?