Radio DaTa Podcast

12 min read

Data & analytics at Acast, AI & trends in the podcasting industry

In this episode of the RadioData Podcast, Adama Kawa talks with Jonas Björk from Acast. Mentioned topics include: analytics use cases implemented at Acast, Cloud-managed data tech stack at Acast, AI/ML in podcasting used today or tomorrow, trends and innovations in the podcasting industry and more.

We encourage you to listen to the whole podcast or, if you read it here.

Host: Adam Kawa, GetInData | Part of Xebia CEO

Since 2010, Adam has been working with Big Data at Spotify (where he proudly operated one of the largest and fastest-growing Hadoop clusters in Europe), Truecaller and as a Cloudera Training Partner. Nine years ago, he co-founded GetInData | Part of Xebia – a company that helps its customers to become data-driven and builds custom Big Data solutions. Adam is also the creator of many community initiatives such as the RadioData podcast, Big Data meetups and the DATA Pill newsletter.

Guest: Jonas Björk

Jonas Björk is a Chief Technology Officer at Acast. He has worked at many heavy data focused companies like Spotify, BizOne and Ericsson in data related projects.

At Ericsson he worked in the machine learning department, building recommendation systems. He then started working in a startup called BizOne in Business Intelligence as a Service platform. Atllit was during this time that Adam Kawa and Jonas met, which was about 10 years ago, and Jonas started exploring the Hadoop ecosystem which led Jonas to join Spotify. After several years he decided to join Acast to build a podcast service platform.

What is Acast? Why did Jonas decide to join Acast?

When Jonas joined Acast, the technical department consisted of about 15 people. Jonas was also very interested in podcasts, especially from an educational point of view. He listened to a lot of podcasts mostly from the technology space. For Jonas it was still one of the few permission-less publishing and permission-less consumption services. Even now anyone can start a podcast and doesn’t need permission to do it. One doesn’t have to agree to terms of service from BigTech companies - there is no gatekeeping. It’s very easy to get started and any voice can be heard by a larger audience. Anyone can subscribe and listen to many podcasts and evaluate them according to his or her standards.

Building the data platform, infrastructure and enabling people to listen to each other is something that keeps Jonas inspired and excited by his work.

Could you share some information about what type of data you collect and process and what are the business use cases for this type of data?

Jonas: There are some aspects of this industry that resemble those at Spotify. In both cases it’s not trivial to define the financial/transactional aspect of the business. You have to define when you should count the “listen” (the moment, when you acknowledge that the podcast was “listened to”) to be valid. Were there ads running in that “listen” or not? What advertiser did those ads come from? You have to create sophisticated data pipelines with lots of computation in order to figure out who and how much money is being made. In Spotify we were referring to those pipelines as royalty pipelines and in Acast we call them calculation pipelines.

It’s a unique combination of the data volume that is being processed and the source of data i.e.: server logs - which drives the core of this business. The podcast can be consumed through various types of clients and players, and because of that we don’t have access to the client data, we have to restrict our processing to the server data. Every piece of information has to be extracted from the server logs:

Did the user listen to enough of a podcast so that it can be counted as a “listened” podcast?
Did the user listen to the ads in the beginning?
Did the user listen to the ads at the end?

There are billions of requests coming in every month and that makes things both challenging and interesting at the same time.

The International Advertising Bureau (IAB) dictates the regulations regarding the podcasting:

how to count a “listen”
how to count ad impressions, according to podcasting measuring specifications

Anytime an external auditor from IAB can come and look at our implementation of those regulations to validate them and provide us with proper certification, that states that we have accurate interpretation of those requirements.

We have our web based client and a mobile application which was discontinued last year. We used those to train models and see if our server side calculations and models were correct when it comes to calculating those important parameters.

Can you tell a little bit more about valuable information that you gather and present to the podcast creators?

Jonas: It’s everything connected to the graphical distribution of the listeners, some of those are:

How long do people listen to this episode?
When did they drop off?
Will they stick around to the end of the podcast?

There are a lot of things that we can do with the data and push it to the creators so that they can understand the audience better, and how the content is being consumed.

What are the key metrics, dashboards and data sets that you look at very often as the CTO of the company?

Jonas: Obviously general metrics like: how many shows are registering, how many shows are published? Financial data: how the company is performing in different parts of the world, and also more related to different types of initiatives that we’re working on.

We have recently released a self served ad-buying feature for advertisers to sign-up with Acast and buy ads directly on the website without involving any Acast staff along that journey. It’s a very new type of use case for us. We have typically been dealing with demand from larger buyers spending big advertiser money for bigger campaigns. So this year we’re looking at how many advertisers are coming, what types of advertisers are coming etc. It depends on the type of initiative we’re currently working on.

Acast is an ecosystem of different tools. Some of those tools come from companies acquired by Acast recently and those tools are related to improving podcasts using AI technologies. Could you tell us a little bit more about those tools and techniques?

Jonas: When we started working with GetInData in 2019, at that time we had a centralized data infrastructure and we decided that it was time to start moving to decentralized infrastructure. Quite quickly as we gained a critical mass when it comes to data processing, we had to decentralize our infrastructure and teams in order to get some more data competence in the products that we were building and make teams be closer to the product and value creation, to deliver more value to the customers without relying on the central team to do the data work.

We made those decentralizing choices by moving first from Azure to on-premise, and then by finally moving to AWS. Right now we’re trying to streamline how we operate. We’ve made some standardization regarding some aspects of our work like, that data across teams is shared through S3 in a Parquet format and Athena is a query engine on top of it with Glue Catalog as a so to say, cross account data-mesh setup. But also we acquired companies like Podchaser last year which can be considered as the IMDB of podcasting.

It’s easy to create text right now and create a podcast out of text to create content with the help of AI and even deep fake it. This could also be harmful for the podcast because it’s hard to create relationships with AI created content. What makes pod-casting sticky is that it’s created by humans. It’s harder to achieve the relationship with AI generated content, but it’s probably faster to create content with AI technology just by parsing the text.

Do you see interesting trends in the podcasting industry that might leverage AI advancements in recent years?

Jonas: Because podcasts are based on an old RCS technology they haven't seen much innovation recently, it’s something that’s holding this industry back. It’s hard to innovate something that’s beneficial to the whole industry and get adoption.

We had RAD (remote audio data), which made callbacks to some remote endpoints whenever the user hit the markers in the audio so that the podcast provider could register the user behavior more easily. Right now with more and more privacy regulations, companies are no longer allowed to do that. Podcasting had to innovate around those limitations and focus much more on the contextual aspect of what the user is listening to. There is a big field of improvement when it comes to analysis of the content itself, the semantics, the tone of the content and leverage that, instead of the attributes of the listening individual.

Do you assume that the number of podcasts is going to grow?

Jonas: I think that the number of podcasts will continue to grow, there will be more AI generated podcasts. There is something powerful in the audio. Having kids as a parent you want the kids to have less screen time, when you’re attached to the screen you cannot do anything else, while when you’re listening to a podcast you can do many other activities. For sure this medium is going to grow further in the future.

Rather than googling for something you’ll just tell ChatGPT: “Hey I’ve got a 25 minute run ahead of me, give me an episode about this interesting topic” and you will just start listening.

Do you think that AI could be a useful tool in improving the relationship between the content creator and the listener by applying translation? How many podcasts are now translated from local languages to other languages?

Jonas: I think an automatic translation is doable but might not be as interesting as the original podcasts, the feeling of the original creator may be lost in the process of translation, but of course it has its use case. It depends on the content, parts of the content may be localized, or the advertisements may be localized. It's probably a lot easier to reach with the localized commercial than in a native language.

Can you tell me a little bit more about the technologies that you use to create data analytics pipelines in your company?

Jonas: Currently everything runs on AWS, we try to minimize the time spent on maintenance. We prefer to pick more managed services from the AWS ecosystem to reduce the cost of managing it by ourselves. S3, Parquet, Athena is our data lake layer that we share across all product teams. Most of the pipelines are written in Spark, running on EMR. Most teams have their own Airflow instance as a scheduler. We also use ECS and Lambdas. For the low latency dashboarding stack we use Snowflake. That’s the high level stack.

In our data pipelines we mainly use Python, TypeScript or JavaScript. For the high performance parts we’ve moved to Rust. Rust has started to get some traction in the data ecosystem recently which is also an interesting trend.

We tried to be cloud agnostic in the beginning but we quite rapidly concluded that it was too costly to maintain, so we ended up choosing our go-to cloud provider which was AWS.

Do you have any plans for 2023 that might change the technology stack that you’re using? Do you have some technologies on your radar?

Jonas: We pivoted away from a hyper-growth strategy into hitting profitability earlier this year. We’re trying to keep things up and running while reconsidering some choices regarding our stack, it’s more about finding more efficiency in what we currently do than in finding new technologies to grow.

More and more companies are questioning their cloud spendings. Cloud is great when you’re using what you’re paying for. But we very often pay for some things that we’re not using so we revisit some of our expenses to reduce the costs. The financial aspect is becoming more and more important.

What are the most challenging aspects of data specific projects in the podcasting industry?

Jonas: Coming back to what we mentioned earlier. Big advancements in terms of what data we are collecting and sharing and what in the industry doesn't need to be agreed upon and adopted. There is not just one single player that will be able to decide that. And that is a challenge but it’s also a good thing in the industry. It prevents anyone from BigTech to just come in and set the rules and start privacy intrusive actions to spy on everyone across the entire web. It’s impossible in podcasting right now, and we don’t think it’s a good thing either. The good thing about Acast is that we don’t have to pivot away from those rules, we have a sustainable position without that. We can innovate even further in that space and it’s super interesting in that industry.

I think that open platforms like Acast are the winners against the exclusive platforms for podcasts. We’re helping the creators to make the audience. We want to help podcasters make more money. In Acast your show can be consumed anywhere because we distribute that content everywhere (Spotify, Apple store etc.).

Another challenge is that there is no canonical truth that identifies a show in podcasting. What platforms have to solve for themselves is that there is no canonical truth as to which podcast is the “original” one. The podcasts exist on multiple platforms but there is no true identifier that decides that the one on this platform is the original one. If you move to another platform there is no way to distinguish your podcast from some other replicating show on another platform.

You can listen to the whole episode here:

Subscribe to the Radio DaTa podcast to stay up-to-date with the latest technology trends and discover the most interesting data use cases!

Last updated: 10 October 2023

Written by

Piotr Tutak

Senior Software Engineer

Like this post?
Spread the word

Want more? Check our articles

finding your way llm getindataobszar roboczy 1 4

Tutorial

Finding your way through the Large Language Models Hype

With the introduction of ChatGPT, Large Language Models (LLMs) have become without doubt the hottest topic in AI and it doesn’t seem that this is…

dynamodb aws jedraszewski getindata big data blog

Tutorial

Amazon DynamoDB - single table design

DynamoDB is a fully-managed NoSQL key-value database which delivers single-digit performance at any scale. However, to achieve this kind of…

run your first private llm on gcpobszar roboczy 1 4

Tutorial

Run your first, private Large Language Model (LLM) on Google Cloud Platform

What are Large Language Models (LLMs)? You want to build a private LLM-based assistant to generate the financial report summary. Although Large…

Learn dbt Data Modeling: 3 Expert Blogs You Shouldn’t Miss

If you’re in the data world, you already know dbt (data build tool) is the real deal for transforming raw data into something actionable. It’s the go…

getindata monitoring alert data streaming platfrorm

Use-cases/Project

How to build continuous processing for real-time data streaming platform?

Real-time data streaming platforms are tough to create and to maintain. This difficulty is caused by a huge amount of data that we have to process as…

Tutorial

Kedro Dynamic Pipelines

“How can I generate Kedro pipelines dynamically?” - is one of the most commonly asked questions on Kedro Slack. I’m a member of Kedro’s Technical…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com

Data & analytics at Acast, AI & trends in the podcasting industry

What is Acast? Why did Jonas decide to join Acast?

Could you share some information about what type of data you collect and process and what are the business use cases for this type of data?

Can you tell a little bit more about valuable information that you gather and present to the podcast creators?

What are the key metrics, dashboards and data sets that you look at very often as the CTO of the company?

Acast is an ecosystem of different tools. Some of those tools come from companies acquired by Acast recently and those tools are related to improving podcasts using AI technologies. Could you tell us a little bit more about those tools and techniques?

Do you see interesting trends in the podcasting industry that might leverage AI advancements in recent years?

Do you assume that the number of podcasts is going to grow?

Do you think that AI could be a useful tool in improving the relationship between the content creator and the listener by applying translation? How many podcasts are now translated from local languages to other languages?

Can you tell me a little bit more about the technologies that you use to create data analytics pipelines in your company?

Do you have any plans for 2023 that might change the technology stack that you’re using? Do you have some technologies on your radar?

What are the most challenging aspects of data specific projects in the podcasting industry?

Like this post?Spread the word

Want more? Check our articles

Finding your way through the Large Language Models Hype

Amazon DynamoDB - single table design

Run your first, private Large Language Model (LLM) on Google Cloud Platform

Learn dbt Data Modeling: 3 Expert Blogs You Shouldn’t Miss

How to build continuous processing for real-time data streaming platform?

Kedro Dynamic Pipelines

Contact us

Interested in our solutions?Contact us!

Like this post?
Spread the word

Interested in our solutions?
Contact us!