Radio DaTa Podcast
12 min read

Data & analytics at Acast, AI & trends in the podcasting industry

In this episode of the RadioData Podcast, Adama Kawa talks with Jonas Björk from Acast. Mentioned topics include: analytics use cases implemented at Acast, Cloud-managed data tech stack at Acast, AI/ML in podcasting used today or tomorrow, trends and innovations in the podcasting industry and more.

We encourage you to listen to the whole podcast or, if you read it here.

Host: Adam Kawa, GetInData | Part of Xebia CEO

Since 2010, Adam has been working with Big Data at Spotify (where he proudly operated one of the largest and fastest-growing Hadoop clusters in Europe), Truecaller and as a Cloudera Training Partner. Nine years ago, he co-founded GetInData | Part of Xebia – a company that helps its customers to become data-driven and builds custom Big Data solutions. Adam is also the creator of many community initiatives such as the RadioData podcast, Big Data meetups and the DATA Pill newsletter.

Guest: Jonas Björk

Jonas Björk is a Chief Technology Officer at Acast. He has worked at many heavy data focused companies like Spotify, BizOne and Ericsson in data related projects.

At Ericsson he worked in the machine learning department, building recommendation systems. He then  started working in a startup called BizOne in Business Intelligence as a Service platform. Atllit was during this time that Adam Kawa and Jonas met, which was about 10 years ago, and Jonas started exploring the Hadoop ecosystem which led Jonas to join Spotify. After several years he decided to join Acast to build a podcast service platform.

What is Acast? Why did Jonas decide to join Acast?

When Jonas joined Acast, the technical department consisted of about 15 people. Jonas was also very interested in podcasts, especially from an educational point of view. He listened to a lot of podcasts mostly from the technology space. For Jonas it was still one of the few permission-less publishing and permission-less consumption services. Even now anyone can start a podcast and doesn’t need permission to do it. One doesn’t have to agree to terms of service from BigTech companies - there is no gatekeeping. It’s very easy to get started and any voice can be heard by a larger audience. Anyone can subscribe and listen to many podcasts and evaluate them according to his or her standards.

Building the data platform, infrastructure and enabling people to listen to each other is something that keeps Jonas inspired and excited by his work.

Could you share some information about what type of data you collect and process and what are the business use cases for this type of data?

Jonas: There are some aspects of this industry that resemble those at Spotify. In both cases it’s not trivial to define the financial/transactional aspect of the business. You have to define when you should count the “listen” (the moment, when you acknowledge that the podcast was “listened to”) to be valid. Were there ads running in that “listen” or not? What advertiser did those ads come from? You have to create sophisticated data pipelines with lots of computation in order to figure out who and how much money is being made. In Spotify we were referring to those pipelines as royalty pipelines and in Acast we call them calculation pipelines.

It’s a unique combination of the data volume that is being processed and the source of data i.e.: server logs - which drives the core of this business. The podcast can be consumed through various types of clients and players, and because of that we don’t have access to the client data, we have to restrict our processing to the server data. Every piece of information has to be extracted from the server logs:

  • Did the user listen to enough of a podcast so that it can be counted as a “listened” podcast? 
  • Did the user listen to the ads in the beginning? 
  • Did the user listen to the ads at the end? 

There are billions of requests coming in every month and that makes things both challenging and interesting at the same time.

The International Advertising Bureau (IAB) dictates the regulations regarding the podcasting:

  • how to count a “listen”
  • how to count ad impressions, according to podcasting measuring specifications

Anytime an external auditor from IAB can come and look at our implementation of those regulations to validate them and provide us with proper certification, that states that we have accurate interpretation of those requirements.

We have our web based client and a mobile application which was discontinued last year. We used those to train models and see if our server side calculations and models were  correct when it comes to calculating those important parameters.

Can you tell a little bit more about valuable information that you gather and present to the podcast creators?

Jonas: It’s everything connected to the graphical distribution of the listeners, some of those are:

  • How long do people listen to this episode?
  • When did they drop off?
  • Will they stick around to the end of the podcast?

There are a lot of things that we can do with the data and push it to the creators so that they can understand the audience better, and how the content is being consumed.

What are the key metrics, dashboards and data sets that you look at very often as the CTO of the company?

Jonas: Obviously general metrics like: how many shows are registering, how many shows are published?  Financial data: how the company is performing in different parts of the world, and also more related to different types of initiatives that we’re working on.

We have recently released a self served ad-buying feature for advertisers to sign-up with Acast and buy ads directly on the website without involving any Acast staff along that journey. It’s a very new type of use case for us. We have typically been dealing with demand from larger buyers spending big advertiser money for bigger campaigns. So this year we’re looking at how many advertisers are coming, what types of advertisers are coming etc. It depends on the type of initiative we’re currently working on.

Acast is an ecosystem of different tools.  Some of those tools come from companies acquired by Acast recently and those tools are related to improving podcasts using AI technologies. Could you tell us a little bit more about those tools and techniques?

Jonas: When we started working with GetInData in 2019, at that time we had a centralized data infrastructure and we decided that it was time to start moving to decentralized infrastructure. Quite quickly as we gained a critical mass when it comes to data processing, we had to decentralize our infrastructure and teams in order to get some more data competence in the products that we were building and make teams be closer to the product and value creation, to deliver more value to the customers without relying on the central team to do the data work.

We made those decentralizing choices by moving first from Azure to on-premise, and then by finally moving to AWS. Right now we’re trying to streamline how we operate. We’ve made some standardization regarding some aspects of our work like, that data across teams is shared through S3 in a Parquet format and Athena is a query engine on top of it with Glue Catalog as a so to say, cross account data-mesh setup. But also we acquired companies like Podchaser last year which can be considered as the IMDB of podcasting.

It’s easy to create text right now and create a podcast out of text to create content with the help of AI and even deep fake it. This could also be harmful for the podcast because it’s hard to create relationships with AI created content.  What makes pod-casting sticky is that it’s created by humans. It’s harder to achieve the relationship with AI generated content, but it’s probably faster to create content with AI technology just by parsing the text.

Do you see interesting trends in the podcasting industry that might leverage AI advancements in recent years?

Jonas: Because podcasts are based on an old RCS technology they haven't seen much innovation recently, it’s something that’s holding this industry back. It’s hard to innovate something that’s beneficial to the whole industry and get adoption.

We had RAD (remote audio data), which made callbacks to some remote endpoints whenever the user hit the markers in the audio so that the podcast provider could register the user behavior more easily. Right now with more and more privacy regulations, companies are no longer allowed to do that. Podcasting had to innovate around those limitations and focus much more on the contextual aspect of what the user is listening to. There is a big field of improvement when it comes to analysis of the content itself, the semantics, the tone of the content and leverage that, instead of the attributes of the listening individual.

Do you assume that the number of podcasts is going to grow?

Jonas: I think that the number of podcasts will continue to grow, there will be more AI generated podcasts. There is something powerful in the audio. Having kids as a parent you want the kids to have less screen time, when you’re attached to the screen you cannot do anything else, while when you’re listening to a podcast you can do many other activities. For sure this medium is going to grow further in the future.

Rather than googling for something you’ll just tell ChatGPT: “Hey I’ve got a 25 minute run ahead of me, give me an episode about this interesting topic” and you will just start listening.

Do you think that AI could be a useful tool in improving the relationship between the content creator and the listener by applying translation? How many podcasts are now translated from local languages to other languages?

Jonas: I think an automatic translation is doable but might not be as interesting as the original podcasts, the feeling of the original creator may be lost in the process of translation, but of course it has its use case. It depends on the content, parts of the content may be localized, or the advertisements may be localized. It's probably a lot easier to reach with the localized commercial than in a native language.

Can you tell me a little bit more about the technologies that you use to create data analytics pipelines in your company?

Jonas: Currently everything runs on AWS, we try to minimize the time spent on maintenance. We prefer to pick more managed services from the AWS ecosystem to reduce the cost of managing it by ourselves. S3, Parquet, Athena is our data lake layer that we share across all product teams. Most of the pipelines are written in Spark, running on EMR. Most teams have their own Airflow instance as a scheduler. We also use ECS and Lambdas. For the low latency dashboarding stack we use Snowflake. That’s the high level stack.

In our data pipelines we mainly use Python, TypeScript or JavaScript. For the high performance parts we’ve moved to Rust. Rust has started to get some traction in the data ecosystem recently which is also an interesting trend.

We tried to be cloud agnostic in the beginning but we quite rapidly concluded that it was too costly to maintain, so we ended up choosing our go-to cloud provider which was AWS.

Do you have any plans for 2023 that might change the technology stack that you’re using? Do you have some technologies on your radar?

Jonas: We pivoted away from a hyper-growth strategy into hitting profitability earlier this year. We’re trying to keep things up and running while reconsidering some choices regarding our stack, it’s more about finding more efficiency in what we currently do than in finding new technologies to grow.

More and more companies are questioning their cloud spendings. Cloud is great when you’re using what you’re paying for. But we very often pay for some things that we’re not using so we revisit some of our expenses to reduce the costs. The financial aspect is becoming more and more important.

What are the most challenging aspects of data specific projects in the podcasting industry?

Jonas: Coming back to what we mentioned earlier. Big advancements in terms of what data we are collecting and sharing and what in the industry doesn't need to be agreed upon and adopted. There is not just one single player that will be able to decide that. And that is a challenge but it’s also a good thing in the industry. It prevents anyone from BigTech to just come in and set the rules and start privacy intrusive actions to spy on everyone across the entire web. It’s impossible in podcasting right now, and we don’t think it’s a good thing either. The good thing about Acast is that we don’t have to pivot away from those rules, we have a sustainable position without that. We can innovate even further in that space and it’s super interesting in that industry.

I think that open platforms like Acast are the winners against the exclusive platforms for podcasts. We’re helping the creators to make the audience. We want to help podcasters make more money. In Acast your show can be consumed anywhere because we distribute that content everywhere (Spotify, Apple store etc.).

Another challenge is that there is no canonical truth that identifies a show in podcasting. What platforms have to solve for themselves is that there is no canonical truth as to which podcast is the “original” one. The podcasts exist on multiple platforms but there is no true identifier that decides that the one on this platform is the original one. If you move to another platform there is no way to distinguish your podcast from some other replicating show on another platform.

You can listen to the whole episode here: 

Subscribe to the Radio DaTa podcast to stay up-to-date with the latest technology trends and discover the most interesting data use cases!

10 October 2023

Want more? Check our articles

flink dbt adapter announcing notext
Tutorial

dbt run real-time analytics on Apache Flink. Announcing the dbt-flink-adapter!

We would like to announce the dbt-flink-adapter, that allows running pipelines defined in SQL in a dbt project on Apache Flink. Find out what the…

Read more
8e8a6167
Big Data Event

A Review of the Presentations at the DataMass Gdańsk Summit 2022

The 4th edition of DataMass, and the first one we have had the pleasure of co-organizing, is behind us. We would like to thank all the speakers for…

Read more
kafka gobblin hdfs getindata linkedin
Tutorial

Data pipeline evolution at Linkedin on a few pictures

Data Pipeline Evolution The LinkedIn Engineering blog is a great resource of technical blog posts related to building and using large-scale data…

Read more
getindata nifi flow cicd notext
Tutorial

NiFi Ingestion Blog Series. PART II - We have deployed, but at what cost… - CI/CD of NiFi flow

Apache NiFi, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data…

Read more
getindata how start big data project
Use-cases/Project

5 questions you need to answer before starting a big data project

For project managers, development teams and whole organizations, making the first step into the Big Data world might be a big challenge. In most cases…

Read more
1716370205561
Big Data Event

Overview of InfoShare 2024 - Part 2: Data Quality, LLMs and Data Copilot

Welcome back to our comprehensive coverage of InfoShare 2024! If you missed our first part, click here to catch up on demystifying AI buzzwords and…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy