Introduction to Apache Sedona (incubating)
Apache Sedona is a distributed system which gives you the possibility to load, process, transform and analyze huge amounts of geospatial data across…
Read moreRemember our whitepaper “Guide to Recommendation Systems. Implementation of Machine Learning in Business” from the middle of last year? Our data scientist, Michal Stawikowski, did an excellent job of giving you a cross-sectional overview of the issues related to recommender systems. In his paper, we analyzed the issue from both the business side and dived into the technical details. We also presented an example of a four-step recommender system, where in successive steps the results are retrieved, filtered, scanned and sorted. You can also find out what QuickStart ML Blueprints are and how they can help data scientists and engineers with building recommendation systems. Download the white paper here.
Today I would like to focus on a specific issue, namely news recommendation. With the development of artificial intelligence, new solutions have started to appear in recent months, based, for example, on GPT-4 or diffusion models to improve the effectiveness of recommendation engines. However, solutions based on slightly older resolutions such as TF-IDF, word2vec or Bag-of-Words are still leading the way.
As a recap, below is a breakdown of the most important approaches to building recommendation engines.
To create a news recommendation engine, we can actually use any of the above approaches, depending on what our business objective and technological capabilities are. However, the news area is characterized by a particular sensitivity to the context of the news.
Traditional recommendation systems recommend articles according to how similar they are to articles in which the user was previously interested. Typically, similarity is measured using the distance between two pieces of text. A small distance indicates high similarity, while a large distance indicates low similarity. However, people's preference depends on several factors, including context or recent social media trends. For example, a text about the latest transfers of one football club may not be of interest to a fan of another team, such a news item may also become instantly irrelevant if the transfer does not materialise after all. It is important to remember that news recommendation systems face particular challenges because articles change quickly, data about readers is limited, and the relevance of articles is highly context-dependent. As a result, there is growing interest in creating personalised news recommendation systems that can provide users with articles that match their preferences and interests. One approach to creating such systems is to use contextual information. Users' reading preferences and habits can vary depending on their location, time of day and other factors. Given contextual information, news recommendation systems can personalise recommendations for each user, taking into account their current state. Capturing context and trends from users can be achieved in several ways, such as analysing the content of articles that users click on, tracking users' social media activity, using collaborative filtering to identify similar users based on their clicking behaviour, and using contextual information such as time of day, location, device and user profile to personalise recommendations.
Below you can find a classification of features used for news recommendation systems:
Taking these issues into account, the target solution should be to build a hybrid model, which takes into account both content and user behaviour and preferences.
A key element in building methods for personalized news recommendations is news modeling. In this step, it is necessary to understand the content and capture the individual characteristics of the article. A large number of approaches can be used for this purpose, which we can divide into two main groups: feature-based methods and deep learning-based methods.
Feature-based methods use features prepared by the data scientist to represent news articles. These features are designed to capture different aspects of news content and contexts. In many collaborative filtering based methods, articles are represented by news ID’s. However, this approach can suffer from a 'cold start' problem, as new articles are constantly being published and old articles quickly disappear, resulting in limited coverage of news identifiers in the learning set. ID-based news modeling has many limitations, so additional techniques are often used to statistically describe news content. One of these is Term Frequency-Inverse Document Frequency (TF-IDF), which extracts features from news texts. Other content features are also often used, such as topic modeling, using techniques such as Latent Dirichlet Allocation (LDA) to extract topics from news titles, summaries and main content. In addition, other factors such as news popularity, frequency, sentiment and bias can also be used in the model to improve news representation.
On the other hand, deep learning-based methods use neural network models to automatically learn article representations from raw input data, such as news texts. In this case, we can largely skip the data preparation step. They are a competing approach to the one described above, often being able to more effectively capture the information and context of news articles by learning latent patterns from raw input data. For example, some methods use autoencoders, knowledge-aware convolutional neural networks (CNNs), multi-headed self-attention networks and pre-trained language models (PLMs) to encode news text. Deep learning-based methods for news recommendation systems can include news attributes, such as specific topics or concepts, in their analysis of news articles. In this way, these methods aim to gain a deeper understanding of the knowledge and common themes contained in news articles.
The next step in building a recommender system is user modeling. During this phase, it is important to understand the interests and preferences of users. This involves constructing user profiles based on a set of characteristics extracted from clicked messages. Again, as with news modeling, methods can be broadly divided into feature-based and deep learning.
The first approach, feature-based user modeling, involves creating user profiles based on a set of features built from historical user behavior, including clicked messages. These methods use various additional user characteristics to facilitate user modeling, such as demographics (e.g. age, gender and occupation), user location, access patterns and user tags or keywords. In some cases, it may be possible to take into account user behavior on other platforms, such as social media and e-commerce platforms, to get additional information about user interests. However, this type of approach usually requires considerable expertise in feature design and validation and access to a wide range of data, preferably of good quality.
On the other hand, user modeling methods based on deep learning aim to learn representations of users based on their behavior, without the need for manual feature engineering. These methods infer user interests based on click behavior, which is an implicit indicator of a user's interest in messages. However, this data can be noisy and may not always accurately indicate a user's actual interests. To address this, many methods incorporate other types of information into user modeling, such as user IDs, contextual features (e.g. user devices and locations) and many types of user feedback on the news platform to incorporate user engagement information into user interest modeling. These methods can automatically learn deep representations of user interests for personalized news recommendations, which are typically more accurate than manually created user interest features.
Once the characteristics of news stories and users have been modeled, the next step is to create a ranking of candidate news stories based on their relevance to the user's interests. This is a key step in personalized news recommendation, as it aims to present users with the most relevant and engaging articles.
Relevance-based methods typically rank candidate articles based on their personalized match to the user's interests. The main problem with these methods is accurately measuring the relevance between candidate news items and the user's interests. Many techniques directly assess the relevance between the user and the news items, based on the similarity of their final representations. For example, some methods calculate the cosine similarity between user and message feature vectors (CF-IDF - Concept Frequency-Inverse Document Frequency) to measure their relevance. Other methods use similarities between vectors of message topics and user interests to determine relevance. One of the challenges of personalized relevance-based ranking is the problem of 'filter bubble', when recommending messages that are similar to those clicked on previously by users can limit diversity. To address this, strategies can be used to recommend messages that are slightly different from those clicked on previously, introducing variety and randomness.
Unlike relevance-based methods, ranking methods are based on reinforcement learning with the aim to optimize the total reward in the long term. These methods explore potential user interests and aim to improve long-term user experience and engagement. They have the ability to increase the diversity of recommendation results and discover potential user interests through exploration.
In comparison to recommendation systems in other domains such as movie recommendations, news recommendation engines face unique challenges due to the dynamic and time-sensitive nature of news content. While both types of recommendation systems leverage various techniques like collaborative filtering and content-based filtering, news recommendation engines must also contend with the scarcity of user data and the need for real-time adaptation to evolving news trends. Despite these differences, the overarching goal of personalized recommendation systems remains consistent: to provide users with relevant and engaging content tailored to their preferences and interests.
If you are seeking support to delve deeper into near recommendation systems solutions, do not hesitate to take advantage of our experts' free consultation offers.
Apache Sedona is a distributed system which gives you the possibility to load, process, transform and analyze huge amounts of geospatial data across…
Read moreA prototype is an early sample, model, or release of a product built to test a concept or process. What we have above is a nice, generic definition of…
Read moreGetInData, Google and Truecaller participate in the Big Data Tech Warsaw Summit 2019. It’s already less than two weeks to the 5th edition of Big Data…
Read moreDuring my 6-year Hadoop adventure, I had an opportunity to work with Big Data technologies at several companies ranging from fast-growing startups (e…
Read moreBlack Friday, the pre-Christmas period, Valentine’s Day, Mother’s Day, Easter - all these events may be the prime time for the e-commerce and retail…
Read moreIn today's fast-paced business environment, companies are increasingly turning to real-time data to gain a competitive edge. One of the examples are…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?