Tech News

10 min read

Combining Kedro and Streamlit to build a simple LLM-based Reading Assistant

Generative AI and Large Language Models are taking applied machine learning by storm, and there is no sign of a weakening of this trend. While it is important to remember that this stream is not likely to wipe out other branches of machine learning and there are still many things to be careful about when applying LLMs (check my previous blog post for some examples), the use of these models is unavoidable in some areas. But to leverage them really efficiently, some pieces must be put together in a proper way. In particular, the use of powerful language understanding capabilities have to be backed up by a clean and well-organized implementation and pleasant user experience. With a very simple example, we will demonstrate how to achieve these 3 essential goals using commercial LLM APIs, Kedro and Streamlit respectively.

The idea of a Reading Assistant

Imagine that you have to read some very technical document, not being an expert in the field. The document surely contains a lot of domain-specific wording, difficult to understand terms and possibly also many outside references. Reading such a document can be quite a pain; you spend more time looking for explanations over the Internet than on the document itself. Nowadays, having the power of Large Language Models at your fingertips, you can make this process a lot faster and easier. LLMs are pretrained on vast amounts of texts from different domains and encode all this broad knowledge in their parameters, also allowing for seamless human-machine interaction using plain, natural language. In some cases, when pretrained knowledge is not enough, there is the possibility of adapting a model to some domain or instruction to perform other forms of finetuning to make it even more useful. This, however, is a quite complex and tricky topic, and we will not focus on it in this article. However, it is under heavy research in our GetInData Advanced Analytics Labs.

So what exactly is the idea behind the LLM Reading Assistant? It is as simple as this:

You upload the document that you want to read into the web-based app and start reading
When you encounter any incomprehensible term or hard to understand portion of text, you can select it and ask the LLM to either explain or summarize it
An appropriate prompt will be constructed under the hood, sent via an API and the answer will be returned and printed

The usefulness of this kind of tool can be proven in large organizations, where people with different roles (management, business, technical, legal etc.) have to deal with domain-specific documents, and the efficiency of processing them is key. As examples, we can think of:

legal acts
scientific papers
medical documentation
financial reports

The solution presented in this article is just a very simple PoC that presents the idea of a Reading Assistant and also shows how you can easily build a quite functional application using a combination of Kedro and Streamlit frameworks, backed up by commercial Large Language Models. To reforge it into a full-scale, production-grade tool, some important developments would be required, e.g.:

a more advanced user interface, allowing for a better user experience using context menus instead of manual copy-paste operations,
possibly a chat window to be able to extend communication with the model beyond simple explain/summarize queries,
an option to use large-context models and in-document search in addition to just relying on pretrained model knowledge,
comprehensive load and functional tests,
optional use of open-source, self-deployed models, finetuned on domain specific corpora.

Nevertheless, such a demo is always a good start, so let’s dive in to see how it works.

Implementation using Kedro, Streamlit and LLM APIs

The code of the application described here is publicly available as one of the QuickStart ML Blueprints which are a set of various ML solution examples, built with a modern open-source stack and according to the best data science development practices. You can find the project and its documentation here. Feel free to run and experiment with it, and also explore other blueprints that include classification/regression models, recommendation systems and time series forecasting etc.

Kedro users will surely notice that from this framework’s perspective, the presented solution is very much trimmed down compared to standard Kedro use cases. It consists of only one pipeline (run_assistant) that contains just a single node (complete_request). Since all input to the pipeline is passed via parameters (some of them in a standard way via Kedro conf files, the other via the Streamlit app, which will be explained later) and the only output is the LLM’s response that needs to be printed for the user - the project doesn’t use a data catalog. In this simple PoC there was also no need for MLflow logging; only the local logger was used for debugging purposes. One Kedro feature that is still very helpful is the pipeline configuration mechanism. It turns out that in such a special use case, seemingly not very much aligned with the usual Kedro way of work, it allows for a flexible and efficient integration with the additional user interface layer formed by the Streamlit app.

On top of the Kedro run_assistant pipeline, there is another Python script run_app, that - not surprisingly - defines and runs the Streamlit application. In more detail, it serves the following purposes:

Displays an uploaded PDF file for reading
Handles additional input parameters that are not passed to the Kedro pipeline via Kedro conf files. These parameters are: the type of LLM API to be used, the LLM model, the mode of operation (either explanation or summarization, which were chosen as basic demo functionalities that of course can be extended with other ones) and of course the main input, which is term or text to be either explained or summarized
Triggers the running of the Kedro pipeline, that:
- collects input parameters (including remaining, technical parameters passed in the traditional Kedro way, needed to construct the prompt and get the response),
- sends the request via the selected API (currently supported are native OpenAI API, Azure OpenAI Service and VertexAI Google PaLM API) and retrieves the response
Prints the answer under the document

The interesting thing in this setup is the coupling between Streamlit and the Kedro pipeline. Kedro has its own set of parameters stored in conf directory. By default, there are two subfolders there: base and local (you can also define other sets of parameters and use them as different environments). The first one is a set of default parameters for a baseline Kedro run and is stored in Git. The other one is stored only locally. You can use it to store parameters that are specific to your very own environment, which should not be shared around. It is also a good place to put something temporary that you do not wish to overwrite in your base configuration files. This makes parameters.yml in the local subdirectory a perfect place to use as the connection between the parameters entered in the Streamlit interface and the Kedro pipeline. Basically how it works on the example of the Reading Assistant is:

First, the KedroSession object is initialized, to be able to run the Kedro pipeline by providing its name in the run_app Python script:

from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
bootstrap_project(os.getcwd())
session = KedroSession.create()

Some Streamlit widgets are defined and their values are assigned to variables:

api = st.selectbox("Select an API:", ["OpenAI", "VertexAI PaLM", "Azure OpenAI"])
model = st.selectbox("Choose LLM:", model_choices)
mode = st.selectbox("Choose mode:", ["explain", "summarize"])
input_text = st.text_area("Paste term or text:", value="", height=200)

Each time input fields are updated, new parameter values are dumped to the local parameters file:

with open("./conf/local/parameters.yml", "w") as f:
yaml.dump(
{"api": api, "model": model, "mode": mode, "input_text": input_text}, f
)

After clicking the “Get Answer!” button, Kedro pipeline is triggered. It collects all the parameters - from base config, but also from our constantly updated via Streamlit app local config:

if st.button("Get Answer!"):
# Run Kedro pipeline to use LLM
answer = session.run("run_assistant_pipeline")["answer"]
else:
answer = "Paste input text and click [Get Answer!] button"

Each time the button is clicked, Kedro pipeline is rerun - possibly with new parameter values, if they were updated in the meantime.

Summary

And that’s it! This demonstrates a very simple yet effective way of managing parameterizing and running Kedro pipelines via the Streamlit application. Of course, the example is very simple, but you can imagine more complex setups with multiple Kedro pipelines that use more Kedro features. In those scenarios, the Kedro project structure and a well-organized pipelining framework would be more advantageous, also leveraging the ease of building Streamlit applications. Nevertheless, the coupling between those two would remain as simple as above.

If you are interested in other applications of LLMs and potential issues during implementation, check out our other blog posts and keep up with the new ones that are published, especially the one about the Shopping Assistant: an e-commerce conversational tool that provides search and recommendation capabilities using a natural language interface.

Do you have any questions? Feel free to sign-up for a free consultation!

Kedro

large language models

LLM

reading assistant

streamlit

Last updated: 12 September 2023

Written by

Piotr Chaberski

Senior Data Scientist

Like this post?
Spread the word

Want more? Check our articles

getindator stream of data showing real time analytics in busine 68956ccf d535 47c5 aa87 1b0106a634dc

Tech News

The Evolution of Real-Time Data Streaming in Business

This blog post is based on a webinar:”Real-Time Data to Drive Business Growth and Innovation in 2024” that was held by CTO Krzysztof Zarzycki at…

Tutorial

Data online generation for event stream processing

In a lot of business cases that we solve at Getindata when working with our clients, we need to analyze sessions: a series of related events of actors…

deep learning azure kedroobszar roboczy 1 4

Tutorial

Deep Learning with Azure: PyTorch distributed training done right in Kedro

At GetInData we use the Kedro framework as the core building block of our MLOps solutions as it structures ML projects well, providing great…

Tutorial

Different generations of CICD tools

What is CICD? It is an acronym for Continuous Integration Continuous Delivery / Deployment. CICD can be also described as the methodology focused on…

big data blog getindata from spreadsheets automated data pipelines how this can be achieved 2png

Tutorial

From spreadsheets to automated data pipelines - and how this can be achieved with support of Google Cloud

CSVs and XLSXs files are one of the most common file formats used in business to store and analyze data. Unfortunately, such an approach is not…

highly available airflow cluster aws notext

Tutorial

Highly available Airflow cluster in Amazon AWS

These days, companies getting into Big Data are granted to compose their set of technologies from a huge variety of available solutions. Even though…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com

Combining Kedro and Streamlit to build a simple LLM-based Reading Assistant

The idea of a Reading Assistant

Implementation using Kedro, Streamlit and LLM APIs

Summary

Like this post?Spread the word

Want more? Check our articles

The Evolution of Real-Time Data Streaming in Business

Data online generation for event stream processing

Deep Learning with Azure: PyTorch distributed training done right in Kedro

Different generations of CICD tools

From spreadsheets to automated data pipelines - and how this can be achieved with support of Google Cloud

Highly available Airflow cluster in Amazon AWS

Contact us

Interested in our solutions?Contact us!

Like this post?
Spread the word

Interested in our solutions?
Contact us!