Tech News
10 min read

Combining Kedro and Streamlit to build a simple LLM-based Reading Assistant

Generative AI and Large Language Models are taking applied machine learning by storm, and there is no sign of a weakening of this trend. While it is important to remember that this stream is not likely to wipe out other branches of machine learning and there are still many things to be careful about when applying LLMs (check my previous blog post for some examples), the use of these models is unavoidable in some areas. But to leverage them really efficiently, some pieces must be put together in a proper way. In particular, the use of powerful language understanding capabilities have to be backed up by a clean and well-organized implementation and pleasant user experience. With a very simple example, we will demonstrate how to achieve these 3 essential goals using commercial LLM APIs, Kedro and Streamlit respectively.

The idea of a Reading Assistant

Imagine that you have to read some very technical document, not being an expert in the field. The document surely contains a lot of domain-specific wording, difficult to understand terms and possibly also many outside references. Reading such a document can be quite a pain; you spend more time looking for explanations over the Internet than on the document itself. Nowadays, having the power of Large Language Models at your fingertips, you can make this process a lot faster and easier. LLMs are pretrained on vast amounts of texts from different domains and encode all this broad knowledge in their parameters, also allowing for seamless human-machine interaction using plain, natural language. In some cases, when pretrained knowledge is not enough, there is the possibility of adapting a model to some domain or instruction to perform other forms of finetuning to make it even more useful. This, however, is a quite complex and tricky topic, and we will not focus on it in this article. However, it is under heavy research in our GetInData Advanced Analytics Labs.

So what exactly is the idea behind the LLM Reading Assistant? It is as simple as this:

  • You upload the document that you want to read into the web-based app and start reading
  • When you encounter any incomprehensible term or hard to understand portion of text, you can select it and ask the LLM to either explain or summarize it
  • An appropriate prompt will be constructed under the hood, sent via an API and the answer will be returned and printed

The usefulness of this kind of tool can be proven in large organizations, where people with different roles (management, business, technical, legal etc.) have to deal with domain-specific documents, and the efficiency of processing them is key. As examples, we can think of:

  • legal acts
  • scientific papers
  • medical documentation
  • financial reports

The solution presented in this article is just a very simple PoC that presents the idea of a Reading Assistant and also shows how you can easily build a quite functional application using a combination of Kedro and Streamlit frameworks, backed up by commercial Large Language Models. To reforge it into a full-scale, production-grade tool, some important developments would be required, e.g.:

  • a more advanced user interface, allowing for a better user experience using context menus instead of manual copy-paste operations,
  • possibly a chat window to be able to extend communication with the model beyond simple explain/summarize queries,
  • an option to use large-context models and in-document search in addition to just relying on pretrained model knowledge,
  • comprehensive load and functional tests,
  • optional use of open-source, self-deployed models, finetuned on domain specific corpora. 

Nevertheless, such a demo is always a good start, so let’s dive in to see how it works.

Implementation using Kedro, Streamlit and LLM APIs

The code of the application described here is publicly available as one of the QuickStart ML Blueprints which are a set of various ML solution examples, built with a modern open-source stack and according to the best data science development practices. You can find the project and its documentation here. Feel free to run and experiment with it, and also explore other blueprints that include classification/regression models, recommendation systems and time series forecasting etc.

Kedro users will surely notice that from this framework’s perspective, the presented solution is very much trimmed down compared to standard Kedro use cases. It consists of only one pipeline (run_assistant) that contains just a single node (complete_request). Since all input to the pipeline is passed via parameters (some of them in a standard way via Kedro conf files, the other via the Streamlit app, which will be explained later) and the only output is the LLM’s response that needs to be printed for the user - the project doesn’t use a data catalog. In this simple PoC there was also no need for MLflow logging; only the local logger was used for debugging purposes. One Kedro feature that is still very helpful is the pipeline configuration mechanism. It turns out that in such a special use case, seemingly not very much aligned with the usual Kedro way of work, it allows for a flexible and efficient integration with the additional user interface layer formed by the Streamlit app.

On top of the Kedro run_assistant pipeline, there is another Python script run_app, that - not surprisingly - defines and runs the Streamlit application. In more detail, it serves the following purposes:

  • Displays an uploaded PDF file for reading
  • Handles additional input parameters that are not passed to the Kedro pipeline via Kedro conf files. These parameters are: the type of LLM API to be used, the LLM model, the mode of operation (either explanation or summarization, which were chosen as basic demo functionalities that of course can be extended with other ones) and of course the main input, which is term or text to be either explained or summarized
  • Triggers the running of the Kedro pipeline, that:
    • collects input parameters (including remaining, technical parameters passed in the traditional Kedro way, needed to construct the prompt and get the response),
    • sends the request via the selected API (currently supported are native OpenAI API, Azure OpenAI Service and VertexAI Google PaLM API) and retrieves the response
  • Prints the answer under the document

The interesting thing in this setup is the coupling between Streamlit and the Kedro pipeline. Kedro has its own set of parameters stored in conf directory. By default, there are two subfolders there: base and local (you can also define other sets of parameters and use them as different environments). The first one is a set of default parameters for a baseline Kedro run and is stored in Git. The other one is stored only locally. You can use it to store parameters that are specific to your very own environment, which should not be shared around. It is also a good place to put something temporary that you do not wish to overwrite in your base configuration files. This makes parameters.yml in the local subdirectory a perfect place to use as the connection between the parameters entered in the Streamlit interface and the Kedro pipeline. Basically how it works on the example of the Reading Assistant is:

  1. First, the KedroSession object is initialized, to be able to run the Kedro pipeline by providing its name in the run_app Python script:
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
bootstrap_project(os.getcwd())
session = KedroSession.create()
  1. Some Streamlit widgets are defined and their values are assigned to variables:
api = st.selectbox("Select an API:", ["OpenAI", "VertexAI PaLM", "Azure OpenAI"])
model = st.selectbox("Choose LLM:", model_choices)
mode = st.selectbox("Choose mode:", ["explain", "summarize"])
input_text = st.text_area("Paste term or text:", value="", height=200)
  1. Each time input fields are updated, new parameter values are dumped to the local parameters file:
with open("./conf/local/parameters.yml", "w") as f:
yaml.dump(
{"api": api, "model": model, "mode": mode, "input_text": input_text}, f
)
  1. After clicking the “Get Answer!” button, Kedro pipeline is triggered. It collects all the parameters - from base config, but also from our constantly updated via Streamlit app local config:
if st.button("Get Answer!"):
# Run Kedro pipeline to use LLM
answer = session.run("run_assistant_pipeline")["answer"]
else:
answer = "Paste input text and click [Get Answer!] button"

Each time the button is clicked, Kedro pipeline is rerun - possibly with new parameter values, if they were updated in the meantime.

Summary

And that’s it! This demonstrates a very simple yet effective way of managing parameterizing and running Kedro pipelines via the Streamlit application. Of course, the example is very simple, but you can imagine more complex setups with multiple Kedro pipelines that use more Kedro features. In those scenarios, the Kedro project structure and a well-organized pipelining framework would be more advantageous, also leveraging the ease of building Streamlit applications. Nevertheless, the coupling between those two would remain as simple as above.

If you are interested in other applications of LLMs and potential issues during implementation, check out our other blog posts and keep up with the new ones that are published, especially the one about the Shopping Assistant: an e-commerce conversational tool that provides search and recommendation capabilities using a natural language interface.

Do you have any questions? Feel free to sign-up for a free consultation!

Kedro
large language models
LLM
reading assistant
streamlit
12 September 2023

Want more? Check our articles

complex event processing apache flink
Tutorial

My experience with Apache Flink for Complex Event Processing

My goal is to create a comprehensive review of available options when dealing with Complex Event Processing using Apache Flink. We will be building a…

Read more
getindata xebia joining forces globa partner

Joining forces with Xebia: The story by GetInData’s founders about their aspirations, dilemmas and key reasons for joining the global partner

Starting a company from scratch as first-time founders can be very challenging, but being active community members can make all the difference…

Read more
getindator create a cover graphic for a blog post about optimiz 05dfdc1c 8a91 4d99 9b19 137eabe195b0
Tutorial

Optimizing Flink SQL: Joins, State Management and Efficient Checkpointing

In the fast-paced world of data processing, efficiency and reliability are paramount. Apache Flink SQL offers powerful tools for handling batch and…

Read more
mamava getindata cloud google bigquery prostooleh
Success Stories

Success story: Breastfeeding supported with modern IoT and app features

Outstanding customer experience is usually backed by robust data analytics. Same applies to Mamava, a business that celebrates and supports…

Read more
getindator stream of data showing real time analytics in busine 68956ccf d535 47c5 aa87 1b0106a634dc
Tech News

The Evolution of Real-Time Data Streaming in Business

This blog post is based on a webinar:”Real-Time Data to Drive Business Growth and Innovation in 2024” that was held by CTO Krzysztof Zarzycki at…

Read more
trucaller getindata control incoming calls cloud journey
Success Stories

Truecaller - armed with data analytics to control incoming calls

Building a modern analytics environment is a strategic, long-term, iterative process of continuous improvement rather than a one-off project. The…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy