Tech News
10 min read

Combining Kedro and Streamlit to build a simple LLM-based Reading Assistant

Generative AI and Large Language Models are taking applied machine learning by storm, and there is no sign of a weakening of this trend. While it is important to remember that this stream is not likely to wipe out other branches of machine learning and there are still many things to be careful about when applying LLMs (check my previous blog post for some examples), the use of these models is unavoidable in some areas. But to leverage them really efficiently, some pieces must be put together in a proper way. In particular, the use of powerful language understanding capabilities have to be backed up by a clean and well-organized implementation and pleasant user experience. With a very simple example, we will demonstrate how to achieve these 3 essential goals using commercial LLM APIs, Kedro and Streamlit respectively.

The idea of a Reading Assistant

Imagine that you have to read some very technical document, not being an expert in the field. The document surely contains a lot of domain-specific wording, difficult to understand terms and possibly also many outside references. Reading such a document can be quite a pain; you spend more time looking for explanations over the Internet than on the document itself. Nowadays, having the power of Large Language Models at your fingertips, you can make this process a lot faster and easier. LLMs are pretrained on vast amounts of texts from different domains and encode all this broad knowledge in their parameters, also allowing for seamless human-machine interaction using plain, natural language. In some cases, when pretrained knowledge is not enough, there is the possibility of adapting a model to some domain or instruction to perform other forms of finetuning to make it even more useful. This, however, is a quite complex and tricky topic, and we will not focus on it in this article. However, it is under heavy research in our GetInData Advanced Analytics Labs.

So what exactly is the idea behind the LLM Reading Assistant? It is as simple as this:

  • You upload the document that you want to read into the web-based app and start reading
  • When you encounter any incomprehensible term or hard to understand portion of text, you can select it and ask the LLM to either explain or summarize it
  • An appropriate prompt will be constructed under the hood, sent via an API and the answer will be returned and printed

The usefulness of this kind of tool can be proven in large organizations, where people with different roles (management, business, technical, legal etc.) have to deal with domain-specific documents, and the efficiency of processing them is key. As examples, we can think of:

  • legal acts
  • scientific papers
  • medical documentation
  • financial reports

The solution presented in this article is just a very simple PoC that presents the idea of a Reading Assistant and also shows how you can easily build a quite functional application using a combination of Kedro and Streamlit frameworks, backed up by commercial Large Language Models. To reforge it into a full-scale, production-grade tool, some important developments would be required, e.g.:

  • a more advanced user interface, allowing for a better user experience using context menus instead of manual copy-paste operations,
  • possibly a chat window to be able to extend communication with the model beyond simple explain/summarize queries,
  • an option to use large-context models and in-document search in addition to just relying on pretrained model knowledge,
  • comprehensive load and functional tests,
  • optional use of open-source, self-deployed models, finetuned on domain specific corpora. 

Nevertheless, such a demo is always a good start, so let’s dive in to see how it works.

Implementation using Kedro, Streamlit and LLM APIs

The code of the application described here is publicly available as one of the QuickStart ML Blueprints which are a set of various ML solution examples, built with a modern open-source stack and according to the best data science development practices. You can find the project and its documentation here. Feel free to run and experiment with it, and also explore other blueprints that include classification/regression models, recommendation systems and time series forecasting etc.

Kedro users will surely notice that from this framework’s perspective, the presented solution is very much trimmed down compared to standard Kedro use cases. It consists of only one pipeline (run_assistant) that contains just a single node (complete_request). Since all input to the pipeline is passed via parameters (some of them in a standard way via Kedro conf files, the other via the Streamlit app, which will be explained later) and the only output is the LLM’s response that needs to be printed for the user - the project doesn’t use a data catalog. In this simple PoC there was also no need for MLflow logging; only the local logger was used for debugging purposes. One Kedro feature that is still very helpful is the pipeline configuration mechanism. It turns out that in such a special use case, seemingly not very much aligned with the usual Kedro way of work, it allows for a flexible and efficient integration with the additional user interface layer formed by the Streamlit app.

On top of the Kedro run_assistant pipeline, there is another Python script run_app, that - not surprisingly - defines and runs the Streamlit application. In more detail, it serves the following purposes:

  • Displays an uploaded PDF file for reading
  • Handles additional input parameters that are not passed to the Kedro pipeline via Kedro conf files. These parameters are: the type of LLM API to be used, the LLM model, the mode of operation (either explanation or summarization, which were chosen as basic demo functionalities that of course can be extended with other ones) and of course the main input, which is term or text to be either explained or summarized
  • Triggers the running of the Kedro pipeline, that:
    • collects input parameters (including remaining, technical parameters passed in the traditional Kedro way, needed to construct the prompt and get the response),
    • sends the request via the selected API (currently supported are native OpenAI API, Azure OpenAI Service and VertexAI Google PaLM API) and retrieves the response
  • Prints the answer under the document

The interesting thing in this setup is the coupling between Streamlit and the Kedro pipeline. Kedro has its own set of parameters stored in conf directory. By default, there are two subfolders there: base and local (you can also define other sets of parameters and use them as different environments). The first one is a set of default parameters for a baseline Kedro run and is stored in Git. The other one is stored only locally. You can use it to store parameters that are specific to your very own environment, which should not be shared around. It is also a good place to put something temporary that you do not wish to overwrite in your base configuration files. This makes parameters.yml in the local subdirectory a perfect place to use as the connection between the parameters entered in the Streamlit interface and the Kedro pipeline. Basically how it works on the example of the Reading Assistant is:

  1. First, the KedroSession object is initialized, to be able to run the Kedro pipeline by providing its name in the run_app Python script:
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
bootstrap_project(os.getcwd())
session = KedroSession.create()
  1. Some Streamlit widgets are defined and their values are assigned to variables:
api = st.selectbox("Select an API:", ["OpenAI", "VertexAI PaLM", "Azure OpenAI"])
model = st.selectbox("Choose LLM:", model_choices)
mode = st.selectbox("Choose mode:", ["explain", "summarize"])
input_text = st.text_area("Paste term or text:", value="", height=200)
  1. Each time input fields are updated, new parameter values are dumped to the local parameters file:
with open("./conf/local/parameters.yml", "w") as f:
yaml.dump(
{"api": api, "model": model, "mode": mode, "input_text": input_text}, f
)
  1. After clicking the “Get Answer!” button, Kedro pipeline is triggered. It collects all the parameters - from base config, but also from our constantly updated via Streamlit app local config:
if st.button("Get Answer!"):
# Run Kedro pipeline to use LLM
answer = session.run("run_assistant_pipeline")["answer"]
else:
answer = "Paste input text and click [Get Answer!] button"

Each time the button is clicked, Kedro pipeline is rerun - possibly with new parameter values, if they were updated in the meantime.

Summary

And that’s it! This demonstrates a very simple yet effective way of managing parameterizing and running Kedro pipelines via the Streamlit application. Of course, the example is very simple, but you can imagine more complex setups with multiple Kedro pipelines that use more Kedro features. In those scenarios, the Kedro project structure and a well-organized pipelining framework would be more advantageous, also leveraging the ease of building Streamlit applications. Nevertheless, the coupling between those two would remain as simple as above.

If you are interested in other applications of LLMs and potential issues during implementation, check out our other blog posts and keep up with the new ones that are published, especially the one about the Shopping Assistant: an e-commerce conversational tool that provides search and recommendation capabilities using a natural language interface.

Do you have any questions? Feel free to sign-up for a free consultation!

Kedro
large language models
LLM
reading assistant
streamlit
12 September 2023

Want more? Check our articles

real time blog getindata analytics
Tutorial

Real-time Analytics: architecture, technologies and example implementation in e-commerce

Real-time analytics are all processes of collecting, transforming, enriching, cleaning and analyzing data to provide immediate insights and actionable…

Read more
kafka gobblin hdfs getindata linkedin
Tutorial

Data pipeline evolution at Linkedin on a few pictures

Data Pipeline Evolution The LinkedIn Engineering blog is a great resource of technical blog posts related to building and using large-scale data…

Read more
deployingsecuremlfowonawsobszar roboczy 1 4
Tutorial

Deploying secure MLflow on AWS

One of the core features of an MLOps platform is the capability of tracking and recording experiments, which can then be shared and compared. It also…

Read more
extracting fling flame graphobszar roboczy 1 4
Tutorial

Extracting Flink Flame Graph data for offline analysis

Introduction - what are Flame Graphs? In Developer life there is a moment when the application that we create does not work as efficiently as we would…

Read more
getindata 2021 lets celebrate our achivements big data world blog

GetInData in 2021 - let’s celebrate our achievements in the Big Data world!

The year 2021 passed in the blink of an eye and the time has come to summarize our goals at GetinData and define our challenges for the next year…

Read more
transfer legacy pipeline modern using gitlab cicd
Tutorial

How we helped our client to transfer legacy pipeline to modern one using GitLab's CI/CD - Part 3

Please dive in the third part of a blog series based on a project delivered for one of our clients. Please click part I, part II to read the…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy