Deploying efficient Kedro pipelines on GCP Composer / Airflow with node grouping & MLflow
Airflow is a commonly used orchestrator that helps you schedule, run and monitor all kinds of workflows. Thanks to Python, it offers lots of freedom…
Read moreWhat to be careful about and how you can benefit your business with the newest AI revolution
With the introduction of ChatGPT, Large Language Models (LLMs) have become without doubt the hottest topic in AI and it doesn’t seem that this is going to change any time soon. Although the underlying iterations of GPT models were just the next step in the process of developing better and better (or, also true to some extent: larger and larger) systems based on Transformer architecture, providing a publicly available chatbot interface sparked a revolution in thinking about how AI can help with different business or everyday tasks. Suddenly, everyone was able to ask the computer about pretty much anything and get answers that sounded credible - all this using just the natural language that they are fluent in, which doesn’t require any programming skills or knowledge of what kind of witchcraft happens underneath. Over the Internet, ideas, examples and demos of different LLM applications suddenly appeared out of nowhere. But as usual, the mix of insufficient experience and excessive enthusiasm can lead to misuse and in consequence to disappointment. So the question is: what can LLMs really offer you here and now, given the rapidly changing landscape?
Just to be transparent about intentions: this article is not trying to disregard the potential of LLMs. Large Language Models, and more generally, Generative AI, are already the next big thing not only in the machine learning world, but also in the everyday life of many people. Their influence will arguably just keep growing. Many people use ChatGPT to get answers quickly to numerous questions, even the complex ones - something that would take a lot longer just using a search engine and compiling the results on their own. GitHub Copilot assists millions of developers in creating production grade code (over 5.5 million downloads of the VSCode plugin, despite not being free to use, and that is only one way to use it). Everyone’s favorite Google Translate uses the Google Neural Machine Translation system, which was initially LSTM-based (when it replaced the former statistical learning system), but now it incorporates other architectures, including Transformer-based - and is still evolving.
There are also unisolated examples of concerns reported by different professionals, saying that the wide introduction of Generative AI tools threatens their position in the job market. While at first it may sound like another example of repeatedly observed fear of some technology taking over human labor, this is not completely without reason and we can already see some evidence of this. LLMs can be better than humans at writing simple articles and may pretty soon be capable of writing a decent novel. When a generative model won an art competition, it caused quite a shock. Stories can be found on social media (true or not, they are still plausible) regarding talented and creative graphic designers being replaced by mediocre ones, who use AI to generate thousands of initial ideas that only need a final polish. The same applies to other creative domains like music. Over the last few years, many pop songs already sounded like slightly altered copies of previous ones and now you don’t even need humans for such tracing - AI can generate songs, imitating the style and voice of any singer, and even invent a new one. Soon, if you are willing to accept the risk of making a mistake, you might be better off without the help of any tax or legal advisors. Why bother if you can just ask your AI assistant that has crunched all the available legal documents regarding yours and similar cases. Also, programmers and data scientists - or maybe especially them, since they know how it all works under the hood, should feel a bit anxious, since LLMs are getting better and better at coding. Now their help is much appreciated by developers who use those tools to speed up mundane and repetitive parts of their work - but the big question is: when will LLMs stop requiring human help to generate complete solutions?
All these examples are not in the science fiction domain anymore. However, looking from a 100% pragmatic perspective, we are not quite there yet (although we might get there quite quickly). LLMs are still not free of some drawbacks that are sometimes not mentioned in impressive, but very case-specific examples and demos. Let’s have a look at some pitfalls and then try to find out what use cases are really interesting, deployable in production and potentially profitable right now.
In the simplest take, a Large Language Model is a system that is able to process and memorize a vast amount of text data and then provide the most probable sequence of words (or more precisely: tokens) given some set of instructions, referred to as a prompt. This fragment - “the most probable sequence” - is crucial to understanding what LLMs can and cannot do, at least without applying some supplementary techniques to pre-training on a given text corpus. In their very nature, foundation Large Language Models are like stochastic parrots: they shout out a bunch of words that are similar to what they heard before. To be fair about that statement, in conversational systems like ChatGPT there are a lot of additional mechanisms that help LLMs act like they are much more intelligent. There are workarounds to prevent them from giving predictions that are biased or not legally approved, methods to inject up-to-date or domain specific knowledge, plugins to allow for arithmetic or symbolic calculations etc. There are also other quite sophisticated techniques to improve LLMs performance and credibility based on their interactions with humans, like Reinforcement Learning from Human Feedback (RLHF). However, with today’s models, it still may happen that the generated output sequence makes no sense, even if the model is pretty confident about it. This phenomenon is called hallucinations. LLMs are by default not optimized for admitting lack of knowledge or validating factual correctness, though they usually try to come up with some sentences, correct semantically, but not necessarily true. Try asking ChatGPT about some public, but not extremely well-known person - you might get a complete, hilarious biography that is completely made up.
Another thing to mention is that LLMs are prone to being misused, as other tools that are great for specific purpose, but people tend to rely on them too much and utilize them for purposes other than intended (my favorite examples being, just to put a cat amongst the pigeons: “Excel as data storage” and “Jupyter Notebooks in production”). Let’s imagine the following example - you are trying to build a “conversational recommendation system” on top of your e-commerce website’s data. You want the user to be able to provide a natural language query like “I am going on a week-long hike in the Tatra Mountains in July, what do I need?” and the system will provide a list of items like a large backpack, mountain boots, a rainproof jacket; then you can add them to your basket or use this information just as a reminder in case you have forgotten to pack any important stuff. Let’s look at what such a “conversational shopping assistant” could use as a data source. One option could be to fine-tune it on user comments, travel blogs and articles etc. The model would then use only the subjective opinions provided by people (imagine some sarcastic comments in your dataset saying that you can go hiking in flip-flops and the model takes them seriously) which also are not correlated with the current store’s inventory. Of course, you might incorporate some additional mechanisms to tune the answers that will adjust them to your offer, but that’s already something outside the pure LLM domain. If you would like to add some objective knowledge to your model, you might try to create a vector database of item embeddings that will allow for an efficient search for similar items. Then you might use LLM to convert natural language input to a database-specific query, and after receiving the results, use it again to convert those results back into a human-readable answer. It is important to realize, however, that LLMs constitute only a component of the overall architecture and require a surrounding system to work successfully. For long-form documents, implementation of a data retrieval pipeline is necessary, while for a recommender system a candidate generation part is needed. Furthermore, you need to double-check the output of the LLM for the reasons mentioned in the previous paragraph, since you still might end up with the list of non-existing or irrelevant items.
While the described functionality might be very helpful and for sure this approach will evolve, it is important to note that at this moment in time it cannot be treated as a reliable, fully-fledged recommendation system, especially from the perspective of the e-commerce store. If you want to know more about modern recommendation systems, please check our whitepaper or browse Graph Neural Network-based or ranking-based implementations within our QuickStart ML Blueprints repository.
A common way to interact with LLMs is via prompts, which can basically be commands written in natural language that tell the model what we want to get. This is very different from the traditional way of interacting with ML (or IT) systems, which is based on programming language interfaces. These are much more precise and do not allow for ambiguity. Prompts can be enhanced with specific patterns, additional data, parameters or examples that help the model give better answers. The whole new engineering area begins to emerge, which is called prompt engineering that describes the way to efficiently design prompts for Large Language Models. For example, we can tell the model to “evaluate a collection of answers to a quiz” that we gathered from people, but add that “we want the results on a 0-10 scale in the format of Score: {number of points}; Why: {explanation}”. We can also include some examples of real answers in the prompt along with our own evaluation in the expected format to enable so-called few-shot learning, where the LLM is able to learn a specific task just by seeing a sample of expected output during inference. It somehow brings writing prompts a bit closer to writing code, however we end up somewhere halfway: we introduce some fixed structures to prompts so we are getting further from plain natural language experience, but we still are very far from fully deterministic commands possible with programming languages.
From a reliability perspective it raises a lot of questions that are the subject of intense research, but are not yet fully answered: How can we precisely evaluate the quality of LLM results? How should we ensure that the model gives the same answers given the same prompt (it is not obvious due to the stochastic nature of LLMs)? How to version our engineered prompts and the output of the model? How to ensure the consistency of answers to particular prompts when we update our model or how to transfer prompts between different models?
In traditional software engineering, we have established best practices to make the code properly tested, reliable, auditable, well-documented, reproducible, portable andwell-optimized. In machine learning we have many mathematical metrics and ways to translate them to business indicators, we also have benchmarks to compare to. But for prompt engineering, especially applied to in-house fine-tuned models, we do not have anything similar. To get a lot of flexibility in interacting with a system using natural language and the possibility to train it on enormous amounts of unstructured data, we give up a lot of control of what this system produces in the end.
If we start thinking of LLMs not only as an extremely interesting R&D area but also as something that we might want to incorporate into actual production systems, we will very soon face a number of questions, some of which are:
All these questions are connected to each other and the answers are not easy to provide. Of course the architecture of any ML or IT system is a crucial thing, but in the case of the unspoiled area of production grade LLMs, we do not have many proven examples of optimum design choices. What’s more, the field is evolving so fast that one thing that a company can be fairly sure of when starting a new project that involves LLMs is that their solution will be outdated before it is finished. After OpenAI’s and Google’s decisions to keep their state of the art models proprietary, the open source community seemingly made it a point of honor to get ahead of tech giants despite the initial barrier of huge costs of training LLMs (by the way, there are rumors that those giants are beginning to realize that keeping their toys away from the other kids might not be the best strategy after all). It took the community barely about two months to start from recently (a very relative term nowadays in ML) published vanilla LLaMA architecture and introduce a series of game changing innovations like: running LLMs on edge devices, fine-tuning a robust model on a single GPU in a matter of hours thanks to the efficient application of a Low-Rank Adaptation technique, deploying another implementation on a CPU-only laptop thanks to 4-bit integer quantization, releasing tools and datasets that enable everyone to experiment with techniques previously unachievable by non-commercial players like Reinforcement Learning from Human Feedback and in general, achieving state of the art results with much smaller and cheaper models.
Answers to the questions stated above might literally change day by day. Before anyone tries to apply a specific set of architectures and techniques, a dozen new promising ones will appear. Incorporating organizational knowledge into LLM-based interfaces in general can be achieved either by injecting additional knowledge via prompt engineering, designing a retrieval pipeline or by fine-tuning a model that was previously pre-trained on some huge general knowledge dataset using a smaller, domain-specific dataset. All methods originally have their drawbacks: prompts have limited length, so not too much information can be smuggled via prompt engineering; retrieval pipelines are usually data source specific, so the solution might not be generic enough and they require additional engineering; fine-tuning the model might come out very costly to achieve required quality as its cost increases with model and dataset size. However, even at the time of completing this very article, new developments are emerging to mitigate the aforementioned drawbacks. The newest open source MPT-7B series of models (it was announced by MosaicML a few hours before I wrote these words) is able to accept an input of up to 65k tokes, which might make prompt limitations irrelevant. Frameworks like LangChain and Haystack are actively developed which makes creating modular systems that involve LLMs as components structured and transparent. Model scaling and fine-tuning, both still being serious challenges, become more conceptual compared to the ones related to affordability with the successful examples of applying techniques like knowledge distillation or Low-Rank Adaptation. Taking all of this into account, while the research labs provide more and more options on how to apply LLMs to your own data, composing the best one seems like an almost impossible challenge - and the one that is the best today, might be outperformed tomorrow.
Similar uncertainty applies to deployment methods. We could also try to define three examples of ways to use LLMs in production: deploy a small, cheap in use model on a local machine; deploy a large, fine-tuned model by yourself in the cloud; or use paid APIs to closed-source models like the one offered by OpenAI. As before, for all options some pros and cons can be found and those lists evolve on a daily basis. Small models that would fit into a single machine’s memory were so far not performant enough for any serious application. With some advancements made by the ML community that were described before this is likely to change, as the initial trend of making bigger and bigger models and train them on larger and larger datasets started by tech giants gives up way to applying smart optimization techniques to model scaling and fine-tuning and using well-curated high quality datasets instead of just colossal ones. Still, achieving state of the art performance might require quite large models, thus significant resources.
There is one more crucial thing to consider when thinking about deploying LLMs on your own which is licensing. What started the open source counter attack against OpenAI’s and Google’s emerging dominance was the publication of Meta’s LLaMA model. The publication was, however, not full, so to speak. Initially, there was only source code published with no weights, so for the model to be usable the entire training procedure would have to be reproduced. The second thing is that LLaMA’s license is non-commercial, which means that it can be used for research purposes, but it cannot be deployed as a part of any commercial solution. The open source community quickly overcame the lack of weights problem by creating and training similar models and developing efficient fine-tuning techniques. But even if the license of the derivative model allows for commercial use, there might still be a catch. There are examples like the Vicuna-13B model that is published under Apache License 2.0 which allows for any type of use. But the model weights are applied on top of original LLaMA weights, and the LLaMA model, as mentioned before, has restrictions - and unfortunately those restrictions are transferred to the derivative model. Another example of a semi-restricted model may be the MPT-7B series. There are four models in the series with three different licenses, one of them, for a chat-enabled model, being a non-commercial one.
Of course, not only models are licensed. We have to be also careful about datasets that we want to use for fine-tuning because the same rules apply here. Also, the whole licensing process is likely to get much, much more complicated with the introduction of the EU AI Act, but this is a topic on its own and to describe all the controversy would require much more space than this blogpost or the act itself would take.
The opposite option to deploying compact and efficient models on your own is to use commercial APIs. Those, however, are of course not free of charge. Taking OpenAI’s API as example, the user is charged both for the number of tokens provided as the input in the prompt and for the amount of tokens returned by the model. The more context you provide to the model the better (hopefully) the response will be, but also more costly. It looks, however, quite easy to control the length of the input sequence, but it gets more tricky if we consider the length of the output. We can explicitly prompt the model to stick to the required amount of tokens for the response and it likely (but not surely) won’t go rogue and won’t decide to write an epos. But if the LLM is a part of a solution that you want to expose to other people, how do you ensure that they will stick to a reasonable volume of the model’s output? Even if you force some limits, users might start feeling too constrained and lose interest in your tool. Either way, the cost of using commercial APIs in production seems really hard to predict. Also, another problem might be the latency of responses which is currently quite high and might be a serious limitation for efficient processing of multiple requests.
Last but not least, there is one more thing to consider about using commercial interfaces to remotely hosted black-box models, which is data privacy. Although while using OpenAI’s API you can explicitly opt out from allowing your data to be used for GPT training, you still need to send some information via prompt. This might be a definitive no-go for regulated organizations like banks and a serious question mark for many others. Also, we are relying on services provided by a specific vendor for the implementation of the crucial system component. All failures or downtimes will affect our solution greatly and we have to also remember that switching to an alternative might be very tricky - even if only due to the problems with ensuring consistent answers for transferred prompts when using a different model.
We have covered some of the things that everyone considering the application of LLMs in their businesses should be careful about, and explained why not all demos seen over the Internet are production-ready, so Let’s now discuss what LLMs can be used for.
This category could be filled with numerous examples as the types of data sources underneath such conversational knowledge retrieval systems can be countless. Not to mention that LLMs that are only pre-trained on general text corpora are already a treasure trove of valuable information. One particular example of a “conversational recommendation system” - or even better, a “shopping assistant" - was already described in the section about potential issues with LLMs. While one should be careful with calling this kind of arrangement a “recommendation system” in the full sense of the word since it greatly depends on other components headed by the proper information retrieval component, such a tool that provides the list of items meeting user-described criteria with the explanation as to why they are relevant might be of use on many occasions. For example, in large organizations, an internal knowledge repository (Confluence pages, internal databases, financial documents etc.) could be used. Depending on the scale and type of data, an LLM model could be either (or both) fine-tuned using this internal knowledge base or a retrieval pipeline could be built to later allow for adding a conversational layer to database queries.
A somewhat similar use case to the previous one can be defined when the user not only wants to retrieve some information from a knowledge base, but also synthetize this knowledge in some way. LLMs can be very good at writing summaries of large collections of documents which might be a huge time saver for many organizations. Companies that deal with vast amounts of financial data could use LLMs to easily generate excerpts with needed information. IT service companies often prepare personalized summaries of a consultants’ experience for their potential clients from specific industries which with the help of LLMs, could be done automatically using documentation of past projects. If given enough instructions and examples through the prompt, an LLM is capable of performing extensive sentiment analysis or document checking, providing quite comprehensive clarifications about suggested quality of input. Another interesting application of LLMs could be defined as something like “reading assistant” (there are probably some better names for such a functionality though). If you are dealing with documents that are difficult and contain many references to other sources (good examples of this are legal acts or scientific papers) it often becomes quite an ordeal since it is hard to assimilate knowledge when you are jumping from one source to another to seek explanations or external supplements all the time. LLMs fine-tuned on domain-specific datasets can greatly help with that - you just select a fragment that is hard to understand and the LLM provides you with a clear, compact hint; you save a lot of googling time and you don’t lose the reading flow.
Many potential use cases involve LLMs generative capabilities. The aforementioned GitHub Copilot that is capable of co-generating code gained a lot of users’ attention even before the current ChatGPT-triggered boom. There are also better and better open-source alternatives being developed, including for example, the recently published StarCoder from Hugging Face. People are already using ChatGPT to help them write simple pieces of text like emails or short articles (have you received any LinkedIn recruitment messages with creative but quite artificial opening lines yet?). Even if we stick strictly to LLMs, not trying to incorporate other Generative AI tools, we can still go beyond the pure natural language domain in content generation. Since there are tools that allow us to define different complex objects as code, we can use a language model to generate that code given a set of general instructions. This way we can have AI-generated plots (e.g. using libraries like Python’s matplotlib and seaborn or R’s ggplot2), presentations (e.g. using Markdown like in Reveal.js or Marp), BI reports (e.g. with Looker’s LookML) or entire websites (which are usually nothing more than a mix of HTML, JavaScript and CSS).
Large Language Models can be used as the backbone for many sophisticated tasks, but as the name suggests they are still mathematical representations of natural language in their very nature. Since modern architectures can be trained on multi-language text corpora, they quite naturally acquire translation capabilities. For everyday personal use, querying Google Translate via a website might seem adequate enough, but if we need to include translations services as the part of a bigger solution, we need to be able to have programmatic access to a translation engine which practically usually means going for some cloud provider’s services (e.g. TranslateAI from Google Cloud) and paying for them potentially quite a lot. Also, LLMs are usually pre-trained on general knowledge corpora and won’t perform well in translating specialized documents, such as from medical, legal or engineering areas. To have that, we might need to build a customized solution and provide the LLM with additional, domain specific data via fine-tuning.
Data has always been the fuel of analytics and machine learning. We can even risk a broader statement - data is the fuel of modern day business. As crucial as it is, data has always been limited, especially high quality and annotated data, which was always the most valuable kind. While the center of gravity is being slightly shifted with broad adoption of self-supervised learning techniques (used also to train LLMs by the way), labeled data is still needed for many specific supervised learning tasks and the most costly part of obtaining annotated datasets is to incorporate human labeling. In many cases this might change since LLMs begin to prove that they can be employed as efficient annotators. Taking one step forward - LLMs can also be used for generating entire synthetic datasets which is an especially necessary task in industries where actual datasets are limited or restricted, like in medicine. Similarly to other attempts using statistical and classic-ML techniques for tabular data and Generative Adversarial Networks or diffusion models for computer vision domain, if the LLM can synthetize text data it can also do other specialized tasks such as impute missing parts of datasets or improve the quality of examples in those datasets.
The main observation that can be made looking at what’s currently happening in the AI world is that once again everybody is overwhelmed by how fast things can change. After several breakthroughs in computer vision and NLP, the deep learning revolution that was sparked several years ago shows yet another face with Generative AI and LLMs and it doesn’t seem that it is going to slow down any time soon. New developments are being announced literally day by day, making it very hard to keep up and also to separate wheat from chaff. In this dynamic world it is important to filter information that is purely based on hype and excitement from the ones that can be forged into useful and economically reasonable practical tools. Hopefully this article can shed some light on the topic, although it will be also interesting to see how many parts get outdated sooner than any author would wish for.
If you want to discuss any potential LLM applications in your business or any other data-related topic that GetInData can help you with (including Advanced Analytics, Data-driven Organization Strategy, DataOps, MLOps and more), please sign up for a free consultation.
_____
This article was generated by ChatGPT.
…OK, it was not. But it could have been, don’t you think? We are not too far away.
Airflow is a commonly used orchestrator that helps you schedule, run and monitor all kinds of workflows. Thanks to Python, it offers lots of freedom…
Read moreRemember our whitepaper “Guide to Recommendation Systems. Implementation of Machine Learning in Business” from the middle of last year? Our data…
Read moreBeing a Data Engineer is not only about moving the data but also about extracting value from it. Read an article on how we implemented anomalies…
Read more“How can I generate Kedro pipelines dynamically?” - is one of the most commonly asked questions on Kedro Slack. I’m a member of Kedro’s Technical…
Read moreThese days, companies getting into Big Data are granted to compose their set of technologies from a huge variety of available solutions. Even though…
Read moreThe need for a unified format for geospatial data In recent years, a lot of geospatial frameworks have been created to process and analyze big…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?