12 min read

Large Language Models - the legal aspects of licensing for commercial purposes

In the rapidly evolving landscape of artificial intelligence (AI), large language models (LLMs) have become indispensable tools for various applications, from natural language processing to content generation.

However, as organizations explore the integration of LLMs for commercial purposes, it's crucial to dive into the legal landscape that governs these advanced technologies. This includes multifaceted aspects such as copyrights, licensing, data privacy, sourcing, liabilities and broader AI transparency and ethics concerns.

Among the escalating demand for sophisticated LLMs, the choice of licensing emerges as a critical factor in shaping accessibility, collaboration and overall impact, especially in commercial contexts. Prior to organizations embarking on acquiring or deploying LLMs, a comprehensive exploration of the legal complexities surrounding their use is essential.

In this blog post, we'll explore the advantages and considerations of using open source licenses for different large language models for commercial purposes. We will examine license models and look closer into the Vicuna and Llama2 case to finally find the open-source license best suited to you.

Types of LLMs

General vs Custom LLMs

Large language models can be broadly categorized into two main types: General LLMs and Custom LLMs.

General Large Language Models (LLMs) encompass models designed to execute a diverse array of language-related functions without being specifically adapted to a particular domain or application. These general LLMs, exemplified by OpenAI's GPT (Generative Pre-trained Transformer) models, undergo training on extensive datasets that capture a broad spectrum of language patterns and topics. Their versatility allows for application in tasks like text generation, language translation, summarization and more, all without the need for fine-tuning tailored to specific use cases.

In contrast, Custom Large Language Models (LLMs) relate to models that have undergone additional training or fine-tuning for specific applications or domains. Organizations or researchers engaging with custom LLMs often take a pre-existing general LLM and refine it, by subjecting it to a dataset pertinent to a particular industry, field or application. This fine-tuning process enhances the model's performance for targeted tasks, rendering custom LLMs optimized for a narrower set of language functions relevant to a specific context. Consequently, this specialization makes them more effective in those specific domains.

Proprietary vs open source LLMs

LLMs operate within two predominant models: proprietary and open source.

Proprietary LLMs are owned by companies, necessitate licensing for usage and often come with restrictions described in the terms and conditions. Usually users have to pay a fee for license and are prohibited from sharing or distributing the software or its outputs without authorization.

Open source LLMs are freely accessible to anyone, allowing improvement, modification and distribution without stringent limitations.

It’s up to the company wanting to implement the LLMs which chooses the direction to take.

Open source LLMs license models

With regards to open source models, two license models can be applicable: copy left licences and permissive licences. Below we will distinguish between the two.

The “Copyleft” license

Concerning copyleft licensing, there are various legal aspects that both the creators and users of copyleft-licensed works should be mindful of. Here are some legal considerations to bear in mind:

License Compatibility: It is crucial to verify that your copyleft license aligns with any other licenses applicable to your work. Certain licenses may be incompatible with each other, leading to legal complications if you attempt to merge works licensed under different terms.

Viral Effect: The viral effect of copyleft licenses dictates that any work derived from a copyleft-licensed work must also be licensed under the same copyleft terms. This can pose significant consideration for both creators and users, impacting the ability to use and distribute the work in specific ways.

International Considerations: Copyleft licensing is a global phenomenon, and it is crucial to comprehend how the chosen license will be interpreted and enforced in various jurisdictions worldwide. Different countries may have distinct legal requirements and interpretations of copyleft licenses, necessitating thorough research before selecting a license.

Numerous copyleft licenses are available, such as the GNU General Public License (GPL) and the Creative Commons ShareAlike license. While these licenses come with distinct terms and conditions, they fundamentally revolve around a common principle: companies that use or modify a copyleft-licensed work are obligated to distribute their derived work under the identical license terms. Please note that what constitutes a "derived work" must be interpreted in light of the specific open source license. "Derived work" (or the term used in the copyleft license) is not necessarily as limited in scope as "derived work" would be under copyright law.

Some copyleft licenses define "derived work" as the entire product in which the open source component is used, in addition to material based on the original component. This is referred to as the so-called "strong" copyleft effect.

The intent behind incorporating copyleft clauses is to keep the freedom granted by the open-source license for any "derived work." The underlying principle is the promotion of collective contributions to a growing repository of source code that remains open and accessible to anyone for use, commercial exploitation and further enhancement. Contrary, commercial developers typically aim to maintain the confidentiality of their entire source code to deter plagiarism and other infringements. Additionally, they often prefer licensing their products under a stringent proprietary license of their choosing. Such licenses typically only provide the right to use the product for the licensee's internal purposes, without permitting commercialization, modifications or further development. Essentially, commercial developers seek to preserve the exclusive rights granted by copyright law.

When copyleft clauses come into play, developers of "derived work" are unable to dictate the terms and conditions for licensing the "derived work." Consequently, the copyleft effect is often deemed commercially unprofitable when an open-source component forms a "derived work."

In accordance with LLMs, the example of copyleft license is GPL 3.0. The GPL 3.0 requires that any derivative works of the software be licensed under the same license. This means that if you use GPL-3.0-licensed software in your project, your project must also be licensed under GPL-3.0.

Permissive licenses

Using permissive open-source components typically presents fewer challenges compared to copyleft ones, as permissive licenses generally impose less strict obligations. Common permissive licenses include Apache 2.0, MIT and various BSD licenses. In general, permissive licenses grant users the right to use, copy, modify and distribute copies of the licensed source code component.

Developers can take the permissive-licensed software, make it their own through changes or additions, keep their new version to themselves, or share them if they choose to. This is a majorly positive feature if you’re looking to create proprietary software that you can sell and keep secret from competitors — and one of the main reasons why permissive licenses are popular.

However, these licenses often make such rights contingent to providing licensing information to the company's own licensees, including attributions of copyright owners and disclaimers. Consequently, failure to comply with this requirement could potentially render open-source license grants invalid. It's important to note that this risk applies to all open-source licenses, not just the permissive ones. Infringement of intellectual property rights may occur if open source is used without full compliance with the respective license terms.

The most popular permissive licenses are:

Apache 2.0 License

Requires license notifications and copyrights on the distributed code and/or as a notice in the software. However, derivative works, larger projects or modifications are permitted to carry different licensing terms when distributed and are not required to provide source code.

MIT License

This on bears the name of the famous university where it originated and is very short and clear and easy to understand. It allows anyone to do whatever they wish with the original code, as long as the original copyright and license notice is included either in the distributed source code or software.

Moreover, not all open-source licenses can be seamlessly combined with components licensed under other open-source licenses. For instance, it is generally assumed that a component licensed under the permissive MIT license can be integrated into a larger work licensed under the copyleft GPL license. Conversely, a component licensed under the GPL license may not be integrated into a larger piece of work intended to be licensed under the MIT license.

The list of LLMs with open source licenses can be found on Github https://github.com/eugeneyan/open-llms

Novel Licensing Approaches: RAIL (Responsible AI Licence) License

The evolving landscape introduces innovative licensing approaches, such as the RAIL license, which combines an open access approach with behavioral restrictions. This nuanced copyright license aims to enforce responsible AI use, introducing usage-based restrictions for models like OPT, Stable Diffusion and BLOOM.

This license has certain use-based restrictions, for example it cannot be used in anything that violates laws and regulations, exploits or harms minors, or in something which discriminates or harms “individuals or groups based on social behavior or known or predicted personal or personality characteristics”. For more information - https://www.licenses.ai/

Some models under this license are: OPT, Stable Diffusion and BLOOM

Bloom is an open-access multilingual language model available for commercial use under the bigscience-bloom-rail-1.0 license, with restrictions on providing medical advice and interpretation of medical results.

Case Studies: Vicuna and Llama2 case

Vicuna for research purposes

Vicuna is an open-source chatbot trained by fine-tuning on LLaMA. The Vicuna model card would show the Apache 2.0 license that can be used commercially. However, the LLaMA weights are not available commercially. A closer examination of real-world cases, such as Vicuna, highlights the complexity of licensing LLMs commercially. Despite an Apache 2.0 license, restrictions on underlying LLaMA weights limit commercial usability, limiting its application to research purposes.

LLama2 with additional commercial restrictions

In accordance with LLama2 terms, in the event that the total monthly active users for products or services offered by or on behalf of the Licensee, or Licensee's affiliates surpasses 700 million in the preceding calendar month, you are required to seek a license from Meta. Meta retains the discretion to grant such a license at its sole discretion, and you are not permitted to exercise any rights outlined in this Agreement until Meta expressly grants such rights. This contradicts the principles of open-source purpose.

Secondly, concerning the weights: Meta does not publicly disclose the weights. To obtain a copy of the weights from Meta, you need to submit an application. Furthermore, these weights cannot be utilized for training any Language Model (LM) except Llama 2, unless you obtain explicit written approval from Meta.

Data Governance and licencing

LLMs licensing should be part of a risk assessment and subject to due diligence and/or a data governance policy.

It should be revisited, to account for the specific risks which arise when a business develops or incorporates an LLM into the technology it uses to conduct business or provide products or services to customers. A series of questions should be asked also about the type of sources from where the data has been taken, the licensing arrangements which attach to that data and the LLM, and the methods used to source the data. Sometimes the platforms from where the data has been obtained allow for, and even encourage, public access.

Reviews of legal terms regarding the LLMs and the data used to train them should be thoroughly undertaken. Reviewing the legal terms is particularly important in reducing the risk of the permissions not covering a data provider or platform owner or used in a way that specifically prohibits their utilization in relation to training LLMs.

For financial institutions (for example regulated in the UK by the Prudential Regulation Authority (PRA)) it is also a regulatory matter. The PRA has said they must ensure that they “obtain appropriate assurance and documentation from third parties on the provenance or lineage of the data to be satisfied that it has been collected and processed in line with applicable legal and regulatory requirements”.

In others, robust contractual protections will need to be put in place and internal governance structures, polices, processes and controls will be necessary to take advantage of the huge potential LLMs have to transform business.

7. Which open-source license is best suited to you?

Copyleft License Caution: Careful consideration is essential when opting for a copyleft license due to the restrictions mentioned in point 3 above
Copyleft vs. Permissive: Generally, copyleft licenses impose more restrictions and possibly offer less liability compared to permissive licenses. When prioritizing code reusability and shareability, a moderately permissive license is often the preferable choice.
GPL Versions and Compatibility: The GPL license exists in two main versions—GPLv2 and GPLv3. Noteworthy differences in GPLv3 address issues not covered in GPLv2, such as patents, and enhance compatibility with other open-source licenses like the Apache License 2.0. It's crucial to recognize that GPLv2 and GPLv3 are not compatible with each other.
Advantages of MIT Licenses: MIT licenses enjoy widespread use, boasting recognition and common understanding. Software licensed under MIT entails no restrictions on redistribution or monetization, making it appealing for various applications. Additionally, MIT licenses are compatible with many other open-source licenses, enabling the use of MIT-licensed code in projects employing different licenses.

Commercial considerations

The commercial deployment of LLMs demands a nuanced understanding of licensing terms. Organizations must carefully evaluate the conditions set by providers, ensuring compliance and staying informed about evolving requirements. Rapid developments, such as open-source alternatives to initially restricted models, underscore the need for continuous monitoring in this dynamic landscape.

Finding a balance between technical advantages and legal complexities is imperative for responsible and effective implementation. A thorough understanding of licensing models, coupled with vigilant monitoring, is the key to unlocking the transformative potential of LLMs in the ever-evolving landscape of artificial intelligence.

LLM

open-source LLM

legal aspects of AI

legal

responsible AI

Last updated: 12 December 2023

Written by

Włodzimierz Marat

Head of Legal & Compliance

Like this post?
Spread the word

Want more? Check our articles

e commerce chatbot llmobszar roboczy 1 4

Tutorial

How to build an e-commerce shopping assistant (chatbot) with LLMs

In the dynamic world of e-commerce, providing exceptional customer service is no longer an option – it's a necessity. The rise of online shopping has…

getindator design a vibrant and engaging scene showcasing real 76ab8269 a013 4120 b722 f95e879d333c

Tutorial

Stream enrichment with Flink SQL

In today's world, real-time data processing is essential for businesses that want to remain competitive and responsive. The ability to obtain results…

semi supervised learning real timeobszar roboczy 1 4

Tutorial

Semi-supervised learning on real-time data streams

Acquiring unlabeled data is inherent to many machine learning applications. There are cases when we do not know the result of the action provided by…

Tutorial

Up & Running: data pipeline with BigQuery and dbt

Nowadays, companies need to deal with the processing of data collected in the organization data lake. As a result, data pipelines are becoming more…

running observability kubernetesobszar roboczy 1 4

Tutorial

Running Observability Stack on Grafana

Introduction At GetInData, we understand the value of full observability across our application stacks. For our Customers, we always recommend…

wp stream blogingobszar roboczy 1 4x 100

Whitepaper

White Paper: Stream Processing Explained

Stream Processing In this White Paper we cover topic such as characteristic of streaming, the challegnges of stream processing, information about open…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com

Large Language Models - the legal aspects of licensing for commercial purposes

Types of LLMs

General vs Custom LLMs

Proprietary vs open source LLMs

Open source LLMs license models

The “Copyleft” license

Permissive licenses

The most popular permissive licenses are:

Apache 2.0 License

MIT License

Novel Licensing Approaches: RAIL (Responsible AI Licence) License

Case Studies: Vicuna and Llama2 case

Data Governance and licencing

7. Which open-source license is best suited to you?

Commercial considerations

Other risks and considerations using LLMs will be covered soon

Sign up and be notified when it will go live

Like this post?Spread the word

Want more? Check our articles

How to build an e-commerce shopping assistant (chatbot) with LLMs

Stream enrichment with Flink SQL

Semi-supervised learning on real-time data streams

Up & Running: data pipeline with BigQuery and dbt

Running Observability Stack on Grafana

White Paper: Stream Processing Explained

Contact us

Interested in our solutions?Contact us!

Like this post?
Spread the word

Interested in our solutions?
Contact us!