Deploy open source LLM in your private cluster with Hugging Face and GKE Autopilot
Deploying Language Model (LLMs) based applications can present numerous challenges, particularly when it comes to privacy, reliability and ease of deployment (see: Finding your way through the Large Language Models Hype - GetInData). While one option is to deploy LLMs on your own infrastructure, it often involves intricate setup and maintenance, making it a daunting task for many developers. Additionally, relying solely on public APIs for LLMs may not be feasible due to privacy constraints imposed by companies, which restrict the usage of external services and/or process valuable data which includes PII (see: Run your first, private Large Language Model (LLM) on Google Cloud Platform - GetInData). Furthermore, public APIs often lack the necessary service level agreements (SLAs) required to build robust and dependable applications on top of LLMs. In this blog post, we will guide you through the deployment process of Falcon LLM within the secure confines of your private Google Kubernetes Engine (GKE) Autopilot cluster using the text-generation-inference library from Hugging Face. The library is built with high-performance deployments in mind and is used by Hugging Face themselves in production, to power HF Model Hub widgets.
Falcon - the new best in class, open source large language model (at least in June 2023 🙃)
Falcon LLM itself is one of the popular Open Source Large Language Models, which recently took the OSS community by storm. Check out Open LLM Leaderboard to compare the different models. This guide is focused on deploying the Falcon-7B-Instruct version, however the same approach can be applied to other models, including Falcon-40B or others, depending on the available hardware and budget.
Prerequisites
Before diving into the deployment process of Falcon LLM in your private GKE Autopilot cluster, there are a few prerequisites you need to have in place:
Access to a GCP Project and a created GKE Autopilot Cluster: Ensure that you have access to a Google Cloud Platform (GCP) project and have set up a GKE Autopilot cluster. The Autopilot cluster provides a managed Kubernetes environment with automatic scaling and efficient resource allocation without the need for manual cluster configuration, which makes it a perfect fit for fast deployments. Make sure that the cluster is in the region/zone that supports GPUs.
Quota for Running T4 GPUs (or A100 for Falcon-40B): Verify that your GCP project has the necessary quota available to run T4 GPUs (or A100 GPUs for Falcon-40B).
Artifact Registry: Set up an Artifact Registry for Docker containers within your GCP project. Make sure that Artifact Registry is in the same region as the GKE cluster to ensure high-speed network transfer.
Google Cloud Storage (GCS) Bucket: Create a GCS bucket within the same region as your GKE cluster. The GCS bucket will be used to store model weights.
Basic Knowledge of Kubernetes: Having a basic understanding of Kubernetes concepts and operations is required to use this deployment process.
The following diagram shows the target infrastructure that you will obtain after following this blog post.
Model files preparation
In order to make deployments private and reliable, you first have to download the model’s files (weights, configuration and miscellaneous files) and store them in GCS.
Note that the disk size required for cloning is typically double the model size due to the git overhead.
Hugging Face Hub SDK: Another option is to use the Hugging Face Hub SDK, which provides a convenient way of fetching and interacting with models programmatically. Refer to the Hugging Face Hub SDK documentation (https://huggingface.co/docs/huggingface_hub/quick-start) for a quick start guide on using the SDK.
After downloading the model files, you should have the following structure on your local disk:
Hugging Face’s text-generation-inference comes with a pre-build Docker image which is ready to use for deployments which can be pulled from https://ghcr.io/huggingface/text-generation-inference. For private and secure deployments, however, it’s better to not rely on an external container registry - pull this image and store it securely in the Google Artifact Registry. Compressed image size is approx. 4GB.
docker pull ghcr.io/huggingface/text-generation-inference:latest
docker tag ghcr.io/huggingface/text-generation-inference:latest <region>-docker.pkg.dev/<project id>/<name of the artifact registry>/huggingface/text-generation-inference:latest
docker push <region>-docker.pkg.dev/<project id>/<name of the artifact registry>/huggingface/text-generation-inference:latest
You should also perform the same actions (pull/tag/push) for the google/cloud-sdk:slim image, which will be used to download model files in the initContainer of the deployment.
A Kubernetes service account in GKE Autopilot cluster
A new service account in GCP that will be connected to the Kubernetes service account
IAM permissions for the GCP service account in IAM (the Storage Object Viewer on the bucket with the model’s files should be enough).
Preparing the deployment
Storage Class (storage.yaml)
Begin by creating a Storage Class that utilizes fast SSD drives for volumes. This will ensure optimal performance for the Falcon LLM deployment - you want to copy the model files from GCS and load them as fast as possible.
The Deployment manifest is responsible for deploying the text-generation-inference-based model, which exposes APIs as shown in the diagram above. Additionally, it includes an initContainer, which pulls the necessary Falcon LLM files from GCS into an ephemeral Persistent Volume Claim (PVC) for the main container to load.
The node selectors instruct the GKE Autopilot to enable the metadata server for the deployment (to use Workload Identity) and to attach a nvidia-tesla-t4 GPU to the node that will execute the deployment. You can find out more about GPU-based workloads in the GKE Autocluster here.
The init container copies the data from GCS into the local SSD-backed PVC. Note that we’re using the gcloud alpha storage command that is the new, throughput-optimized CLI interface for operating with Google Cloud Storage, allowing the user to achieve throughput of approximately 550MB/s (*this might vary depending on the zone/cluster etc).
The main container (model) starts the text-generation-launcher with the quantized model (8-bit quantization) and most importantly - with overridden --weights-cache-override and --huggingface-hub-cache parameters, which alongside the HUGGINGFACE_OFFLINE environment, prevent the container from downloading anything from the Hugging Face Hub directly, giving you full control over the deployment.
Service (service.yaml)
The last piece of the puzzle is a simple Kubernetes service, to expose the deployment:
After a few minutes, the deployment should be ready
kubectl get pods -l app=falcon-7b-instruct
NAME READY STATUS RESTARTS AGE
falcon-7b-instruct-68d7b85f56-vbcsv 1/1 Running 0 8m
You can now connect to the model!
Connecting to the private Falcon-7B-Instruct model
Now that the model is deployed, you can connect to the model from within the cluster or from your local machine, by creating a port-forward to the service:
In order to query the model, you can use HTTP API directly
curl 127.0.0.1:8081/generate \
-X POST \
-d '{"inputs":"Who is James Hetfield?","parameters":{"max_new_tokens":17}}' \
-H 'Content-Type: application/json'
Response:
{"generated_text":"\nJames Hetfield is a guitarist and singer for the American heavy metal band Metallica"}
Or using Python (the example shows zero-shot text classification / sentiment analysis):
from text_generation import Client
client = Client("http://127.0.0.1:8080")
query = """
Answer the question using ONLY the provided context between triple backticks, and if the answer is not contained within the context, answer: "I don't know".
Only focus on the information within the context. Do NOT use any outside information. If the question is not about the context, answer: "I don't know".
Context:
```
I really like the new Metallica 72 Seasons album! I can't wait to hear the next one.
```
Question: What is the sentiment of the text? Answer should be one of the following words: "positive" / "negative" / "unknown".
"""
text = ""
for response in client.generate_stream(query, max_new_tokens=128):
if not response.token.special:
text += response.token.text
print(response.token.text, end="")
print("")
Result:
> python query.py
Answer: "positive"
Full documentation of the service’s API is available here.
Summary
By following the steps in this blog post, you can seamlessly deploy Falcon LLMs (or other OSS large language models) within the secure, private GKE Autopilot cluster. This approach allows you to maintain privacy and control over your data while leveraging the benefits of a managed Kubernetes environment in terms of scalability.
Embracing the power of Open Source LLMs empowers you to build reliable, efficient and privacy-focused applications that harness the capabilities of state-of-the-art language models.
Stay up to date with LLM news and tutorials
sign up to our newsletter
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.
By submitting this form, you agree to our Terms & Conditions and Privacy Policy
language models
LLM
Falcon LLM
open-source LLM
20 July 2023
Like this post? Spread the word
Want more? Check our articles
Tutorial
Running Kedro… everywhere? Machine Learning Pipelines on Kubeflow, Vertex AI, Azure and Airflow
Building reliable machine learning pipelines puts a heavy burden on Data Scientists and Machine Learning engineers. It’s fairly easy to kick-off any…
Extracting Flink Flame Graph data for offline analysis
Introduction - what are Flame Graphs? In Developer life there is a moment when the application that we create does not work as efficiently as we would…