Deploying Furiosa-LLM on Kubernetes#
This guide describes how to deploy Furiosa-LLM—an OpenAI-compatible server optimized for Furiosa AI RNGDs—on Kubernetes.
Prerequisites#
A Kubernetes cluster equipped with Furiosa RNGD devices.
Hugging Face account and access token.
A Kubernetes storage class which supports dynamic volume provisioning.
For detailed instructions on setting up an RNGD cluster, please refer to Installing Prerequisites and Kubernetes Plugins.
Step 1: Prepare Kubernetes Secret#
Create a Kubernetes secret for your Hugging Face token:
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
type: Opaque
data:
token: <your_base64_encoded_hf_token>
Encode your token using the following command:
echo -n '<your_HF_TOKEN>' | base64
Then apply the secret:
kubectl apply -f secret.yaml
Step 2: Create Persistent Volume Claim (PVC)#
Note
Using ephemeral storage inside a Pod for large files, such as LLM model caches, may result in eviction due to disk pressure. Thus, using a Persistent Volume Claim (PVC) is highly recommended to ensure data durability and stability.
Create a PVC manifest to store models. The storage class specified below (“default”) is an example, and the actual available storage class in your Kubernetes environment may vary. Please verify and specify the appropriate storage class for your cluster:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llama-storage
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
# The storage class "default" is an example, use an appropriate one for your cluster.
storageClassName: default
Apply the PVC:
kubectl apply -f pvc.yaml
Step 3: Deploy Furiosa-LLM Server#
Create a Deployment manifest for serving the furiosa-ai/Llama-3.1-8B-Instruct-FP8
model with Furiosa-LLM.
For detailed information about the furiosa-ai/Llama-3.1-8B-Instruct-FP8
model, please see the Hugging Face model page.
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-3-8b
labels:
app: llama-3-8b
spec:
replicas: 1
selector:
matchLabels:
app: llama-3-8b
template:
metadata:
labels:
app: llama-3-8b
spec:
containers:
- name: llama-3-8b
image: furiosaai/furiosa-llm:latest
args:
- "serve"
- "furiosa-ai/Llama-3.1-8B-Instruct-FP8"
ports:
- containerPort: 8000
resources:
# Recommended resources for one RNGD card: 10 CPU cores and 100GB memory
limits:
cpu: 10
memory: 100Gi
furiosa.ai/rngd: "1"
requests:
cpu: 10
memory: 100Gi
furiosa.ai/rngd: "1"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
volumeMounts:
- name: model-storage
mountPath: /root/.cache/huggingface
securityContext:
capabilities:
drop:
- ALL
seccompProfile:
type: Unconfined
# Increase initialDelaySeconds of livenessProbe and readinessProbe if larger models take longer to load
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 5
volumes:
# If dynamic PVC provisioning isn’t possible in your cluster, consider using a hostPath volume.
- name: model-storage
persistentVolumeClaim:
claimName: llama-storage
Apply the deployment:
kubectl apply -f deployment.yaml
Step 4: Expose the Deployment as a Service#
Expose the Furiosa-LLM server using a Kubernetes Service:
apiVersion: v1
kind: Service
metadata:
name: llama-3-8b
spec:
selector:
app: llama-3-8b
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
Apply the service:
kubectl apply -f service.yaml
Step 5: Test the Deployment#
Confirm the server is running by inspecting the logs:
kubectl logs deployment/llama-3-8b
You should see output similar to:
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Test the inference endpoint:
kubectl port-forward svc/llama-3-8b 8000:8000
Then, issue a test request:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "furiosa-ai/Llama-3.1-8B-Instruct-FP8",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
You should receive a valid response from the Furiosa-LLM server, similar to:
{
"id": "cmpl-69102e5d78c94e29b74660eaadbe39db",
"object": "text_completion",
"created": 1748675715,
"model": "meta-llama/Llama-3.1-8B-Instruct",
"choices": [
{
"index": 0,
"text": " city that is known for its vibrant",
"logprobs": null,
"finish_reason": "length",
"prompt_logprobs": null
}
],
"usage": {
"prompt_tokens": 5,
"total_tokens": 12,
"completion_tokens": 7,
"completion_tokens_details": null
}
}
Conclusion#
Deploying Furiosa-LLM with Kubernetes leverages the efficiency and scalability of Furiosa RNGD accelerators. This guide provides a straightforward approach to setting up your Kubernetes cluster to serve inference workloads efficiently.