Deploying Furiosa-LLM with llm-d#
llm-d is a Kubernetes-native distributed inference framework that lets you serve LLM models across a cluster. Adopting llm-d as a distributed inference framework can provide the following benefits:
Intelligent Inference Scheduling: llm-d provides a configurable load balancer and pluggable scorers to route serving requests to optimal pods. llm-d provides metrics-based scoring and prefix-cache-aware scorers.
Prefill/Decode Disaggregation: llm-d selects optimal Prefill and Decode pods and relays KV Cache transfers between the designated pods.
Wide Expert-Parallelism: llm-d supports wide expert parallelism to deploy large Mixture-of-Experts (MoE) models.
llm-d integration with Furiosa-LLM#
Furiosa-LLM can be integrated with llm-d to support distributed serving of LLM models using RNGDs. Currently, the following integrations are supported:
Intelligent Inference Scheduling Furiosa-LLM implements Model Server Protocol’s metrics reporting to support Intelligent Inference Scheduling. The corresponding metrics are as follows:
Model Server Protocol metrics# Metric
Furiosa-LLM Metric
TotalQueuedRequests
furiosa_llm_num_requests_waitingTotalRunningRequests
furiosa_llm_num_requests_runningKVCacheUtilization
furiosa_llm_kv_cache_usage_percentBlockSize
Name:furiosa_llm_cache_config_infoLabel:block_sizeNumGPUBlocks
Name:furiosa_llm_cache_config_infoLabel:num_gpu_blocks
The following integrations are not currently supported:
Precise Prefix-Cache-Aware Scoring: Furiosa-LLM currently does not implement KV Cache events.
Prefill/Decode Disaggregation: Furiosa-LLM currently does not support prefill/decode disaggregation.
Wide Expert-Parallelism: Furiosa-LLM currently does not support wide expert parallelism.
Deploying Furiosa-LLM with llm-d#
This section describes how to deploy Furiosa-LLM with llm-d. The deployed llm-d will have Intelligent Inference Scheduling enabled, enabling metric-based request routing. This guide is based on llm-d’s Well-lit Path: Intelligent Inference Scheduling guide.
Prerequisites#
A Kubernetes cluster equipped with two or more Furiosa RNGD devices.
Hugging Face account and access token.
A Kubernetes storage class which supports dynamic volume provisioning.
For detailed instructions on setting up an RNGD cluster, please refer to Installing Prerequisites and Kubernetes Plugins.
You will also install Gateway API and a GIE-compatible gateway (Istio) in the steps below.
llm-d-modelservice and Helm repository#
llm-d provides the llm-d-modelservice Helm chart to simplify LLM deployment. We provide a fork of that chart to run Furiosa-LLM on RNGDs. Add the Furiosa Helm repository now so it is available when you deploy the model server in Step 5:
helm repo add furiosa https://furiosa-ai.github.io/helm-charts
helm repo update
Step 1: Set up Gateway API CRDs#
llm-d utilizes Kubernetes’ Gateway API Inference Extension (GIE). Appropriate CRDs need to be installed in the cluster.
# Install Gateway API CRDs
kubectl apply -f \
https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.1/standard-install.yaml
# Install Gateway API Inference Extension CRDs
kubectl apply -f \
https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.3.0/manifests.yaml
Step 2: Deploy GIE-compatible Gateway#
A list of GIE-compatible gateways can be found here. For this guide, we will deploy Istio as the GIE-compatible gateway. Install the base chart first, then the discovery chart (istiod) with GIE enabled.
Add the Istio Helm repository and install istio-base:
helm repo add istio https://istio-release.storage.googleapis.com/charts
helm repo update
helm install istio-base istio/base -n istio-system \
--set defaultRevision=default --create-namespace
Install the Istio discovery chart with GIE support:
meshConfig:
defaultConfig:
proxyMetadata:
ENABLE_GATEWAY_API_INFERENCE_EXTENSION: "true"
pilot:
env:
ENABLE_GATEWAY_API_INFERENCE_EXTENSION: "true"
helm install istiod istio/istiod -f /path/to/istiod.yaml -n istio-system --wait
Step 3: Prepare Kubernetes Secret#
We will use llm-d namespace to deploy llm-d with llm-d-modelservice Helm chart.
Create a Kubernetes secret for your Hugging Face token:
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
type: Opaque
data:
HF_TOKEN: <your_base64_encoded_hf_token>
Encode your token using the following command:
echo -n '<your_HF_TOKEN>' | base64
Then apply the secret:
kubectl create namespace llm-d
kubectl apply -f /path/to/hf-secret.yaml -n llm-d
Step 4: Download the Target LLM Model#
Furiosa-LLM deployed using llm-d-modelservice can utilize PVCs to pre-download Hugging Face models.
In this guide, we will pre-download the furiosa-ai/Llama-3.1-8B-Instruct model from Hugging Face Hub.
Create a PVC to store the model:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
# The storage class "default" is an example, use an appropriate one for your cluster.
storageClassName: default
kubectl apply -f /path/to/model-pvc.yaml -n llm-d
Create a pod to download the model:
apiVersion: v1
kind: Pod
metadata:
name: model-downloader
spec:
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvc
initContainers:
- name: download-model
image: python:3.10-slim
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: HF_TOKEN
command:
- /bin/bash
- -c
- >
pip install --no-cache-dir huggingface_hub &&
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='furiosa-ai/Llama-3.1-8B-Instruct', cache_dir='/models')"
volumeMounts:
- name: model-storage
mountPath: /models
containers:
- name: model-consumer
image: ubuntu
command: ["sleep", "infinity"]
volumeMounts:
- name: model-storage
mountPath: /models
restartPolicy: Never
kubectl apply -f /path/to/model-downloader.yaml -n llm-d
Wait for the model download to complete before proceeding (e.g. check the pod status with kubectl get pods -n llm-d and logs with kubectl logs model-downloader -n llm-d -c download-model).
For detailed instructions on how to pre-download a model on PVCs, please refer to the upstream documentation.
Step 5: Deploy llm-d#
First, we need to deploy the llm-d Inference Scheduler to the namespace using the Helm chart. Prior to deploying the Inference Scheduler, the llm-d-infra Helm chart must be deployed.
gateway:
gatewayClassName: istio
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add llm-d-infra https://llm-d-incubation.github.io/llm-d-infra/
helm repo update
helm install llm-d-infra llm-d-infra/llm-d-infra -n llm-d -f /path/to/llm-d-infra.yaml
Then, deploy the llm-d Inference Scheduler using the Helm chart. Set the flags to the respective Model Server Protocol metrics of Furiosa-LLM to enable metric-based request routing.
inferenceExtension:
replicas: 1
flags:
cache-info-metric: "furiosa_llm_cache_config_info"
kv-cache-usage-percentage-metric: "furiosa_llm_kv_cache_usage_percent"
lora-info-metric: ""
total-queued-requests-metric: "furiosa_llm_num_requests_waiting"
total-running-requests-metric: "furiosa_llm_num_requests_running"
image:
name: llm-d-inference-scheduler
hub: ghcr.io/llm-d
tag: v0.5.0
pullPolicy: Always
extProcPort: 9002
pluginsConfigFile: "default-plugins.yaml"
monitoring:
interval: "10s"
# Prometheus ServiceMonitor will be created when enabled for EPP metrics collection
secret:
name: inference-scheduling-gateway-sa-metrics-reader-secret
prometheus:
enabled: true
auth:
# To allow unauthenticated /metrics access (e.g., for debugging with curl), set to false
enabled: true
inferencePool:
targetPorts:
- number: 8000
modelServerType: vllm
modelServers:
matchLabels:
llm-d.ai/inferenceServing: "true"
provider:
name: istio
istio:
destinationRule:
host: "gaie-epp.llm-d.svc.cluster.local"
helm install gaie \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \
--version v1.3.0 \
-n llm-d \
-f /path/to/inference-scheduler.yaml
llm-d’s Inference Scheduler is deployed using GIE’s InferencePool helm chart. For further information on customizing the deployment, please refer to the InferencePool helm chart documentation.
Finally, deploy the llm-d-modelservice Helm chart to run the Furiosa-LLM model server.
multinode: false
modelArtifacts:
name: "furiosa-ai/Llama-3.1-8B-Instruct"
uri: "pvc+hf://model-pvc/furiosa-ai/Llama-3.1-8B-Instruct"
size: 20Gi
authSecretName: "hf-token-secret"
labels:
llm-d.ai/inferenceServing: "true"
llm-d.ai/model: "Llama-3.1-8B-Instruct"
accelerator:
type: "furiosa"
routing:
proxy:
enabled: false
targetPort: 8000
prefill:
create: false
decode:
parallelism:
tensor: 8
replicas: 2
containers:
- name: "furiosa-llm"
image: furiosaai/furiosa-llm:latest
modelCommand: furiosaLLMServe
args:
- --enable-prefix-caching
- --disable-uvicorn-access-log
ports:
- containerPort: 8000
name: furiosa-llm
protocol: TCP
mountModelVolume: true
startupProbe:
httpGet:
path: /v1/models
port: furiosa-llm
initialDelaySeconds: 15
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 60
livenessProbe:
httpGet:
path: /health
port: furiosa-llm
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /v1/models
port: furiosa-llm
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 3
helm install ms furiosa/llm-d-modelservice -n llm-d -f /path/to/llm-d-modelservice.yaml
llm-d-modelservice Helm chart provides more configuration knobs other than the ones shown in the example above. Please refer to the llm-d-modelservice Helm chart documentation for details.
Step 6: Expose the service with an HTTPRoute#
Create an HTTPRoute so that inference requests are routed to the llm-d Inference Scheduler:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llm-d
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: llm-d-infra-inference-gateway
rules:
- backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: gaie
port: 8000
weight: 1
timeouts:
backendRequest: 0s
request: 0s
matches:
- path:
type: PathPrefix
value: /
kubectl apply -f /path/to/httproute.yaml -n llm-d
After the route is applied, you can send inference requests to the Gateway address (e.g. via the Istio ingress). For details on how to obtain the Gateway URL and call the API, see the llm-d Inference Against llm-d guide.