Quick Start with Furiosa-LLM#
Furiosa-LLM is a serving framework for LLM models that uses FuriosaAI’s NPU. It provides a Python API compatible with vLLM and a server compatible with OpenAI’s API. This document explains how to install and use Furiosa-LLM.
Warning
This document is based on Furiosa SDK 2025.2.0. The features and APIs described herein are subject to change in the future.
Installing Furiosa-LLM#
The minimum requirements for Furiosa-LLM are as follows:
A system with the prerequisites installed (see Installing Prerequisites)
Python 3.9, 3.10, 3.11, or 3.12
PyTorch 2.5.1
Sufficient storage space for model weights (varies depending on the model size)
To install the furiosa-compiler
package and Furiosa-LLM,
run the following commands:
sudo apt install -y furiosa-compiler
pip install --upgrade pip setuptools wheel
pip install --upgrade furiosa-llm
Offline Batch Inference with Furiosa-LLM#
We now explain how to perform offline LLM inference using the Python API of Furiosa-LLM.
First, import the LLM
and SamplingParams
classes from the furiosa_llm module.
The LLM
class is used to load LLM models and provides the core API for LLM inference.
SamplingParams
is used to specify various parameters for text generation.
from furiosa_llm import LLM, SamplingParams
# Load the Llama 3.1 8B Instruct model
llm = LLM.load_artifact("furiosa-ai/Llama-3.1-8B-Instruct-FP8", devices="npu:0")
# You can specify various parameters for text generation
sampling_params = SamplingParams(min_tokens=10, top_p=0.3, top_k=100)
# Prompt for the model
message = [{"role": "user", "content": "What is the capital of France?"}]
prompt = llm.tokenizer.apply_chat_template(message, tokenize=False)
# Generate text
response = llm.generate([prompt], sampling_params)
# Print the output of the model
print(response[0].outputs[0].text)
Streaming Inference with Furiosa-LLM#
In addition to batch inference, Furiosa-LLM also supports streaming inference.
The key difference of streaming inference is that tokens are returned as soon
they are generated.
This allows you to start printing or processing partial tokens before the whole
inference process finishes.
To perform streaming inference, use the stream_generate
method instead of
generate
.
This method is asynchronous and returns a stream of tokens as they are generated.
import asyncio
from furiosa_llm import LLM, SamplingParams
async def main():
# Load the Llama 3.1 8B Instruct model
llm = LLM.load_artifact("furiosa-ai/Llama-3.1-8B-Instruct-FP8", devices="npu:0")
# You can specify various parameters for text generation
sampling_params = SamplingParams(min_tokens=10, top_p=0.3, top_k=100)
# Prompt for the model
message = [{"role": "user", "content": "What is the capital of France?"}]
prompt = llm.tokenizer.apply_chat_template(message, tokenize=False)
# Generate text and print each token at a time
async for output_txt in llm.stream_generate(prompt, sampling_params):
print(output_txt, end="", flush=True)
# Run the async main function
if __name__ == "__main__":
asyncio.run(main())
Chat Inference with Furiosa-LLM#
Furiosa-LLM provides a high-level chat
method for models with chat capabilities.
This method simplifies interactions by handling prompt templating internally - it applies the appropriate chat template
and invokes the generate
method with the formatted prompt.
For detailed examples of using the chat API, refer to:
Launching the OpenAI-Compatible Server#
Furiosa-LLM can be deployed as a server that provides an API compatible with OpenAI’s. Since many LLM frameworks and applications are built on top of OpenAI’s API, you can easily integrate Furiosa-LLM into your existing applications.
By default, the server listens on the HTTP endpoint http://localhost:8000.
You can change the binding address and port by specifying the --host
and --port
options.
The server can host only one model at a time for now and provides a chat template feature.
You can find more details in the OpenAI-Compatible Server section.
Below is an example of how to launch the server with the Llama 3.1 8B Instruct model.
# Launch the server to listen 8000 port by default
furiosa-llm serve furiosa-ai/Llama-3.1-8B-Instruct-FP8 --devices "npu:0"
The server loads the model and starts listening on the specified port. When the server is ready, you will see the following message:
INFO: Started server process [27507]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Then, you can test the server using the following curl command:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "EMPTY",
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}' \
| python -m json.tool
Example output:
{
"id": "chat-21f0b74b2c6040d3b615c04cb5bf2e2e",
"object": "chat.completion",
"created": 1736480800,
"model": "meta-llama/Llama-3.1-8B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris.",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 42,
"total_tokens": 49,
"completion_tokens": 7
},
"prompt_logprobs": null
}