Quick Start with Furiosa LLM#
Furiosa LLM is a serving framework for LLM models that uses FuriosaAI’s NPU. It provides a Python API compatible with vLLM and a server compatible with OpenAI’s API. This document explains how to install and use Furiosa LLM.
Warning
This document is based on Furiosa SDK 2025.1.0 (beta0). The features and APIs described herein are subject to change in the future.
Installing Furiosa LLM#
The minimum requirements for Furiosa LLM are as follows:
A system with the prerequisites installed (see Installing Prerequisites)
Python 3.8, 3.9, or 3.10
PyTorch 2.4.1
Enough storage space for model weights, e.g., about 100 GB for the Llama 3.1 70B model
To install the furiosa-compiler
package and the Furiosa LLM,
run the following commands:
sudo apt install -y furiosa-compiler
pip install --upgrade pip setuptools wheel
pip install --upgrade furiosa-llm
Building Model Artifacts#
To run Furiosa LLM with a given model, you need to build a model artifact first. This process starts with a pre-trained model from the Hugging Face Hub, and involves calibration and quantization, and compilation, ultimately generating an artifact. Furiosa LLM provides a Python API to perform these steps simply.
Note
If you already have a pre-built model artifact, you can skip this section. According to our roadmap, the 2025.2 release will allow BF16 models to run on Furiosa LLM without the calibration and quantization steps. So, the following steps will be much simpler.
The following examples show how to build a model artifact from a pre-trained model.
from furiosa_llm.optimum.dataset_utils import create_data_loader
from furiosa_llm.optimum import QuantizerForCausalLM, QuantizationConfig
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
# Create a dataloader for calibration
dataloader = create_data_loader(
tokenizer=model_id,
dataset_name_or_path="mit-han-lab/pile-val-backup",
dataset_split="validation",
num_samples=5, # Increase this number for better calibration
max_sample_length=1024,
)
quantized_model = "./quantized_model"
# Load a pre-trained model from Hugging Face model hub
quantizer = QuantizerForCausalLM.from_pretrained(model_id)
# Calibrate, quantize the model, and save the quantized model
quantizer.quantize(quantized_model, dataloader, QuantizationConfig.w_f8_a_f8_kv_f8())
The above code snippet shows how to quantize a model using the QuantizerForCausalLM
class.
Eventually, it will save the quantized model to the specified directory.
from furiosa_llm.artifact.builder import ArtifactBuilder
quantized_model = "./quantized_model"
compiled_model = "./Output-Llama-3.1-8B-Instruct"
builder = ArtifactBuilder(
quantized_model,
tensor_parallel_size=4,
max_seq_len_to_capture=1024, # Maximum sequence length covered by LLM engine
)
builder.build(compiled_model)
ArtifactBuilder
applies the parallelism strategy to the quantized model,
compiles the parallelized model, and eventually generates a model artifact.
More details and examples can be found in the Model Preparation Workflow section.
Offline Batch Inference with Furiosa LLM#
We now explain how to perform offline LLM inference using the Python API of Furiosa LLM.
First, import the LLM
and SamplingParams
classes from the furiosa_llm module.
The LLM
class is used to load LLM models and provides the core API for LLM inference.
SamplingParams
is used to specify various parameters for text generation.
from furiosa_llm import LLM, SamplingParams
# Load the Llama 3.1 8B Instruct model
path = "./Llama-3.1-8B-Instruct"
llm = LLM.load_artifact(path, devices="npu:0")
# You can specify various parameters for text generation
sampling_params = SamplingParams(min_tokens=10, top_p=0.3, top_k=100)
# Generate text
prompts = ["Say this is a test"]
response = llm.generate(prompts, sampling_params)
# Print the output of the model
print(response[0].outputs[0].text)
Streaming Inference with Furiosa LLM#
In addition to batch inference, Furiosa LLM also supports streaming inference.
The key difference of streaming inference is that tokens are returned as soon
they are generated.
This allows you to start printing or processing partial tokens before the whole
inference process finishes.
To perform streaming inference, use the stream_generate
method instead of
generate
.
This method is asynchronous and returns a stream of tokens as they are generated.
import asyncio
from furiosa_llm import LLM, SamplingParams
async def main():
# Load the Llama 3.1 8B Instruct model
path = "./Llama-3.1-8B-Instruct"
llm = LLM.load_artifact(path, devices="npu:0")
# You can specify various parameters for text generation
sampling_params = SamplingParams(min_tokens=10, top_p=0.3, top_k=100)
# Generate text and print each token at a time
prompt = "Say this is a test"
async for output_txt in llm.stream_generate(prompt, sampling_params):
print(output_txt, end="", flush=True)
# Run the async main function
if __name__ == "__main__":
asyncio.run(main())
Launching the OpenAI-Compatible Server#
Furiosa LLM can be deployed as a server that provides an API compatible with OpenAI’s. Since many LLM frameworks and applications are built on top of OpenAI’s API, you can easily integrate Furiosa LLM into your existing applications.
By default, the server listens on the HTTP endpoint http://localhost:8000.
You can change the binding address and port by specifying the --host
and --port
options.
The server can host only one model at a time for now and provides a chat template feature.
You can find more details in the OpenAI-Compatible Server section.
Below is an example of how to launch the server with the Llama 3.1 8B Instruct model.
# Launch the server to listen 8000 port by default
furiosa-llm serve ./Llama-3.1-8B-Instruct --devices "npu:0"
The server loads the model and starts listening on the specified port. When the server is ready, you will see the following message:
INFO: Started server process [27507]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Then, you can test the server using the following curl command:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "EMPTY",
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}' \
| python -m json.tool
Example output:
{
"id": "chat-21f0b74b2c6040d3b615c04cb5bf2e2e",
"object": "chat.completion",
"created": 1736480800,
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris.",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 42,
"total_tokens": 49,
"completion_tokens": 7
},
"prompt_logprobs": null
}
Using Chat Templates with Furiosa LLM#
Chat models are usually trained with a variety of prompt formats. In particular, Llama 3.x models require a specific prompt format to leverage multiple tools. You can find a full guide to prompt formatting in the Llama model card.
When using the OpenAI-Compatible Server with a chat model, furiosa-llm serve
will automatically apply the chat template if the model’s tokenizer provides it.
Additionally, you can use the --chat-template
option to specify a custom
chat template path.
Note
The Chat API is not yet supported in furiosa-llm. Support is planned for the 2025.1 release.
If you are using the LLM API, you will need to manually apply the chat template to the prompt for now. Since furiosa-llm does not yet provide a chat API, you need to use the tokenizer to apply the chat template to the prompt.
from furiosa_llm import LLM, SamplingParams
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
def apply_template(prompt):
chat = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": prompt},
]
return tokenizer.apply_chat_template(chat, tokenize=False)
path = "./Llama-3.1-8B-Instruct"
llm = LLM.load_artifact(path, devices="npu:0")
prompt1 = apply_template("What is the capital of France?")
prompt2 = apply_template("Say something nice about me.")
sampling_params = SamplingParams(min_tokens=10, top_p=0.3, top_k=100)
responses = llm.generate([prompt1, prompt2], sampling_params)
for response in responses:
print(response.outputs[0].text)