OpenAI-Compatible Server#
In addition to the Python API, Furiosa LLVM offers an OpenAI-compatible server that hosts a single model and provides two OpenAI-compatible APIs: Completions API and Chat API.
To launch the server, use the furiosa-llm serve
command with the model
artifact path, as follows:
furiosa-llm serve [ARTIFACT_PATH]
The following sections describe how to launch and configure the server and interact with the server using OpenAI API clients.
Warning
This document is based on Furiosa SDK 2025.1.0 (beta0). The features and APIs described herein are subject to change in the future.
Prerequisites#
To use the OpenAI-Compatible server, you need the following:
A system with the prerequisites installed (see Installing Prerequisites)
An installation of Furiosa LLM
A model artifact
Chat template for chat application (Optional)
Using the OpenAI API#
Once the server is running, you can interact with it using an HTTP client, as shown in the following example:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "EMPTY",
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}' \
| python -m json.tool
You can also use the OpenAI client to interact with the server.
To use the OpenAI client, you need to install the openai
package first:
pip install openai
The OpenAI client provides two APIs: client.chat.completions
and
client.completions
.
To stream responses, you can use the client.chat.completions
API with stream=True
, as follows:
import asyncio
from openai import AsyncOpenAI
# Replace the following with your base URL
base_url = f"http://localhost:8000/v1"
api_key = "EMPTY"
client = AsyncOpenAI(api_key=api_key,base_url=base_url)
async def run():
stream_chat_completion = await client.chat.completions.create(
model="EMPTY",
messages=[{"role": "user", "content": "Say this is a test"}],
stream=True,
)
async for chunk in stream_chat_completion:
print(chunk.choices[0].delta.content or "", end="", flush=True)
if __name__ == "__main__":
asyncio.run(run())
By default, the Furiosa LLM server binds to localhost:8000
.
You can change the host and port using the --host
and --port
options.
Chat Templates#
To use a language model in a chat application, we need to prepare a structured string to give as input. This is essential because the model must understand the conversation’s context, including the speaker’s role (e.g., “user” and “assistant”) and the message content. Just as different models require distinct tokenization methods, they also have varying input formats for chat. This is why a chat template is necessary.
Furiosa LLM supports chat templates based on the Jinja2 template engine, similar
to Hugging Face Transformers.
If the model’s tokenizer includes a built-in chat template,
furiosa-llm serve
will automatically use it.
However, if the tokenizer lacks a built-in template, or if you want to override
the default, you can specify one using the --chat-template
parameter.
For reference, you can find a well-structured example of a chat template in the Llama 3.1 Model Card.
To launch the server with a custom chat template, use the following command:
furiosa-llm serve [ARTIFACT_PATH] --chat-template [CHAT_TEMPLATE_PATH]
Tool Calling#
Furiosa LLM supports tool calling (also known as function calling) for models trained with this capability.
Within the tool_choice
options supported by the
OpenAI API,
Furiosa LLM supports "auto"
and "none"
.
Future releases will support "required"
and named function calling.
The system converts model outputs into the OpenAI response format through a
designated parser implementation.
At this time, only the llama3_json
parser is available.
Additional parsers will be introduced in future releases.
The following command starts the server with tool calling enabled for Llama 3.1 models:
furiosa-llm serve [ARTIFACT_PATH] --enable-auto-tool-choice --tool-call-parser llama3_json
To use the tool calling feature, specify the tools
and tool_choice
parameters. Here’s an example:
from openai import OpenAI
import json
client = OpenAI(base_url="http://localhost:8000/v1", api_key="test")
def get_weather(location: str, unit: str):
return f"Getting the weather for {location} in {unit}..."
tool_functions = {"get_weather": get_weather}
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location", "unit"]
}
}
}]
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-70B-Instruct",
messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
tools=tools,
tool_choice="auto" # None is also equivalent to "auto"
)
tool_call = response.choices[0].message.tool_calls[0].function
print(f"Function called: {tool_call.name}")
print(f"Arguments: {tool_call.arguments}")
print(f"Result: {get_weather(**json.loads(tool_call.arguments))}")
The expected output is as follows.
Function called: get_weather
Arguments: {"location": "San Francisco, CA", "unit": "F"}
Result: The temperature in San Francisco, CA is 70 °F
Compatibility with OpenAI API#
Below are the API parameters currently supported by Furiosa LLM:
Warning
Please note that using use_beam_search
together with stream
is not
allowed because beam search requires the whole sequence to produce the
output tokens.
In the 2024.2 release, n
works only for beam search. This limitation
will be fixed in the next release.
Warning
The max_tokens
parameter in the Chat API has been deprecated in favor of
max_completion_tokens
. While both parameters are currently supported for
backwards compatibility, max_tokens
will be removed in a future release.
Parameters supported by both the Completions and Chat APIs:
n
temperature
top_p
top_k
early_stopping
length_penalty
use_beam_search
best_of
stream
min_tokens
max_tokens
Parameters supported by the Chat API only:
max_completion_tokens
tools
tool_choice
Launching the OpenAI-Compatible Server Container#
FuriosaAI offers a containerized server that can be used for faster deployment.
Here is an example that launches the Furiosa LLM server in a Docker container
(replace $HF_TOKEN
with your Hugging Face Hub token):
docker run -it --rm --privileged \
--env HF_TOKEN=$HF_TOKEN \
-v ./Llama-3.1-8B-Instruct:/model \
-p 8000:8000 \
furiosaai/furiosa-llm:latest \
serve /model --devices "npu:0"