OpenAI Compatible Server#
In addition to Python API, furiosa-llm
also offers an OpenAI-compatible server
which hosts a single model and provides two OpenAI-compatible APIs:
Completions API and
Chat API.
You can simply launch the server using the furiosa-llm serve
command with an artifact path
as the following.
furiosa-llm serve --model [ARTIFACT_PATH]
The following sections describe how to launch and configure the server and interact with the server using OpenAI API clients.
Warning
This document is based on Furiosa SDK 2024.2.1 (beta0) version, and the features and APIs described in this document may change in the future.
Prerequisites#
To use the OpenAI-Compatible server, you need the following prerequisites:
LLM Engine Artifact
Chat template for chat application (Optional)
Chat Templates#
To use the language models for chat application, we need a structured string instead of a single string. It’s necessary because the model should be able to understand the context of the conversation, including the role of the speaker (e.g., “user” and “assistant”) and the content of the message. Similar to tokenization, different models require very different input formats for chat. That’s why we need a chat template.
Furiosa LLM supports chat templates based on Jina2 template engine in the same way as HuggingFace Transformers. Chat Templates offers a comprehensive guide on what chat templates are and how to write your chat templates. You can also find one good example of chat template at Llama 3.1 Model Card.
The following command launches the server with the chat template:
furiosa-llm serve --model [ARTIFACT_PATH] --chat-template [CHAT_TEMPLATE_PATH]
Arguments of furiosa-llm serve
command#
By default, the server binds to localhost:8000
, and
you can change the host and port using the --host
and --port
options.
The following is the list of options and arguments for the serve command:
usage: furiosa-llm serve [-h] --model MODEL [--host HOST] [--port PORT] [--chat-template CHAT_TEMPLATE] [--response-role RESPONSE_ROLE] [-tp TENSOR_PARALLEL_SIZE] [-pp PIPELINE_PARALLEL_SIZE]
[-dp DATA_PARALLEL_SIZE] [--devices DEVICES]
options:
-h, --help show this help message and exit
--model MODEL The Hugging Face model id, or path to Furiosa model artifact. Currently only one model is supported per server.
--host HOST Host to bind the server to (default: 0.0.0.0)
--port PORT Port to bind the server to (default: 8000)
--chat-template CHAT_TEMPLATE
If given, the default chat template will be overridden with the given file. (Default: use chat template from tokenizer)
--response-role RESPONSE_ROLE
Response role for /v1/chat/completions API (default: assistant)
-tp TENSOR_PARALLEL_SIZE, --tensor-parallel-size TENSOR_PARALLEL_SIZE
Number of tensor parallel replicas. (default: 4)
-pp PIPELINE_PARALLEL_SIZE, --pipeline-parallel-size PIPELINE_PARALLEL_SIZE
Number of pipeline stages. (default: 1)
-dp DATA_PARALLEL_SIZE, --data-parallel-size DATA_PARALLEL_SIZE
Data parallelism size. If not given, it will be inferred from total avaialble PEs and other parallelism degrees.
--devices DEVICES Devices to use (e.g. "npu:0:*,npu:1:*"). If unspecified, all available devices from the host will be used.
Using OpenAI API with Furiosa LLM#
Once the server is launched, you can interact with the server using HTTP clients
as the following CURL
command example.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "EMPTY",
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}' \
| python -m json.tool
You can use OpenAI client to interact with the server.
To use OpenAI client, you need to install the openai
package first.
pip install openai==1.58.1
OpenAI client provides two APIs: client.chat.completions
and client.completions
.
You can use the client.chat.completions
API with stream=True
for streaming responses, as following:
import asyncio
from openai import AsyncOpenAI
# Replace the following with your base URL
base_url = f"http://localhost:8000/v1"
api_key = "EMPTY"
client = AsyncOpenAI(api_key=api_key,base_url=base_url)
async def run():
stream_chat_completion = await client.chat.completions.create(
model="EMPTY",
messages=[{"role": "user", "content": "Say this is a test"}],
stream=True,
)
async for chunk in stream_chat_completion:
print(chunk.choices[0].delta.content or "", end="", flush=True)
if __name__ == "__main__":
asyncio.run(run())
The compatibility with OpenAI API#
Currently, furiosa serve
supports the following OpenAI API parameters:
You can find more about each parameter at Completions API
and Chat API.
Warning
Please note that using use_beam_search
with stream
is not allowed
because the beam search cannot determine the tokens until the end of the sequence.
In 2024.2 release, n
works only for beam search and it will be fixed in the next release.
n
temperature
top_p
top_k
early_stopping
length_penalty
max_tokens
min_tokens
use_beam_search
best_of
Launching the OpenAI-Compatible Server Container#
Furiosa LLM can be launched immedaitely as a containerized server.
docker run -it --rm --privileged \
--env HF_TOKEN=$HF_TOKEN \
-v ./Llama-3.1-8B-Instruct:/model \
-p 8000:8000 \
furiosaai/furiosa-llm:latest \
serve --model /model --devices "npu:0:*"