OpenAI Compatible Server#

In addition to Python API, furiosa-llm also offers an OpenAI-compatible server which hosts a single model and provides two OpenAI-compatible APIs: Completions API and Chat API.

You can simply launch the server using the furiosa-llm serve command with an artifact path as the following.

furiosa-llm serve --model [ARTIFACT_PATH]

The following sections describe how to launch and configure the server and interact with the server using OpenAI API clients.

Warning

This document is based on Furiosa SDK 2024.2.1 (beta0) version, and the features and APIs described in this document may change in the future.

Prerequisites#

To use the OpenAI-Compatible server, you need the following prerequisites:

Chat Templates#

To use the language models for chat application, we need a structured string instead of a single string. It’s necessary because the model should be able to understand the context of the conversation, including the role of the speaker (e.g., “user” and “assistant”) and the content of the message. Similar to tokenization, different models require very different input formats for chat. That’s why we need a chat template.

Furiosa LLM supports chat templates based on Jina2 template engine in the same way as HuggingFace Transformers. Chat Templates offers a comprehensive guide on what chat templates are and how to write your chat templates. You can also find one good example of chat template at Llama 3.1 Model Card.

The following command launches the server with the chat template:

furiosa-llm serve --model [ARTIFACT_PATH] --chat-template [CHAT_TEMPLATE_PATH]

Arguments of furiosa-llm serve command#

By default, the server binds to localhost:8000, and you can change the host and port using the --host and --port options. The following is the list of options and arguments for the serve command:

usage: furiosa-llm serve [-h] --model MODEL [--host HOST] [--port PORT] [--chat-template CHAT_TEMPLATE] [--response-role RESPONSE_ROLE] [-tp TENSOR_PARALLEL_SIZE] [-pp PIPELINE_PARALLEL_SIZE]
                        [-dp DATA_PARALLEL_SIZE] [--devices DEVICES]

options:
    -h, --help            show this help message and exit
    --model MODEL         The Hugging Face model id, or path to Furiosa model artifact. Currently only one model is supported per server.
    --host HOST           Host to bind the server to (default: 0.0.0.0)
    --port PORT           Port to bind the server to (default: 8000)
    --chat-template CHAT_TEMPLATE
                            If given, the default chat template will be overridden with the given file. (Default: use chat template from tokenizer)
    --response-role RESPONSE_ROLE
                            Response role for /v1/chat/completions API (default: assistant)
    -tp TENSOR_PARALLEL_SIZE, --tensor-parallel-size TENSOR_PARALLEL_SIZE
                            Number of tensor parallel replicas. (default: 4)
    -pp PIPELINE_PARALLEL_SIZE, --pipeline-parallel-size PIPELINE_PARALLEL_SIZE
                            Number of pipeline stages. (default: 1)
    -dp DATA_PARALLEL_SIZE, --data-parallel-size DATA_PARALLEL_SIZE
                            Data parallelism size. If not given, it will be inferred from total avaialble PEs and other parallelism degrees.
    --devices DEVICES     Devices to use (e.g. "npu:0:*,npu:1:*"). If unspecified, all available devices from the host will be used.

Using OpenAI API with Furiosa LLM#

Once the server is launched, you can interact with the server using HTTP clients as the following CURL command example.

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "EMPTY",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
    }' \
    | python -m json.tool

You can use OpenAI client to interact with the server. To use OpenAI client, you need to install the openai package first.

pip install openai==1.58.1

OpenAI client provides two APIs: client.chat.completions and client.completions. You can use the client.chat.completions API with stream=True for streaming responses, as following:

import asyncio
from openai import AsyncOpenAI

# Replace the following with your base URL
base_url = f"http://localhost:8000/v1"
api_key = "EMPTY"

client = AsyncOpenAI(api_key=api_key,base_url=base_url)

async def run():
    stream_chat_completion = await client.chat.completions.create(
        model="EMPTY",
        messages=[{"role": "user", "content": "Say this is a test"}],
        stream=True,
    )

    async for chunk in stream_chat_completion:
        print(chunk.choices[0].delta.content or "", end="", flush=True)


if __name__ == "__main__":
    asyncio.run(run())

The compatibility with OpenAI API#

Currently, furiosa serve supports the following OpenAI API parameters: You can find more about each parameter at Completions API and Chat API.

Warning

Please note that using use_beam_search with stream is not allowed because the beam search cannot determine the tokens until the end of the sequence.

In 2024.2 release, n works only for beam search and it will be fixed in the next release.

  • n

  • temperature

  • top_p

  • top_k

  • early_stopping

  • length_penalty

  • max_tokens

  • min_tokens

  • use_beam_search

  • best_of

Launching the OpenAI-Compatible Server Container#

Furiosa LLM can be launched immedaitely as a containerized server.

docker run -it --rm --privileged \
    --env HF_TOKEN=$HF_TOKEN \
    -v ./Llama-3.1-8B-Instruct:/model \
    -p 8000:8000 \
    furiosaai/furiosa-llm:latest \
    serve --model /model --devices "npu:0:*"