Responses API#
Furiosa-LLM implements the OpenResponses specification,
a multi-provider, interoperable LLM interface. The Responses API is available for
text generation models with a chat template. It supports text input, streaming,
multi-turn conversations, tool calling with tool_choice control
(see Tool Calling Support), and structured output via JSON Schema.
Note
The following features from the OpenResponses specification are not yet supported:
Multimodal inputs —
input_image,input_audio, andinput_filecontent types are accepted but silently ignored.Built-in tools —
web_search,file_search,code_interpreter,computer_use, andmcptools are not supported. Only custom function tools are available.Background processing —
background=trueis accepted but has no effect.Auto-truncation —
truncation="auto"is accepted but only"disabled"is implemented.
Endpoints#
POST /v1/responses— Create a response (streaming or non-streaming).GET /v1/responses/{response_id}— Retrieve a previously stored response.POST /v1/responses/{response_id}/cancel— Cancel an in-progress response.
Response Store#
The response store keeps responses in memory so they can be retrieved later or
referenced by previous_response_id for multi-turn conversations. The store
is disabled by default and must be explicitly enabled with the
--enable-responses-api-store server option:
furiosa-llm serve [ARTIFACT_PATH] --enable-responses-api-store
The following server options control the store behavior:
Option |
Default |
Description |
|---|---|---|
|
false |
Enable the in-memory response store. |
|
10000 |
Maximum number of responses to keep. Oldest entries are evicted when the limit is reached. |
|
3600 |
Time-to-live for stored responses in seconds. |
When the store is enabled and store=true is set in the request (the default),
the server stores the response and its full chat message history, enabling:
Response retrieval via
GET /v1/responses/{response_id}.Response cancellation via
POST /v1/responses/{response_id}/cancel.Multi-turn conversations via
previous_response_id.
To skip storage for a specific request, set store=false.
When the store is disabled, GET /v1/responses/{response_id},
POST /v1/responses/{response_id}/cancel, and previous_response_id are
not available.
Note
Stored responses and their conversation histories are held in memory and are lost when the server restarts. Monitor memory consumption on the server if you expect a large number of stored responses.
Multi-Turn Conversations#
The Responses API supports two methods for multi-turn conversations:
Using previous_response_id (recommended)#
The server automatically prepends the stored conversation history from the
referenced response. This is the simplest approach and requires store=true
on the referenced response.
import os
from openai import OpenAI
base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")
api_key = os.getenv("OPENAI_API_KEY", "EMPTY")
client = OpenAI(api_key=api_key, base_url=base_url)
model = client.models.list().data[0].id
# Turn 1: store the response for later continuation
res1 = client.responses.create(
model=model,
input="My name is Alice.",
store=True,
)
print(f"Turn 1: {res1.output[0].content[0].text}")
print(f"Response ID: {res1.id}")
# Turn 2: continue the conversation using previous_response_id
res2 = client.responses.create(
model=model,
input="What is my name?",
previous_response_id=res1.id,
store=True,
)
print(f"Turn 2: {res2.output[0].content[0].text}")
Using manual context#
Alternatively, you can manually build the conversation context by appending
previous output items to the input array:
# Turn 1
res1 = client.responses.create(model=model, input="My name is Alice.")
# Turn 2: manually include previous context
context = [{"role": "user", "content": "My name is Alice."}]
context += res1.output
context.append({"role": "user", "content": "What is my name?"})
res2 = client.responses.create(model=model, input=context)
Examples#
Basic usage:
import os
from openai import OpenAI
base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")
api_key = os.getenv("OPENAI_API_KEY", "EMPTY")
client = OpenAI(api_key=api_key, base_url=base_url)
model = client.models.list().data[0].id
# Non-streaming
response = client.responses.create(
model=model,
input="What is the capital of France?",
)
print(response.output[0].content[0].text)
Streaming:
import os
from openai import OpenAI
base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")
api_key = os.getenv("OPENAI_API_KEY", "EMPTY")
client = OpenAI(api_key=api_key, base_url=base_url)
model = client.models.list().data[0].id
with client.responses.stream(
model=model,
input="What is the capital of France?",
) as stream:
for event in stream:
if event.type == "response.output_text.delta":
print(event.delta, end="", flush=True)
print()
API Reference#
Parameters without descriptions inherit their behavior and functionality from the corresponding parameters in the OpenResponses specification.
Name |
Type |
Default |
Description |
|---|---|---|---|
model |
string |
Required by the client, but the value is ignored on the server. |
|
input |
string or array |
Text string, or an array of input items (messages, function call outputs).
Multimodal content types ( |
|
instructions |
string |
null |
System-level instructions prepended to the conversation. |
stream |
boolean |
false |
|
store |
boolean |
true |
When true and |
temperature |
float |
1.0 |
|
top_p |
float |
1.0 |
|
top_k |
integer |
-1 |
Furiosa-LLM extension; not part of the OpenResponses specification. |
max_output_tokens |
integer |
null |
If null, the server will use the maximum possible length considering the input. |
presence_penalty |
float |
0.0 |
Accepted for compatibility but not yet functional. |
frequency_penalty |
float |
0.0 |
Accepted for compatibility but not yet functional. |
tools |
array |
[] |
Function tool definitions. Only custom function tools are supported. |
tool_choice |
string or object |
“auto” |
Controls how tools are invoked. Supported values:
|
text |
object |
null |
Structured output configuration. Supports |
previous_response_id |
string |
null |
ID of a previously stored response to continue the conversation from. |
truncation |
string |
“disabled” |
Only |
reasoning |
object |
null |
Accepted but not yet functional. |
metadata |
object |
null |
|
user |
string |
null |