Qwen3-VL#
The Qwen3-VL models are dense vision-language models that pair a vision encoder with a dense transformer decoder, using Interleaved-MRoPE positional embeddings and DeepStack multi-level feature fusion to handle images and videos alongside text. They cover visual understanding tasks such as OCR, document and chart analysis, spatial reasoning, and video comprehension, and natively support tool (function) calling.
FuriosaAI publishes pre-compiled builds of the Qwen3-VL models under the
furiosa-ai organization on the Hugging Face Hub,
each shipping a Furiosa Executable Bundle (FXB) for running it on
FuriosaAI RNGD with Furiosa-LLM. The same upstream weights
also run on other frameworks (such as vLLM, SGLang, and Transformers); for usage
with those, see the upstream model cards linked below.
Variants#
Model |
Quantization |
RNGD cards |
Notes |
|---|---|---|---|
None (16-bit) |
4 |
32B dense; Instruct (non-thinking) edition |
Architecture: Qwen3-VL (dense),
Qwen3VLForConditionalGenerationInput / Output: Image + Text / Text
Quantization: No quantization is applied — the model runs in the same precision as the upstream weights.
Usage#
To run this model with Furiosa-LLM, follow the example commands below after installing Furiosa-LLM and its prerequisites.
Launch the server#
The simplest way to serve the model is:
# Launch the server, listening on port 8000 by default
furiosa-llm serve furiosa-ai/Qwen3-VL-32B-Instruct
When the server is ready, you will see:
INFO: Started server process [27507]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Launch the server with tool calling#
To enable tool (function) calling, start the server with the hermes tool-call
parser:
furiosa-llm serve furiosa-ai/Qwen3-VL-32B-Instruct \
--enable-auto-tool-choice \
--tool-call-parser hermes
Query the server#
The server exposes an OpenAI-compatible API. You can send a text-only request
with curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "furiosa-ai/Qwen3-VL-32B-Instruct",
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}' \
| python -m json.tool
To ask about an image, pass an image_url content part in the message:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "furiosa-ai/Qwen3-VL-32B-Instruct",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"}},
{"type": "text", "text": "Describe this image."}
]
}]
}' \
| python -m json.tool
The image_url.url field accepts a remote http:///https:// URL, an inline
base64 data: URL, or a local file:// path (the last requires the
--allowed-local-media-path flag below).
Multimodal serving options#
furiosa-llm serve provides flags to control multimodal behavior; requests
that violate them are rejected with HTTP 400:
--image-limit-per-prompt N/--video-limit-per-prompt N— maximum number of images/videos allowed per request (default: unlimited).--allowed-local-media-path PATH— allowfile://URLs whose resolved path is underPATH. Local file access is disabled unless this is set.--allowed-media-domains D [D ...]— whitelist of remote domains for SSRF protection. When set, only images from the listed domains are fetched.--interleave-mm-strings— keep image placeholders at their original positions when the model uses a string-format chat template (no-op for OpenAI-format templates, the common case).--mm-processor-cache-gb GB— size of the UUID-keyed multimodal processor cache (default: 4.0). Clients can tag animage_urlpart with auuidfield and re-reference it in follow-up requests without re-uploading the image bytes; set to 0 to disable.
For example, to serve local images under /srv/media and restrict remote
fetches to a single domain:
furiosa-llm serve furiosa-ai/Qwen3-VL-32B-Instruct \
--allowed-local-media-path /srv/media \
--allowed-media-domains cdn.example.com \
--image-limit-per-prompt 4
See the Vision-Language Models guide for image input formats, the UUID cache, and Python client examples.
Tool calling#
With the server launched using --enable-auto-tool-choice --tool-call-parser hermes,
you can pass tools and let the model decide when to call them. See the
Tool Calling guide
for a complete client example and details on tool-choice options.
Learn more#
Vision-Language Models — image input formats, multimodal server options, and the UUID cache
Tool Calling — parsers, tool-choice options, and more examples
Furiosa-LLM Server (
furiosa-llm serve) — full OpenAI-compatible API reference and serving optionsUpstream model card: Qwen/Qwen3-VL-32B-Instruct