Qwen3-Reranker#

The Qwen3-Reranker series is a family of reranking models built on the Qwen3 dense transformer backbone. Given a query and a set of candidate documents, they produce relevance scores used to reorder retrieval results — a common second stage in retrieval-augmented generation (RAG) and search pipelines.

FuriosaAI publishes pre-compiled builds of the Qwen3-Reranker models under the furiosa-ai organization on the Hugging Face Hub, each shipping a Furiosa Executable Bundle (FXB) for running it on FuriosaAI RNGD with Furiosa-LLM.

For the related embedding model see Qwen3-Embedding; for the dense Qwen3 chat models see Qwen3 (dense).

Variants#

Model

Quantization

RNGD cards

Notes

furiosa-ai/Qwen3-Reranker-8B

None (16-bit)

1

8B reranker

  • Architecture: Qwen3 (dense), Qwen3ForSequenceClassification

  • Task: Reranking

  • Input / Output: Text (query-document pairs) / Relevance score

  • Quantization: No quantization — the model runs in its native 16-bit precision.

Usage#

To run this model with Furiosa-LLM, follow the examples below after installing Furiosa-LLM and its prerequisites. You can use the model either offline through the Furiosa-LLM Python API or online through the OpenAI-compatible server.

Python API#

Load the artifact and call score with query-document pairs to obtain relevance scores:

from furiosa_llm import LLM

llm = LLM.from_artifacts("furiosa-ai/Qwen3-Reranker-8B")
scores = llm.score([("query", "document1"), ("query", "document2")])

Online server#

The server exposes a /v1/rerank endpoint (compatible with the Cohere/Jina rerank API, also used by vLLM). Launch it the same way as any other model:

# Launch the server, listening on port 8000 by default
furiosa-llm serve furiosa-ai/Qwen3-Reranker-8B

Once it is ready, send a query and the candidate documents with curl; the server returns the documents reordered by relevance_score:

curl http://localhost:8000/v1/rerank \
    -H "Content-Type: application/json" \
    -d '{
    "model": "furiosa-ai/Qwen3-Reranker-8B",
    "query": "What is deep learning?",
    "documents": [
        "Deep learning is a subset of machine learning using neural networks.",
        "Python is a popular programming language for data science.",
        "Neural networks are inspired by biological neural networks."
    ]
    }' \
    | python -m json.tool

You can do the same from Python with the requests library, and pass top_n to keep only the most relevant documents:

import requests

response = requests.post(
    "http://localhost:8000/v1/rerank",
    json={
        "model": "furiosa-ai/Qwen3-Reranker-8B",
        "query": "What is deep learning?",
        "documents": [
            "Deep learning is a subset of machine learning using neural networks.",
            "Python is a popular programming language for data science.",
            "Neural networks are inspired by biological neural networks.",
        ],
        "top_n": 2,
    },
)

for result in response.json()["results"]:
    print(f"score={result['relevance_score']:.4f}  {result['document']['text']}")

To score query-document pairs directly instead of reranking, the server also exposes a /v1/score endpoint.

Learn more#