Reranking (Document Reranking)#

This example demonstrates how to use the rerank functionality to order documents by relevance to a query. Reranking is particularly useful for retrieval-augmented generation (RAG) and information retrieval applications.

Server API Example#

The rerank API is available through the OpenAI-compatible server. Currently, there is no direct Python LLM.rerank() method; use the LLM.score() method and sort results manually, or use the HTTP API.

Basic Reranking#

Example of using Rerank API for document reranking#
import os

import requests

# Start server with: furiosa-llm serve path/to/reranker-model

base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")

# Basic reranking
response = requests.post(
    f"{base_url}/rerank",
    json={
        "model": "reranker",
        "query": "What is deep learning?",
        "documents": [
            "Deep learning is a subset of machine learning using neural networks.",
            "Python is a popular programming language for data science.",
            "Machine learning algorithms learn patterns from data.",
            "Neural networks are inspired by biological neural networks.",
            "JavaScript is used for web development.",
        ],
    },
)

data = response.json()
print(f"Model: {data['model']}")
print(f"Total results: {len(data['results'])}")
print()

# Results are sorted by relevance (descending)
for result in data["results"]:
    print(f"Rank {result['index'] + 1}: score = {result['relevance_score']:.4f}")
    print(f"  Document: {result['document']['text'][:60]}...")
    print()

Using top_n Parameter#

Limit the number of returned results to the top N most relevant documents:

Example of using Rerank API with top_n parameter#
import os

import requests

# Start server with: furiosa-llm serve path/to/reranker-model

base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")

# Reranking with top_n parameter to limit results
response = requests.post(
    f"{base_url}/rerank",
    json={
        "model": "reranker",
        "query": "machine learning frameworks",
        "documents": [
            "TensorFlow is a popular machine learning framework.",
            "PyTorch is widely used in research.",
            "Scikit-learn provides simple ML tools.",
            "Pandas is for data manipulation.",
            "NumPy is for numerical computing.",
            "Keras is a high-level neural networks API.",
            "JAX is for high-performance ML research.",
        ],
        "top_n": 3,  # Only return top 3 most relevant
    },
)

data = response.json()
print(f"Showing top {len(data['results'])} results:")

for result in data["results"]:
    print(f"{result['index']}: {result['relevance_score']:.4f} - {result['document']['text']}")

Truncating Long Documents#

Use truncate_prompt_tokens to handle long documents that exceed the model’s context length:

Example of using Rerank API with truncate_prompt_tokens#
import os

import requests

# Start server with: furiosa-llm serve path/to/reranker-model

base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")

# Long documents that may exceed model's context length
long_documents = [
    "Deep learning is a subset of machine learning. " * 50,
    "Neural networks process data in layers. " * 50,
    "Transformers revolutionized natural language processing. " * 50,
]

# Use truncate_prompt_tokens to handle long documents
response = requests.post(
    f"{base_url}/rerank",
    json={
        "model": "reranker",
        "query": "research findings",
        "documents": long_documents,
        "truncate_prompt_tokens": 512,  # Truncate to 512 tokens
        "top_n": 10,
    },
)

data = response.json()
print(f"Reranked {len(data['results'])} documents with truncation:")

for result in data["results"]:
    print(f"Document {result['index']}: {result['relevance_score']:.4f}")

Using Python Client#

You can also use the OpenAI Python client or any HTTP client:

from openai import OpenAI

# Initialize client pointing to Furiosa-LLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # Not used but required by client
)

# Note: OpenAI client doesn't have native rerank support
# Use requests library or implement custom extension

Use Cases#

Reranking is essential for:

  • Retrieval-Augmented Generation (RAG)

    • First-stage retrieval returns many candidates

    • Reranking selects the most relevant documents to include in the LLM context

    • Improves answer quality by providing better context

  • Search Engines

    • Initial search returns broad results

    • Reranking orders them by relevance to user query

    • Enhances user experience with more accurate results

  • Question Answering Systems

    • Multiple knowledge base articles retrieved

    • Reranking identifies which article best answers the question

    • Reduces latency by processing fewer documents

  • Content Recommendation

    • Candidate items filtered by basic criteria

    • Reranking personalizes based on user query or context

    • Delivers more relevant recommendations

Workflow Example: RAG Pipeline#

import requests
from openai import OpenAI

# Step 1: Retrieve candidate documents (e.g., from vector database)
query = "How does attention mechanism work in transformers?"
candidate_documents = retrieve_from_vector_db(query, top_k=50)  # Get 50 candidates

# Step 2: Rerank to find most relevant
rerank_response = requests.post(
    "http://localhost:8000/v1/rerank",
    json={
        "model": "reranker",
        "query": query,
        "documents": candidate_documents,
        "top_n": 5  # Select top 5 for LLM context
    }
)

top_documents = [
    result["document"]["text"]
    for result in rerank_response.json()["results"]
]

# Step 3: Generate answer using reranked documents
llm_client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

context = "\n\n".join(top_documents)
completion = llm_client.chat.completions.create(
    model="llama",
    messages=[
        {"role": "system", "content": f"Answer based on this context:\n\n{context}"},
        {"role": "user", "content": query}
    ]
)

print(completion.choices[0].message.content)

API Compatibility#

Furiosa-LLM’s rerank API follows the vLLM rerank API specification and is compatible with:

  • Multiple endpoint paths: /rerank, /v1/rerank, /v2/rerank

  • JinaAI rerank API format (commonly used in RAG frameworks like RAGFlow)

This ensures compatibility with existing tools and frameworks that support these standards.

For scoring individual pairs without ranking, see Score API example.

API Reference#

See Rerank API Reference for complete parameter documentation.