Reranking (Document Reranking)#
This example demonstrates how to use the rerank functionality to order documents by relevance to a query. Reranking is particularly useful for retrieval-augmented generation (RAG) and information retrieval applications.
Server API Example#
The rerank API is available through the OpenAI-compatible server. Currently, there is no direct Python LLM.rerank() method;
use the LLM.score() method and sort results manually, or use the HTTP API.
Basic Reranking#
import os
import requests
# Start server with: furiosa-llm serve path/to/reranker-model
base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")
# Basic reranking
response = requests.post(
f"{base_url}/rerank",
json={
"model": "reranker",
"query": "What is deep learning?",
"documents": [
"Deep learning is a subset of machine learning using neural networks.",
"Python is a popular programming language for data science.",
"Machine learning algorithms learn patterns from data.",
"Neural networks are inspired by biological neural networks.",
"JavaScript is used for web development.",
],
},
)
data = response.json()
print(f"Model: {data['model']}")
print(f"Total results: {len(data['results'])}")
print()
# Results are sorted by relevance (descending)
for result in data["results"]:
print(f"Rank {result['index'] + 1}: score = {result['relevance_score']:.4f}")
print(f" Document: {result['document']['text'][:60]}...")
print()
Using top_n Parameter#
Limit the number of returned results to the top N most relevant documents:
import os
import requests
# Start server with: furiosa-llm serve path/to/reranker-model
base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")
# Reranking with top_n parameter to limit results
response = requests.post(
f"{base_url}/rerank",
json={
"model": "reranker",
"query": "machine learning frameworks",
"documents": [
"TensorFlow is a popular machine learning framework.",
"PyTorch is widely used in research.",
"Scikit-learn provides simple ML tools.",
"Pandas is for data manipulation.",
"NumPy is for numerical computing.",
"Keras is a high-level neural networks API.",
"JAX is for high-performance ML research.",
],
"top_n": 3, # Only return top 3 most relevant
},
)
data = response.json()
print(f"Showing top {len(data['results'])} results:")
for result in data["results"]:
print(f"{result['index']}: {result['relevance_score']:.4f} - {result['document']['text']}")
Truncating Long Documents#
Use truncate_prompt_tokens to handle long documents that exceed the model’s context length:
import os
import requests
# Start server with: furiosa-llm serve path/to/reranker-model
base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")
# Long documents that may exceed model's context length
long_documents = [
"Deep learning is a subset of machine learning. " * 50,
"Neural networks process data in layers. " * 50,
"Transformers revolutionized natural language processing. " * 50,
]
# Use truncate_prompt_tokens to handle long documents
response = requests.post(
f"{base_url}/rerank",
json={
"model": "reranker",
"query": "research findings",
"documents": long_documents,
"truncate_prompt_tokens": 512, # Truncate to 512 tokens
"top_n": 10,
},
)
data = response.json()
print(f"Reranked {len(data['results'])} documents with truncation:")
for result in data["results"]:
print(f"Document {result['index']}: {result['relevance_score']:.4f}")
Using Python Client#
You can also use the OpenAI Python client or any HTTP client:
from openai import OpenAI
# Initialize client pointing to Furiosa-LLM server
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy" # Not used but required by client
)
# Note: OpenAI client doesn't have native rerank support
# Use requests library or implement custom extension
Use Cases#
Reranking is essential for:
Retrieval-Augmented Generation (RAG)
First-stage retrieval returns many candidates
Reranking selects the most relevant documents to include in the LLM context
Improves answer quality by providing better context
Search Engines
Initial search returns broad results
Reranking orders them by relevance to user query
Enhances user experience with more accurate results
Question Answering Systems
Multiple knowledge base articles retrieved
Reranking identifies which article best answers the question
Reduces latency by processing fewer documents
Content Recommendation
Candidate items filtered by basic criteria
Reranking personalizes based on user query or context
Delivers more relevant recommendations
Workflow Example: RAG Pipeline#
import requests
from openai import OpenAI
# Step 1: Retrieve candidate documents (e.g., from vector database)
query = "How does attention mechanism work in transformers?"
candidate_documents = retrieve_from_vector_db(query, top_k=50) # Get 50 candidates
# Step 2: Rerank to find most relevant
rerank_response = requests.post(
"http://localhost:8000/v1/rerank",
json={
"model": "reranker",
"query": query,
"documents": candidate_documents,
"top_n": 5 # Select top 5 for LLM context
}
)
top_documents = [
result["document"]["text"]
for result in rerank_response.json()["results"]
]
# Step 3: Generate answer using reranked documents
llm_client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
context = "\n\n".join(top_documents)
completion = llm_client.chat.completions.create(
model="llama",
messages=[
{"role": "system", "content": f"Answer based on this context:\n\n{context}"},
{"role": "user", "content": query}
]
)
print(completion.choices[0].message.content)
API Compatibility#
Furiosa-LLM’s rerank API follows the vLLM rerank API specification and is compatible with:
Multiple endpoint paths:
/rerank,/v1/rerank,/v2/rerankJinaAI rerank API format (commonly used in RAG frameworks like RAGFlow)
This ensures compatibility with existing tools and frameworks that support these standards.
For scoring individual pairs without ranking, see Score API example.
API Reference#
See Rerank API Reference for complete parameter documentation.