OpenAI-Compatible API with Logprobs

OpenAI-Compatible API with Logprobs#

This example demonstrates how to retrieve log probabilities (logprobs) using the OpenAI-compatible Chat Completion API. There are two types of logprobs available:

Generated Token Logprobs (Standard OpenAI): Shows the model’s probability distribution over tokens during text generation.
Prompt Token Logprobs (vLLM Extension): Shows the model’s probability distribution over tokens for each position in the input prompt.

Chat Completion API Example#

The following example shows how to use both logprobs (for generated tokens) and prompt_logprobs (for prompt tokens) in a single request.

Example of using Chat Completion API with logprobs and prompt_logprobs#

"""
OpenAI-Compatible Server - Chat Completion API with Logprobs Example

This example demonstrates how to use both:
1. logprobs: Log probabilities for generated tokens (standard OpenAI parameter)
2. prompt_logprobs: Log probabilities for prompt tokens (vLLM extension)

Model: Any model available on the server
Endpoint: http://localhost:8000
"""

import os
from openai import OpenAI

base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")
api_key = os.getenv("OPENAI_API_KEY", "EMPTY")

client = OpenAI(base_url=base_url, api_key=api_key)

# Get available model
models = client.models.list()
model_name = models.data[0].id
print(f"Using model: {model_name}")
print()

# Chat completion request with both logprobs and prompt_logprobs
response = client.chat.completions.create(
    model=model_name,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    max_tokens=16,
    temperature=0.0,
    # Standard OpenAI parameter: logprobs for generated tokens
    logprobs=True,
    top_logprobs=3,  # Return top 3 alternatives for each generated token
    # vLLM/Furiosa-LLM extension parameters
    extra_body={
        "prompt_logprobs": 3,  # Top 3 logprobs per prompt token
        "return_token_ids": True,  # Return prompt token IDs
    },
)

# Print generated text
print("=" * 80)
print(f"Generated text: {response.choices[0].message.content}")
print("=" * 80)
print()

# =========================================================================
# Section 1: Generated Token Logprobs (Standard OpenAI)
# =========================================================================
print("=== Generated Token Logprobs (Standard OpenAI) ===")
print("These show the model's confidence for each token it generated.\n")

logprobs_content = response.choices[0].logprobs
if logprobs_content and logprobs_content.content:
    for idx, token_info in enumerate(logprobs_content.content):
        print(
            f"Position {idx}: '{token_info.token}' (logprob={token_info.logprob:.6f})"
        )

        # Show alternative tokens the model considered
        if token_info.top_logprobs:
            for alt in token_info.top_logprobs:
                marker = " <-- chosen" if alt.token == token_info.token else ""
                print(f"    '{alt.token}': logprob={alt.logprob:.6f}{marker}")
        print()
else:
    print("No logprobs in response")
print()

# =========================================================================
# Section 2: Prompt Token Logprobs (vLLM Extension)
# =========================================================================
print("=== Prompt Token Logprobs (vLLM Extension) ===")
print("These show how likely each prompt token was, given the preceding context.\n")

# Access vLLM extension fields directly from response object
prompt_logprobs = response.prompt_logprobs  # type: ignore[attr-defined]
prompt_token_ids = response.prompt_token_ids  # type: ignore[attr-defined]

if prompt_logprobs:
    for idx, token_logprobs in enumerate(prompt_logprobs):
        actual_token_id = prompt_token_ids[idx] if prompt_token_ids else None

        if token_logprobs is None:
            # First token has no logprobs (no prior context)
            print(
                f"Position {idx}: token_id={actual_token_id} "
                f"-> None (first token, no prior context)"
            )
            continue

        print(f"Position {idx}: token_id={actual_token_id}")

        # token_logprobs is in dict[token_id, Logprob] format
        for token_id_str, logprob_info in token_logprobs.items():
            token_id = int(token_id_str)
            logprob = logprob_info["logprob"]
            rank = logprob_info.get("rank")
            decoded_token = logprob_info.get("decoded_token", "")
            escaped_token = repr(decoded_token)[1:-1]

            is_actual = token_id == actual_token_id
            actual_marker = " <-- actual" if is_actual else ""
            print(
                f"    token_id={token_id:>6}, "
                f"token='{escaped_token}', "
                f"logprob={logprob:>10.6f}, "
                f"rank={rank}{actual_marker}"
            )
    print()
else:
    print("No prompt_logprobs in response (vLLM extension field)")
print()

Example Output#

Using model: Qwen/Qwen2.5-0.5B

================================================================================
Generated text: The capital of France is Paris.

================================================================================

=== Generated Token Logprobs (Standard OpenAI) ===
These show the model's confidence for each token it generated.

Position 0: 'The' (logprob=-0.298153)
    'The': logprob=-0.298153 <-- chosen
    'Paris': logprob=-1.551574
    'As': logprob=-4.550769

Position 1: ' capital' (logprob=-0.003914)
    ' capital': logprob=-0.003914 <-- chosen
    ' current': logprob=-6.505387
    ' Capital': logprob=-8.506357

Position 2: ' of' (logprob=-0.011788)
    ' of': logprob=-0.011788 <-- chosen
    ' city': logprob=-4.769587
    ' and': logprob=-8.139359

Position 3: ' France' (logprob=-0.003914)
    ' France': logprob=-0.003914 <-- chosen
    ' the': logprob=-10.007491
    'France': logprob=-11.007912

Position 4: ' is' (logprob=-0.003914)
    ' is': logprob=-0.003914 <-- chosen
    ',': logprob=-5.880869
    ' (': logprob=-9.756177

Position 5: ' Paris' (logprob=-0.007843)
    ' Paris': logprob=-0.007843 <-- chosen
    ' Lyon': logprob=-6.015181
    '巴黎': logprob=-6.947220

...

=== Prompt Token Logprobs (vLLM Extension) ===
These show how likely each prompt token was, given the preceding context.

Position 0: token_id=151644 -> None (first token, no prior context)
Position 1: token_id=8948
    token_id=  8948, token='system', logprob=-11.958041, rank=8063 <-- actual
    token_id= 72030, token='/API', logprob= -1.378512, rank=1
    token_id= 16731, token='/T', logprob= -3.065493, rank=2
    token_id= 59981, token='/block', logprob= -3.948318, rank=3
Position 2: token_id=198
    token_id=   198, token='\n', logprob= -1.947865, rank=1 <-- actual
    token_id=   271, token='\n\n', logprob= -2.263327, rank=2
    token_id= 69425, token=' 发', logprob= -2.448469, rank=3
...

Key Parameters#

Generated Token Logprobs (Standard OpenAI)#

logprobs (bool): When True, returns log probabilities for generated tokens.
top_logprobs (int): Number of top alternative tokens to return.

These parameters are part of the standard OpenAI Chat Completion API and show what alternatives the model considered when generating each token.

Prompt Token Logprobs (vLLM Extension)#

prompt_logprobs (int): Number of top log probabilities to return for each prompt token. Pass via extra_body.
return_token_ids (bool): When True, returns the token IDs of the prompt tokens. Pass via extra_body.

These parameters are vLLM extensions and show how likely each token in your prompt was, given the preceding context.

Response Structure#

Generated Token Logprobs#

Located in the standard response structure:

response.choices[0].logprobs.content[i].token       # The chosen token
response.choices[0].logprobs.content[i].logprob     # Its log probability
response.choices[0].logprobs.content[i].top_logprobs  # Alternative tokens

Prompt Token Logprobs#

Located in the vLLM extension fields, accessible directly on the response object:

response.prompt_logprobs    # List of logprobs per prompt position
response.prompt_token_ids   # List of actual token IDs in the prompt

Each logprob entry contains:

logprob: The log probability value
rank: The rank of this token among all possibilities (1 = most likely)
decoded_token: The string representation of the token

Understanding the Output#

Generated Token Logprobs#

For generated tokens, a logprob closer to 0 indicates higher confidence:

logprob=-0.003914: \(e^{-0.003914}\) ≈ 99.6% probability (very confident)
logprob=-2.0 ≈ 13.5% probability (less confident)
logprob=-5.0 ≈ 0.7% probability (unlikely alternative)

Prompt Token Logprobs#

For prompt tokens, the rank field shows where the actual token ranked among all possible tokens. A high rank (e.g., 8063) with a low logprob indicates that the model found that token “surprising” given the context - this is expected for prompt tokens since they are user-provided.