Prefix Caching

Prefix Caching#

Prefix caching is a performance optimization technique that reuses previously computed KV cache entries for common prompt prefixes. This significantly reduces latency and computational costs for requests that share the same prompt beginning.

Overview#

When processing multiple requests with similar prefixes (e.g., system prompts, common instructions, or shared context), prefix caching eliminates redundant computation by:

Storing KV cache entries from previous computations
Matching new prompts against cached prefixes
Reusing matching cache entries instead of recomputing them
Computing only the non-cached portion of the prompt

This can dramatically improve performance for scenarios with:

Repeated system prompts across multiple user queries
Common instruction templates
Shared conversation history in multi-turn dialogues
Document-based question answering where the document prefix is constant

How It Works#

Automatic Management#

The prefix cache is managed automatically by the scheduler:

Cache Population: KV cache entries are stored after each forward pass
Cache Matching: New requests are checked against the cache tree
Cache Eviction: Least recently used, non-active entries are evicted when memory is needed (starting from leaf suffixes)
Cache Invalidation: No manual invalidation required

Token-Level Matching#

Furiosa-LLM’s prefix caching operates at the token level, using a radix tree data structure for efficient prefix matching:

Request 1: [S1] + [A]
       │
       │    ┌──────────────┐   ┌──────────────┐
       └──► │ new compute  │ + │ new compute  │
            └──────────────┘   └──────────────┘
       Cache: ROOT ── [S1] ── [A]

Request 2: [S1] + [B]
       │
       │    ┌──────────────┐   ┌──────────────┐
       └──► │  cache hit   │ + │ new compute  │
            └──────────────┘   └──────────────┘
       Cache: ROOT ── [S1] ──┬── [A]
                             └── [B]

In the standard radix-cache flow (non-hybrid attention), Furiosa-LLM:

Finds the longest token-exact prefix from the start of the prompt
Reuses all matched KV blocks from that prefix
Computes and caches only the unmatched suffix

If a request diverges in the middle of an existing branch, the cache can split that branch internally so future requests can still reuse the shared portion precisely.

Hybrid Attention Models#

Prefix caching fully supports hybrid attention models that use both global attention and sliding-window attention. For these models, a prefix is reusable only when both cache types are valid for the same matched range:

Tokens match from the beginning of the prompt
Global KV cache entries are present for that range
Sliding-window KV entries are available as a valid contiguous window for the current position

If the token match continues but the sliding-window requirement is no longer satisfied, Furiosa-LLM reuses only the longest safe prefix and computes the rest. This keeps results correct while still maximizing reuse.

In practice, this means cache hits in hybrid models may be shorter than raw token overlap, especially when recent-window context differs across requests or has been evicted.

This behavior is also affected by eviction policy. For sliding-window attention, Furiosa-LLM intentionally allows sliding-window blocks to be evicted independently of global blocks, including at intermediate (non-leaf) prefix-tree nodes. The rationale is memory efficiency: blocks outside the frequently reused window are treated as stale and can be reclaimed earlier to make room for more useful cache content. However, under high memory pressure, this might cause a scenario where token-prefix matching continues while sliding-window validity ends sooner; in that case, Furiosa-LLM safely reuses the longest prefix that still has a valid window.

Performance Benefits#

Latency Reduction#

Prefix caching can provide significant latency improvements:

50-90% reduction in time-to-first-token (TTFT) for requests with long shared prefixes
Proportional improvement based on the length of the matched prefix
No impact on generation quality or accuracy

Example Scenarios#

Scenario 1: System Prompt Reuse

system_prompt = "You are a helpful assistant specialized in Python programming..."  # 200 tokens

# First request: Full computation
response1 = llm.generate([
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "How do I read a file?"}
])  # Computes all 200 + N tokens

# Second request: Only user message computed
response2 = llm.generate([
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "How do I write to a file?"}
])  # Reuses 200 cached tokens, computes only new N tokens

Scenario 2: Document QA

long_document = "..."  # 2000 tokens

# Multiple questions about the same document
for question in questions:
    response = llm.generate([
        {"role": "system", "content": f"Answer based on: {long_document}"},
        {"role": "user", "content": question}
    ])  # Document prefix cached after first query

Throughput Impact#

By reducing computation per request:

Higher throughput for workloads with shared prefixes
Better resource utilization by serving more requests with the same compute
Reduced memory bandwidth usage

Configuration#

Prefix caching is not enabled by default in Furiosa-LLM. To enable prefix caching, set --enable-prefix-caching option in the Furiosa-LLM server settings:

furiosa-llm serve --enable-prefix-caching ...

Memory Management#

The prefix cache shares memory with the KV cache pool. The scheduler automatically:

Allocates cache entries from available KV cache memory
Evicts cached prefixes when memory pressure increases

No manual tuning of cache size or eviction policies is necessary.

Best Practices#

Maximizing Cache Hits#

To get the most benefit from prefix caching:

Use consistent system prompts: Keep system prompts identical across requests
Maintain token-exact matching: Even small changes (punctuation, whitespace) break the match
Order messages consistently: The cache matches from the beginning of the prompt sequence

Example - Consistent Prompting:

# Good: Exact same system prompt
system_prompt = "You are a helpful AI assistant."

response1 = llm.generate([{"role": "system", "content": system_prompt}, ...])
response2 = llm.generate([{"role": "system", "content": system_prompt}, ...])
# ✓ Cache hit

# Bad: Slightly different prompts
response3 = llm.generate([{"role": "system", "content": "You are a helpful AI assistant."}, ...])
response4 = llm.generate([{"role": "system", "content": "You are a helpful assistant."}, ...])
# ✗ Cache miss due to difference

Monitoring#

To monitor prefix caching effectiveness, observe:

Prefix cache hit ratio: Higher ratios indicate better reuse of cached prefixes. You can find this in the server logs.
Time-to-First-Token (TTFT): Should decrease for requests with cached prefixes
Request latency patterns: Requests following similar prompts should show consistent speedups