Hybrid KV Cache Management#
Hybrid KV cache management is an internal optimization for models that mix global attention and sliding-window attention layers. Instead of treating all layers as if they had the same memory behavior, Furiosa-LLM manages them with separate KV cache pools and coordinated allocation logic.
This feature is applied automatically when the model uses hybrid attention; no extra user configuration is required.
Overview#
In hybrid models, global-attention layers and sliding-window layers have different cache growth patterns:
Global attention grows with full sequence length
Sliding-window attention is bounded by the window size
Using one undifferentiated pool can over-provision memory for sliding-window layers, especially for long-context workloads. Hybrid KV cache management avoids this by separating the two memory paths.
How It Works#
Pool Partitioning#
At initialization, Furiosa-LLM partitions KV cache memory into:
A global-attention pool
A sliding-window attention pool
Each pool is then assigned to the corresponding KV cache space for that attention type. The partitioning logic is attention-aware, so memory is distributed according to expected global vs windowed usage instead of a one-size-fits-all split.
Request Lifecycle#
For each request phase, the scheduler coordinates both pools together:
Prefill: allocates write blocks in global and sliding-window pools for incoming tokens
Extend / Decode: loads existing cached blocks and allocates new write blocks for new tokens
Cleanup: eagerly releases sliding-window blocks that moved out of the valid window, while keeping global blocks reusable for full-prefix history
This is a key efficiency point: sliding-window cache entries outside the active window are reclaimed early instead of occupying memory until request completion.
Interaction with Prefix Caching#
When prefix caching is enabled, hybrid cache management works with hybrid prefix matching to deduplicate both global and sliding-window blocks. If part of a matched prefix no longer has valid sliding-window cache, Furiosa-LLM still reuses the valid portion and computes only what is needed. For details on hybrid prefix-match behavior, see Hybrid Attention Models in Prefix Caching.
Why It Is Efficient#
Compared to a single pooled approach, hybrid KV cache management provides:
Lower memory waste in models with sparse or mixed attention patterns
Higher effective cache capacity for long-context global attention
Reduced eviction pressure by reclaiming stale sliding-window blocks earlier
Stable serving behavior without requiring users to manually tune per-attention memory pools
For end users, the main benefit is straightforward: better KV memory utilization and more consistent performance on hybrid-attention models, automatically.
Optional Tuning#
For hybrid-attention models, you can optionally set the EXPECTED_AVERAGE_SEQ_LENGTH environment variable to guide how much KV memory is reserved for global attention versus sliding-window attention.
This is useful when your workload has a stable prompt-length pattern (for example, consistently very long prompts), and you want memory partitioning to better match that pattern.
If unset, Furiosa-LLM uses a default ratio based on the model’s global-attention and sliding-window attention cache requirements.
If set, Furiosa-LLM uses the provided expected sequence length to compute a more workload-aware split.
The value must be a positive integer.
export EXPECTED_AVERAGE_SEQ_LENGTH=8192
furiosa-llm serve ...