Furiosa-LLM#
Furiosa-LLM is a high-performance inference engine for LLM and multi-modal LLM models. Furiosa-LLM offers state-of-the-art serving efficiency and optimizations. Key features of Furiosa-LLM include:
vLLM-compatible API (LLM, LLMEngine, AsyncLLMEngine API)
Efficient KV cache management with PagedAttention
Continuous batching of incoming requests
Quantization: FP8 (Planned: INT4, INT8, GPTQ, AWQ)
Support for data, tensor, and pipeline parallelism across multiple NPUs
OpenAI-compatible API server
Various decoding algorithms: greedy search, top-k/top-p, and speculative decoding (planned for 2026.3)
Tool calling and reasoning parser support
Structured output generation with guided decoding (guided_choice, guided_regex, guided_json, guided_grammar)
Chunked Prefill
Integration with Hugging Face models and hub support
Hugging Face PEFT support (planned)
Documentation#
Quick Start with Furiosa-LLM: A quick start guide to Furiosa-LLM
OpenAI-Compatible Server: Details about the OpenAI-compatible server and its features
Tool Calling: Guide to tool calling with parsers and choice options
Structured Output: Guide to structured output generation with guided decoding
Prefix Caching: Guide to prefix caching for improved performance
Hybrid KV Cache Management: Understanding hybrid KV cache management
Model Preparation: How to prepare LLM models to be served by Furiosa-LLM
Model Parallelism: A guide to model parallelism in Furiosa-LLM
API Reference: Python API reference for Furiosa-LLM
Examples: Examples of using Furiosa-LLM
Deploying Furiosa-LLM on Kubernetes: A guide to deploying Furiosa-LLM on Kubernetes