Furiosa-LLM

Furiosa-LLM#

Furiosa-LLM is a high-performance inference engine for LLM and multi-modal LLM models. Furiosa-LLM offers state-of-the-art serving efficiency and optimizations. Key features of Furiosa-LLM include:

vLLM-compatible API (LLM, LLMEngine, AsyncLLMEngine API)
Efficient KV cache management with PagedAttention
Continuous batching of incoming requests
Quantization: FP8 (Planned: INT4, INT8, GPTQ, AWQ)
Support for data, tensor, and pipeline parallelism across multiple NPUs
OpenAI-compatible API server
Various decoding algorithms: greedy search, top-k/top-p, and speculative decoding (planned for 2026.3)
Tool calling and reasoning parser support
Structured output generation (choice, regex, json schema, grammar)
Chunked Prefill
Integration with Hugging Face models and hub support
Hugging Face PEFT support (planned)

Documentation#

Quick Start with Furiosa-LLM: A quick start guide to Furiosa-LLM
OpenAI-Compatible Server: Details about the OpenAI-compatible server and its features
Responses API: Guide to the OpenResponses-compatible Responses API
Tool Calling: Guide to tool calling with parsers and choice options
Structured Output: Guide to structured output generation
Prefix Caching: Guide to prefix caching for improved performance
Hybrid KV Cache Management: Understanding hybrid KV cache management
Model Preparation: How to prepare LLM models to be served by Furiosa-LLM
Model Parallelism: A guide to model parallelism in Furiosa-LLM
API Reference: Python API reference for Furiosa-LLM
Examples: Examples of using Furiosa-LLM
Deploying Furiosa-LLM on Kubernetes: A guide to deploying Furiosa-LLM on Kubernetes