Furiosa-LLM

Furiosa-LLM#

Furiosa-LLM is a high-performance inference engine for LLM and multi-modal LLM models. Furiosa-LLM offers state-of-the-art serving efficiency and optimizations. Key features of Furiosa-LLM include:

vLLM-compatible API (LLM, LLMEngine, AsyncLLMEngine API)
Efficient KV cache management with PagedAttention
Continuous batching of incoming requests
Quantization: FP8 (Planned: INT4, INT8, GPTQ, AWQ)
Support for data, tensor, and pipeline parallelism across multiple NPUs
OpenAI-compatible API server
Various decoding algorithms: greedy search, beam search, top-k/top-p, and speculative decoding (planned for 2025.3)
Support for context lengths of up to 32k
Tool calling and reasoning parser support
Chunked Prefill
Integration with Hugging Face models and hub support
Hugging Face PEFT support (planned)

Documentation#

Quick Start with Furiosa-LLM: A quick start guide to Furiosa-LLM
OpenAI-Compatible Server: Details about the OpenAI-compatible server and its features
Model Preparation: How to prepare LLM models to be served by Furiosa-LLM
Building Model Artifacts By Examples: A guide to building model artifacts through examples
Model Parallelism: A guide to model parallelism in Furiosa-LLM
FuriosaLLMBuildCommand: Command-line tool for building model artifacts
API Reference: Python API reference for Furiosa-LLM
Examples: Examples of using Furiosa-LLM
Deploying Furiosa-LLM on Kubernetes: A guide to deploying Furiosa-LLM on Kubernetes