Furiosa LLM#
Furiosa LLM is a high-performance inference engine for LLM and multi-modal LLM models. Furiosa LLM offers state-of-the-art serving efficiency and optimizations. Key features of Furiosa LLM include:
vLLM-compatible API (LLM, LLMEngine, AsyncLLMEngine API)
Efficient KV cache management with PagedAttention
Continuous batching of incoming requests
Quantization: INT4, INT8, FP8, GPTQ, AWQ
Support for data, tensor, and pipeline parallelism across multiple NPUs
OpenAI-compatible API server
Various decoding algorithms: greedy search, beam search, top-k/top-p, and speculative decoding (planned for 2025.1)
Support for context lengths of up to 32k
Integration with Hugging Face models and hub support
Hugging Face PEFT support (planned)
Documentation#
Quick Start with Furiosa LLM: A quick start guide to Furiosa LLM
Model Preparation Workflow: Guide on how to prepare models to be served by Furiosa LLVM
OpenAI-Compatible Server: Details about the OpenAI-compatible server and its features
Model Parallelism: Guide on tensor/pipeline/data parallelism in Furiosa LLM
API Reference: Python API reference for Furiosa LLM