Furiosa-LLM

Furiosa-LLM#

Furiosa-LLM is a high-performance inference engine for large language models (LLMs) and multi-modal (vision-language) models. Furiosa-LLM offers state-of-the-art serving efficiency and optimizations. Key features of Furiosa-LLM include:

  • vLLM-compatible API (LLM, LLMEngine, AsyncLLMEngine API)

  • Text and multi-modal (vision-language) serving

  • Generative and pooling model support (embedding, reranking, classification)

  • Efficient KV cache management with PagedAttention

  • Radix-tree prefix caching for reuse of shared prompt prefixes

  • Hybrid KV cache for models that mix sliding-window and global attention

  • Continuous batching of incoming requests

  • Quantization: INT4, INT8, BF16, FP8, MXFP4, and NVFP4

  • Support for data, tensor, and pipeline parallelism across multiple NPUs

  • OpenAI-compatible API server with Chat Completions and Responses APIs

  • Various decoding algorithms: greedy search, top-k/top-p, and speculative decoding (planned)

  • Tool calling and reasoning parser support

  • Structured output generation (choice, regex, json schema, grammar)

  • Chunked prefill with mixed prefill/decode batching

  • Prometheus metrics endpoint for serving observability

  • Integration with Hugging Face models and hub support

Documentation#