Roadmap#

FuriosaAI regularly publishes its software with new features, performance improvements, and expanded hardware support. This page shows the forward-looking roadmap of ongoing & upcoming projects and when they are expected to land, broken down by areas on our software stack.

Note

The latest release is 2026.3.0. You can find the release notes here.

Upcoming Releases 2026 Q3#

  • πŸ”¨ EXAONE 4.5, Qwen 3.6, Gemma 4 model support

  • πŸ”¨ Multi-modal (vision-language) serving optimization

  • πŸ”¨ Hierarchical KV caching support, including KV cache offloading

  • πŸ”¨ KVCacheConnector support

  • πŸ”¨ Prefill/decode (PD) disaggregation

  • πŸ”¨ Speculative decoding support

  • πŸ”¨ PyTorch eager mode support


2026 Q1 - Q2#

Furiosa-LLM#

  • βœ… Qwen3 MoE, gpt-oss, K-EXAONE, Solar-Open model support

  • βœ… Qwen3-VL and multi-modal (vision-language) serving

  • βœ… Qwen3 dense and Qwen3 MoE (FP8), EXAONE 4.0 model support

  • βœ… TCL (Tensor Contraction Language) kernel framework and furiosa-kernels

  • βœ… FXB (Furiosa Executable Bundle) shareable compiled-artifact format

  • βœ… Overlap scheduler for zero-overhead batching

  • βœ… Per-model bucket presets for best default artifacts

  • βœ… Responses API support

  • βœ… Data Parallel router with prefix-aware and scoring-based routing

  • βœ… Enhanced observability with OpenTelemetry and per-device metrics

Platform & Packages#

  • βœ… Python 3.14 support (supported versions now 3.10–3.14)

  • βœ… Broader arm64 (aarch64) support across Python wheels and cloud-native images

  • βœ… Rocky Linux 10 / RHEL support via .el10 RPM packages


2025 Q3 - Q4#

Furiosa-LLM#

  • βœ… Hybrid batching support (i.e., chunked prefill or inflight-batching)

  • βœ… Exaone4, Qwen3 support

  • βœ… Guided-decoding support (libguidance, xgrammar backends)

  • βœ… Tool-calling support

  • βœ… Prefix-caching support

  • βœ… Pooling Model support (embedding, score, and rank)

  • βœ… Fine-tuned model support

  • βœ… Tensor Parallelism support Phase 2: Inter-chip

  • βœ… Hugging Face Hub support

  • βœ… Pre-compiled artifacts on Hugging Face Hub

  • βœ… Qwen2 and Qwen2.5 model support

  • βœ… EXAONE3 model support

  • βœ… API Key based authentication support

  • βœ… Harmony response format support

Quantization#

  • βœ… Fine-grained FP8 Quantization (dynamic quantization, mixed quantization)

Distributed & Scalable Inference#

  • βœ… llm-d integration

  • βœ… NPU operator support for Kubernetes

  • βœ… DRA (Dynamic Resource Allocation) support for Kubernetes

2025 Q1 - Q2#

  • βœ… Tool-calling support in Furiosa-LLM

  • βœ… Device remapping support (e.g., /dev/rngd/npu2pe0-3 -> /dev/rngd/npu0pe0-3) for container

  • βœ… Automatic configuration for the maximum KV-cache memory allocation

  • βœ… Min-p sampling support

  • βœ… Chunked Prefill support in Furiosa-LLM

  • βœ… Chat API support in Furiosa-LLM

  • βœ… Reasoning parser support

  • βœ… Torch 2.5.1 support

  • βœ… Python 3.11 and 3.12 support

  • βœ… Support for building bfloat16, float16, and float32 models to model artifact without quantization

  • βœ… Metrics endpoint (/metrics/) support in Furiosa-LLM

  • βœ… Model artifact support in Huggingface Hub

  • βœ… Sampling parameter β€œlogprobs” support

  • βœ… Container Runtime and Container Interface Device (CDI) support

2024 Q4#

  • βœ… Language Model Support: CodeLLaMA2, Vicuna, Solar, EXAONE-3.0

  • βœ… Vision Model Support: MobileNetV1, MobileNetV2, ResNet152, ResNet50, EfficientNet, YOLOv8m, etc

  • βœ… Tensor Parallelism support Phase 1: Intra-chip

  • βœ… Torch 2.4.1 support

  • βœ… Huggingface Optimum integration