What’s New#
Furiosa SDK 2025.1.0 Beta1 (2025-02-24)#
2025.1.0 is the third major SDK release for RNGD. This release includes a lot of new features and significant improvements, including significant LLM latency optimization, tool-calling support in Furiosa LLM, the device remapping support for container environment, command line tools improvements, and bug fixes.
Please refer to the Upgrading FuriosaAI’s Software section for instructions on obtaining this update.
🚀 Highlights#
LLM Latency Optimization (Up to 11.66% TTFT, 11.45% TPOT improvement for 30k inputs, 1k outputs)
Support Tool-calling in Furiosa LLM (Tool Calling)
Support Device remapping (e.g.,
/dev/rngd/npu2pe0-3
->/dev/rngd/npu0pe0-3
) for containerAdd the new command line tool
furiosa-llm build
to build easily an artifact from Hugging Face model (Building a Model Artifact)Fix continuous batch scheduling bugs which occur in certain ranges of sequence lengths and batch sizes
Maximize the blocked KV cache allocation by reducing the memory fragmentations
Allow
furiosa-mlperf
command to specifypipeline_parallel_size
anddata_parallel_size
Add
--allowed-origins
argument tofuriosa-llm serve
(OpenAI-Compatible Server)Fix
trust_remote_code
support bug in furiosa-llmSupport Min-p sampling in
SamplingParams
(SamplingParams class)- Allow
npu:X
in addition tonpu:X:*
indevices
option e.g.,
furiosa-llm serve ./model --devices "npu:0"
- Allow
fuiorsa-mlperf
command supportsnpu_q_limit
,spare_ratio
, allowing to optimize the performance
⚠️ Deprecations & Upcoming Changes#
LLM.from_artifacts()
API will be deprecated from the 2025.2.0 release. Please useLLM.load_artifact()
instead (LLM class).
👎 Breaking Changes#
--model
option offuriosa-llm serve
become a positional argument. Please usefuriosa-llm serve <model>
instead offuriosa-llm serve --model <model>
. (OpenAI-Compatible Server)
Furiosa SDK 2024.2.1 Beta0 (2025-01-10)#
2024.2.1 is a minor release based on 2024.2.0 major release.
Please refer to the Upgrading FuriosaAI’s Software section for instructions on obtaining this update.
🚀 Highlights#
Support for context lengths of up to 32k in furiosa-llm for various models, including LLaMA 3.1, and EXAONE
Artifacts with the same
tensor_parallel_size
are compatible even with anypipeline_parallel_size
Versions of components:
Package name |
Version |
---|---|
furiosa-compiler |
2024.2.0 |
furiosa-driver-rngd |
2024.2.1 |
furiosa-firmware-tools-rngd |
2024.2.1 |
furiosa-firmware-image-rngd |
2024.2.0 |
furiosa-pert-rngd |
2024.2.1 |
furiosa-model-compressor |
2024.2.0 |
furiosa-llm |
2024.2.1 |
furiosa-llm-models |
2024.2.0 |
furiosa-mlperf |
2024.2.1 |
furiosa-mlperf-resources |
4.1.0 |
furiosa-native-compiler |
2024.2.0 |
furiosa-native-runtime |
2024.2.1 |
furiosa-feature-discovery |
2024.2.0 |
furiosa-device-plugin |
2024.2.0 |
furiosa-smi |
2024.2.0 |
furiosa-libsmi |
2024.2.0 |
Furiosa SDK 2024.2.0 Beta0 (2024-12-23)#
2024.2.0 is the second major SDK release for RNGD. This release includes a lot of new features and significant improvements, including new model support, support for context lengths of up to 8k, support for Tensor Parallelism, support for PyTorch 2.4, Optimum API, and multiple performance improvements.
Please refer to the Upgrading FuriosaAI’s Software section for instructions on obtaining this update.
🚀 Highlights#
New Model support: Solar, EXAONE-3.0, CodeLLaMA2, Vicuna
Up to 8k context length support in models, such as LLaMA 3.1
Tensor Parallelism support (
tensor_parallel_size <= 8
)PyTorch 2.4.1 support
Transformers 4.44.2 support
- Furiosa LLM
- ArtifactBuilder API and CLI tools (refer to ArtifactBuilder API)
Users can build artifacts from Huggingface Hub models with Huggingface Transformers compatible API
- Huggingface Transformers compatible API support (furiosa_llm.optimum)
AutoModel, AutoModelForCausalLM, AutoModelForQuestionAnswering API
QuantizerForCausalLM API support for calibration and quantization
LLMEngine, AsyncLLMEngine API support compatiable with vLLM
- About 20% performance improvements in models based on LlamaForCausalLM
e.g., 3580 tokens/sec in LLaMA 3.1 8B model with a single RNGD card
👎 Breaking Changes#
LLM.from_artifacts() API has been deprecated. Please use LLM.load_artifacts() instead.
The artifacts built from 2024.1.x is not compatible with 2024.2.x. Please use the artifact built from 2024.2.x.
Package name |
Version |
---|---|
furiosa-compiler |
2024.2.0 |
furiosa-driver-rngd |
2024.2.0 |
furiosa-firmware-tools-rngd |
2024.2.0 |
furiosa-firmware-image-rngd |
2024.2.0 |
furiosa-pert-rngd |
2024.2.0 |
furiosa-llm |
2024.2.0 |
furiosa-llm-models |
2024.2.0 |
furiosa-mlperf |
2024.2.0 |
furiosa-mlperf-resources |
4.1.0 |
furiosa-model-compressor |
2024.2.0 |
furiosa-native-compiler |
2024.2.0 |
furiosa-native-runtime |
2024.2.0 |
furiosa-smi |
2024.2.0 |
furiosa-libsmi |
2024.2.0 |
furiosa-device-plugin |
2024.2.0 |
furiosa-feature-discovery |
2024.2.0 |
Furiosa SDK 2024.1.0 Alpha (2024-10-11)#
2024.1.0 is the first SDK release for RNGD. This release is alpha release, and the features and APIs described in this document may change in the future.
🚀 Highlights#
Model Support: LLaMA 3.1 8B/70B, BERT Large, GPT-J 6B
- Furiosa Quantizer supports the following quantization methods:
BF16 (W16A16)
INT8 Weight-Only (W8A16)
FP8 (W8A8)
INT8 SmoothQuant (W8A8)
- Furiosa LLM
Efficient KV cache management with PagedAttention
Continuous batching support in serving
OpenAI-compatible API server
Greedy search and beam search
Pipeline Parallelism and Data Parallelism across multiple NPUs
furiosa-mlperf
commandServer and Offline scenarios
BERT, GPT-J, LLaMA 3.1 benchmarks
- System Management Interface
System Management Interface Library and CLI for Furiosa NPU family
- Cloud Native Toolkit
Kubernetes integration for managing and monitoring the Furiosa NPU family
Package name |
Version |
---|---|
furiosa-compiler |
2024.2.0 |
furiosa-device-plugin |
2024.2.0 |
furiosa-driver-rngd |
2024.2.0 |
furiosa-feature-discovery |
2024.1.0 |
furiosa-firmware-tools-rngd |
2024.1.0 |
furiosa-firmware-image-rngd |
2024.1.0 |
furiosa-libsmi |
2024.2.0 |
furiosa-llm |
2024.2.0 |
furiosa-llm-models |
2024.2.0 |
furiosa-mlperf |
2024.2.0 |
furiosa-mlperf-resources |
4.1.0 |
furiosa-model-compressor |
2024.1.0 |
furiosa-native-compiler |
2024.2.0 |
furiosa-native-runtime |
2024.2.0 |
furiosa-smi |
2024.1.0 |