FuriosaAI’s Software Stack#

FuriosaAI provides the streamlined software stack to allow FuriosaAI NPU to be used in various applications and environments. Here, we outline the SW stack provided by FuriosaAI, explaining the roles of each component, together with guidelines and tutorials. The below diagram demonstrates the SW stack provided by FuriosaAI, by layers.

FuriosaAI Software Stack
FuriosaAI Software Stack

The following outlines the key components.

Kernel Driver, Firmware, and PE Runtime#

The kernel device driver enables the Linux operating system to recognize NPU devices and expose them as Linux device files. The firmware runs on the SoC within the RNGD card and provides low-level APIs to the PE Runtime (PERT) that runs on the Processing Element (PE). PERT is responsible for communicating with the host’s runtime and scheduling, managing the resources of PEs to execute NPU tasks.

Furiosa Compiler#

Furiosa compiler analyzes, optimizes a model graph, and generates a NPU executable program. The operation passes involve graph-level optimization, operator fusion, memory allocation optimization, scheduling, and data movement minimization across layers. When torch.compile() backend, FuriosaBackend is used or furiosa-llm is used, the Furiosa Compiler is transparently used to generate NPU executable programs for Runtime.

Furiosa Runtime#

Runtime loads multiple executable NPU programs generated by Furiosa compiler, and run them on the NPU. A single model can be compiled into multiple executable programs according to model architectures and applications. Runtime is responsible for scheduling NPU programs and managing computation and memory resource on NPUs and CPUs. Also, Runtime can use multiple NPUs and provides a single entry point to run the model on multiple NPUs.

Furiosa Model Compressor (Quantizer)#

Furiosa Model Compressor is a library as well as toolkit for model calibration and quantization. Model quantization is a powerful technique to reduce memory footprint, computation cost, inference latency and power consumption. Furiosa Model Compressor provides post-training quantization methods, such as

  • BF16 (W16A16)

  • INT8 Weight-Only (W8A16)

  • FP8 (W8A8)

  • INT8 SmoothQuant (W8A8)

  • INT4 Weight-Only (W4A16 AWQ / GPTQ) (Planned in release 2024.2)

Furiosa LLM#

Furiosa LLM provides a high-performance inference engine for LLM models, such as Llama 3.1 70B, 8B, GPT-J, and Bert. Furiosa LLM is designed to provide the state-of-the-art serving optimization for LLM models. The key features of Furiosa LLM include vLLM-compatible API, PagedAttention, continuous batching, HuggingFace hub support, and OpenAI-compatible API server. You can find further information at Furiosa LLM.

Kubernetes Support#

Kubernetes, an open-source platform designed to manage containerized applications and services, is extensively adopted by various companies due to its robust capabilities for deploying, scaling, and automating containerized workloads. FuriosaAI software stack also offers native integration with Kubernetes, allowing seamless deployment and management of AI applications within a Kubernetes environment.

FuriosaAI’s device plugin enables Kubernetes clusters to recognize FuriosaAI’s NPUs and allows NPUs to be scheduled for workloads and services that require them. This feature allows users to easily deploy AI workloads with FuriosaAI NPUs on Kubernetes, enabling efficient resource utilization and scaling.

You can find more information about Kubernetes support in the Cloud Native Toolkit.