Introduction
FuriosaAI’s Tensor Contraction Processor (TCP) is a massively parallel AI accelerator targeting inference workloads. High-level frameworks such as PyTorch and XLA abstract away memory layouts and hardware scheduling, but give programmers no control over either. Low-level kernel APIs give fine-grained control, but require reasoning in bytes and hardware addresses rather than tensors. TCP’s Virtual Instruction Set Architecture (Virtual ISA) bridges this gap: it lets programmers think in terms of tensors while directly managing memory allocation and tensor unit scheduling. This manual explains TCP programming through the Virtual ISA.
The manual walks through concrete examples, targeting two audiences: programmers writing Virtual ISA directly and compiler developers generating it. Basic Rust familiarity is assumed; see the language manual if needed.
Warning
Alpha Test Build: Experimental Software
This software is an early, experimental, and incomplete build intended strictly for technical evaluation and internal testing.
Before using this software for any production work, critical tasks, or for important data, you must consult with Furiosa engineers.
Your feedback is vital to our development. Please provide it.
Installation
Install two dependencies:
- Rust: Follow the official guide.
- Furiosa SDK: Follow the SDK documentation.
Your First Program
Create a new project:
cargo new --bin tcp-my-project
cd tcp-my-project
cargo add furiosa-visa-std tokio
Add rust-toolchain.toml:
[toolchain]
channel = "nightly-2025-12-12"
components = ["rustfmt", "clippy"]
Write main.rs:
#![feature(register_tool)]
#![register_tool(tcp)]
extern crate furiosa_visa_std;
extern crate tokio;
extern crate rand;
use rand::SeedableRng;
use rand::rngs::SmallRng;
use furiosa_visa_std::prelude::*; // provided by the Furiosa SDK
// Declare axis sizes
axes![A = 8, B = 512];
/// The main function running in host
#[tokio::main]
async fn main() {
// Acquire exclusive access to the TCP device
let mut ctx = Context::acquire();
// TCP has three memory levels:
// - Host: system memory
// - HBM (High-Bandwidth Memory): device's main memory
// - SRAM (on-chip scratchpad): the primary SRAM tier is called DM (Data Memory)
//
// Data flows: Host → HBM → DM → compute → DM → HBM → Host.
//
// Two DMA engines move data between these levels:
// - `ctx.pdma` (PCIe DMA): transfers between Host and HBM
// - `ctx.tdma` (Tensor DMA): transfers between HBM and DM
// Create tensor on host
// Tensors are parameterized by element type and mapping
// The mapping `m![A, B]` specifies `A` as the major axis and `B` as the minor axis
let mut rng = SmallRng::seed_from_u64(42);
let host: HostTensor<i8, m![A, B]> = HostTensor::rand(&mut rng);
// Transfer to device HBM using PCIe DMA engine
// HBM tensor has two dimensions: m![A] for chip and m![B] for intra-chip address
let hbm: HbmTensor<i8, m![A], m![B]> = host.to_hbm(&mut ctx.pdma, 0x1000).await;
// Launch kernel on device
// Host continues while kernel runs asynchronously, but the kernel synchronously occupies the device
launch(kernel, (&mut ctx, &hbm))
// Host waits for the asynchronous execution of the kernel to finish
.await;
}
#[device(chip = 1)] // Running on a single chip
fn kernel(ctx: &mut Context, hbm: &HbmTensor<i8, m![A], m![B]>) {
// Move to DM (Data Memory) in on-chip SRAM using Tensor DMA engine
let dm = hbm.to_dm::<m![1], m![A], m![B]>(&mut ctx.tdma, 0);
// ... perform computations ...
}
Build and Test
TCP supports two execution environments, ordered from fastest iteration to production use:
# 1. CPUs (standalone Rust)
cargo build # Add --release for optimized builds, same below
cargo test
# 2. Real TCP devices
cargo furiosa-opt build
cargo furiosa-opt test
Development Tools
The TCP Software Toolchain (cargo furiosa-opt) provides utilities for developing, testing, and optimizing Virtual ISA programs on Furiosa chips.
It complements the Furiosa SDK’s compiler by giving developers fine-grained control over program behavior, whether the programmer writes Virtual ISA by hand or a compiler generates it.
The toolchain consists of four components:
- Compiler: Translates Virtual ISA into executable code for the chip.
- Interpreter: Executes Virtual ISA as native Rust programs for software simulation and debugging.
- Language Server: Enables IDE features (autocompletion, diagnostics, navigation) via Rust’s language server infrastructure.
- Schedule Viewer: Visualizes the execution timeline to help identify performance bottlenecks.
Book Organization
The rest of this book is organized in the following chapters:
- Hello, TCP!: How TCP programming works, introduced through worked examples covering element-wise operations and tensor contractions.
- Mapping Tensors: How logical tensors map to physical memory: axis layout, stride, padding, and tiling.
- Moving Tensors: How data moves between memory tiers (HBM, DM) and the Tensor Unit via Fetch, Commit, and DMA engines.
- Computing Tensors: How the Tensor Unit pipeline (Switching, Collect, Contraction, Vector, Cast, Transpose) transforms data each cycle.
- Scheduling: How to control the order and concurrency of operations across contexts.
- Kernel Examples: End-to-end examples showing how mapping, movement, computation, and scheduling combine into real kernels.
License
This documentation and the entire furiosa-opt repository are licensed under the Apache License Version 2.0.
Hello, TCP!
This chapter introduces TCP programming through worked examples. Each example builds a mental model of how computation maps to hardware, making the rest of this book easier to follow. The first two examples cover element-wise operations; the remaining three cover tensor contractions (dot product, GEMV, and GEMM), each adding one new hardware concept. Two additional examples (Blocked GEMM and Flash Attention) are outlined as stubs.
Mathematical Background
This section defines the two mathematical concepts that TCP is built to accelerate: tensors and their contractions.
Tensor
A tensor is a mapping from tensor index to its corresponding value.
To understand this, we must first define tensor’s shape.
Unlike other libraries where axis order encodes meaning (e.g., NumPy’s ndarray), we define tensor’s shape as an unordered set of named axes.
The shapes \(\{\texttt{N} = 4, \texttt{C} = 3\}\) and \(\{\texttt{C} = 3, \texttt{N} = 4\}\) identify the same tensor; axis names carry the meaning, not the position.
A tensor index is formed by specifying index value for each axes. For a tensor with shape \(\{\texttt{N} = 4, \texttt{C} = 3\}\), the valid indices will be: \(\{\texttt{N}: 0, \texttt{C}: 0\}\), \(\{\texttt{N}: 0, \texttt{C}: 1\}\), \(\{\texttt{N}: 0, \texttt{C}: 2\}\), \(\{\texttt{N}: 1, \texttt{C}: 0\}\), etc.
A tensor can behave like a multi-dimensional array of numbers. For example:
- 0D Tensor (Scalar): a single number like \(5.2\)
- 1D Tensor (Vector): a sequence like \([1, 2, 3]\) with one axis
- 2D Tensor (Matrix): a \(2 \times 4\) grid with two axes
- 4D Tensor: a batch of RGB images with shape \(\{\texttt{N} = 4, \texttt{C} = 3, \texttt{H} = 256, \texttt{W} = 512\}\)
Tensor Contraction
A tensor contraction is a operation on a tensor that takes two tensors, pair up specific axes that appears in both inputs, then sums the products of their elements along those axes.
Einsum notation is a compact way to write contractions: each input tensor is listed by its axis labels, and output axes follow the → arrow; any axis that appears in both inputs but not in the output is summed over.
| Operation | Formula | Einsum notation |
|---|---|---|
| Dot product | \(\sum_i x_i y_i\) | \(I, I \rightarrow 1\) |
| GEMV | \(y_i = \sum_j A_{ij} x_j\) | \(IJ, J \rightarrow I\) |
| GEMM | \(C_{ij} = \sum_k A_{ik} B_{kj}\) | \(IK, KJ \rightarrow IJ\) |
Every contraction can be decomposed into three steps: Broadcast, Multiply, and Reduce.
| Step | Dot Product (\(I, I \rightarrow 1\)) | GEMV (\(IJ, J \rightarrow I\)) | GEMM (\(IK, KJ \rightarrow IJ\)) |
|---|---|---|---|
| Broadcast | none (axes match) | \(x\) broadcasts across \(I\) | \(A\) across \(J\); \(B\) across \(I\) |
| Multiply | \(x_i \cdot y_i\) | \(A_{ij} \cdot x_j\) | \(A_{ik} \cdot B_{kj}\) |
| Reduce | \(\sum_i x_i y_i\) | \(y_i = \sum_j A_{ij} x_j\) | \(C_{ij} = \sum_k A_{ik} B_{kj}\) |
Tensor Contraction Processor
This section covers the hardware concepts needed to understand the examples: the processing unit hierarchy, memory tiers, tensor mapping types, and execution contexts.
Processing Units
The TCP architecture accelerates these contractions by streaming tensor data through a hierarchy of parallel processing units.
| Level | Count (RNGD) | Role |
|---|---|---|
Chip | (system-dependent) | Top-level unit; holds HBM |
Cluster | 2 per chip | Groups 256 slices |
Slice | 256 per cluster | Runs one Tensor Unit: a Fetch → Switching → Collect → Contraction → Vector → Cast → Transpose → Commit pipeline |
Row | 8 per slice | One row of the Contraction Engine’s MAC (multiply-accumulate) array |
The Switch Engine connects slices, enabling data redistribution across the slice array.
Memory
| Type | Location | Capacity (RNGD) | Role |
|---|---|---|---|
HbmTensor | On-package | 48 GB, 1.5 TB/s | Long-term weight and activation storage |
DmTensor | On-chip SRAM | 256 MB total; 512 KB/slice | Primary working memory for computations |
TrfTensor | On-chip SRAM | 8 KB × 8 MAC rows / slice | Weight register file for Contraction Engine |
VrfTensor | On-chip SRAM | 8 KB / slice | Operand register file for Vector Engine |
Most alignment and capacity constraints in this book derive from the counts and capacities in these tables.
Tensor Mapping
TCP’s Virtual ISA exposes the hardware hierarchy through its type system.
Each tensor type encodes the element type and how each logical axis distributes across the hardware hierarchy.
For example, DmTensor<bf16, m![1], m![1 # 2], m![A / 8 # 256], m![A % 8]> (with axes![A = 2048]) represents a bf16 tensor on one chip, one of two clusters, distributed across 256 slices with 8 elements per slice.
TCP also introduces two kernel-specific parameters: Time indexes pipeline iterations; Packet indexes elements within each iteration.
The mapping expression (m![] macro and its operators) is used to express this distribution:
/splits by stride:A / 8gives 2048 / 8 = 256 indices, the “which slice” index.%gives the inner count:A % 8gives the 8 indices for each element the slice holds.#pads to the hardware unit count:# 256makes the slice count explicit.
Together, each element of A is mapped to a well-defined position within exactly one slice.
Execution Contexts
Every device kernel has two execution contexts running concurrently on separate hardware resources: ctx.main and ctx.sub.
main runs the primary computation; sub runs a concurrent pipeline, typically used to prefetch operands into TRF or VRF while main computes.
If main needs operands that sub is still fetching, main automatically waits for sub’s execution to ensure synchronization.
Because both contexts share the flat on-chip SRAM, the programmer must explicitly assign DM addresses (e.g. the addr argument in .to_dm(), .commit()) to prevent tensors from overlapping.
Addresses must not collide, but they can be non-contiguous.
Examples
The first two examples cover element-wise operations by using the Vector Engine; the remaining three cover tensor contractions by using the Contraction Engine.
Constant Addition
The first kernel takes a vector of integers and adds the constant 1 to each element.
It uses one chip, one of two clusters, and all 256 slices in that cluster, with one 8-element group per slice.
The Vector Engine processes integers using fixed-point operations, so we use vector_fxp(FxpBinaryOp::AddFxp, 1) to add the constant value.
flowchart TB
HOST[Host] <-->|PCIe DMA| HBM[(HBM)]
HBM <-->|Tensor DMA| DM[(DM)]
subgraph TU[Tensor Unit]
direction TB
FE[Fetch] --> SW["Switch (Forward)"] --> CO[Collect] --> VE["Vector (AddFxp +1)"] --> CM[Commit]
end
DM -->|stream| FE
CM -->|stream| DM
This example demonstrates the full Tensor Unit pipeline.
to_dm moves data from HBM to DM, splitting the flat tensor across 256 slices.
The begin → fetch → collect → vector_init → vector_intra_slice_branch → vector_fxp → vector_final → commit chain processes each slice in one pass, and vector_fxp(FxpBinaryOp::AddFxp, 1) adds the integer constant 1 to every element in parallel across all 256 slices.
BranchMode::Unconditional configures the pipeline to execute on every cycle.
#![feature(register_tool)]
#![register_tool(tcp)]
extern crate furiosa_visa_std;
extern crate tokio;
extern crate rand;
use rand::SeedableRng;
use rand::rngs::SmallRng;
use furiosa_visa_std::prelude::*;
axes![A = 2048]; // declare named axis A with size 2048; used in all tensor types below
type Chip = m![1];
type Cluster = m![1 # 2]; // 1 active cluster; hardware has 2 per chip
type Slice = m![A / 8 # 256]; // distribute A across 256 slices, 8 elements each
#[tokio::main]
async fn main() {
let mut ctx = Context::acquire();
// Create input on the host and transfer to HBM
let mut rng = SmallRng::seed_from_u64(42);
let input = HostTensor::<i32, m![A]>::rand(&mut rng);
let in_hbm = input.to_hbm(&mut ctx.pdma, 0).await;
// Launch the device kernel
let out_hbm = launch(kernel, (&mut ctx, &in_hbm)).await;
// Transfer result back to host
let _out = out_hbm.to_host::<m![A]>(&mut ctx.pdma).await;
}
#[device(chip = 1)]
fn kernel(ctx: &mut Context, input: &HbmTensor<i32, Chip, m![A]>) -> HbmTensor<i32, Chip, m![A]> {
// HBM → DM: split 2048 elements across 256 slices (8 elements per slice)
let dm = input.to_dm::<Cluster, Slice, m![A % 8]>(&mut ctx.tdma, 0);
let result = ctx
.main
.begin(dm.view())
// Fetch: stream 8-element packets from DM into the pipeline
.fetch::<i32, m![1], m![A % 8]>()
// Collect: normalize the stream into 32-byte flits (8 × i32)
.collect::<m![1], m![A % 8]>()
// Vector Engine: enter pipeline and arm unconditionally
.vector_init()
.vector_intra_slice_branch(BranchMode::Unconditional)
// Add the scalar constant 1 to every element
.vector_fxp(FxpBinaryOp::AddFxp, 1)
// Exit VE and commit: write results back to DM
.vector_final()
.commit::<m![A % 8]>(1 << 12);
// DM → HBM
result.to_hbm(&mut ctx.tdma, 1 << 28)
}
Elementwise Multiplication
The second kernel multiplies two same-shape vectors element-wise.
Because the Vector Engine’s fixed-point multiply unit (FxpMul) takes a second operand per element, that operand must come from the VRF (Vector Register File).
The VRF is a small per-slice register file that the Vector Engine reads every cycle; it is loaded in the sub context while the main computation streams.
flowchart TB
LHS_HBM[(lhs: HBM)] -->|Tensor DMA| LHS_DM[(lhs: DM)]
RHS_HBM[(rhs: HBM)] -->|Tensor DMA| RHS_DM[(rhs: DM)]
subgraph sub[sub context]
direction LR
sFE[Fetch] --> sSW[Switch] --> sCO[Collect]
end
subgraph main[main context]
direction LR
mFE[Fetch] --> mSW[Switch] --> mCO[Collect] --> VE["Vector (MulInt)"] --> CM[Commit]
end
RHS_DM --> sFE
LHS_DM --> mFE
sCO --> VRF[(VRF)]
VRF --> VE
CM --> OUT_DM[(result: DM)]
OUT_DM -->|Tensor DMA| OUT_HBM[(HBM)]
This example adds the VRF and the sub context.
rhs_dm is allocated at a different base address (1 << 12) to avoid overlapping with lhs_dm.
The sub context loads rhs_dm into the VRF through the Fetch → Switch → Collect → .to_vrf(0) pipeline.
The main context then streams lhs_dm and multiplies each element by its VRF counterpart using MulInt; the hardware runs both contexts concurrently where possible.
#![feature(register_tool)]
#![register_tool(tcp)]
extern crate furiosa_visa_std;
extern crate tokio;
extern crate rand;
use rand::SeedableRng;
use rand::rngs::SmallRng;
use furiosa_visa_std::prelude::*;
axes![A = 2048];
type Chip = m![1];
type Cluster = m![1 # 2];
type Slice = m![A / 8 # 256];
#[tokio::main]
async fn main() {
let mut ctx = Context::acquire();
let mut rng = SmallRng::seed_from_u64(42);
let lhs = HostTensor::<i32, m![A]>::rand(&mut rng);
let rhs = HostTensor::<i32, m![A]>::rand(&mut rng);
let lhs_hbm = lhs.to_hbm(&mut ctx.pdma, 0).await;
let rhs_hbm = rhs.to_hbm(&mut ctx.pdma, 1 << 28).await;
let out_hbm = launch(kernel, (&mut ctx, &lhs_hbm, &rhs_hbm)).await;
let _out = out_hbm.to_host::<m![A]>(&mut ctx.pdma).await;
}
#[device(chip = 1)]
fn kernel(
ctx: &mut Context,
lhs: &HbmTensor<i32, Chip, m![A]>,
rhs: &HbmTensor<i32, Chip, m![A]>,
) -> HbmTensor<i32, Chip, m![A]> {
// Move both operands from HBM to DM; use distinct base addresses to avoid overlap
let lhs_dm = lhs.to_dm::<Cluster, Slice, m![A % 8]>(&mut ctx.tdma, 0);
let rhs_dm = rhs.to_dm::<Cluster, Slice, m![A % 8]>(&mut ctx.tdma, 1 << 12);
// Sub context: load rhs into VRF (runs concurrently with the main context below).
// VRF holds a per-slice operand that the Vector Engine reads every cycle.
let rhs_vrf: VrfTensor<i32, Chip, Cluster, Slice, m![A % 8]> = ctx
.sub
.begin(rhs_dm.view())
.fetch::<i32, m![1], m![A % 8]>()
.collect::<m![A % 8 / 8], m![A % 8 % 8]>()
.to_vrf(0);
// Main context: multiply every lhs element by its rhs counterpart from VRF
let result = ctx
.main
.begin(lhs_dm.view())
.fetch::<i32, m![1], m![A % 8]>()
.collect::<m![1], m![A % 8]>()
.vector_init()
.vector_intra_slice_branch(BranchMode::Unconditional)
// Each slice multiplies its 8 lhs elements by the matching 8 rhs elements in VRF
.vector_fxp(FxpBinaryOp::MulInt, &rhs_vrf)
.vector_final()
.commit::<m![A % 8]>(1 << 13);
result.to_hbm(&mut ctx.tdma, 1 << 28)
}
The following three examples implement the contractions from the table above, each introducing a different Switch Engine topology.
Dot Product
The dot product \(I, I \rightarrow 1\) is the simplest contraction: there is no broadcast step, and both operands reduce along the same axis.
The sub context loads rhs into the TRF, the on-chip register file that holds one operand stationary while the other streams through, via Fetch → Collect → .to_trf().
TrfAddress::Full dedicates the entire TRF to this tensor.
.align() pairs the streaming LHS flits with the stationary RHS, doubling the packet width.
.contract() multiplies and reduce-adds along A spatially via the hardware reduction tree; .accumulate() then performs temporal accumulation across the time axis, producing a scalar per slice; .cast() converts the f32 accumulator output back to bf16.
#![feature(register_tool)]
#![register_tool(tcp)]
extern crate furiosa_visa_std;
extern crate tokio;
extern crate rand;
use rand::SeedableRng;
use rand::rngs::SmallRng;
use furiosa_visa_std::prelude::*;
axes![A = 2048];
type Chip = m![1];
type Cluster = m![1 # 2];
type Slice = m![1 # 256]; // 1 active slice; m![A / 8 # 256] would distribute across all 256
type Time = m![1]; // No temporal iteration
type Row = m![1]; // No row parallelism
#[tokio::main]
async fn main() {
let mut ctx = Context::acquire();
let mut rng = SmallRng::seed_from_u64(42);
let lhs = HostTensor::<bf16, m![A]>::rand(&mut rng);
let rhs = HostTensor::<bf16, m![A]>::rand(&mut rng);
let lhs_hbm = lhs.to_hbm(&mut ctx.pdma, 0).await;
let rhs_hbm = rhs.to_hbm(&mut ctx.pdma, 1 << 28).await;
let out_hbm = launch(kernel, (&mut ctx, &lhs_hbm, &rhs_hbm)).await;
let out = out_hbm.to_host::<m![1]>(&mut ctx.pdma).await;
}
#[device(chip = 1)]
fn kernel(
ctx: &mut Context,
lhs: &HbmTensor<bf16, Chip, m![A]>,
rhs: &HbmTensor<bf16, Chip, m![A]>,
) -> HbmTensor<bf16, Chip, m![1]> {
// HBM → DM
let lhs: DmTensor<bf16, Chip, Cluster, Slice, m![A]> = lhs.to_dm(&mut ctx.tdma, 0);
let rhs: DmTensor<bf16, Chip, Cluster, Slice, m![A]> = rhs.to_dm(&mut ctx.tdma, 1 << 12);
// Sub context: load rhs into TRF (TrfAddress::Full dedicates the entire TRF to this tensor)
let rhs: TrfTensor<bf16, Chip, Cluster, Slice, Row, m![A]> = ctx
.sub
.begin(rhs.view())
.fetch::<bf16, Time, m![A]>()
.collect::<m![{Time}, A / 16], m![A % 16]>()
.to_trf(TrfAddress::Full);
// Main context: stream lhs through the Contraction Engine, reduce along A
let result: DmTensor<bf16, Chip, Cluster, Slice, m![1 # 8]> = ctx
.main
.begin(lhs.view())
.fetch::<bf16, Time, m![A]>()
.collect::<m![A / 16], m![A % 16]>()
// Pair consecutive 32-byte flits into 64-byte packets, halving time steps (A/16 → A/32)
.align::<m![A / 32], m![A % 32], _, _>(&rhs)
.contract::<m![1]>()
.accumulate::<m![1], m![1 # 8]>(AccumulationKind::Interleaved)
.cast::<bf16, m![1 # 16]>() // cast f32 accumulator output back to bf16
.commit::<m![1 # 8]>(1 << 13);
// DM → HBM
result.to_hbm(&mut ctx.tdma, 2 << 28)
}
The dot product reduces along a single axis with no redistribution needed, so the Switch Engine is skipped and collect() is called directly on the FetchTensor.
The pseudocode below describes this behavior:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 2048];
// Dot product: both operands reduce along A; no slice redistribution needed.
fn collect_dot_product<'l, const T: Tu>(
input: FetchTensor<'l, T, bf16, m![1], m![1], m![1 # 256], m![1], m![A]>,
) -> CollectTensor<'l, T, bf16, m![1], m![1], m![1 # 256], m![A / 16], m![A % 16]> {
input.collect()
}
}
GEMV
GEMV \(IJ, J \rightarrow I\) extends the dot product by requiring the Switch Engine (which redistributes data across slices between Fetch and Collect) to broadcast the vector across all I slices, so each slice can independently compute its row of the output \(y_i = \sum_j A_{ij} x_j\).
The reduced dimension J splits into Time (one iteration per tile) and Packet (elements within each tile).
The preserved output dimension I maps to Slice, distributing output elements across slices for spatial parallelism.
GEMV requires broadcasting the vector to all I slices, which the Switch Engine handles with Broadcast01.
The pseudocode below describes this behavior:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![I = 256, J = 2048];
// GEMV: broadcast the vector across all I slices.
fn switch_gemv<'l, const T: Tu>(
input: FetchTensor<'l, T, bf16, m![1], m![1], m![1 # 256], m![1], m![J]>,
) -> SwitchTensor<'l, T, bf16, m![1], m![1], m![I], m![1 # 256], m![J]> {
input.switch(SwitchConfig::Broadcast01 {
slice1: 256,
slice0: 1,
time0: 1,
})
}
}
#![allow(unused)]
fn main() {
#![feature(register_tool)]
#![register_tool(tcp)]
extern crate furiosa_visa_std;
extern crate rand;
use rand::SeedableRng;
use rand::rngs::SmallRng;
use furiosa_visa_std::prelude::*;
axes![I = 256, J = 2048];
type Chip = m![1];
type Cluster = m![1 # 2];
type Slice = m![I]; // Distribute output dimension across slices
type Time = m![J / 32]; // Temporal iterations for reduction dimension
type Packet = m![J % 32]; // Packet size for reduction dimension
type Row = m![1];
async fn run() {
let mut ctx = Context::acquire();
// Create matrix and vector on host
let mut rng = SmallRng::seed_from_u64(42);
let matrix = HostTensor::<bf16, m![I, J]>::rand(&mut rng);
let vector = HostTensor::<bf16, m![J]>::rand(&mut rng);
// Transfer to HBM
let matrix_hbm = matrix.to_hbm(&mut ctx.pdma, 0 << 28).await;
let vector_hbm = vector.to_hbm(&mut ctx.pdma, 1 << 28).await;
// Launch kernel
let out_hbm = launch(kernel, (&mut ctx, &matrix_hbm, &vector_hbm)).await;
// Transfer result back
// > TODO(jeongmin.park): Consider adding a type annotation here.
let out = out_hbm.to_host::<m![I]>(&mut ctx.pdma).await;
}
#[device(chip = 1)]
fn kernel(
ctx: &mut Context,
matrix: &HbmTensor<bf16, Chip, m![I, J]>,
vector: &HbmTensor<bf16, Chip, m![J]>,
) -> HbmTensor<bf16, Chip, m![I]> {
// Move data from HBM to DM
let matrix: DmTensor<bf16, Chip, Cluster, Slice, m![J]> = matrix.to_dm(&mut ctx.tdma, 0);
let vector: DmTensor<bf16, Chip, Cluster, Slice, m![J]> = vector.to_dm(&mut ctx.tdma, 1 << 12);
// Load vector into TRF
// The Switch Engine automatically broadcasts the vector to all `I` slices
let vector_trf: TrfTensor<bf16, Chip, Cluster, Slice, Row, m![J]> = ctx
.sub
.begin(vector.view())
.fetch::<bf16, m![1], m![J]>()
// Collect Engine: split into 32-byte flits.
.collect::<m![J / 16], m![J % 16]>()
.to_trf(TrfAddress::Full);
// Compute GEMV: matrix × vector
// Key difference: `I` maps to slice (preserved), `J` gets reduced
let result: DmTensor<bf16, Chip, Cluster, Slice, m![1]> = ctx
.main
.begin(matrix.view())
.fetch::<bf16, Time, Packet>()
.collect::<Time, Packet>()
.align::<Time, Packet, _, _>(&vector_trf)
.contract::<m![1]>()
.accumulate::<m![1], m![1 # 8]>(AccumulationKind::Interleaved)
.cast::<bf16, m![1 # 16]>()
.commit(0);
// Transfer result to HBM
result.to_hbm(&mut ctx.tdma, 2 << 28)
}
}
GEMM
GEMM \(IK, JK \rightarrow IJ\) computes \(C_{ij} = \sum_k A_{ik} B_{jk}\). Each matrix broadcasts along its missing output dimension: \(A\) broadcasts across \(J\), \(B\) broadcasts across \(I\). Then the shared dimension \(K\) is reduced.
The main change from the GEMV example is that two output dimensions \(I\) and \(J\) are jointly mapped to Slice, so each slice computes a 2D tile of the output matrix.
The slice mapping now covers both dimensions, and the contraction output preserves both.
This example introduces type Slice = m![I / 32, J / 32], which decomposes two output dimensions jointly and assigns each slice a 16 × 16 output tile.
The Switch Engine distributes each tile of B to the matching slice, so each slice sees only its portion of J.
.contract::<m![1]>() reduces along K spatially, and .accumulate::<m![I], m![J # 8]>(AccumulationKind::Interleaved) accumulates over time, preserving both I and J in the output.
#![allow(unused)]
fn main() {
#![feature(register_tool)]
#![register_tool(tcp)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![I = 512, J = 512, K = 2048];
type Chip = m![1];
type Cluster = m![1 # 2];
// Distribute output dimensions `I` and `J` across slices
type Slice = m![I / 32, J / 32]; // Each slice handles a 16 × 16 output tile
type Row = m![J % 8];
// Host code similar to previous examples:
// - Create matrix tensors A and B
// - Transfer to HBM
// - Launch kernel
// - Transfer result back
#[device(chip = 1)]
fn kernel(
ctx: &mut Context,
a: &HbmTensor<bf16, Chip, m![I, K]>,
b: &HbmTensor<bf16, Chip, m![K, J]>,
) -> HbmTensor<bf16, Chip, m![I, J]> {
// Move data from HBM to DM
let a: DmTensor<bf16, Chip, Cluster, Slice, m![I % 32, K]> = a.to_dm(&mut ctx.tdma, 0);
let b: DmTensor<bf16, Chip, Cluster, Slice, m![J % 32, K]> = b.to_dm(&mut ctx.tdma, 1 << 12);
// Load matrix B into TRF
// Switch Engine distributes B across 256 slices
// Each slice gets the full `K` dimension but only its (16 × 16) output tile
// See: Switch Engine topologies for details on distribution
let b_trf: TrfTensor<bf16, Chip, Cluster, Slice, Row, m![J / 8 % 4, K]> = ctx
.sub
.begin(b.view())
.fetch::<bf16, m![J % 8, J / 8 % 4], m![K]>()
.collect::<m![J % 8, J / 8 % 4, K / 16], m![K % 16]>()
.to_trf(TrfAddress::Full);
// Compute GEMM: A × B
// Switch Engine ensures matching (`I / 32`, `J / 32`) slice distribution
// Contraction reduces along `K`, preserves `I` and `J`
let result: DmTensor<bf16, Chip, Cluster, Slice, m![I % 32, J % 32]> = ctx
.main
.begin(a.view())
.fetch::<bf16, m![I % 32, J / 8 % 4], m![K]>()
.collect::<m![I % 32, J / 8 % 4, K / 16], m![K % 16]>()
.align::<m![I % 32, J / 8 % 4, K / 32], m![K % 32], _, _>(&b_trf)
.contract::<m![1]>()
.accumulate::<m![I % 32, J / 8 % 4], m![J % 8]>(AccumulationKind::Interleaved)
.cast::<bf16, m![J % 8 # 16]>()
.commit(0);
// Transfer result to HBM
result.to_hbm(&mut ctx.tdma, 2 << 28)
}
}
Blocked GEMM
Note
This section is a work in progress. A complete example extending GEMM with blocking (tiling) for matrices that exceed on-chip DM capacity — covering temporal partitioning over the
Kdimension and spatial partitioning that distributesIandJtiles across multiple chips — will be added in a future release.
Flash Attention
Note
This section is a work in progress. A complete flash attention example combining GEMM-style contraction, softmax (Vector Engine), and the multi-pass main/sub prefetch pattern across a full transformer attention head will be added in a future release.
Together, the five complete examples above demonstrate every major hardware engine in TCP: the DMA engines (HBM to DM), the Vector Engine (element-wise ops), the Contraction Engine (multiply-reduce), and the Switch Engine (data redistribution across slices). (The Blocked GEMM and Flash Attention sections are stubs and will demonstrate additional patterns once complete.)
Further Reading
The examples above process tensors that fit in a single hardware pass. Real workloads often exceed the 512 KB/slice DM capacity and require partitioning into tiles. The next chapters cover two complementary strategies: temporal partitioning, which processes tiles sequentially over time, and spatial partitioning, which distributes tiles across parallel hardware units.
Each construct introduced in this chapter is covered in depth in the reference chapters:
axes![],m![],HbmTensor,DmTensor→ Mapping Tensors.to_dm(),.to_hbm(),.fetch(),.commit()→ Moving Tensors.contract(),.accumulate(),.cast(),.switch(),.vector_fxp()→ Computing Tensorsctx.main,ctx.sub,launch()→ Scheduling- End-to-end kernels combining all of the above → Kernel Examples
Mapping Tensors
This chapter explains what mappings are, how to declare them in TCP’s Virtual ISA, and how to choose them for performance.
Layout and Performance
Tensors have no intrinsic order of elements. A mapping is a function from tensor indices to buffer positions, which defines the order in which elements are stored. When storing a tensor in hardware, you need to decide how they will be mapped into the flat buffer.
The choice of mapping matters because hardware reads memory in contiguous blocks: elements stored far apart require more memory transfers. For example, one can choose which axis is major (outermost, changes slowest) and which is minor (innermost, changes fastest, stored contiguously). Changing a layout after allocation requires copying and transposing data, so the mapping chosen at allocation time constrains all subsequent operations to match that layout.
Consider a tensor with axes H (height, 6 rows) and W (width, 8 columns). The same tensor admits different mappings, each with different performance characteristics.
| H\W | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|---|
| 0 | a | b | c | d | e | f | g | h |
| 1 | i | j | k | l | m | n | o | p |
| 2 | · | · | · | · | · | · | · | · |
| 3 | · | · | · | · | · | · | · | · |
| 4 | · | · | · | · | · | · | · | · |
| 5 | · | · | · | · | · | · | · | · |
H Major, W Minor
A scan along W is contiguous; a scan along H accesses one element per cache line.
| H=0 | H=1 | ... | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | ... |
W Major, H Minor
A scan along H is contiguous; a scan along W accesses one element per cache line.
| W=0 | W=1 | W=2 | ... | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| a | i | · | · | · | · | b | j | · | · | · | · | c | k | · | · | · | · | ... |
Either choice sacrifices spatial locality (the property that nearby elements are stored at nearby addresses) in one direction. Tiling achieves good locality along both axes by grouping nearby H and W indices into 2D tiles.
2×2 Tiles
All elements within a tile are contiguous.
| t(0,0) | t(0,1) | t(0,2) | ... | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| a | b | i | j | c | d | k | l | e | f | m | n | ... |
W-minor layout is fast along W but slow along H; H-minor layout is the reverse; tiling gives balanced locality in both directions at the cost of more complex address calculation.
The outer dimension of a decomposition can become a hardware time loop; the inner dimension, a parallel lane.
Choosing a decomposition is how programmers control which dimensions execute sequentially and which execute in parallel.
TCP names these hardware dimensions Time (the sequential loop counter) and Packet (the parallel data lane width), used throughout this book; Memory and Stream explains how decompositions map to them.
The Declarative Approach
Virtual ISA lets the programmer declare a mapping in terms of logical axes; the compiler derives physical placement, alignment, and hardware scheduling.
In the above example, the simplest form is m![H, W] for H-major and m![W, H] for W-major, where the leftmost axis is major and the rightmost is minor.
Decomposing axes further with / and % enables tiling, expressed as m![H / 2, W / 2, H % 2, W % 2]: the first two dimensions are the tile indices and the last two are positions within the tile.
Declarative mappings offer three benefits:
- Expressiveness: Layout is stated in terms of logical axes (
m![H, W]), not raw memory strides or offsets. - Correctness: The compiler normalizes mapping expressions to canonical form and verifies them symbolically, turning layout properties into compile-time invariants.
- Portability: The same expression targets CPUs, GPUs, and TCPs without rewrites; the compiler derives hardware-specific placement from the axis description.
Mapping expressions describe a tensor at every stage of its life, not only when it is at rest in memory. The same tensor can be stored in HBM, loaded into DM with a different layout, and streamed through the pipeline as packets; each stage holds the same mathematical values under a different mapping.
This unified view treats data movement as preserving the mathematical tensor: moving a tensor between stages changes only its physical representation, not its values. The Tensor Functions page formalizes this perspective and shows how it makes data movement composable with computation in the same pipeline.
Mapping Expressions
A mapping expression defines where each tensor element sits in a buffer. This page covers the available mapping constructors and the equivalences between mappings.
Consider a tensor with axes![A = 8, B = 512].
The mapping expression m![A, B] places \(A\) as the major axis and \(B\) as the minor axis, requiring a buffer of 8 × 512 = 4096 elements.
Buffer position 0 holds \(\{A=0, B=0\}\), position 1 holds \(\{A=0, B=1\}\), and so on through all 512 elements where \(A=0\) before moving to \(A=1\).
Axis Sizes
The axes! macro declares axis identifiers and their sizes.
Throughout this section, assume the following axis sizes.
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 512];
}
Mapping Interface
A mapping expression like m![H, W] is a Rust type that describes how tensor indices map to buffer positions.
Every mapping expression implements the M trait, which provides the buffer size and a buffer-index-to-tensor-index mapping function:
#![allow(unused)]
fn main() {
// Inside `furiosa_visa_std::prelude`...
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::fmt::Debug;
/// Tensor index: a map from axis identifiers to coordinate values.
pub struct Index { /* ... */ }
/// Constructs tensor indices.
/// `i![A: 2, B: 3]` creates an `Index` with A = 2 and B = 3.
macro_rules! i {
() => {};
/* ... */
}
}
A mapping defines what mathematical tensor a buffer represents.
For example, HostTensor<bf16, m![A, B]> denotes a host memory buffer containing m![A, B]::SIZE elements of bf16 data, which is 4096 elements.
We say a buffer holds a tensor \(T\) when:
- For every buffer index
iand tensor indexti, - if
m![A, B]::map(i) = Some(ti), - then the
i-th element of the buffer stores the value of tensor \(T\) at indexti.
Constructors
Mapping expressions are built by composing small constructors, each of which transforms or combines simpler mappings.
These expressions use arithmetic-like operators (/, %, and # for padding) to concisely define the mapping between tensor and linear buffer indices.
Symbol
A symbol is a single uppercase letter whose size comes from the shape declaration.
The mapping m![A] maps 8 buffer indices linearly to tensor indices along the axis: buffer index 0 holds i![] (empty tensor index), index 1 holds i![A: 1], index 2 holds i![A: 2], and so on:
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8];
type E = m![A]; // Symbol<Ident::A, 8>
#[test]
fn test_symbol() {
for i in 0..E::SIZE {
assert_eq!(E::map(i), Some(i![A: i]));
}
assert_eq!(E::map(E::SIZE), None);
}
}
// Trait implementation
impl<S: AxisName> M for Symbol<S> {
const SIZE: usize = S::SIZE;
fn to_value() -> Mapping {
Mapping::Symbol {
symbol: S::NAME,
size: S::SIZE,
}
}
fn map(i: usize) -> Option<Index> {
if i < S::SIZE {
let mut index = Index::new();
Index::add_term(
&mut index,
Term {
inner: Atom::Symbol {
symbol: S::NAME,
size: S::SIZE,
},
stride: 1,
modulo: S::SIZE,
},
i,
);
Some(index)
} else {
None
}
}
}
Note
For every symbol
A, the 0’th indexi![A: 0]corresponds to the empty tensor indexi![].
Pair
One way to store a 2D tensor with shape \(\{A=8, B=512\}\) is the pair mapping m![A, B].
This creates a buffer of 4096 elements where A is the major axis and B is the minor axis.
The first 512 elements hold A = 0 and the next 512 elements hold A = 1.
Buffer index 519 holds i![A: 1, B: 7] since 519 == 512 * 1 + 7.
The mapping Pair<L, R> maps the Cartesian product of two spaces into a linear buffer where L is the major dimension and R is the minor dimension.
The size is L::SIZE * R::SIZE, and the mapping uses floor division and modulo to decompose indices.
m![A, B, C, D] expands to Pair<A, Pair<B, Pair<C, D>>> and is right-associative.
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 512];
type E = m![A, B]; // Pair<m![A], m![B]>
#[test]
fn test_pair() {
for i in 0..E::SIZE {
assert_eq!(E::map(i), Some(i![A: i / <m![B]>::SIZE, B: i % <m![B]>::SIZE]));
}
assert_eq!(E::map(2 * <m![B]>::SIZE + 7), Some(i![A: 2, B: 7]));
assert_eq!(E::map(E::SIZE), None);
}
}
// Trait implementation
impl<L, R> M for Pair<L, R>
where
L: M,
R: M,
{
const SIZE: usize = L::SIZE * R::SIZE;
fn to_value() -> Mapping {
Mapping::Pair {
left: RBox::new(L::to_value()),
right: RBox::new(R::to_value()),
}
}
fn map(i: usize) -> Option<Index> {
let mut l = L::map(i / R::SIZE)?;
let r = R::map(i % R::SIZE)?;
Index::add(&mut l, r);
Some(l)
}
}
Identity
The identity mapping m![1] creates a single-element buffer that maps buffer index 0 to the empty tensor index i![].
It serves as the identity element for Pair: m![1, A] and m![A, 1] are both equivalent to m![A].
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
type E = m![1]; // Identity
#[test]
fn test_identity() {
assert_eq!(E::map(0), Some(i![]));
assert_eq!(E::map(1), None);
}
}
// Trait implementation
impl M for Identity {
const SIZE: usize = 1;
fn to_value() -> Mapping {
Mapping::Identity
}
fn map(i: usize) -> Option<Index> {
if i == 0 { Some(Index::new()) } else { None }
}
}
Padding
Padding aligns data to hardware requirements by adding unused buffer space.
For example, the DMA engine requires rows to start on 64-byte boundaries.
With axes![C = 13, D = 61], m![C, D] creates misaligned rows since 61 is not divisible by 64.
m![C, D # 64] fixes this by aligning each row to 64-byte boundaries, using 3 extra elements per row.
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![C = 13, D = 61];
type E = m![C, D # 64]; // Pair<m![C], Padding<m![D], 64>>
#[test]
fn test_padding() {
assert_eq!(E::map(0), Some(i![C: 0, D: 0]));
assert_eq!(E::map(60), Some(i![C: 0, D: 60]));
assert_eq!(E::map(61), None); // padding
assert_eq!(E::map(62), None); // padding
assert_eq!(E::map(63), None); // padding
assert_eq!(E::map(64), Some(i![C: 1, D: 0]));
}
}
// Trait implementation
impl<L, const SIZE: usize> M for Padding<L, SIZE>
where
L: M,
{
const SIZE: usize = SIZE;
fn to_value() -> Mapping {
Mapping::Padding {
inner: RBox::new(L::to_value()),
padding: SIZE,
kind: PaddingKind::Top,
}
}
fn map(i: usize) -> Option<Index> {
L::map(i)
}
}
Resize
Resize constrains a mapping to a smaller logical size by truncating indices beyond the new size, discarding elements outside that range. Unlike padding, which expands the buffer, Resize shrinks the logical view.
The mapping m![D = 2] takes only the first 2 elements of axis D, producing indices D = 0 and D = 1.
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![C = 2, D = 3];
type E = m![C, D = 2]; // Pair<m![C], Resize<m![D], 2>>
#[test]
fn test_resize() {
assert_eq!(E::map(0), Some(i![C: 0, D: 0]));
assert_eq!(E::map(1), Some(i![C: 0, D: 1]));
assert_eq!(E::map(2), Some(i![C: 1, D: 0]));
assert_eq!(E::map(3), Some(i![C: 1, D: 1]));
assert_eq!(E::map(4), None);
}
}
// Trait implementation
impl<L, const SIZE: usize> M for Resize<L, SIZE>
where
L: M,
{
const SIZE: usize = SIZE;
fn to_value() -> Mapping {
Mapping::Resize {
inner: RBox::new(L::to_value()),
resize: SIZE,
}
}
fn map(i: usize) -> Option<Index> {
if i < SIZE { L::map(i) } else { None }
}
}
Tiling is implemented through indexed views, pure metadata transformations without data copies.
The .tile() method extracts a tile by resizing one dimension to the tile size and offsetting into the buffer.
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 512];
fn tiles() {
let tensor = unsafe { HbmTensor::<bf16, m![1], m![A, B]>::from_addr(0) };
let view = tensor.view(); // HbmTensorView::<'_, bf16, m![1], m![A, B]>
let tile01 = view.tile::<m![B], 2, m![A, B = 2 # 4]>(0); // HbmTensorView::<'_, bf16, m![1], m![A, B = 2 # 4]>
let tile23 = view.tile::<m![B], 2, m![A, B = 2 # 4]>(2); // HbmTensorView::<'_, bf16, m![1], m![A, B = 2 # 4]>
}
}
The .tile() method takes three type parameters and one value parameter.
- The tile dimension
m![B]specifies which dimension to divide along. - The tile size
2specifies the number of elements per tile. - The tile mapping
m![A, B = 2 # 4]defines the resulting view’s mapping. The mappingB = 2 # 4signifies that dimensionBhas a logical size of2within the view but exists within a physical footprint of4. This is essential for preserving the original memory layout and stride calculations. - The starting index specifies which tile to extract. Passing
0captures the range0..2fortile01, while passing2captures the range2..4fortile23.
Stride and Modulo
Stride (/) and modulo (%) decompose a single dimension into two: the outer (block index) and the inner (position within block).
Consider the 512-element axis B divided into 8 blocks of 64 elements each.
The mapping m![B / 64, B % 64] creates an 8 × 64 grid where the first dimension selects which block and the second dimension selects the position within that block.
Buffer index 130 corresponds to block 2 at position 2 within that block, giving tensor index B = 64 × 2 + 2 = 130, equal to the flat-buffer result (since m![B / 64, B % 64] is equivalent to m![B]):
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 512];
type D1 = m![B / 64]; // stride with size 8
type D2 = m![B % 64]; // modulo with size 64
type E = m![B / 64, B % 64]; // equivalent to `m![B]`
#[test]
fn test_stride_modulo() {
for i in 0..8 {
assert_eq!(D1::map(i), Some(i![B / 64: i]));
}
assert_eq!(D1::map(8), None);
for j in 0..64 {
assert_eq!(D2::map(j), Some(i![B % 64: j]));
}
assert_eq!(D2::map(64), None);
for i in 0..8 {
for j in 0..64 {
assert_eq!(
E::map(64 * i + j), // i![B / 64: i, B % 64: j]
<m![B]>::map(64 * i + j), // equivalent to above
);
}
}
assert_eq!(E::map(512), None);
}
}
// Trait implementation
impl<L, const SIZE: usize> M for Stride<L, SIZE>
where
L: M,
{
const SIZE: usize = {
assert!(L::SIZE % SIZE == 0, "Stride size must divide the original size");
L::SIZE / SIZE
};
fn to_value() -> Mapping {
Mapping::Stride {
inner: RBox::new(L::to_value()),
stride: SIZE,
}
}
fn map(i: usize) -> Option<Index> {
if i < Self::SIZE { L::map(i * SIZE) } else { None }
}
}
impl<L, const SIZE: usize> M for Modulo<L, SIZE>
where
L: M,
{
const SIZE: usize = {
assert!(L::SIZE % SIZE == 0, "Modulo size must divide the original size");
SIZE
};
fn to_value() -> Mapping {
Mapping::Modulo {
inner: RBox::new(L::to_value()),
modulo: SIZE,
}
}
fn map(i: usize) -> Option<Index> {
if i < Self::SIZE { L::map(i % L::SIZE) } else { None }
}
}
Together, m![B / 64, B % 64] transforms axis B into an 8 × 64 grid.
The mapping is equivalent to m![B] but expresses a different logical view of the same data, revealing block structure hidden in the flat representation.
Stride and modulo mappings can be visualized in tabular form. Consider the mapping m![B / 4, B % 4] with B::SIZE = 16. The following table shows how buffer indices are arranged: each row corresponds to a specific index of B / 4 (the stride axis), and each column corresponds to an index of B % 4 (the modulo axis):
i![B % 4: 0] | i![B % 4: 1] | i![B % 4: 2] | i![B % 4: 3] | |
|---|---|---|---|---|
i![B / 4: 0] | i![B: 0] | i![B: 1] | i![B: 2] | i![B: 3] |
i![B / 4: 1] | i![B: 4] | i![B: 5] | i![B: 6] | i![B: 7] |
i![B / 4: 2] | i![B: 8] | i![B: 9] | i![B: 10] | i![B: 11] |
i![B / 4: 3] | i![B: 12] | i![B: 13] | i![B: 14] | i![B: 15] |
Stride and modulo factorize a single mapping into multiple dimensions.
The expression m![B / n] creates an outer dimension indexing blocks of size n.
The expression m![B % n] creates an inner dimension indexing positions within each block.
Modulo differs from resize in how it handles buffer size:
- Resize shrinks the buffer by truncating indices beyond the new size.
- Modulo preserves the original buffer size while partitioning it into equal-sized blocks.
These operations can be nested for complex decompositions.
The following example splits B into three dimensions where the buffer’s bit layout differs from that of the tensor index.
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 512];
// B's bits: 6 - 8, 0 - 4, 5
// Values: 0 - 7, 0 - 31, 0 - 1
type E = m![B / 64, B % 32, B / 32 % 2];
#[test]
fn test_nested_stride() {
for i in 0..8 {
for j in 0..32 {
for k in 0..2 {
assert_eq!(
E::map(64 * i + 2 * j + k),
Some(i![B: 64 * i + j + 32 * k]),
);
}
}
}
assert_eq!(E::map(512), None);
}
}
The buffer index decomposes as 64 * i + 2 * j + k where i selects the block, j selects position within the block, and k selects the sub-block.
The tensor index B reconstructs as 64 * i + j + 32 * k, which rearranges the bit positions.
For example, buffer index 67 maps to B = 97:
- Buffer
67 = 64 * 1 + 2 * 1 + 1givesi=1, j=1, k=1 - Tensor
B = 64 * 1 + 1 + 32 * 1 = 97 - Verify:
97 / 64 = 1,97 % 32 = 1,(97 / 32) % 2 = 1
This kind of bit rearrangement maps naturally to hardware memory layouts where address bits are reordered for bank interleaving or cache efficiency.
In binary, this rearranges bit positions: buffer 001_00001_1 becomes B = 001_1_00001.
The buffer groups bits as [8:6]_[5:1]_[0] while B groups them as [8:6]_[5]_[4:0].
Tiling can operate on blocks rather than individual elements.
The following example tiles by block using m![B / 32] and creates overlapping tiles:
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 512];
let tensor = unsafe { HbmTensor::<bf16, m![1], m![A, B]>::from_addr(0) };
for i in 0..15 {
let tile = tensor.view().tile::<m![B / 32], 2, m![A, B / 32 = 2 # 16, B % 32]>(i);
}
}
With B = 512, the dimension B / 32 has 16 blocks numbered 0-15.
Each tile takes 2 consecutive blocks starting at index i.
Tile 0 covers blocks {0, 1}, tile 1 covers blocks {1, 2}, and so on through tile 14 covering blocks {14, 15}.
These tiles overlap because consecutive tiles share one block: tiles 0 and 1 both include block 1.
The tile mapping B / 32 = 2 resizes the block dimension to 2 since each tile contains exactly 2 blocks.
When tiling with a single block, B / 32 = 1 simplifies to the identity m![1] since the dimension has only one value.
Escape
For complex mappings, define type aliases and reference them using { ... }.
With separate mappings L = m![A] and R = m![B], combining them as m![{ L }, { R }] produces the same result as m![A, B]:
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 512];
type L = m![A];
type R = m![B];
type E = m![{ L }, { R }]; // equivalent to `m![A, B]`
}
This escape syntax breaks down complex mappings into named, reusable components.
Advanced Constructors
Skewed Axis
A skewed axis creates a diagonal access pattern across two dimensions.
Skewed axes introduce derived axis labels defined by arithmetic differences between existing axes; for example, B' = B - A defines a new axis B' whose coordinate at any point equals B minus A.
Algorithms that process data along diagonals use this pattern, such as certain wavefront computations.
The expression m![A, B' = 4] with B' = B - A creates a mapping where each row is shifted relative to the previous one.
The = operator specifies the logical size after skewing. The result wraps around using modular arithmetic.
For example, with axes![A = 4, B = 4] and B' = B - A:
| (A, B’) | (A, B) |
|---|---|
| (0, 0) | (0, 0) |
| (0, 1) | (0, 1) |
| (0, 2) | (0, 2) |
| (0, 3) | (0, 3) |
| (1, 0) | (1, 1) |
| (1, 1) | (1, 2) |
| (1, 2) | (1, 3) |
| (1, 3) | (1, 0) |
When A = 1 and B' = 3, the original B coordinate wraps to 0 via modular arithmetic since B = (B' + A) % 4 = (3 + 1) % 4 = 0.
Indirect Sequencing
TODO: Document indirect sequencing patterns for non-contiguous memory access. This advanced constructor provides index-based memory access where the sequence of buffer positions is determined by an indirection table rather than a mathematical formula.
Sliding (Linear Combination)
Note
Linear combination expressions
$(e1:n1, ..., ed:nd)combine multiple dimensions with specified strides. Formal definition:size_S($(e1:n1, ..., ed:nd)) = 1 + sum_k((size_S(ek) - 1) * nk). The mappingS, $(e1:n1, ..., ed:nd) |- si ~ tiholds if there existsi1...sid, ti1...tidsuch that for allk:S, ek |- sik ~ tik,si = sum_k(sik * nk), andti = sum_k(tik * nk).Linear combinations can encode outer sum:
e1 * e2is equivalent to$(e1 : size_S(e2), e2 : 1). However, outer sum is preferred because it’s more resilient to axis reordering. Changinge1 * e2toe2 * e1doesn’t require manual stride updates.
Sliding operations access overlapping data blocks, essential for convolutional neural networks. Consider a buffer of 9 elements representing a tensor with shape \(\{N=5, F=3\}\) where each row is a 3-element slice that slides one element at a time. The tensor element at \((N, F)\) maps to buffer index \(N + 2F\):
$$ \begin{array}{c|ccc} & F=0 & F=1 & F=2 \\ \hline N=0 & 0 & 2 & 4 \\ N=1 & 1 & 3 & 5 \\ N=2 & 2 & 4 & 6 \\ N=3 & 3 & 5 & 7 \\ N=4 & 4 & 6 & 8 \\ \end{array} $$
Note
In this sliding pattern, a single space index can map to multiple tensor indices. For example, space index
4maps to{4_N},{2_N, 1_F}, and{2_F}simultaneously. This illustrates the non-one-to-one nature of(S, e).maps(si, ti).
This can be expressed using a linear combination expression where the N axis has stride 1 and the F axis has stride 2, yielding a total size of 1 + (5-1)*1 + (3-1)*2 = 9.
Equivalent Mapping
Mappings E1 and E2 are equivalent when:
E1::SIZE == E2::SIZE- For every
i,E1::map(i) == E2::map(i)
The equivalence relation is reflexive, symmetric, and transitive. Examples:
- Identity of pairs: for every
E,Eis equivalent both tom![{ E }, 1]andm![1, { E }]. - Stride-modulo decomposition: for every
Ewhose sizeE::SIZEis divisible byn,Eandm![{ E } / n, { E } % n]are equivalent. - Pair projection: for every
AandB,m![[{ A }, { B }] / B::SIZE]is equivalent tom![A]andm![[{ A }, { B }] % B::SIZE]is equivalent tom![B]. - Associativity of pairs: for every
E1,E2,E3,m![{ E1 }, { E2 }, { E3 }],m![[{ E1 }, { E2 }], { E3 }], andm![{ E1 }, [{ E2 }, { E3 }]]are equivalent. - Idempotent operations: for every
E,Eis equivalent tom![{ E } / 1], tom![{ E } # E::SIZE], and tom![{ E } = E::SIZE]. - Modulo by 1: For every
E,m![E % 1]is equivalent to the identity mappingm![1].
Memory and Stream
The Mapping Expressions page covered host tensors, which use a single flat buffer.
Device tensors extend host tensors in two directions: storage adds spatial dimensions (chips, clusters, slices) to match the hardware hierarchy, while data flowing through the Tensor Unit pipeline takes a streaming form with Time and Packet dimensions.
Both arise from the same hardware distinction between static storage and pipeline flow.
HBM and SRAM
(See Formal Definition at the end of this page for the precise buffer-to-tensor correspondence.)
Device memory has multiple levels, each with its own geometry. Each level is represented as a separate type parameter, enabling spatial parallelism:
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
// Assumed throughout this page.
axes![A = 8, B = 512];
// HBM tensors
struct HbmTensor<D: Scalar, Chip: M, Element: M> {
/* ... */
_marker: PhantomData<(D, Chip, Element)>,
}
// SRAM tensors
// DM (Data Memory), TRF (Tensor Register File), and VRF (Vector Register File)
struct DmTensor<D: Scalar, Chip: M, Cluster: M, Slice: M, Element: M> {
/* ... */
_marker: PhantomData<(D, Chip, Cluster, Slice, Element)>,
}
struct TrfTensor<D: Scalar, Chip: M, Cluster: M, Slice: M, Row: M, Element: M> {
/* ... */
_marker: PhantomData<(D, Chip, Cluster, Slice, Row, Element)>,
}
struct VrfTensor<D: Scalar, Chip: M, Cluster: M, Slice: M, Element: M> {
/* ... */
_marker: PhantomData<(D, Chip, Cluster, Slice, Element)>,
}
}
HBM tensors distribute data across chips for spatial parallelism (processing different data elements simultaneously on different hardware units).
For example, HbmTensor<bf16, m![A], m![B]> distributes 8 × 512 = 4096 elements across 8 chips with 512 elements per chip.
The i-th chip’s j-th element stores tensor index i![A = i, B = j].
SRAM tensor types add Cluster and Slice dimensions for finer-grained parallelism.
TrfTensor additionally has a Row dimension that distributes weight data across the 8 MAC rows per slice.
See Contraction Engine for details.
These tensor types assume all units at each level share the same mapping. The type parameters directly mirror the device structure, avoiding complex address calculations that would arise from flattening multi-dimensional storage into linear indices.
Alignment Constraint
Alignment constraints apply to the Element dimension: the starting address must be a multiple of size_of::<D>().
This ensures natural alignment for maximum throughput.
The Chip, Cluster, and Slice dimensions have no additional alignment constraints.
Size Constraint
Each dimension must fit within hardware limits. Each chip has 256MB of SRAM: 2 clusters × 256 slices × 512KB per slice. An 8-chip system provides 2GB total SRAM capacity.
All device tensor types share the following spatial constraints:
| Unit | Count | Constraint | Padding Required |
|---|---|---|---|
| Chip | System-dependent | Chip::SIZE == NUM_CHIPS | m![1 # NUM_CHIPS] |
| Cluster | 2 / Chip | Cluster::SIZE == 2 | m![1 # 2] |
| Slice | 256 / Cluster | Slice::SIZE == 256 | m![X / N # 256] |
Note
These exact-match constraints are a current limitation: the runtime operates at chip granularity (
#[device(chip = N)]), so partial chip or cluster usage is not yet supported. Use the#padding operator to fill unused positions. This may be relaxed in future releases.
The Element dimension varies by tensor type:
| Type | Unit | Constraint |
|---|---|---|
DmTensor | 512KB / Slice | Element::SIZE * size_of::<D>() <= 512KB |
TrfTensor | 8KB / Row | Row::SIZE <= 8, Element::SIZE * size_of::<D>() <= 8KB |
VrfTensor | 8KB / Slice | Element::SIZE * size_of::<D>() <= 8KB |
When a kernel uses fewer clusters than the hardware provides, the Cluster dimension is padded.
For example, a single-cluster kernel uses type Cluster = m![1 # 2], meaning 1 logical cluster padded to the hardware’s 2 clusters per chip.
A DmTensor<D, ..., Element> at address addr occupies addr..(addr + Element::SIZE * size_of::<D>()).
Tensor Unit Stream
While tensor data is stored in DM in a compact, storage-optimized layout, the Tensor Unit receives tensor data as streams of elements delivered over time.
The Packet dimension determines how many elements are delivered to the Tensor Unit in a single cycle.
Fetch Sequencers read DM data chunks and deliver a portion each clock cycle.
The Time dimension models this sequence of data delivery.
Unlike spatial dimensions that are constrained by hardware capacity, Time has no hardware-imposed size limit; it grows with the amount of data to process.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::ConstParamTy;
use std::marker::PhantomData;
axes![N = 4, C = 64, H = 32, W = 32];
/// Pipeline stage.
/// `Vector` is intentionally absent: the Vector Engine uses a separate typestate
/// (`VectorBranchTensor` and friends) that tracks branch, ALU, and other Vector-specific state.
/// `Commit` is intentionally absent: once the Commit Engine writes results back to DM,
/// the data is at rest and the type becomes `DmTensor`, not `StreamTensor`.
#[derive(PartialEq, Eq, ConstParamTy)]
enum Position {
Begin, // After the start of the pipeline
Fetch, // After the Fetch Engine
Switch, // After the Switch Engine
Collect, // After the Collect Engine
Contraction, // After the Contraction Engine
Reduce, // After the Reduce Engine
Cast, // After the Cast Engine
Transpose, // After the Transpose Engine
}
struct StreamTensor<
'l, // Lifetime tied to the Tensor Unit context
const P: Position,
D: Scalar,
Chip: M,
Cluster: M,
Slice: M,
Time: M,
Packet: M,
> {
/* ... */
_marker: PhantomData<&'l (D, Chip, Cluster, Slice, Time, Packet)>,
}
type T<'l> = StreamTensor<
'l,
{ Position::Fetch }, // Fetch Engine's output
bf16,
m![1], // Chip: single chip
m![1], // Cluster: single cluster
m![C / 2], // Slice: distribute 64 channels across 32 slices
m![N, H, W], // Time: iterate over batch (N) and spatial (H, W) dimensions
m![C % 2], // Packet: 2 channels per cycle
>;
}
Type T streams a tensor with an aggregate shape of \(\{N=4, C=64, H=32, W=32\}\) across 32 slices (Slice::SIZE = m![C / 2]::SIZE = 32).
The Time dimension (m![N, H, W]) has size 4 * 32 * 32 = 4096, which means there are 4096 temporal iterations or cycles.
Each cycle, the Packet dimension m![C % 2] delivers 2 channels to each slice.
Since 32 slices operate in parallel, each cycle processes 32 * 2 = 64 channels total.
Formal Definition
The following formalizes the buffer-to-tensor correspondence described above for multi-dimensional storage.
For an HBM tensor holding tensor \(T\), the correspondence is:
- For every chip index
i, element indexj, and corresponding tensor indicesti,tj: - if
Chip::map(i) = Some(ti)andElement::map(j) = Some(tj), - then the
i-th chip’sj-th element stores the value of tensor \(T\) at indexti ∪ tj(the union of the two partial tensor indices).
The same principle extends to SRAM tensors: for a DmTensor, the correspondence additionally requires matching Cluster::map and Slice::map indices, with the tensor index being the union of all four partial indices.
TrfTensor further adds a Row index.
Stream tensors add Time and Packet dimensions: the Time dimension indexes which cycle delivers the data, and the Packet dimension indexes elements within a single cycle.
Tensor Functions
The preceding pages showed that the same tensor can live in different memory tiers with different mapping expressions. To reason about operations independently of physical layout, TCP models hardware operations as abstract functions on mathematical tensors, focusing on what data they produce rather than how it is physically arranged.
The function elementwise_add implements the mathematical operation \(f(T_1, T_2) = T_1 + T_2\):
- For every tensor \(T_1\) and \(T_2\),
- if
lhsholds \(T_1\) andrhsholds \(T_2\), - then the return value holds \(T_1 + T_2\).
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 512];
fn elementwise_add(
lhs: &HbmTensor<bf16, m![A], m![B]>,
rhs: &HbmTensor<bf16, m![A], m![B]>,
) -> HbmTensor<bf16, m![A], m![B]> {
// ... computes elementwise add ...
todo!("elementwise add lhs and rhs")
}
}
The same reasoning applies to data movement: moving a tensor from one memory tier to another is also a tensor function, one that preserves the mathematical content while changing the physical representation.
The .to_dm() method implements the identity function on the mathematical tensor (not the physical representation), copying tensor \(T\) from HBM to on-chip Data Memory:
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 512];
fn hbm_to_dm(
ctx: &mut Context,
hbm: &HbmTensor<bf16, m![A], m![B]>,
) -> DmTensor<bf16, m![A], m![1], m![B / 2], m![B % 2]> {
hbm.to_dm(&mut ctx.tdma, 1 << 16) // 64KB offset
}
}
Both the input HbmTensor and output DmTensor hold the same mathematical tensor \(T\), but in different memory tiers and with different mapping expressions.
This means correctness is defined at the tensor level: a function is correct if its output holds the right mathematical tensor, regardless of which mapping or memory tier is used.
Treating data movement as a function on tensors rather than as a copy of bytes makes it composable with compute operations in the same pipeline.
Moving Tensors
This chapter explains the three engines that move tensor data between memory tiers (HBM, DM, SPM) and the Tensor Unit: the Fetch Engine (DM to pipeline), the Commit Engine (pipeline to DM), and the DMA Engine (HBM/SPM to DM). Their APIs are designed around what you control: packet sizes, which engine moves each tensor, and how axes map to hardware dimensions. The compiler translates these declarations into low-level hardware concerns such as memory bank scheduling, stride calculation, and access alignment.
Device memory has two primary levels: off-chip HBM (High Bandwidth Memory) for high-capacity storage, and on-chip SRAM for low-latency working memory. SRAM is subdivided into DM (Data Memory) (the primary working memory), SPM (Scratchpad Memory) (a smaller high-speed buffer within each DM), TRF (Tensor Register File), and VRF (Vector Register File). Tensors are stored in these tiers in storage-optimized layouts. This chapter covers HBM, DM, and SPM (the tiers accessed by the DMA, Fetch, and Commit engines); TRF and VRF are loaded through the Tensor Unit pipeline and are covered in Computing Tensors.
flowchart TB
HBM[(HBM)] <--> DMA[DMA]
SPM[(SPM)] <--> DMA[DMA]
DMA <--> DM[(DM)]
subgraph TU[Tensor Unit]
direction TB
FE[Fetch] --> DOT1[...] --> CT[Contraction] --> VE[Vector] --> DOT2[...] --> CM[Commit]
end
DM -->|stream| FE
CM -->|stream| DM
click DMA "./dma-engine.html" "DMA Engine"
click FE "./fetch-engine.html" "Fetch Engine"
click CT "../computing-tensors/contraction-engine/index.html" "Contraction Engine"
click VE "../computing-tensors/vector-engine/index.html" "Vector Engine"
click CM "./commit-engine.html" "Commit Engine"
click TU "../computing-tensors/index.html" "Tensor Unit"
The Fetch engine converts DM storage layout into packet streams for the Tensor Unit; the Commit engine performs the reverse; the DMA engine converts between HBM and DM layouts. All three engines rely on Sequencers, a configuration abstraction that controls memory access patterns through nested-loop configurations, generating and consuming fixed-size packets for deterministic per-cycle transfers and aligned bank access. Memory Performance provides guidance on achieving optimal throughput.
Sequencers read DM at the Fetch Engine and write DM at the Commit Engine, converting between storage layout and stream format. The DMA Engine is a separate pipeline that moves data between HBM/SPM and DM independently of the Tensor Unit.
Sequencer
The Fetch and Commit Engines use sequencers to address DM; this page explains how sequencers work, including their configuration constraints and the failure cases that arise when those constraints are exceeded.
Sequencers convert between tensors in memory and packet streams: reading converts a memory buffer into a stream of packets, and writing performs the reverse.
Each packet is a fixed-size chunk delivered each clock cycle; its size is set by the Packet mapping dimension.
The DMA Engine chains a read and a write sequencer to move data between HBM/SPM and DM without intermediate buffers.
As a kernel writer, you control the Time and Packet type parameters, which determine iteration count and packet size; the compiler derives the register configuration and strides.
For performance implications of Packet choices, see Memory Performance.
Interface
To explain sequencer concepts, we use simplified types that capture the essential structure.
The actual API is introduced in later sections.
The read and write methods preserve tensor values while transforming between memory layout and stream format.
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
/// A tensor in a linear buffer with mapping `Buf`.
struct BufTensor<D: Scalar, Buf: M> {
/* ... */
_marker: PhantomData<(D, Buf)>,
}
/// A tensor in motion.
/// - `'l`: Lifetime tied to the source buffer, ensuring the stream cannot outlive its data.
/// - `Time`: Temporal mapping (iteration over time).
/// - `Packet`: Spatial mapping (contents of a single packet).
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M> {
/* ... */
_marker: PhantomData<&'l (D, Time, Packet)>,
}
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
/// Reads a tensor from a linear buffer into a stream.
fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> {
// hardware implementation
unimplemented!()
}
/// Writes a stream of packets back into a linear buffer.
fn write<'l, Time: M, Packet: M>(&mut self, stream: StreamTensor<'l, D, Time, Packet>) {
// hardware implementation
unimplemented!()
}
}
}
Examples
(The Configuration section below explains how the compiler derives these configurations from tensor mappings.)
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
fn write<'l, Time: M, Packet: M>(&mut self, stream: StreamTensor<'l, D, Time, Packet>) { let _ = stream; }
}
axes![A = 8, B = 512, N = 4, C = 3, H = 8, W = 8, T = 4, P = 4];
/// Strided access: read 8×512 tensor as 128 packets of 32 elements.
/// Time = m![A, B / 32] produces 8 * 16 = 128 time steps.
/// Packet = m![B % 32] delivers 32 consecutive elements per packet.
fn strided_read<'l>(
buf: &'l BufTensor<bf16, m![A, B]>,
) -> StreamTensor<'l, bf16, m![A, B / 32], m![B % 32]> {
buf.read() // Automatic type inference
}
/// Strided write: write 128 packets of 32 elements back to 8×512 tensor.
fn strided_write(
buf: &mut BufTensor<bf16, m![A, B]>,
stream: StreamTensor<bf16, m![A, B / 32], m![B % 32]>,
) {
buf.write(stream)
}
/// Axis reordering read: change traversal from [N, C, H, W] to [W, H, C, N].
/// Time = m![W, H, C, N] iterates in reversed axis order.
/// Packet = m![1] delivers single-element packets.
fn axis_reordering_read<'l>(
buf: &'l BufTensor<bf16, m![N, C, H, W]>,
) -> StreamTensor<'l, bf16, m![W, H, C, N], m![1]> {
buf.read()
}
/// Axis reordering write: write [W, H, C, N] stream back to [N, C, H, W] buffer.
fn axis_reordering_write(
buf: &mut BufTensor<bf16, m![N, C, H, W]>,
stream: StreamTensor<bf16, m![W, H, C, N], m![1]>,
) {
buf.write(stream)
}
/// Tiling read: break axes into sub-blocks for cache efficiency.
/// Time = m![A % 2, B % 4, A / 2, B / 4] tiles A into 2 × 4, B into 4 × 128 blocks.
/// Packet = m![C # 32] pads C to 32 elements per packet.
fn tiling_read<'l>(
buf: &'l BufTensor<i8, m![A, B, C # 8]>,
) -> StreamTensor<'l, i8, m![A % 2, B % 4, A / 2, B / 4], m![C # 32]> {
buf.read()
}
/// Tiling write: write tiled stream back to buffer.
fn tiling_write(
buf: &mut BufTensor<i8, m![A, B, C # 8]>,
stream: StreamTensor<i8, m![A % 2, B % 4, A / 2, B / 4], m![C # 32]>,
) {
buf.write(stream)
}
/// Broadcasting read: replicate elements using stride 0.
/// Time = m![T, A] broadcasts T temporally (same data repeated T times).
/// Packet = m![P] broadcasts P spatially (same element fills packet).
fn broadcasting_read<'l>(
buf: &'l BufTensor<i8, m![A]>,
) -> StreamTensor<'l, i8, m![T, A], m![P]> {
buf.read()
}
/// Broadcasting write: write broadcast stream back to buffer.
fn broadcasting_write(
buf: &mut BufTensor<i8, m![A]>,
stream: StreamTensor<i8, m![T, A], m![P]>,
) {
buf.write(stream)
}
}
Configuration
The compiler translates input and output tensor mappings into nested-loop configurations that the sequencer hardware executes.
Each configuration has the form [size_0 : stride_0, size_1 : stride_1, ...] : packet_size, where each entry’s subscript corresponds to its position in the loop nest (0 = outermost), represented by the following Rust type:
#![allow(unused)]
fn main() {
struct Config {
/// Each entry defines a nested loop level.
entries: Vec<Entry>,
/// Number of elements per hardware fetch.
packet_size: usize,
}
struct Entry {
/// Number of iterations for this loop level.
size: usize,
/// Memory address distance (in elements) to skip after each iteration.
stride: isize,
}
}
Each entry encodes one dimension of tensor traversal.
The size field determines how many times this loop iterates, while the stride field determines the memory offset between consecutive iterations.
Together, entries form nested loops that traverse memory.
Example: [N, C, H, W] ↔ [W, H, C, N]
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
fn write<'l, Time: M, Packet: M>(&mut self, stream: StreamTensor<'l, D, Time, Packet>) { let _ = stream; }
}
struct Config {
entries: Vec<Entry>,
packet_size: usize,
}
struct Entry {
size: usize,
stride: isize,
}
axes![N = 4, C = 3, H = 8, W = 8];
fn read_nchw_whcn(buf: &BufTensor<bf16, m![N, C, H, W]>) ->
StreamTensor<bf16, m![W, H, C, N], m![1]> {
// Compiler-generated configuration: [8 : 1, 8 : 8, 3 : 64, 4 : 192] : 1
let config = Config {
entries: vec![
Entry { size: 8, stride: 1 }, // W
Entry { size: 8, stride: 8 }, // H
Entry { size: 3, stride: 64 }, // C
Entry { size: 4, stride: 192 }, // N
],
packet_size: 1,
};
// The hardware executes the configuration as nested loops:
for w in 0..8 {
for h in 0..8 {
for c in 0..3 {
for n in 0..4 {
// Read each address
let addr = 1 * w + 8 * h + 64 * c + 192 * n;
// yield buf[addr];
}
}
}
}
buf.read()
}
fn write_whcn_nchw(buf: &mut BufTensor<bf16, m![N, C, H, W]>,
stream: StreamTensor<bf16, m![W, H, C, N], m![1]>) {
// The compiler generates an identical config for writing
// The hardware executes the configuration as nested loops:
for w in 0..8 {
for h in 0..8 {
for c in 0..3 {
for n in 0..4 {
// Write to each address
let addr = 1 * w + 8 * h + 64 * c + 192 * n;
// buf[addr] = stream.next();
}
}
}
}
}
}
The Time dimension represents logical iteration steps, not physical clock cycles.
The Packet dimension represents logical unit of data processed per Time.
The hardware computes fetch_size to determine the minimum number of fetch cycles required (see Fetch Engine for the constraints of fetch_size).
Configuration Examples
The compiler automatically derives configurations required to traverse memory. The following examples illustrate common patterns.
Rearranging Axes
Rearranging axes changes the traversal order of the tensor.
When Time specifies a different axis order than Buf, the compiler computes strides to traverse memory in the requested order.
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
}
axes![A = 8, B = 8, C = 8];
fn read_rearranging<'l>(
buf: &'l BufTensor<i8, m![A, B, C # 32]>, // Buf
) -> StreamTensor<'l, i8, m![B, A], m![C # 16]> { // Time, Packet
// Compiler-generated configuration: [
// B -> 8 : 32,
// A -> 8 : 256,
// C # 16 -> 16 : 1,
// ] : 16
buf.read()
}
}
The compiler generates configuration entries by processing the combined mapping m![B, A, C # 16] term by term, transforming Buf along the way.
For each term, the entry size equals the term size, and the stride equals the volume that term occupies within the current Buf.
After processing a term, Buf is updated to reflect that the axis has been consumed:
| Term | Entry | Stride Source | Buf After |
|---|---|---|---|
B | 8 : 32 | m![C # 32]::SIZE | m![A, 1 # 8, C # 32] |
A | 8 : 256 | m![1 # 8, C # 32]::SIZE | m![1 # 64, C # 32] |
C # 16 | 16 : 1 | contiguous (Packet dimension) | 1 # 2048 |
The maximum fetch_size is 16.
The innermost entry 16 : 1 has stride 1, making elements contiguous within the packet.
Splitting Axes
Splitting axes enables tiling by breaking logical axes into multiple entries. This is useful for cache efficiency or matching tensor unit buffer sizes.
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
}
axes![A = 8, B = 8, C = 4];
fn read_splitting<'l>(
buf: &'l BufTensor<i8, m![A, B, C # 8]>, // Buf
) -> StreamTensor<'l, i8, m![A % 2, B % 4, A / 2, B / 4], m![C # 32]> { // Time, Packet
// The compiler generates: [
// A % 2 -> 2 : 64,
// B % 4 -> 4 : 8,
// A / 2 -> 4 : 128,
// B / 4 -> 2 : 32,
// C # 32 -> 32 : 1,
// ] : 32
buf.read()
}
}
Expressions like A % 2 and A / 2 split axis A into separate entries.
The compiler processes m![A % 2, B % 4, A / 2, B / 4, C # 32] term by term:
| Term | Entry | Stride Source | Buf After |
|---|---|---|---|
A % 2 | 2 : 64 | m![B, C # 8]::SIZE | m![A / 2, 1 # 2, B, C # 8] |
B % 4 | 4 : 8 | m![C # 8]::SIZE | m![A / 2, 1 # 2, B / 4, 1 # 4, C # 8] |
A / 2 | 4 : 128 | m![1 # 2, B / 4, 1 # 4, C # 8]::SIZE | m![1 # 8, B / 4, 1 # 4, C # 8] |
B / 4 | 2 : 32 | m![1 # 4, C # 8]::SIZE | m![1 # 64, C # 8] |
C # 32 | 32 : 1 | contiguous (Packet dimension) | 1 # 512 |
The maximum fetch_size is 32.
Slicing Axes
Slicing reads only a partial range of indices from the memory layout. This arises from indexed views that select subsets of the original tensor.
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
}
axes![A = 16, B = 8, C = 8];
fn read_slicing<'l>(
buf: &'l BufTensor<i8, m![A, B, C]>, // Buf
) -> StreamTensor<'l, i8, m![A / 4, A % 4 = 3, B / 4, B % 4 = 2], m![C]> { // Time, Packet
// Compiler-generated configuration: [
// A / 4 -> 4 : 256,
// A % 4 = 3 -> 3 : 64,
// B / 4 -> 2 : 32,
// B % 4 = 2 -> 2 : 8,
// C -> 8 : 1,
// ] : 8
buf.read()
}
}
The = 3 notation limits A % 4 to only 3 iterations instead of 4, restricting the hardware to a sub-region of the tensor.
The compiler processes m![A / 4, A % 4 = 3, B / 4, B % 4 = 2, C] term by term:
| Term | Entry | Stride Source | Buf After |
|---|---|---|---|
A / 4 | 4 : 256 | m![A % 4, B, C]::SIZE | m![1 # 4, A % 4, B, C] |
A % 4 = 3 | 3 : 64 | m![B, C]::SIZE (sliced to 3) | m![1 # 16, B, C] |
B / 4 | 2 : 32 | m![B % 4, C]::SIZE | m![1 # 32, B % 4, C] |
B % 4 = 2 | 2 : 8 | m![C]::SIZE (sliced to 2) | m![1 # 128, C] |
C | 8 : 1 | contiguous (Packet dimension) | 1 # 1024 |
The maximum fetch_size is 8.
Broadcasting Axes
Broadcasting replicates tensor elements across multiple packets or time steps. Stride 0 causes the hardware to repeatedly read from the same memory location.
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
}
axes![A = 16, T = 4, P = 4];
fn read_broadcasting<'l>(
buf: &'l BufTensor<i8, m![A]>, // Buf
) -> StreamTensor<'l, i8, m![T, A], m![P]> { // Time, Packet
// Compiler-generated configuration: [
// T -> 4 : 0, // temporal broadcast
// A -> 16 : 1,
// P -> 4 : 0, // spatial broadcast
// ] : 4
buf.read()
}
}
Axes not present in Buf get stride 0.
The compiler processes m![T, A, P] term by term:
| Term | Entry | Stride Source | Buf After |
|---|---|---|---|
T | 4 : 0 | not in Buf (broadcast) | m![A] |
A | 16 : 1 | A in m![A] | 1 # 16 |
P | 4 : 0 | not in Buf (broadcast) | 1 # 16 |
The maximum fetch_size is 4.
Since P has stride 0, the same element is replicated across the packet (spatial broadcast).
Since T has stride 0, the same data is repeated across time steps (temporal broadcast).
Merging Entries
When a transformation produces more than 8 entries, the compiler merges adjacent entries to meet hardware limits.
Adjacent entries (n1 : s1) and (n2 : s2) merge into (n1 * n2 : s2) when physically contiguous: s1 == n2 * s2.
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
}
axes![N = 8, C = 8, H = 8, W = 32];
fn read_merging<'l>(
buf: &'l BufTensor<i8, m![N, C, H, W]>, // Buf
) -> StreamTensor<'l, i8, m![W / 16, H % 2, H / 2, C / 2, C % 2, N / 2, N % 2, W / 8 % 2], m![W % 8]> { // Time, Packet
// Initial 9 entries:
// W / 16 -> 2 : 16,
// H % 2 -> 2 : 32,
// H / 2 -> 4 : 64,
// C / 2 -> 4 : 512,
// C % 2 -> 2 : 256,
// N / 2 -> 4 : 4096,
// N % 2 -> 2 : 2048,
// W / 8 % 2 -> 2 : 8,
// W % 8 -> 8 : 1,
// After merging to 6 entries:
// W / 16 -> 2 : 16,
// H % 2 -> 2 : 32,
// H / 2 -> 4 : 64,
// C -> 8 : 256, // merged C / 2 and C % 2
// N -> 8 : 2048, // merged N / 2 and N % 2
// W % 16 -> 16 : 1, // merged W / 8 % 2 and W % 8
buf.read()
}
}
The compiler processes m![W / 16, H % 2, H / 2, C / 2, C % 2, N / 2, N % 2, W / 8 % 2, W % 8] term by term, producing 9 initial entries:
| Term | Entry | Stride Source |
|---|---|---|
W / 16 | 2 : 16 | m![W % 16]::SIZE |
H % 2 | 2 : 32 | m![W]::SIZE |
H / 2 | 4 : 64 | m![H % 2, W]::SIZE |
C / 2 | 4 : 512 | m![C % 2, H, W]::SIZE |
C % 2 | 2 : 256 | m![H, W]::SIZE |
N / 2 | 4 : 4096 | m![N % 2, C, H, W]::SIZE |
N % 2 | 2 : 2048 | m![C, H, W]::SIZE |
W / 8 % 2 | 2 : 8 | m![W % 8]::SIZE |
W % 8 | 8 : 1 | contiguous (packet dimension) |
Since 9 entries exceed the hardware limit of 8, the compiler merges contiguous pairs where s1 == n2 * s2.
The entries for H % 2 -> (2 : 32) and H / 2 -> (4 : 64) are not merged because they are not physically contiguous (\(s_1 \neq n_2 \times s_2 \iff 32 \neq 4 \times 64\)).
The final configuration has 6 entries.
The last merge combines a temporal entry with the packet entry, increasing the packet size from 8 to 16.
| Term | Entry | Merged Entries |
|---|---|---|
W / 16 | 2 : 16 | |
H % 2 | 2 : 32 | |
H / 2 | 4 : 64 | |
C | 8 : 256 | C / 2 (4 : 512),C % 2 (2 : 256) |
N | 8 : 2048 | N / 2 (4 : 4096),N % 2 (2 : 2048) |
W % 16 | 16 : 1 | W / 8 % 2 (2 : 8),W % 8 (8 : 1) |
Configuration Failures
In TCP, violating any of the following limits causes a compilation error:
- Entry limit: Maximum 8 entries; the compiler merges adjacent entries where possible (see Merging Entries in Configuration Examples).
- Iteration limit:
size <= 65,536per entry. - Packet size: Must be 1, 2, 4, 8, 16, or 32 bytes.
- Packet fetch: The innermost entry
n : smust satisfy one of:- Contiguous access (adjacent elements):
(s == 0 || s == 1) && n % packet_size == 0 - Discrete access (single-element packets):
packet_size == 1
- Contiguous access (adjacent elements):
If merging fails or limits are exceeded, redesign the tensor mapping or split the operation across multiple sequencer calls. The following examples illustrate common failure cases.
Insufficient Input
The temporal mapping Time attempts to iterate over indices that do not exist in Buf.
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
}
axes![N = 2048];
fn read_insufficient<'l>(
buf: &'l BufTensor<i8, m![N % 512]>, // Buf
) -> StreamTensor<'l, i8, m![N / 512], m![N % 512]> { // Time, Packet
buf.read() // Compilation error: insufficient input
}
}
Time requires N / 512, but Buf only contains N % 512.
The buffer does not have the indices that the temporal mapping tries to iterate over.
Incompatible Shapes
The buffer and stream mappings have the same total size but different mathematical structures that no single sequencer configuration can reconcile.
#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
}
axes![A = 15];
fn read_incompatible<'l>(
buf: &'l BufTensor<i8, m![A % 5, A / 5]>, // Buf
) -> StreamTensor<'l, i8, m![1], m![A % 3, A / 3]> { // Time, Packet
buf.read() // Compilation error: incompatible shapes
}
}
Both Buf and Packet represent 15 elements, but their internal index mappings differ.
The buffer uses a base-5 decomposition (A % 5, A / 5) while the packet uses a base-3 decomposition (A % 3, A / 3).
These are mathematically incompatible: there is no way to traverse memory in one pattern to produce the other.
The compiler detects this using a “factorizable” concept: it attempts to decompose axes with the same label into (inner, intersection, outer) components to find a common representation.
When no such factorization exists, the configuration is rejected.
Indirect Access
Standard sequencer entries use fixed strides: the memory offset between iterations is constant.
IndirectLoop extends this by allowing variable offsets per iteration, enabling gather operations with data-dependent access patterns.
The standard pattern (limit, stride) becomes (limit, [offset0, offset1, ...]), where each iteration uses a different offset from the provided sequence.
This supports operations like embedding lookups where indices are determined at runtime.
Fetch Engine
The Tensor Unit is a pipeline of engines (Switching, Collect, Contraction, Vector, Cast, Transpose, Commit) that processes tensor data. It cannot operate directly on tensors stored in DM: data must first be converted into a stream of fixed-size packets that flow through the compute pipeline. The Fetch Engine performs this conversion: it reads tensor data from DM and produces packet streams for the rest of the Tensor Unit.
The Fetch Engine operates in two stages:
- Fetch Sequencer: Reads tensor data from DM using nested-loop configurations that define access patterns.
- Fetch Adapter: Post-processes streams through masking, type conversion, and batching to produce computation-ready packets.
Additional sections cover the interaction with the Switch Engine and performance guidelines.
As a kernel writer, you control the Time and Packet type parameters of .fetch(), which determine packet size and iteration count.
The compiler derives the sequencer loop configuration and stride calculation.
For performance implications of Packet choices, see Memory Performance.
Interface
The Fetch Engine implements a logical tensor move from DM to tensor streams:
impl<'l, const T: Tu, D: Scalar, Chip: M, Cluster: M, Slice: M, Time: M, Packet: M>
BeginTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet>
{
/// Performs fetch operation to create a fetched tensor.
#[primitive(BeginTensor::fetch)]
pub fn fetch<D2: Scalar, Time2: M, Packet2: M>(self) -> FetchTensor<'l, T, D2, Chip, Cluster, Slice, Time2, Packet2>
where
D: FetchCast<D2>,
{
assert_eq!(Cluster::SIZE, 2, "Cluster size must be 2, got {}", Cluster::SIZE);
assert_eq!(Slice::SIZE, 256, "Slice size must be 256, got {}", Slice::SIZE);
let packet_bytes = D2::size_in_bytes_from_length(Packet2::SIZE);
assert_eq!(
packet_bytes % FETCH_ALIGN_BYTES,
0,
"Fetch output packet must be {FETCH_ALIGN_BYTES}-byte aligned, got {packet_bytes} bytes.",
);
FetchTensor::new(self.ctx, self.inner.map(|v| v.map(|v| v.cast())).transpose(true))
}
}
The resulting FetchTensor represents a stream of packets flowing through the Tensor Unit pipeline.
The mapping [Chip, Cluster, Slice, Time, Packet] distributes data across hardware and time.
Time2 represents the temporal iteration mapping, while Packet2 is the packet shape within each cycle.
The output packet size must be 8-byte aligned (a multiple of the fetch sequencer’s read granularity).
The output type D2 supports type casting (such as i8 to i32).
Notice that .fetch() does not have an output parameter for Slice: each slice independently reads its own DM data, so the Slice mapping is inherited unchanged from the input BeginTensor.
To redistribute data across slices, use the Switch Engine.
Examples
A matrix stored as 8-bit integers needs conversion to 32-bit integers for computation:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512, B = 32];
/// Fetches matrix data from DM, casting i8 to i32.
fn fetch_matrix_example<'l, const T: Tu>(
input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![A, B]>,
) -> FetchTensor<'l, T, i32, m![1], m![1], m![1], m![A], m![B]> {
input.fetch()
}
}
The input BeginTensor represents data in DM.
The output FetchTensor represents a packet stream: 512 rows over time, each containing 32 i32 (128 bytes).
The compiler automatically configures the sequencer and adapter from the output type signature.
For the StreamTensor type hierarchy across pipeline stages, see Memory and Stream.
Fetch Sequencer
The Fetch Sequencer defines the memory access pattern: which addresses to read, in what order, and how to package the data into packets. Each slice executes its own sequencer independently, enabling parallel data movement.
Sequencers typically use homogeneous configurations: each slice processes the same pattern on its local data partition. The hardware also supports heterogeneous configurations where different slices execute different access patterns simultaneously.
Constraints
The sequencer has physical limits that must be respected:
Chip::SIZE\(=\) number of chips in the system.Cluster::SIZE\(=\) 2 (clusters per chip).Slice::SIZE\(=\) 256 (slices per cluster).
Note
These exact-match constraints are a current limitation: the runtime operates at chip granularity (
#[device(chip = N)]), so partial chip or cluster usage is not yet supported. Use the#padding operator (e.g.,m![1 # 2]for a single logical cluster) to fill unused positions. This may be relaxed in future releases.
- Fetch addresses must be
1-byte aligned (minimal constraint), but8-byte alignment is required for certain DMA operations. - Depending on the context, the
fetch_sizeis restricted to certain values:- The main-context supports a
fetch_sizeof 1, 2, 4, 8, 16, or 32 bytes (see main-context). - The sub-context supports a
fetch_sizeof 4 bytes (when casting fromi4toi32), or 8 bytes (otherwise) (see sub-context).
- The main-context supports a
fetch_sizeis determined as the largest supported divisor ofgcd(packet_size, contiguous_sram_access_size).fetch_sizemust dividepacket_sizebecause data fetched in a single memory read cannot be split across different packets.fetch_sizemust dividecontiguous_sram_access_sizebecause a single memory fetch can only read physically contiguous data.
The contiguous_sram_access_size represents the total byte size of contiguous elements in memory that can be accessed without stride discontinuities. It is derived from the sequencer configuration by multiplying the sizes of consecutive physically contiguous entries, from the innermost to the outermost level. Two adjacent entries—an outer (n1 : s1) and an inner (n2 : s2)—are physically contiguous if s1 == n2 * s2.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![N = 4, C = 3, H = 4, W = 8];
// Compiler-generated configuration: [
// N -> 4 : 96, (96 == 3 * 32, contiguous)
// C -> 3 : 32, (32 == 4 * 8, contiguous)
// H -> 4 : 8, (8 == 8 * 1, contiguous)
// W -> 8 : 1, (packet dimension, contiguous)
// ] : 8
// contiguous_sram_access_size = 8 * 4 * 3 * 4 = 384
fn fully_contiguous<'l, const T: Tu>(
input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![N, C, H, W]>,
) -> FetchTensor<'l, T, i8, m![1], m![1], m![1], m![N, C, H], m![W]> {
input.fetch()
}
// Compiler-generated configuration: [
// C -> 3 : 32, (32 != 4 * 96, NOT contiguous)
// N -> 4 : 96, (96 != 4 * 8, NOT contiguous)
// H -> 4 : 8, (8 == 8 * 1, contiguous)
// W -> 8 : 1, (packet dimension, contiguous)
// ] : 8
// contiguous_sram_access_size = 8 * 4 = 32
fn three_axes_non_contiguous<'l, const T: Tu>(
input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![N, C, H, W]>,
) -> FetchTensor<'l, T, i8, m![1], m![1], m![1], m![C], m![N, H, W]> {
input.fetch()
}
// Compiler-generated configuration: [
// N -> 4 : 96, (96 != 4 * 8, NOT contiguous)
// H -> 4 : 8, (8 != 3 * 32, NOT contiguous)
// C -> 3 : 32, (32 != 8 * 1, NOT contiguous)
// W -> 8 : 1, (packet dimension, contiguous)
// ] : 8
// contiguous_sram_access_size = 8
fn four_axes_non_contiguous<'l, const T: Tu>(
input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![N, C, H, W]>,
) -> FetchTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![N, H, C, W]> {
input.fetch()
}
}
For detailed information on how packet size interacts with memory access patterns and sequencer configuration, see the sequencer configuration.
Note
The optimal sequencer configuration is automatically generated by the compiler based on the output type of
fetch(). Users do not directly specify sequencer configurations in Virtual ISA. Similarly,fetch_sizeandcontiguous_sram_access_sizeare automatically derived by the compiler, and not directly specified by users.
Optimizations
Different configurations can achieve the same tensor move with varying efficiency. Two key optimizations dramatically improve performance: padding packets to maximize bandwidth and interleaving tensors to combine operations.
Padding Packets
Padding packets to full hardware bandwidth drastically reduces fetch cycles.
The increased packet size allows the compiler to increase fetch_size, which reduces the number of fetch cycles needed to transfer the same amount of data.
The following example demonstrates this effect:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 3, B = 5, C = 2];
/// Smallest packet: only C dimension (2 bytes). Takes 15 cycles.
fn fetch_packet_C<'l, const T: Tu>(
input: BeginTensor<'l, T, f8e4m3, m![1], m![1], m![1], m![1], m![A, B, C]>,
) -> FetchTensor<'l, T, f8e4m3, m![1], m![1], m![1], m![A, B], m![C]> {
input.fetch()
}
/// Medium packet: B and C dimensions padded to 16 bytes. Takes 3 cycles.
fn fetch_packet_BC<'l, const T: Tu>(
input: BeginTensor<'l, T, f8e4m3, m![1], m![1], m![1], m![1], m![A, B, C]>,
) -> FetchTensor<'l, T, f8e4m3, m![1], m![1], m![1], m![A], m![[B, C] # 16]> {
input.fetch()
}
/// Largest packet: all dimensions padded to 32 bytes. Takes 1 cycle.
fn fetch_packet_ABC<'l, const T: Tu>(
input: BeginTensor<'l, T, f8e4m3, m![1], m![1], m![1], m![1], m![A, B, C]>,
) -> FetchTensor<'l, T, f8e4m3, m![1], m![1], m![1], m![1], m![[A, B, C] # 32]> {
input.fetch()
}
}
Padding reads beyond the actual data, but this is safe because padding values do not affect computation.
Note that different padding strategies produce different FetchTensor mappings, which may affect downstream components.
Interleaving Tensors
Interleaving combines two tensors with identical mappings into a single sequencer operation, reducing overhead when both tensors are needed for the same computation.
An explicit axis is introduced in the Time dimension to encode alternation between the two tensors.
In the following example, the main context creates an interleaved tensor using begin_interleaved().
This introduces an axis I = 2 in the Time dimension, which encodes alternation between the two tensors.
The first temporal iteration fetches from lhs, the second iteration fetches from rhs,
the third iteration fetches the next packet from lhs, and so on, continuing this alternating pattern.
At most two tensors can be interleaved in a single fetch operation.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512, B = 32, I = 2];
/// Interleaves two input tensors into a single packet stream.
/// Useful for operations like 'input1 + input2' in the Vector Engine.
/// The interleaved BeginTensor is created via Tu.begin_interleaved().
/// The `I = 2` axis in Time encodes alternation between the two tensors.
fn fetch_interleaved<'l>(
ctx: &'l mut Context,
lhs: &'l DmTensor<i8, m![1], m![1], m![1], m![A, B]>,
rhs: &'l DmTensor<i8, m![1], m![1], m![1], m![A, B]>,
) -> FetchTensor<'l, { Tu::Main }, i8, m![1], m![1], m![1], m![A, I], m![B]> {
ctx.main.begin_interleaved::<I, _, _, _, _, _>(lhs.view(), rhs.view()).fetch()
}
}
Fetch Adapter
The Fetch Adapter transforms raw packet streams into computation-ready format through five stages: masking, table indexing, type casting, zero-point subtraction, and batching. The main-context adapter supports all five stages, while sub-context adapters only support zero-point subtraction.
Sequencer and Adapter Interaction
The Fetch Sequencer operates with a fixed stream mapping: (Slice, Time) → Packet.
It controls what memory addresses are read and how elements are packed.
While structural transformations happen during memory access, element-wise transformations happen during adapter processing.
To achieve a specific data layout, reshape the tensor before the sequencer reads it, then apply value-wise transformations in the adapter stage.
For example, consider a tensor with S = [8_a], Element = a, Time = a ! 4. Slicing can be expressed as S = [4_a], Element = a + 4, Time = a. With offset consideration, S = [8_a], Element = a, Time = a @ 4 becomes S = [8_a], Element = 4 + a, Time = a. In principle, if we reshape the tensor appropriately, all forms could be expressed this way, though the necessity of this specific approach warrants further investigation.
Masking
The Tensor Unit requires data in power-of-two sizes for efficient processing, because its internal data paths operate on fixed-width units (32-byte flits containing 8 elements of 32-bit data).
A 63-element axis must be padded to 64 elements.
Without masking, the padded element might contain an arbitrary value that corrupts operations like sum or max.
Masking forces padded elements to neutral values so they do not influence the result.
For example, the Reducer sums elements along an axis. Summing 63 real elements plus 1 arbitrary padded value produces an incorrect result. Masking sets that padded element to zero.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 63];
/// Fetches with automatic masking: pads 63 elements to 64, masking the padding.
/// Hardware automatically masks the 64th element to zero
/// so reduce operations compute correctly on 63 valid elements.
fn fetch_with_masking<'l, const T: Tu>(
input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![A]>,
) -> FetchTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![A # 64]> {
input.fetch()
}
}
Masking Configuration
The Fetch Engine supports masking for innermost axes with padding on both sides, expressed as (# n + A + # m) where n is left padding, A is valid data, and m is right padding.
The hardware provides three masking cases to handle different padding scenarios, each optimized for specific padding patterns and axis sizes.
All masking configurations use three key parameters:
last_dim: Specifies the dimension index to apply masking to.left_pad: Masks the firstleft_padelements when the index oflast_dimis 0.last_dim_rightmost_valid_count[0]: Masksdim0 - last_dim_rightmost_valid_count[0]elements from the right when thelast_dimindex is the last. This value is limited to 0-255 for 4-bit types, 0-31 forf32, as the final packet size must not exceed 256 bytes.
Example (Padding case 1)

axes![A = 32, B = 90]dtype = i8base_addr = 0Element = m![A, B # 96]- Configuration:
last_dim = 1,lpad = 2,last_dim_rightmost_valid_count[0] = 4,pad_value = 0 - Stream mapping:
let B' = # 2 + B + # 4 in { Time: m![A, B' / 32], Packet: m![B' % 32] } - Sequencer configuration:
[A = 32 : 96, B' / 32 = 3 : 32, B' % 32 = 32 : 1] : 32 @ base_addr = -2 - Packet size:
m![B' % 32]::SIZE = 32 - Cycles:
Time::SIZE = m![A, B' / 32]::SIZE = 32 * 3 = 96 - Result: The first 2 and last 4 values of
(# 2 + B + # 4)are masked to0.
Example (Padding case 2)
Case 2 handles the same masking as Case 1, but for non-contiguous padding regions that are split across the data.

axes![A = 32, B = 90]dtype = i8base_addr = 0Element = m![A, B # 96]- Configuration:
last_dim = 0,lpad = 2,last_dim_rightmost_valid_count[0] = 4,pad_value = 0 - Stream mapping:
let B' = # 2 + B + # 4 in { Time: m![B' / 32, A], Packet: m![B' % 32] } - Sequencer configuration:
[B' / 32 = 3 : 32, A = 32 : 96, B' % 32 = 32 : 1] : 32 @ base_addr = -2 - Packet size:
m![B' % 32]::SIZE = 32 - Cycles:
Time::SIZE = m![A, B' / 32]::SIZE = 32 * 3 = 96 - Result: The first 2 and last 4 values of
(# 2 + B + # 4)are masked to0.
Example (Padding case 3)

Case 3 supports larger right padding values through per-index masking. Cases 1 and 2 limit right padding to 255 * 4-bit, but Case 3 removes this limitation:
- Each entry index
iuses its ownlast_dim_rightmost_valid_count[i]value. - Supports
last_dim_rightmost_valid_count[0..8]when axis size is 8 or less.
Consider the following:
axes![A = 32, B = 97]dtype = f32base_addr = 0Element = m![A, B # 128]- Stream mapping:
let B' = B + # 31 in { Time: m![A, B' / 16, 1], Packet: m![B' % 16] } - Sequencer configuration:
[A = 32 : 128, B' / 16 = 8 : 16, 1 = 1 : 0, B' % 16 = 16 : 1] : 16 @ base_addr = -2 - Packet size:
m![B' % 16]::SIZE = 16 - Cycles:
Time::SIZE = m![A, B' / 32]::SIZE = 32 * 8 = 256 - Configuration:
last_dim_rightmost_valid_count_dim = 1,last_dim = 2,last_dim_rightmost_valid_count[0..8] = [16, 16, 16, 16, 16, 16, 1, 0],pad_value = 0 - Result: Of
(B # 31), 97 elements are valid and 31 are masked as invalid.
Table Indexing
Some operations cannot be efficiently implemented with standard arithmetic. Non-linear activation functions like Sigmoid and GeLU require expensive approximations, and certain quantization schemes use custom encoding tables.
Table indexing provides hardware-accelerated lookup tables during the fetch stage. Each value is treated as an index into a pre-configured table, and the corresponding table entry is output instead. This enables:
- Non-linear activations: Implement Sigmoid, GeLU, and other functions through pre-computed lookup tables.
- Custom type casting: Translate specialized encodings like
MXFP4to standard formats using conversion tables.
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8];
/// Fetches with table lookup: each input value indexes into a pre-configured table.
/// Input [0, 1, 2, 3, 4, 5, 6, 7] with table[x] = 2*x
/// Output [0, 2, 4, 6, 8, 10, 12, 14]
fn fetch_with_table<'l, const T: Tu>(
input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![A]>,
table: &LookupTable<i8, i8>,
) -> FetchTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![A]> {
input.fetch_with_table(table)
}
Performance
Table indexing can introduce performance overhead compared to direct memory fetches. Key considerations include:
- Additional latency for table lookup operations
- Potential bandwidth limitations when lookup tables are accessed
- Impact on pipeline throughput when table access becomes a bottleneck
Type Casting
The Fetch Adapter converts element types as data streams from DM, enabling computation on data stored at lower precision than the compute pipeline requires.
RNGD supports the following conversions:
| Input | Output |
|---|---|
i4 | i5, i32 |
i8 | i9, i32 |
i16 | i32 |
f8e4m3 | f32 |
f8e5m2 | f32 |
bf16 | f32 |
f16 | f32 |
f32 | bf16 |
RNGD-S supports the following additional type conversions:
| Input | Output |
|---|---|
i4 | i9 |
i16 | i9 |
f8e4m3 | bf16 |
f8e5m2 | bf16 |
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8];
/// Fetches with type casting: converts i8 storage to i32 for computation.
/// Input: i8 [0, 1, 2, 3, 4, 5, 6, 7]
/// Output: i32 [0, 1, 2, 3, 4, 5, 6, 7]
fn fetch_with_type_cast<'l, const T: Tu>(
input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![A]>,
) -> FetchTensor<'l, T, i32, m![1], m![1], m![1], m![1], m![A]> {
input.fetch()
}
}
Constraints
Type casting requires the data produced by a single fetch operation to not exceed 32 bytes.
This constraint exists as Fetch Engine packets are forwarded through the Switch Engine and Collect Engine, which normalizes packets into flits (flow control units) for the Compute Engine. Each flit is 32 bytes (single-channel mode) or 64 bytes (dual-channel mode).
The 32-byte limit avoids flit overflow. See Fetch Engine and Switch Engine Interaction for more details.
Consider the following examples:
-
Valid:
i4toi32conversion with afetch_sizeof 4 bytes- Fetches 4 bytes (8 elements of
i4) - Produces 32 bytes (8 elements of
i32) - Output size: 32 bytes ✓
- Fetches 4 bytes (8 elements of
-
Invalid:
i4toi32conversion with afetch_sizeof 8 bytes- Fetches 8 bytes (16 elements of
i4) - Produces 64 bytes (16 elements of
i32) - Output size: 64 bytes ✗ (exceeds 32-byte limit)
- Fetches 8 bytes (16 elements of
-
Invalid:
i8toi32conversion with afetch_sizeof 16 bytes- Fetches 16 bytes (16 elements of
i8) - Produces 64 bytes (16 elements of
i32) - Output size: 64 bytes ✗ (exceeds 32-byte limit)
- Fetches 16 bytes (16 elements of
This constraint affects the allowed fetch_size values in sub-context operations:
- When casting from
i4toi32,fetch_sizemust be 4 bytes. - For other conversions,
fetch_sizemust be 8 bytes.
Zero-Point Subtraction
Quantization schemes are either symmetric (centered around zero) or asymmetric (shifted by an offset called the zero-point). Asymmetric quantization represents data ranges more efficiently but requires subtracting the zero-point before computation. When converting from quantized integers to computation types, the hardware can simultaneously subtract this offset.
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8];
/// Fetches with zero-point subtraction for asymmetric quantization.
/// Input: i8 [0, 1, 2, 3, 4, 5, 6, 7], with zero_point = 10
/// Output: i32 [-10, -9, -8, -7, -6, -5, -4, -3]
fn fetch_with_zero_point<'l, const T: Tu>(
input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![A]>,
zero_point: i8,
) -> FetchTensor<'l, T, i32, m![1], m![1], m![1], m![1], m![A]> {
input.fetch_with_zero_point(zero_point)
}
Interleaving fetches enable subtracting different zero points from each tensor. This is useful when combining tensors with different quantization parameters.
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, I = 2];
/// Fetches interleaved tensors with different zero points per tensor.
/// Input1: [0, 1, 2, 3, 4, 5, 6, 7], with zero_point = 100
/// Input2: [0, 1, 2, 3, 4, 5, 6, 7], with zero_point = -100
/// Output interleaved: [-100, -99, ..., 100, 101, ...]
fn fetch_interleaved_with_zero_points<'l, const T: Tu, const I: Ident>(
input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![I], m![A]>,
zero_points: [i8; 2],
) -> FetchTensor<'l, T, i32, m![1], m![1], m![1], m![I], m![A]> {
input.fetch_with_zero_points(zero_points)
}
Batching
Memory systems have a minimum efficient transfer size to make full use of each memory access.
When a tensor’s natural packet size is smaller than this threshold, fetching each packet individually wastes bandwidth.
Batching combines multiple small packets into a single larger transfer by grouping consecutive time steps: the fetches_per_packet value determines how many individual fetches are combined into one packet.
With a fetch_size of 8 bytes and fetches_per_packet of 5, the adapter groups 5 fetches together to create a single 40-byte packet.
Note
The
fetches_per_packetvalue is derived by the compiler, from the output type offetch(). Users do not directly specifyfetches_per_packet.
Tip
Prefer fetching large packets directly from DM using the Fetch Sequencer. Use Fetch Adapter batching only when large packets cannot be retrieved in one cycle due to memory layout constraints, such as a 24-byte packet spread across three non-contiguous 8-byte locations.
The total number of cycles required to fetch the data is: $$ \text{#cycles} = \texttt{Time:SIZE} \times \text{ceil}\left({\frac{\texttt{Packet::SIZE}}{\texttt{fetch_size}}}\right) $$
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![N = 4, C = 3, H = 4, W = 8];
/// Sequencer config: [N = 4 : 96, C = 3 : 32, H = 4 : 8, W = 8 : 1].
/// contiguous_sram_access_size = m![N, C, H, W]::SIZE = 384
/// packet_size = 8 bytes
/// fetch_size = gcd(packet_size, contiguous_sram_access_size) = 8 bytes
/// batching_factor (fetches_per_packet) = packet_size / fetch_size = 1
/// #cycles = 48
fn fetch_batch_1<'l, const T: Tu>(
input: BeginTensor<'l, T, i4, m![1], m![1], m![1], m![1], m![N, C, H, W]>,
) -> FetchTensor<'l, T, i4, m![1], m![1], m![1], m![N, C, H], m![W]> {
input.fetch()
}
/// Sequencer config: [N = 4 : 96, C = 3 : 32, H / 2 = 2 : 16, H % 2 = 2 : 8, W = 8 : 1].
/// contiguous_sram_access_size = m![N, C, H / 2, H % 2, W]::SIZE = 384
/// packet_size = 16 bytes
/// fetch_size = gcd(packet_size, contiguous_sram_access_size) = 16 bytes
/// batching_factor (fetches_per_packet) = packet_size / fetch_size = 1
/// #cycles = 24
fn fetch_batch_2<'l, const T: Tu>(
input: BeginTensor<'l, T, i4, m![1], m![1], m![1], m![1], m![N, C, H, W]>,
) -> FetchTensor<'l, T, i4, m![1], m![1], m![1], m![N, C, H / 2], m![H % 2, W]> {
input.fetch()
}
/// Sequencer config: [N = 4 : 96, C = 3 : 32, H = 4 : 8, W = 8 : 1].
/// contiguous_sram_access_size = m![N, C, H, W]::SIZE = 384
/// packet_size = 32 bytes
/// fetch_size = gcd(packet_size, contiguous_sram_access_size) = 32 bytes
/// batching_factor (fetches_per_packet) = packet_size / fetch_size = 32 / 32 = 1
/// #cycles = 12
fn fetch_batch_3<'l, const T: Tu>(
input: BeginTensor<'l, T, i4, m![1], m![1], m![1], m![1], m![N, C, H, W]>,
) -> FetchTensor<'l, T, i4, m![1], m![1], m![1], m![N, C], m![H, W]> {
input.fetch()
}
/// Sequencer config: [N = 4 : 96, C = 3 : 32, H = 4 : 8, W = 8 : 1].
/// contiguous_sram_access_size = m![N, C, H, W]::SIZE = 384
/// packet_size = 96 bytes
/// fetch_size = gcd(packet_size, contiguous_sram_access_size) = 32 bytes
/// fetch_size should be in {1, 2, 4, 8, 16, 32}
/// batching_factor (fetches_per_packet) = packet_size / fetch_size = 96 / 32 = 3
/// #cycles = 12
fn fetch_batch_4<'l, const T: Tu>(
input: BeginTensor<'l, T, i4, m![1], m![1], m![1], m![1], m![N, C, H, W]>,
) -> FetchTensor<'l, T, i4, m![1], m![1], m![1], m![N], m![C, H, W]> {
input.fetch()
}
}
Constraints
- The output packet size must be 8-byte aligned (a multiple of the fetch sequencer’s read granularity).
- The
Timedimension size must be divisible by the batching factor. - Type conversion is limited to specific type pairs (see Type Casting).
- Zero-point values must fit within the target data type’s representable range.
- The underlying sequencer produces base packets of 1, 2, 4, 8, 16, or 32 bytes (see Sequencer Constraints).
Fetch Engine and Switch Engine Interaction
After batching, Fetch Engine packets are forwarded to the Switch Engine, where data is routed through an interconnect network of slices. A network topology determines the distribution pattern of packets between slices. The packet passes through the Switch Engine unchanged.
After switching, the Collect Engine normalizes packets to 32-byte flits (flow control units used by all downstream engines).
The Collect Engine pads packets that are not divisible by the flit size (32 bytes) with zeros.
For example, for fetch_size = 8 bytes and fetches_per_packet = 5, the Fetch Adapter batches 5 fetches together, producing a 40-byte packet.
Since it is not 32-byte aligned, the Collect Engine adds 24 bytes of zero padding, producing a 64-byte packet (2 flits).
The hardware operates over flits (flow control units), which are the physical unit of data transfer. The Tensor Unit has two Switch Engines (one for each context), each with a 32-byte data width. Throughput and flit size depend on the configured mode:
- Single channel mode: A single flit has 32 bytes. Half of the available bandwidth is used.
- Dual channel mode: The main and sub contexts are combined to produce 64-byte flits. Dual channel mode requires explicit configuration. The compiler does not generate it automatically.
Forwarding 32-byte aligned packets from the Fetch Engine avoids wasted bandwidth in the Collect Engine. An unaligned packet requires padding: for example, for a 20-byte packet, \(\frac{32 - 20}{32} \approx 37.5\%\) of the flit payload is unused.
Tip
Prefer 32-byte aligned packets over unaligned ones.
Performance
Memory Bandwidth
Peak DM bandwidth is 256 B/cycle with proper DMN interleaving (see Memory Performance for the interleaving technique). Contiguous accesses enable parallel bank access; distributing fetches across slices maximizes parallelism. See Memory Performance for optimization strategies.
Adapter Overhead
Each adapter stage adds minimal latency: batching must accumulate fetches_per_packet packets, type conversion takes 1–2 cycles.
Bank Starvation
The Fetch Engine shares DM bank access with Commit and DMA Engines. Fetch operations have higher priority than DMA, but consecutive accesses to the same bank (64+ accesses) can starve the DMA Engine. The compiler prevents this by avoiding concurrent scheduling of problematic patterns. See Bank Starvation for details.
Dual Channel Mode Benefits
When both main and sub contexts are available:
- Bandwidth doubles to 64 bytes per cycle
- Fetch cycles halve for large transfers
- Trade-off: sub-context unavailable for independent operations
Packet Alignment
Prefer 32-byte aligned packets over unaligned ones. See Fetch Engine and Switch Engine Interaction for more details.
Commit Engine
The Commit Engine writes Tensor Unit results back to DM (Data Memory), the primary on-chip SRAM tier. It implements a logical tensor move from Tensor Unit streams to SRAM, writing each slice’s result to its designated DM address.
After the Tensor Unit completes computation, results exist as streaming packets distributed across slices. The Commit Engine transforms these packets through an adapter (truncating) and writes them to DM via a sequencer. This page covers the interface and examples, the adapter stages, the sequencer, sub-context operations, and performance guidelines.
Interface
impl<'l, const T: Tu, P: Position, D: Scalar, Chip: M, Cluster: M, Slice: M, Time: M, Packet: M>
StreamTensor<'l, { T }, P, D, Chip, Cluster, Slice, Time, Packet>
{
/// Commits to the data memory.
#[primitive(StreamTensor::commit)]
pub fn commit<Element: M>(self, address: Address) -> DmTensor<D, Chip, Cluster, Slice, Element> {
verify_commit::<D, Time, Packet, Element>();
DmTensor::new(self.inner.transpose(false), address)
}
/// Commits to mutable tensor view in the data memory.
#[primitive(StreamTensor::commit_view)]
pub fn commit_view<Element: M>(self, mut dst: DmTensorViewMut<'l, D, Chip, Cluster, Slice, Element>) {
verify_commit::<D, Time, Packet, Element>();
dst.inner.write_transpose(self.inner.view(), false);
}
}
The Commit Engine mirrors the Fetch Engine’s structure, but operates in reverse.
For detailed examples, see kernel examples.
Examples
Consider storing a matrix multiplication result C = A * B back to DM after computation.
The Cast Engine converts the Contraction Engine’s f32 packet elements to bf16 to save space.
The Commit Engine stores the resulting tensor to DM.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![P = 256, M = 16, N = 8];
fn cast_commit<'l, const T: Tu>(
input: AccumulationTensor<'l, T, f32, m![1], m![1], m![P], m![M], m![N]>,
) -> DmTensor<bf16, m![1], m![1], m![P], m![M, N # 16]> {
// Cast f32 to bf16 (Cast Engine), then commit to DM (Commit Engine).
// Input: M = 16 time steps, N = 8 f32 elements per packet (32 bytes).
// After cast: N = 8 bf16 elements padded to 16 (32 bytes).
// The sequencer writes across P = 256 slices.
input.cast::<bf16, m![N # 16]>().commit(0)
}
}
Adapter
The adapter transforms stream packets before writing to DM via truncating.
The main context and sub-context adapters both support truncating. The sub-context is typically used for prefetching to TRF/VRF.
Truncating
Truncating reduces packet size by keeping only the leading elements.
The input packet is always a full 32-byte flit.
The commit_in_size parameter controls how many bytes are actually written to DM: 8, 16, 24, or 32 bytes (where 32 bytes means no reduction).
This operation discards trailing elements or satisfies downstream alignment constraints.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 4, K = 2, W = 8, N = 16, J = 64];
fn i8_padding_truncation<'l, const T: Tu>(
input: CastTensor<'l, T, i8, m![1], m![1], m![1], m![M, K], m![W # 32]>,
) -> DmTensor<i8, m![1], m![1], m![1], m![M, K, W]> {
// Input: 8 i8 elements padded to 32 (32 bytes per packet).
// Truncation removes padding: only the 8 leading elements are written to DM.
// commit_in_size = 8 elements × 1 byte = 8 bytes.
input.commit(0)
}
fn f32_non_padding_truncation<'l, const T: Tu>(
input: AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![M, K], m![W]>,
) -> DmTensor<f32, m![1], m![1], m![1], m![M, K, W = 4]> {
// Input: 8 f32 elements (32 bytes per packet).
// Truncation: only the first 4 elements are written to DM.
// commit_in_size = 4 elements × 4 bytes = 16 bytes.
input.commit(0)
}
fn bf16_truncation_with_transpose<'l, const T: Tu>(
input: CastTensor<'l, T, bf16, m![1], m![1], m![1], m![M, K], m![N]>,
) -> DmTensor<bf16, m![1], m![1], m![1], m![K, M, N = 8]> {
// Input: 16 bf16 elements (32 bytes per packet).
// Truncation: only the leading 8 elements are written to DM.
// commit_in_size = 8 elements × 2 bytes = 16 bytes.
// Time is transposed: m![M, K] → m![K, M].
input.commit(0)
}
fn i4_no_truncation_with_transpose<'l, const T: Tu>(
input: CastTensor<'l, T, i4, m![1], m![1], m![1], m![M, K], m![J]>,
) -> DmTensor<i4, m![1], m![1], m![1], m![K, M, J]> {
// Input: 64 i4 elements (32 bytes per packet).
// No truncation: the full 32-byte packet is written to DM.
// commit_in_size = 64 elements × 0.5 bytes = 32 bytes.
// Time is transposed: m![M, K] → m![K, M].
input.commit(0)
}
}
Note
The
commit_in_sizevalue is automatically derived by the compiler from the output tensor mapping. It is not manually specified by the user.
Commit Sequencer
The commit sequencer writes streams to DM across slices. Each slice within an aggregation executes its own sequencer. This mirrors how fetch sequencers pull data into Tensor Units.
The commit_size value determines how many bytes are written per sequencer step.
It is analogous to the Fetch Engine’s fetch_size and is also derived from contiguous_sram_access_size:
$$ \texttt{commit_size} = \gcd(\texttt{contiguous_sram_access_size_bytes},\ \texttt{commit_in_size}) $$
- When
commit_size == commit_in_size, each time step produces a single DM write. - When
commit_size < commit_in_size, the packet is split intocommit_in_size / commit_sizewrites per time step.
The main context supports a commit_size of 8, 16, 24, or 32 bytes (see main context).
The sub-context supports a commit_size of 8 bytes only (see sub-context).
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 4, K = 2, W = 8, N = 16];
// Compiler-generated configuration: [
// M -> 4 : 16, (16 == 2 * 8, contiguous)
// K -> 2 : 8, (8 == 8 * 1, contiguous)
// W -> 8 : 1 (packet dimension, contiguous)
// ] : 8
// contiguous_sram_access_size = (8 * 2 * 4) elements × 1 byte = 64 bytes
// commit_in_size = 8 bytes (8 valid i8 elements out of 32-byte flit)
// commit_size = gcd(64, 8) = 8
fn no_transpose<'l, const T: Tu>(
input: CastTensor<'l, T, i8, m![1], m![1], m![1], m![M, K], m![W # 32]>,
) -> DmTensor<i8, m![1], m![1], m![1], m![M, K, W]> {
input.commit(0)
}
// Compiler-generated configuration: [
// M -> 4 : 8, (8 != 2 * 32, NOT contiguous)
// K -> 2 : 32, (32 != 8 * 1, NOT contiguous)
// W -> 8 : 1 (packet dimension, contiguous)
// ] : 32
// contiguous_sram_access_size = 8 elements × 4 bytes = 32 bytes
// commit_in_size = 32 bytes
// commit_size = gcd(32, 32) = 32
fn transpose<'l, const T: Tu>(
input: AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![M, K], m![W]>,
) -> DmTensor<f32, m![1], m![1], m![1], m![K, M, W]> {
input.commit(0)
}
// Compiler-generated configuration: [
// M -> 4 : 8, (8 != 2 * 32, NOT contiguous)
// K -> 2 : 32, (32 != 8 * 1, NOT contiguous)
// N -> 8 : 1 (truncated packet dimension, contiguous)
// ] : 16
// contiguous_sram_access_size = 8 elements × 2 bytes = 16 bytes
// commit_in_size = 16 bytes (8 bf16 elements; truncation from 16 elements to 8)
// commit_size = gcd(16, 16) = 16
fn transpose_with_truncation<'l, const T: Tu>(
input: CastTensor<'l, T, bf16, m![1], m![1], m![1], m![M, K], m![N]>,
) -> DmTensor<bf16, m![1], m![1], m![1], m![K, M, N = 8]> {
input.commit(0)
}
// Compiler-generated configuration: [
// K -> 2 : 64, (64 == 4 * 16, contiguous)
// M -> 4 : 16, (16 != 8 * 1, NOT contiguous)
// W -> 8 : 1 (packet dimension, contiguous)
// ] : 8
// contiguous_sram_access_size = 8 elements × 1 byte = 8 bytes
// commit_in_size = 32 bytes
// commit_size = gcd(8, 32) = 8
//
// The 32-byte packet is split into 4 × 8-byte writes along the M axis:
// - Write 0: packet[ 0.. 8] → DM offset 0
// - Write 1: packet[ 8..16] → DM offset 16
// - Write 2: packet[16..24] → DM offset 32
// - Write 3: packet[24..32] → DM offset 48
fn padding_chunking<'l, const T: Tu>(
input: CastTensor<'l, T, i8, m![1], m![1], m![1], m![K], m![M, W]>,
) -> DmTensor<i8, m![1], m![1], m![1], m![K, M, W # 16]> {
input.commit(0)
}
}
Slice Bitmap
The slice bitmap enables selective commits to specific slices.
A 256-bit mask controls which slices receive commit data, with each bit corresponding to one slice.
For example:
bitmap = 00000000...01enables commit only to slice0bitmap = 11111111...10enables commit to all slices except slice0
This feature supports workflows that compute on specific slices and commit results only to those slices.
Hardware Constraint
The commit sequencer must adhere to the same limits as fetch sequencers. See fetch sequencer constraints for details.
Sub-Context Operations
The sub-context Commit Engine provides specialized capabilities beyond the main context, though it supports fewer adapter stages.
- Valid Count Packing: This operation selectively commits only valid tensor elements based on a runtime count, excluding padding or invalid data from the output buffer. When computation produces variable-length results (for example, filtering operations or dynamic sequence lengths), valid count packing ensures that only meaningful elements are written to DM, preventing wasted memory and simplifying downstream processing. The hardware uses a count parameter to determine how many leading elements from each packet should be committed, discarding the remainder.
- Generate Mode: Writes a single
32-bit value to a specified address via anITOS(immediate-to-SRAM) command, bypassing the Tensor Unit execution pipeline.
Constraints
- The input packet size must be 32 bytes.
- The
commit_in_sizemust be 8, 16, 24, or 32 bytes. Thecommit_sizemust be 8, 16, 24, or 32 bytes for the main context and 8 bytes only for the sub-context. Note that the user only specifies theElementmapping. These constraints are internal to the compiler.
- The two contexts support different capabilities:
| Stage | Main context | Sub context |
|---|---|---|
| Truncating | Yes | Yes |
| Valid Count Packing | No | Yes |
| Generate Mode | No | Yes |
- Sub-context commits can only follow
fetch. These cannot be preceded by Cast Engine or Transpose Engine operations. - The commit sequencer shares the same limits as the fetch sequencer (see fetch sequencer constraints). Additionally, all sequencer strides must be multiples of 8 bytes.
Performance
Commit Engine performance directly affects overall computation throughput since DM writes must complete before subsequent operations can access the data.
Write Bandwidth
The Commit Engine achieves maximum write bandwidth when:
- Slice Interleaving: Distributing writes across all active slices (or the subset specified by the slice bitmap) avoids bottlenecks on individual slices. The RNGD chip has 64 slices per PE. The 256-bit bitmap accommodates up to 4 PEs (4 × 64 = 256).
- Sequential Addresses: Writing to sequential DM addresses within each slice enables parallel bank access (128 B/cycle per DMN, 256 B/cycle with DMN interleaving).
- Aligned Packet Sizes: Using 8-byte aligned packet sizes (8, 16, 24, 32 bytes) avoids partial bank writes.
For detailed memory performance characteristics, see Memory Performance.
Adapter Stage Costs
Each adapter stage adds minimal latency:
- Truncating: Nearly zero cost (simple data width reduction)
Bank Starvation Prevention
The Commit Engine shares DM bank access with the Fetch Engine and DMA Engine.
To prevent bank starvation and catastrophic NoC timeouts, ensure commit patterns avoid 64+ consecutive accesses to the same bank.
The compiler automatically enforces this constraint by treating violating operations as if they occupy DMA context, preventing concurrent DMA operations.
See DM Bank Starvation for details.
DMA Engine
The DMA Engine moves tensors directly between memory locations without involving the Tensor Unit. It supports all combinations of HBM, SPM, and DM transfers while optionally transforming memory layouts.
As a kernel writer, you control the source and destination memory tiers and any layout transformation expressed as mapping expressions. Prefer direct transfers between tiers: routing data through an intermediate tier (e.g., HBM→SPM→DM when HBM→DM suffices) adds unnecessary latency and bandwidth pressure. The compiler derives the read/write sequencer configuration.
This page covers the interface, worked examples, architecture, and performance characteristics.
Interface
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
/// Moves a tensor from one memory location to another using DMA.
/// Supports layout transformations during transfer.
fn dma<D: Scalar, InMedia, OutMedia, InMapping, OutMapping, StreamMapping>(
input: &Tensor<D, InMedia, InMapping>,
output: &mut Tensor<D, OutMedia, OutMapping>,
stream: StreamMapping,
) {
// Hardware implementation:
// - Read sequencer fetches from source memory
// - Write sequencer stores to destination memory
// - Stream mapping coordinates the transfer
}
The operation signature follows this pattern:
impl<D: Scalar, Chip: M, Element: M> HbmTensor<D, Chip, Element> {
/// Converts to data memory tensor.
#[primitive(HbmTensor::to_dm)]
pub fn to_dm<Cluster: M, Slice: M, Element2: M>(
&self,
_dma: &mut DmaContext<{ Dma::Tensor }>,
address: Address,
) -> DmTensor<D, Chip, Cluster, Slice, Element2> {
DmTensor::new(self.inner.transpose(true), address)
}
}
Transfer capabilities:
- All nine source-destination pairs between DM, SPM, and HBM (including same-tier copies)
- Cross-DMN, cross-cluster, and cross-chip transfers
- Inter-chip transfers via PCIe at 30 bytes/cycle
See also: Memory Performance, Sequencer.
Examples
Layout Transformation
Consider transposing a tensor’s layout while moving it from HBM to DM:
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![N = 4, C = 3, H = 8, W = 8];
// Tensor in HBM with NCHW layout
let hbm: HbmTensor<i8, m![1], m![N, C, H, W]> = /* ... */;
// DMA Engine moves to DM with NHWC layout
let dm: DmTensor<i8, m![1], m![1], m![1], m![N, H, W, C]> =
dma_engine(&hbm);
The DMA Engine reads from HBM using one access pattern and writes to DM using a different pattern, transforming the layout during transfer. For parameter definitions, see the Architecture section below.
Architecture
The DMA Engine coordinates paired read and write sequencers for flexible tensor movement. Each RNGD chip contains eight DMA Engines, one per pair of DMNs, so up to eight independent tensor transfers can proceed simultaneously.
Single-Engine Operation
A single DMA Engine operation transforms a tensor by reading it from one memory location and writing it to another with a potentially different layout.
Parameters
The DMA operation requires several parameters to specify the source tensor, destination tensor, and how data flows between them:
shape: The tensor’s logical shape (declared viaaxes![...])dtype: Element datatype (e.g.,i8,bf16)media_in,media_out: Source and destination media types (DM/SPM/HBM)b_in,b_out: Base memory addresses for input/output tensors (when media is HBM,b = { element: b_element })In,Out: Mapping environments that specify how logical tensor indices map to physical memory locationsStream: Intermediate stream mapping environment that coordinates the read and write sequencers
The operation executes using two coordinated sequencers:
The read sequencer applies read(shape, dtype, b_in, In, Stream) to fetch data from the source, while the write sequencer applies write(shape, dtype, b_out, Out, Stream) to store data at the destination.
These sequencers work together through the shared Stream environment to ensure data flows correctly from source to destination.
Alignment Constraints
These constraints reflect the physical organization of memory hardware and the AXI bus protocol. The 8-byte DM write alignment stems from SRAM bank structure: each bank has an 8-byte data width, and the bank controller can only write complete 8-byte units. Misaligned writes require a read-modify-write operation, tripling the time and blocking other operations on that bank. The 1-byte read alignment reflects asymmetric hardware capabilities: SRAM read ports can extract arbitrary byte ranges using byte-select logic, but write ports cannot. HBM-to-DM 8-byte alignment combines both constraints: unaligned HBM reads incur severe performance penalties (potentially halving bandwidth), so the hardware enforces alignment for this critical path. The 4096-byte packet limit comes from the AXI bus protocol: AXI transactions cannot exceed 256 beats, and with 16-byte data width this yields 4096 bytes maximum. Violating these constraints causes correctness errors or hardware exceptions, not just performance degradation. The compiler enforces these rules because they are hardware invariants, not optimization hints.
Structural Requirements
The mapping environments must follow specific structural requirements depending on the media types involved:
-
Streammust have a specific form:// Stream = { time: Time, packet: Packet } -
In/outmust have a specific form depending on the mediamedia_in/out:// In = // if media_in in {HBM, SPM}: { element: ElementIn } // if media_in in {DM}: { slice: SliceIn, element: ElementIn } // // Out = // if media_out in {HBM, SPM}: { element: ElementOut } // if media_out in {DM}: { slice: SliceOut, element: ElementOut }This specifies the respective memory space.
-
b_in/outmust have a specific form depending on the mediamedia_in/out:// b_in = // if media_in in {HBM, SPM}: { chip: b_chip_in, element: b_element_in } // if media_in in {DM}: { chip: b_chip_in, cluster: b_cluster_in, slice: b_sliceIn, element: b_element_in } // // b_out = // if media_out in {HBM, SPM}: { chip: b_chip_out, element: b_element_out } // if media_out in {DM}: { chip: b_chip_out, cluster: b_cluster_out, slice: b_sliceOut, element: b_element_out }This specifies addresses in the respective memory space.
-
RNGD imposes the following hardware constraints on DMA Engine sequencers (see sequencer constraints for details):
-
Alignment requirements for addresses and packet size (
Packet::SIZE):HBM DM (SRAM) Read address 1B1BWrite address 1B8Bpacket size 1B8BIn addition, HBM-to-DM DMA transfers require an
8-byte alignment for the read address, write address, and packet size, regardless of the values shown in the table above. -
The packet size must be less than or equal to
4096bytes (AXI protocol constraint).
-
Example: Basic HBM-to-HBM Layout Transformation
This example demonstrates how a DMA operation transforms a tensor’s memory layout through a simple HBM-to-HBM transfer that rearranges tensor dimensions.
Consider a DMA operation with the following arguments:
axes![N = 4, C = 3, H = 8, W = 8];
// dtype = i8
// media_in = media_out = HBM
// b_in = { chip: 0, element: 1024 }, b_out = { chip: 0, element: 2048 }
// In = { element: m![N, C, H, W] }, Out = { element: m![H, C, N, W] }
// Stream = { time: m![H, C, N], packet: m![W] }
The compiler generates the following sequencer configurations from these arguments:
- Read sequencer configuration:
[H=8:8, C=3:64, N=3:192, W=8:1]:8 HBM/D@1024 - Write sequencer configuration:
[H=8:192, C=8:32, N=3:8, W=8:1]:8 HBM/D@2048
The hardware traverses memory locations according to these sequencer configurations. The following pseudocode models this behavior conceptually:
#![allow(unused)]
fn main() {
fn dma_sequencer() {
let packet_size = 8; // packet size divides last consecutive read/write sequencer configuration entry
for h in 0..8 {
for c in 0..3 {
for n in 0..4 {
for w_packet in 0..1 {
// packet size is 8, so W=8 is accessed as a single chunk
let read_index = h * 8 + c * 64 + n * 192 + w_packet * 1;
let stream = Mem[read_index..(read_index + packet_size)];
let writ_index = h * 96 + c * 32 + n * 8 + w_packet * 1;
Mem[writ_index..(writ_index + packet_size)] = stream;
}
}
}
}
}
}
This example illustrates how the stream environment (Stream) mediates between different input and output layouts (In and Out), transforming the tensor’s organization in memory while moving it.
Performance
Optimal DMA performance requires attention to startup overhead, alignment, and packet size:
Startup overhead: Each DMA operation incurs approximately 500 cycles of initial overhead. Combining multiple transfers into fewer operations improves efficiency.
Alignment: While the constraints above specify minimum requirements, using larger alignment factors (particularly 256-byte alignment) yields better throughput. For detailed guidance, refer to the memory performance section.
Packet size and internal DMA requests: DMA automatically splits packets into 256-byte units internally: an n-byte packet becomes ceil(n / 256) DMA requests. Examples:
- If the innermost entry is
x=4095:1, packet size 4095 results in 16 DMA requests. - If the innermost entry is
x=4099:1, since 4099 is prime, a single DmaCommand processes 1 byte at a time (very inefficient). Split into two DmaCommands (e.g., 4096/3 portions) instead, though each additional DmaCommand adds ~500 cycles of initial latency.
Homogeneous Aggregate Operation
Multiple DMA Engines work together in parallel to improve throughput for large tensor moves. The homogeneous aggregate operation distributes a single logical tensor move across DMA Engines in multiple DMNs, with all DMNs using identical stream environments to coordinate their work. With four chips, up to 32 DMA Engines execute portions of a single tensor move concurrently.
The operation has the following form:
// dma(shape, dtype, media_in, media_out, b_in, b_out, In, Stream, Out)
Each participating DMN executes its own DMA Engine to handle a portion of the overall transfer, together implementing the following single logical tensor move:
// <shape, In, media_in / dtype @ { element: b_in }> --id--> <shape, Out, media_out / dtype @ { element: b_out }>
Parallel execution across multiple DMNs requires extending the mapping environments beyond the single-DMN case to include chip, cluster, and slice dimensions:
// In =
// if media_in in {HBM, SPM}: { chip: ChipIn, element: ElementIn }
// if media_in in {DM}: { chip: ChipIn, cluster: ClusterIn, slice: SliceIn,
// element: ElementIn }
//
// Out =
// if media_out in {HBM, SPM}: { chip: ChipOut, element: ElementOut }
// if media_out in {DM}: { chip: ChipOut, cluster: ClusterOut, slice: SliceOut,
// element: ElementOut }
The key characteristic of homogeneous operations is that all DMNs share the same parametric stream environment:
// Stream = { chip: ChipStream, cluster: ClusterStream, slice: SliceStream,
// time: Time, packet: Packet }
Heterogeneous Aggregate Operation
The heterogeneous aggregate operation provides flexibility for different DMNs to process data differently during a parallel transfer. This variant allows each DMN to use a distinct stream environment while coordinating to perform a single logical tensor move.
Two constraints maintain correctness with this added flexibility:
- All participating DMA Engines must use the same input and output media types
- A single, unified input and output tensor mapping expression must govern the overall transfer
The heterogeneous aggregate DMA operation is defined as:
// dma(shape, dtype, media_in, media_out, b_in, b_out, In, StreamFn, Out)
Each DMN executes its own DMA Engine to implement the following single logical tensor move:
// <shape, In, media_in / dtype @ { element: b_in }> --id--> <shape, Out, media_out / dtype @ { element: b_out }>
The stream environment specification distinguishes this operation.
Instead of a single parametric stream environment shared by all DMNs, the heterogeneous operation uses StreamFn, a function mapping each DMN’s location to its own unique stream environment.
For a DMN at chip i, cluster j, and slice index k, the function StreamFn(i, j, k) returns that DMN’s specific stream mapping of the form { time: Time, packet: Packet }.
The input and output mapping environments (In and Out) remain structurally identical to the homogeneous case, ensuring a well-defined overall logical tensor move.
DMA Command Syntax
Two syntactic forms express DMA operations, depending on whether each DMN needs its own descriptor or can share a common pattern.
Heterogeneous Syntax (Full Flexibility)
The heterogeneous syntax specifies a complete DMA descriptor for each DMN individually, including potentially different source and destination media:
<DMACommand> ::= HashMap(<DmnIndex>, <DmaDescriptor>)
<DmaDescriptor> ::= (<DmaSequencer>, <source_media: Media>, <dest_media: Media>)
<DmaSequencer> ::= (<limit: integer>, <source_stride: integer>, <dest_stride: integer>)*,
(<source_base: integer>, <dest_base: integer>), <stride0: 1~4096>
<Media> ::= "HBM"(<ChipIndex>) | "DM"(DmnIndex) | "SPM"(DmnIndex)
<DmnIndex> ::= (<ChipIndex>, <ClusterInChipIndex>, <SliceInClusterIndex>)
<ChipIndex> ::= 0 | 1 | 2 | 3 (when using 4 chips)
<ClusterInChipIndex> ::= 0 | 1
<SliceInClusterIndex> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
Note: While a DMA operation logically uses separate read and write sequencers, the compiler represents them compactly as a single DmaSequencer with paired strides and bases per entry (one for source, one for destination).
Homogeneous Syntax (Common Case)
For the common case where all DMNs follow a regular pattern, the homogeneous syntax offers a concise representation:
<DMACommand> ::= ( <source: Tensor>, <dest: Tensor>, HashMap(<DmnIndex>, <StreamShape>) )
<Tensor> ::= ( <Shape>, <Memory Mapping Expression>, <Media>, <addr: integer>, <Dtype> )
<Shape>, <Memory Mapping Expression>: defined before
<Media> ::= "HBM" | "DM" | "SPM"
<Dtype> ::= i4 | i8 | f8e4m3 | f8e5m2 | i16 | fp16 | bf16 | i32 | f32
<StreamShape> ::= <Memory Mapping Expression>
<DmnIndex> ::= (<ChipIndex>, <ClusterInChipIndex>, <SliceInClusterIndex>)
Key usage notes:
- DM tensor specifications must include chip, cluster, and slice dimensions in the Memory Mapping Expression to identify the exact memory location
- Each DMN’s
StreamShapeincludes inter-DMN mapping information (e.g.,chip: A!4, chip #2means the stream shape usesA@2!1to specify reading from a particular chip) - Stream shapes are often inferred: if only source and destination tensors are provided, the compiler derives appropriate stream shapes. Alternatively, specify a single stream shape with chip/cluster/slice dimensions, from which per-DMN stream shapes are automatically derived
Example of heterogeneous mapping:
axes![A = 4, B = 256, C = 256, D = 256];
// source: [Chip: [A % 4], Dram: [B % 256 * C % 256 * D % 256]], HBM @ 0
// dest: [Chip: [A % 4], Cluster: [B / 128], Partitioning: [C % 256], InSlice: [B % 128 * D % 256]], DM @ 0
StreamShape for (Chip_i, Cluster_j, Slice_k):
// [A @ i % 1 * (B / 128) @ j % 1 * (C / 64) @ k % 1 * C % 32 * B % 128 * C / 32 % 2 (DMN) * D % 256]
// DMA Sequencer = [C=32:(32 * 256, slice_stride), B=128:(256 * 256, 256),
// C/32=2:(32 * 256, 32) * slice_stride, D=256:(1, 1)],
// base: (Chip, Cluster, Slice, HBM/InSlice) = ((i, i), (j, j), (k, k), (0, 0))
// (slice_stride = 4MB = virtual address space of in_slice DM)
Implementation Details
This section explains how the compiler generates DMA operations and how the hardware executes them.
Compiler generates aggregate operations by default: The compiler treats tensor-to-tensor moves (T → T’) as atomic units and automatically distributes work across available DMNs in parallel, similar to Fetch/Commit Sequencers. Aggregate operations are the primary abstraction programmers interact with, which explains why this documentation emphasizes them rather than single-DMN DMA.
Sequencer representation is compact: Although DMA operations logically use separate read and write sequencers, the compiler represents them efficiently as a single structure. Each entry in this unified sequencer contains shared loop limits but separate strides (one for read, one for write) and separate base addresses (one for source, one for destination). This compact representation exploits the fact that read and write amounts must always match.
DMA Engine assignment is flexible:
- Any DMA Engine among the 8 can handle any transfer, but using the DMN’s own DMA Engine is more efficient (not quantitatively measured).
- The compiler typically uses the source DM DMN’s DMA Engine, but any DMA Engine works.
- The 8 DMA Engines can transfer between different memory components in parallel (e.g., DMA #0: HBM ↔ DM, DMA #1: DM ↔ DM).
- The compiler only allows moving from one tensor (HBM/DM/SPM) to another tensor (HBM/DM/SPM).
- For inter-chip transfers, all chip IDs are globally agreed upon across the system
- Programmers can leave DMA Engine selection unspecified and let the compiler choose, though explicit specification is also supported
SRAM access patterns for optimal bandwidth: SRAM memory bandwidth depends critically on DMN (Data Memory Network) interleaving. For detailed SRAM performance characteristics and interleaving patterns, see the Data Memory section. The key principle: interleave across both DMNs to achieve full 256 B/cycle bandwidth.
Bandwidth trade-offs: DMA provides flexibility for arbitrary tensor moves but may underutilize SRAM slice bandwidth compared to the Tensor Unit. However, HBM bandwidth is often the bottleneck in practice, making this less critical. For SRAM-to-SRAM transfers, the Tensor Unit is often more efficient, except when the Switch Engine operates at size 256 (which may be slower than DMA).
Tensor Memory Mapping
The compiler automatically derives the correspondence between source and destination memory indices from the mapping environments.
Given tensor memory mappings (e, e'), the compiler computes how each flat memory index relates to the logical tensor dimensions:
// S, e' |- i ~ { i_A = (i % 65536) / 256, i_B = i / 65536, i_C = i % 256 }
A simple layout transformation that reorders dimensions:
axes![A = 256, B = 256, C = 256];
// e_1 = A * B * C, e_2 = B * A * C
// DMA: <S, e_1, HBM@0> =id=> <S, e_2, HBM@256^3>
DMA Sequencer Internals
This section explains how DMA sequencers execute at the hardware level.
A DMA Descriptor represents a single execution unit that the hardware can process. Each DMN’s DMA Engine can accept multiple DmaDescriptors, which it executes in sequence (or potentially in parallel when resources permit). The sequencer within each descriptor determines the exact order in which memory addresses are accessed.
Startup overhead detail: As mentioned in the performance considerations above, each DMA Descriptor incurs approximately 500 cycles of initial latency before data transfer begins.
Example sequencer execution:
// DmaSequencer = [A=256:(65536, 256), B=256:(256, 65536), C=256:(1, 1)], base=(0, 256^3)
Reading one data element per cycle, each cycle performs:
| i | ti | read addr | write addr |
|---|---|---|---|
| 0 | { A: 0, B: 0, C: 0 } | 0 | write_base (=256³) |
| 1 | { A: 0, B: 0, C: 1 } | 1 | 1 + write_base |
| … | … | … | … |
| 255 | { A: 0, B: 0, C: 255 } | 255 | 255 + write_base |
| 256 | { A: 0, B: 1, C: 0 } | 256 | 65536 + write_base |
i = a*256² + b*256 + c | { A: a, B: b, C: c } | i | 256*a + 256²*b + c + write_base |
The DmaSequencer compactly represents this address mapping table.
With stride0 = 256, the hardware reads and writes 256 bytes per cycle: cycle 0 processes all values for (A, B, C) = (0, 0, 0..255) as a single packet.
A complete descriptor example:
// DmaSequencer = [A=256:(65536, 256), B=256:(256, 65536), C=256:(1, 1)],
// base=(0, 256^3), stride0 = 256
// media_source = HBM, media_dest = HBM, DmnIndex = (0, 0, 0)
This descriptor activates the DMA Engine on Chip 0, Cluster 0, DMN 0, moving data from HBM starting at address 0 to HBM starting at address 256³. The transfer completes in approximately 500 cycles (initial latency) + 256 × 256 cycles (data transfer).
How the compiler derives sequencers: Given source and destination tensor shapes along with a stream shape, the compiler derives the DMA sequencer configuration:
// stream_shape = [A * B * C]
// => read_sequencer = [A=256:65536, B=256:256, C=256:1], base=0
// => write_sequencer = [A=256:256, B=256:65536, C=256:1], base=256^3
The derivation process follows these steps:
- The read sequencer is derived by projecting the source tensor mapping onto the stream shape
- The write sequencer is derived by projecting the destination tensor mapping onto the stream shape
- These are combined into a unified DMA sequencer with paired strides and bases
- The packet size (stride0) is inferred from the consecutive read/write volume: if both read and write access 256 consecutive bytes, the optimal stride0 is 256 bytes
When stride0 is not 256-byte aligned, the cycle count formula is ceil(stride0 / 256). However, HBM write operations incur additional penalties beyond the ceil calculation. The unaligned write requires a Read-Modify-Write (RMW) operation for the partial 256-byte block, slowing the operation significantly (see the Misaligned Access section in ./memory-performance.md for details). For HBM read operations, the penalty is limited to the ceil overhead. For SRAM operations, alignment has minimal impact.
Memory Bandwidth Limits
Memory bandwidth limits are crucial for achieving optimal DMA performance. A single DMA Engine can theoretically move up to 256 bytes per clock cycle, but the actual transfer rate is constrained by the slowest component in the data path: the source memory, the destination memory, or the PCIe interconnect for inter-chip transfers.
For detailed characteristics and optimization strategies for each memory type, see:
- Data Memory (DM) performance
- High-Bandwidth Memory (HBM) performance
- Scratchpad Memory (SPM) performance
Key bandwidth constraints:
- HBM: 1.5 TB/s combined read + write per chip (0.75 TB/s read + 0.75 TB/s write)
- DM: 256 B/cycle per cluster with proper DMN interleaving (128 B/cycle per DMN)
- SPM: 128 B/cycle per cluster
- PCIe DMA Engine: 30 bytes/cycle for inter-chip transfers
Detailed Examples
The following examples illustrate DMA Engine behavior across various configurations, from simple single-engine transfers to complex multi-DMN operations with performance considerations.
Example 1: Single DMA Engine HBM to HBM
This example demonstrates a basic HBM-to-HBM transfer using a single DMA Engine to rearrange tensor dimensions. The operation achieves good performance through effective channel interleaving, distributing memory accesses across different HBM channels to enable parallel processing.
Operation arguments:
axes![A = 8, B = 8, C = 256];
// dtype = i8
// media_in = media_out = HBM
// b_in = { chip: 0, element: 0 }, b_out = { chip: 0, element: 16384 }
// In = { element: m![A, B, C] }
// Out = { element: m![B, A, C] }
// Stream = { time: m![A, B], packet: m![C] }
Generated sequencer configurations:
- Read sequencer configuration:
[A=8:2048, B=8:256, C=256:1]:256 HBM/D@0 - Write sequencer configuration:
[A=8:256, B=8:2048, C=256:1]:256 HBM/D@16384
Why this achieves good performance: Channel interleaving enables efficient parallel processing. The strides in the non-innermost sequencer entries (256 and 2048) toggle HBM address bits 8 and 11, which correspond to the stack and channel selection bits. This access pattern ensures that every read and write request targets a different HBM channel, with multiple memory operations proceeding in parallel.
Although each 256-byte transfer takes 4 cycles at 0.75GHz clock speed, the parallel distribution across channels enables efficient execution. At 1GHz, the total time is approximately 128 cycles (64 read requests + 64 write requests) plus approximately 500 cycles of initial latency.
| read #i | ti | read addr | write addr |
|---|---|---|---|
| 0 | { A: 0, B: 0, C: 0 } | 0 | write_base(=16384) |
| 1 | { A: 0, B: 0, C: 1 } | 1 | 1 + write_base |
| 2 | { A: 0, B: 0, C: 2 } | 2 | 2 + write_base |
| … | … | ||
| 255 | { A: 0, B: 0, C: 255 } | 255 | 255 + write_base |
| 256 | { A: 0, B: 1, C: 0 } | 256 | 2048 + write_base |
| 257 | { A: 0, B: 1, C: 1 } | 257 | 2048 + 1 + write_base |
| … | … | ||
i = a * 2048 + b * 256 + c | { A: a, B: b, C: c } | i | 256 * a + 2048 * b + c + write_base |
| … | … |
Bandwidth sharing note: HBM bandwidth is 1.5 TB/s (read + write combined), and each DMA Engine has 256 GB/s bandwidth. For DRAM ↔ DRAM operations, read bandwidth is 0.75 TB/s. If 4 DMA Engines perform DRAM ↔ DRAM operations, each gets ~0.1875 TB/s. Even with stride0=256, each engine reads 256B per request but cannot complete one request per cycle due to this bandwidth constraint.
Example 2: Single DMA Engine HBM to DM
This example demonstrates an HBM-to-DM transfer that achieves optimal bandwidth by carefully interleaving both HBM channels and DM DMNs. Both memory systems require specific access patterns to reach their full bandwidth potential.
Operation arguments:
axes![A = 256, B = 256, C = 256];
// dtype = i8
// media_in = HBM
// media_out = DM
// b_in = { chip: 0, element: 0 }
// b_out = { chip: 0, cluster: 0, slice: 0, element: 0 }
// In = { element: m![B, A, C] }
// Out = { slice: m![A / 4], element: m![A % 4, B, C] }
// Stream = { time: m![B, A % 4, A / 4 % 32, A / 128], packet: m![C] }
Generated sequencer configurations:
- Read sequencer configuration:
[B=256:65536, A%4=4:256, A/4%32=32:1024, A/128=2:32768, C=256:1]:256 HBM/D@0 - Write sequencer configuration:
[B=256:256, A%4=4:65536, A/4%32=32:slice_stride, A/128=2:DMN_stride, C=256:1]:256 DM/D@0
Performance analysis: Both HBM and DM achieve full bandwidth through careful interleaving in their respective access patterns.
HBM side: The stride of 32768 for the A/128=2 loop interleaves memory accesses effectively.
For the innermost 2 iterations, this interleaves at the byte level; for outer iterations, it interleaves across HBM channels.
The hardware command queue processes all 65536 requests (256 * 4 * 32 * 2) efficiently, utilizing full HBM bandwidth.
DM side: DMN and slice interleaving work together to maximize throughput. Each of the two DMNs provides 128 bytes/cycle bandwidth, so a 256-byte write normally requires 2 cycles on a single DMN. However, interleaving consecutive requests across both DMNs (achieved through the DMN_stride and slice_stride) enables the two DMNs to operate in parallel, processing one 256-byte request per cycle. All 65536 write requests therefore complete at one request per cycle.
Total execution time: Approximately 65536 cycles (read) + 65536 cycles (write) + 500 cycles (initial latency). Since reads and writes overlap in the pipeline, the actual time is closer to max(65536, 65536) + 500 ≈ 66036 cycles.
Example 3: Single DMA Engine DM to DM
This example shows a DM-to-DM transfer within a single cluster, where both reads and writes access the same DM. This scenario requires careful DMN interleaving for both operations to avoid contention and achieve maximum bandwidth.
Operation arguments:
axes![A = 256, B = 256, C = 256];
// dtype = i8
// media_in = DM
// media_out = DM
// b_in = { chip: 0, cluster: 0, slice: 0, element: 0 }
// b_out = { chip: 0, cluster: 0, slice: 0, element: 4 * 256 * 256 }
// In = { slice: m![A / 4], element: m![A % 4, B, C] }
// Out = { slice: m![A / 4], element: m![B, A % 4, C] }
// Stream = { time: m![B, A % 4, A / 4 % 32, A / 128], packet: m![C] }
Generated sequencer configurations:
- Read sequencer configuration:
[B=256:1, A%4=4:65536, A/4%32=32:slice_stride, A/128=2:DMN_stride, C=256:1]:256 DM/D@0 - Write sequencer configuration:
[B=256:1024, A%4=4:256, A/4%32=32:slice_stride, A/128=2:DMN_stride, C=256:1]:256 DM/D@(4 * 256 * 256)
Performance analysis: DMN and slice interleaving enable full bandwidth for both read and write operations. Each 256-byte access is structured to interleave across the two DMNs, while the outer loops interleave across different DM slices. Each DMN provides 128 bytes/cycle bandwidth, so a single 256-byte access normally requires 2 cycles on one DMN. However, alternating requests between both DMNs enables parallel operation to achieve full 256 B/cycle bandwidth.
Request execution:
- Total read requests: 65536 (256 * 4 * 32 * 2)
- Total write requests: 65536
- At saturation with proper interleaving, one request completes per cycle
Total execution time: Approximately 131072 cycles (since reads and writes must proceed sequentially for DM-to-DM within the same cluster) + 500 cycles (initial latency).
Note on packet size alignment: The choice of C=256 is important for performance. If C were between 1-255, the cycle count remains similar because the number of DMA requests determines execution time. However, if the packet size is 256n+r (where 0 ≤ r < 256), the cycle count increases by a factor of (n+1) due to more requests. Aligning packet sizes to 256-byte boundaries maximizes data transferred per request.
Example 4: Homogeneous DMA Engine, HBM to DM (Pathological: Bank Conflict)
This example demonstrates performance degradation from poorly designed memory access patterns: severe HBM bank conflicts. The issue arises when the stream shape causes consecutive accesses to trigger row switches within HBM banks, preventing efficient parallel execution and resulting in approximately 10x slower performance compared to well-optimized access patterns.
Operation arguments:
// 1 chip (8 DMNs): chip-related mapping is not needed
axes![A = 64, B = 2048, C = 1024];
// dtype = i8
// media_in = HBM
// media_out = DM
// b_in = 0
// b_out = 0
// In = { cluster: m![B / 1024], slice: m![B / 256 % 4, A], element: m![B % 256, C] }
// Out = { slice: m![A / 4], element: m![B, A % 4, C] }
// Stream = { cluster: m![B / 1024], slice: m![B / 256 % 4],
// time: m![B % 256, C / 256, A % 32, A / 32], packet: m![C % 256] }
Generated sequencer configurations:
- Read sequencer configuration at
(cluster_i, dmn_j):[B%256=256:1024, C/256=4:256, A%32=32:2^21, A/32=2:2^26, C%256=256:1]:256 HBM/D@(i * (1024 * 1024) + j * (256 * 1024))- The base address offset
i * (1024 * 1024) + j * (256 * 1024)is derived from the DMN location(B/1024, B/256%4) = (i, j)
- The base address offset
- Write sequencer configuration at
(cluster_i, dmn_j):[B%256=256:1024, C/256=4:256, A%32=32:slice_stride, A/32=2:DMN_stride, C%256=256:1]:256 DM/D@(cluster_i, dmn_j, 0)
Why this performs poorly: row-level bank conflicts The stream shape structure optimized for DM’s DMN/slice interleaving creates a pathological access pattern for HBM. The innermost interleaving dimensions (A%32 and A/32) correspond to HBM address bits 21 and 26, which control row addressing within banks. Consecutive memory accesses trigger row switches within the same bank on nearly every request.
Channel interleaving still occurs (the C dimension’s stride of 256 enables stack interleaving across all 32 channels), but this parallelism cannot compensate for the row conflict penalty within each channel. Each access within a channel must wait for the previous row to close and the new row to open, dramatically increasing latency.
Performance breakdown:
HBM reads (the bottleneck):
- Per DMN: 65536 data requests (256 * 4 * 32 * 2)
- Across 8 DMNs: 524288 total requests, distributed evenly across 32 HBM channels
- Each channel handles: 16384 requests
- Each request incurs approximately 40 cycles due to bank conflicts (a conservative estimate; actual penalty depends on tCCD and FR-FCFS scheduling)
- Total HBM time: approximately 655360 cycles (16384 * 40)
DM writes (not the bottleneck):
- DMN interleaving works correctly, achieving full 256 B/cycle bandwidth
- 65536 requests per DMN, processing at one request per cycle
Total execution time: Approximately 655360 cycles + 500 cycles (initial latency) ≈ 655860 cycles.
Critical lesson: Careful access pattern design is essential for performance. Avoid bank conflicts through proper stream shape construction. Note that this estimate is conservative; actual performance may be somewhat better due to FR-FCFS (First Ready-First Come First Served) memory scheduling, which can mitigate some conflicts, but the fundamental problem remains severe.
Example 5: Homogeneous DMA Engine HBM to DM (Pathological: Missing Stack Interleaving)
This example demonstrates another common pitfall: failing to interleave across HBM’s stack dimension (address bit 8). When this bit is not toggled by the access pattern, only 16 of the 32 available HBM channels are utilized, cutting effective bandwidth in half.
Operation arguments:
// 1 chip (8 DMNs)
axes![A = 8, B = 64, C = 8, D = 512];
// dtype = i8
// media_in = HBM
// media_out = DM
// b_in = 0
// b_out = 0
// In = { element: m![A, B, C, D] }
// Out = { cluster: m![A / 4], slice: m![A % 4, B], element: m![C, D % 256] }
// Stream = { cluster: m![A / 4], slice: m![A % 4], time: m![C, B % 32, B / 32], packet: m![D % 256] }
Generated sequencer configurations:
- Read sequencer configuration at
(cluster_i, dmn_j):[C=8:512, B%32=32:4096, B/32=2:131072, D%256=256:1]:256 DM/D@(i * 2^20 + j * 2^18)- The base address offset
i * 2^20 + j * 2^18is derived from the DMN location(A/4, A%4) = (i, j)
- The base address offset
- Write sequencer configuration at
(cluster_i, dmn_j):[C=8:256, B%32=32:slice_stride, B/32=2:DMN_stride, D%256=256:1]:256 DM/D@(cluster_i, dmn_j, 0)
Why this performs poorly: missing stack bit interleaving The stream shape does not exercise HBM address bit 8, which controls the stack dimension. In the HBM access pattern, the C axis has a stride of 512, so bit 8 is never toggled during the innermost loops. This occurs in operations like tensor splits where dimension structure changes between input and output (notice that the input tensor mapping includes D/256 but the output/stream does not).
HBM channel selection uses address bits 9-28, while the stack bit is bit 8. Without bit 8 interleaving, memory requests distribute across only 16 of the 32 available channels, immediately halving achievable bandwidth.
Performance breakdown:
HBM reads (the bottleneck):
- Per DMN: 512 data requests (8 * 32 * 2)
- Across 8 DMNs: 4096 total requests, distributed across only 16 channels
- Each channel handles: 256 requests
- Each channel’s bandwidth: 256B per 4 cycles at 0.75GHz, or approximately 5.3 cycles per request at 1GHz
- Total HBM time: approximately 1357 cycles (256 * 5.3)
DM writes (not the bottleneck):
- DMN interleaving achieves full 256 B/cycle bandwidth
- 512 requests per DMN (8 * 32 * 2), processing at one request per cycle
- DM writes overlap with HBM reads in the pipeline, so their latency is hidden
Total execution time: Approximately 1357 cycles + 500 cycles (initial latency) ≈ 1857 cycles.
Critical lesson: Achieving full HBM bandwidth (1.5TB/s) and DMA Engine bandwidth (2TB/s) requires memory access patterns that interleave across all 32 channels by toggling all relevant address bits including the stack bit (bit 8). Missing even one dimension of interleaving significantly degrades performance.
Example 6: Heterogeneous DMA Engine with Segmentation
This example demonstrates a heterogeneous DMA operation where the tensor shape does not divide evenly across all DMNs. Some DMNs must use different stream environments than others, and in extreme cases, a DMN may need to segment its work into multiple DMA commands to avoid writing to incorrect memory locations. This illustrates both the flexibility and complexity of heterogeneous DMA operations.
Operation arguments:
// 4 chips
axes![A = 15, B = 32, C = 256, D = 8];
// dtype = i8
// media_in = DM
// media_out = HBM
// b_in = 0
// b_out = 0
// In = let A' = A + 1# in
// { chip: m![D / 2], cluster: m![D % 2], slice: m![A' / 4, A' / 2 % 2, B],
// element: m![A' % 2, C] }
// Out = { chip: m![D / 2], element: m![D % 2, B, A, C] }
// StreamFn(chip_i, cluster_j, slice_k) = let A' = A + 1# in
// { chip: m![(D / 2) @ i = 1],
// cluster: m![(D % 2) @ j = 1],
// slice: m![(A' / 4) @ k = 1],
// time: (k == 0,1,2): m![A' % 2, B, A' / 2 % 2, C]
// (k == 3, exec #0): m![A' % 2, B, A' / 2 = 1, C]
// (k == 3, exec #1): m![A' = 1, B, A' / 2 % 2 @ 1, C],
// packet: m![C] }
The compiler generates the following sequencer configurations:
- Read sequencer configuration at
(chip_i, cluster_j, dmn_k):- k = 0, 1, 2:
[A'%2=2:256, B=32:slice_stride, A'/2%2=2:DMN_stride, C=256:1]:256 DM/D@(chip_i, cluster_j, dmn_k, 0) - k = 3:
- execution #0
[A'%2=2:256, B=32:slice_stride, A'/2=1:DMN_stride, C=256:1]:256 DM/D@(chip_i, cluster_j, dmn_3, 0) - execution #1
[A'%2=2:256, B=32:slice_stride, A'/2=1:DMN_stride, C=256:1]:256 DM/D@(chip_i, cluster_j, dmn_3, 0)
- execution #0
- k = 0, 1, 2:
- Write sequencer configuration at
(chip_i, cluster_j, dmn_k):- k = 0, 1, 2:
[A'%2=2:256, B=32:15 * 256, A'/2%2=2:512, C=256:1]:256 HBM/D@(0 + i * 2 * (15 * 32 * 256) + j * (15 * 32 * 256) + k * (4 * 256)) - k = 3:
- execution #0
[A'%2=2:256, B=32:15 * 256, A'/2=1:512, C=256:1]:256 HBM/D@(0 + i * 2 * (15 * 32 * 256) + j * (15 * 32 * 256) + 3 * (4 * 256)) - execution #1
[A'%2=1:256, B=32:15 * 256, A'/2=1:512, C=256:1]:256 HBM/D@(0 + i * 2 * (15 * 32 * 256) + j * (15 * 32 * 256) + 3 * (4 * 256) + 512)- 512: offset by
A'/2%2@1
- 512: offset by
- execution #0
- k = 0, 1, 2:
Why DMN #3 requires segmentation: The tensor dimension A=15 does not divide evenly across 4 DMNs (15 = 3*4 + 3). DMNs #0, #1, and #2 each process exactly 4 elements of the A dimension. DMN #3 must process the remaining 3 elements (A=12, 13, 14) but its sequencer would naturally try to process 4 elements. If DMN #3 used the same single-command pattern as the other DMNs, it would write one extra element, corrupting memory in the region designated for B * (A + 1#) * C.
The compiler segments DMN #3’s work into two commands to avoid this:
- Execution #0 handles part of the valid range
- Execution #1 handles the remainder, ensuring the total is exactly 3 elements rather than 4
Performance comparison:
DMNs #0, #1, #2 (single command each):
- DM reads: 128 cycles for 2 * 32 * 2 packets of 256B each
- HBM writes: 128 cycles with proper channel interleaving
- Total: approximately 256 cycles (reads and writes overlap) + 500 cycles (initial latency) = 756 cycles
DMN #3 (two commands):
- Execution #0: 64 DM read cycles + 64 HBM write cycles + 500 cycles initial latency
- Note: reads from one DMN only, but slice interleaving still applies
- Execution #1: 32 DM read cycles + 32 HBM write cycles + 500 cycles initial latency
- Total: 192 data cycles + 1000 cycles (initial latency for two commands) = 1192 cycles
Overall execution time: The heterogeneous operation completes when the slowest DMN finishes. DMN #3 determines the total time: approximately 1192 cycles.
Key insight: Command segmentation incurs additional startup overhead (500 cycles per command). Choose tensor shapes that divide evenly across DMNs when possible, avoiding the need for heterogeneous stream environments and command segmentation.
Performance
DMA Engine performance depends on memory types, access patterns, and parallelism strategies.
Memory-Specific Bandwidth
Transfer bandwidth varies by memory type and configuration:
Data Memory (DM/SRAM):
- Peak bandwidth: 256 B/cycle (with proper DMN interleaving)
- Requires interleaving across both DMNs (128 B/cycle each)
- Bank conflicts and starvation can severely degrade performance
- See Memory Performance for DM optimization details
High-Bandwidth Memory (HBM):
- Peak bandwidth: 1.5 TB/s per chip (48 GB/s per channel × 32 channels)
- Channel interleaving is essential for high bandwidth
- Misaligned access and bank conflicts cause severe degradation
- See HBM Performance for optimization strategies
Scratchpad Memory (SPM):
- Bandwidth: 128 B/cycle per cluster
- Restricted to same-chip transfers
Startup Latency
Each DMA command incurs approximately 500 cycles of startup latency before data transfer begins. This fixed cost is amortized over large transfers but becomes significant for small tensors.
Command segmentation (as shown in Example 6) doubles startup latency by requiring two separate commands, emphasizing the importance of tensor shapes that divide evenly across DMNs.
Parallelism Strategies
Multiple DMA Engines can operate simultaneously:
- 8 DMA Engines per chip (one per pair of DMNs, 8 DMNs per cluster)
- Parallel DMA operations on independent data enable high aggregate bandwidth
- Local DMN memory access is faster than cross-DMN access (not quantitatively measured)
Alignment Constraints
Strict alignment requirements affect performance:
- DM writes: 8-byte alignment required for addresses and packet sizes
- HBM operations: 1-byte alignment for reads/writes, but HBM-to-DM transfers require 8-byte alignment
- Maximum packet size: 4096 bytes (AXI protocol constraint)
- Misaligned access in HBM can halve bandwidth or trigger expensive Read-Modify-Write operations
Bank Starvation Prevention
DMA Engine shares DM bank access with Fetch and Commit Engines. DMA has the lowest priority among these engines, making it vulnerable to bank starvation. If a DMA request blocks for more than 4,096 cycles, a NoC timeout occurs, requiring a hardware reset.
The compiler prevents this by ensuring operations with 64+ consecutive same-bank accesses are not scheduled concurrently with DMA. See Bank Starvation for details.
Inter-Chip Transfers
PCIe-based inter-chip transfers have limited bandwidth:
- 30 B/cycle for both reads and writes
- Significantly slower than on-chip transfers
- Consider minimizing cross-chip data movement in algorithm design
Memory Performance
Memory performance fundamentally determines TCP program efficiency. This page is the primary actionable reference for kernel writers: it documents hardware specifications, explains why each constraint exists, and maps API choices to their performance consequences. It covers DM first (the tier kernel writers interact with most), then SPM and HBM.
In practice, most performance problems trace back to two root causes, each with multiple specific manifestations:
- Unnecessary hops: routing data through an intermediate memory tier (e.g., through SPM when a direct HBM→DM transfer suffices) adds latency and bandwidth pressure.
- Low throughput: a
Packetthat is smaller than necessary or non-contiguous in memory causes more sequencer iterations and strided access patterns. The following table details the hardware constraints that determine howPacketchoices affect each memory type:
| Memory | Issue | Rule | Penalty |
|---|---|---|---|
| DM | Bank starvation | < 64 consecutive same-bank accesses | NoC timeout → hardware reset |
| DM | DMN interleaving | Alternate across 2 DMNs per cluster | 50% bandwidth loss |
| DM | Slice interleaving | Spread across 32 slices per DMN | Command queue contention |
| HBM | Alignment | 256-byte aligned access | Unaligned read: 2× penalty; unaligned write: ~50× penalty (RMW) |
| HBM | Bank conflicts | Avoid row switches within same bank | 30–40× degradation |
| HBM | Channel interleaving | Spread across 32 channels | Reduced parallelism |
Data Memory (DM)
Data Memory is the primary SRAM for tensor computations. A single RNGD chip contains 256MB of DM, structured to maximize parallel access and bandwidth. The following table summarizes the DM geometry:
| Unit | Count |
|---|---|
| Clusters | 2 / Chip |
| Data Memory Networks (DMNs) | 8 / Cluster |
| Slices | 32 / DMN |
| Banks | 16 / Slice |
| Rows | 4096 / Bank |
| Bytes | 8 / Row |
The SRAM hierarchy consists of clusters, DMNs, and slices. A single chip contains two clusters, each with eight Data Memory Networks. Each DMN contains 32 slices, totaling 256 slices per cluster. Clusters can exchange data through the Switch Engine; see the dedicated section for details.
Address Space in a Slice
Each slice provides 512KB of SRAM with a dedicated address space. The memory is organized into 16 parallel banks, each with an 8-byte data width, enabling a total data access rate of 128 B/cycle. Access to any individual bank is serialized, but the address space distributes 128 consecutive bytes across all 16 banks (8 bytes per bank) for parallel access. The following bit mapping defines this distribution:
| Bit # | Component |
|---|---|
0–2 | Byte |
3–6 | Bank |
7–18 | Row |
This bit mapping optimizes various access patterns, particularly during sequential access. Distributing consecutive bytes across banks enables parallel access and maximizes bandwidth utilization.
Optimizing DMA Performance
Achieving full DMA bandwidth requires following three guidelines: interleaving across DMNs, interleaving across slices, and preventing bank starvation.
1. Interleave across DMNs.
DMN interleaving is essential because each DMN provides only 128 B/cycle bandwidth. Since the standard 256-byte transfer unit requires two cycles per DMN, you should pipeline accesses across both DMNs to maintain continuous throughput:
| cycle | DMN #0 | DMN #1 |
|---|---|---|
| 0 | read #0 (1/2) | (idle) |
| 1 | read #0 (2/2) | read #1 (1/2) |
| 2 | read #2 (1/2) | read #1 (2/2) |
| 3 | read #2 (2/2) | read #3 (1/2) |
| … | … | … |
| 2n-1 | read #2n-2 (2/2) | read #2n-1 (1/2) |
| 2n | (idle) | read #2n-1 (2/2) |
2. Interleave across slices.
Slice interleaving improves efficiency by distributing DMA requests across the 32 slices within each DMN. Slices are shared resources used by the DMA, Fetch, and Commit Engines, so spreading requests helps manage contention. Each slice has two command queue entries to hold pending DMA requests.
3. Prevent bank starvation. Unlike the first two issues, bank starvation may force a complete chip reset, not just a slowdown.
Bank Starvation
The key constraint is the 64-access rule: Fetch and Commit engines must not access the same DM bank for more than 64 consecutive operations while DMA is active. Violating this causes a NoC timeout and full cluster reset.
The fundamental issue is priority inversion in a shared resource system: a low-priority requester is indefinitely blocked by high-priority ones accessing the same resource. The DM controller prioritizes requests in this order:
- Main-context Fetch Engine
- Main-context Commit Engine
- Sub-context Fetch Engine
- Sub-context Commit Engine
- DMA Engine
DMA has the lowest priority among all memory engines, which makes sense during normal operation since computation engines should get first access to data. However, this creates a dangerous scenario when high-priority engines continuously access the same bank: the DMA Engine’s request sits in the queue, unable to make progress, while the higher-priority engines monopolize that bank. After 4,096 cycles without a response, the NoC (Network on Chip) protocol declares the transaction dead and enters an exception state. This 4,096-cycle limit exists as a safety mechanism to detect deadlocks and hung transactions in the NoC protocol; without this timeout, a stuck transaction could hang the entire system indefinitely. When the timeout triggers, the hardware lacks a graceful recovery mechanism, and the only recovery is a full cluster domain reset, losing all computation state and requiring complete reinitialization.
The 64-access rule prevents this catastrophe: the Fetch and Commit Engines must not access the same bank for more than 64 consecutive operations while DMA is active.
Why 64? The math comes from NoC bandwidth: (256 B/cycle DMA ÷ 128 B/cycle per DMN) × 64 accesses × 32 slices < 4096 cycles—this ensures DMA requests complete before the timeout even in the worst case.
For example, suppose the DMA Engine issues a request to bank 0 (along with 15 other banks), but the main-context’s Fetch Engine continuously requests bank 0. The DMA request stalls, and if this exceeds 4,096 cycles, a NoC timeout forces a hardware reset.
Compiler scheduling behavior: When Tensor Unit operations would violate the 64-access limit, the compiler schedules them as if they occupy DMA, preventing concurrent DMA operations. This sacrifices the TCP architecture’s inherent main/sub/DMA context parallelism where data preparation and computation occur in parallel, but avoids catastrophic hardware resets. Treat this as a hard constraint: never use patterns with 64+ consecutive same-bank accesses.
Main-context starving sub-context is less severe because it does not trigger NoC timeouts and only increases processing time. Additionally, the Tensor Unit’s internal pipeline naturally generates back-pressure between Fetch and Commit Engines, preventing internal starvation.
Scheduling model: The scheduler uses context occupancy information: if operation A occupies a context (e.g., main context), the next operation B using that context waits until A completes. Understanding which contexts operations occupy enables predicting parallel execution. A scheduling visualization utility would help verify actual schedules.
The 64-access limit details:
- The limit is cumulative across all concurrent commands: if the total number of consecutive accesses from all commands (main-context fetch/commit + sub-context fetch/commit + DMA) to the same bank exceeds 64 cycles, DMA starvation occurs.
- Even if individual commands interleave accesses to the same bank, their combined access count still accumulates toward the 64-access limit, which can cause DMA Engine starvation.
- The compiler controls only single commands accessing the same bank consecutively; multiple commands interleaving the same bank are not controlled.
- In practice, sub-context rarely accesses the same bank consecutively (
StoTrf,StoVrfoperations typically use sequential addresses and tiling prevents same-bank access). - Sub-context operations that would exceed the limit are also not scheduled concurrently with DMA.
Note
Cumulative Bank Access Constraint
Even if each individual command accesses a bank fewer than 64 times, the TOTAL across all concurrent main/sub/DMA commands to the SAME bank must be less than 64.
The compiler prevents individual commands from exceeding this limit, but it cannot prevent accumulation from multiple concurrent operations. For example:
- Main-context Fetch: 30 consecutive bank accesses
- Sub-context Fetch: 20 consecutive bank accesses
- DMA: 1 concurrent request to the same bank
- Total: 51 accesses → Safe (below 64)
But if either main or sub reaches 35+ accesses, the combined total exceeds 64 and triggers starvation. This is why the compiler sacrifices main/sub/DMA parallelism (scheduling them sequentially instead) when cumulative access patterns would exceed 64 to the same bank.
Main/sub-context contention: Main-context can starve sub-context, but this is less severe:
- Unlike DMA starvation, sub-context starvation does not cause NoC timeout or hardware reset and only increases processing time.
- Collision probability is lower: DMA Engine occupies 16 banks at once, while sub fetch/commit engines occupy only one bank.
- Starvation does not occur between fetch and commit engines within the same context due to pipeline back-pressure.
Performance impact example: If main-context exec command continuously accesses a specific bank while sub-context stos command is scheduled, sub-context processing is delayed. Worst case: total time = main-context time + sub-context time. Ideal case: main and sub access different banks, achieving total time = max(main-context time, sub-context time).
Technical Details: Banks and Command Queues
Bank access: At 128 B/cycle DMN access, 16 banks are accessed simultaneously. Banks are shared resources among Fetch Engine, Commit Engine, and DMA Engine. Access to any individual bank is serialized.
DMN bandwidth: Within the Data Memory Network, Data Memory Slices share data paths, so DMA Engine transfers achieve 128 B/cycle per DMN.
Command queues: Each Data Memory Slice has a 2-entry command queue for pending DMA read/write requests. Since this is limited, spreading DMA requests across multiple slices is ideal: distributing across M slices reduces required throughput to 1/M even if request processing slows due to priority. DMN interleaving every n cycles achieves saturated 256 B/cycle.
Note
While command queues theoretically allow some burst access without interleaving, we strongly recommend always interleaving across DMNs when generating DMA streams, as this is the most natural approach.
The 4096-cycle limit derivation: The formula is: (TDMA_IO_BYTE / DMN_IO_BYTE) * Max_Consecutive_Access * DMN_SIZE < 4096.
With TDMA_IO_BYTE=256, DMN_IO_BYTE=128, DMN_SIZE=32, this yields Max_Consecutive_Access < 64.
DMN NoC architecture: Tensor DMA connects to DRAM and DMN through a NoC acting as a hub. Each port (DMA port, DRAM port, DMN ports) receives requests and must send responses. Transactions are considered hung if response takes more than 4096 cycles after request. When a DMN port request doesn’t receive a response within 4096 cycles, the NoC treats it as an error and enters an exception state, requiring a cluster domain reset.
Data Memory Network topology: Data Memory Routers are DMN components connected in a ring topology forming the Data Memory Network. The path is: slice0_in → slice31_out, slice32_in → slice63_out.
Scratchpad Memory (SPM)
Note
This section is a work in progress; hardware-specific details (capacity, addressing, bank structure) are pending.
Scratchpad Memory provides additional fast storage within each DMN for temporary data and intermediate results. Each DMN contains SPM with a bandwidth of 128 B/cycle, offering high-speed access for frequently reused values such as constants, lookup tables, or small working sets that don’t require the full capacity of SRAM.
SPM serves as a middle tier in the memory hierarchy between the ultra-fast VRF (Vector Register File) and the larger SRAM. Its primary use cases include storing scalar constants, small weight matrices, activation function lookup tables, and configuration data that needs rapid access without consuming scarce VRF capacity. The compiler automatically selects SPM for data that exhibits high temporal locality but modest capacity requirements.
The key distinction from SRAM is explicit software management: the compiler explicitly allocates data to SPM when beneficial, whereas SRAM allocation follows more general-purpose policies. SPM’s 128 B/cycle bandwidth per DMN enables high-throughput access for small tensors, and because each DMN has dedicated SPM, there are no inter-DMN contention issues. SPM is particularly valuable for per-DMN state that would otherwise require repeated SRAM fetches.
High-Bandwidth Memory (HBM)
HBM provides high-capacity off-chip storage with substantial bandwidth for large tensor operations. A single RNGD chip contains 48GB of HBM. The following table summarizes the HBM geometry:
| Unit | Count |
|---|---|
| Stacks | 2 / Chip |
| Channels | 16 / Stack |
| Slices | 3 / Channel |
| Bank Groups | 4 / Slice |
| Banks | 4 / Bank Group |
| Rows | 16K / Bank |
| Bytes | 2K / Row |
Address Space in a Chip
The HBM address space uses a non-linear bit mapping optimized for parallel sequential access. This design maximizes parallelism and minimizes overhead rather than directly mapping to physical geometry:
| Bit # | Main Component | Additional Components |
|---|---|---|
| 0–7 | Byte | |
| 8 | Stack | |
| 9–12 | Channel | |
| 13 | Bank Group | Channel |
| 14–16 | Byte | Channel |
| 17–18 | Bank | Channel |
| 19 | Bank Group | Channel |
| 20 | Slice | Channel |
| 21–33 | Row | Channel (21–28) |
| 34 | Slice | Row |
| 35 | Row |
The bit assignment for each component corresponds to the physical memory geometry. For instance, the byte component occupies 11 bits (bits 0-7, 14-16) to represent 2K (2^11) bytes per row. Three exceptions exist:
- Slice representation: Two bits (20 and 34) represent slice, even though there are only three slices.
- Contiguous address space: Bit 34 is influenced by the row component to ensure bits 34 and 35 are never both 1, guaranteeing a contiguous 48GB address space.
- Channel XOR mapping: The channel component equals the XOR of bits 9-12 and 13-28 (e.g., the channel’s first bit equals the XOR of bits 9, 13, 21, and 25).
This unconventional bit order enhances performance by enabling parallelism across different memory resources.
Peak Bandwidth
Peak HBM bandwidth reaches 1.5TB/s per chip through parallel operation of stacks and channels. The channel controller transfers 64B/cycle at 0.75GHz,1 yielding 48GB/s per channel (0.75GHz x 64B/cycle) or 1.5TB/s per chip (48GB/s x 32 channels). The fundamental transfer unit is 256 bytes, requiring 4 clock cycles per channel. Saturating a single DMA Engine (256GB/s capacity) requires interleaving accesses across multiple channels.
Achieving peak bandwidth requires careful attention to access patterns. Channel throughput is highly sensitive to misalignment, bank conflicts, and resource sharing. Each channel controller has a 64-entry command queue that interleaves accesses to minimize penalties, but pathological cases can still cause severe degradation. The following sections describe causes of performance degradation and how to avoid them.
Misaligned Access
Misaligned access significantly degrades HBM performance. Bits 0-7 (the eight LSBs) represent the 256-byte minimum access unit within a memory row. Accessing data that crosses this boundary incurs substantial penalties.
Unaligned Read: Read requests crossing a 256-byte boundary require two NoC transfers, effectively halving bandwidth.
Unaligned or Partial Write: DMA packets are internally segmented into 256-byte transactions. When a packet’s size is not 256-byte aligned (e.g., a 2,800-byte packet splits into ten 256-byte requests plus one 240-byte request), the final “leftover” transaction requires a Read-Modify-Write (RMW) operation. RMW reads the entire 256-byte unit, updates the requested bytes, then writes the entire unit back. RMW can slow writes by roughly 50× compared to aligned writes.
Bank Conflict
Bank conflicts cause severe performance degradation of 30–40× compared to accessing an already-open row. They occur when consecutive accesses target different rows within the same bank. Only one row per bank can be open at a time; all rows start closed. Once open, a row’s 256-byte words can be accessed quickly, but switching rows requires closing the current row and opening a new one, adding 40–50 ns (60–75 cycles at 1.5 GHz) of latency.
Channel interleaving mitigates bank conflicts. Interleaving accesses across all 32 channels distributes load and reduces conflicts. Bits 8-12 (the next five LSBs) represent independent stacks and channels; placing these at low addresses prevents interference between adjacent accesses, which is vital for parallelizing contiguous operations. Non-contiguous operations often benefit from natural channel interleaving because the channel component spans bits 9-28. However, the stack component only corresponds to bit 8, so interleaving this bit requires explicit attention.
The controller hides row-switch latency through command interleaving. Within each channel, the controller automatically interleaves commands across banks, enabling useful transfers while other banks perform row switches. The controller manages bank states using its command queue and employs FR-FCFS (First Ready-First Come First Served) scheduling, prioritizing commands targeting already-open rows.
Despite this sophisticated scheduling, access patterns that continuously switch rows within the same bank still degrade performance significantly. Compilers and programmers should estimate row-switch costs when generating code.
Column-to-Column Delay
Column-to-Column Delay (tCCD) rarely affects performance significantly, so skip this section on first reading.
tCCD is the minimum time between consecutive read or write commands on the same channel.
It determines the maximum command issue rate, directly affecting channel throughput.
Vendor specifications set tCCD values based on analog constraints for accessing DRAM stack layers and shared resources.
The tCCD value depends on which memory resources consecutive commands target:
| Command Relation | tCCD (cycles @ 1.5GHz) | Relative Performance | Reason for Penalty |
|---|---|---|---|
| Same Slice, Different Bank Group | 2 | 1 | Ideal interleaving of bank groups |
| Different Slice | 3 | 2/3 | Data path switching |
| Same Slice, Same Bank Group | 4 | 1/2 | Shared I/O buffer among four banks |
The optimal case is interleaving between different bank groups within the same slice (tCCD = 2 cycles at 1.5GHz), allowing a new 64B command to be issued every cycle at 0.75GHz, achieving back-to-back transmission and full channel speed.
Any tCCD greater than 2 reduces the command rate and channel utilization.
Pathological tCCD patterns cause less severe degradation than bank conflicts for two reasons: either they often coincide with bank conflicts anyway, or channel interleaving masks their impact:
- Different Slice (
tCCD = 3): Slice ID corresponds to bit20, and bit21corresponds to the row. Interleaving across slices therefore likely causes bank conflicts simultaneously. - Same Slice, Same Bank Group (
tCCD = 4): This pattern interleaves bits8-35except bits13,19,20, and34. Bits29-35relate to bank conflicts; bits8-28relate to channel interleaving.
-
Although the channel controller operates at a frequency of 0.75GHz, it performs eight bursts per cycle, leading to an effective frequency of 0.75×8=6GHz. ↩
Computing Tensors
The Tensor Unit transforms data through a pipeline of eight specialized engines. Data flows from DM, through the engine pipeline, and back to DM. After the Collect Engine normalizes packets to flits (32-byte flow control units), all downstream engines — Contraction, Vector, Cast, Transpose, and Commit — operate on flits. (See Collect Engine for the normalization details.)
flowchart TB
subgraph SRAM
DM[(DM)] & TRF[(TRF)] & VRF[(VRF)]
end
subgraph TU[Tensor Unit]
direction LR
FE[Fetch] --> SW[Switching] --> CO[Collect] --> CE[Contraction] --> VE[Vector] --> CA[Cast] --> TR[Transpose] --> CM[Commit]
end
DM --> FE
CM --> DM
CO --> TRF --> CE
CO --> VRF --> VE
click FE "../moving-tensors/fetch-engine.html" "Fetch Engine"
click SW "./switch-engine.html" "Switch Engine"
click CO "./collect-engine.html" "Collect Engine"
click CE "./contraction-engine/index.html" "Contraction Engine"
click VE "./vector-engine/index.html" "Vector Engine"
click CA "./cast-engine.html" "Cast Engine"
click TR "./transpose-engine.html" "Transpose Engine"
click CM "../moving-tensors/commit-engine.html" "Commit Engine"
Two register files serve distinct roles: TRF (Tensor Register File; see hello-tcp memory overview) holds weights for the Contraction Engine (load once, reuse across many cycles), while VRF (Vector Register File) holds operands for the Vector Engine.
The Collect Engine loads data into TRF via .to_trf() and VRF via .to_vrf().
Fetch and Commit are part of the Tensor Unit pipeline but interface directly with DM; see Moving Tensors.
| Engine | Function | Key Constraint |
|---|---|---|
| Fetch | Load data from DM into the pipeline | Packet must be 8-byte aligned; Slice is unchanged |
| Switching | Redistribute data across slices | Ring network topology; Slice can change |
| Collect | Normalize packets to 32-byte flits | Output = exactly one flit |
| Contraction | Einsum: matmul, convolution, attention | Weight-stationary via TRF |
| Vector | Elementwise, binary, reduce operations | Only i32/f32 input |
| Cast | Precision lowering with batching | Output = exactly one flit |
| Transpose | Reorder elements within a flit | Within-flit only |
| Commit | Write results back to DM | Flit-aligned writes |
As a kernel writer, you specify data types, tensor mapping expressions, and computations in einsum form. The compiler translates these into per-engine hardware configurations.
Execution Contexts
Two execution contexts enable double-buffering (preparing the next operand batch while the current one is being computed) to hide memory latency:
| Context | Compute Engines | Fetch/Commit | Typical Use |
|---|---|---|---|
| Main | Exclusive access | Dedicated units | Computation |
| Sub | Idle only | Lower bandwidth | Prefetching to TRF/VRF |
While the main context computes, the sub context prefetches the next operand batch into TRF/VRF. When the sub context is unused, the main and sub Switch Engine channels combine into dual channel mode (see Switch Engine), doubling bandwidth. See Scheduling for how the scheduler coordinates the two contexts and the DMA Engine.
The following sections cover each engine in detail.
Switch Engine
The Fetch Engine produces a FetchTensor where each slice holds its own portion of data.
The Switch Engine then redistributes data across slices so each slice receives exactly what it needs for computation.
Data flows through a ring network of 256 interconnected slices; each slice’s router decides per packet whether to output locally or forward to a neighbor.
This data redistribution overlaps with computation, enabling the Contraction Engine to receive data in the exact pattern it needs while continuously executing operations. This page covers the interface, routing architecture (Forwarding, Broadcast01, Broadcast1, Transpose, InterTranspose, and Custom Topologies), hardware constraints, and performance characteristics.
Interface
impl<'l, const T: Tu, D: Scalar, Chip: M, Cluster: M, Slice: M, Time: M, Packet: M>
FetchTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet>
{
/// Performs switching operation to create a switched tensor.
///
/// Applies switching network routing only. The packet passes through
/// unchanged — no padding, no reshaping. Use [`SwitchTensor::collect`]
/// afterwards to normalize the packet to flit-sized chunks.
#[primitive(FetchTensor::switch)]
pub fn switch<Slice2: M, Time2: M>(
self,
config: SwitchConfig,
) -> SwitchTensor<'l, T, D, Chip, Cluster, Slice2, Time2, Packet> {
verify_switch::<Slice, Time, Slice2, Time2>(&config);
SwitchTensor::new(self.ctx, self.inner.transpose(true))
}
/// Skips the switching network and goes directly to collect.
///
/// Slice and Time are preserved from fetch; only the packet is normalized
/// to flit-sized chunks.
#[primitive(FetchTensor::collect)]
pub fn collect<Time2: M, Packet2: M>(self) -> CollectTensor<'l, T, D, Chip, Cluster, Slice, Time2, Packet2> {
verify_collect::<D, Time, Packet, Time2, Packet2>();
CollectTensor::new(self.ctx, self.inner.transpose(false))
}
}
The transformation preserves the tensor’s mathematical representation while redistributing data across slices.
The Chip and Cluster dimensions pass through unchanged; only Slice and Time are permuted.
The packet passes through the switch engine unchanged.
After switching, call collect() to normalize the packet to 32-byte flits.
Architecture
This section explains how routers make decisions to route data, then shows regular topologies with predictable data flow, and finally covers custom topologies that enable arbitrary patterns. The Switch Engine only supports specific slice and temporal dimension transformations determined by the switching topology.
Router Decision Process
Understanding how routers make forwarding decisions is essential before exploring specific topologies.
Each slice has a router that decides whether to send its input packet to an adjacent slice or to output.
Each packet has a source slice number attached, and each slice configures a snoop bitmap (a bitmask specifying which source slices’ packets to accept and output) to control which data it receives.
Each slice’s router can make the following three routing decisions:
- Input routing: The router decides whether its input packet goes to output, rightward to the next slice, or leftward to the previous slice.
- Right-neighbor routing: Data arriving from the right neighbor can be forwarded to output, rightward, or leftward.
- Left-neighbor routing: Data arriving from the left neighbor can be forwarded to output, leftward, or rightward.
Using these settings, data moving in a counter-clockwise ring pattern can be configured to reach the desired slice.
Common router configurations for counter-clockwise ring communication:
- Root node: Outputs input data and data from the right slice, sends to right slice.
- Middle node: Outputs data from the left slice, forwards to right.
- Leaf node: Forwards input data to left, outputs data from the left slice.
To understand how the switching mechanism routes data through the ring network, consider a minimal example with 2 slices and 2 input packets per slice.
This example shows how data flows through the ring over time, with each slice deciding whether to output data locally or forward it to neighbors.
Given:
axes![A = 2, B = 2, C = 64]Slice: m![A]Time: m![B]Packet: m![C]- Input packets per slice:
slice0 = [0, 1],slice1 = [2, 3]
| i(cycle) | slice#0 | slice#1 | Output Data |
|---|---|---|---|
| 0 | 0: from input, to (output, right) | 0: [0]1: [] | |
| 1 | 1: from input, to (output, right) | 0: from left, to output 2: from input, to left | 0: [0, 1]1: [0] |
| 2 | 2: from right, to (output, right) | 1: from left, to output 3: from input, to left | 0: [0, 1, 2]1: [0, 1] |
| 3 | 3: from right, to (output, right) | 2: from left, to output | 0: [0, 1, 2, 3]1: [0, 1, 2] |
| 4 | 3: from left, to output | 0: [0, 1, 2, 3]1: [0, 1, 2, 3] |
As a result, a tensor with the following mapping expression is output:
Slice: m![A / 2, 2]Time: m![B / 2, B % 2, A % 2]Packet: m![C]
The hardware provides pre-defined regular topologies (like Broadcast01 with slice0 = 2, slice1 = 1) that configure the routers to achieve such patterns efficiently.
Forwarding
Forwarding passes data through the switching network unchanged, preserving the Slice and Time dimension mapping.
Each slice’s router simply passes its input data directly to output; no inter-slice communication occurs.
To use forwarding, skip the .switch() call entirely and invoke collect() directly on the FetchTensor:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 256, B = 64, C = 32];
fn forwarding<'l, const T: Tu>(
input: FetchTensor<'l, T, f32, m![1], m![1], m![A], m![B], m![C]>,
) -> CollectTensor<'l, T, f32, m![1], m![1], m![A], m![B], m![C]> {
input.collect()
}
}
The ring network operates at the following minimum cost when forwarding:
$$ \text{#cycles} = \text{ring_size} \times \text{input_time} \times \text{cycles_per_packet} $$
ring_size is 1 since no inter-slice communication is needed, making this the most efficient topology when no actual switching is required.
Broadcast01
Broadcast01 replicates data across slices along two inner Slice sub-dimensions (called slice0 and slice1 in the layout diagram below), enabling parallel computation on the same data across multiple processing elements.
This topology is essential for operations like matrix-vector multiplication where a vector needs to be broadcast to all rows of a matrix distributed across slices.
This topology is parameterized by slice1, slice0, and time0.
The compiler infers slice2 = InSlice::SIZE / (slice1 * slice0) and time1 = InTime::SIZE / time0.
The following table shows the input axis structure (outermost to innermost, left to right):
+--------------------------+---------------+
| Slice | Time |
+--------+--------+--------+-------+-------+
| slice2 | slice1 | slice0 | time1 | time0 |
+--------+--------+--------+-------+-------+
After switching, slice1 and slice0 move from Slice into Time, broadcasting those dimensions across the ring group while tiling slice2:
+----------------------+---------------------------------+
| Slice | Time |
+--------+------+------+-------+--------+-------+--------+
| slice2 | tile | tile | time1 | slice1 | time0 | slice0 |
+--------+------+------+-------+--------+-------+--------+
Moving slice1 and slice0 from Slice to Time creates time1 × time0 independent ring groups, each of size slice1 × slice0, where slices within each ring group exchange data to achieve the broadcast pattern.
The broadcast dimensions (slice1, slice0) are placed at the innermost positions of the output Time dimension (just outside Packet).
This broadcast topology takes data that was spatially distributed across slices (the Slice axis) and broadcasts it over time.
Instead of different slices having different data, all slices in the same ring group receive the same data sequentially through time.
Example
Consider the following configuration:
axes![A = 256, B = 64, C = 63, D = 8]dtype = i8In:Chip: m![D / 2]Cluster: m![D % 2]Slice: m![A]Time: m![B]Packet: m![C # 64]
Out:Chip: m![D / 2]Cluster: m![D % 2]Slice: m![A / 4, 4]Time: m![B / 4, A / 2 % 2, B % 4, A % 2]Packet: m![C # 64]
This configuration sets slice1 = 2, slice0 = 2, time0 = 4 in the Broadcast01 topology.
The compiler infers slice2 = 256 / (2 * 2) = 64 and time1 = 64 / 4 = 16.
Notice that slice2 * slice1 * slice0 = 256 = Slice::SIZE, and time1 * time0 = 64 = (old)Time::SIZE.
The difference between the input and output mappings is that the A % 4 axis moved from Slice to Time, while slice2 is tiled.
This divides the 256 slices into 16 groups of size ring_size = slice0 * slice1 = 4.
The axis movement between Slice and Time enables this broadcast behavior: when an axis moves from Slice to Time, it creates dependencies where slices in a particular ring group receive data from the other slices in the same ring group.
The slice1 and slice0 broadcast axes each move to Time as A / 2 % 2 and A % 2, respectively.
This particular configuration is equivalent to the following custom snoop bitmap, which maps the slice identified by the bitmap index to its corresponding ring group.
The broadcast pattern is evident: rows 0-3 have identical entries as the slices they represent ({0, 1, 2, 3}) receive data from the same input slices ({0, 1, 2, 3}).
| Bitmap Index | (A / 4, A % 4) | A | Ring Group |
|---|---|---|---|
| 0 | (0, 0), (0, 1), (0, 2), (0, 3) | 0, 1, 2, 3 | 0, 1, 2, 3 |
| 1 | (0, 0), (0, 1), (0, 2), (0, 3) | 0, 1, 2, 3 | 0, 1, 2, 3 |
| 2 | (0, 0), (0, 1), (0, 2), (0, 3) | 0, 1, 2, 3 | 0, 1, 2, 3 |
| 3 | (0, 0), (0, 1), (0, 2), (0, 3) | 0, 1, 2, 3 | 0, 1, 2, 3 |
| 4 | (1, 0), (1, 1), (1, 2), (1, 3) | 4, 5, 6, 7 | 4, 5, 6, 7 |
| … | … | … | … |
| 255 | (63, 0), (63, 1), (63, 2), (63, 3) | 252, 253, 254, 255 | 252, 253, 254, 255 |
Since this matches exactly the pre-defined Broadcast01 form, it is an input/output format that can be processed by the Switch Engine.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 256, B = 64, C = 63, D = 8, X = 4];
fn broadcast01<'l, const T: Tu>(
input: FetchTensor<'l, T, f32, m![D / 2], m![D % 2], m![A], m![B], m![C # 64]>,
) -> SwitchTensor<'l, T, f32, m![D / 2], m![D % 2], m![A / 4, X], m![B / 4, A / 2 % 2, B % 4, A % 2], m![C # 64]> {
// X is a newly introduced axis for broadcast semantics.
// Input: each slice has its own portion of data (256 slices, 64 time steps, 64 byte packets)
// Output: all slices receive broadcast data from their 4-slice ring group
// Packet passes through unchanged; call .collect() afterwards to normalize to flits.
input.switch::<m![A / 4, X], m![B / 4, A / 2 % 2, B % 4, A % 2]>(
SwitchConfig::Broadcast01 {
slice1: 2,
slice0: 2,
time0: 4
}
)
}
}
Cycle Estimation
The Switch Engine’s cycle estimation follows the formula:
$$ \text{#cycles} = \text{ring_size} \times \text{input_time} \times \text{cycles_per_packet} $$
$$ = (\texttt{slice0} \times \texttt{slice1}) \times \texttt{B::SIZE} \times \frac{\texttt{(C # 64)::SIZE}}{32} $$
$$ = (2 \times 2) \times 64 \times \frac{64}{32} = 512 \text{ cycles} $$
Ring Structure
The ring_size of 4 means that inter-slice data movement occurs in groups of 4 slices, with data dependencies existing only within each ring.
When we group all 256 slices into rings of size 4, we get 64 independent rings that operate in parallel.
Within each ring, exchanging data takes time proportional to ring_size, and each packet represents the minimum unit of data exchange.
Regular topologies can be expressed as tensor mapping expressions. For example, with:
axes![A = 64, B = 64]Slice = m![A]Time = m![B / 2]Packet = m![B % 32]
If configured with Broadcast01 (slice0 = 8, slice1 = 8, time0 = 2), the tensor mapping expression corresponding to the output is:
axes![A = 64, B = 64]Slice = m![A / 64, 64]Time = m![B / 4, A / 8 % 8, B / 2 % 2, A % 8]Packet = m![B % 32]
Broadcast1
Broadcast1 replicates data across slices along Slice dimension 1, enabling parallel computation where a single dimension needs to be broadcast while preserving another dimension in the slice.
This topology is simpler than Broadcast01 as it only broadcasts along one Slice dimension.
This topology is parameterized by slice1 and slice0.
The compiler infers slice2 = InSlice::SIZE / (slice1 * slice0).
An input tensor structured as follows:
+--------------------------+--------+
| Slice | Time |
+--------+--------+--------+--------+
| slice2 | slice1 | slice0 | time0 |
+--------+--------+--------+--------+
is transformed into the following output tensor, where only the slice1 axis moves from Slice to Time, broadcasting this dimension across the slice’s ring group, while preserving slice0 in Slice dimension and tiling slice2.
+------------------------+----------------+
| Slice | Time |
+--------+------+--------+-------+--------+
| slice2 | tile | slice0 | time0 | slice1 |
+--------+------+--------+-------+--------+
Example
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 256, B = 64, C = 63, X = 4];
fn broadcast1<'l, const T: Tu>(
input: FetchTensor<'l, T, i8, m![1], m![1], m![A], m![B], m![C # 64]>,
) -> SwitchTensor<'l, T, i8, m![1], m![1], m![A / 32, X, A % 8], m![B, A / 8 % 4], m![C # 64]> {
// X is a newly introduced axis for broadcast semantics.
// Packet passes through unchanged; call .collect() afterwards to normalize to flits.
input.switch::<m![A / 32, X, A % 8], m![B, A / 8 % 4]>(
SwitchConfig::Broadcast1 {
slice1: 4,
slice0: 8,
}
)
}
}
Transpose
Transpose permutes axes within the innermost part of the Slice dimension.
This topology is parameterized by slice1 and slice0.
An input tensor with the slice dimension structured as [slice2, slice1, slice0] is transformed so that the output slice becomes [slice2, slice0, slice1]:
+--------------------------+ +--------------------------+
| Slice | | Slice |
+--------+--------+--------+ --> +--------+--------+--------+
| slice2 | slice1 | slice0 | | slice2 | slice0 | slice1 |
+--------+--------+--------+ +--------+--------+--------+
Example
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 256, B = 64, C = 63];
// Transpose with slice1 = 32, slice0 = 2.
// Input Slice: m![A]: [slice2 = 4, slice1 = 32, slice0 = 2]
// Output Slice: m![A / 64, A % 2, A / 2 % 32]
fn transpose<'l, const T: Tu>(
input: FetchTensor<'l, T, i8, m![1], m![1], m![A], m![B], m![C # 64]>,
) -> SwitchTensor<'l, T, i8, m![1], m![1], m![A / 64, A % 2, A / 2 % 32], m![B], m![C # 64]> {
input.switch::<m![A / 64, A % 2, A / 2 % 32], m![B]>(SwitchConfig::Transpose {
slice1: 32,
slice0: 2,
})
}
}
The output slice m![A / 64, A % 2, A / 2 % 32] decomposes the original axis A into three parts: A / 64 extracts slice2 (stride 64, size 4), A % 2 extracts slice0 (stride 1, size 2), and A / 2 % 32 extracts slice1 (stride 2, size 32).
Compared to the input slice ordering ([slice2, slice1, slice0]), slice1 and slice0 are swapped.
InterTranspose
While regular Transpose permutes axes within Slice only, InterTranspose swaps between the Slice and Time dimensions and transposes in the Time dimension.
This topology is parameterized by slice1 (the size of the dimension being swapped), slice0, and time0.
The compiler derives slice2 and time2 from the input Slice and Time mappings.
Since time1 must have the same size as slice1 for OutSlice::SIZE to be 256, this effectively swaps equally-sized chunks between the Slice and Time dimensions:
Input:
+--------------------------+-----------------------+
| Slice | Time |
+--------+--------+--------+-------+-------+-------+
| slice2 | slice1 | slice0 | time2 | time1 | time0 |
+--------+--------+--------+-------+-------+-------+
Output:
+--------------------------+------------------------+
| Slice | Time |
+--------+--------+--------+-------+-------+--------+
| slice2 | time1 | slice0 | time2 | time0 | slice1 |
+--------+--------+--------+-------+-------+--------+
The slice2 and slice0 axes remain unchanged in Slice, while time1 in Slice comes from the Time axis.
The output Time dimension contains slice1 from the original Slice dimension.
Example
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 32, C = 256];
// InterTranspose with slice1 = 2, slice0 = 16, time0 = 2.
// The compiler derives: slice2 = 8, time2 = 2.
// Input Slice: m![C] = [slice2 = 8, slice1 = 2, slice0 = 16]
// Input Time: m![A] = [time2 = 2, time1 = 2, time0 = 2]
// Output Slice: m![C / 32, A / 2 % 2, C % 16]
// Output Time: m![A / 4, A % 2, C / 16 % 2]
fn inter_transpose<'l, const T: Tu>(
input: FetchTensor<'l, T, i8, m![1], m![1], m![C], m![A], m![B # 32]>,
) -> SwitchTensor<'l, T, i8, m![1], m![1], m![C / 32, A / 2 % 2, C % 16], m![A / 4, A % 2, C / 16 % 2], m![B # 32]> {
input.switch::<m![C / 32, A / 2 % 2, C % 16], m![A / 4, A % 2, C / 16 % 2]>(
SwitchConfig::InterTranspose {
slice1: 2,
slice0: 16,
time0: 2,
})
}
}
The output Slice (m![C / 32, A / 2 % 2, C % 16]) decomposes into:
C / 32extractsslice2(from inputSlice)A / 2 % 2extractstime1(from inputTime)C % 16extractsslice0(from inputSlice)
The output Time (m![A / 4, A % 2, C / 16 % 2]) contains:
A / 4extractstime2(from inputTime)A % 2extractstime0(from inputTime)C / 16 % 2extractsslice1(from inputSlice)
Custom Topologies
Regular topologies cover the most common data movement patterns efficiently, but some tensor operations require arbitrary permutations or partial axis extractions that don’t fit these predefined patterns.
Custom topologies solve this problem by allowing you to program exactly which input slices map to which output slices using a bitmap, giving you complete flexibility for complex transformations.
Configuration Overhead
The tradeoff for this flexibility is configuration overhead: using a custom topology requires preempting DMA and sub-context operations to write the bitmap to the hardware’s Special Function Registers (SFRs).
This setup cost makes custom topologies most appropriate when the computation benefits outweigh the initialization overhead.
Supported Transformation Patterns
Custom bitmaps support two key transformation patterns that regular topologies cannot express.
First, they enable free transpose with broadcast, allowing arbitrary permutation and broadcast of partitioning axes—regular topologies only support specific forms like Transpose or TransposedDim1Broadcast, but custom bitmaps let you freely mix axes while broadcasting.
Second, they support partial axis extraction, where only a portion of an axis moves to Time during broadcasting—regular topologies like Broadcast01 always move the entire broadcast axis, but custom bitmaps can select subsets.
Example 1: Arbitrary Permutation
This example demonstrates arbitrary slice dimension permutations that regular topologies cannot express, enabling flexible data reordering for specialized computation patterns.
Arguments:
- 1 cluster (256 slices):
Chip/Clustercontext applies to a single cluster. axes![A = 16, B = 16, C = 8, D = 8, E = 8]dtype = i8In:Slice: m![A, B]Time: m![C]Packet: m![D, E]
Out:Slice: m![B % 4, B / 4, A % 4, A / 4]Time: m![C]Packet: m![D, E]
The difference between In and Out is that permutation occurred in the slice shape.
The form of permutation is [0, 1, 2, 3] to [3, 2, 1, 0].
There is no regular topology corresponding to such free permutation, but it is a form that can be simply expressed with custom bitmap.
| Bitmap Index | (B % 4, B / 4, A % 4, A / 4) | (A, B) | Ring Group |
|---|---|---|---|
| 0 | (0, 0, 0, 0) | (0, 0) | 0 |
| 1 | (0, 0, 0, 1) | (4, 0) | 64 |
| 2 | (0, 0, 0, 2) | (8, 0) | 128 |
| 3 | (0, 0, 0, 3) | (12, 0) | 192 |
| 4 | (0, 0, 1, 0) | (0, 1) | 1 |
| 5 | (0, 0, 1, 1) | (4, 1) | 65 |
| … | … | … | … |
| 255 | (3, 3, 3, 3) | (15, 15) | 255 |
The cycle calculation follows the standard formula:
$$ \text{cycle} = \text{ring_size} \times \text{input_time} \times \text{cycles_per_packet}. $$
$$ = 256 \times \texttt{C::SIZE} \times \frac{\texttt{m![D, E]::SIZE}}{32} = 4096 $$
The ring size must be a power of 2, and in this case we need the maximum value of 256, as this particular permutation creates dependencies across all slices with no repeating structure.
For example, data from input slice 196 must reach output slice 3, which means we need a ring large enough to cover all such cross-slice dependencies.
This high cycle count reflects the cost of the arbitrary permutation—contrast this with regular topologies that achieve much lower cycle counts through structured parallelism.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 16, B = 16, C = 8, D = 8, E = 8];
fn arbitrary_permutation<'l, const T: Tu>(
input: FetchTensor<'l, T, f32, m![1], m![1], m![A, B], m![C], m![D, E]>,
) -> SwitchTensor<'l, T, f32, m![1], m![1], m![B % 4, B / 4, A % 4, A / 4], m![C], m![D, E]> {
input.switch::<m![B % 4, B / 4, A % 4, A / 4], m![C]>(
SwitchConfig::CustomBroadcast { ring_size: 256 }
)
}
}
Example 2: Multi-Axis Broadcast
This example shows broadcasting across multiple non-contiguous axes within the Slice dimension, useful for complex tensor operations that require replication along several independent dimensions simultaneously.
Arguments:
- 1 cluster (256 slices):
Chip/Clustercontext applies to a single cluster. axes![A = 16, B = 16, C = 8, D = 8, E = 8]dtype = i8In:Slice: m![A, B]Time: m![C]Packet: m![D, E]
Out:Slice: m![A / 2, 2, B / 2, 2]Time: m![C, A % 2, B % 2]Packet: m![D, E]
The difference between In and Out is that the two axes A % 2 and B % 2 moved from Slice to Time, and broadcast occurred at the original position.
Among regular topologies, Dim0/Dim1Broadcast supports a similar form, but cases where axes corresponding to Slice to Time are separated within the slice cannot be expressed.
However, this is a form that can be simply expressed with custom bitmap.
| Bitmap Index | (A / 2, A % 2, B / 2, B % 2) | (A, B) | Ring Group |
|---|---|---|---|
| 0 | (0, 0, 0, 0), (0, 0, 0, 1), (0, 1, 0, 0), (0, 1, 0, 1) | (0, 0), (0, 1), (1, 0), (1, 1) | 0, 1, 16, 17 |
| 1 | (0, 0, 0, 0), (0, 0, 0, 1), (0, 1, 0, 0), (0, 1, 0, 1) | (0, 0), (0, 1), (1, 0), (1, 1) | 0, 1, 16, 17 |
| 2 | (0, 0, 1, 0), (0, 0, 1, 1), (0, 1, 1, 0), (0, 1, 1, 1) | (0, 2), (0, 3), (1, 2), (1, 3) | 2, 3, 18, 19 |
| 3 | (0, 0, 1, 0), (0, 0, 1, 1), (0, 1, 1, 0), (0, 1, 1, 1) | (0, 2), (0, 3), (1, 2), (1, 3) | 2, 3, 18, 19 |
| … | … | … | … |
| 255 | (7, 0, 7, 0), (7, 0, 7, 1), (7, 1, 7, 0), (7, 1, 7, 1) | (14, 14), (14, 15), (15, 14), (15, 15) | 238, 239, 254, 255 |
The cycle calculation gives us:
$$ \text{#cycles} = \text{ring_size} \times \text{input_time} \times \text{cycles_per_packet} $$
$$ = 32 \times 8 \times 2 = 512 \text{ cycles} $$
The ring_size of 32 is smaller than the full 256 slices because the outermost A / 2 part of the slice dimension doesn’t require data exchange—only the remaining 32 slices within each A / 2 group need to exchange data with each other.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 16, B = 16, C = 8, D = 8, E = 8, X = 2, Y = 2];
fn multi_axis_broadcast<'l, const T: Tu>(
input: FetchTensor<'l, T, f32, m![1], m![1], m![A, B], m![C], m![D, E]>,
) -> SwitchTensor<'l, T, f32, m![1], m![1], m![A / 2, X, B / 2, Y], m![C, A % 2, B % 2], m![D, E]> {
input.switch::<m![A / 2, X, B / 2, Y], m![C, A % 2, B % 2]>(
SwitchConfig::CustomBroadcast { ring_size: 32 }
)
}
}
Understanding the Bitmap Pattern
Two key patterns appear in the bitmap that reveal how the transformation works.
First, broadcast manifests as identical bitmaps: bitmap[0] and bitmap[1] are completely identical because output slices 0 and 1 both receive the same source data, implementing the broadcast operation.
Second, the Slice to Time movement appears as one output slice receiving from multiple input slices: bitmap[0] = {0, 1, 16, 17} shows that output slice 0 collects data from four different input slices.
Example 3: Partial Axis Extraction (Slicing)
This example demonstrates extracting only a subset of an axis during the Slice to Time transformation, enabling selective data distribution for operations that don’t require the full axis range.
Arguments:
- 1 cluster (256 slices):
Chip/Clustercontext applies to a single cluster. axes![A = 16, B = 16, C = 8, D = 8, E = 8]dtype = i8In:Slice: m![A, B]Time: m![C]Packet: m![D, E]
Out:Slice: m![A, B / 4, 4]Time: m![C, B % 4 = 3]Packet: m![D, E]
The difference between In and Out is that the B % 4 axis moved from Slice to Time, and broadcast occurred at the original position.
The somewhat unusual point is that B % 4 did not move completely intact to Time, but was partially sliced (3 out of total 4).
Among regular topologies, Dim0/Dim1Broadcast supports a similar form, but the form where axes corresponding to Slice to Time are sliced cannot be expressed.
However, this is a form that can be simply expressed with a custom bitmap.
| Bitmap Index | (A, B / 4, B % 4 = 3) | (A, B) | Ring Group |
|---|---|---|---|
| 0 | (0, 0, 0), (0, 0, 1), (0, 0, 2) | (0, 0), (0, 1), (0, 2) | 0, 1, 2 |
| 1 | (0, 0, 0), (0, 0, 1), (0, 0, 2) | (0, 0), (0, 1), (0, 2) | 0, 1, 2 |
| 2 | (0, 0, 0), (0, 0, 1), (0, 0, 2) | (0, 0), (0, 1), (0, 2) | 0, 1, 2 |
| 4 | (0, 1, 0), (0, 1, 1), (0, 1, 2) | (0, 4), (0, 5), (0, 6) | 4, 5, 6 |
| … | … | … | … |
| 255 | (15, 3, 0), (15, 3, 1), (15, 3, 2) | (15, 12), (15, 13), (15, 14) | 252, 253, 254 |
The cycle calculation follows the formula:
$$ \text{#cycles} = \text{ring_size} \times \text{input_time} \times \text{cycles_per_packet} $$
$$ = 4 \times 8 \times 2 = 128 \text{ cycles} $$
The small ring_size of 4 reflects that the A, (B / 4) outermost portion of the slice doesn’t exchange data. Only the innermost 4 slices within each group need to communicate.
The bitmap reveals how partial axis extraction works: bitmap[0] = {0, 1, 2} shows that output slice 0 receives from only 3 input slices.
If the bitmap were {0, 1, 2, 3}, it would represent receiving the entire B axis (all 4 values).
By including only {0, 1, 2}, the bitmap implements slicing—extracting 3 out of 4 values from the B axis dimension.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 16, B = 16, C = 8, D = 8, E = 8, X = 4];
fn partial_axis_extraction<'l, const T: Tu>(
input: FetchTensor<'l, T, f32, m![1], m![1], m![A, B], m![C], m![D, E]>,
) -> SwitchTensor<'l, T, f32, m![1], m![1], m![A, B / 4, X], m![C, B % 4 = 3], m![D, E]> {
input.switch::<m![A, B / 4, X], m![C, B % 4 = 3]>(
SwitchConfig::CustomBroadcast { ring_size: 4 }
)
}
}
Constraint 1: Order Preservation
Hardware limitations require that axes moving from Slice to Time must preserve their relative order from the input Slice dimension.
This constraint exists because the routing network can efficiently forward data in the original axis order, but reordering axes during the transfer would require additional buffering that the hardware doesn’t provide.
The following example shows an unsupported transformation that violates this constraint.
Arguments:
- 1 cluster (256 slices):
Chip/Clustercontext applies to a single cluster. axes![A = 16, B = 16, C = 8, D = 8, E = 8]dtype = i8In:Slice: m![A, B]Time: m![C]Packet: m![D, E]
Out:Slice: m![A, B / 4, 4]Time: m![C, B % 2, B / 2]Packet: m![D, E]
In this example, the B % 2 and B / 2 axes appear in reversed order compared to their arrangement in the input slice dimension.
While the slice bitmap could theoretically represent this pattern, the hardware cannot execute it because it lacks the buffering needed to reorder axes during transfer.
If the output slice were instead m![A, B / 4, 4] with time m![C, B / 2, B % 2] and packet m![D, E], then the transformation would be valid.
In this corrected version, the B / 2, B % 2 axes maintain their original order from the input slice, satisfying the order preservation constraint.
Constraint 2: Innermost Time Position
The hardware requires axes moving from Slice to Time to appear at the innermost positions of the output time dimension.
Axes moving from Slice to Time are delivered last per packet, so they become the innermost Time dimensions in the output stream.
The following example shows an unsupported transformation that violates this constraint.
Arguments:
- 1 cluster (256 slices):
Chip/Clustercontext applies to a single cluster. axes![A = 16, B = 16, C = 8, D = 8, E = 8]dtype = i8In:Slice: m![A, B]Time: m![C]Packet: m![D, E]
Out:Slice: m![A / 2, 2, B / 2, 2]Time: m![A % 2, C, B % 2]Packet: m![D, E]
In this example, the A % 2 and B % 2 axes that move from Slice to Time preserve their relative order correctly.
However, the transformation is still invalid because these axes are not positioned at the innermost part of the output time dimension—the C axis appears between A % 2 and B % 2, violating the innermost position requirement.
Note that Broadcast01 topology can sometimes work around this constraint using the time0 parameter, which provides additional flexibility in axis positioning.
Custom topologies lack this time0 mechanism, so they must strictly place all Slice to Time axes at the innermost time positions.
Constraints
Understanding switching constraints prevents compilation errors and ensures correct data movement patterns.
Why Switching Constraints Exist
The Switch Engine constraints reflect fundamental hardware design decisions about the ring network topology and router capabilities.
Ring network topology fundamentally limits flexibility. The hardware implements a physical ring connecting 256 slices in a fixed order. Data flows counter-clockwise through this ring, with each router deciding whether to output locally or forward to neighbors. This topology is highly efficient for regular patterns (like broadcasting) where all slices follow similar routing rules. However, it cannot efficiently express arbitrary permutations that would require complex routing tables or multiple ring passes. The hardware provides only 256 router configuration entries—one per slice—rather than a full crossbar switch that could connect any slice to any other.
Buffering constraints drive the order preservation rule.
Each slice router has minimal buffering (essentially one packet), which enables high throughput but prevents reordering.
When data arrives from the ring network, the router must immediately decide: output locally or forward?
It cannot buffer multiple packets and reorder them.
Therefore, axes moving from the Slice to Time dimension must maintain their original order—the hardware simply forwards data in arrival order without reordering capabilities.
Pipeline structure requires innermost time position.
Data from other slices arrives last within each packet, so Slice-to-Time axes naturally become the innermost time dimensions.
Placing these axes anywhere else would require the hardware to buffer and reorder complete time sequences, which would require prohibitive amounts of SRAM and complex control logic.
Regular Topology Constraints
Regular topologies impose specific structural requirements:
-
Topology pattern matching: Input/output mapping expressions must match the predefined topology pattern. Violating this causes a compilation error. Example:
Broadcast01requires specific axis ordering (slice2,slice1,slice0,time1) that cannot be arbitrarily reordered. -
Full cluster operation:
InSlice::SIZE=OutSlice::SIZE= 256. Partial cluster operations are not supported; violating this causes a compilation error.
Custom Topology Constraints
Custom topologies provide flexibility but impose two critical constraints:
1. Order Preservation: Axes moving from Slice to Time must preserve their relative order from the input slice dimension (see Buffering constraints above).
Violating this causes a compilation error or incorrect data routing.
// Input: Slice: m![A, B]
// INVALID: B % 2 and B / 2 are reversed
// Output: Time: m![C, B % 2, B / 2]
// Valid: Time: m![C, B / 2, B % 2]
2. Innermost Time Position: Axes moving from Slice to Time must appear at the innermost positions of the output Time dimension (see Pipeline structure above).
Violating this causes a compilation error or incorrect data ordering.
// Input: Slice: m![A, B]
// INVALID: C appears between moved axes
// Output: Time: m![A % 2, C, B % 2]
// Valid: Time: m![C, A % 2, B % 2]
Note
The
Broadcast01topology can sometimes work around the innermost position constraint using thetime0parameter. Custom topologies lack this mechanism and must strictly follow the constraint.
Performance
The Switch Engine performance directly affects computation throughput since data redistribution overlaps with tensor operations.
Cycle Estimation
Switching operations follow the formula:
$$ \text{cycle} = \text{ring_size} \times \text{input_time} \times \text{cycle_per_packet} $$
Where:
ring_size: Number of slices in each independent ring (e.g.,slice0 × slice1forBroadcast01)input_time: Size of the input time dimensioncycles_per_packet:packet_size / 32(number of 32-byte flits per packet)
For example, with ring_size = 4, input_time = 64, and 64-byte packets (2 flits):
$$ \text{#cycles} = \text{ring_size} \times \text{input_time} \times \text{cycles_per_packet} $$
$$ = 4 \times 64 \times 2 = 512 \text{ cycles} $$
Parallelism Across Rings
When 256 slices are grouped into rings (e.g., 64 rings of size 4), all rings operate independently and in parallel.
This parallelism is critical for high throughput: although each ring takes ring_size × input_time × cycles_per_packet cycles to complete, all rings finish simultaneously.
Custom Topology Overhead
Custom topologies provide arbitrary permutation flexibility but incur configuration overhead:
- Requires preempting DMA and sub-context operations
- Must write the bitmap to Special Function Registers (SFRs)
- Setup cost makes custom topologies most appropriate when computation benefits outweigh initialization overhead
Communication Cost
Communication cost in the ring network scales with ring size and data volume:
- Regular topologies: Optimized for common patterns, minimal overhead
- Custom topologies: Flexible but potentially higher setup cost
- Ring topology characteristic: Data movement cost increases proportionally with ring size, unlike other dimensions where stride differences have minimal impact
Collect Engine
The Collect Engine normalizes packets to exactly one flit (a 32-byte flow control unit that all downstream engines operate on). It follows the Switch Engine in the pipeline, or the Fetch Engine directly when forwarding is implied.
Interface
impl<'l, const T: Tu, D: Scalar, Chip: M, Cluster: M, Slice: M, Time: M, Packet: M>
SwitchTensor<'l, { T }, D, Chip, Cluster, Slice, Time, Packet>
{
/// Normalizes packet to exactly 32 bytes (one flit).
///
/// Pads to flit-aligned boundary, then splits: inner 32 bytes become Packet2,
/// outer flit portion is absorbed into Time2.
/// For packets already ≤ 32 bytes, only padding is added.
#[primitive(SwitchTensor::collect)]
pub fn collect<Time2: M, Packet2: M>(self) -> CollectTensor<'l, T, D, Chip, Cluster, Slice, Time2, Packet2> {
verify_collect::<D, Time, Packet, Time2, Packet2>();
CollectTensor::new(self.ctx, self.inner.transpose(false))
}
}
Packet Normalization
The collect() method transforms an arbitrary-sized packet into exactly one flit (32 bytes):
- Pad the input packet to the nearest 32-byte boundary (if not already aligned). Skipped if the packet is already 32-byte aligned.
- Split at the flit boundary: the inner 32 bytes become
Packet2, and the outer flit count is absorbed intoTime2. Skipped if the padded packet is at most 32 bytes (i.e., fits in one flit).
Packet = 32 bytes (identity)
i8, B = 32: packet = 32 elements × 1 byte = 32 bytes = one flit. Nothing changes.
Before: Time = m![A]
Packet = m![B]
┌──────────────────────────┐
│ B │ 32 bytes
└──────────────────────────┘
After: Time = m![A]
Packet = m![B # 32]
┌──────────────────────────┐
│ B # 32 │ 32 bytes
└──────────────────────────┘
Packet < 32 bytes (pad to one flit)
i8, B = 16: packet = 16 elements × 1 byte = 16 bytes. Padded to 32 bytes.
Before: Time = m![A]
Packet = m![B]
┌────────────┐
│ B │ 16 bytes
└────────────┘
After: Time = m![A]
Packet = m![B # 32]
┌────────────┬─────────────┐
│ B │ pad │ 32 bytes
└────────────┴─────────────┘
Packet > 32 bytes (split into flits)
bf16, B = 32: packet = 32 elements × 2 bytes = 64 bytes = 2 flits.
The outer flit count (2) is absorbed into Time.
Before: Time = m![A]
Packet = m![B]
┌──────────────────────────┬──────────────────────────┐
│ B/16 == 0 │ B/16 == 1 │ 64 bytes
└──────────────────────────┴──────────────────────────┘
32 bytes 32 bytes
After: Time = m![A, B/16]
Packet = m![B % 16]
┌──────────────────────────┐
│ B % 16 │ 32 bytes × B/16 time steps
└──────────────────────────┘
Each flit is delivered in a separate time step, so Time grows from m![A] to m![A, B/16].
Pipeline Position
The collect step is mandatory in the Tensor Unit pipeline: all downstream engines (Contraction, TRF/VRF load, etc.) require exactly-32-byte flits, so every execution must pass through fetch → [switch →] collect to normalize packets before proceeding.
When no slice redistribution is needed, call FetchTensor::collect() directly — no .switch() call is required:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 64];
fn direct_collect<'l, const T: Tu>(
input: FetchTensor<'l, T, i8, m![1], m![1], m![1], m![A], m![B]>,
) -> CollectTensor<'l, T, i8, m![1], m![1], m![1], m![A, B / 32], m![B % 32]> {
input.collect()
}
}
Examples
Single-flit packet (identity)
When the input packet is already exactly 32 bytes, collect passes it through unchanged.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 32];
fn collect_identity<'l, const T: Tu>(
input: SwitchTensor<'l, T, i8, m![1], m![1], m![1], m![A], m![B # 32]>,
) -> CollectTensor<'l, T, i8, m![1], m![1], m![1], m![A], m![B # 32]> {
// B=32 elements × 1 byte (i8) = 32 bytes = one flit.
// Time and Packet pass through unchanged.
input.collect()
}
}
Sub-flit packet (padding added)
When the input packet is smaller than 32 bytes, collect pads to 32 bytes.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 16];
fn collect_padding<'l, const T: Tu>(
input: SwitchTensor<'l, T, i8, m![1], m![1], m![1], m![A], m![B]>,
) -> CollectTensor<'l, T, i8, m![1], m![1], m![1], m![A], m![B # 32]> {
// B=16 elements × 1 byte = 16 bytes < 32 bytes.
// Padded to 32 bytes: Packet2 = m![B # 32].
// Time unchanged since it fits in one flit.
input.collect()
}
}
Multi-flit packet (outer absorbed into Time)
When the input packet exceeds 32 bytes, collect splits into flits and absorbs the outer portion into Time.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 32];
fn collect_multi_flit<'l, const T: Tu>(
input: SwitchTensor<'l, T, bf16, m![1], m![1], m![1], m![A], m![B]>,
) -> CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![A, B / 16], m![B % 16]> {
// B=32 elements × 2 bytes (bf16) = 64 bytes = 2 flits.
// Inner 16 elements = 32 bytes → Packet2 = m![B % 16].
// Outer 2 flits → absorbed into Time2 = m![A, B / 16].
input.collect()
}
}
Contraction Engine
The Contraction Engine performs einsum operations — tensor contractions such as matrix multiplication, convolution, and attention — which are the dominant computations in deep learning workloads.
The key mental model is weight-stationary execution: one operand (weights) is loaded into TRF (Tensor Register File) once and held fixed while the other streams through the pipeline, so maximizing TRF reuse minimizes memory traffic. As a kernel writer, you specify the einsum expression, input/output data types, and which tensor goes into the TRF as weights. The compiler maps this to the hardware components described below.
The rest of this chapter explains how einsum operations decompose into hardware primitives across the Contraction Engine’s two components: Aligner (Stream Adapter + TRF Sequencer) and Reducer.
Einsum
Einsum (Einstein summation) generalizes matrix multiplication to arbitrary tensors by specifying which dimensions to contract. For background, see Einsum Is All You Need.
// AB, BC -> AC
// AC[i, j] = sum(AB[i, k] * BC[k, j] for k in 0..B)
Every einsum decomposes into four fundamental steps:
- Broadcast LHS: Expand tensor
T0: [A, B]toT0_prime: [A, B, C] - Broadcast RHS: Expand tensor
T1: [B, C]toT1_prime: [A, B, C] - Elementwise multiply: Compute
T2 = T0_prime * T1_prime - Reduce-add: Sum over contracted dimension
T3: [A, C]whereT3[i, j] = sum(T2[i, k, j] for k in 0..B)
Overview
flowchart TB
subgraph CE[Contraction]
direction LR
SA[Stream Adapter] --> RD[Reducer]
TS[TRF Sequencer] --> RD
end
TRF[(TRF)] --> TS
CO[Collect] --> SA
RD --> VE[Vector]
click SA "./stream-adapter.html" "Stream Adapter"
click TS "./trf-sequencer.html" "TRF Sequencer"
click RD "./reducer.html" "Reducer"
click CO "../collect-engine.html" "Collect Engine"
click VE "../vector-engine/index.html" "Vector Engine"
The einsum steps map to diagram components:
| Einsum Step | Component |
|---|---|
| LHS broadcast | Switch Engine → Collect Engine → Aligner: Stream Adapter |
| RHS broadcast | Aligner: TRF Sequencer |
| Elementwise multiply | Reducer |
| Reduce-add | Reducer |
For reductions across slices or chips, the Vector Engine handles the final aggregation.
The following sections present case studies showing how common operations map to the Contraction Engine. Each case study shows a compiler-generated configuration dump; for the format definition, see Aligner and Reducer. For a beginner-friendly introduction, see the Hello, Contraction! Tutorial.
Case Studies
Batched MatMul
This section demonstrates how batched matrix multiplication maps to the Contraction Engine using the einsum VMK, VNK -> VMN, where V is the batch axis, M and N are the output axes, and K is the contraction axis.
Choose a mapping based on which axis is largest: use K contraction when K is large (maximizes Reducer efficiency), V vectorized when the batch axis V is large (maximizes temporal parallelism), and N×M tiled when both output axes are large (distributes work across TRF rows and time).
K contraction by Reducer
The Reducer can perform the K-axis contraction directly, placing K in the temporal dimension.
The following dump shows the resulting input, TRF, computation, and accumulation mappings:
// Configuration: input_type = bf16, trf_type = bf16, reduce_op = `Add`
// Input mapping: [ H: [V=32, M=32, K=32] ] (1)
// TRF mapping: [ Row: [N=8] | H: [V=5, N/8=3, K=32] ] (1)
// Computation mapping: [ H: [V=32, M=32] | Row: [N=8] | T: [K=32] ] (1)
// Accumulation mapping: [ H: [V=32, M=32, N/8=3] | T: [N=8] ] (1)
V - vectorized mapping
The batch axis V can be placed in the temporal dimension for vectorized computation.
The following dump shows how V moves into the temporal dimension while M and K remain in their respective positions:
// Configuration: input_type = bf16, trf_type = bf16, reduce_op = `Add`
// Input mapping: [ H: [M=5, K=32, V=32] ] (1)
// TRF mapping: [ Row: [N=8] | H: [N/8=3, K=32, V=32] ] (1)
// Computation mapping: [ H: [M=5, N/8=3, K=32] | Row: [N=8] | T: [V=32] ] (1)
// Accumulation mapping: [ H: [M=5, N=24] | T: [V=32] ] (1)
N x M - tiled mapping
Both output axes N and M can be tiled across the hardware for maximum parallelism.
The following dump shows both output axes distributed across the TRF Row and temporal dimensions:
// Configuration: input_type = bf16, trf_type = bf16, reduce_op = `Add`
// Input mapping: [ H: [V=5, K=32, M=32] ] (1)
// TRF mapping: [ Row: [N=8] | H: [V=5, N/8=3, K=32, M=32] ] (1)
// Computation mapping: [ H: [V=5, N/8=3, K=32] | Row: [N=8] | T: [M=32] ] (1)
// Accumulation mapping: [ H: [V=5, N=24] | T: [M=32] ] (1)
Mixed configurations and constraints for these mappings are detailed in the Reducer section.
2D Convolution
This section demonstrates 2D convolution mapping using the einsum $(H + Fh)$(W + Fw)K, FhFwKC -> HWC, where H and W are spatial output axes, C is the output channel axis, and Fh, Fw, K are contraction axes (filter height, filter width, and input channels). Variations covered in the batched matmul section are omitted.
Filter-Stride 1
For stride-1 convolution, the Stream Adapter performs shift-reuse on the input to produce sliding windows. The $(H+Fh) sliding is done in the Fetch Engine before reaching the Stream Adapter, while the input $(W+Fw) undergoes shift-reuse in the Stream Adapter to produce Fw, W sliding in the computation. The example below uses shift-stride of 1 with two shifts:
// Configuration: input_type = bf16, trf_type = bf16, reduce_op = `Add`
// Input mapping: [ H: [H=30, Fh=3, K=32, $(W=30 + Fw=3)=32] ] (1)
// TRF mapping: [ Row: [C=8] | H: [K=32, C=24, Fh=3, Fw=3] ] (1)
// Computation mapping: [ H: [H=30, C/8=3, Fh=3, K=32, Fw=3] | Row: [C=8] | T: [W=30+2#] ] (1)
// Accumulation mapping: [ H: [H=30, C=32] | T: [W=30+2#] ] (1)
Filter-Stride 2
For stride-2 convolution, shift-reuse with 1 shift and shift-stride of 2 extracts strided windows.
The transformation $(W:2=15 + Fw=4)=32 produces Fw/2=2, (W=15, Fw=2), conceptually extracting a size-2 axis with stride :2 from a linear combination as an outer product: $(W:2=15 + (Fw/2:2=2, Fw=2))=32 becomes Fw/2=2, $(W:2=15, Fw=2).
// Configuration: input_type = bf16, trf_type = bf16, reduce_op = `Add`
// Input mapping: [ H: [H=15, Fh=4, K=32, $(W:2=15 + Fw=4)=32] ] (1)
// TRF mapping: [ Row: [C=8] | H: [K=32, C=24, Fh=4, Fw=4] ] (1)
// Computation mapping: [ H: [H=15, C/8=3, Fh=4, K=32, Fw/2=2] | Row: [C=8] | T: [W=15+1#, Fw=2] ] (1)
// Accumulation mapping: [ H: [H=15, C=32] | T: [W=15+1#] ] (1)
To fully utilize MACs, fill more flits in the shift buffer by increasing feed_flits from the default of 2 to 3. The transformation $(W:2=16 + Fw=4)=34 then produces Fw/2=2, (W=16, Fw=2):
// Configuration: feed_flits = 3, input_type = bf16, trf_type = bf16, reduce_op = `Add`
// Input mapping: [ H: [H=16, Fh=4, K=32, $(W:2=16 + Fw=4)=34] ] (1)
// TRF mapping: [ Row: [C=8] | H: [K=32, C=24, Fh=4, Fw=4] ] (1)
// Computation mapping: [ H: [H=16, C/8=3, Fh=4, K=32, Fw/2=2] | Row: [C=8] | T: [W=16, Fw=2] ] (1)
// Accumulation mapping: [ H: [H=16, C=32] | T: [W=16] ] (1)
Dilation 2
For dilation-2 convolution, shift-reuse with 2 shifts and shift-stride of 2 extracts dilated filter positions.
The transformation $(W=27 + Fw:2=3)=32 produces Fw=3, W=27, conceptually extracting a size-3 axis with stride :2 from a linear combination as an outer product: $(W=27 + Fw:2=3)=32 becomes Fw=3, $(W=27).
// Configuration: input_type = bf16, trf_type = bf16, reduce_op = `Add`
// Input mapping: [ H: [H=27, Fh=3, K=32, $(W=27 + Fw:2=3)=32] ] (1)
// TRF mapping: [ Row: [C=8] | H: [K=32, C=24, Fh=3, Fw=3] ] (1)
// Computation mapping: [ H: [H=27, C/8=3, Fh=3, K=32, Fw=3] | Row: [C=8] | T: [W=27+5#] ] (1)
// Accumulation mapping: [ H: [H=27, C=32] | T: [W=27+5#] ] (1)
Filter-Stride 2, Dilation 2
Combining stride-2 and dilation-2 requires shift operations similar to dilation 2 alone.
The transformation $(W:2=14 + Fw:2=3)=31 + 1# produces Fw=3, W=14, 1+1#, extracting a size-3 axis with stride :2 from a linear combination as an outer product: $(W:2=14 + Fw:2=3)=31 becomes Fw=3, $(W:2=14).
The notation 1z is similar to 1# (dummy padding) but filled with zeros instead of arbitrary values. The TRF must contain zero-padded dummies so that 1+1# contracted with 1+1z yields 1. Note that 1+1# contracted with 1+1# would yield 1#.
// Configuration: input_type = bf16, trf_type = bf16, reduce_op = `Add`
// Input mapping: [ H: [H=14, Fh=3, K=32, $(W:2=14 + Fw:2=3)=31+1#] ] (1)
// TRF mapping: [ Row: [C=8] | H: [K=32, C=24, Fh=3, Fw=3, 1+1z] ] (1)
// Computation mapping: [ H: [H=14, C/8=3, Fh=3, K=32, Fw=3] | Row: [C=8] | T: [W=14+2#, 1+1z] ] (1)
// Accumulation mapping: [ H: [H=14, C=32] | T: [W=14+2#] ] (1)
Constraints
Mapping alignment (compiler-enforced): The computation mapping from the Stream Adapter must exactly match the computation mapping from the TRF Sequencer. Misaligned mappings prevent the Reducer from operating correctly. The compiler ensures this alignment during code generation.
TRF capacity (programmer responsibility): TRF storage limits constrain weight tensor size. Large weight tensors require tiling and multiple SRAM-to-TRF loads, adding overhead. The TRF Sequencer can broadcast weight data across time and head dimensions, enabling efficient reuse of loaded weights.
Row utilization (programmer responsibility): The hardware provides 8 Rows. Operations should use all 8 Rows when possible to maximize throughput. Using fewer Rows (1, 2, 4) reduces parallelism and effective computational bandwidth.
Stream Adapter buffer limits (compiler-enforced): The Stream Adapter has limited buffer capacity for shift operations and packet collection. Configurations exceeding these limits are invalid. See Stream Adapter documentation for specific capacity constraints.
Data type support (compiler-enforced): Input data types are limited to i4, i8, f8, and bf16. Output types are widened automatically (i4/i8 → i32, f8/bf16 → f32). The Reducer does not support f32 input directly, though f32 can be processed in the Vector Engine after contraction.
Performance
Contraction Engine performance depends on Row utilization, TRF reuse, and memory bandwidth:
Row parallelism: Using all 8 Rows achieves 8× parallelism. Configurations with fewer Rows proportionally reduce throughput. Configure tensor mappings to distribute work across all available Rows.
TRF reuse through broadcasting: Weight data loaded into TRF can be broadcast across time and head dimensions at no additional cost. Design tensor mappings to maximize weight reuse through broadcasting, minimizing SRAM-to-TRF transfers.
Pipeline latency: The Contraction Engine pipeline includes the Reducer (5-7 cycles for spatial reduction depending on data type, plus cycles proportional to time dimension size for temporal reduction). Total latency is the sum of these stages plus Stream Adapter/TRF Sequencer overhead.
Memory bandwidth bottlenecks: The Stream Adapter is limited by DM fetch bandwidth (256 B/cycle with proper interleaving). TRF bandwidth is typically not a bottleneck due to the broadcasting capability. Ensure fetch patterns interleave across DMNs and slices to maximize bandwidth utilization.
Aligner
The Aligner stage prepares both operands for the Reducer by transforming them into a matching computation mapping.
The computation mapping is the common tensor layout ([Chip, Cluster, Slice, Row, Time, Packet]) that both the Stream Adapter and TRF Sequencer must produce so the Reducer can pair them element-by-element.
It is positioned within the Contraction Engine data flow as follows:
fetch() -> switch() -> collect() -> align(trf) -> contract() -> accumulate()
The Aligner consists of two parallel paths:
| Path | Component | Source | Role |
|---|---|---|---|
| Data | Stream Adapter | Collect Engine (Stream data from DM) | Collect flits, broadcast to Rows |
| Weight | TRF Sequencer | TRF (weight data) | Broadcast and transform weight data |
Overview
┌───────────────────────────────────────────────────┐
│ Aligner │
│ │
│ ┌─────────────────────┐ │
Switching ──────► │ Stream Adapter ────────►│ │ │
Engine │ │ Computation mapping │───► Reducer
│ | | │
TRF ────────────► │ TRF Sequencer ────────►│ │ │
│ └─────────────────────┘ │
│ │
└───────────────────────────────────────────────────┘
The computation mapping consists of the following dimensions:
Chip: No change from Stream Adapter/TRF Sequencer inputCluster: No change from Stream Adapter/TRF Sequencer inputSlice: No change from Stream Adapter/TRF Sequencer inputRow: Maps to the 8 Rows in the ReducerTime: The temporal dimension for sequential processingPacket: Data packet dimension
The key difference between the two paths is:
- Stream Adapter: Always populates Rows via broadcasting, and supports basic flit collection and data feeding for convolutions.
- TRF Sequencer: Leverages a sequencer to enable more complex data transformations.
Example: Batched MatMul
A batched matrix multiplication demonstrates how the Stream Adapter and TRF Sequencer align data and weights into a matching computation mapping (each detailed in the Stream Adapter and TRF Sequencer sub-sections). The code below does three things:
- Flit Collection (Stream Adapter,
collect_flits = 2):L = 2flits are collected from the innermostTimeaxis into thePacketdimension, forming a 64B packet. The collected data is broadcast to Rows (1, 2, 4, or 8 rows depending on the computation mapping). - Packet Broadcast (TRF Sequencer,
reg_read_size = 32B): The TRF Sequencer reads 32B (K = 16bf16) contiguously each cycle and broadcasts twice to fill the 64 bytes, matching the Stream Adapter’s Packet. - Time Permute (TRF Sequencer):
The order of axes in TRF Element
[O = 2, M = 32, K = 16]does not matchTime: [M = 32, O = 2]. The sequencer reorders this by placingOin Entry 0 (inner loop) with stride 1024, whileMuses Entry 1 (outer loop) with stride 32.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 32, N = 8, K = 16, L = 2, O = 2];
/// Stores weights into TRF (sub context).
fn store_weights<'l, const T: Tu>(
weights: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![N, O, M], m![K]>,
) -> TrfTensor<bf16, m![1], m![1], m![1], m![N], m![O, M, K]> {
// TRF mapping: [
// Row: [N = 8]: 8 output channels mapped to 8 Rows
// Element: [O = 2, M = 32, K = 16]: each Row stores 2×32×16 = 1024 bf16 elements
// ]
weights.to_trf(TrfAddress::FirstHalf)
}
/// Aligns data and weights, then contracts (main context).
fn matmul<'l, const T: Tu>(
input: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![M, O, L], m![K]>,
trf: &TrfTensor<bf16, m![1], m![1], m![1], m![N], m![O, M, K]>,
) -> AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![M, O, L], m![N]> {
// Collect mapping: [Time: [M = 32, O = 2, L = 2], Packet: [K = 16]]
// TRF mapping: [Row: [N = 8], Element: [O = 2, M = 32, K = 16]]
//
// Stream Adapter (collect_flits = 2):
// Flit Collection:
// Collects L = 2 flits from innermost Time into Packet.
// After collection, the computation mapping dimensions become:
// Time = [M = 32, O = 2], Packet = [L = 2, K = 16] = 32 bf16 = 64B
// Broadcasts Packet to Rows (N = 8).
//
// TRF Sequencer (reg_read_size = 32B):
// Packet Broadcast:
// reg_read_size read: reads K = 16 bf16 = 32B contiguously from TRF,
// then broadcasts 2× to fill the 64B — matching Packet = [L = 2, K = 16].
// Time Permute:
// TRF Element outer of reg_read_size(K) is [O = 2(outer), M = 32(inner)],
// but Time is [M = 32(outer), O = 2(inner)] — M, O are reordered via sequencer.
//
// Compiler-generated TRF Sequencer configuration:
// Entry 0: { size: 2, stride: 1024 } — O (inner loop, stride = K×M×sizeof(bf16))
// Entry 1: { size: 32, stride: 32 } — M (outer loop, stride = K×sizeof(bf16))
//
// Computation mapping: [Time: [M = 32, O = 2], Row: [N = 8], Packet: [L = 2, K = 16]]
// Output mapping: [Time: [M = 32, O = 2, L = 2], Packet: [N = 8]]
// (K is contracted, column major)
input.align::<m![M, O], m![L, K], _, _>(trf)
.contract::<m![L]>()
.accumulate::<m![M, O, L], m![N]>(AccumulationKind::Interleaved)
}
}
For details on each component, see the sub-sections:
- Stream Adapter — Flit collection, Rows broadcast
- Advanced Operations — Transpose, Shift (for convolutions)
- TRF Sequencer — SRAM-to-TRF, weight broadcasting
Stream Adapter
The Stream Adapter is part of the Aligner stage. It transforms activation data from the Collect Engine into the computation mapping required by the Reducer. It collects incoming flits into properly sized packets and broadcasts them across Rows, enabling data reuse across output channels. This operation is the data-side counterpart to the TRF Sequencer, which prepares weight data on the other side.
Interface
The Stream Adapter is configured through the align method on CollectTensor (see TRF Sequencer — Interface for the full API).
The Time and Packet type parameters determine how the Stream Adapter reshapes the input:
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
impl<'l, const T: Tu, D, Chip, Cluster, Slice, Time, Packet>
CollectTensor<'l, { T }, D, Chip, Cluster, Slice, Time, Packet>
{
/// Aligns this input stream with a TRF tensor for contraction.
/// Configures both the Stream Adapter (data path) and TRF Sequencer (weight path)
/// to produce a matching computation mapping.
pub fn align<OutTime: M, OutPacket: M, Row: M, TrfElement: M>(
self,
trf_tensor: &TrfTensor<D, Chip, Cluster, Slice, Row, TrfElement>,
) -> AlignedPair<'l, { T }, D, Chip, Cluster, Slice, Row, OutTime, OutPacket> {
// Hardware implementation: configures Stream Adapter and TRF Sequencer
}
}
The typical data flow is: switch() → collect() → align(&trf) → contract() → accumulate() for activations (main context).
The Chip, Cluster, and Slice dimensions pass through unchanged.
Architecture
Conceptual Operation
The Stream Adapter transforms the collect tensor mapping into the computation mapping:
Collect mapping: [Chip, Cluster, Slice, Time, Packet]
↓ Stream Adapter (collect + broadcast)
Computation mapping: [Chip, Cluster, Slice, Row, OutTime, OutPacket]
This transformation involves three operations:
- Collect: Buffer
collect_flitsincoming 32-byte flits from the innermostTimeaxis intoPacket, creating theOutTimeandOutPacketmappings. - Rows broadcast: Broadcast the collected
OutPacketto 1, 2, 4, or 8 Rows (determined by the computation mapping). - Time broadcast: Repeat the same activation data across tiling axes in
OutTime.
For advanced operations (transpose, shift-and-reuse for convolutions), see Advanced Operations.
Flit Buffer
The Flit Buffer buffers incoming flits so the Reducer receives data in properly sized units.
The Collect Engine sends data in 32-byte flits.
The collect_flits parameter controls how many consecutive flits are collected into one OutPacket:
collect_flits | Data per Packet | Zero padding | MAC utilization | Use case |
|---|---|---|---|---|
| 1 | 32 bytes | 32 bytes | Half | Small data where a single flit covers the Packet axis |
| 2 (default) | 64 bytes | None | Full | Standard — full mac_width utilization |
| 3 | 96 bytes | N/A | Full | Shift-reuse with padding (see Advanced) |
OutPacket is always 64 bytes (mac_width).
The collect_flits parameter determines how much of that 64 bytes is actual data versus zero padding.
When collect_flits = 2, the innermost Time axis is consumed into Packet.
For example, if the collect mapping has Time: [..., L = 2] and Packet: [K = 16], collecting L = 2 produces Packet = [L = 2, K = 16] = 32 bf16 elements = 64 bytes of data, filling the entire mac_width.
When collect_flits = 1, no Time axis is consumed.
The original Packet (32 bytes) occupies the first half, and the remaining 32 bytes are zero-padded.
Only half the MACs produce meaningful results — the zero-padded half always multiplies by zero.
The Flit Buffer has 96-byte physical capacity: up to 3 single-channel flits (32 bytes each) or 1 dual-channel flit (64 bytes).
Rows Broadcast
After collection, the Stream Adapter broadcasts the same OutPacket data to multiple Rows.
The number of Rows receiving the broadcast is determined by the computation mapping: 1, 2, 4, or 8.
This is in contrast to the TRF Sequencer, where each Row reads different weight data from its own TRF partition. The Reducer then multiplies each Row’s shared activation data against its unique weights.
┌─── Row 0: Packet (same data)
Stream Adapter ──┼─── Row 1: Packet (same data)
(rows=4) ├─── Row 2: Packet (same data)
└─── Row 3: Packet (same data)
Time Broadcast
When the computation mapping includes Time axes that have no corresponding axes in the activation data, the Stream Adapter tiles the input data.
For example, if the TRF data has a T = 5 axis that the activation data lacks, the Stream Adapter tiles the input Packet 5 times.
┌─── T = 0: Packet (same data)
├─── T = 1: Packet (same data)
Time broadcast ───┼─── T = 2: Packet (same data)
(T = 5) ├─── T = 3: Packet (same data)
└─── T = 4: Packet (same data)
Tiling axes are placed at the innermost positions of OutTime.
Multiple tiling axes can be used.
Specifications
| Parameter | Values | Description |
|---|---|---|
collect_flits | 1, 2, 3 | Number of 32-byte flits collected per OutPacket |
| Flit Buffer capacity | 96 bytes | Physical buffer limit (3 × 32-byte flits) |
OutPacket size | Always 64B | = mac_width; zero-padded when collect_flits = 1 |
Rows | 1, 2, 4, 8 | Number of Rows receiving the broadcast (from computation mapping) |
| Tiling axes | Any size, stride = 0 | Time axes that broadcast activation data without re-fetching |
Performance
For collect_flits = 1 or 2, the Stream Adapter is effectively a pass-through with no overhead.
The collect_flits = 3 case (shift-reuse) introduces additional latency; see Advanced Operations.
Examples
collect_flits = 2 (Flit Collection)
This example collects L = 2 flits from the innermost Time axis into Packet, producing a 64B OutPacket (computation packet):
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 32, N = 8, K = 16, L = 2, O = 2];
fn align<'l, const T: Tu>(
input: CollectTensor<'l, { T }, bf16, m![1], m![1], m![1], m![M, O, L], m![K]>,
trf: &TrfTensor<bf16, m![1], m![1], m![1], m![N], m![O, K]>,
) -> AlignedPair<'l, { T }, bf16, m![1], m![1], m![1], m![N], m![M, O], m![L, K]> {
// Collect mapping: [Time: [M=32, O=2, L=2], Packet: [K=16]]
//
// Stream Adapter (collect_flits = 2):
// Flit Collection:
// Collects L = 2 flits from innermost Time into Packet:
// Time = [M = 32, O = 2], Packet = [L = 2, K = 16] = 32 bf16 = 64B
// Broadcasts Packet to Rows (N = 8).
//
// Computation mapping:
// [Time: [M = 32, O = 2] | Row: [N = 8] | Packet: [L = 2, K = 16]]
input.align::<m![M, O], m![L, K], _, _>(trf)
}
}
collect_flits = 1 (No Collection)
When the Packet axis already covers the contraction dimension and no additional flits need to be collected:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 32, N = 8, K = 16];
fn align<'l, const T: Tu>(
input: CollectTensor<'l, { T }, bf16, m![1], m![1], m![1], m![M], m![K]>,
trf: &TrfTensor<bf16, m![1], m![1], m![1], m![N], m![K]>,
) -> AlignedPair<'l, { T }, bf16, m![1], m![1], m![1], m![N], m![M], m![K # 32]> {
// Switch mapping: [Time: [M = 32], Packet: [K = 16]]
//
// Stream Adapter (collect_flits = 1):
// Flit Collection:
// No Time axis collected — data = [K = 16] = 16 bf16 = 32B.
// Packet = [K = 16 # 32] = 64B (32B data + 32B zero padding).
// Broadcasts Packet to Rows (N = 8).
// Half MAC utilization — zero-padded half always multiplies by zero.
//
// Computation mapping:
// [Time: [M = 32] | Row: [N = 8] | Packet: [K = 16 # 32]] (64 bytes)
input.align::<m![M], m![K # 32], _, _>(trf)
}
}
Time Broadcast
When the TRF has axes not present in the input data, the Stream Adapter tiles the activation across Time:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 32, N = 8, K = 16, T = 5];
fn align<'l, const T: Tu>(
input: CollectTensor<'l, { T }, bf16, m![1], m![1], m![1], m![M], m![K]>,
trf: &TrfTensor<bf16, m![1], m![1], m![1], m![N], m![T, K]>,
) -> AlignedPair<'l, { T }, bf16, m![1], m![1], m![1], m![N], m![M, T], m![K # 32]> {
// Collect mapping: [Time: [M = 32], Packet: [K = 16]]
//
// Stream Adapter (collect_flits = 1):
// Flit Collection:
// No Time axis collected — Packet = [K = 16 # 32] (32B data + 32B zero padding).
// Rows Broadcast: N = 8.
// Time Broadcast: T = 5 - activation tiled 5 times per M position.
//
// Computation mapping:
// [Row: [N = 8], Time: [M = 32, T = 5], Packet: [K = 16 # 32]]
input.align::<m![M, T], m![K # 32], _, _>(trf)
}
}
TRF Sequencer
The TRF Sequencer is part of the Aligner stage.
It reads weight data from the Tensor Register File (TRF) and reshapes it to match the computation mapping required by the Reducer.
It broadcasts stored weights across the temporal (via sequencer) and spatial (via reg_read_size) dimensions, enabling weight reuse without additional memory usage.
This operation is the weight-side counterpart to the Stream Adapter, which prepares activation data on the input side.
Interface
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
/// TRF address modes for partitioning the register file.
enum TrfAddress {
FirstHalf, // First half of TRF
SecondHalf, // Second half of TRF
Full, // Entire TRF
}
impl<'l, const T: Tu, D, Chip, Cluster, Slice, Time, Packet>
CollectTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet>
{
/// Stores tensor data from the Collect Engine into TRF.
/// The outermost axes of the input become the Row dimension,
/// and the remaining inner axes become the Element dimension.
/// The resulting layout is [Chip, Cluster, Slice, Row, Element].
pub fn to_trf<Row: M, Element: M>(
self,
address: TrfAddress,
) -> TrfTensor<D, Chip, Cluster, Slice, Row, Element> {
// Hardware implementation: writes data to TRF via SRAM-to-TRF
}
/// Aligns this input stream with a TRF tensor for contraction.
/// Configures the TRF Sequencer to reshape the TRF tensor
/// to match the computation mapping.
pub fn align<OutTime: M, OutPacket: M, Row: M, TrfElement: M>(
self,
trf_tensor: &TrfTensor<D, Chip, Cluster, Slice, Row, TrfElement>,
) -> AlignedPair<'l, T, D, Chip, Cluster, Slice, Row, OutTime, OutPacket> {
// Hardware implementation: configures Stream Adapter and TRF Sequencer
}
}
The typical data flow is: collect() → to_trf() for weights (sub context), then collect() → align(&trf) → contract() → accumulate() for activations (main context).
The Chip, Cluster, and Slice dimensions pass through unchanged.
Architecture
Conceptual Operation
The TRF Sequencer transforms the TRF tensor mapping into the computation mapping:
TRF mapping: [Chip, Cluster, Slice, Row, Element]
↓ TRF Sequencer
Computation mapping: [Chip, Cluster, Slice, Row, OutTime, OutPacket]
This transformation involves four operations:
- Spatial Read: Fill in
OutPacket(which is 64 bytes), with the mechanism involvingreg_read_size. - Row Partitioning: Each Row reads from its own TRF region.
- Temporal Broadcasting: Axes with stride 0 are broadcast across
Time, reusing the same weight data each cycle. - Time Reordering: The
Timeaxes are reordered via a nested-loop sequencer configuration.
SRAM
│
│ SRAM-to-TRF (short command or tensor unit path)
▼
┌──────────────┐ TRF mapping: [Chip, Cluster, Slice, Row, Element]
│ TRF │
└──────┬───────┘
│ TRF Sequencer (nested-loop config)
▼
┌──────────────┐ Computation mapping: [Chip, Cluster, Slice, Row, OutTime, OutPacket]
│ Reducer │◄── Stream Adapter (activation data)
└──────────────┘
TRF Read Mechanism
Every cycle, the Reducer consumes exactly mac_width (64 bytes) of data. This 64-byte window is composed of two parts:
┌──────────── mac_width (64 bytes) ────────────┐
│ broadcast │ reg_read_size (contiguous) |
│ ← (repeated) → │ ← inner (from TRF) → |
└────────────────┴─────────────────────────────┘
reg_read_size: The number of contiguous bytes read from TRF each cycle. Must be a power of two: 1, 2, 4, 8, 16, 32, or 64 bytes.broadcast: The portion ofmac_widthnot covered byreg_read_sizeis filled by repeating the read data. For example, ifreg_read_size = 8, the 8 bytes are broadcast 8× to fill 64 bytes.
The inner part (within reg_read_size) is always read contiguously each cycle. The sequencer does not control this region.
The outer part (beyond mac_width) is controlled by the sequencer entries’ (size, stride) pairs, which specify the iteration order over the remaining dimensions.
Note
reg_read_sizeis not a user-specified parameter. The compiler determines it by comparing the innermost axes of the TRF mapping and the computation mapping: the contiguous portion that is common to both (within 64 bytes) becomesreg_read_size. This means neither the TRF mapping alone nor the computation mapping alone determinesreg_read_size— it is derived from their intersection.
Note
64-byte alignment constraint: When
reg_read_size = 64bytes (i.e., equal tomac_width), the base address and all sequencer strides must be aligned to 64 bytes. A 64-byte read spans both bank columns (32 bytes each). If the address is not 64-byte aligned, the read would cross a bank column boundary, which is not supported by the hardware.
SRAM-to-TRF (StoTRF)
Data is loaded into TRF after the Collect Engine. If the sequencer is configured for completely contiguous access (no gaps or reordering), the load can be executed as a short command (a compact hardware instruction that bypasses the full tensor unit pipeline).
If this condition is not met, the load goes through the full tensor unit path (SRAM → Fetch → Switch → Collect → to_trf()), which supports arbitrary layouts via the fetch engine but has higher setup overhead.
TRF Memory Layout
The TRF is a banked SRAM organized as 8 bank rows × 2 bank columns.
Each bank row corresponds to a Row.
Each bank contains 128 rows (Full mode) or 64 rows (Half mode), and each row holds 32 bytes (320b) of data:
bank row 7 ──────────────────────────────────────────┐
(= Row 7) bank col 0 bank col 1 │
: ┌─ 320b ─┐ ┌─ 320b ─┐ ╱
: ╱ ╱| ╱ ╱| ╱
bank row 1 ╱─────────╱ | ╱─────────╱ | ╱ bank row
bank row 0╱ ╱ | ╱ ╱ | ╱ (= Row)
│ │ │ │ │ │
│ 128 rows│ ╱ │ 128 rows│ ╱
│ │╱ │ │╱
└─────────┘ └─────────┘
Each bank row corresponds to a Row and can be accessed independently in parallel. The 2 bank columns within each bank row share the same row address space.
Each element in the TRF is addressed via a bit-field index:
┌───────────┬──────────────┬──────────┬──────────┐
│ bank row │ row in bank │ bank col │ offset │
│ (3 bit) │ (7b / 6b) │ (1 bit) │ (6 bit) │
└───────────┴──────────────┴──────────┴──────────┘
| Field | Bits | Description |
|---|---|---|
| bank row | 3 | Selects Row (0–7). Each bank row corresponds directly to a Row and can be accessed independently, enabling parallel reads. When rows < 8, unused bits extend the row address space |
| row in bank | 7 (Full) / 6 (Half) | Selects row within a bank: 128 rows (Full) or 64 rows (Half). FirstHalf uses the lower 64 rows (rows 0–63) and SecondHalf uses the upper 64 rows (rows 64–127), so Half mode needs only 6 bits |
| bank col | 1 | Selects bank column (2 columns per bank) |
| offset | 6 | Element offset within a row in 5-bit granularity (64 positions × 5 bits = 320 bits per row) |
The reg_types value determines how many 5-bit slots each element occupies:
reg_types | Element Width | Slots per Element | Elements per Row |
|---|---|---|---|
| 0 | 5-bit (i4 extended) | 1 | 64 |
| 1 | 10-bit (i4→i8) | 2 | 32 |
| 2 | 10-bit (i8→f8) | 2 | 32 |
| 3 | 20-bit (bf16) | 4 | 16 |
When rows < 8, the unused bank row bits effectively increase the per-Row capacity. For example, with rows = 4, one extra bit extends row in bank from 7 to 8 bits, doubling the rows available per Row.
Specifications
TRF Address Modes
| Mode | Region | Capacity |
|---|---|---|
FirstHalf | First half of TRF | register_file_size / 2 |
SecondHalf | Second half of TRF | register_file_size / 2 |
Full | Entire TRF | register_file_size |
These address modes partition the TRF so that tensors stored in different regions do not interfere with each other.
Full dedicates the entire TRF to a single tensor.
FirstHalf and SecondHalf isolate up to two tensors, allowing them to coexist in TRF simultaneously (for example, one half can be read by the Sequencer while the other is written by SRAM-to-TRF, enabling double buffering).
The two halves can be flipped between iterations.
Rows
Each Row maps directly to a bank row in TRF. Since bank rows are physically independent, all Rows can read in parallel without contention.
rows | Description |
|---|---|
| 1 | Single Row (1 bank row used) |
| 2 | 2 Rows (2 bank rows used) |
| 4 | 4 Rows (4 bank rows used) |
| 8 | 8 Rows (all bank rows used) |
Each Row reads the same sequencer pattern from a different TRF offset (different bank row, same row_in_bank/bank_col/offset).
Sequencer Configuration
The TRF Sequencer uses the same nested-loop configuration as all other sequencers (see Sequencer):
| Parameter | Range | Description |
|---|---|---|
| Entries | 1–8 | Each entry is a (size, stride) pair |
size per entry | 1–65,536 | Iteration count for this dimension |
stride per entry | signed 32-bit | Address increment per iteration |
The sequencer entries control iteration over the outer part — the dimensions beyond mac_width. The inner part (within reg_read_size) is read contiguously each cycle and is not represented in the sequencer entries. See TRF Read Mechanism for how the inner and outer parts relate.
Axes with stride = 0 are broadcast: the same data is repeated for each iteration of that dimension.
Performance
TRF Cache
Purpose
The TRF bank columns are shared between read (main context sequencer) and write (sub context StoTRF). A read cache sits between the TRF banks and the Reducer so that cache hits serve reads without occupying a bank — freeing the bank for concurrent StoTRF writes.
Structure
The cache is direct-mapped:
| Parameter | Value |
|---|---|
| Rows per Row | 4 rows × 2 bank columns = 8 entries |
| Entry size | 32 bytes |
| Rows | 8 |
| Total capacity | 8 × 4 × 2 × 32 = 2,048 bytes |
Operation
- First read (cache miss) — data is fetched from the TRF bank and loaded into the cache. The bank is occupied for this cycle.
- Subsequent reads to the same address (cache hit) — data is served from the cache. The bank is not occupied, allowing StoTRF writes to proceed simultaneously.
Bank Conflict Priority
When a cache miss and an StoTRF write target the same bank column in the same cycle, read has higher priority than write. This means frequent cache misses can stall concurrent StoTRF operations.
Impact of reg_read_size
reg_read_size | Bank columns used per cycle | Cache miss impact |
|---|---|---|
| ≤ 32 bytes | 1 column | A miss on the same bank column as a concurrent StoTRF write will still cause a conflict. However, if the innermost sequencer entry interleaves reads at 32-byte granularity across the two bank columns, then the sequencer and StoTRF alternate columns on successive cycles — avoiding degradation even during misses. |
| 64 bytes | Both columns | A miss occupies both columns simultaneously, blocking StoTRF for that cycle. Write throughput degrades in proportion to the miss rate. |
Examples
Basic Weight Broadcasting (MatMul)
This example shows a matrix multiplication where weights are stored in TRF and broadcast across the M (output row) dimension:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 32, N = 8, K = 32];
/// Stores weights into TRF (sub context).
fn store_weights<'l, const T: Tu>(
weights: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![N, K / 16], m![K % 16]>,
) -> TrfTensor<bf16, m![1], m![1], m![1], m![N], m![K]> {
// TRF mapping: [
// Row: [N = 8]: output channels mapped to 8 Rows
// Element: [K = 32]: 32 weight elements stored per Row
// ]
weights.to_trf(TrfAddress::Full)
}
/// Performs matmul contraction (main context).
fn matmul<'l, const T: Tu>(
input: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![M, K / 16], m![K % 16]>,
trf: &TrfTensor<bf16, m![1], m![1], m![1], m![N], m![K]>,
) -> AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![M], m![N]> {
// TRF mapping: [
// Row: [N = 8],
// Element: [K = 32]
// ]
// Computation mapping: [
// Time: [M = 32, K / 16 = 2],
// Row: [N = 8],
// Packet: [K % 16 = 16]
// ]
//
// reg_read_size: K = 32 bf16 elements = 64 bytes = mac_width
// → reg_read_size = 64B (no broadcast, full mac_width read each cycle)
// → 64B alignment required: base and strides must be 64-byte aligned
//
// Compiler-generated TRF Sequencer configuration:
// Entry 0: { size: 32, stride: 0 } — M (broadcast, not in TRF)
// 1. M = 32 is broadcast (stride 0): weights reused for each M iteration
// 2. N = 8 maps to Row: each Row reads from its TRF partition
// 3. K axis is contracted, and remaining Time: [M], Row: [N]
// By using column major, outputs Time: [M], Packet: [N]
input.align::<m![M], m![K], _, _>(trf)
.contract::<m![1]>()
.accumulate::<m![M], m![N]>(AccumulationKind::Interleaved)
}
}
Small reg_read_size
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 32, N = 8, K = 16, L = 2, O = 2];
/// Stores weights into TRF (sub context).
fn store_weights<'l, const T: Tu>(
weights: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![N, O], m![K]>,
) -> TrfTensor<bf16, m![1], m![1], m![1], m![N], m![O, K]> {
// TRF mapping: [
// Row: [N = 8]: output channels mapped to 8 Rows
// Element: [O = 2, K = 16]: 16 weight elements stored per Row
// ]
weights.to_trf(TrfAddress::FirstHalf)
}
/// Performs matmul contraction (main context).
fn matmul<'l, const T: Tu>(
input: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![O, M, L], m![K]>,
trf: &TrfTensor<bf16, m![1], m![1], m![1], m![N], m![O, K]>,
) -> AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![O, M, L], m![N]> {
// TRF mapping: [
// Row: [N = 8],
// Element: [O = 2, K = 16]
// ]
// Computation mapping: [
// Time: [O = 2, M = 32],
// Row: [N = 8],
// Packet: [L = 2, K = 16]
// ]
//
// reg_read_size: K=16 bf16 elements = 32 bytes (= mac_width/2)
// → reg_read_size = 32B (outer size 2 is broadcast)
//
// Compiler-generated TRF Sequencer configuration:
// Entry 0: { size: 32, stride: 0 } — M (broadcast, not in TRF)
// Entry 1: { size: 2, stride: 32 } — O (direct from O)
// 1. M = 32 is broadcast (stride 0): weights reused for each M iteration
// 2. N = 8 maps to Row: each Row reads from its TRF partition
// 3. K axis is contracted, and remaining Time: [O, M], Row: [N], Packet: [L]
// By using column major, outputs Time: [O, M, L], Packet: [N]
input.align::<m![O, M], m![L, K], _, _>(trf)
.contract::<m![L]>()
.accumulate::<m![O, M, L], m![N]>(AccumulationKind::Interleaved)
}
}
TODO: Read (Main) and Write (Sub) to TRF at the same time
Reducer
The Reducer performs elementwise multiplication followed by reduce-add. Each slice’s Reducer contains 8 independent Rows, which are parallel MAC lanes that each process a different weight channel. It receives input data from the Stream Adapter and weight data from the TRF Sequencer.
Interface
The Reducer is invoked via .align() followed by .contract() and .accumulate():
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
impl CollectTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet> {
/// Aligns input stream and TRF to computation mapping.
pub fn align<OutTime: M, OutPacket: M, Row: M, TrfElement: M>(
self,
trf: &TrfTensor<D, Chip, Cluster, Slice, Row, TrfElement>,
) -> AlignedPair<'l, T, D, Chip, Cluster, Slice, Row, OutTime, OutPacket>;
}
impl AlignedPair<'l, T, D, Chip, Cluster, Slice, Row, Time, Packet> {
/// Performs spatial reduction: elementwise multiplication followed by reduce-add
/// across the Packet dimension via the hardware reduction tree.
/// Data type is widened during contraction: i4/i8 -> i32, f8/bf16 -> f32.
pub fn contract<OutPacket: M>(
self,
) -> ContractionTensor<'l, T, OutD, Chip, Cluster, Slice, Row, Time, OutPacket>;
}
impl ContractionTensor<'l, T, D, Chip, Cluster, Slice, Row, Time, Packet> {
/// Performs temporal accumulation: accumulates values over the Time dimension
/// and produces the final contraction output.
pub fn accumulate<OutTime: M, OutPacket: M>(
self, kind: AccumulationKind,
) -> AccumulationTensor<'l, T, D, Chip, Cluster, Slice, OutTime, OutPacket>;
}
The Reducer computes the dot product of input stream \(X\) and TRF weights \(W\):
$$\text{output}[i] = \sum_{j} X[i, j] \times W[i, j]$$
The summation index \(j\) corresponds to axes removed during reduction:
- Spatial reduction removes axes from the
Packetdimension via the hardware reduction tree - Temporal reduction removes axes from the
Timedimension via the accumulator buffer
The output mapping is determined by which axes survive reduction: OutPacket contains Packet axes after spatial reduction, and OutTime contains Time axes after temporal reduction.
Examples
Matrix Multiplication
Matrix multiplication with 8 Rows operating in parallel:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 32, B = 32, C = 8];
fn matmul<'l, const T: Tu>(
input: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![A, B / 16], m![B % 16]>,
trf: &TrfTensor<bf16, m![1], m![1], m![1], m![C], m![B]>,
) -> AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![A], m![C]> {
// Computation mapping: [
// Time: [A = 32],
// Row: [C = 8],
// Packet: [B = 32] (32 bf16 elements = 64 bytes)
// ]
//
// Spatial reduction: tree depth 5 reduces 32 bf16 elements along B → f32
// Output (Interleaved): Time = [A], Packet = [C]
input.align::<m![A], m![B], _, _>(&trf)
.contract::<m![1]>()
.accumulate::<m![A], m![C]>(AccumulationKind::Interleaved)
}
}
At each Row, elementwise multiplication of input * trf occurs.
With tree depth 5, reduce-add sums over the 32 bf16 elements of B, producing one f32 per A position.
Full Tensor Reduction
This example demonstrates a complete reduce-add over a tensor m![A] with m![A]::SIZE = 65536, showing how spatial and temporal reduction combine with slice-level reduction.
Mapping:
Slice = mTime = mPacket = m
Reduction breakdown:
| Stage | Axes Reduced | Mechanism | Cycles |
|---|---|---|---|
| Spatial | A % 32 | Reducer tree (depth 5) | 5 |
| Temporal | A / 32 % 8 | Reducer accumulator (8 iterations) | 8 |
| Slice-level | A / 256 | Inter-Slice Block | 256 |
Analysis:
- Each slice processes
A / 256(256) elements - Within a slice:
A % 32elements are reduced spatially by the tree (5 cycles forbf16) - The temporal axis
A / 32 % 8means 8 flits arrive sequentially, accumulated by the buffer - After in-slice reduction completes (~40 cycles), 256 partial results exist across slices
- The Inter-Slice Block reduces these 256 slice results (256 cycles)
- Total: ~296 cycles for reducing 65536 elements to a single scalar
Architecture
The Reducer consists of 8 independent Rows operating in parallel. Data flows to the Rows from two sources:
- StreamUnit data: Broadcast to all Rows (same data to every row)
- TRF data: Read in parallel from 8 independent Row spaces (TRF Row \(i\) feeds Row \(i\) directly)
Each Row contains a reduction tree for spatial reduction, followed by a shared accumulator buffer for temporal reduction.

The diagram shows data widths at different stages. The 320b/640b corresponds to 64/128 elements for i4/i5, 32/64 elements for i8/f8/i9, 16/32 elements for bf16, and 8/16 elements for f32/i32.
Spatial Reduction
Each Row contains a reduction tree that sums products hierarchically.

At depth 0, each Row multiplies the input stream from the Stream Adapter, with the weight data from the TRF Sequencer (each 64 bytes wide). Each subsequent depth sums pairs of partial products, halving the element count from the previous depth. The tree depth varies by data type to provide sufficient depth for reducing the full data width:
i4: depth 7 (reduces 128 elements)i8/f8: depth 6 (reduces 64 elements)bf16: depth 5 (reduces 32 elements)
The output data type is widened to accommodate larger result values from contraction.
With i8 input, i8 * i8 multiplication occurs first, and up to 64 values can be summed across the 6-depth tree.
Inputs i4/i8 produce i32 outputs, and inputs f8/bf16 produce f32 outputs.
Given a computation mapping of m![Row, Time, Packet], spatial reduction eliminates the innermost m![Packet % 2^n] axes (where n is the tree depth), producing an output mapping of m![Row, Time, Packet / 2^n].
Note
Spatial reduction in
additionmode allows full 8-Row usage, butmaxmode only supports a single Row (Row 0).
Resize
After spatial reduction, the output is resized to exactly 32 i32/f32 elements per Row before being fed to the temporal accumulator.
When the tree depth is 0 (no spatial reduction), the 32 outer elements are truncated.
Otherwise, the spatial reduction output is padded or broadcast to fill the 32 columns of the temporal accumulator, depending on the output mode.
The Reducer supports two output modes that determine how the resize is performed:
- Sequential: Rows are sequentially ordered. The spatial reduction output is padded with zeros.
- Interleaved: Rows are interleaved. The spatial reduction output is repeated across the 32 columns.
The figure below illustrates the output of spatial reduction for various i8 reduction depths.
The left side shows Sequential mode adding zero-padding; the right shows Interleaved mode replicating the output to fill 32 element positions.

Temporal Reduction
After resizing, each Row feeds its output to a shared temporal accumulator.
The temporal accumulator stores intermediate results in a buffer and accumulates values that arrive sequentially over time, enabling reduce operations even when the reduce axis is not contiguous in the innermost dimension.
The buffer has 1024 slots total: 8 rows × 32 columns × 4 registers/column.
Consider axes![A = 2048, B = 8] and a tensor with mapping m![A, B], where we want to reduce along axis B.
With mapping Time = m![B / 4, A % 8] and Packet = m![B % 4], the spatial reduction stage outputs 16 flits (since Time::SIZE = m![B / 4, A % 8]::SIZE = 2 * 8 = 16).
The accumulator uses 8 buffer slots (one per A % 8 value) to accumulate across the B / 4 (2) iterations:
| flit # | B / 4 | A % 8 | Buffer Slot | Operation |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | Store |
| 1 | 0 | 1 | 1 | Store |
| 2 | 0 | 2 | 2 | Store |
| 3 | 0 | 3 | 3 | Store |
| 4 | 0 | 4 | 4 | Store |
| 5 | 0 | 5 | 5 | Store |
| 6 | 0 | 6 | 6 | Store |
| 7 | 0 | 7 | 7 | Store |
| 8 | 1 | 0 | 0 | Accumulate with flit #0 |
| 9 | 1 | 1 | 1 | Accumulate with flit #1 |
| 10 | 1 | 2 | 2 | Accumulate with flit #2 |
| 11 | 1 | 3 | 3 | Accumulate with flit #3 |
| 12 | 1 | 4 | 4 | Accumulate with flit #4 |
| 13 | 1 | 5 | 5 | Accumulate with flit #5 |
| 14 | 1 | 6 | 6 | Accumulate with flit #6 |
| 15 | 1 | 7 | 7 | Accumulate with flit #7, then output |
The first 8 flits are stored in buffer slots 0-7. When flits 8-15 arrive, they accumulate with the stored values. After flit 15, the buffer contains the final reduced results and outputs them.
For buffered reduction to work, the product of all axis sizes inner to the reduce axis must be at most 1024, in order to fit the accumulator buffer.
The temporal accumulator supports two operation modes: Sequential and Interleaved.
Interleaved provides a greater buffer capacity of 128 for the axes inner to the reduce axis, compared to the Sequential 32-element capacity. However, Interleaved changes the output packet structure. Choose the mode based on buffer constraints and whether the desired output ordering matches downstream requirements. See Constraints for the full buffer capacity rules.
Interleaved Mode
In Interleaved mode, the Reducer outputs data element-by-element across all Rows. The output bus carries one value from each of the 8 Rows, per beat.
- Packet Slicing: In Interleaved mode, not all of
Packetis fed to the accumulator. Since the reduction tree broadcasts \(m\) partial sums across all 32 column positions (via replication), only the first \(m\) columns get written to accumulator entries, slicingPacketfrom 32 down to \(m\).
Note
User-specified slicing should only slice padded
Packetaxes.
-
Column Interleaving: To achieve maximum accumulator utilization, all of the 32 accumulator columns are filled by interleaving \(\frac{32}{m}\) column groups over successive cycles. For \(m = 4\), the first cycle writes to columns 0–3, the next to columns 4–7, and so on, giving 8 interleave steps to fill all 32 columns.
-
Full Row Utilization: Additionally, all 8 accumulator rows are always active regardless of the actual input Row count: if
Row < 8, the data is padded to occupy all 8 Rows. -
Output:
OutTime: m![Time', Packet / 2^n = m],OutPacket: m![Row # 8].OutTimepreserves the order ofTime, Packet, but with some axes fromTimeremoved. The removed axes undergo reduce-add, yieldingTime'.OutPacketequalsRowpadded with dummies to align to 8, as all Rows are utilized.
Note
Interleavedmode has reduced accumulator utilization whenRow< 8: onlyRowout of 8 rows store meaningful data, while the output bus always sends all 8 Rows together. Effective accumulator capacity isRow× 32 × 4 instead of the full 8 × 32 × 4 = 1024 slots. This limitation is most severe atRow = 1(128 useful slots), but applies toRow = 2andRow = 4as well.
Example
This example performs a contraction where K is partially reduced spatially (K % 4 in Packet) and temporally (K / 16 in Time), with K % 16 / 4 surviving in the output:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 4, N = 8, K = 64];
fn interleaved<'l, const T: Tu>(
input: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![K / 16, M], m![K % 16]>,
trf: &TrfTensor<bf16, m![1], m![1], m![1], m![N], m![K]>,
) -> AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![M, K % 16 / 4], m![N]> {
// Computation mapping:
// Time: [K / 16, M], Row: [N], Packet: [K % 16]
// 16 bf16 elements per packet = 32 bytes
//
// Spatial: tree depth 2 reduces groups of 4 bf16 -> 1 f32,
// leaving K % 16 / 4 (4) columns
// Temporal: K / 16 (4) iterations accumulated in buffer
//
// Interleaved output:
// m = 4 valid columns, 16 / 4 = 4 column groups interleaved
// OutTime = [M, K % 16 / 4] (K / 16 reduced, surviving Packet appended)
// OutPacket = [N] (Row, already 8)
input.align::<m![K / 16, M], m![K % 16 # 32], _, _>(&trf)
.contract::<m![K % 16 / 4]>()
.accumulate::<m![M, K % 16 / 4], m![N]>(AccumulationKind::Interleaved)
}
}
The axes inner to the reduce axis (K / 16) are M and K % 16 / 4, with a total size of 4 × 4 = 16.
This satisfies the Interleaved buffer constraint (≤ 128).
Sequential Mode
In Sequential mode, the Reducer outputs the reduced data in each Row sequentially.
The output bus carries up to 8 elements from Packet, per beat.
-
Full Packet Utilization: In Sequential mode, all 32 columns of
Packet / 2^nare fed to the accumulator. Unlike in Interleaved mode, no packet slicing occurs. Each cycle writes all 32 columns simultaneously, with zeros padding any unused positions. -
Row Interleaving: To achieve maximum accumulator utilization, all 8 accumulator rows are filled by interleaving \(\frac{8}{\texttt{Row}}\) row groups over successive cycles. With
Row::SIZE = 4, the first 4 rows of the temporal accumulator store rows 0–3, and, in the next cycle, the next 4 rows store in rows 4–7. -
Output:
OutTime: m![Time', Row, Packet_outer],OutPacket: m![Packet_inner].OutTimepreserves the order ofTime, Row, but with some axes fromTimeremoved. The removed axes undergo reduce-add, yieldingTime'.- Since the output bus is 8 elements-wide, only multiples of 8 elements (8, 16, 24, or 32) can be output.
Packetis split accordingly:Packet_outer = m andPacket_inner = m.
Example
The same computation mapping as the Interleaved example above, but with Sequential output:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 4, N = 8, K = 64];
fn sequential<'l, const T: Tu>(
input: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![K / 16, M], m![K % 16]>,
trf: &TrfTensor<bf16, m![1], m![1], m![1], m![N], m![K]>,
) -> AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![M, N], m![K % 16 / 4 # 8]> {
// Computation mapping:
// Time: [K / 16, M], Row: [N], Packet: [K % 16]
// 16 bf16 elements per packet = 32 bytes
//
// Spatial: tree depth 2 reduces groups of 4 bf16 -> 1 f32,
// leaving K % 16 / 4 (4) columns
// Temporal: K / 16 (4) iterations accumulated in buffer
//
// Sequential output:
// Packet'' = K % 16 / 4, padded to 8:
// - Packet_outer = [1],
// - Packet_inner = [K % 16 / 4 # 8]
// OutTime = [M, N] (K / 16 reduced, Row appended)
// OutPacket = [K % 16 / 4 # 8] (surviving Packet padded to 8)
input.align::<m![K / 16, M], m![K % 16 # 32], _, _>(&trf)
.contract::<m![K % 16 / 4]>()
.accumulate::<m![M, N], m![K % 16 / 4 # 8]>(AccumulationKind::Sequential)
}
}
The axes inner to the reduce axis (K / 16) are M and N, with a total size of 4 × 8 = 32.
This satisfies the Sequential buffer constraint (≤ 32).
Note
Sequentialmode has reduced accumulator utilization whenPacketis spatially reduced: onlyPacket / 2^nout of 32 elements store meaningful data per Row. Effective accumulator capacity is 8 × 1 × 4 = 32 slots instead of the full 1024. This limitation applies whenever the non-padded portion ofPacket / 2^nis fewer than 32 elements.
Constraints
- Row count: The hardware provides exactly 8 Rows. Operations can use 1, 2, 4, or 8 rows, but the
Rowdimension size must match one of these values. - Tree depth: Determines how many elements can be reduced spatially. Depth 7 for
i4(128 elements), depth 6 fori8/f8(64 elements), depth 5 forbf16(32 elements). The input packet size must not exceed the maximum elements reducible at the given depth. - Spatial output limit: For a tree depth of 0 (no spatial reduction), the Reducer outputs at most 32
i32/f32elements. Configurations that would produce more than 32 output elements per cycle are invalid. - Data types: Input types must be
i4,i8,f8, orbf16. Output types are automatically widened toi32(fromi4/i8) orf32(fromf8/bf16). The type widening is mandatory. - Reduce-max: Only supports using a single Row (Row 0), limiting reduce-max throughput to 1/8th of reduce-add capacity.
- Buffer capacity: The accumulator has 1024 buffer slots (8 rows × 32 columns × 4 registers/column). The product of axes inner to the outermost reduce axis must fit within this capacity.
- Interleaved constraints: Requires axes inner to outermost reduce in
OutTimeto be at most 128. Full constraint:align_up(Row, 8) * (axes inner to reduce)≤ 1024. - Interleaved utilization: When
Row< 8, effective capacity is reduced toRow× 32 × 4 slots, preventing full buffer utilization. - Sequential constraints: Requires axes inner to outermost reduce in
OutTimeto be at most 32. Full constraint:align_up(reduced_packet.len(), 32) * (axes inner to reduce)≤ 1024. - Sequential utilization: When
Packetis reduced, full buffer utilization cannot be achieved. For instance, when reducingPackettotally, only one column of each Row is used, wasting 31/32 of the buffer capacity.
- Interleaved constraints: Requires axes inner to outermost reduce in
Performance
- Spatial latency: Tree depth determines spatial reduction latency:
i4depth 7 (128 elements in 7 cycles),i8/f8depth 6 (64 elements in 6 cycles),bf16depth 5 (32 elements in 5 cycles). Shallower trees complete faster, but larger data types require less depth due to narrower data paths. - Temporal latency: Each accumulation cycle processes one packet. For a reduction axis of size
Nin the time dimension, the accumulator requires approximatelyNcycles to complete the reduction. - Parallelism: Using all 8 Rows maximizes throughput. Each Row operates independently, so 8 rows achieve 8× parallelism compared to a single row.
- Type widening: Output data types are widened to prevent overflow (
i4/i8→i32,f8/bf16→f32). This widening is automatic and adds minimal latency, but downstream components must handle 32-bit data. - Reduce-max: Only supports single Row (Row 0) usage, limiting parallelism to 1/8th of reduce-add throughput.
- Truncation: When tree depth is 0, the Reducer can output at most 32 elements spatially. Larger packets are truncated.
- Pipeline integration: The Reducer sits between Stream Adapter/TRF Sequencer and Vector Engine, adding latency proportional to tree depth plus time dimension size.
Vector Engine
The Vector Engine applies element-wise operations: activations such as GELU and SiLU, normalizations such as softmax and layer norm, and binary operations. It is used both after the Contraction Engine (to post-process f32/i32 accumulator results) and independently for element-wise kernels that skip contraction entirely.
The Vector Engine operates exclusively on i32 and f32 data types. Data moves in 32-byte units called flits, each containing eight 32-bit values. This 32-bit restriction exists because lower-precision data is widened before or during computation: bf16 products accumulate in f32, and i8 products accumulate in i32.
The Vector Engine sits between the Contraction Engine and the Cast Engine in the Tensor Unit pipeline:
Fetch -> Switch -> Collect -> Contraction -> Vector -> Cast -> Transpose -> Commit
| ^
+-----------------------+
(skip contraction)
Data enters the Vector Engine as either:
- From the Collect Engine when the Contraction Engine is skipped
- From the Contraction Engine when it produces the input
Interface
/// Initializes Vector Engine processing for this tensor.
#[primitive(CollectTensor::vector_init)]
pub fn vector_init(self) -> VectorInitTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet>
where
D: VeScalar,
#[primitive(VectorInitTensor::vector_intra_slice_branch)]
pub fn vector_intra_slice_branch(
self,
branch: BranchMode,
) -> VectorBranchTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet, D, NoTensor, { VeOrder::IntraFirst }> {
#[primitive(VectorInitTensor::vector_intra_slice_unzip)]
pub fn vector_intra_slice_unzip<I: AxisName, TileTime: M, SplitTime: M>(
self,
) -> VectorTensorPair<'l, T, D, stage::Branch, Chip, Cluster, Slice, SplitTime, Packet> {
#[primitive(VectorInitTensor::vector_inter_slice_reduce)]
pub fn vector_inter_slice_reduce<Slice2: M, Time2: M>(
self,
op: InterSliceReduceOpI32,
) -> VectorInterSliceReduceTensor<'l, T, i32, Chip, Cluster, Slice2, Time2, Packet, { VeOrder::InterFirst }> {
The same vector_init() entry point is available regardless of whether the input comes from the Collect Engine (when contraction is skipped) or the Contraction Engine (for post-contraction processing).
After vector_init(), choose the first block by calling either vector_intra_slice_branch(...), vector_intra_slice_unzip(...), or vector_inter_slice_reduce(...).
For detailed stage-by-stage API coverage, see Intra-Slice Block and Inter-Slice Block.
Quick Reference
| Block | How to Reach It | Use It For | Output |
|---|---|---|---|
| Intra-Slice Block | Start with vector_init(), then call vector_intra_slice_branch() | Elementwise ops, binary ops, intra-slice reduce | Chain stages, then vector_final() |
| Inter-Slice Block | Either call vector_init() -> vector_inter_slice_reduce() first, or switch from an eligible intra-slice tensor with vector_inter_slice_reduce() | Reduction across the 256 slices in a cluster | vector_inter_slice_reduce(), then optional intra-slice work or vector_final() |
| Two-group intra-slice mode | Start with vector_init(), then call vector_intra_slice_unzip() | Process two interleaved groups before combining them | _zip to merge, then vector_final() |
Examples
ReLU Activation
Applying ReLU activation (max(x, 0)) after matrix multiplication:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 128, N = 256, K = 64];
fn relu<'l, const T: Tu>(
input: AccumulationTensor<'l, T, f32, m![1], m![1], m![K], m![M], m![N]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1], m![K], m![M], m![N]> {
input
.vector_init()
.vector_intra_slice_branch(BranchMode::Unconditional)
.vector_clip(ClipBinaryOpF32::Max, 0.0f32)
.vector_final()
}
}
Inter-Slice Reduce
Reducing a tensor across slices:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 4, A = 512];
fn inter_slice_reduce<'l, const T: Tu>(
input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, R], m![1], m![A % 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, 1 # 4], m![1], m![A % 8]> {
input
.vector_init()
.vector_inter_slice_reduce::<m![A / 8, 1 # 4], m![1]>(InterSliceReduceOpI32::AddSat)
.vector_final()
}
}
Ordering
| Order | Flow | Typical Use |
|---|---|---|
IntraFirst | Intra-Slice Block -> optional Inter-Slice Block | Post-process each slice, then reduce across slices |
InterFirst | Inter-Slice Block -> optional Intra-Slice Block | Reduce first, then apply elementwise post-processing |
The examples above show one concrete IntraFirst path and one concrete InterFirst path.
Constraints
When using i8 or bf16 input without the Contraction Engine, widening must still fit within one 32-byte flit. This limits how much data the Fetch Engine can supply per flit after type conversion. See Fetch Engine: Type Casting Constraints.
Intra-Slice Block
The Intra-Slice Block performs elementwise, binary, and intra-slice reduce operations on tensor data.
After the Contraction Engine completes matrix multiplication, the Intra-Slice Block applies activation functions, normalization, and other elementwise transformations to produce the final result.
For example, computing sigmoid(X * W + b) requires the Contraction Engine for X * W, then the Intra-Slice Block for addition and sigmoid activation.
Interface
/// Initializes Vector Engine processing for this tensor.
#[primitive(CollectTensor::vector_init)]
pub fn vector_init(self) -> VectorInitTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet>
where
D: VeScalar,
#[primitive(VectorInitTensor::vector_intra_slice_branch)]
pub fn vector_intra_slice_branch(
self,
branch: BranchMode,
) -> VectorBranchTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet, D, NoTensor, { VeOrder::IntraFirst }> {
#[primitive(VectorInitTensor::vector_intra_slice_unzip)]
pub fn vector_intra_slice_unzip<I: AxisName, TileTime: M, SplitTime: M>(
self,
) -> VectorTensorPair<'l, T, D, stage::Branch, Chip, Cluster, Slice, SplitTime, Packet> {
The same vector_init() entry point is available regardless of whether the input comes from the Collect Engine (when contraction is skipped) or the Contraction Engine (for post-contraction processing).
After vector_init(), enter the intra-slice block with either vector_intra_slice_branch(...) or vector_intra_slice_unzip(...).
For the paired path entered through vector_intra_slice_unzip(...), see Two-Group Mode.
After entry, operations are chained stage by stage:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512];
fn staged_pipeline<'l, const T: Tu>(
input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
input
.vector_init()
.vector_intra_slice_branch(BranchMode::Unconditional)
.vector_fxp(FxpBinaryOp::AddFxp, 100)
.vector_fxp_to_fp(31)
.vector_trim_way4::<m![A % 2 # 4]>()
.vector_fp_unary(FpUnaryOp::Sigmoid)
.vector_pad_way8::<m![A % 2 # 8]>()
.vector_fp_to_fxp(31)
.vector_clip(ClipBinaryOpI32::Max, 0)
.vector_final()
}
}
Each method corresponds to a hardware pipeline stage. The type system enforces valid stage transitions at compile time. For example, vector_fp_unary is only available after the pipeline has been narrowed with vector_split or vector_trim_way4.
Architecture
flowchart TD
classDef way8 fill:#e8f5e9,stroke:#2e7d32
classDef way4 fill:#e3f2fd,stroke:#1565c0
classDef conv fill:#fff3e0,stroke:#e65100
classDef ctrl fill:#f3e5f5,stroke:#6a1b9a
classDef vrf fill:#fce4ec,stroke:#880e4f
Entry["vector_intra_slice_branch(BranchMode)"]:::ctrl
VRF_L["VRF"]:::vrf
Logic["Logic Cluster<br><code>vector_logic()</code>"]:::way8
VRF_L -. "operand" .-> Logic
Entry --> Logic
Fxp["Fxp Cluster<br><code>vector_fxp()</code>"]:::way8
VRF_R1["VRF"]:::vrf
Fxp -. "operand" .- VRF_R1
Logic --> Fxp
FxpToFp["FxpToFp<br><code>vector_fxp_to_fp()</code>"]:::conv
Fxp --> FxpToFp
Narrow["Narrow Stage<br><code>vector_split() / vector_trim_way4()</code>"]:::conv
FxpToFp --> Narrow
VRF_L2["VRF"]:::vrf
Fp["Float Cluster<br><code>vector_fp_unary/binary/ternary()</code>"]:::way4
VRF_L2 -. "operand" .-> Fp
Narrow --> Fp
Reduce["IntraSliceReduce Stage<br><code>vector_intra_slice_reduce()</code>"]:::way4
Fp --> Reduce
FpDiv["FpDiv<br><code>vector_fp_div()</code>"]:::way4
VRF_R2["VRF"]:::vrf
FpDiv -. "operand" .- VRF_R2
Reduce --> FpDiv
Widen["Widen Stage<br><code>vector_concat() / vector_pad_way8()</code>"]:::conv
FpDiv --> Widen
FpToFxp["FpToFxp<br><code>vector_fp_to_fxp()</code>"]:::conv
Widen --> FpToFxp
VRF_L3["VRF"]:::vrf
Clip["Clip Cluster<br><code>vector_clip()</code>"]:::way8
VRF_L3 -. "operand" .-> Clip
FpToFxp --> Clip
Exit["vector_final()"]:::ctrl
Clip --> Exit
Quick Reference
Entry and Transition
Before the stage-by-stage table, it helps to separate the ways you can enter or resume the intra-slice block:
| Current state | Method | Result |
|---|---|---|
Fresh VE input after vector_init() | vector_intra_slice_branch(BranchMode) | Enters the single-stream intra-slice path |
Fresh VE input after vector_init() | vector_intra_slice_unzip() | Enters the two-group intra-slice path |
Tensor after vector_inter_slice_reduce() | vector_intra_slice_branch(BranchMode) | Continues with intra-slice work after inter-slice reduction |
The stage table below describes the single-stream path after vector_intra_slice_branch().
For the paired path after vector_intra_slice_unzip(), see Two-Group Mode.
Stages
Every stage is optional; you can skip directly from Branch to any downstream stage.
Recall from Vector Engine that Way8 processes 8 elements per cycle and Way4 processes 4.
The ALU column is shown only where the API exposes multiple competing ALUs inside one stage. Stages such as FxpToFp, Narrow, IntraSliceReduce, FpDiv, Widen, FpToFxp, Filter, and Output do not require a user-visible ALU choice here.
| Stage | Method | Data Type | Mode | ALUs | Notes |
|---|---|---|---|---|---|
| Branch | vector_intra_slice_branch(BranchMode) | i32, f32 | Way8 | Single-stream entry after vector_init(), or continuation after vector_inter_slice_reduce() | |
| Logic | vector_logic(op, operand) | i32, f32 | Way8 | LogicAnd, LogicOr, LogicXor, LogicLshift, LogicRshift | |
| Fxp | vector_fxp(op, operand) | i32 | Way8 | FxpAdd, FxpLshift, FxpMul, FxpRshift | |
| FxpToFp | vector_fxp_to_fp(int_width) | i32 → f32 | Way8 | ||
| Narrow | vector_split() / vector_trim_way4() | f32 | Way8 → Way4 | ||
| Float | vector_fp_unary/binary/ternary(op, ...) | f32 | Way4 | FpFma, FpFpu, FpExp, FpMul0, FpMul1 | |
| IntraSliceReduce | vector_intra_slice_reduce(op) | i32, f32 | Way4 | ||
| FpDiv | vector_fp_div(op, operand) | f32 | Way4 | ||
| Widen | vector_concat() / vector_pad_way8() | f32 | Way4 → Way8 | ||
| FpToFxp | vector_fp_to_fxp(int_width) | f32 → i32 | Way8 | ||
| Clip | vector_clip(op, operand) | i32, f32 | Way8 | ClipAdd, ClipMax, ClipMin | |
| Filter | vector_filter(mode) | i32, f32 | Way8 | ||
| Output | vector_final() | i32, f32 | Way8 |
vector_intra_slice_branch() is both the initial single-stream intra-slice entry after vector_init() and the continuation point after vector_inter_slice_reduce().
vector_intra_slice_unzip() is only available directly from vector_init().
Within a stage, each ALU can only be used once per pass. This matters mainly in Logic, Fxp, Fp, and Clip, where multiple operators share a stage-local ALU pool. For example, tanh(sqrt(x)) is impossible in a single pass because both tanh and sqrt require the FpFpu ALU. Such operations require multiple Tensor Unit invocations with intermediate results stored in DM or TRF.
vector_stash() is not a pipeline stage. It can be called at any Stashable point in the chain to snapshot the current tensor for later use as an operand. See Stash for details.
Examples
i32 Pipeline Example
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512];
fn add_constant<'l, const T: Tu>(
input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
input
.vector_init()
.vector_intra_slice_branch(BranchMode::Unconditional)
.vector_fxp(FxpBinaryOp::AddFxp, 100)
.vector_final()
}
}
f32 Pipeline Example
In this example, vector_trim_way4() is the Narrow step: it changes the tensor from Way8 to Way4 before the float operation.
Later, vector_pad_way8() is the Widen step: it changes the tensor from Way4 back to Way8 after the float operation.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512];
fn sigmoid<'l, const T: Tu>(
input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
input
.vector_init()
.vector_intra_slice_branch(BranchMode::Unconditional)
.vector_trim_way4::<m![A % 2 # 4]>() // Narrow: Way8 -> Way4
.vector_fp_unary(FpUnaryOp::Sigmoid)
.vector_pad_way8::<m![A % 2 # 8]>() // Widen: Way4 -> Way8
.vector_final()
}
}
For stash usage, see Stash.
Stage Details
Branch (vector_intra_slice_branch)
Enters the pipeline and configures conditional execution via BranchMode.
Each 32-bit element in a flit is assigned a 4-bit ExecutionId (0-15) that determines which operations to apply.
| Mode | Description |
|---|---|
Unconditional | All elements get ExecutionId 0 |
AxisToggle { axis } | Toggle group based on axis index (group_id = axis_index % 2) |
ValidCount | Via Valid Count Generator |
Comparison([InputCmp; 4]) | Set branch bits via comparison operations on input values |
Vrf | Load ExecutionIds from VRF (pre-written by Branch Logger in prior TuExec) |
Logic Cluster (vector_logic)
Bitwise operations on i32 or f32 (bit-level). Requires Way8 mode.
This stage has multiple ALUs, so operator choice matters for fusion: LogicAnd, LogicOr, LogicXor, LogicLshift, and LogicRshift can each be used at most once per pass.
i32 operations:
| Op | ALU | Note |
|---|---|---|
BitAnd | LogicAnd | bitwise and |
BitOr | LogicOr | bitwise or |
BitXor | LogicXor | bitwise xor |
LeftShift | LogicLshift | logical left shift |
LogicRightShift | LogicRshift | logical right shift |
ArithRightShift | LogicRshift | arithmetic right shift |
f32 operations:
| Op | ALU | Note |
|---|---|---|
BitAnd | LogicAnd | bitwise and on fp bit patterns |
BitOr | LogicOr | bitwise or on fp bit patterns |
BitXor | LogicXor | bitwise xor on fp bit patterns |
Fxp Cluster (vector_fxp)
Integer and fixed-point arithmetic on i32. Requires Way8 mode.
This stage has four reusable ALU classes: FxpAdd, FxpLshift, FxpMul, and FxpRshift. Operators sharing the same class cannot be fused in one pass.
| Op | ALU | Note |
|---|---|---|
AddFxp | FxpAdd | wrapping add |
AddFxpSat | FxpAdd | saturating add |
SubFxp | FxpAdd | wrapping subtract |
SubFxpSat | FxpAdd | saturating subtract |
LeftShift | FxpLshift | logical left shift |
LeftShiftSat | FxpLshift | saturating left shift |
MulFxp | FxpMul | fixed-point multiply |
MulInt | FxpMul | integer multiply |
LogicRightShift | FxpRshift | logical right shift |
ArithRightShift | FxpRshift | arithmetic right shift |
ArithRightShiftRound | FxpRshift | arithmetic right shift with rounding |
FxpToFp Conversion (vector_fxp_to_fp)
Converts i32 to f32. The int_width parameter specifies the integer bit width for the conversion.
| Method | Effect |
|---|---|
vector_fxp_to_fp(int_width) | convert i32 stream to f32 |
Narrow (vector_split, vector_trim_way4)
Way8 and Way4 are the two packet modes of the intra-slice pipeline.
In Way8, one packet carries 8 active lanes (Packet = m![... # 8]).
In Way4, one packet carries 4 active lanes (Packet = m![... # 4]).
Narrow switches the pipeline from Way8 to Way4, floating-point and intra-slice reduce stages run in Way4, and Widen switches back to Way8.
This usually halves throughput for the float / reduce path, because the same logical tensor shape now takes twice as many packets or passes.
| Method | Use When | Effect |
|---|---|---|
vector_split() | both halves contain real data | split one 8-way flit into two 4-way packets, updating Time and Packet |
vector_trim_way4() | upper 4 lanes are already padding or irrelevant | keep only the lower 4 lanes |
Shape semantics:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![S = 64, A = 512];
fn split_semantics<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, i32, m![1], m![1 # 2], m![S # 16 / 4], m![S # 16 % 4], m![A % 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorNarrowTensor<'l, T, i32, m![1], m![1 # 2], m![S # 16 / 4], m![S # 16 % 4, A / 4 % 2], m![A % 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input.vector_split::<m![S # 16 % 4, A / 4 % 2], m![A % 4]>()
// shape semantics: [T], [P] -> [T, P / 2], [P % 4]
}
fn trim_way4_semantics<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorNarrowTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input.vector_trim_way4::<m![A % 2 # 4]>()
// shape semantics: [T], [P] -> [T], [P = 4]
}
}
Float Cluster (vector_fp_unary, vector_fp_binary, vector_fp_ternary)
Floating-point operations on f32. Requires Way4 mode.
That is, the input must already have passed through Narrow, so each packet carries 4 active lanes rather than 8.
This is the stage where ALU planning matters most. It exposes five independent ALUs, FpFma, FpFpu, FpExp, FpMul0, and FpMul1, and each can be used once per pass.
Unary ops:
| Op | ALU | Note |
|---|---|---|
Exp | FpExp | exponential |
NegExp | FpExp | negative exponential |
Sqrt | FpFpu | square root |
Tanh | FpFpu | hyperbolic tangent |
Sigmoid | FpFpu | sigmoid |
Erf | FpFpu | error function |
Log | FpFpu | natural logarithm |
Sin | FpFpu | sine |
Cos | FpFpu | cosine |
Binary ops:
| Op | ALU | Note |
|---|---|---|
AddF | FpFma | floating-point add |
SubF | FpFma | floating-point subtract |
MulF(FpMulAlu::Mul0) | FpMul0 | multiply using mul lane 0 |
MulF(FpMulAlu::Mul1) | FpMul1 | multiply using mul lane 1 |
MaskMulF(FpMulAlu::Mul0) | FpMul0 | masked multiply |
MaskMulF(FpMulAlu::Mul1) | FpMul1 | masked multiply |
DivF | FpFpu | division inside Fp stage |
Ternary ops:
| Op | ALU | Note |
|---|---|---|
FmaF | FpFma | fused multiply-add |
MaskFmaF | FpFma | masked fused multiply-add |
Example: To compute exp(sqrt(((x + 1) * 2) * 3)):
x1 = x + 1via FpFma (FpBinaryOp::AddF)x2 = x1 * 2via FpMul0 (FpBinaryOp::MulF(FpMulAlu::Mul0))x3 = x2 * 3via FpMul1 (FpBinaryOp::MulF(FpMulAlu::Mul1))x4 = sqrt(x3)via FpFpu (FpUnaryOp::Sqrt)x5 = exp(x4)via FpExp (FpUnaryOp::Exp)
IntraSliceReduce (vector_intra_slice_reduce)
Reduces axes within a single slice. Requires Way4 mode. This stage uses a dedicated reduction resource rather than a user-selectable ALU set.
| Data Type | Supported Ops |
|---|---|
i32 | AddSat, Max, Min |
f32 | Add, Max, Min |
See Intra-Slice Reduce for details.
FpDiv (vector_fp_div)
Floating-point division. Requires Way4 mode. The public API exposes only division here, so there is no operator-level ALU choice to plan in normal use.
| Op | Note |
|---|---|
FpDivBinaryOp::DivF | dedicated division stage after IntraSliceReduce |
Widen (vector_concat, vector_pad_way8)
These APIs enter the Widen stage and transition from Way4 back to Way8.
After Widen, later stages such as FpToFxp, Clip, Filter, and Output see 8-lane packets again.
| Method | Use When | Effect |
|---|---|---|
vector_concat() | reversing a prior vector_split() | merge two 4-way packets back into one 8-way flit |
vector_pad_way8() | reversing a prior vector_trim_way4() | pad a 4-way packet back to 8 lanes with invalid elements |
Shape semantics:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![S = 64, A = 512];
fn concat_semantics<'l, const T: Tu>(
input: VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1 # 2], m![S # 16 / 4], m![A / 4 % 2], m![A % 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorWidenTensor<'l, T, i32, m![1], m![1 # 2], m![S # 16 / 4], m![1], m![A % 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input.vector_concat::<m![1], m![A % 8]>()
// shape semantics: [T, P / 2], [P % 4] -> [T], [P]
}
fn pad_way8_semantics<'l, const T: Tu>(
input: VectorFpTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorWidenTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input.vector_pad_way8::<m![A % 2 # 8]>()
// shape semantics: [T], [P] -> [T], [P # 8]
}
}
FpToFxp Conversion (vector_fp_to_fxp)
Converts f32 back to i32. The int_width parameter specifies the integer bit width.
| Method | Effect |
|---|---|
vector_fp_to_fxp(int_width) | convert f32 stream back to i32 |
Clip Cluster (vector_clip)
Clamping and comparison operations. Requires Way8 mode.
This stage exposes three ALU classes, ClipAdd, ClipMax, and ClipMin, and each can be used once per pass.
i32 operations:
| Op | ALU | Note |
|---|---|---|
Min | ClipMin | minimum |
Max | ClipMax | maximum |
AbsMin | ClipMin | absolute minimum |
AbsMax | ClipMax | absolute maximum |
AddFxp | ClipAdd | wrapping add |
AddFxpSat | ClipAdd | saturating add |
f32 operations:
| Op | ALU | Note |
|---|---|---|
Min | ClipMin | minimum |
Max | ClipMax | maximum |
AbsMin | ClipMin | absolute minimum |
AbsMax | ClipMax | absolute maximum |
Add | ClipAdd | floating-point add |
Filter (vector_filter)
Applies an execution mask based on a branch condition to filter output flits.
Output (vector_final)
Exits the Vector Engine pipeline. The result can continue to the Cast Engine, Transpose Engine, or Commit Engine.
Stash (vector_stash)
Saves the current tensor data for later use as an operand via the Stash marker.
Key points:
vector_stash()snapshots the current tensor for later use. Later binary or ternary ops can read it asStash.- The stash is typed. An
f32stash can be consumed only by laterf32ops, and ani32stash only by lateri32ops. - The stash follows the current tensor mapping. When it is read later, the implementation reinterprets or transposes it to the current mapping as needed.
- It is available only at
Stashablestages:Branch,Logic,Fxp,Narrow,Fp,FpDiv, andClip. - It is not available after a binary op consumes the stash.
- It is a single slot per pass. The type system exposes at most one live stash in the chain.
See also:
- Operands, for how
Stashis consumed as an operand - Two-Group Mode, for the paired context where
stash()is intentionally unavailable
Typical use:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512];
fn residual_max<'l, const T: Tu>(
input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
input
.vector_init() // enter VE
.vector_intra_slice_branch(BranchMode::Unconditional) // start the intra-slice path
.vector_stash() // save x
.vector_trim_way4::<m![A % 2 # 4]>() // narrow to Way4
.vector_fp_binary(FpBinaryOp::MulF(FpMulAlu::Mul0), 2.0f32) // compute 2x
.vector_pad_way8::<m![A % 2 # 8]>() // widen back to Way8
.vector_clip(ClipBinaryOpF32::Max, Stash) // max(2x, x)
.vector_final()
}
}
Stash: Fxp-Only Path
Stash at an early stage, then use it later in a Clip operation. This implements max(x + bias, x):
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512];
fn stash_at_fxp<'l, const T: Tu>(
input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
input
.vector_init() // enter VE
.vector_intra_slice_branch(BranchMode::Unconditional) // start the intra-slice path
.vector_stash() // save original x
.vector_fxp(FxpBinaryOp::AddFxp, 100) // compute x + bias
.vector_clip(ClipBinaryOpI32::Max, Stash) // compute max(x + bias, x)
.vector_final()
}
}
Stash: Read/Write Across Narrow and Widen
Stash before narrowing, consume after widening. This computes max(sigmoid(x), x):
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512];
fn stash_across_narrow_widen<'l, const T: Tu>(
input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
input
.vector_init() // enter VE
.vector_intra_slice_branch(BranchMode::Unconditional) // start the intra-slice path
.vector_stash() // save x (Way8)
.vector_trim_way4::<m![A % 2 # 4]>() // narrow to Way4
.vector_fp_unary(FpUnaryOp::Sigmoid) // compute sigmoid(x) in Way4
.vector_pad_way8::<m![A % 2 # 8]>() // widen back to Way8
.vector_clip(ClipBinaryOpF32::Max, Stash) // compute max(sigmoid(x), x)
.vector_final()
}
}
Operands
Operations (excluding unary and reduce) take operands specifying the second (or third) input.
The IntoOperands trait accepts multiple types:
| Operand Type | Example | Description |
|---|---|---|
| Constant | 100, 2.5f32 | Scalar broadcast to all elements |
| VRF tensor | VeRhs::vrf(&vrf_tensor) | Pre-loaded via .to_vrf() before entering the Vector Engine |
| Stash | Stash | Value saved by a prior vector_stash() call |
For ternary operations (FmaF), use (operand0, operand1) pairs or TernaryOperand.
Operands can be conditioned per ExecutionId group using VeBranchOperand:
VeBranchOperand::always(operand), applied to all groupsVeBranchOperand::group(operand, GroupId::Zero), applied only to group 0
VRF Input
Using pre-loaded VRF data as an operand:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512, B = 256];
fn vrf_add<'l, const T: Tu>(
input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![B], m![A % 2 # 8]>,
vrf: &VrfTensor<i32, m![1], m![1 # 2], m![A / 2], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![B], m![A % 2 # 8]> {
input
.vector_init()
.vector_intra_slice_branch(BranchMode::Unconditional)
.vector_fxp(FxpBinaryOp::AddFxp, VeRhs::vrf(vrf))
.vector_final()
}
}
Argument Modes
Unary, binary, and ternary ops can select how the operator arguments are sourced.
For single-stream operations, “stream” refers to the tensor value carried by the self input of the method chain.
UnaryArgMode:
| Mode | Meaning | Computation |
|---|---|---|
Mode0 | stream | op(stream) (default) |
Mode1 | operand | op(operand) |
BinaryArgMode:
| Mode | Meaning | Computation |
|---|---|---|
Mode00 | stream / stream | op(stream, stream) |
Mode01 | stream / operand | op(stream, operand) (default) |
Mode10 | operand / stream | op(operand, stream) |
Mode11 | operand / operand | op(operand, operand) |
TernaryArgMode:
| Mode | Meaning | Computation |
|---|---|---|
Mode012 | stream / operand0 / operand1 | op(stream, operand0, operand1) (default) |
Mode002 | stream / stream / operand1 | op(stream, stream, operand1) |
Mode102 | operand0 / stream / operand1 | op(operand0, stream, operand1) |
Mode112 | operand0 / operand0 / operand1 | op(operand0, operand0, operand1) |
Mode020 | stream / operand1 / stream | op(stream, operand1, stream) |
Mode021 | stream / operand1 / operand0 | op(stream, operand1, operand0) |
Mode120 | operand0 / operand1 / stream | op(operand0, operand1, stream) |
In two-group mode, BinaryArgMode has two interpretations:
- For per-group ops such as
vector_fxp_with_mode(...)orvector_fp_binary_with_mode(...), the mode is interpreted independently inside each group.0means that group’s stream and1means that group’s operand. - For
_zipops such asvector_fxp_zip_with_mode(...)orvector_fp_zip_with_mode(...), the mode refers to the two grouped streams directly.0means Group 0 and1means Group 1.
For _zip ops, BinaryArgMode maps to the grouped streams as follows:
| Mode | Meaning | Computation |
|---|---|---|
Mode00 | group0 / group0 | op(group0, group0) |
Mode01 | group0 / group1 | op(group0, group1) (default) |
Mode10 | group1 / group0 | op(group1, group0) |
Mode11 | group1 / group1 | op(group1, group1) |
Single-stream example:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512];
fn bias_minus_x<'l, const T: Tu>(
input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
input
.vector_init()
.vector_intra_slice_branch(BranchMode::Unconditional)
.vector_fxp_with_mode(FxpBinaryOp::SubFxp, BinaryArgMode::Mode10, 7) // compute 7 - x
.vector_final()
}
}
Two-group _zip example:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512, I = 2];
fn pair_sub_reverse<'l, const T: Tu>(
input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![I], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
input
.vector_init()
.vector_intra_slice_unzip::<I, m![1 # 2], m![1]>()
.vector_trim_way4::<m![A % 2 # 4]>()
.vector_fp_zip_with_mode(FpBinaryOp::SubF, BinaryArgMode::Mode10) // compute group1 - group0
.vector_pad_way8::<m![A % 2 # 8]>()
.vector_final()
}
}
Two-Group Mode
Enter via vector_intra_slice_unzip() to process two interleaved groups in parallel.
This is the API used after begin_interleaved(...), where the collected tensor carries a 2-way grouping axis that should be treated as “group 0” and “group 1”.
The high-level flow is:
vector_intra_slice_unzip()splits the grouped input into two parallel streams.- Per-group stages run in lock-step on both groups.
- A
_zipop merges the pair back into a single stream. - The merged result can continue to
vector_final().
There are two kinds of operations in this mode:
- Common stages apply to both groups together:
vector_fxp_to_fp,vector_split,vector_trim_way4,vector_concat,vector_pad_way8,vector_fp_to_fxp. - Per-group ops take one argument per group. Use
()to skip one side, or pass different operands to each side. See Argument Modes for howBinaryArgModeis interpreted in per-group ops and_zipops.
Minimal example, zip two interleaved groups with integer add:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512, I = 2];
fn pair_add<'l, const T: Tu>(
input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![I], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
input
.vector_init()
.vector_intra_slice_unzip::<I, m![1 # 2], m![1]>()
.vector_clip_zip(ClipBinaryOpI32::AddFxp)
.vector_final()
}
}
With asymmetric preprocessing, only group 0 is scaled before zip:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512, I = 2];
fn pair_preprocess_one_side<'l, const T: Tu>(
input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![I], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
input
.vector_init()
.vector_intra_slice_unzip::<I, m![1 # 2], m![1]>()
.vector_fxp(FxpBinaryOp::MulInt, 10, ()) // group 0 only
.vector_clip_zip(ClipBinaryOpI32::AddFxp)
.vector_final()
}
}
For float pipelines, both groups must narrow together before vector_fp_*, then zip in Way4, then widen again if later stages need Way8.
Important constraints:
- While the two groups are still paired (before
_zip),stash()andfilter()are not available. - After
_zipmerges the pair, the result can continue downstream, butstash()andfilter()remain unavailable on the merged tensor. - ALU usage is shared across both groups. If either group uses an ALU in a stage, that ALU is consumed for the whole pair pass.
See also:
- Quick Reference, for the single-stream stage order that resumes after
_zip - Stash, for the snapshot path (unavailable in two-group mode)
Float Pipeline with Zip
Both groups go through the float path (narrow -> fp -> zip -> widen):
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512, I = 2];
fn pair_fp_mul_zip<'l, const T: Tu>(
input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![I], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
input
.vector_init()
.vector_intra_slice_unzip::<I, m![1 # 2], m![1]>()
.vector_trim_way4::<m![A % 2 # 4]>() // both groups: Way8 -> Way4
.vector_fp_zip(FpBinaryOp::MulF(FpMulAlu::Mul0)) // group0 * group1 (Way4)
.vector_pad_way8::<m![A % 2 # 8]>() // Way4 -> Way8
.vector_final()
}
}
Per-Group Preprocessing
Apply different operations to each group before zipping:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512, I = 2];
fn pair_asymmetric_preprocess<'l, const T: Tu>(
input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![I], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
input
.vector_init()
.vector_intra_slice_unzip::<I, m![1 # 2], m![1]>()
.vector_trim_way4::<m![A % 2 # 4]>()
.vector_fp_unary(FpUnaryOp::Exp, true, false) // group 0: exp(x), group 1: skip
.vector_fp_zip(FpBinaryOp::MulF(FpMulAlu::Mul0)) // exp(group0) * group1
.vector_pad_way8::<m![A % 2 # 8]>()
.vector_final()
}
}
Constraints
| Constraint | Detail |
|---|---|
| ALU single-use | Each ALU usable once per pass. Same-ALU operations require separate TU invocations. |
| Data types | i32/f32 only. Lower-precision data must be converted at fetch or after contraction. |
| ExecutionId range | 4-bit (0-15), limiting conditional paths to 16 branches per element. |
| VRF capacity | Limited capacity; binary/ternary operands must be pre-loaded via .to_vrf(). |
| Narrow/Widen overhead | Float operations halve throughput due to the Way8→Way4→Way8 path. |
| Stash single-use | Only one vector_stash() snapshot can be live in a pass, and it is unavailable after binary ops. |
| Two-group context | stash() and filter() are unavailable while paired (before _zip) and after _zip. |
ALU Conflict Example
Each ALU can only be used once. This example panics because AddFxp and SubFxp both use the FxpAdd ALU:
// PANICS: "FxpAdd is already in use"
input
.vector_init()
.vector_intra_slice_branch(BranchMode::Unconditional)
.vector_fxp(FxpBinaryOp::AddFxp, 10) // uses FxpAdd
.vector_fxp(FxpBinaryOp::MulInt, 2) // uses FxpMul ✓
.vector_fxp(FxpBinaryOp::SubFxp, 5) // uses FxpAdd again ✗
.vector_final()
Performance
Throughput
- Logic, Fxp, Clip Clusters: Full 8-way throughput (8 elements per cycle)
- Float Cluster: 4-way throughput (typically requires Narrow/Widen around the float path, effectively halving throughput)
Pipeline Latency
Each ALU introduces one cycle of latency.
Operations requiring multiple ALUs accumulate latencies.
For example, exp(sqrt(x)) adds 2 cycles (FpFpu for sqrt + FpExp for exp).
Intra-Slice Reduce
The Intra-Slice Reduce is a reduction operation performed by the IntraSliceReduce stage within the Intra-Slice Block.
At the hardware level, this corresponds to the reduction unit in the 4-way path.
It reduces axes within a single slice, contrasting with the Inter-Slice Block which reduces across the 256 slices of a cluster (inter-slice reduce).
This document covers the blocking-mode case where the accumulator result is not stored to an intermediate buffer.
Non-blocking mode (where accumulator results are stored to an intermediate buffer) is not covered here.
Interface
impl<
'l,
const T: Tu,
S,
Chip: M,
Cluster: M,
Slice: M,
Time: M,
Packet: M,
StashD: VeScalar,
Stash: TensorState<StashD>,
FS: stage::VeTensorContext,
const VE_ORDER: VeOrder,
> VectorTensor<'l, T, S, i32, Chip, Cluster, Slice, Time, Packet, StashD, Stash, VE_ORDER, FS, { Way4 }>
where
S: stage::Stage + CanTransitionTo<stage::IntraSliceReduce>,
{
/// Intra-slice reduce operation (i32).
#[primitive(VectorTensor::vector_intra_slice_reduce)]
pub fn vector_intra_slice_reduce<Reduce: AxisName, OTime: M, OPacket: M>(
mut self,
op: IntraSliceReduceOpI32,
) -> VectorIntraSliceReduceTensor<
'l,
T,
i32,
Chip,
Cluster,
Slice,
OTime,
OPacket,
StashD,
Stash,
VE_ORDER,
stage::Standalone,
{ Way4 },
>
impl<
'l,
const T: Tu,
S,
Chip: M,
Cluster: M,
Slice: M,
Time: M,
Packet: M,
StashD: VeScalar,
Stash: TensorState<StashD>,
FS: stage::VeTensorContext,
const VE_ORDER: VeOrder,
> VectorTensor<'l, T, S, f32, Chip, Cluster, Slice, Time, Packet, StashD, Stash, VE_ORDER, FS, { Way4 }>
where
S: stage::Stage + CanTransitionTo<stage::IntraSliceReduce>,
{
/// Intra-slice reduce operation (f32).
#[primitive(VectorTensor::vector_intra_slice_reduce)]
pub fn vector_intra_slice_reduce<Reduce: AxisName, OTime: M, OPacket: M>(
mut self,
op: IntraSliceReduceOpF32,
) -> VectorIntraSliceReduceTensor<
'l,
T,
f32,
Chip,
Cluster,
Slice,
OTime,
OPacket,
StashD,
Stash,
VE_ORDER,
stage::Standalone,
{ Way4 },
>
Parameters:
REDUCE_LABEL: Which axis to reduce, specified as anIdentvalue (e.g.,Ident::R). Eachaxes![]declaration creates a named label (Ident); all factors derived from the same declaration share that label. All factors in Time and Packet carrying this label are eliminated by the reduction, so they must not appear in the output shape (OTime,OPacket). For example, ifRis split asR / 4in Time andR % 4in Packet, specifyingREDUCE_LABEL = Ident::Reliminates both.op: The reduce operation (IntraSliceReduceOpI32fori32,IntraSliceReduceOpF32forf32).OTime,OPacket: The output Time and Packet shape after reduction. These must be exactly the input Time and Packet with allREDUCE_LABELfactors removed.
The Chip, Cluster, and Slice dimensions pass through unchanged from input to output.
Mechanism
Conceptual Operation
The IntraSliceReduce stage sits inside the Intra-Slice Block pipeline, after Narrow and before Widen.
It accepts a 4-way input (after Buffering Split divides the 8-way flit), performs a 2-level tree reduce to produce a single value, and accumulates the result into an accumulator slot.
4-way input from Narrow stage
┌───┬───┬───┬───┐
│ a │ b │ c │ d │
└─┬─┴─┬─┴─┬─┴─┬─┘
│ │ │ │
└─┬─┘ └─┬─┘ Level 1: pairwise reduce
op(a,b) op(c,d)
└───┬───┘ Level 2: pairwise reduce
op(op(a,b),op(c,d))
│
▼
┌─────────────┐
│ Accumulator │ Accumulate across time steps
│ (8 slots) │
└─────────────┘
The accumulator holds partial results across multiple input flits, implementing temporal reduction over the Time axis. Up to 8 accumulator slots are available, each serving as a buffer that accumulates partial reduce results across time steps.
Padding exclusion is handled by the Valid Count Generator (VCG), which tags each flit with the count of valid elements so the IntraSliceReduce stage can skip pad data.
Architectural Parameters
| Parameter | Value | Description |
|---|---|---|
| Tree input width | 4-way | Fixed; Narrow produces the 4-way input |
| Tree depth | 2 | Two levels of pairwise reduction |
| Accumulator slots | 8 | Independent reduction accumulators |
| Accumulator type | Temporal | Accumulates across time steps within each slot |
Supported Operations
Integer Operations (IntraSliceReduceOpI32)
| Operation | Description | Identity Element |
|---|---|---|
AddSat | Saturating addition | 0 |
Max | Maximum value | i32::MIN |
Min | Minimum value | i32::MAX |
Floating-Point Operations (IntraSliceReduceOpF32)
| Operation | Description | Identity Element |
|---|---|---|
Add | Floating-point addition | 0.0 |
Max | Maximum value | f32::NEG_INFINITY |
Min | Minimum value | f32::INFINITY |
Examples
Reduce in Time (i32 Saturating Add)
R exists only in Time, accumulated across time steps.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 16];
// Slice = m![A / 2], Time = m![R], Packet = m![A % 2 # 8] (8-way)
// R in Time → temporal accumulation. Packet is non-reduce.
fn reduce_time<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![A / 2], m![R], m![A % 2 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![A / 2], m![1], m![A % 2 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![A % 2 # 4]>() // 8-way → 4-way
.vector_intra_slice_reduce::<R, m![1], m![A % 2 # 4]>(
IntraSliceReduceOpI32::AddSat,
)
// Output: Slice = m![A / 2], Time = m![1], Packet = m![A % 2 # 4]
// R eliminated from Time.
}
}
Reduce in Packet Only (f32 Add)
R exists only in Packet, so the hardware performs a 4-way tree reduce within each flit with no temporal accumulation.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 4];
// Slice = m![A / 2], Time = m![A % 2], Packet = m![R # 8] (8-way)
// R in Packet → tree reduce within flit. VCG tags valid_count = |R|.
fn reduce_packet<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![A / 2], m![A % 2], m![R # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![A / 2], m![A % 2], m![1 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![R]>() // 8-way → 4-way
.vector_intra_slice_reduce::<R, m![A % 2], m![1 # 4]>(
IntraSliceReduceOpF32::Add,
)
// Output: Slice = m![A / 2], Time = m![A % 2], Packet = m![1 # 4]
// R eliminated from Packet.
}
}
Reduce Split Across Time and Packet (f32 Max)
R is split: R % 4 in Packet is tree-reduced within each flit, then accumulated across R / 4 time steps.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 16];
// Slice = m![A / 2], Time = m![R / 4], Packet = m![R % 4 # 8] (8-way)
// R % 4 in Packet → spatial tree reduce
// R / 4 in Time → temporal accumulation
fn reduce_time_packet<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![A / 2], m![R / 4], m![R % 4 # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![A / 2], m![1], m![1 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![R % 4]>() // 8-way → 4-way
.vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
IntraSliceReduceOpF32::Max,
)
// Output: Slice = m![A / 2], Time = m![1], Packet = m![1 # 4]
// Both R portions eliminated.
}
}
Reduce Axis Spanning Slice, Time, and Packet (i32 Min)
R spans all three dimensions. The VCG handles per-slice valid count variation for boundary slices.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 13];
// Slice = m![R # 32 / 8], Time = m![R # 32 / 4 % 2], Packet = m![R # 32 % 4 # 8] (8-way)
// R split across all three: Slice (groups of 8), Time (pairs within group), Packet (4 elements).
// Boundary slices may have fewer valid time steps, and the VCG handles this.
fn reduce_slice_time_packet<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![R # 32 / 8], m![R # 32 / 4 % 2], m![R # 32 % 4 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![R # 32 / 8], m![1], m![1 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![R # 32 % 4]>() // 8-way → 4-way
.vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
IntraSliceReduceOpI32::Min,
)
// Output: Slice = m![R # 32 / 8], Time = m![1], Packet = m![1 # 4]
// R eliminated from Time and Packet (accumulated within each slice).
}
}
Constraints
Accumulator Slot Limit
The IntraSliceReduce stage has 8 accumulator slots.
Each non-reduce (NR) position inside the outermost reduce factor occupies a separate slot, so the product of inner NR axis sizes must be ≤ 8.
Consider Time = m![R / 2, A % 2, R % 2] where R is the reduce label and A is non-reduce.
The NR factor A % 2 sits between the outer reduce R / 2 and inner reduce R % 2.
Each value of A % 2 needs its own accumulator slot to maintain an independent reduction:
Time = m![R / 2, A % 2, R % 2]
~~~~~~ ~~~~~~ ~~~~~~
outer R NR inner R
Flit sequence (R=4, A%2 has values 0,1):
flit #0: R/2=0, A%2=0, R%2=0 ──→ ┌─────────────────┐
flit #1: R/2=0, A%2=0, R%2=1 ──→ │ Slot 0 (A%2=0) │ accumulates R for A%2=0
flit #4: R/2=1, A%2=0, R%2=0 ──→ │ │
flit #5: R/2=1, A%2=0, R%2=1 ──→ └─────────────────┘
flit #2: R/2=0, A%2=1, R%2=0 ──→ ┌─────────────────┐
flit #3: R/2=0, A%2=1, R%2=1 ──→ │ Slot 1 (A%2=1) │ accumulates R for A%2=1
flit #6: R/2=1, A%2=1, R%2=0 ──→ │ │
flit #7: R/2=1, A%2=1, R%2=1 ──→ └─────────────────┘
2 NR positions → 2 slots used (≤ 8 ✓)
With multiple NR factors, slot usage multiplies:
Valid: Time = m![R, A % 2, B % 4] → 2 × 4 = 8 slots (≤ 8 ✓)
Invalid: Time = m![R, A % 3, B % 4] → 3 × 4 = 12 slots (> 8 ✗)
If the NR product exceeds 8, the mapping must be restructured.
Invalid: Accumulator Slot Limit Exceeded (i32 AddSat)
NR factors between reduce factors occupy accumulator slots.
Here A % 3 (3 values) and B % 4 (4 values) sit inside the reduce axis, requiring 3 × 4 = 12 slots, exceeding the 8-slot limit.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 6, B = 8, R = 16];
// Time = m![R, A % 3, B % 4] -> NR product = 3 × 4 = 12 > 8
fn invalid_too_many_slots<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![A / 3], m![R, A % 3, B % 4], m![B / 4 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![A / 3], m![A % 3, B % 4], m![B / 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![B / 4]>()
.vector_intra_slice_reduce::<R, m![A % 3, B % 4], m![B / 4]>(
IntraSliceReduceOpI32::AddSat,
)
// Rejected: 12 accumulator slots required, but only 8 are available.
}
}
Padding Strategy
When a reduce axis is padded to fit hardware dimensions, the padded positions contain arbitrary data that must be excluded from the reduction result. Three strategies are available:
| Situation | Strategy |
|---|---|
| Mapping supported by VCG (see Valid Count Generator’s Interface) | VCG (automatic, no extra setup) |
| Unsupported VCG placement, simple reduce op (Add, Max, Min) | Identity-element padding via Fetch Engine’s pad_value |
Unsupported VCG placement, composed op (e.g., exp + Add) | Restructure the mapping, or use other methods |
1. VCG (Valid Count Generator).
The VCG tags each flit with a valid_count so the IntraSliceReduce stage excludes pad elements automatically.
Not all axis placements across Slice, Time, and Packet are supported; see Valid Count Generator’s Interface for details.
2. Identity-element padding.
Fill pad positions with the identity element of the reduce operation before data reaches the Intra-Slice Block.
The Fetch Engine’s padding adapter can set a pad_value during fetch.
| Operation | Identity Element |
|---|---|
AddSat / Add | 0 / 0.0 |
Max | i32::MIN / f32::NEG_INFINITY |
Min | i32::MAX / f32::INFINITY |
This does not work when the reduce operation is composed with a preceding non-invertible transformation.
For example, exp(x) + exp(y) + ... (sum of exponentials): there is no value p such that exp(p) = 0 (the additive identity), so padding with any value produces an incorrect contribution.
3. Other methods.
NAN masking via ExecutionId and per-slice SFR override (stosfr/itosfr) are additional options, but they are not covered in this page.
Performance
Throughput
The 2-level tree reduce is fully pipelined within the Intra-Slice Block pipeline. Each input flit passes through the tree in one pipeline stage, adding no extra per-flit throughput cost.
Latency
The reduce must accumulate all input flits for a reduction group before emitting the result.
If the reduce axis spans n time steps, the first output flit is delayed by n flit cycles beyond the normal pipeline latency.
In a multi-engine pipeline, this accumulation delay can stall downstream engines waiting for the first flit.
This page covers blocking mode only. Non-blocking mode, ExecutionId-based NAN masking, and per-slice SFR override are outside its scope.
Valid Count Generator’s Interface
Overview
Recall that when a reduce axis is padded, the extra positions contain arbitrary data that must be excluded from the reduction (see Padding Strategy).
The Valid Count Generator (VCG) solves this at the hardware level: it tags each 8-element packet with a valid_count (abbreviated vc), telling the Intra-Slice Reduce stage how many elements are real data.
The intra-slice reduce API takes a REDUCE_LABEL that identifies the axis to reduce (e.g., Ident::R).
When that axis is padded, the VCG automatically determines which elements are real data and which are padding, based on how the axis is distributed across Slice, Time, and Packet.
The VCG requires the reduce axis to be structured in a specific way (padded, split, and distributed across Slice, Time, and Packet).
The distribution rules are described in How R Should Be Distributed, followed by concrete examples for each placement.
The VCG is configured automatically by the compiler; no manual setup is needed.
For the underlying hardware mechanism, see Valid Count Generator’s Implementation.
Quick Reference
If you are checking whether a mapping is supported, start with this table and the examples below.
| Placement | Mode | Example | Supported |
|---|---|---|---|
| Slice + Time | Time Reduce | Slice = m![X, R # 24 / 3], Time = m![R # 24 % 3] | Yes |
| Time only | Time Reduce | Time = m![R # 16] | Yes |
| Slice + Time (transposed, supported) | Time Reduce | Slice = m![X, R # 8 % 4], Time = m![R # 8 / 4] | Yes |
| Slice + Time (transposed, not supported) | Time Reduce | Slice = m![X, R # 20 % 4], Time = m![R # 20 / 4] | No |
| Non-outer/inner ordering | Time Reduce | Slice = m![X, R # 16 / 2 % 4, R # 16 / 8] | No |
| Packet only | Packet Reduce | Packet = m![R # 8] | Yes |
| Time + Packet | Packet Reduce | Time = m![R # 24 / 8], Packet = m![R # 24 % 8] | Yes |
| Time + Packet (Packet not innermost) | Packet Reduce | Time = m![R # 24 % 8], Packet = m![R # 24 / 8 # 8] | No |
| Time + Packet (mixed Packet axes) | Packet Reduce | Packet = m![R # 24 % 4, A] | No |
| Slice + Packet | Packet Reduce | Slice = m![R # 2048 / 8], Packet = m![R # 2048 % 8] | No |
Two Modes
Depending on whether R appears in Packet, the VCG operates in one of two exclusive modes:
Packet Reduce Mode: R appears in Packet, for example Packet = m![R # 24 % 8].
The VCG assigns a per-packet valid_count from 0 to 8, so the IntraSliceReduce stage knows how many packet elements are real data.
Time Reduce Mode: R does not appear in Packet, only in Slice and/or Time.
The VCG makes a binary valid-or-invalid decision per flit.
How R Should Be Distributed
All examples in this document assume the same high-level pattern: first pad the reduce axis, then split it, then place the resulting sub-expressions across Slice, Time, and Packet.
At the top level, R # p is split into an outer and an inner part:
R # p -> R # p / k (outer), R # p % k (inner)
Each part is then assigned to one of the three hardware dimensions. Within a dimension, a part can be split again recursively.
The VCG rule depends on where R appears:
- Slice: when
Rappears multiple times in Slice, the stride order must increase from inner to outer. This keeps each slice’sRrange contiguous. - Time: multiple
Rsub-expressions can appear in any order, and non-reduce axes may sit between them. - Packet: Packet must still be padded to
# 8.Rmay appear at most once in Packet, and it must be the innermost% kpart occupying the packet prefix.
Example: why Slice stride order matters
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 13, X = 32];
// R # 16, split into 2 * 4 * 2:
// Slice = m![R # 16 / 8, X, R # 16 / 2 % 4]
// Time = m![R # 16 % 2]
//
// Slice strides for R:
// R # 16 / 2 % 4 -> stride = 2 (inner)
// R # 16 / 8 -> stride = 8 (outer)
// 2 < 8, so each slice receives a contiguous R interval.
fn example_stride_ordering<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![R # 16 / 8, X, R # 16 / 2 % 4], m![R # 16 % 2], m![1 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![R # 16 / 8, X, R # 16 / 2 % 4], m![1], m![1 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![1 # 4]>()
.vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
IntraSliceReduceOpI32::Min,
)
}
}
Examples: Time Reduce Mode
R does not appear in Packet.
The VCG decides per-flit: valid or invalid.
Slice + Time
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 4, R = 17, X = 32];
// The most common pattern. Outer part of R → Slice, inner part → Time.
//
// R = 17, padded to 24 = 8 * 3.
// R # 24 is split into:
// R # 24 / 3 (size 8, outer) → Slice
// R # 24 % 3 (size 3, inner) → Time
// This follows the standard outer→Slice, inner→Time pattern.
//
// Slice = m![X, R # 24 / 3], Time = m![R # 24 % 3], Packet = m![A # 8]
// |Slice| = X(32) * 8 = 256.
// Time Reduce Mode: R does not appear in Packet. VCG gates flits by slice and time.
// Boundary slice = floor(17 / 3) = 5. Valid time steps in boundary = 17 mod 3 = 2.
// For each X group, R-slices 0-4: all 3 time steps valid.
// R-slice 5: 2 of 3 valid (boundary). R-slices 6-7: all invalid.
fn reduce_slice_time<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![X, R # 24 / 3], m![R # 24 % 3], m![A # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![X, R # 24 / 3], m![1], m![A], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![A]>()
.vector_intra_slice_reduce::<R, m![1], m![A]>(
IntraSliceReduceOpI32::AddSat,
)
// Output: Slice = m![X, R # 24 / 3], Time = m![1], Packet = m![A]
// R eliminated from Time. Boundary slice (R-slice #5) accumulated only their valid steps.
}
}
Valid count trace for R = 17 (within one X group)
| R-slice | Group | t=0 | t=1 | t=2 |
|---|---|---|---|---|
| 0 | all-valid | 0 | 1 | 2 |
| 1 | all-valid | 3 | 4 | 5 |
| 2 | all-valid | 6 | 7 | 8 |
| 3 | all-valid | 9 | 10 | 11 |
| 4 | all-valid | 12 | 13 | 14 |
| 5 | boundary | 15 | 16 | . |
| 6 | all-invalid | . | . | . |
| 7 | all-invalid | . | . | . |
This pattern repeats identically for each of the 32 X groups (256 total slices).
Slice + Time (R Split Into Multiple Sub-Expressions in Slice)
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 13, X = 32];
// R can appear as multiple sub-expressions in Slice.
//
// R # 16 is first split into outer (/ 2) and inner (% 2) for Slice vs Time:
// R # 16 / 2 → Slice portion (size 8)
// R # 16 % 2 → Time portion (size 2)
//
// The Slice portion (R # 16 / 2, size 8) is further split into two sub-expressions:
// R # 16 / 8 stride = 8, size = 2 (outer)
// R # 16 / 2 % 4 stride = 2, size = 4 (inner)
//
// Slice = m![R # 16 / 8, X, R # 16 / 2 % 4], Time = m![R # 16 % 2], Packet = m![1 # 8]
// Slice product = 2 * 32 * 4 = 256.
//
// Slice ordering check (inner to outer, ascending stride):
// R # 16 / 2 % 4 stride = 2, size = 4 (inner)
// R # 16 / 8 stride = 8, size = 2 (outer)
// inner_stride(2) * size(4) = 8 = outer_stride ✓
//
// This gives each slice contiguous R indices (within one X group):
// S0: 0,1 S1: 2,3 S2: 4,5 S3: 6,7 S4: 8,9 S5: 10,11 S6: 12,13 S7: 14,15
// Time Reduce Mode:
// Boundary slice = floor(13 / 2) = 6. Valid time steps = 13 mod 2 = 1.
// S0-S5: all-valid, S6: boundary (1 of 2), S7: all-invalid.
fn reduce_multi_level<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![R # 16 / 8, X, R # 16 / 2 % 4], m![R # 16 % 2], m![1 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![R # 16 / 8, X, R # 16 / 2 % 4], m![1], m![1 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![1 # 4]>()
.vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
IntraSliceReduceOpI32::Min,
)
}
}
Slice + Time (Transposed)
The reverse of Slice + Time: the inner part of R goes to Slice, the outer part to Time.
In transposed mode, the slice ID represents the inner index and the time step the outer index.
Slices beyond the boundary still have valid data at early time steps.
Given R # p split as Slice = m![..., R # p % slice_size], Time = m![R # p / slice_size, ...] (where slice_size is the size of R’s portion in Slice), this mode is supported only when:
$$\text{time_size} = p / \text{slice_size} = \lceil |R| / \text{slice_size} \rceil$$
- \(|R|\): original axis size (before padding)
- \(\text{slice_size}\): the size of
R’s portion in Slice (R # p % slice_size) - \(\text{time_size}\): the size of
R’s portion in Time (R # p / slice_size), which equals \(p / \text{slice_size}\)
Slice + Time (Transposed, Supported)
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 5, X = 64];
// Transposed: inner part of R → Slice, outer part → Time.
// (The reverse of standard Slice + Time.)
//
// R = 5, padded to 8 = 4 * 2.
// R # 8 is split into:
// R # 8 % 4 (size 4, inner) → Slice (transposed: inner goes to Slice)
// R # 8 / 4 (size 2, outer) → Time (transposed: outer goes to Time)
//
// Slice = m![X, R # 8 % 4], Time = m![R # 8 / 4], Packet = m![1 # 8]
// |Slice| = 64 * 4 = 256.
//
// time_size(2) == ceil(5 / 4) = 2 ✓
//
// Time Reduce Mode:
// Boundary slice = |R| mod slice_size = 5 mod 4 = 1.
// Valid time steps in boundary = floor(|R| / slice_size) = floor(5/4) = 1.
// R-slice 0 (< boundary): all 2 time steps valid.
// R-slice 1 (= boundary): 1 of 2 time steps valid.
// R-slices 2-3 (> boundary): also 1 of 2 time steps valid (transposed behavior).
fn reduce_transposed<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![X, R # 8 % 4], m![R # 8 / 4], m![1 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![X, R # 8 % 4], m![1], m![1 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![1 # 4]>()
.vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
IntraSliceReduceOpI32::AddSat,
)
// Output: Slice = m![X, R # 8 % 4], Time = m![1], Packet = m![1 # 4]
}
}
Slice + Time (Transposed, Not Supported)
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 14, X = 64];
// NOT supported: time_size is over-allocated.
//
// Slice = m![X, R # 20 % 4], Time = m![R # 20 / 4], Packet = m![1 # 8]
// Slice = 64 * 4 = 256.
//
// R = 14, padded to 20 = 4 (S) * 5 (time_size).
// time_size(5) != ceil(14 / 4) = 4 ✗
//
// R-slice 0 should have 4 valid time steps (indices 0, 4, 8, 12, all < 14).
// But the VCG classifies R-slice 0 as "all-valid" = 5 time steps. WRONG.
//
// a possible Fix: pad to 16 instead, so time_size = ceil(14/4) = 4:
// Slice = m![X, R # 16 % 4], Time = m![R # 16 / 4], Packet = m![1 # 8]
fn reduce_transposed_wrong<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![X, R # 20 % 4], m![R # 20 / 4], m![1 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![X, R # 20 % 4], m![1], m![1 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![1 # 4]>()
.vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
IntraSliceReduceOpI32::AddSat,
)
// ✗ VCG will over-count valid time steps. Use R # 16 instead of R # 20.
}
}
Time Only
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 12, X = 64];
// R exists only in Time. VCG gates excess time steps as invalid.
// All slices see the same pattern.
//
// Slice = m![X, A / 2], Time = m![R # 16], Packet = m![A % 2 # 8]
// Slice = 64 * 4 = 256.
//
// R = 12, padded to 16. Time Reduce Mode: time steps 0-11 valid, 12-15 invalid.
fn reduce_time_only<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![X, A / 2], m![R # 16], m![A % 2 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![X, A / 2], m![1], m![A % 2 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![A % 2 # 4]>()
.vector_intra_slice_reduce::<R, m![1], m![A % 2 # 4]>(
IntraSliceReduceOpI32::AddSat,
)
// Output: Slice = m![X, A / 2], Time = m![1], Packet = m![A % 2 # 4]
// R eliminated from Time. Time steps 12-15 were gated off.
}
}
Non-Outer/Inner Ordering (Not Supported)
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 13, X = 32];
// NOT supported: reordered sub-expressions in Slice break monotonic validity.
// R's sub-expressions must form a clean outer/inner relationship across dimensions.
// If reordered, the VCG cannot express the resulting validity pattern.
//
// Slice = m![X, R # 16 / 2 % 4, R # 16 / 8], Time = m![R # 16 % 2]
// Slice = 32 * 4 * 2 = 256.
//
// Striped R indices per R-slice group:
// S0: 0,1 S1: 8,9 S2: 2,3 S3: 10,11 S4: 4,5 S5: 12,13 S6: 6,7 S7: 14,15
// Validity: valid, valid, valid, valid, valid, partial, valid, invalid
//
// Non-monotonic (S6 valid after S5 partial). The VCG cannot express this.
//
// Fix: use standard ordering instead:
// Slice = m![X, R # 16 / 8, R # 16 / 2 % 4], Time = m![R # 16 % 2]
fn reduce_wrong_ordering<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![X, R # 16 / 2 % 4, R # 16 / 8], m![R # 16 % 2], m![1 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![X, R # 16 / 2 % 4, R # 16 / 8], m![1], m![1 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![1 # 4]>()
.vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
IntraSliceReduceOpI32::Min,
)
// ✗ Non-monotonic slice validity. VCG cannot express this pattern.
}
}
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 13, X = 64];
// NOT supported: Time-Slice-Time interleave.
// Slice = m![X, R # 16 / 2 % 4], Time = m![R # 16 / 8, R # 16 % 2]
// Slice = 64 * 4 = 256.
//
// The interleave causes different R-slices to need different valid time step counts:
// S0: R indices 0,1,8,9 -> 4/4 valid
// S1: R indices 2,3,10,11 -> 4/4 valid
// S2: R indices 4,5,12,13 -> 3/4 valid
// S3: R indices 6,7,14,15 -> 2/4 valid
//
// The VCG has a single threshold, so it cannot express per-slice values.
//
// Fix: standard Slice outer, Time inner:
// Slice = m![X, R # 16 / 4], Time = m![R # 16 % 4]
fn reduce_wrong_interleave<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![X, R # 16 / 2 % 4], m![R # 16 / 8, R # 16 % 2], m![1 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![X, R # 16 / 2 % 4], m![1], m![1 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![1 # 4]>()
.vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
IntraSliceReduceOpI32::Min,
)
// ✗ Per-slice V values needed. VCG cannot express this pattern.
}
}
Time: Flexible Ordering
When R is split into multiple sub-expressions within Time, they can appear in any order and non-reduce axes can sit between them (unlike Slice, where ordering is strict).
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 4, R = 45, X = 2, Y = 256];
// R's Time portion can be split into multiple sub-expressions (order does not matter).
//
// R # 48 is split into outer (/ 8) and inner (% 8) for Time vs Packet:
// R # 48 / 8 → Time portion (size 6)
// R # 48 % 8 → Packet portion (size 8)
//
// The Time portion (R # 48 / 8, size 6) is further split:
// R # 48 / 8 / 2 (size 3)
// R # 48 / 8 % 2 (size 2)
// These can appear in any order with non-reduce axes between them.
//
// Slice = m![Y], Time = m![R # 48 / 8 % 2, X, R # 48 / 8 / 2], Packet = m![A # 8]
// |Slice| = 256.
//
// The VCG combines both R sub-expressions' positions to determine validity.
// (Contrast with Slice, where ordering is strict.)
fn reduce_time_split<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![Y], m![R # 48 / 8 % 2, X, R # 48 / 8 % 2], m![A # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![Y], m![X], m![A], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![A]>()
.vector_intra_slice_reduce::<R, m![X], m![A]>(
IntraSliceReduceOpF32::Add,
)
// Output: Slice = m![Y], Time = m![X], Packet = m![A]
// Both R sub-expressions eliminated from Time; X remains.
}
}
Examples: Packet Reduce Mode
R appears in Packet.
The VCG assigns a per-packet valid_count (0-8) that varies by time step.
Packet Only
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 3, X = 64];
// R exists only in Packet. Every packet gets the same constant valid_count = |R|.
//
// Slice = m![X, A / 2], Time = m![1], Packet = m![R # 8]
// Slice = 64 * 4 = 256.
//
// R = 3, padded to 8. Packet Reduce Mode: every packet has vc = 3.
// No slice or time variation needed.
fn reduce_packet_only<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![1], m![R # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![1], m![1 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![R]>()
.vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
IntraSliceReduceOpF32::Add,
)
// Output: Slice = m![X, A / 2], Time = m![1], Packet = m![1 # 4]
// R eliminated from Packet. All 3 of 8 elements were counted as valid.
}
}
Time + Packet
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 19, X = 64];
// R spans both Time and Packet.
// VCG produces full packets first, then a partial packet at the tail.
//
// Slice = m![X, A / 2], Time = m![R # 24 / 8], Packet = m![R # 24 % 8]
// Slice = 64 * 4 = 256.
//
// R = 19, padded to 24 = 3 (Time) * 8 (Packet).
// Packet Reduce Mode: R fills all 8 Packet positions.
// t=0: vc = 8 (all valid)
// t=1: vc = 8 (all valid)
// t=2: vc = 3 (first 3 valid, last 5 are padding)
// All slices see the same [8, 8, 3] pattern.
fn reduce_time_packet<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![R # 24 / 8], m![R # 24 % 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![1], m![1 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![R # 24 % 4]>()
.vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
IntraSliceReduceOpF32::Add,
)
// Output: Slice = m![X, A / 2], Time = m![1], Packet = m![1 # 4]
// R eliminated from both Time and Packet.
}
}
Time + Packet (Packet Not Innermost, Not Supported)
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 19, X = 64];
// NOT supported: R's portion in Packet must be the innermost (lowest) sub-expression.
//
// R = 19, padded to 24 = 3 * 8.
// R # 24 is split into:
// R # 24 % 8 (size 8, inner), should go to Packet
// R # 24 / 8 (size 3, outer), should go to Time
// But here they are swapped: the OUTER part (R # 24 / 8) goes to Packet,
// and the INNER part (R # 24 % 8) goes to Time.
//
// Slice = m![X, A / 2], Time = m![R # 24 % 8], Packet = m![R # 24 / 8 # 8]
// |Slice| = 64 * 4 = 256.
//
// The VCG's prefix-based valid count assumes R in Packet is the innermost index.
// When the outer part is in Packet instead, the valid count pattern no longer
// forms a simple decreasing sequence; it would need non-contiguous validity.
//
// Fix: put the inner part in Packet and the outer part in Time:
// Time = m![R # 24 / 8], Packet = m![R # 24 % 8]
fn reduce_wrong_packet_outer<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![R # 24 % 8], m![R # 24 / 8 # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![1], m![1 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![R # 24 / 4]>()
.vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
IntraSliceReduceOpF32::Add,
)
// ✗ Outer part in Packet violates innermost requirement.
}
}
Time + Packet (R fills fewer than 8 positions)
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 7, X = 64];
// When R fills fewer than 8 Packet positions, the remaining must be padding, not another axis.
// The valid count is capped at the size of R in Packet.
//
// Slice = m![X, A / 2], Time = m![R # 8 / 4], Packet = m![R # 8 % 4 # 8]
// Slice = 64 * 4 = 256.
//
// R = 7, padded to 8 = 2 * 4. R fills 4 Packet positions, padded to 8-way.
// Packet Reduce Mode: valid count capped at 4 (the size of R in Packet).
// t=0: vc = 4 (positions 0-3 valid, 4-7 are padding)
// t=1: vc = 3 (positions 0-2 valid)
// All slices see the same [4, 3] pattern.
//
// Supported: R solely occupies the prefix; positions 4-7 are padding, not another axis.
fn reduce_time_packet_partial<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![R # 8 / 4], m![R # 8 % 4 # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![1], m![1 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![R # 8 % 4]>()
.vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
IntraSliceReduceOpF32::Add,
)
// Output: Slice = m![X, A / 2], Time = m![1], Packet = m![1 # 4]
}
}
Time + Packet (Mixed Packet Axes, Not Supported)
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 2, R = 19, X = 256];
// NOT supported. R must be the sole occupant of the Packet prefix.
// If another axis shares the Packet, the prefix-based count marks that axis's data as padding.
//
// Slice = m![X], Time = m![R # 24 / 4], Packet = m![R # 24 % 4, A # 8]
// Slice = 256.
//
// R fills positions 0-3, A fills positions 4-5 (padded to 8).
// Packet Reduce Mode: valid count applies to the whole packet as a prefix.
// vc = 3 means "positions 0-2 valid", but A's real data at positions 4-5
// is ALWAYS treated as invalid, regardless of A's actual size.
// The reduce result silently loses A's contributions.
//
// Fix: put A outside Packet, or pad R to fill all 8 positions:
// Time = m![R # 24 / 8], Packet = m![R # 24 % 8]
fn reduce_wrong_mixed_packet<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![X], m![R # 24 / 4], m![R # 24 % 4, A # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![X], m![1], m![A # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![R # 24 % 4, A]>()
.vector_intra_slice_reduce::<R, m![1], m![A # 4]>(
IntraSliceReduceOpF32::Add,
)
// ✗ A's data at positions 4-5 silently excluded by prefix-based vc.
}
}
Time + Packet: Perfectly Aligned
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 24, X = 64];
// When |R| is exactly divisible by the size of R in Packet,
// every packet is full and the VCG is not needed.
//
// Slice = m![X, A / 2], Time = m![R / 8], Packet = m![R % 8]
// Slice = 64 * 4 = 256.
//
// R = 24, no padding needed! 24 = 3 * 8.
// Every element is real data. All vc = 8.
fn reduce_time_packet_aligned<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![R / 8], m![R % 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![1], m![1 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![R % 4]>()
.vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
IntraSliceReduceOpF32::Add,
)
// Output: Slice = m![X, A / 2], Time = m![1], Packet = m![1 # 4]
}
}
Slice + Packet (Not Supported)
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 2045];
// NOT supported. The VCG produces the same valid_count for all slices at a given time step.
// When R spans Slice and Packet, different slices need different counts.
//
// Slice = m![R # 2048 / 8], Time = m![1], Packet = m![R # 2048 % 8]
// Slice = 2048 / 8 = 256.
//
// R = 2045, padded to 2048 = 256 (Slice) * 8 (Packet).
// Packet Reduce Mode:
// Slices 0-254 need vc = 8 (full). Slice 255 needs vc = 5 (2045 mod 8 = 5).
// But vc is the same for all slices at the same time step.
// The VCG cannot produce vc = 8 for some slices and vc = 5 for others.
//
// Fix: add R to Time so R spans Slice + Time instead:
// Slice = m![R # 2048 / 8], Time = m![R # 2048 % 8], Packet = m![A # 8]
// Another possible fix: if R's size was 2048, no padding is introduced, which does not need VCG at all:
// Slice = m![R: 2048 / 8], Time: m![1], Packet: m![R % 8]
fn reduce_wrong_slice_packet<'l, const T: Tu>(
input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![R # 2048 / 8], m![1], m![R # 2048 % 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![R # 2048 / 8], m![1], m![1 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
input
.vector_trim_way4::<m![R # 2048 % 4]>()
.vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
IntraSliceReduceOpI32::AddSat,
)
// ✗ Slice-varying vc needed. VCG cannot express this pattern.
}
}
Valid Count Generator’s Implementation
This document describes what the Valid Count Generator (VCG) hardware can express, independent of mapping expressions or tensor shapes. VCG tags are consumed by the Intra-Slice Reduce stage to exclude padding from reductions. For how mapping expressions control VCG behavior (supported placements, constraints, and examples), see Valid Count Generator’s Interface.
Data Model
Data flows into the VectorEngine as a stream of flits (packets). Each flit contains 8 elements. The VCG operates at the VectorEngine’s input, tagging each 8-way flit with a valid count. The 4-way halving and its valid count derivation are described in Downstream: 4-Way Operations.
A flit is identified by two coordinates. A slice corresponds to the Slice dimension in the mapping; a time step indexes sequential flits within a slice.
| Coordinate | Range | Meaning |
|---|---|---|
s (slice number) | [0, num_slices) | Which slice processes this flit |
t (time step) | [0, num_flits) | Sequential position within a slice |
The VCG assigns a valid count (abbreviated vc in formulas and diagrams) to each flit:
$$\text{vc}(s, t) \in {0, 1, \ldots, 8}$$
Element p (where p is in [0, 8)) within flit (s, t) is valid if and only if p < vc(s, t).
Valid elements always form a contiguous prefix; this is a fundamental hardware constraint.
The VCG cannot express “elements 0, 1, 3 are valid but 2 is not.”
Valid Count Formula
The VCG computes vc(s, t) through a pipeline of stages:
$$t \overset{\text{Sequencer}}{\longrightarrow} (c_0, c_1, \ldots, c_{k-1}) \overset{\text{Original Dims}}{\longrightarrow} \text{idx}(t) \overset{\text{Validity}}{\longrightarrow} \text{vc}(s, t)$$
- Sequencer: The flat time index
tis decomposed into counter values \((c_0, c_1, \ldots, c_{k-1})\) via mixed-radix decomposition. - Original Dimensions: Each counter is assigned to one of 4 dimensions (packet dim or gate dim 0-2). Per-dimension indices are computed as \(\text{idx}(t) = \sum_{i} c_i \cdot \sigma _ i\), where \(\sigma_i\) is the counter’s stride.
- Validity Decision:
- Packet Dim: produces a packet-level valid count \(\text{packet_vc}(t) = \min(\text{stride}_p,; \max(0,; V_p - \text{idx}_p(t)))\).
- Gate Dims: each produces a binary gate \(\text{gate}_d(s, t) \in {0, 1}\) based on slice classification (below/boundary/above a threshold) and the per-dim index.
The final valid count combines these components:
$$\text{vc}(s, t) = \text{packet_vc}(t) \times \text{gate}_0(s, t) \times \text{gate}_1(s, t) \times \text{gate}_2(s, t)$$
vc(s,t) = packet_vc(t) × gate_0(s,t) × gate_1(s,t) × gate_2(s,t)
─────────── ─────────── ─────────── ───────────
packet dim gate dim 0 gate dim 1 gate dim 2
(count 0-8) (gate 0/1) (gate 0/1) (gate 0/1)
- If all gates are open (= 1): the flit gets
packet_vc(t)valid elements. - If any gate is closed (= 0):
vc = 0(entire flit is invalid, regardless of packet dim’s count).
VCG Configuration
Configuration is organized around two concepts: counters (which drive the sequencer) and original dimensions (which decide validity). Counters produce a flit sequence; each dim uses its assigned counters to compute an index and decide validity.
The VCG is configured via the following parameters (each is explained in detail in subsequent sections):
| Field | Scope | Description |
|---|---|---|
| Counter limits \(L_0 \ldots L_7\) | per counter (up to 8) | Sequencer counter limits |
| Original dim assignment | per counter | Which dim (packet / gate 0-2) each counter belongs to |
| stride \(\sigma_i\) | per counter | stride for index computation |
| \(\text{mask}_{gd}\) | per gate dim | Slice-id bitmask |
| \(\text{match}_{gd}\) | per gate dim | Threshold for slice classification |
| \(V_p\) / \(V_{gd}\) | packet dim / per gate dim | Valid count / threshold |
| \(P_{gd}\) | per gate dim | Standard (0) vs transposed (1) |
Unassigned counters and disabled gate dims (\(\text{mask}_{gd} = 0, \text{match}_{gd} = 1\)) effectively pass through as “all valid.”
Sequencer
The sequencer interprets the flat time index t as a multi-dimensional counter.
Counter Structure
Up to 8 nested counters iterate to produce the flit sequence:
$$t \to (c_0, c_1, \ldots, c_{k-1})$$
where \(c_0\) is the fastest (innermost) and \(c_{k-1}\) is the slowest (outermost).
Each counter \(c_i\) has a limit \(L_i\), cycling through \(0, 1, \ldots, L_i - 1\), and a stride \(\sigma_i\) that scales the counter’s contribution to the dimension index (see Original Dimensions). The total number of flits per slice is \(L_0 \times L_1 \times \cdots \times L_{k-1}\).
Example: 3 counters with limits [3, 2, 2]
This produces 3 * 2 * 2 = 12 flits per slice. The counters cycle as:
t=0: (c_0=0, c_1=0, c_2=0)
t=1: (c_0=1, c_1=0, c_2=0)
t=2: (c_0=2, c_1=0, c_2=0)
t=3: (c_0=0, c_1=1, c_2=0) <- c_0 wraps, c_1 increments
t=4: (c_0=1, c_1=1, c_2=0)
t=5: (c_0=2, c_1=1, c_2=0)
t=6: (c_0=0, c_1=0, c_2=1) <- c_1 wraps, c_2 increments
t=7: (c_0=1, c_1=0, c_2=1)
t=8: (c_0=2, c_1=0, c_2=1)
t=9: (c_0=0, c_1=1, c_2=1)
t=10: (c_0=1, c_1=1, c_2=1)
t=11: (c_0=2, c_1=1, c_2=1)
c_0 changes every flit, c_1 every 3 flits, c_2 every 6 flits, just like digits in a mixed-radix number.
The sequencer produces counter values; the next step is mapping them to original dimensions and then to the validity decision.
Original Dimensions
Each counter is assigned to one of 4 original dimensions (packet dim or gate dim 0-2), or left unassigned.
Let \(D_d\) be the set of counters assigned to original dimension d.
Each counter contributes to its assigned dimension’s index by multiplying its current value by its stride.
The sum of all contributions gives the current position within that dimension’s data:
$$\text{idx}_d(t) = \sum _ {i \in D_d} c_i(t) \cdot \sigma_i$$
This index tracks the position within that dimension’s original data range. Multiple counters can be assigned to the same dim; their contributions are simply summed.
Example: Counters mapped to original dimensions
Suppose 3 counters are configured as follows:
| Counter | Limit | stride | Assigned to |
|---|---|---|---|
| c_0 | 3 | 8 | packet dim (W axis) |
| c_1 | 2 | 1 | gate dim 0 (C axis) |
| c_2 | 2 | 1 | gate dim 1 (H axis) |
At time step t=4, which gives (c_0=1, c_1=1, c_2=0):
idx_p = 1 * 8 = 8, position 8 along Widx_g0 = 1 * 1 = 1, position 1 along Cidx_g1 = 0 * 1 = 0, position 0 along H
Each dim uses its index independently to decide validity.
Validity Decision
The following diagram shows the complete pipeline from time index to final valid count. Each stage is explained in the subsections below.
t (flat time index)
│
├─ mixed-radix decomposition (see Sequencer above)
▼
(c_0, c_1, ..., c_{k-1}) ← counter values
│
├─ each counter assigned to a dim, multiplied by stride σ_i
│ (see Original Dimensions above)
▼
idx_p(t), idx_g0(t), idx_g1(t), idx_g2(t) ← per-dim indices
│
├─ packet dim: packet_vc = min(stride_p, max(0, V_p - idx_p))
├─ gate dim 0: gate_0 = f(masked_id(s), idx_g0, match_g0, V_g0)
├─ gate dim 1: gate_1 = f(masked_id(s), idx_g1, match_g1, V_g1)
├─ gate dim 2: gate_2 = f(masked_id(s), idx_g2, match_g2, V_g2)
│
▼
vc(s,t) = packet_vc(t) × gate_0(s,t) × gate_1(s,t) × gate_2(s,t)
Packet dim and gate dims make qualitatively different judgments:
- Packet dim answers: “how many elements in this flit are valid?” (a count, 0-8)
- Gate dims each answer: “is this flit valid at all?” (a binary gate, yes or no)
Gate dims act as gates: only when all three report “valid” does packet dim’s count take effect. If any gate reports “invalid”, the entire flit gets valid count = 0.
Packet Dim: Packet-Level Valid Count
Packet dim determines how many elements within a flit are valid. Two parameters control the computation:
- \(V_p\): the original valid count for packet dim (the unpadded size of the data along this dimension).
- \(\text{stride}_p\): the stride of the innermost counter assigned to packet dim, representing how many flit elements belong to the axis tracked by packet dim.
The per-packet valid count is:
$$\text{packet_vc}(t) = \min(\text{stride}_p, \max(0, V_p - \text{idx}_p(t)))$$
packet_vc(t) = min( stride_p, max(0, V_p - idx_p(t) ))
───────── ─────────────────
HW width cap remaining valid data
When the axis fills all 8 flit positions, \(\text{stride}_p = 8\) and the formula is equivalent to \(\min(8, \ldots)\). When the axis occupies only \(k < 8\) positions (with the remaining positions padded), \(\text{stride}_p = k\) caps the valid count so that only the axis’s portion of the flit is counted as valid.
Hardware constraints:
- The innermost Packet counter must always be assigned to packet dim.
- Other counters may also be assigned to packet dim (e.g., a Time counter for the same axis).
- When no axis is assigned to packet dim,
packet_vcis always 8 (full flit) or 0 (empty flit), effectively making packet dim a binary gate like gate dims.
As the sequencer advances, \(\text{idx}_p\) increases and packet_vc decreases.
This produces a repeating sawtooth pattern:
Example 1: V_p = 19, stride_p = 8, counter stride=8, limit=3 (axis fills full 8-way)
flit 0: idx_p = 0 -> packet_vc = min(8, 19 - 0) = 8 (full)
flit 1: idx_p = 8 -> packet_vc = min(8, 19 - 8) = 8 (full)
flit 2: idx_p = 16 -> packet_vc = min(8, 19 - 16) = 3 (partial)
Example 2: V_p = 11, stride_p = 4, counter stride=4, limit=3 (axis fills 4 of 8 positions)
flit 0: idx_p = 0 -> packet_vc = min(4, 11 - 0) = 4 (full within stride)
flit 1: idx_p = 4 -> packet_vc = min(4, 11 - 4) = 4 (full within stride)
flit 2: idx_p = 8 -> packet_vc = min(4, 11 - 8) = 3 (partial)
In Example 2, positions 4-7 in each flit are padding and automatically excluded by the \(\text{stride}_p = 4\) cap.
Key property: packet_vc depends only on the sequencer state t, not on the slice s.
All slices receive the same packet valid count for the same time step.
Example: Why packet_vc is slice-independent
If \(V_p = 19\), \(\text{stride}_p = 8\), and the packet counter cycles [0, 8, 16], then:
- At t=2 (idx_p=16): packet_vc = 3 for every slice.
- Slice 0 gets vc=3, slice 5 gets vc=3, slice 15 gets vc=3, all the same.
This is because packet dim’s formula \(\min(\text{stride}_p, V_p - \text{idx}_p)\) has no s term.
Gate dims can still make certain slices’ final vc = 0 (by reporting invalid),
but they cannot change the packet_vc value itself.
Gate Dims: Per-Flit Binary Validity
Gate dims 0, 1, 2 decide whether a flit as a whole is valid (1) or invalid (0), not a count.
Each gate dim classifies slices into groups by extracting a subset of the slice-id bits (via a bitmask) and comparing the result against a threshold. The bitmask \(\text{mask}_{gd}\) selects which bits of the slice-id this gate dim tracks:
$$\text{masked_id} (s) = s \mathbin{\&} \text{mask}_{gd}$$
Example: 16 slices (4-bit slice_id), mask_g0 = 0b1100
slice_id (4 bits): [ b3 b2 b1 b0 ]
mask_g0 = 0b1100: [ 1 1 0 0 ]
─────────────────
masked_id: [ b3 b2 0 0 ] → extracts the upper 2 bits
Slices fall into three groups based on comparing \(\text{masked_id}\) with \(\text{match}_{gd}\):
| Group | Condition | Meaning |
|---|---|---|
| Below | \(\text{masked_id}(s) < \text{match}_{gd}\) | All time steps valid |
| Boundary | \(\text{masked_id}(s) = \text{match}_{gd}\) | Valid when \(\text{idx}_{gd}(t) < V_{gd}\) |
| Above | \(\text{masked_id}(s) > \text{match}_{gd}\) | Depends on mode (see below) |
The \(P_{gd}\) flag selects between two modes that differ only in the “above” group:
Standard mode (\(P_{gd} = 0\))
$$\text{gate}_d(s, t) = \begin{cases} 1 & \text{masked_id}(s) < \text{match}_{gd} \\ [\text{idx}_{gd}(t) < V_{gd}] & \text{masked_id}(s) = \text{match}_{gd} \\ 0 & \text{masked_id}(s) > \text{match}_{gd} \end{cases}$$
Above-threshold slices are entirely invalid. This is the common case: the Slice factor is laid out in ascending order, so slices beyond the boundary contain no valid data.
Transposed mode (\(P_{gd} = 1\))
$$\text{gate}_d(s, t) = \begin{cases} 1 & \text{masked_id}(s) < \text{match}_{gd} \\ [\text{idx}_{gd}(t) < V_{gd}] & \text{masked_id}(s) \ge \text{match}_{gd} \end{cases}$$
Above-threshold slices get the same \(V_{gd}\) check as the boundary: they are not entirely invalid. This handles the transposed case where the slice ID encodes the inner index: slices beyond the boundary still contain valid data at early time steps (the outer index is small enough), and only run out of valid data at the same point as the boundary slice.
To disable a gate dim (make it always valid), set \(\text{mask}_{gd} = 0, \text{match}_{gd} = 1\). Then \(\text{masked_id} = 0 < 1\) for all slices, so every slice is in the “below” group.
Example: Standard mode, H=5 split into Ho=4 (slice) × Hi=2 (time)
H=5 is split into Ho × Hi = 4 × 2 (padded from 5 to 8).
Ho is the Slice factor (encoded in slice-id bits), Hi is the Time factor (sequencer counter).
Axis index = Ho * 2 + Hi. Valid when index < 5.
Gate dim 0 config: mask=0b1100 (extracts 2 bits for Ho), match=2, V_g0=1, standard mode.
16 slices, where masked_id = (slice_id & 0b1100) >> 2 gives Ho:
| Ho | masked_id | Group | Hi=0 | Hi=1 |
|---|---|---|---|---|
| 0 | 0 | below (< 2) | valid | valid |
| 1 | 1 | below (< 2) | valid | valid |
| 2 | 2 | boundary (= 2) | valid (idx=0 < 1) | invalid (idx=1 >= 1) |
| 3 | 3 | above (> 2) | invalid | invalid |
Ho=0,1: both time steps valid (index 0-3, all < 5). Ho=2: only first time step (index 4 < 5), second invalid (index 5 >= 5). Ho=3: fully invalid (index 6, 7 >= 5).
Example: Transposed mode, H=5 split into Ho=4 (slice, inner) × Hi=2 (time, outer)
H=5 is split into Ho × Hi = 4 × 2 (padded from 5 to 8), but transposed: Ho is the inner factor, Hi is the outer factor.
Axis index = Hi * 4 + Ho. Valid when index < 5.
Gate dim 0 config: match=1 (= 5 mod 4), V_g0=1 (= floor(5/4)), transposed mode.
| Ho | masked_id | Group | Hi=0 | Hi=1 |
|---|---|---|---|---|
| 0 | 0 | below (< 1) | valid | valid |
| 1 | 1 | boundary (= 1) | valid (idx=0 < 1) | invalid (idx=1 >= 1) |
| 2 | 2 | above (> 1) | valid (idx=0 < 1) | invalid (idx=1 >= 1) |
| 3 | 3 | above (> 1) | valid (idx=0 < 1) | invalid (idx=1 >= 1) |
Verify: Ho=0, Hi=0: 0 < 5, Hi=1: 4 < 5, so 2 steps. Ho=1, Hi=0: 1 < 5, Hi=1: 5 >= 5, so 1 step. Ho=2, Hi=0: 2 < 5, Hi=1: 6 >= 5, so 1 step. Ho=3, Hi=0: 3 < 5, Hi=1: 7 >= 5, so 1 step.
Key difference from standard: the “above” group (Ho=2,3) still gets V_g0=1 valid time steps, not zero.
Example: Full VCG computation for [H=5, C=5, W=19], step-by-step build-up
This example builds up from one axis to three, so each dimension’s contribution is clear.
Original shape [H, C, W] = [5, 5, 19].
Each axis is split into a slice part (slice_id) and a time part (sequencer):
H = 5 -> Ho(slice) * Hi(time) = 4 * 2 (padded from 5 to 8)
C = 5 -> Co(slice) * Ci(time) = 4 * 2 (padded from 5 to 8)
W = 19 -> Wi(packet) = 3 * 8 (padded from 19 to 24)
Step 1: W=19 only (packet dim, no gates)
Ignore H and C for now. Disable gate dims 0 and 1. Every slice processes 3 flits (Wi limit=3), and packet dim produces the sawtooth:
packet_vc: 8, 8, 3
^ ^
full 19 - 16 = 3 (partial)
Since there are no gates, every slice gets this exact same pattern:
All slices, all flits:
flit 0: ████████ (vc=8)
flit 1: ████████ (vc=8)
flit 2: ███ (vc=3)
Step 2: Add C=5 (packet dim + gate dim 0)
Now enable the C-axis gate (gate dim 0).
C=5 is split into Co(slice, 4 values) * Ci(time, limit 2).
The C-gate uses: mask=0b0011 (extracts Co from slice_id), match=2, V_g0=1, standard mode.
Each slice now runs 6 flits: Ci (limit 2) * Wi (limit 3). The C-gate classifies slices by their Co value:
| Co | Group | Effect |
|---|---|---|
| 0 | below (< 2) | gate open: all 6 flits get packet dim’s pattern |
| 1 | below (< 2) | gate open: same |
| 2 | boundary (= 2) | gate open for Ci=0, closed for Ci=1 |
| 3 | above (> 2) | gate closed: all 6 flits get vc=0 |
Result per slice (6 flits = 2 Ci groups * 3 Wi flits):
Co=0: [8,8,3, 8,8,3] <- both Ci steps valid
Co=1: [8,8,3, 8,8,3] <- same
Co=2: [8,8,3, 0,0,0] <- Ci=0 valid, Ci=1 gated off
Co=3: [0,0,0, 0,0,0] <- entirely gated off
Notice the gate’s effect: some slices go entirely to zero, and the boundary slice loses its second half.
But within the valid flits, the [8,8,3] pattern from packet dim is unchanged.
Step 3: Add H=5 (full 3-axis, packet dim + gate dim 0 + gate dim 1)
Now enable the H-axis gate (gate dim 1).
H=5 is split into Ho(slice, 4 values) * Hi(time, limit 2).
The H-gate uses: mask=0b1100 (extracts Ho from slice_id), match=0b1000, V_g1=1, standard mode.
Slice ID encodes both slice factors: slice_id = Ho * 4 + Co, giving 16 slices.
Each slice now runs 12 flits: Hi (limit 2) * Ci (limit 2) * Wi (limit 3).
| Dim | Axis | What it tracks | VCG config |
|---|---|---|---|
| packet | W=19 | element count in packet | V_p=19, stride_p=8, counter stride=8, limit=3 |
| gate 0 | C=5 | gate: is Co within valid range? | mask=0b0011, match=2, V_g0=1, standard |
| gate 1 | H=5 | gate: is Ho within valid range? | mask=0b1100, match=0b1000, V_g1=1, standard |
The H-gate classifies slices by Ho, same logic as C-gate by Co:
| Ho | Group | Effect |
|---|---|---|
| 0 | below | H-gate open |
| 1 | below | H-gate open |
| 2 | boundary | H-gate open for Hi=0, closed for Hi=1 |
| 3 | above | H-gate closed |
The final vc for each flit is packet_vc(t) * C_gate(s,t) * H_gate(s,t).
Both gates must be open for packet dim’s count to survive.
The complete heatmap (16 slices * 12 flits). Columns are slices grouped by Ho; rows are flits grouped by (Hi, Ci). Right-side annotations show which gates are active for each row:
Ho=0 |Ho=1 |Ho=2 |Ho=3
Co: 0 1 2 3 | 0 1 2 3| 0 1 2 3| 0 1 2 3
H-gate: v v v v | v v v v| > > > >| x x x x
C-gate: v v > x | v v > x| v v > x| v v > x
--------------------------------------------------------------------------------
t= 0 Hi=0,Ci=0 W 8 8 8 0 | 8 8 8 0| 8 8 8 0| 0 0 0 0 H:v C:v
t= 1 | 8 8 8 0 | 8 8 8 0| 8 8 8 0| 0 0 0 0
t= 2 | 3 3 3 0 | 3 3 3 0| 3 3 3 0| 0 0 0 0
| | |
t= 3 Hi=0,Ci=1 W 8 8 0 0 | 8 8 0 0| 8 8 0 0| 0 0 0 0 H:v C:>
t= 4 | 8 8 0 0 | 8 8 0 0| 8 8 0 0| 0 0 0 0
t= 5 | 3 3 0 0 | 3 3 0 0| 3 3 0 0| 0 0 0 0
| | |
t= 6 Hi=1,Ci=0 W 8 8 8 0 | 8 8 8 0| 0 0 0 0| 0 0 0 0 H:> C:v
t= 7 | 8 8 8 0 | 8 8 8 0| 0 0 0 0| 0 0 0 0
t= 8 | 3 3 3 0 | 3 3 3 0| 0 0 0 0| 0 0 0 0
| | |
t= 9 Hi=1,Ci=1 W 8 8 0 0 | 8 8 0 0| 0 0 0 0| 0 0 0 0 H:> C:>
t=10 | 8 8 0 0 | 8 8 0 0| 0 0 0 0| 0 0 0 0
t=11 | 3 3 0 0 | 3 3 0 0| 0 0 0 0| 0 0 0 0
v = open (below threshold) > = boundary (partial) x = closed (above)
Reading the patterns:
- Ho=3 columns (rightmost 4): all 0. H-gate
x(above threshold, always closed). - Co=3 columns (every 4th): all 0. C-gate
x. - Co=2 columns (H:v C:
>): C-gate is boundary; only rows with Ci=0 pass. Compare Co=1 vs Co=2 to see the gate’s effect. - Ho=2 columns (H:
>C:v): H-gate is boundary; only rows with Hi=0 pass. Compare Ho=1 vs Ho=2. - Ho=2 * Co=2 (both
>): only (Hi=0, Ci=0) rows pass, the intersection of both boundaries. - Within valid cells: the
[8, 8, 3]sawtooth from packet dim always appears, the same regardless of slice.
What Patterns Are Expressible
The valid count formula is a product of four independent terms.
This multiplicative structure determines which vc(s, t) functions the hardware can produce, and which it cannot.
Why Limitations Arise
Each limitation traces back to a specific part of the formula.
Packet dim cannot see slice-id.
The packet dim formula packet_vc(t) = min(stride_p, max(0, V_p − idx_p(t))) depends only on t.
If two slices need different partial counts at the same time step, packet dim cannot produce both:
Suppose we need: vc(s=0, t=0) = 8, vc(s=1, t=0) = 3
───────────────
packet dim would need to output
both 8 and 3 at t=0, impossible
Each gate classifies slices by a single threshold after masking.
Gate dims first apply a bitmask to the slice-id (masked_id = slice_id & mask), then compare against one value match.
This produces three contiguous groups: below, boundary, above.
A gate cannot express “slices 0, 3, 7 are valid but 1, 2 are not”; it can represent only contiguous ranges of masked_id.
The mask selects which bits of the slice-id to inspect, allowing one gate to track a specific axis even when the slice-id encodes multiple axes.
At most 4 independent checks. One packet count (packet dim) + three binary gates (gate dims) = 4 orthogonal dimensions total.
Single-Axis Scenarios
A padded axis (original size n, padded to n' > n) occupies some combination of Packet (packet dim), Time (sequencer), and Slice (gate dims via slice-id bits).
Single-position: Packet / Time / Slice
When an axis occupies only one position, validity tracking is straightforward:
- Packet only → packet dim handles the sawtooth (see Packet Dim examples).
- Time only → gate dim with
mask=0, match=0: all slices are boundary, binary validity by time step. - Slice only → gate dim with appropriate mask/match: all time steps within a valid slice pass, invalid slices are fully gated.
All three are always supported.
Slice + Time
One axis split between slice-id bits and sequencer counters. This is the VCG’s most important use case, and it is how gate dims are typically used.
Standard (slice outer, time inner): axis index = Ho × time_count + Hi.
Example: H=14, Ho=8 (slice) × Hi=3 (time), standard mode
Gate dim config: match = ⌊14/3⌋ = 4, V = 14 mod 3 = 2.
Ho masked_id group Hi=0 Hi=1 Hi=2
── ───────── ───────── ────────── ────────── ──────────
0 0 below idx=0 < 2 ✅ idx=1 < 2 ✅ always ✅
1 1 below always ✅ always ✅ always ✅
2 2 below always ✅ always ✅ always ✅
3 3 below always ✅ always ✅ always ✅
4 4 boundary idx=0 < 2 ✅ idx=1 < 2 ✅ idx=2 < 2 ❌
5 5 above ❌ ❌ ❌
6 6 above ❌ ❌ ❌
7 7 above ❌ ❌ ❌
Why “below” is genuinely all-valid: Ho × 3 + Hi < 4 × 3 = 12 ≤ 14 for all Hi ∈ [0,3).
Standard mode tolerates over-allocated time_count; the “below” interpretation remains correct.
Transposed (time outer, slice inner): axis index = Hi × slice_count + Ho.
Example: H=19, Ho=8 (slice, inner) × Hi=3 (time, outer), transposed mode
Gate dim config: match = 19 mod 8 = 3, V = ⌊19/8⌋ = 2, transposed mode.
Ho masked_id group Hi=0 Hi=1 Hi=2
── ───────── ───────── ────────── ────────── ──────────
0 0 below always ✅ always ✅ always ✅
1 1 below always ✅ always ✅ always ✅
2 2 below always ✅ always ✅ always ✅
3 3 boundary idx=0 < 2 ✅ idx=1 < 2 ✅ idx=2 < 2 ❌
4 4 above idx=0 < 2 ✅ idx=1 < 2 ✅ idx=2 < 2 ❌
5 5 above idx=0 < 2 ✅ idx=1 < 2 ✅ idx=2 < 2 ❌
6 6 above idx=0 < 2 ✅ idx=1 < 2 ✅ idx=2 < 2 ❌
7 7 above idx=0 < 2 ✅ idx=1 < 2 ✅ idx=2 < 2 ❌
Verify against real data (axis index = Hi × 8 + Ho, valid when < 19):
- Ho=0, Hi=2:
2×8 + 0 = 16 < 19✅; “below” gives all-valid = 3 steps, needV+1 = 3steps ✅ - Ho=3, Hi=2:
2×8 + 3 = 19 ≥ 19❌; boundary gives 2 steps ✅ - Ho=7, Hi=1:
1×8 + 7 = 15 < 19✅; “above” gives V=2 steps, actual need is 2 steps ✅
Constraint: time_count must equal ⌈n / slice_count⌉ (= V + 1).
The “below” group gets time_count valid steps from the HW “all-valid” interpretation.
If time_count > V + 1, the “below” group receives more valid steps than the data actually has.
Packet + Time
Both packet and time factors assigned to packet dim, with multiple counters contributing to idx_p(t) (see Original Dimensions).
Example: n=50, two counters on packet dim, contiguous (`stride_outer = 8 × 3 = 24`)
c_inner (limit=3, stride=8): packet counter
c_outer (limit=3, stride=24): time counter (24 = 8 × 3 ✅ contiguous)
idx_p = c_outer × 24 + c_inner × 8
c_inner=0 c_inner=1 c_inner=2
───────── ───────── ─────────
c_outer=0 idx_p=0→8 idx_p=8→8 idx_p=16→8
c_outer=1 idx_p=24→8 idx_p=32→8 idx_p=40→8
c_outer=2 idx_p=48→2 idx_p=56→0 idx_p=64→0
↑
min(8, 50-48)=2
Packet dim handles both the within-flit and across-flit boundaries.
Slice + Packet: not supported
One axis split between slice (gate dim) and packet (packet dim). This directly violates the slice-independent packet count constraint (see Packet Dim: Key property).
packet_vc(t) depends only on t, but the boundary slice needs a different partial count than all-valid slices.
The gate can multiply by 0 or 1, so it can fully close a flit but cannot change the partial count.
Example: n=10, stride_p=8, slice_count=2
What we need:
Ho=0: elements 0-7, all valid → vc = 8
Ho=1: elements 8-15, first 2 valid → vc = 2
─
partial count, different from 8
Attempt 1: set `V_p = 10`:
packet_vc = min(8, 10-0) = 8 for ALL slices
Ho=0: vc = 8 × 1 = 8 ✅
Ho=1: vc = 8 × 1 = 8 ❌ (need 2, not 8)
vc = 8 × 0 = 0 ❌ (gate can close to 0, not to 2)
Attempt 2: set `V_p = 2`:
packet_vc = min(8, 2-0) = 2 for ALL slices
Ho=0: vc = 2 × 1 = 2 ❌ (need 8, not 2)
No single V_p works. Packet dim produces one value; the gate can only multiply by 0 or 1.
Note on degenerate cases:
When n % stride_p = 0, every packet is either fully valid or fully invalid, so packet dim produces no partial counts and a gate alone handles validity.
This is effectively Slice only, not a true Slice + Packet scenario.
Similarly, n <= stride_p means a single flit covers the entire axis, reducing to Packet only.
When the VCG cannot express the required pattern, Padding Strategy alternatives are available.
Slice + Time + Packet: not supported
The axis spans all three positions.
The Slice + Packet conflict carries over: the boundary slice still needs a different partial count than all-valid slices, and packet_vc(t) still cannot vary by slice.
The same degenerate exception applies: n % stride_p = 0 eliminates partial counts, reducing to Slice + Time (packet dim unused).
Multiple Axes
Each padded axis that needs validity tracking consumes one original dimension slot:
| Resource | Capacity | Notes |
|---|---|---|
| Packet Dim (packet count) | 1 slot | Innermost counter determines stride_p |
| Gate Dims (binary gates) | 3 slots | One gate per padded axis |
| Unpadded axes | free | No dim needed (mask=0, match=1) |
When the packet axis is fully aligned (n % stride_p = 0), packet_vc is constant and packet dim is effectively unused, so it can be repurposed as a gate for another axis.
Summary
A valid count function vc(s, t) is VCG-expressible only if:
- Prefix property: Valid elements form a contiguous prefix
[0, vc)within each flit. - Slice-independent packet count:
packet_vc(t)must be the same across all slices at the samet. Slices can be gated tovc = 0, but cannot receive a different partial count. - Monotonic slice ordering: Each gate dim classifies slices by a single threshold on
masked_id. - At most 4 orthogonal dimensions: 1 packet count + 3 binary gates.
| Placement | Dim | Supported? | Key constraint |
|---|---|---|---|
| Packet only | packet | ✅ | none |
| Time only | gate | ✅ | none |
| Slice only | gate | ✅ | none |
| Slice + Time (standard) | gate | ✅ | none |
| Slice + Time (transposed) | gate | ✅ | time_count = ⌈n / slice_count⌉ |
| Packet + Time | packet | ✅ | none |
| Slice + Packet | packet + gate | ❌ | packet_vc(t) cannot vary by slice |
| Slice + Time + Packet | packet + gate | ❌ | same as Slice + Packet |
For mapping-level code examples of each placement, see Examples. For unsupported cases, see Padding Strategy.
Downstream: 4-Way Operations
VCG assigns valid counts per 8-way flit, but the VectorArithmeticUnit can operate on 4-way halves.
| Operation | Input | Output | Valid Count Transformation |
|---|---|---|---|
| split_way4 | 8-way flit (vc = v) | two 4-way flits | vc_low = min(v, 4), vc_high = max(v - 4, 0) |
| trim_way4 | 8-way flit (vc = v) | one 4-way flit | vc = v (requires v <= 4) |
| concat_way8 | two 4-way flits | 8-way flit | vc = vc_low + vc_high |
| pad_way8 | 4-way flit | 8-way flit | vc unchanged |
The prefix property is preserved through split and concat.
For trim_way4, the constraint v <= 4 must be statically guaranteed by the mapping; if the upper 4 elements could be valid, trimming them would lose data.
Inter-Slice Block
The Inter-Slice Block performs inter-slice reduction, aggregating partial results across the 256 slices within a cluster.
It preserves Chip, Cluster, and Packet, and rewrites Slice and Time to SliceOut and TimeOut.
Interface
i32 Interface
#[primitive(VectorInitTensor::vector_inter_slice_reduce)]
pub fn vector_inter_slice_reduce<Slice2: M, Time2: M>(
self,
op: InterSliceReduceOpI32,
) -> VectorInterSliceReduceTensor<'l, T, i32, Chip, Cluster, Slice2, Time2, Packet, { VeOrder::InterFirst }> {
f32 Interface
#[primitive(VectorInitTensor::vector_inter_slice_reduce)]
pub fn vector_inter_slice_reduce<Slice2: M, Time2: M>(
self,
op: InterSliceReduceOpF32,
) -> VectorInterSliceReduceTensor<'l, T, f32, Chip, Cluster, Slice2, Time2, Packet, { VeOrder::InterFirst }> {
You can reach this block in two ways:
- Run inter-slice first:
vector_init() -> vector_inter_slice_reduce::<SliceOut, TimeOut>(op) - Run intra-slice first, then switch: call
vector_inter_slice_reduce()directly on the current intra-slice tensor instead of callingvector_init()again.
In the IntraFirst path, vector_inter_slice_reduce() is available only from Way8 intra-slice stages that can transition to inter-slice reduction: Branch, Logic, Fxp, FxpToFp, Widen, FpToFxp, and Clip.
It is not available from Way4 stages such as Narrow, Fp, IntraSliceReduce, or FpDiv.
Quick Reference
| Current state | Method | Result |
|---|---|---|
Fresh VE input after vector_init() | vector_inter_slice_reduce::<SliceOut, TimeOut>(op) | Enters inter-slice reduction directly (InterFirst) |
| Eligible intra-slice tensor | vector_inter_slice_reduce::<SliceOut, TimeOut>(op) | Transitions from intra-slice to inter-slice reduction (IntraFirst) |
Tensor after vector_inter_slice_reduce() | vector_intra_slice_branch(BranchMode) | Switches to intra-slice work after inter-slice reduction |
Operations
Integer Operations (InterSliceReduceOpI32)
| Operation | Description |
|---|---|
Add | Wrapping addition |
AddSat | Saturating addition |
Max | Maximum value |
Min | Minimum value |
Floating-Point Operations (InterSliceReduceOpF32)
| Operation | Description |
|---|---|
Add | Floating-point addition |
Max | Maximum value |
Min | Minimum value |
Mul | Floating-point multiplication |
Output Mapping Rule
After inter-slice reduction removes a slice factor R, the output mapping typically follows one of three rules:
| Rule | Output mapping | Reference |
|---|---|---|
| Broadcast | Slice = m![A, R], Time = m![C] -> SliceOut = m![A, X], TimeOut = m![C] | Broadcast Into a New Slice Axis |
| Dummy | Slice = m![A, R], Time = m![C] -> SliceOut = m![A, 1 # n], TimeOut = m![C] | Dummy Replacement |
| Promotion | Slice = m![A, R], Time = m![C] -> SliceOut = m![A, C], TimeOut = m![1] | Promotion from Time into SliceOut |
Chip, Cluster, and Packet pass through unchanged.
Only Slice and Time are rewritten into SliceOut and TimeOut.
Examples
Dummy Replacement
Replace the reduced slice factor with a dummy factor in SliceOut:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 4, A = 512];
fn inter_slice_add<'l, const T: Tu>(
input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, R], m![1], m![A % 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, 1 # 4], m![1], m![A % 8]> {
input
.vector_init()
.vector_inter_slice_reduce::<m![A / 8, 1 # 4], m![1]>(InterSliceReduceOpI32::AddSat)
.vector_final()
}
}
R occupies part of the Slice dimension. After reduction, R is eliminated and the remaining A / 8 positions are padded from R(=4) slots to 1 # 4.
Broadcast Into a New Slice Axis
Introduce a new axis in SliceOut, and broadcast the reduced value over that axis:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![W = 64, R = 4, X = 4, P = 8];
fn broadcast_into_x<'l, const T: Tu>(
input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![W, R], m![1], m![P]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![W, X], m![1], m![P]> {
input
.vector_init()
.vector_inter_slice_reduce::<m![W, X], m![1]>(InterSliceReduceOpF32::Add)
.vector_final()
}
}
Here, R is reduced away. X is a new axis that appears only in SliceOut,
so the reduced value is broadcast over the X positions in the output.
Promotion from Time into SliceOut
If Time already contains an axis that should occupy the freed slice space, promote that axis into SliceOut.
The promoted axis does not have to be the outermost axis in Time:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![W = 32, R = 4, T0 = 2, T2 = 4, T1 = 2, P = 8];
fn axis_promotion<'l, const T: Tu>(
input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![W, R], m![T0, T2, T1], m![P]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![W, T2], m![T0, T1], m![P]> {
// Before: Slice = m![W, R], Time = m![T0, T2, T1], Packet = m![P]
// After: Slice = m![W, T2], Time = m![T0, T1], Packet = m![P]
// R is reduced away, and T2 is promoted from the middle of Time into Slice.
input
.vector_init()
.vector_inter_slice_reduce::<m![W, T2], m![T0, T1]>(InterSliceReduceOpF32::Add)
.vector_final()
}
}
Inter-Slice Reduce with AddSat, Then Intra-Slice
Reducing an i32 tensor across slices, then applying an elementwise add:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 4, A = 512];
fn reduce_then_add<'l, const T: Tu>(
input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, R], m![1], m![A % 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, 1 # 4], m![1], m![A % 8]> {
input
.vector_init()
.vector_inter_slice_reduce::<m![A / 8, 1 # 4], m![1]>(InterSliceReduceOpI32::AddSat)
.vector_intra_slice_branch(BranchMode::Unconditional)
.vector_fxp(FxpBinaryOp::AddFxp, 100)
.vector_final()
}
}
Intra-Slice Then Inter-Slice Reduce with AddSat
Applying an intra-slice operation first, then reducing the resulting i32 tensor across slices:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 4, A = 512];
fn add_then_reduce<'l, const T: Tu>(
input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, R], m![1], m![A % 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, 1 # 4], m![1], m![A % 8]> {
input
.vector_init()
.vector_intra_slice_branch(BranchMode::Unconditional)
.vector_fxp(FxpBinaryOp::AddFxp, 100)
.vector_inter_slice_reduce::<m![A / 8, 1 # 4], m![1]>(InterSliceReduceOpI32::AddSat)
.vector_final()
}
}
Constraints
| Constraint | Detail |
|---|---|
| Data types | i32 and f32 only |
| Scope | Reduction happens within one 256-slice cluster |
| Packet mapping | Packet does not change across inter-slice reduction |
Performance
Inter-slice reduce is best understood as a ring-like global reduction across the participating slices. For documentation purposes, the most useful high-level estimate is:
| Quantity | Rough rule of thumb |
|---|---|
| First reduced output | on the order of one ring traversal for the reduction group |
| Total time | input streaming time + that ring-sized tail |
| Main tuning knob | reduction ratio, that is, how many slices participate in one inter-slice contraction group |
If you want a quick mental model, let r be the reduction ratio or route-group size:
- first output appears after roughly
O(r)cycles - larger
rmeans more noticeable inter-slice tail latency - if upstream already produces flits slowly, that upstream rate dominates and the inter-slice cost is partly hidden
This is intentionally a high-level approximation. The practical mental model is simple: stream partial results in, then pay about one ring traversal before the reduced result settles.
Interaction With Other Pipelines
- Contraction -> Inter-Slice: if contraction takes longer to produce partial sums, contraction can dominate and inter-slice may not be the bottleneck.
- Intra-Slice -> Inter-Slice: intra-slice work can reduce the number of packets that reach inter-slice, or simply take longer itself. In those cases, inter-slice is less visible because there is less data to reduce, or because the front half already dominates.
- Large ring / large reduction ratio: when many slices participate, inter-slice tail latency grows and can become the bottleneck.
- Small tensors: even when total data volume is small, the fixed ring-style tail can still matter because it is amortized over fewer packets.
For an end-to-end contraction example that includes inter-slice reduction, see Reducer.
Cast Engine
Storing full f32/i32 results in DM would waste memory; the Cast Engine narrows them back to application-specified types (e.g., bf16) before the Commit Engine writes to DM.
Interface
impl<'l, const T: Tu, D: VeScalar, Chip: M, Cluster: M, Slice: M, Time: M, Packet: M> StreamCast<D>
for CollectTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet>
{
type CastOutput<D2: Scalar, OutPacket: M>
= CastTensor<'l, T, D2, Chip, Cluster, Slice, Time, OutPacket>
where
D: Cast<D2>;
#[primitive(CollectTensor::cast)]
fn cast<D2: Scalar, OutPacket: M>(self) -> Self::CastOutput<D2, OutPacket>
where
D: Cast<D2>,
{
cast_stream(self.ctx, self.inner)
}
}
Precision Lowering
Precision lowering downcasts f32 or i32 data into specific lower-precision formats:
Input Type (D1) | Supported Output Types (D2) |
|---|---|
i32 | i4, i8, i16 |
f32 | f8e5m2, f8e4m3, f16, bf16 |
Packet Transformation
The input packet must be exactly 32 bytes (one flit). The Collect Engine ensures this before data reaches the Cast Engine.
After casting each element to the output type, the result is padded back to 32 bytes. Time passes through unchanged.
Input: Time = [T], Packet = [P # (32 / sizeof(D1))], dtype = D1
Output: Time = [T], Packet = [P # (32 / sizeof(D2))], dtype = D2
Examples
Single-flit packet
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![B = 4, A = 8];
fn cast_i32_to_i8<'l, const T: Tu>(
input: CollectTensor<'l, T, i32, m![1], m![1], m![1], m![B], m![A]>,
) -> CastTensor<'l, T, i8, m![1], m![1], m![1], m![B], m![A # 32]> {
input.cast()
}
}
Before the cast, each flit is fully utilized: A = 8 elements x 4 bytes (i32) = 32 bytes.
After the cast, each element shrinks to 1 byte (i8), so A = 8 elements occupy only 8 bytes.
The A # 32 padding fills the remaining 24 bytes to maintain the 32-byte flit alignment.
Time stays m![B] because it passes through unchanged.
Padded input packet
When the input data doesn’t fill the full flit, it arrives already padded from the Collect Engine.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 4];
fn cast_padded<'l, const T: Tu>(
input: CollectTensor<'l, T, i32, m![1], m![1], m![1], m![1], m![A # 8]>,
) -> CastTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![A # 32]> {
input.cast()
}
}
Input packet A # 8 = 4 data elements padded to 8 elements at i32 = 32 bytes (one flit).
After cast to i8, 4 data elements occupy 4 bytes, padded to 32: m![A # 32].
This under-utilization may look wasteful, but the Cast Engine is a pass-through stage that is never the pipeline bottleneck. The downstream Commit Engine can aggregate multiple under-utilized flits into dense DM writes anyway. The net effect is the same: no bandwidth is wasted at the DM level.
Transpose Engine
When computation results are in a different memory layout than DM requires, the Transpose Engine reorders the data within flits before the Commit Engine writes to DM.
The Transpose Engine reorders data within a 2D matrix by swapping rows and columns.
It interprets input data as a [in_rows, in_cols] matrix, transposes it, and optionally slices padded elements to produce the desired output shape.
Interface
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
impl<'l, const T: Tu, D, Chip, Cluster, Slice, Time, Packet>
CollectTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet>
{
/// Transposes axes between the Time and Packet mappings.
/// Swaps the innermost Time axes with the Packet axis, converting [A, B] layout to [B, A].
pub fn transpose<OutTime: M, OutPacket: M>(
self,
) -> TransposeTensor<'l, T, D, Chip, Cluster, Slice, OutTime, OutPacket>
{
// Hardware implementation: swaps rows and columns within [Time, Packet]
}
}
The Transpose Engine operates on the Time and Packet dimensions only. The Chip, Cluster, and Slice dimensions pass through unchanged.
Architecture
Conceptual Operation
The Transpose Engine performs four stages:
- Unpack: Each input packet is 32 bytes, but the transpose buffer only uses the first
elements_per_packetelements (see Internal Buffer Architecture). This stage discards extraneous padding from each packet, keeping onlyelements_per_packetelements. There arein_rowstime steps (each deliveringpackets_per_colpackets), which assemble the[in_rows × in_cols]input matrix, wherein_cols = packets_per_col × elements_per_packet. - Transpose: The matrix is transposed:
[in_rows × in_cols]→[in_cols × in_rows]. - Trim: After transposing, padded elements within each input packet constitute entire rows.
This stage allows the removal of those padded rows, producing
[out_rows × in_rows], whereout_rows<=in_cols. - Align: Each output row is
in_rowselements wide. This stage pads each row to 32 bytes (output_alignmentelements), producing the final output packets of shape[out_rows × (in_rows # output_alignment)].
in_cols output_alignment
┌─────────────────┐ ┌──────────────────┐
│ 12 13 14 15 ... │ │ 3 7 11 15 ... │
in_rows │ 8 9 10 11 ... │ ────► │ 2 6 10 14 ... │ out_rows
│ 4 5 6 7 ... │ │ 1 5 9 13 ... │
│ 0 1 2 3 ... │ │ 0 4 8 12 ... │
└─────────────────┘ └──────────────────┘
data_in data_out
Specifications
Internal Buffer Architecture
The Transpose Engine has two internal buffers, each with num_buffer_cols = 16 columns. The input interface receives a fixed number of elements per cycle based on the data type:
| Data Type | elements_per_packet |
|---|---|
| 4-bit | 16 |
| 8/16/32-bit | 8 |
Input Bus Constraints
The input bus to the Transpose Engine is 32 bytes, but its usable capacity depends on the data type:
| Type | Input Format |
|---|---|
| 4-bit | 4b × 16 |
| 8-bit | 8b × 8 |
| 16-bit | 16b × 8 |
| 32-bit | 32b × 8 |
The Transpose Engine receives data from three possible sources:
- Contraction Engine: Outputs 32b × 8
- Vector Engine: Outputs 4b × 16, 8b × 8, 16b × 8, or 32b × 8
- Fetch Engine: Outputs 4b x 16, 8b x 8, 16b x 8, or 32b x 8
Constraints
The following parameters are dependent on the data type:
| Data type | elements_per_packet | output_alignment | Max in_rows | Valid in_cols |
|---|---|---|---|---|
| 4-bit | 16 | 64 | 16 | 16, 32 |
| 8-bit | 8 | 32 | 8 | 8, 16, 32 |
| 16-bit | 8 | 16 | 4 | 8, 16, 32 |
| 32-bit | 8 | 8 | 2 | 8, 16, 32 |
The following are type-agnostic:
- Both the input and output packets must be 32 bytes.
out_rows<=in_cols(determines the number of sliced rows in the Trim stage)
Performance
Double Buffering
The buffering mode is determined by comparing in_cols with num_buffer_cols.
Double buffering occurs when in_cols <= num_buffer_cols.
Otherwise, single buffering is used.
in_cols | Condition | Buffering Mode |
|---|---|---|
| 8 | 8 ≤ 16 | Double buffering |
| 16 | 16 ≤ 16 | Double buffering |
| 32 | 32 > 16 | Single buffering |
- Double buffering: One buffer receives input while the other produces output simultaneously
- Single buffering: Both buffers are used together, so input and output must alternate
Cycle Calculation
Variable definitions:
$$ \texttt{input_flits_per_iter} = \texttt{in_rows} \times \frac{\texttt{in_cols}}{\texttt{elements_per_packet}} $$ $$ = \texttt{in_rows} \times \texttt{packets_per_col} $$ $$ \texttt{output_flits_per_iter} = \texttt{out_rows} $$ $$ \texttt{n} = \frac{\texttt{OutTime::SIZE}}{\texttt{out_rows}} $$
Cycles per iteration:
- Double buffering:
max(input_flits_per_iter, output_flits_per_iter)- Input and output happen simultaneously, so the slower one determines the cycle count
- Single buffering:
input_flits_per_iter+output_flits_per_iter- Input and output alternate, so both are added
Total cycles in a burst:
-
Double buffering (pipelined execution): $$ \texttt{input_flits_per_iter} + (n - 1) \times \texttt{cycles_per_iter} + \texttt{output_flits_per_iter} $$
input_flits_per_iter: Initial input-only phase (filling first buffer)(n - 1) * cycles_per_iter: Middle phase where input and output overlapoutput_flits_per_iter: Final output-only phase (draining last buffer)
-
Single buffering (sequential execution): $$ n \times \texttt{cycles_per_iter} $$
Examples
Basic 8×8 Transpose
The simplest case transposes an 8×8 matrix across the Packet and Time dimensions:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![P = 256, C = 8, D = 8, E = 8];
fn basic_transpose<'l, const T: Tu>(
input: CollectTensor<'l, T, i8, m![1], m![1], m![P], m![C, D], m![E # 32]>,
) -> TransposeTensor<'l, T, i8, m![1], m![1], m![P], m![C, E], m![D # 32]> {
// in_rows = 8 (D)
// packets_per_col = 1,
// elements_per_packet = 8 (i8),
// in_cols = packets_per_col * elements_per_packet = 8 (E)
// out_rows = 8 (E)
// output_alignment = 32 (i8)
// 1. Unpack: [in_rows x packets_per_col x packet]: [D, E # 32] →
// [in_rows x packets_per_col x elements_per_packet]: [D, E] =
// [in_rows x in_cols]
// 2. Transpose: [in_rows x in_cols]: [D, E] →
// [in_cols x in_rows]: [E, D]
// 3. Trim: [in_cols x in_rows]: [E, D] →
// [out_rows x in_rows]: [E, D] (no rows trimmed)
// 4. Align: [out_rows x in_rows]: [E, D] →
// [out_rows x (in_rows # output_alignment)]: [E, D # 32]
// cycle estimation: in_cols (8) ≤ num_buffer_cols (16), double buffering
// input_flits_per_iter = 8, output_flits_per_iter = 8, n = 8 (C), cycles_per_iter = 8
// cycles = input_flits_per_iter + (n - 1) * cycles_per_iter + output_flits_per_iter = 72
input.transpose()
}
}
Small Matrix Transpose
Transpose works with matrices smaller than the maximum size:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![P = 64, A = 4, B = 2];
fn small_transpose<'l, const T: Tu>(
input: CollectTensor<'l, T, i8, m![1], m![1], m![P], m![A], m![B # 32]>,
) -> TransposeTensor<'l, T, i8, m![1], m![1], m![P], m![B], m![A # 32]> {
// in_rows = 4 (A),
// packets_per_col = 1,
// elements_per_packet = 8 (i8),
// in_cols = packets_per_col * elements_per_packet = 1 * 8 = 8
// (B=2 data elements, padded internally to 8)
// out_rows = 2 (B),
// output_alignment = 32 (i8)
// 1. Unpack: [in_rows x packets_per_col x packet]: [A, B # 32] →
// [in_rows x packets_per_col x elements_per_packet]: [A, B # 8] =
// [in_rows x in_cols]
// 2. Transpose: [in_rows x in_cols]: [A, B # 8] →
// [in_cols x in_rows]: [B # 8, A]
// 3. Trim: [in_cols x in_rows]: [B # 8, A] →
// [out_rows x in_rows]: [B, A] (6 rows trimmed)
// 4. Align: [out_rows x in_rows]: [B, A] →
// [out_rows x (in_rows # output_alignment)]: [B, A # 32]
// cycle estimation: in_cols (8) ≤ num_buffer_cols (16), double buffering
// input_flits_per_iter = 4, output_flits_per_iter = 2, n = 1, cycles_per_iter = 4
// cycles = input_flits_per_iter + (n - 1) * cycles_per_iter + output_flits_per_iter = 6
input.transpose()
}
}
Large Column Transpose (in_cols = 32, single buffering)
When in_cols exceeds num_buffer_cols, single buffering is used:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![P = 256, B = 2, C = 8, D = 4, E = 8];
fn large_col_transpose<'l, const T: Tu>(
input: CollectTensor<'l, T, i8, m![1], m![1], m![P], m![B, C, D], m![E # 32]>,
) -> TransposeTensor<'l, T, i8, m![1], m![1], m![P], m![B, D, E], m![C # 32]> {
// in_rows = 8 (C),
// packets_per_col = 4 (D),
// elements_per_packet = 8 (i8),
// in_cols = packets_per_col * elements_per_packet = 32 (D * E),
// out_rows = 32 (D * E),
// output_alignment = 32 (i8)
// 1. Unpack: [in_rows x packets_per_col x packet]: [C, D, E # 32] →
// [in_rows x packets_per_col x elements_per_packet]: [C, D, E] =
// [in_rows x in_cols]
// 2. Transpose: [in_rows x in_cols]: [C, D, E] →
// [in_cols x in_rows]: [D, E, C]
// 3. Trim: [in_cols x in_rows]: [D, E, C] →
// [out_rows x in_rows]: [D, E, C] (no rows trimmed)
// 4. Align: [out_rows x in_rows]: [D, E, C] →
// [out_rows x (in_rows # output_alignment)]: [D, E, C # 32]
// cycle estimation: in_cols (32) > num_buffer_cols (16), single buffering
// input_flits_per_iter = 8 * 4 = 32, output_flits_per_iter = 32, n = 2 (B), cycles_per_iter = 32 + 32 = 64
// cycles = n * cycles_per_iter = 128
input.transpose()
}
}
16-bit Data Type (bf16)
For 16-bit types, the maximum in_rows is reduced to 4:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![P = 256, C = 8, D = 4, E = 8];
fn bf16_transpose<'l, const T: Tu>(
input: CollectTensor<'l, T, bf16, m![1], m![1], m![P], m![C, D], m![E # 16]>,
) -> TransposeTensor<'l, T, bf16, m![1], m![1], m![P], m![C, E], m![D # 16]> {
// in_rows = 4 (D),
// packets_per_col = 1,
// elements_per_packet = 8 (bf16),
// in_cols = 8 (E),
// out_rows = 8 (E),
// output_alignment = 16 (bf16)
// 1. Unpack: [in_rows x packets_per_col x packet]: [D, E # 16] →
// [in_rows x in_cols]: [D, E]
// 2. Transpose: [in_rows x in_cols]: [D, E] →
// [in_cols x in_rows]: [E, D]
// 3. Trim: [in_cols x in_rows]: [E, D] →
// [out_rows x in_rows]: [E, D] (no rows trimmed)
// 4. Align: [out_rows x in_rows]: [E, D] →
// [out_rows x (in_rows # output_alignment)]: [E, D # 16]
// cycle estimation: in_cols (8) ≤ num_buffer_cols (16), double buffering
// input_flits_per_iter = 4, output_flits_per_iter = 8, n = 8 (C), cycles_per_iter = 8
// cycles = input_flits_per_iter + (n - 1) * cycles_per_iter + output_flits_per_iter = 68
input.transpose()
}
}
Scheduling
Scheduling determines how operations execute on hardware resources. This chapter explains how the Virtual ISA translates programs into executable schedules. Two programmer-visible inputs determine the schedule: the textual order of operations and explicit memory address assignments.
Programs implicitly define their schedule through the textual order of operations and explicit memory address assignments. The scheduler respects the written order as the authoritative sequence — it does not reorder operations. The scheduler then analyzes resource dependencies to determine which operations can run in parallel.
Operation Order
Operation order is how the program communicates sequencing intent to the scheduler: the textual order of operations defines their execution order.
The following example shows this, where load_from_host() loads a tensor from host memory and .op() represents any pipeline operation:
let t0 = load_from_host(); // O0
let t1 = load_from_host(); // O1
let t2 = t0.op(); // O2
let t3 = t1.op(); // O3
let t4 = t2.op(); // O4
let t5 = t4.op(); // O5
The final execution order respects the written order: O0 → O1 → O2 → O3 → O4 → O5.
Memory Allocation
Each tensor requires a specific memory address for precise scheduling. Currently, tensor addresses must be specified explicitly by the programmer.
Hardware Resources
The hardware provides three allocatable resources that can execute in parallel:
| Resource | Description |
|---|---|
| Main context | Primary Tensor Unit execution context |
| Sub context | Secondary context for data prefetching |
| Direct Memory Access (DMA) Engine | Memory-to-memory data transfer |
Main context handles compute-intensive operations through the complete Tensor Unit pipeline — Fetch, Switching, Collect, Contraction, Vector, Cast, Transpose, and Commit — but can only execute one operation at a time.
Sub context runs data movement operations (SRAM-to-TRF transfers, SRAM-to-SRAM copies) concurrently with the main context, enabling double-buffering where the next operation’s data is prepared while the current one computes.
DMA Engine moves large tensors between HBM and SRAM independently of both Tensor Unit contexts, enabling overlapped data transfer and computation.
Two factors cause operations to serialize: resource conflicts and memory dependencies.
Resource conflicts occur when two operations require the same resource, forcing the later one to wait. For example, two matrix multiplications both requiring the main context must execute sequentially. However, a matrix multiplication (main context) can run in parallel with a DMA transfer (DMA engine) because they use different resources.
Memory dependencies arise from data hazards on shared addresses. Read-after-write (RAW) hazards require a read to see the result of a preceding write. Write-after-read (WAR) hazards prevent a write from overwriting data still being read. Write-after-write (WAW) hazards require writes to the same address to execute in order. The scheduler detects these hazards by analyzing the memory addresses specified in the program.
The scheduler manages these constraints automatically. It analyzes each operation’s resource usage and memory addresses to determine parallelism opportunities while respecting program order, inserting implicit waits where necessary. This dependency resolution frees programmers from manually inserting synchronization barriers, though memory addresses must still be specified explicitly (see Memory Allocation).
Kernel Examples
The introductory tutorial briefly introduced temporal and spatial partitioning for large tensors in its Further Reading section. The preceding chapters explained how mapping expressions distribute work across TCP’s hardware hierarchy and how each component reduces partial results. This chapter shows how to combine mapping, movement, computation, and scheduling into complete, working kernels. The table below summarizes the available parallelism and reduction at each level:
| Dimension | Type | Defined in | Reduced in |
|---|---|---|---|
Chip | Spatial | HBM, SRAM, Stream | DMA + Vector |
Cluster | Spatial | SRAM, Stream | DMA + Vector |
Slice | Spatial | SRAM, Stream | Vector |
Row | Spatial | TRF | Contraction |
Time | Temporal | Stream | Contraction |
Packet | Spatial | Stream | Contraction |
For cross-chip and cross-cluster reduction patterns (the Chip and Cluster rows above), see Chip/Cluster Reduce, which demonstrates DMA broadcast followed by Vector Engine binary add.
The examples progress from single-engine patterns to composed multi-engine patterns to full model implementations:
- Tiling (coming soon): Tile size selection, memory layout, and accumulation strategies.
- Split Reduce: Interleaved fetch for reducing across multiple tensor instances. Use when a reduction dimension exceeds what a single tile can accumulate.
- Chip/Cluster Reduce: ReduceScatter and AllReduce across chips. Use when computation must be distributed across multiple chips or clusters.
- Fetch and Commit Engine: Axis permutation, full-flit commit, tail padding, and tensor segmentation. Use when data layout transformations are needed between memory and compute.
- GEMM with Double-Buffering (coming soon): DMA load from HBM, sub-context TRF prefetch, main-context tiled contraction, cast, and commit. A short end-to-end example bridging single-engine patterns and full model implementations.
- Transformer: Llama 3 70B implementation with prefill and decode phases. A full model combining tiling, multi-chip reduce, and memory management.
- Mixture of Experts: Branchless TopK routing and blockwise sparse computation. A full model demonstrating dynamic routing with sparse computation patterns.
Tiling
Warning
This page is a work in progress. Content will be added in a future release.
Tiling breaks large tensors into smaller tiles that fit in on-chip memory. When a tensor exceeds VRF capacity (8KB per slice) or DM capacity, it must be processed in multiple iterations.
When to Use Tiling
Tiling applies when:
- A tensor dimension exceeds what fits in a single hardware pass — compare the dimension size against the DM capacity table in Memory Performance.
- Memory bandwidth needs to be optimized by reusing loaded data — check whether the same data is fetched more than once across operations.
- Computation needs to be distributed across time rather than space — use when the spatial dimensions are already fully distributed but a loop over tiles is needed.
Basic Tiling Pattern
The basic pattern is: (1) choose a tile size that fits in VRF/DM, (2) loop over tiles in the outer dimensions, (3) fetch each tile from HBM to DM, (4) run the computation, and (5) accumulate partial results before writing back. The tile size must satisfy alignment constraints (32-byte flits) and leave room for double-buffering if overlapping fetch with compute.
Warning
Add a simple tiling example showing:
- Original tensor shape exceeding VRF
- Tile size calculation
- Loop structure for processing tiles
- Accumulation of partial results
// TODO: Example code
// axes![M = 8192, N = 8192, K = 2048];
//
// Tile sizes chosen to fit in VRF:
// type TileM = m![M / 32]; // 256 elements per tile
// type TileN = m![N / 32]; // 256 elements per tile
//
// Outer loop iterates over tiles
// Inner computation processes one tile
Example: Tiled Matrix Multiplication
Warning
Add complete GEMM example with tiling:
- Input matrices A[M, K] and B[K, N] where M, N, K exceed VRF capacity
- Tile along M and N dimensions
- Accumulate partial results across K tiles
Memory Layout
Warning
Describe how tiles are laid out in HBM and DM
Tile Size Selection
Warning
Explain constraints for choosing tile sizes:
- VRF capacity (8KB per slice)
- DM capacity
- Alignment requirements (32-byte flits)
- Trade-off between tile size and iteration count
Accumulation Strategy
Warning
Explain how partial results are accumulated:
- Accumulate in higher precision (f32) to avoid precision loss
- Store intermediate results in DM or HBM depending on size
- Final cast to output precision (bf16)
Example: Tiled Attention
Warning
Add attention example showing tiling for long sequences:
- Query, Key, Value tensors with long sequence length
- Tile along sequence dimension
- FlashAttention-style tiling for memory efficiency
Performance Considerations
Warning
Add performance analysis:
- Overhead of tile boundary handling
- Memory bandwidth utilization
- Optimal tile sizes for different tensor shapes
- Interaction with hardware prefetching
Split Reduce
Split reduce handles reductions when a logical reduction axis cannot be mapped to a single continuous hardware dimension: the axis is split into multiple separate tensor instances that must be fetched independently using interleaved fetch and combined using Vector Engine binary operations.
When to Use Split Reduce
Split reduce applies when:
- The reduce axis exceeds single-tensor capacity: A reduction axis is too large to fit in VRF (8KB per slice) as a single tensor, requiring the logical axis to be split into multiple physical tensor instances.
- Independent tensor instances exist: Multiple tensor instances hold different portions of the same logical reduction axis (e.g., from different model layers, experts, or temporal segments).
- Avoiding cross-chip communication: Data resides on the same chip/cluster but in separate memory allocations, making interleaved fetch more efficient than DMA-based approaches.
Split reduce sits between slice-level and chip-level reductions in TCP’s reduction hierarchy:
- Packet reduce: Within a single packet (Reducer)
- Time reduce: Across time dimension (Reducer)
- Slice reduce: Across slices within a cluster (Inter-Slice Block)
- Split reduce: Across multiple independent tensor instances, using interleaved fetch (alternating loads from separate tensor instances) combined with Vector Engine binary ops
- Chip/Cluster reduce: Across chips or clusters (DMA + interleaved fetch + Vector Engine binary op)
Implementation: Interleaved Fetch
Split reduce uses interleaved fetch to load multiple tensor instances alternately, creating a time-interleaved stream that the Vector Engine reduces. The fetch pattern introduces an interleave dimension I that indexes the separate tensor instances:
// Two tensor instances to be reduced together
let tensor_0: DmTensor<bf16, m![1], m![1], m![1], m![A, B]> = ...;
let tensor_1: DmTensor<bf16, m![1], m![1], m![1], m![A, B]> = ...;
// Interleaved fetch creates alternating time stream: I=2 dimension
let interleaved: TuTensor<bf16, m![1], m![1], m![1],
m![I: 2, A], m![B]
> = ctx.main.begin_interleaved().fetch(&tensor_0, &tensor_1);
// Vector Engine reduction combines the I dimension
let reduced: TuTensor<bf16, m![1], m![1], m![1],
m![A], m![B]
> = interleaved.reduce_add(axis: I);
The interleaved fetch alternates between tensor instances in the time dimension: time[0] holds data from tensor_0, time[1] from tensor_1, time[2] from tensor_0 again, and so on. The Vector Engine performs binary operations (add, max, min) across the interleave dimension to complete the reduction.
Example 1: Layer Normalization Split Reduction
Layer normalization computes statistics (mean, variance) over the feature dimension. When the feature dimension is too large to fit in a single VRF allocation, it must be split into multiple chunks that are processed separately and then combined.
The Problem: Layer normalization requires computing the mean and variance of all features for each token. The formula is:
output = (input - mean) / sqrt(variance + epsilon)
where mean and variance are computed over the entire Hidden dimension.
When Hidden is very large (like 8192 elements), the tensor won’t fit in the 8KB VRF, so we cannot reduce it in a single operation.
Input: A 3D tensor representing transformer activations:
- Shape:
[Batch=32, SeqLen=128, Hidden=8192] - Data type:
bf16(2 bytes per element) - Total size: 32 × 128 × 8192 × 2 bytes = 64 MB
- Per-token slice: For each of 4096 tokens (32 × 128), we have 8192 features = 16 KB per token
- VRF constraint: Only 8KB per slice ≈ 4096
bf16elements - Problem: Cannot load all 8192 features for a token simultaneously
Solution Strategy:
Split the Hidden dimension into two 4096-element chunks:
- Chunk 0:
[Batch=32, SeqLen=128, Hidden_0=4096]- first half of features - Chunk 1:
[Batch=32, SeqLen=128, Hidden_1=4096]- second half of features - Each chunk = 4096 elements × 2 bytes = 8KB, fits in VRF
Step-by-Step Execution
Step 1: Compute Partial Statistics
First, compute statistics for each chunk independently:
// Chunk 0: Hidden dimensions 0..4096
let chunk_0: DmTensor<bf16, m![1], m![1], m![1], m![Batch, SeqLen, Hidden_0: 4096]> = ...;
// Chunk 1: Hidden dimensions 4096..8192
let chunk_1: DmTensor<bf16, m![1], m![1], m![1], m![Batch, SeqLen, Hidden_1: 4096]> = ...;
// Compute sum for each chunk (using Reducer + Inter-Slice Block)
let sum_0: DmTensor<f32, m![1], m![1], m![1], m![Batch, SeqLen]> = chunk_0.reduce_sum(axis: Hidden_0);
let sum_1: DmTensor<f32, m![1], m![1], m![1], m![Batch, SeqLen]> = chunk_1.reduce_sum(axis: Hidden_1);
Step 2: Interleaved Fetch and Combine
Use split reduce to combine the partial sums:
// Fetch both chunks in interleaved pattern
let interleaved_sums: TuTensor<f32, m![1], m![1], m![1],
m![I: 2, Batch, SeqLen], m![1]
> = ctx.main.begin_interleaved().fetch(&sum_0, &sum_1);
// Vector Engine adds across I dimension to get total sum
let total_sum: TuTensor<f32, m![1], m![1], m![1],
m![Batch, SeqLen], m![1]
> = interleaved_sums.reduce_add(axis: I);
// Compute mean: total_sum / Hidden
let mean = total_sum * (1.0 / 8192.0); // Vector Engine scalar multiply
Step 3: Compute Variance
Similarly, combine partial variance calculations:
// Compute squared differences for each chunk
let sq_diff_0 = (chunk_0 - mean).square().reduce_sum(axis: Hidden_0);
let sq_diff_1 = (chunk_1 - mean).square().reduce_sum(axis: Hidden_1);
// Split reduce to combine variance contributions
let interleaved_vars: TuTensor<f32, m![1], m![1], m![1],
m![I: 2, Batch, SeqLen], m![1]
> = ctx.main.begin_interleaved().fetch(&sq_diff_0, &sq_diff_1);
let total_variance = interleaved_vars.reduce_add(axis: I);
let std = total_variance.sqrt();
Output:
The three steps produce the statistics needed for layer normalization:
- Mean:
[Batch=32, SeqLen=128]- one mean value per token, representing the average of all 8192 features - Standard deviation:
[Batch=32, SeqLen=128]- one std value per token - Result: Use these statistics to normalize each token’s 8192 features:
normalized_chunk_0 = (chunk_0 - mean) / std normalized_chunk_1 = (chunk_1 - mean) / std
Computing statistics in two separate chunks produces the same mathematical result as computing over all 8192 features at once:
- Mathematically:
mean([a,b,c,d,e,f]) = (sum(a,b,c) + sum(d,e,f)) / 6 - In practice:
mean([Hidden_0, Hidden_1]) = (sum(Hidden_0) + sum(Hidden_1)) / 8192
Split reduce computes global statistics despite VRF capacity limits.
Hardware Mapping
The split reduce operation maps to hardware as follows:
| Operation | Hardware Component | Cycles |
|---|---|---|
| Fetch chunk_0 | Fetch Engine | ~1 cycle per 32-byte flit |
| Fetch chunk_1 | Fetch Engine (interleaved) | ~1 cycle per 32-byte flit |
| Interleave dimension creation | Fetch Sequencer | 0 (structural transformation) |
| Binary add across I | Vector Engine | 1 cycle per packet |
Performance Analysis
Total cycles for split reduce:
- Fetch both tensors:
2 * (Batch * SeqLen * ceil(Hidden / flit_elements))cycles - Vector Engine reduction:
(Batch * SeqLen)cycles - Total: Dominated by fetch time, ~8K cycles for this example
Bottleneck: Memory bandwidth for fetching both tensor instances sequentially.
Optimization: Restructure the computation to avoid splitting the reduction axis when possible. If the axis must be split, minimize the number of split instances.
Example 2: Batch Normalization Across Split Batches
Batch normalization computes statistics across the batch dimension. When processing very large batches, the batch dimension may be split across multiple tensor allocations.
Problem Setup
- Input:
[Batch_0 = 256, ...], [Batch_1 = 256, ...](two separate batch tensors) - Reduction goal: Compute mean and variance across all 512 examples
- Constraint: Cannot allocate single tensor for 512 batches due to memory limits
Execution Pattern
// Two batch allocations
let batch_0: DmTensor<bf16, m![1], m![1], m![1], m![Batch_0: 256, C, H, W]> = ...;
let batch_1: DmTensor<bf16, m![1], m![1], m![1], m![Batch_1: 256, C, H, W]> = ...;
// Compute per-batch statistics (reduce over H, W)
let batch_stats_0 = batch_0.reduce_mean(axis: [H, W]); // [Batch_0=256, C]
let batch_stats_1 = batch_1.reduce_mean(axis: [H, W]); // [Batch_1=256, C]
// Split reduce to combine batch statistics
let interleaved: TuTensor<f32, m![1], m![1], m![1],
m![I: 2, Batch: 256, C], m![1]
> = ctx.main.begin_interleaved().fetch(&batch_stats_0, &batch_stats_1);
// Compute global statistics across all batches
let global_mean = interleaved.reduce_mean(axis: I); // Average the two batch means
This pattern extends naturally to more than two splits by increasing the interleave dimension: I: 4 for four splits, etc.
Example 3: Mixture of Experts Partial Reduction
Problem Setup
- Expert outputs: Multiple tensors from different expert evaluations
- Routing weights: Weights determining how much each expert contributes
- Goal: Weighted sum across expert outputs
Execution Pattern
// Expert outputs from separate evaluations (simplified: 2 experts)
let expert_0_output: DmTensor<bf16, m![1], m![1], m![1], m![Tokens, Hidden]> = ...;
let expert_1_output: DmTensor<bf16, m![1], m![1], m![1], m![Tokens, Hidden]> = ...;
let routing_weights: [f32; 2] = [0.7, 0.3]; // Per-expert weights
// Apply routing weights during fetch using zero-point arithmetic or scaling
let weighted_0 = expert_0_output * routing_weights[0];
let weighted_1 = expert_1_output * routing_weights[1];
// Split reduce to combine weighted expert contributions
let interleaved: TuTensor<bf16, m![1], m![1], m![1],
m![I: 2, Tokens], m![Hidden]
> = ctx.main.begin_interleaved().fetch(&weighted_0, &weighted_1);
let combined_output = interleaved.reduce_add(axis: I);
Example 4: Temporal Reduction Across Windows
Problem Setup
- Input: Video frames or sequence tokens split into temporal chunks
- Goal: Compute global statistics across all chunks
- Constraint: Cannot load all chunks simultaneously due to memory limits
Execution Pattern
// Temporal chunks
let chunk_t0: DmTensor<bf16, m![1], m![1], m![1], m![Time_0: 128, Features]> = ...;
let chunk_t1: DmTensor<bf16, m![1], m![1], m![1], m![Time_1: 128, Features]> = ...;
let chunk_t2: DmTensor<bf16, m![1], m![1], m![1], m![Time_2: 128, Features]> = ...;
let chunk_t3: DmTensor<bf16, m![1], m![1], m![1], m![Time_3: 128, Features]> = ...;
// Compute per-chunk max (e.g., for max pooling over time)
let max_t0 = chunk_t0.reduce_max(axis: Time_0); // [Features]
let max_t1 = chunk_t1.reduce_max(axis: Time_1); // [Features]
let max_t2 = chunk_t2.reduce_max(axis: Time_2); // [Features]
let max_t3 = chunk_t3.reduce_max(axis: Time_3); // [Features]
// Split reduce with I=4 to find global maximum
let interleaved: TuTensor<bf16, m![1], m![1], m![1],
m![I: 4], m![Features]
> = ctx.main.begin_interleaved().fetch(&max_t0, &max_t1, &max_t2, &max_t3);
let global_max = interleaved.reduce_max(axis: I);
Comparison with Other Reduction Methods
The choice between split reduce and its alternatives depends on data location, tensor shape, and whether data can be merged into a single allocation.
Split Reduce vs. Slice Reduce (Inter-Slice Block)
| Aspect | Split Reduce | Slice Reduce (Inter-Slice Block) |
|---|---|---|
| Data layout | Multiple independent tensors | Single tensor across slices |
| Fetch pattern | Interleaved fetch from multiple sources | Single contiguous fetch |
| Reduction hardware | Vector Engine binary ops | Inter-Slice Block |
| Typical cycles | ~2x fetch time + cycles | ~256 cycles (slice reduction) |
| Use case | Data cannot fit in single tensor | Data distributed across hardware |
Prefer split reduce: Multiple tensor instances that cannot be merged into a single tensor due to memory allocation constraints, but all reside on the same chip/cluster.
Prefer slice reduce: Allocate a single tensor that spans slices, allowing the hardware to handle distribution automatically.
Split Reduce vs. Chip/Cluster Reduce
| Aspect | Split Reduce | Chip/Cluster Reduce |
|---|---|---|
| Data location | Same chip/cluster | Across chips/clusters |
| Communication | Local memory fetch | DMA over chip interconnect |
| Overhead | Minimal (interleaved fetch) | Significant (DMA + synchronization) |
| Bandwidth | SRAM bandwidth | Chip interconnect bandwidth |
Prefer split reduce: All data resides on the same chip, even if in separate allocations.
Prefer chip/cluster reduce: Data is distributed across physically separate processing units requiring cross-chip communication.
Implementation Methods
The split reduce operation maps to the following hardware primitives:
- Interleaved fetch: Fetch Engine with
begin_interleaved()mode, creating theIinterleave dimension - Reduction across I: Vector Engine binary operations (add, max, min) configured to reduce the interleave axis
- Alternative for 2-way split: Can use binary operation directly without explicit interleave dimension
Two-Instance Optimization
For the common case of splitting into exactly two instances, the Vector Engine can perform the reduction without creating an explicit interleave dimension:
// Direct binary operation for 2-way split
let sum_0: TuTensor<f32, m![1], m![1], m![1], m![A], m![B]> = ...;
let sum_1: TuTensor<f32, m![1], m![1], m![1], m![A], m![B]> = ...;
// Fetch both and add in one operation
let total = sum_0.binary_add(sum_1); // No interleave dimension needed
This optimization reduces overhead by combining fetch and reduction into a single pipelined operation.
Performance Considerations
Cycle Analysis
Split reduce cycle count depends on three factors:
- Fetch cycles:
N_splits * fetch_cycles_per_tensor - Vector Engine cycles:
Time_dim_size * cycles_per_packet(typically 1 cycle per packet) - Pipeline overlap: Fetch and VE operations can overlap when possible
Total cycles ≈ N_splits * fetch_cycles + max(0, VE_cycles - pipeline_overlap)
Memory Bandwidth
Split reduce consumes memory bandwidth proportionally to the number of splits:
- 2-way split: 2x memory bandwidth vs. single tensor
- 4-way split: 4x memory bandwidth vs. single tensor
Optimization: Minimize the number of splits by maximizing individual tensor size within VRF capacity.
Comparison to Alternatives
For a reduction requiring combining N tensor instances:
| Method | Cycles | Memory BW | Complexity |
|---|---|---|---|
| Split reduce (interleaved) | ~N * fetch + VE | N * tensor_size | Low |
| Sequential fetch + accumulate | ~N * (fetch + VE) | N * tensor_size | Medium |
| DMA to single buffer + reduce | DMA + single_reduce | N * tensor_size | High |
Split reduce with interleaved fetch provides the best balance of performance and implementation simplicity for same-chip reductions.
Constraints and Limitations
Hardware Constraints
- Interleave dimension size: Limited by Fetch Engine capabilities
- Tensor alignment: All tensor instances must have compatible shapes for interleaving
- VRF capacity: After interleaving, the combined tensor must fit in VRF (8KB per slice)
When Split Reduce Is Not Optimal
- Single tensor possible: Data fits in one tensor allocation, use slice reduce (Inter-Slice Block) instead
- Cross-chip reduction needed: Data spans chips, use chip/cluster reduce with DMA
- Very large split count: Beyond ~8 splits, consider alternative memory management strategies
Best Practices
- Minimize splits: Design tensor allocations to minimize the number of splits required
- Power-of-2 splits: Use 2, 4, or 8 splits when possible for optimal hardware utilization
- Reuse reduction results: Cache split reduce results when the same combination is needed multiple times
- Consider memory layout: Organize tensor allocations to enable efficient interleaved fetch patterns
Chip/Cluster Reduce
When a previous operation has already mapped the reduce axis to the Chip or Cluster dimension, chip/cluster reduce is needed to combine partial results across physically separate processing units.
This section demonstrates how to perform those reduction operations when data is distributed across multiple chips or clusters.
When possible, assigning reduce axes to Slice/Element (reduced by Inter-Slice Block/Vector Engine) is preferred because it avoids cross-chip communication overhead.
Two main operations implement chip/cluster reduce: AllReduce and ReduceScatter. Both combine Switch Engine operations (for data redistribution across slices within a cluster) with Vector Engine binary operations (for actual reduction computation).
ReduceScatter
ReduceScatter reduces data distributed across chip/cluster axes while distributing the result so each chip holds a portion.
This operation is useful when you need both reduction and result distribution in a single step.
Example: 4-chip ReduceScatter with Add
This example demonstrates how to perform reduction across chips when data is partitioned by one dimension (A) but needs to be reduced along a different dimension (B).
The challenge is that each chip owns data for all B values of its assigned A value, but we need to sum across all A values for each B.
Input:
A 2D tensor [A=4, B=4] with 16 total elements, distributed across 4 chips:
- Shape:
[A=4, B=4]- 16 elements total - Data type:
i8(8-bit signed integer) - Storage: SRAM on each chip
- Distribution:
In = {chip: A, slice: 256, element: B}- Chip 0 owns:
(A=0, B=0),(A=0, B=1),(A=0, B=2),(A=0, B=3)- all B values for A=0 - Chip 1 owns:
(A=1, B=0),(A=1, B=1),(A=1, B=2),(A=1, B=3)- all B values for A=1 - Chip 2 owns:
(A=2, B=0),(A=2, B=1),(A=2, B=2),(A=2, B=3)- all B values for A=2 - Chip 3 owns:
(A=3, B=0),(A=3, B=1),(A=3, B=2),(A=3, B=3)- all B values for A=3
- Chip 0 owns:
Goal:
Reduce along the A axis (summing across chips) while keeping results distributed by B:
- Output shape:
[B=4]- 4 elements (A dimension eliminated by reduction) - Output distribution:
Out = {chip: 4, slice: 256, element: 1}- Chip 0 should hold: sum of
(A=0..3, B=0)- the sum of all A values for B=0 - Chip 1 should hold: sum of
(A=0..3, B=1)- the sum of all A values for B=1 - Chip 2 should hold: sum of
(A=0..3, B=2)- the sum of all A values for B=2 - Chip 3 should hold: sum of
(A=0..3, B=3)- the sum of all A values for B=3
- Chip 0 should hold: sum of
Processing:
Slice (also called asymmetric slice) is a sub-context operation that extracts a subset of elements from specific chip positions (see Implementation Methods).
ChipShuffle is a DMA-based redistribution operation that moves data from one chip to another.
The algorithm works through six stages: create four intermediate tensors using diagonal Slice + ChipShuffle patterns, add them to reduce the A axis, then broadcast results to all chips.
The diagonal pattern ensures each chip receives the data it needs for its assigned B value.
Initial State
Each chip owns one value along the A axis:
- Chip 0:
(A=0, B=0),(A=0, B=1),(A=0, B=2),(A=0, B=3) - Chip 1:
(A=1, B=0),(A=1, B=1),(A=1, B=2),(A=1, B=3) - Chip 2:
(A=2, B=0),(A=2, B=1),(A=2, B=2),(A=2, B=3) - Chip 3:
(A=3, B=0),(A=3, B=1),(A=3, B=2),(A=3, B=3)
Step 1: Create Tensor T0 - Slice(0,1,2,3)
This step selects specific positions along the B axis from each chip using Slice, creating a diagonal selection pattern:
- Chip 0: select (0,0)
- Chip 1: select (1,1)
- Chip 2: select (2,2)
- Chip 3: select (3,3)
Result T0:
- Chip 0: (0,0)
- Chip 1: (1,1)
- Chip 2: (2,2)
- Chip 3: (3,3)
Step 2: Create Tensor T1 - Slice(3,0,1,2) + ChipShuffle(1,2,3,0)
This step combines Slice with ChipShuffle to create a rotated diagonal pattern.
First, Slice selects elements:
- Chip 0: (0,3)
- Chip 1: (1,0)
- Chip 2: (2,1)
- Chip 3: (3,2)
Then ChipShuffle(1,2,3,0) redistributes the data so each chip receives data from another chip:
- Data from Chip 1 moves to Chip 0: (1,0)
- Data from Chip 2 moves to Chip 1: (2,1)
- Data from Chip 3 moves to Chip 2: (3,2)
- Data from Chip 0 moves to Chip 3: (0,3)
Step 3: Create Tensor T2 - Slice(2,3,0,1) + ChipShuffle(2,3,0,1)
This step creates another rotated diagonal pattern. First, Slice selects positions:
- Chip 0: select (0,2)
- Chip 1: select (1,3)
- Chip 2: select (2,0)
- Chip 3: select (3,1)
Then ChipShuffle(2,3,0,1) redistributes the data, yielding T2:
- Chip 0: (2,0)
- Chip 1: (3,1)
- Chip 2: (0,2)
- Chip 3: (1,3)
Step 4: Create Tensor T3 - ChipSlice(1,2,3,0) + ChipShuffle(3,0,1,2)
This step creates the final rotated diagonal pattern. First, Slice selects positions:
- Chip 0: select (0,1)
- Chip 1: select (1,2)
- Chip 2: select (2,3)
- Chip 3: select (3,0)
Then ChipShuffle(3,0,1,2) redistributes the data, yielding T3:
- Chip 0: (3,0)
- Chip 1: (0,1)
- Chip 2: (1,2)
- Chip 3: (2,3)
Step 5: Vector Engine Add - A Axis Reduction
This step performs the actual reduction by adding all 4 tensors element-wise:
- Chip 0: (0,0) + (1,0) + (2,0) + (3,0)
- Chip 1: (1,1) + (2,1) + (3,1) + (0,1)
- Chip 2: (2,2) + (3,2) + (0,2) + (1,2)
- Chip 3: (3,3) + (0,3) + (1,3) + (2,3)
After this addition, each chip holds only one value because the A axis has been reduced:
Intermediate = { chip: B, slice: 256, element: 1 }
Step 6: AllGather
This final step broadcasts the result so all chips hold the complete reduction output. Each chip gathers data from Chip 0 through Chip 3:
Intermediate = { chip: 4, slice: 256, element: B }
Output:
After all six steps complete, each chip holds a portion of the reduced result:
- Final distribution:
Out = {chip: A, slice: 256, element: 4} - Chip 0: Holds sum of all
(A=*, B=0)values - Chip 1: Holds sum of all
(A=*, B=1)values - Chip 2: Holds sum of all
(A=*, B=2)values - Chip 3: Holds sum of all
(A=*, B=3)values
The A axis has been reduced (summed across all 4 chips), and the results are scattered across chips based on the B value.
Each chip now owns one element representing the sum of all A values for its assigned B coordinate.
Why this example is useful:
ReduceScatter combines two operations that frequently occur together in distributed computing:
- Reduction across processors: Summing/aggregating data distributed across multiple chips
- Result distribution: Each chip gets a portion of the result rather than duplicating it everywhere
This pattern is essential for:
- Distributed matrix multiplication: Reduce partial products from different chips while distributing the result
- Gradient aggregation in data parallelism: Sum gradients across workers, with each worker holding a portion
- Memory efficiency: Avoids storing the full reduced result on every chip (unlike AllReduce)
- Pipeline parallelism: Enables efficient communication patterns between pipeline stages
The diagonal slicing pattern is key: it ensures that data needed for each output element is gathered from all chips before reduction, minimizing communication rounds.
AllReduce
AllReduce reduces data distributed across the chip axis so that all chips have identical reduction results.
Unlike ReduceScatter, AllReduce ensures every chip ends up with the complete result rather than a portion.
Example: 4-chip AllReduce with Add
This example demonstrates the most common collective operation in distributed deep learning: reducing values across all processors so every processor has the identical complete result. This is essential for operations like averaging gradients across data-parallel training workers.
Input:
A 2D tensor [A=4, B=4] distributed across 4 chips by the A dimension:
- Shape:
[A=4, B=4]- 16 elements total - Data type:
i8(8-bit signed integer) - Storage: SRAM on each chip
- Distribution:
In = {chip: A, slice: 256, element: B}- Chip 0 owns:
(A=0, B=0-3)- all 4 B values for A=0 - Chip 1 owns:
(A=1, B=0-3)- all 4 B values for A=1 - Chip 2 owns:
(A=2, B=0-3)- all 4 B values for A=2 - Chip 3 owns:
(A=3, B=0-3)- all 4 B values for A=3
- Chip 0 owns:
Goal:
Reduce along the A axis and replicate the complete result to all chips:
- Output shape:
[B=4]- 4 elements (A dimension eliminated by summation) - Output distribution:
Out = {chip: 4, slice: 256, element: B}- Every chip holds: sum of all
(A=0..3, B=0), sum of all(A=0..3, B=1), sum of all(A=0..3, B=2), sum of all(A=0..3, B=3) - All chips have identical data after AllReduce completes
- Every chip holds: sum of all
Processing:
The algorithm creates 4 versions of the input tensor through rotation, then adds them all together:
- Use 3
ChipShuffleoperations on the original tensorT0to create 3 rotated versions (T1,T2,T3) - Add all 4 tensors element-wise using Vector Engine
- Every chip performs the same additions on its local data, producing identical results everywhere
Initial State (T0)
Each chip owns one value along the A axis:
- Chip 0: (A=0, B=0), (A=0, B=1), (A=0, B=2), (A=0, B=3)
- Chip 1: (A=1, B=0), (A=1, B=1), (A=1, B=2), (A=1, B=3)
- Chip 2: (A=2, B=0), (A=2, B=1), (A=2, B=2), (A=2, B=3)
- Chip 3: (A=3, B=0), (A=3, B=1), (A=3, B=2), (A=3, B=3)
Step 1: Create Tensor T1 - ChipShuffle(1,2,3,0)
This step rotates the data by one chip position.
ChipShuffle(1,2,3,0) is applied to the original T0:
- Data from Chip 1 moves to Chip 0
- Data from Chip 2 moves to Chip 1
- Data from Chip 3 moves to Chip 2
- Data from Chip 0 moves to Chip 3
The resulting T1:
- Chip 0: (1,0), (1,1), (1,2), (1,3)
- Chip 1: (2,0), (2,1), (2,2), (2,3)
- Chip 2: (3,0), (3,1), (3,2), (3,3)
- Chip 3: (0,0), (0,1), (0,2), (0,3)
Step 2: Create Tensor T2 - ChipShuffle(2,3,0,1)
This step rotates the data by two chip positions.
ChipShuffle(2,3,0,1) is applied to the original T0:
- Data from Chip 2 moves to Chip 0
- Data from Chip 3 moves to Chip 1
- Data from Chip 0 moves to Chip 2
- Data from Chip 1 moves to Chip 3
The resulting T2:
- Chip 0: (2,0), (2,1), (2,2), (2,3)
- Chip 1: (3,0), (3,1), (3,2), (3,3)
- Chip 2: (0,0), (0,1), (0,2), (0,3)
- Chip 3: (1,0), (1,1), (1,2), (1,3)
Step 3: Create Tensor T3 - ChipShuffle(3,0,1,2)
This step rotates the data by three chip positions.
ChipShuffle(3,0,1,2) is applied to the original T0:
- Data from Chip 3 moves to Chip 0
- Data from Chip 0 moves to Chip 1
- Data from Chip 1 moves to Chip 2
- Data from Chip 2 moves to Chip 3
The resulting T3:
- Chip 0: (3,0), (3,1), (3,2), (3,3)
- Chip 1: (0,0), (0,1), (0,2), (0,3)
- Chip 2: (1,0), (1,1), (1,2), (1,3)
- Chip 3: (2,0), (2,1), (2,2), (2,3)
Step 4: Vector Engine Add - A Axis Reduction
This step performs the actual reduction by adding all 4 tensors T0, T1, T2, T3:
- Chip 0: (0,0)+(1,0)+(2,0)+(3,0), (0,1)+(1,1)+(2,1)+(3,1), (0,2)+(1,2)+(2,2)+(3,2), (0,3)+(1,3)+(2,3)+(3,3)
- Chip 1: (1,0)+(2,0)+(3,0)+(0,0), (1,1)+(2,1)+(3,1)+(0,1), (1,2)+(2,2)+(3,2)+(0,2), (1,3)+(2,3)+(3,3)+(0,3)
- Chip 2: (2,0)+(3,0)+(0,0)+(1,0), (2,1)+(3,1)+(0,1)+(1,1), (2,2)+(3,2)+(0,2)+(1,2), (2,3)+(3,3)+(0,3)+(1,3)
- Chip 3: (3,0)+(0,0)+(1,0)+(2,0), (3,1)+(0,1)+(1,1)+(2,1), (3,2)+(0,2)+(1,2)+(2,2), (3,3)+(0,3)+(1,3)+(2,3)
Notice that each chip computes the same mathematical result, just with operands in different orders (addition is commutative, so order doesn’t matter). After this step, all chips have identical data.
Output:
After the AllReduce completes, every chip holds the complete reduced result:
- Final distribution:
Out = {chip: 4, slice: 256, element: B} - Every chip holds identical data: The sum of all A values for each B position
- All chips have:
[sum(A=0..3, B=0), sum(A=0..3, B=1), sum(A=0..3, B=2), sum(A=0..3, B=3)]
- All chips have:
This can be viewed as transforming [A=4] | [B=4] to [Broadcast=4] | [B=4]:
- The
Aaxis has been reduced (eliminated through summation) - The result is broadcast to all chips (every chip has the complete result)
Why this example is useful:
AllReduce is the workhorse operation for distributed machine learning:
- Data parallel training: Average gradients computed across multiple batches on different chips
- Model averaging: Combine parameter updates from multiple workers
- Synchronization primitive: Ensure all chips have identical state before proceeding
- Global statistics: Compute metrics like mean/max/min across the entire distributed dataset
Key characteristics:
- Bandwidth efficient: Each chip only receives data from 3 shuffle operations (not 3 full tensor transfers)
- Symmetric: All chips perform the same computation, simplifying implementation
- Complete replication: Every chip ends with full result, enabling independent downstream operations
- Foundation for collectives: More complex distributed operations build on AllReduce
The rotation-based algorithm shown here scales to any power-of-2 number of chips: for 8 chips, use 7 rotations; for 16 chips, use 15 rotations, etc.
Implementation Methods for Each Operation
Each operation in chip/cluster reduce maps to specific hardware primitives. Understanding these mappings helps predict performance and resource usage patterns.
Asymmetric Slice
Chip/cluster asymmetric slice operations extract a subset of data from specific positions in the chip or cluster dimension. The ParallelCopy operation implements this by running in the sub-context using the stos (Store to SRAM) command. This approach enables selective data extraction without full tensor movement, copying only the elements at positions specified by the slice indices. The sub-context execution ensures that slice operations can overlap with main-context computation, maintaining pipeline efficiency.
Shuffle
Chip/cluster shuffle redistributes data across chips using DMA operations through HBM. The DmaCommand handles intra-chip shuffles by moving data between HBM regions associated with different chips, while PCIeDmaCommand extends this capability to inter-chip communication when needed. The HBM-to-HBM transfer pattern avoids unnecessary round-trips through chip-local memory, directly routing data to its destination. Shuffle operations are the primary cost factor in chip/cluster reduce because they involve cross-chip data movement over the interconnect fabric, typically requiring hundreds to thousands of cycles depending on data volume.
Tensor Addition
Tensor addition combines multiple input tensors element-wise to perform the actual reduction computation. This operation runs in the main context using a two-stage approach: interleaved fetch brings data from multiple tensor instances into the pipeline, and the Vector Engine’s binary add operation performs the element-wise summation. The interleaved fetch pattern enables the Vector Engine to process additions efficiently by presenting operands in alternating time steps, avoiding the need for separate accumulation buffers. This main-context execution provides maximum throughput for the arithmetic-intensive reduction phase after data has been properly arranged through slice and shuffle operations.
Fetch and Commit Engine
The Fetch Engine reads tensors from SRAM while the Commit Engine writes them back. These examples demonstrate the complete data path: input tensor -> fetch sequencer -> Switch Engine -> Collect Engine -> commit unit -> output tensor.
Each example focuses on a specific pattern: axis permutation, full-flit commit, tail padding optimization, and tensor segmentation. These four patterns represent distinct aspects of the fetch-commit data path: axis reordering (permutation), write granularity (full-flit), memory layout choices (tail padding), and handling of tensors that exceed hardware capacity (segmentation).
Example 1: Axis Permutation
This example demonstrates tensor reshaping by permuting axes during a fetch-commit cycle. The Switch Engine enables axis reordering without additional computation by controlling data flow from fetch to commit.
axes![A = 3, B = 5, C = 2];
// Input: shape [A, B, C] at address 0
let input: DmTensor<f8, m![1], m![1], m![1], m![A, B, C]> = ...;
// Output: shape [B, A, C] at address 1024 (permuted layout)
let output: DmTensor<f8, m![1], m![1], m![1], m![B, A, C # 6]> = ctx
.main
.begin(input.view())
.fetch::<f8, m![A, B], m![C # 6]>() // Time=[A,B], Packet=[C] padded to 8 bytes
.collect::<m![A, B], m![C # 30]>() // Pad to 32-byte flit (forwarding switch implied)
.commit(1024); // Write with permuted sequencer config
Input Tensor:
The input is a 3D tensor stored in SRAM with dimensions [A=3, B=5, C=2], containing 30 elements total:
- Shape:
A × B × C = 3 × 5 × 2 - Data type:
f8(8-bit floating-point) - Memory layout:
m![A, B, C]- consecutive in memory asAvaries slowest,Cvaries fastest - Base address:
b = 0(starts at SRAM address 0) - Physical storage: Elements are arranged as
[A0,B0,C0][A0,B0,C1][A0,B1,C0]...[A2,B4,C1]
Labeling elements by their indices, memory contains:
Address 0-1: (A=0,B=0,C=0-1)
Address 2-3: (A=0,B=1,C=0-1)
Address 4-5: (A=0,B=2,C=0-1)
...continuing with A=0, varying B...
Address 10-11: (A=1,B=0,C=0-1)
...and so on
Output Tensor (Target):
Store the same logical tensor with axes permuted to layout [B, A, C]:
- Shape: Still
B × A × C = 5 × 3 × 2(same 30 elements, different order) - Data type:
f8(unchanged) - Memory layout:
m![B, A, C # 6]- nowBvaries slowest, with 6-byte padding per element - Base address:
b = 1024(stored at SRAM address 1024) - Physical storage: Elements arranged as
[B0,A0,C0-1][B0,A1,C0-1][B0,A2,C0-1][B1,A0,C0-1]...
This reordering changes which elements are adjacent in memory: in the input, all B values for A=0 are contiguous; in the output, all A values for B=0 are contiguous.
Processing:
The axis permutation happens through three stages—Fetch, Switch, and Commit:
-
Fetch Sequencer: Reads the input tensor from SRAM and creates a packet stream
- Time dimension:
Time = m![A, B]- iterates through 15 cycles (3×5) - Packet dimension:
Packet = m![C # 6]- each packet contains 2Celements plus 6 bytes padding - Fetch size: 8 bytes per cycle (meets hardware alignment requirement)
- Note: Hardware requires 8-byte packet alignment, so we cannot use
C=2bytes alone; we pad to 8 bytes
- Time dimension:
-
Collect Engine: Normalizes packets into standard 32-byte flits for the commit stage
- Input packets (8 bytes) are padded to create 32-byte flits
- Time dimension:
Time = m![A, B]- unchanged, still 15 cycles - flit dimension:
Flit = m![C # 30]- 2 data bytes + 30 bytes padding = 32-byte flit - The collect engine pads and normalizes packet sizes without reordering data
-
Commit Unit: Writes data to SRAM with the new axis order
[B, A, C]- Receives flits with time
m![A, B]but writes to memory layoutm![B, A, C # 6] - The write sequencer configuration creates the permutation
- Commit size: 8 bytes per write (matching fetch size)
- Slices incoming 32-byte flits down to 8-byte write units
- Receives flits with time
The write sequencer configuration determines how to map the incoming time-ordered stream m![A, B, C] to the permuted memory layout m![B, A, C #6].
The notation [axis=count:stride, ...] @ base / commit_size means: for each axis, loop count times advancing stride bytes per step; @ sets the base address; / sets the bytes written per commit operation (see Sequencer for the full sequencer model).
The sequencer is configured as: [A=3:8, B=5:24, C=8:1] @ 1024 / 8
A=3:8means loop 3 times with stride 8 bytes between iterationsB=5:24means loop 5 times with stride 24 bytes between iterationsC=8:1means write 8 bytes (the packet size) with stride 1- Base address: 1024 (output tensor starts here)
- Commit size: 8 bytes per write operation
This configuration causes data arriving in [A, B] time order to be written to addresses that correspond to [B, A] spatial order. Here’s how the writes occur:
Cycle i | Time axes | Write to memory address | Explanation |
|---|---|---|---|
| 0 | A=0, B=0 | 1024-1032 (B=0, A=0) | First element: writes to base address |
| 1 | A=0, B=1 | 1048-1056 (B=1, A=0) | Stride 24 bytes forward (next B) |
| 2 | A=0, B=2 | 1072-1080 (B=2, A=0) | Another 24-byte stride |
| 3-4 | A=0, B=3-4 | Continue with B=3,4 | Complete A=0 row |
| 5 | A=1, B=0 | 1032-1040 (B=0, A=1) | Jump to B=0, A=1 (+8 from cycle 0) |
| 6 | A=1, B=1 | 1056-1064 (B=1, A=1) | +24 stride for next B |
| 7-14 | Continue | … | Complete all A=1,2 rows |
Notice how the write pattern interleaves: we write A=0,B=0 then A=0,B=1, but these end up at addresses that place all A values for each B together in the output layout.
Output:
After commit completes, SRAM address 1024 onwards contains the tensor with permuted layout:
- Memory layout:
[B=5, A=3, C=2]with 6-byte padding - Physical arrangement: All
Avalues forB=0are contiguous, then allAvalues forB=1, etc. - Address structure:
1024-1032: (B=0, A=0, C=0-1) + 6 bytes padding 1032-1040: (B=0, A=1, C=0-1) + 6 bytes padding 1040-1048: (B=0, A=2, C=0-1) + 6 bytes padding 1048-1056: (B=1, A=0, C=0-1) + 6 bytes padding ...and so on
The permutation is complete: the same 30 data elements that were in [A, B, C] order are now in [B, A, C] order. This operation takes 15 cycles (one per A×B combination) and requires no actual computation—only memory read/write with different address patterns.
Key constraints:
Three constraints govern axis permutation operations:
- 8-byte alignment:
commit_in_sizeandcommit_sizeare always in 8-byte units, so the target tensor for commit always corresponds to an 8-byte aligned range, naturally creating 8-byte tail alignment (a dummy is added to align the tail to 8 bytes). - Sequencer limit: Like the Fetch Engine, sequencer entries are limited to 8 total (limit < 65536).
- Non-contiguous writes: Since the write sequencer sets the commit address, committed data need not be contiguous in flit time order—permutations like
AB -> BAare possible.
Why this example is useful:
Axis permutation is a common requirement in deep learning:
- Tensor layout transformations: Converting between
NCHW(batch, channels, height, width) andNHWC(batch, height, width, channels) formats for different operations - Matrix transpose: Preparing data for operations that require transposed matrices without actual computation
- Memory access optimization: Reordering axes to make the most frequently accessed dimension innermost for better cache performance
- Inter-operation compatibility: Reformatting tensors to match the input requirements of subsequent operations
The TCP architecture performs these reshapes during data movement without consuming compute resources or requiring separate transpose kernels.
Example 2: Full-flit Commit
This example demonstrates full-flit commit, an optimization that writes entire 32-byte flits directly to memory without slicing them into smaller chunks. Tensor dimensions naturally aligned to 32-byte boundaries eliminate commit slicer overhead and simplify write sequencer configuration.
axes![A = 3, B = 5, C = 2];
// Input: same shape [A, B, C]
let input: DmTensor<f8, m![1], m![1], m![1], m![A, B, C]> = ...;
// Output: merge B and C, pad to 32 bytes per A slice
let output: DmTensor<f8, m![1], m![1], m![1], m![A, [B, C] # 22]> = ctx
.main
.begin(input.view())
.fetch::<f8, m![A], m![[B, C] # 22]>() // Time=[A], Packet=[B,C] padded to 32 bytes
.collect::<m![A], m![[B, C] # 22]>() // Already 32-byte flit, identity collect
.commit(1024); // Full-flit commit: 3 cycles vs 15 in Example 1
Input Tensor: The input tensor is identical to Example 1, but committed with a different memory layout that allows full-flit writes:
- Shape:
[A=3, B=5, C=2]containing 30 elements - Data type:
f8(8-bit floating-point, 1 byte per element) - Memory layout:
m![A, B, C]- standard row-major order - Base address:
b = 0 - Element size: 1 byte × 30 elements = 30 bytes of data
Output Tensor (Target): Instead of permuting axes like Example 1, we merge the last two dimensions and add padding:
- Shape: Still
[A=3, B=5, C=2]logically, but stored as[A=3, BC=10] - Data type:
f8(unchanged) - Memory layout:
m![A, [B, C] # 22]- merge B and C dimensions, add 22 bytes padding - Base address:
b = 1024 - Physical layout: Each
Aiteration stores 10 data bytes (B×C) plus 22 padding bytes = 32 bytes total - The 32-byte size per
Aslice perfectly matches hardware flit size, enabling full-flit writes
Processing:
Data dimensions aligned with hardware flit size enable a simpler pipeline than Example 1:
-
Fetch Sequencer: Reads input and pads to 32-byte packets immediately
- Time dimension:
Time = m![A, B]- 15 cycles (3×5) - Packet dimension:
Packet = m![[B, C] # 22]- merges B and C, adds 22 bytes padding to reach 32 bytes - Fetch size: 32 bytes per cycle (full packet, not split)
- The sequencer pads from 10 data bytes to 32 bytes during fetch
- Time dimension:
-
Collect Engine: Receives 32-byte packets and passes them as 32-byte flits
- Time dimension:
Time = m![A]- simplified to just 3 cycles since B and C are merged into packet - flit dimension:
Flit = m![[B, C] # 22]- full 32-byte flit with no additional padding needed - No reformatting required: packet size = flit size = 32 bytes
- Time dimension:
-
Commit Unit: Writes full 32-byte flits directly to memory without slicing
- Receives 32-byte flits and writes them as complete 32-byte units
- commit_in_size = 32 bytes: No slicer operation needed
- commit_size = 32 bytes: Each write operation handles a full flit
- Time: Only 3 cycles (one per
A), much faster than Example 1’s 15 cycles
The write sequencer configuration is simple: [A=3:32, [B,C]=32:1] @ 1024 / 32
A=3:32means loop 3 times with 32-byte stride (one full flit per A)[B,C]=32:1means write 32 bytes with stride 1 (continuous write of flit contents)- Each cycle writes one complete flit: cycle 0 writes flit for A=0, cycle 1 for A=1, cycle 2 for A=2
Output:
After commit, SRAM address 1024 onwards contains the tensor packed into 32-byte-aligned blocks:
- Memory layout:
[A=3, BC=10+padding]with eachAslice occupying exactly 32 bytes - Physical structure:
1024-1056: (A=0, all 10 B×C elements) + 22 bytes padding = 32 bytes 1056-1088: (A=1, all 10 B×C elements) + 22 bytes padding = 32 bytes 1088-1120: (A=2, all 10 B×C elements) + 22 bytes padding = 32 bytes - Performance: Only 3 write cycles vs 15 in Example 1 (5× faster)
- Simplicity: No slicing overhead, no complex stride patterns
Why this example is useful:
Full-flit commit demonstrates an important optimization strategy:
- Alignment optimization: When you can pad dimensions to 32-byte boundaries, commit becomes much more efficient
- Reduced cycles: Fewer, larger writes complete faster than many small writes
- Hardware efficiency: Writing full flits maximizes memory bandwidth utilization
- Design principle: Sometimes adding padding to align with hardware granularity improves overall performance
This technique is particularly valuable for:
- Small tensors where padding overhead is minimal compared to the benefit
- Intermediate results that don’t need compact storage
- Situations where downstream operations also benefit from 32-byte alignment
Key constraint: Write sequencer configurations require non-zero stride for all entries. This means you cannot discard data beyond slicing (no selective writes), and broadcast (reuse) operations are not possible during commit.
Example 3: Tail Padding and Fetch Size
The amount of tail padding dramatically affects fetch/commit efficiency.
In the mapping expression m![A # 72], the # 72 pads A up to 72 elements; the pads are referred to as dummy in the hardware configuration below.
Understanding padding interaction with hardware fetch size constraints enables optimization between memory usage (less padding) and performance (more padding aligned to hardware boundaries).
axes![A = 65, B = 2];
let input: DmTensor<f8, m![1], m![1], m![1], m![B, A # 72]> = ...;
// Option 1: dummy=7, fetch_size=24 bytes, 6 cycles
let out_7: DmTensor<f8, m![1], m![1], m![1], m![B, A # 72]> = ctx.main
.begin(input.view())
.fetch::<f8, m![B * (A # 72) / 24], m![A % 24]>()
.collect::<m![B * (A # 72) / 24], m![A % 24 # 8]>()
.commit(1024);
// Option 2: dummy=31, fetch_size=32 bytes, 6 cycles (best performance)
let out_31: DmTensor<f8, m![1], m![1], m![1], m![B, A # 96]> = ctx.main
.begin(input.view())
.fetch::<f8, m![B * (A # 96) / 32], m![A % 32]>()
.collect::<m![B * (A # 96) / 32], m![A % 32]>()
.commit(1024);
The Problem:
Commit a tensor with shape [A=65, B=2] (130 bytes of data). Hardware fetch sizes must be 8, 16, 24, or 32 bytes.
Determine the padding amount for dimension A to maximize performance.
Input Tensor:
- Shape:
[A=65, B=2]- 130 elements (65 elements across A dimension, 2 across B) - Data type:
f8(1 byte per element) - Memory layout:
m![B, A+7]- stored with 7 bytes of tail padding after A - Base address:
b = 0 - Total size:
2 × (65 + 7) = 144 bytes(includes padding)
Output Tensor (Variable Padding): The target can have different padding amounts, each enabling different fetch sizes:
- Shape:
[A=65, B=2](same logical data) - Data type:
f8 - Base address:
b = 1024 - Memory layout options:
m![B, A # 7]: 7 bytes padding → enables 24-byte fetch sizem![B, A # 15]: 15 bytes padding → enables 16-byte fetch sizem![B, A # 23]: 23 bytes padding → enables 8-byte fetch size (worst)m![B, A # 31]: 31 bytes padding → enables 32-byte fetch size (best)
The optimal fetch size unit varies depending on the tail dummy value.
The following subsections show each case:
dummy = 7
- fetch sequencer output
Time = m![B * (A # 7) / 24]Flit = m![A % 24]- fetch_size = 24 bytes
- Switch Engine output (= commit unit input)
Time = m![B * (A # 7) / 24]Flit = m![A % 24 # 8]
- Commit Unit
- commit_in_size = 24 bytes
- sliced shape
Time = m![B * (A # 7) / 24]Flit = m![A % 24]
- write sequencer configuration
m![B, A # 7]->m![B * (A # 7) / 24 * A % 24]- Sequencer configuration:
[B=2:72, (A # 7)/24=3:24, A=24:1] @ 1024 / 24
dummy = 15
- fetch sequencer output
Time = m![B * (A # 15) / 16]Flit = m![A % 16]- fetch_size = 16 bytes
- Switch Engine output (= commit unit input)
Time = m![B * (A # 15) / 16]Flit = m![A % 16]
- Commit Unit
- commit_in_size = 16 bytes
- sliced shape
Time = m![B * (A # 15) / 16]Flit = m![A % 16]
- write sequencer configuration
m![B, A # 15]->m![B * (A # 15) / 16 * A % 16]- Sequencer configuration:
[B=2:80, (A # 15)/16=5:16, A=16:1] @ 1024 / 16
dummy = 23
- fetch sequencer output
Time = m![B * (A # 23) / 8]Flit = m![A % 8]- fetch_size = 8 bytes
- Switch Engine output (= commit unit input)
Time = m![B * (A # 23) / 8]Flit = m![A % 8 # 24]
- Commit Unit
- commit_in_size = 8 bytes
- sliced shape
Time = m![B * (A # 23) / 8]Flit = m![A % 8]
- write sequencer configuration
m![B, A # 23]->m![B * (A # 23) / 8 * A % 8]- Sequencer configuration:
[B=2:88, (A # 23)/8=11:8, A=8:1] @ 1024 / 8
dummy = 31
- fetch sequencer output
Time = m![B * (A # 31) / 32]Flit = m![A % 32]- fetch_size = 32 bytes
- Switch Engine output (= commit unit input)
Time = m![B * (A # 31) / 32]Flit = m![A % 32]
- Commit Unit
- commit_in_size = 32 bytes
- sliced shape
Time = m![B * (A # 31) / 32]Flit = m![A % 32]
- write sequencer configuration
m![B, A # 31]->m![B * (A # 31) / 32 * A % 32]- Sequencer configuration:
[B=2:96, (A # 31)/32=3:32, A=32:1] @ 1024 / 32
Summary: The Impact of Padding Choice
The following table summarizes how output tail padding affects performance:
Padding (dummy) | fetch_size | Fetch cycles | Memory overhead | Efficiency |
|---|---|---|---|---|
| 7 | 24 bytes | 6 cycles | 14 bytes (9.7%) | Good |
| 15 | 16 bytes | 10 cycles | 30 bytes (18.8%) | Moderate |
| 23 | 8 bytes | 22 cycles | 46 bytes (26.1%) | Poor |
| 31 | 32 bytes | 6 cycles | 62 bytes (32.3%) | Best |
Key Insights:
-
Performance varies dramatically:
dummy=23requires 22 cycles (8-byte fetches) whiledummy=31requires only 6 cycles (32-byte fetches) - nearly 4× faster despite using similar amounts of padding -
Optimal padding aligns with hardware: The best performance comes when
(data_size + padding)is divisible by 32 bytes (the largest fetch size) -
Trade-off: Adding 8 more bytes of padding (23→31) increases memory overhead from 26.1% to 32.3% (only 6% increase) but improves performance by 3.7× (22 cycles→6 cycles)
-
Design principle: Prefer padding amounts that enable the largest possible fetch size (32 bytes), even with slightly more memory waste
Why this example is useful:
Naive padding choices cause severe performance degradation:
- Padding to arbitrary values like 23 bytes forces small 8-byte fetches
- Understanding fetch size constraints enables strategic padding choices
- Pad to the next multiple of 32 bytes when possible
- The memory cost of better padding is usually negligible compared to the performance gain
Why dummy=23 Cannot Use 32-byte Commits
The dummy=23 case cannot use 32-byte commits because write sequencer configurations must never exceed tensor boundaries:
- fetch sequencer output
Time = m![B * (A # 31) / 32]: Since A + 23 is not divisible by 32, setting fetch_size=32 bytes requires fetching A+31 elements.Flit = m![A % 32]- fetch_size = 32 bytes
- Switch Engine output (= commit unit input)
Time = m![B * (A # 31) / 32]Flit = m![A % 32]
- Commit Unit
- commit_in_size = 32 bytes (if commit_in_size < 32, it would cut valid A portion: must set commit_in_size=32 bytes)
- sliced shape
Time = m![B * (A # 31) / 32]Flit = m![A % 32]
- write sequencer configuration
m![B, A # 23]->m![B * (A # 31) / 32 * A % 32]- Sequencer configuration:
[B=2:88, (A # 31)/32=3:32, A=32:1] @ 1024 / 32
Key insight: Read sequencer configurations can safely overfetch (reading dummy addresses beyond the input tensor range is acceptable), but write sequencer configurations must never write beyond the tensor boundary (as it could write data to space occupied by other tensors).
This asymmetry is why dummy=23 cannot use 32-byte fetches.
Example 4: Tensor Segmentation
TCP handles tensors exceeding Vector Register File (VRF) capacity through segmentation. Tensors too large to process in a single execution are automatically split into smaller chunks that fit within hardware constraints, with each chunk processed independently.
axes![A = 2048, B = 32];
// Input: 64KB tensor, exceeds 8KB VRF limit
let input: DmTensor<f8, m![1], m![1], m![1], m![A, B]> = ...;
// Segmented into two executions (compiler handles this automatically)
// Execution #0: first half of A
let seg_0: DmTensor<f8, m![1], m![1], m![1], m![A % 1024, B]> = ctx.main
.begin(input.view().slice(A, 0..1024))
.fetch::<f8, m![A % 1024], m![B]>()
.collect::<m![A % 1024], m![B]>()
.commit(256 * 1024); // Write to 256K
// Execution #1: second half of A
let seg_1: DmTensor<f8, m![1], m![1], m![1], m![A @ 1024, B]> = ctx.main
.begin(input.view().slice(A, 1024..2048))
.fetch::<f8, m![A @ 1024], m![B]>()
.collect::<m![A @ 1024], m![B]>()
.commit(256 * 1024 + 32 * 1024); // Write to 256K + 32K
The Problem: The Vector Engine’s VRF has only 8KB of capacity per slice. Tensors requiring more storage than this cannot be fetched and processed in one operation. Segmentation splits the tensor across multiple executions.
The 8KB limit is per-slice, not total. While a cluster has 256 slices (2MB total VRF), each slice holds only 8KB. Tensor distribution across slices is controlled by the slice dimension in the tensor mapping. Tensors without enough elements mapped to the slice dimension, or operations requiring entire rows/columns in individual slices (common in reduction operations), hit the per-slice limit before using all 256 slices. A [2048, 32] tensor with mapping m![1, 1, 1, 2048, 32] (no slice distribution) attempts to store all 64KB in slice 0, exceeding the 8KB limit. Even with slice distribution m![1, 1, 256, 8, 32], each slice stores only 256 bytes, but intermediate results or operation constraints may require more per-slice storage. Segmentation ensures each slice’s VRF usage stays within the 8KB hardware limit.
Input Tensor: A large 2D tensor that exceeds VRF capacity:
- Shape:
[A=2048, B=32]- 65,536 elements - Data type:
f8(1 byte per element) - Total size: 2048 × 32 = 65,536 bytes = 64 KB
- Memory layout:
m![A, B]- standard row-major - Base address:
b = 0 - Problem: 64 KB far exceeds the 8 KB VRF limit per slice
Output Tensor: The same tensor needs to be written to a different SRAM location:
- Shape:
[A=2048, B=32](identical) - Data type:
f8 - Memory layout:
m![A, B] - Base address:
b = 256K(different location)
Solution Strategy:
Split the A dimension into two segments:
- Segment 1:
A % 1024(first 1024 elements) = 1024 × 32 = 32 KB - Segment 2:
A @ 1024(second 1024 elements) = 1024 × 32 = 32 KB - Each segment fits comfortably within the 8 KB limit… wait, that’s still too large!
This requires further splitting or distributing dimensions across slices. Segmentation processes arbitrarily large tensors by dividing them into hardware-manageable chunks.
Processing:
Process the tensor in two separate executions:
Execution #0
- fetch sequencer output
Time = m![A % 1024]Flit = m![B]- fetch_size = 32 bytes
- Switch Engine output (= commit unit input)
Time = m![A % 1024]Flit = m![B]
- Commit Unit
- commit_in_size = 32 bytes
- sliced shape
Time = m![A % 1024]Flit = m![B]
- write sequencer configuration
m![A, B]->m![(A % 1024), B]- Sequencer configuration:
[A%1024=1024:32, B=32:1] @ 256K / 32 - From the entire output tensor with mapping
m![A, B], only the first half is fetched and committed.
Execution #1 (Second Half)
Processes the second half of the A dimension:
- Fetch sequencer output:
Time = m![A @ 1024]- time dimension covers A elements 1024-2047Flit = m![B]- 32-byte packets containing full B dimension- fetch_size = 32 bytes
- Switch Engine output:
Time = m![A @ 1024]- 1024 cycles for second half of AFlit = m![B]- 32-byte flits
- Commit Unit:
- commit_in_size = 32 bytes (full flit commit)
- write sequencer configuration:
[A@1024=1024:32, B=32:1] @ (256K + 32 * 1024) / 32 - Base address:
256K + 32KB(offset to skip the first segment) - Writes to addresses 256K+32KB through 256K+64KB
Output:
After both executions complete, the output tensor is reconstructed:
- Memory layout: SRAM starting at address 256K contains the complete tensor
[A=2048, B=32] - Segment 1: Addresses 256K to 256K+32KB hold
A[0:1023, B[0:31]] - Segment 2: Addresses 256K+32KB to 256K+64KB hold
A[1024:2047, B[0:31]] - Result: Logically identical to the input, just stored at a different location
The compiler automatically determines segmentation requirements and splits tensors into multiple executions. From the programmer’s perspective, this is a single logical operation; the segmentation is transparent.
Why this example is useful:
Tensor segmentation is essential for practical deep learning workloads:
- Large model support: Modern LLMs have tensors with billions of elements that cannot fit in VRF
- Automatic handling: The compiler manages segmentation automatically based on VRF capacity
- No performance penalty for well-designed splits: When segment boundaries align with memory access patterns, segmentation adds minimal overhead
- Scalability: This mechanism enables processing tensors of arbitrary size on fixed hardware
- Memory hierarchy exploitation: Segmentation naturally maps to hierarchical memory systems (VRF → SRAM → HBM)
In practice, the compiler considers multiple factors when segmenting:
- VRF capacity constraints
- Memory bandwidth utilization
- Alignment with tensor unit requirements
- Minimizing the number of segments to reduce overhead
Transformer Architecture
This page uses Llama 3 70B as a concrete example to show how each transformer operation maps to specific TCP hardware components. Llama 3 70B implements a decoder-only transformer architecture with two main phases: prefill (input encoding) and decode (token generation).
Model Parameters
The following parameters define the Llama 3 70B architecture, grouped by category:
Sequence dimensions (control input/output length):
B: batch sizes_in: input sequence lengths_max: maximum sequence length/context lengths: total sequence length processed so far (prefill + decode)
Model size (vocabulary and layer counts):
V = 128256: vocab sizeD = 8192: hidden dimension/size of embeddingF = 28672: intermediate dimension for FFN up projectionL = 80: num layers
Attention head dimensions (how attention is partitioned):
h_q = 64: number of query headsh_kv = 8: number of key/value headsG = 8: number of attention groups (= h_q / h_kv)d_k = 128: head dimension (equal toD / h_q)d_k_prime = 64: split head dimension for RoPE computationf = 2: frequency dimension for adjacent heads (d_k = d_k_prime * f)
Prefill Phase
The prefill phase processes the entire input sequence in parallel, outputting the first token while storing computed Key/Value pairs as KV cache. The transformer block executes on all input tokens provided by the user. The following subsections describe each step in order.
1. Embedding Lookup
Embedding lookup converts input tokens to vector space representations.
- Input
input: shape![B, s_in]- Token indices of input text (which vocabulary entry each token corresponds to)
- Weight
w_emb: shape![V, D]- Pre-trained embedding value table for each vocabulary entry
- Output
x_0: shape![B, s_in, D]
- Operation
x_0 = gather(index: input, table: w_emb)- gather: Operation that reads values from the table using index values specified in the index tensor.
- Processed by TensorDMA.
2. Transformer Layers (repeated L times)
Each transformer layer applies attention and feed-forward operations sequentially.
For each layer l = 1, ..., L, perform the following:
2.1. Input Layer Normalization
Input layer normalization stabilizes training by normalizing activations before attention.
- Input
x_prev: shape
- Output
x_norm: shape![B, s_in, D]
- Operation
- Apply RMSNorm
x_norm = RMSNorm(x_prev)- RMSNorm: Root Mean Square Layer Normalization
- Processed by Vector Engine.
2.2. Multi-Head Grouped Query Attention (GQA)
Grouped Query Attention (GQA) improves memory efficiency by sharing key/value heads across multiple query heads, reducing KV cache size.
2.2.1. QKV Projection
QKV projection transforms the normalized input into Query, Key, and Value tensors.
- Input
x_norm: shape![B, s_in, D]
- Weights
w_q: shape![D, h_q, d_k]w_k: shape![D, h_kv, d_k]w_v: shape![D, h_kv, d_k]
- Outputs
Q: shape![B, s_in, h_q, d_k]K: shape![B, s_in, h_kv, d_k]V: shape![B, s_in, h_kv, d_k]
- Operations
Q = einsum(x_norm, w_q)K = einsum(x_norm, w_k)V = einsum(x_norm, w_v)- matmul corresponds to einsum (= broadcast + elementwise mul + reduce add).
- elementwise mul: Contraction Engine
- reduce add
- packet reduce: Reducer
- time reduce: Reducer
- slice reduce: global adder tree
- split reduce: interleaved fetch + Vector Engine binary op
- cluster/chip reduce: DMA + interleaved fetch + Vector Engine binary op
2.2.2. Rotary Position Embedding (RoPE)
Rotary Position Embedding (RoPE) applies positional information to Query and Key tensors through rotation transformations.
- Inputs
Q: shape![B, s_in, h_q, d_k]K: shape![B, s_in, h_kv, d_k]d_k = d_k_prime * f- Split the
d_kaxis to apply RoPE rotation in a TCP-friendly manner.
- Split the
- RoPE table
w_rope: shape![s_max, d_k_prime, 2, 2]- Pre-computed table of cos/sin values based on sequence position and head position.
- RoPE operation groups consecutive pairs among
d_kvalues and applies rotation transformation using cos/sin. - Store the 2 × 2 matrix representing the cos/sin rotation transformation for TCP-friendly execution.
- Position
position: shape![s_in]position(i) = i
- Outputs
Q_rope: shape![B, h_q, s_in, d_k]K_rope: shape![B, h_kv, s_in, d_k]
- Operations
- RoPE table lookup
t_rope: shape![s_in, d_k_prime, 2, 2] = gather(index: position, table: w_rope)
- Apply RoPE
- RoPE computation reduces to a simple einsum operation given the prepared rotation transformation matrix values.
- Reshape (noop)
Q: shape![B, s_in, h_q, d_k] == shape![B, s_in, h_q, d_k_prime, f]K: shape![B, s_in, h_kv, d_k] == shape![B, s_in, h_kv, d_k_prime, f]t_rope: shape![s_in, d_k_prime, 2, 2] == shape![s_in, d_k_prime, f, 2]
- einsum
Q_rope = einsum(Q, t_rope)(shape![B, s_in, h_q, d_k_prime, f], shape![s_in, d_k_prime, f, 2]) -> shape![B, h_q, s_in, d_k_prime, 2] == shape![B, h_q, s_in, d_k]
K_rope = einsum(K, t_rope)(shape![B, s_in, h_kv, d_k_prime, f], shape![s_in, d_k_prime, f, 2]) -> shape![B, h_kv, s_in, d_k_prime, 2] == shape![B, h_kv, s_in, d_k]
- RoPE table lookup
As a result of RoPE, Q/K values encode relative positional information.
2.2.3. Store in KV Cache
KV cache stores the current layer’s Key and Value for reuse during the decode phase, avoiding redundant computation.
- Inputs
K_rope: shape![B, h_kv, s_in, d_k]V: shape![B, s_in, h_kv, d_k]
- KV Cache (for layer
l)kv_cache_l_K: shape![B, h_kv, s_in, d_k]kv_cache_l_V: shape![B, h_kv, s_in, d_k]
- Operations
kv_cache_l_K = K_ropekv_cache_l_V = V- Cache storage: Stores einsum computation results from DM to HBM, processed by TensorDMA.
2.2.4. Grouped Query Attention Computation
Grouped Query Attention shares each key/value head across multiple query heads.
Each of the 8 KV heads is shared with 8 Query heads (G = h_q / h_kv = 64 / 8 = 8).
2.2.4.1. Attention Scores Computation
Attention scores measure the relevance between query and key positions using dot product similarity.
- Inputs
Q_rope: shape![B, h_q, s_in, d_k]K_rope: shape![B, h_kv, s_in, d_k]
- Output
scores: shape![B, h_q, s_in, s_in]
- Operations
scores = (Q_rope @ K_rope.T) / sqrt(d_k)- Reshape (noop)
- The dot product operation can be expressed as einsum. Each tensor’s shape axes must be precisely distinguished from the output shape perspective to accurately represent the einsum operation semantics.
Q_rope: shape![B, h_q, s_in, d_k] == shape![B, G, h_kv, s_in_q, d_k]K_rope: shape![B, h_kv, s_in, d_k] == shape![B, h_kv, s_in_k, d_k]
- einsum
scores_before_normalize = einsum(Q_rope, K_rope)(shape![B, G, h_kv, s_in_q, d_k], shape![B, h_kv, s_in_k, d_k]) -> shape![B, G, h_kv, s_in_q, s_in_k] == shape![B, h_q, s_in, s_in]- The einsum expression shows that
Gwas broadcast fromK_rope, andd_kwas reduced.
- Normalize
scores = scores_before_normalize / sqrt(d_k)- Division by
sqrt(d_k)can be computed as multiplication by1/sqrt(d_k). The value1/sqrt(d_k)is pre-computed, and the Vector Engine performs simple constant multiplication.
2.2.4.2. Causal Mask Application
Causal masking prevents tokens from attending to future positions, preserving autoregressive semantics.
In the prefill phase, s_in tokens are processed in parallel, but the i-th token must not reference tokens after position i to maintain the autoregressive model’s semantics.
- Input
scores: shape![B, h_q, s_in, s_in]attention_mask: shape![s_in, s_in]attention_mask(i, j) = true if j <= i, false if j > i
- Output
scores_masked: shape![B, h_q, s_in, s_in]
- Operation
scores_masked(b, h, i, j) = scores(b, h, i, j) if j <= i, -inf if j > i- In the Vector Engine, the
attention_masktensor is written to the branch log, then processed through branched operations.
2.2.4.3. Softmax Application
Softmax normalizes attention scores into a probability distribution over key positions.
- Input
scores_masked: shape![B, h_q, s_in, s_in]
- Output
attn_weights: shape![B, h_q, s_in, s_in]
- Operation
attn_weights = softmax(scores_masked)- Softmax computes the ratio at which each query should reference each token to combine values.
- Reduces the key-corresponding axis among the two
s_indimensions. softmax(x)_i = exp(x_i) / sum_j(exp(x_j))- Processed by Vector Engine
2.2.4.4. Weighted Sum (Attention Output)
Weighted sum computes the attention output by combining Value vectors according to attention weights.
- Inputs
attn_weights: shape![B, h_q, s_in, s_in]V: shape![B, s_in, h_kv, d_k]
- Output
attn_output: shape![B, h_q, s_in, d_k]
- Operations
- Reshape (noop)
attn_weights: shape![B, h_q, s_in, s_in] == shape![B, G, h_kv, s_in_q, s_in_kv]V: shape![B, s_in, h_kv, d_k] == shape![B, h_kv, s_in_kv, d_k]
- einsum
attn_output = einsum(attn_weights, V)(shape![B, G, h_kv, s_in_q, s_in_kv], shape![B, h_kv, s_in_kv, d_k]) -> shape![B, G, h_kv, s_in_q, d_k] == shape![B, h_q, s_in, d_k]- The einsum expression shows that
Gwas broadcast fromV, ands_in_kvwas reduced.
- Reshape (noop)
2.2.5. Output Projection
Output projection combines the multi-head attention results into a single hidden state vector.
- Input
attn_output: shape![B, h_q, s_in, d_k]
- Weight
w_o: shape![h_q, d_k, D]
- Output
attn_out: shape![B, s_in, D]
- Operations
attn_out = einsum(attn_output, w_o)(shape![B, h_q, s_in, d_k], shape![h_q, d_k, D]) -> shape![B, s_in, D]
2.2.6. Residual Connection
Residual connection adds the attention output to the layer input, improving gradient flow during training.
- Inputs
x_prev: shapeattn_out: shape
- Output
x_attn: shape![B, s_in, D]
- Operation
x_attn = x_prev + attn_out- elementwise addition: Processed by Vector Engine
2.3. Feed-Forward Network (FFN)
The Feed-Forward Network applies non-linear transformations to each token independently after attention.
2.3.1. Post-Attention Layer Normalization
Post-attention normalization stabilizes activations before the FFN computation.
- Input
x_attn: shape![B, s_in, D]
- Output
x_ffn_norm: shape![B, s_in, D]
- Operation
x_ffn_norm = RMSNorm(x_attn)- RMSNorm: Processed by Vector Engine
2.3.2. SwiGLU FFN
SwiGLU (Swish-Gated Linear Unit) is Llama 3’s activation function, combining gating with the Swish non-linearity.
- Input
x_ffn_norm: shape![B, s_in, D]
- Weights
w_gate: shapew_up: shapew_down: shape
- Output
ffn_out: shape![B, s_in, D]
- Operations
- Gate projection:
gate = einsum(x_ffn_norm, w_gate)(shape![B, s_in, D], shape![D, F]) -> shape![B, s_in, F]
- Up projection:
up = einsum(x_ffn_norm, w_up)(shape![B, s_in, D], shape![D, F]) -> shape![B, s_in, F]
- SwiGLU activation:
activated = SiLU(gate) * up- SiLU (Swish):
SiLU(x) = x * sigmoid(x) *: element-wise multiplication- Processed by Vector Engine
- Down projection:
ffn_out = einsum(activated, w_down)(shape![B, s_in, F], shape![F, D]) -> shape![B, s_in, D]
- Gate projection:
2.3.3. Residual Connection
FFN residual connection adds the FFN output to the post-attention output.
- Inputs
x_attn: shapeffn_out: shape
- Output
x_l: shape
- Operation
x_l = x_attn + ffn_out- elementwise addition: Processed by Vector Engine
3. Final Layer Normalization
Final layer normalization is applied after passing through all 80 transformer layers.
- Input
x_L: shape
- Output
x_final: shape![B, s_in, D]
- Operation
x_final = RMSNorm(x_L)- RMSNorm: Processed by Vector Engine
4. Language Model Head (Output Layer)
The language model head converts the hidden state at the last token position into vocabulary logits for next-token prediction.
- Input
x_final: shape![B, s_in, D]
- Weight
w_lm_head: shape![D, V]- Typically
w_lm_head = w_emb.T(weight tying)
- Output
logits: shape![B, V]
- Operations
- Slice: In prefill phase, only the last token is used
x_last: shape![B, D] = x_final[:, -1, :]- Extract only the hidden state of the last token to predict the next token
- Process the slice operation as a simple view operation depending on shape, or use parallel copy to directly read and move a portion of data.
- einsum: Logit computation for vocabulary
logits = einsum(x_last, w_lm_head)(shape![B, D], shape![D, V]) -> shape![B, V]
- Slice: In prefill phase, only the last token is used
5. Sampling
Sampling converts logit values into a probability distribution and selects the next token. This process occurs on the Host, not the TCP.
- Input
logits: shape![B, V]temperature: scalar(sampling temperature parameter, typically 0.7~1.0)
- Output
next_token: shape
- Operations
- Temperature scaling:
logits_scaled = logits / temperature- Higher temperature leads to more diverse token selection, lower temperature leads to more deterministic selection
- The value
1/temperatureis pre-computed, then processed as constant multiplication in Vector Engine
- Softmax:
probs: shape![B, V] = softmax(logits_scaled)softmax(x)_i = exp(x_i) / sum_j(exp(x_j))- Apply softmax over the Vocabulary axis (
V)
- Token sampling:
- Sample the next token index from the probability distribution
probs - Sampling strategies:
- Greedy:
next_token = argmax_i(probs_i) - Top-k sampling: Sample only from the top k tokens by probability
- Top-p (nucleus) sampling: Sample from the smallest token set whose cumulative probability exceeds p
- Greedy:
- Sample the next token index from the probability distribution
- Temperature scaling:
Decode Phase
The decode phase reuses the same operation sequence as prefill (embedding, transformer layers, LM head, sampling), but operates on a single token at a time, reusing cached KV pairs instead of recomputing them. The decode phase generates tokens one at a time autoregressively, continuing until an end token (EOS) is produced or the maximum length is reached. Unlike prefill, decode processes only one token per iteration.
Three characteristics distinguish decode from prefill:
- Single-token input:
s_in = 1(only the most recent output token is used as query) - KV cache reuse: Previously computed Key and Value tensors are reused rather than recomputed
- Autoregressive generation: Each token prediction references all previous tokens via the cache
For each decoding step s = s_prefill + 1, ..., s_max:
1. Embedding Lookup
Embedding lookup converts the previously generated token to its vector representation.
- Input
input: shape![B, 1]- Token index sampled in the previous step
- Weight
w_emb: shape![V, D]
- Output
x_0: shape![B, 1, D]
- Operation
x_0 = gather(index: input, table: w_emb)- Processed by TensorDMA
2. Transformer Layers (repeated L times)
Each transformer layer processes the single token through attention and FFN, reusing cached KV pairs.
For each layer l = 1, ..., L, perform the following:
2.1. Input Layer Normalization
Input layer normalization prepares the token for attention computation.
- Input
x_prev: shape
- Output
x_norm: shape![B, 1, D]
- Operation
x_norm = RMSNorm(x_prev)- Processed by Vector Engine
2.2. Multi-Head Grouped Query Attention (GQA)
Attention in decode phase computes attention between the current token (query) and all cached tokens (keys/values).
2.2.1. QKV Projection
QKV projection computes Query, Key, and Value for the current token only.
- Input
x_norm: shape![B, 1, D]
- Weights
w_q: shape![D, h_q, d_k]w_k: shape![D, h_kv, d_k]w_v: shape![D, h_kv, d_k]
- Outputs
Q: shape![B, 1, h_q, d_k]K_new: shape![B, 1, h_kv, d_k]V_new: shape![B, 1, h_kv, d_k]
- Operations
Q = einsum(x_norm, w_q)K_new = einsum(x_norm, w_k)V_new = einsum(x_norm, w_v)(shape![B, 1, D], shape![D, h_q/kv, d_k]) -> shape![B, 1, h_q/kv, d_k]
2.2.2. Rotary Position Embedding (RoPE)
RoPE applies positional encoding corresponding to the current sequence position.
- Inputs
Q: shape![B, 1, h_q, d_k]K_new: shape![B, 1, h_kv, d_k]
- RoPE table
w_rope: shape![s_max, d_k_prime, 2, 2]
- Position
position: shape![1]position(0) = s(total sequence length processed so far)
- Outputs
Q_rope: shape![B, h_q, 1, d_k]K_rope: shape![B, h_kv, 1, d_k]
- Operations
- RoPE table lookup
t_rope: shape![1, d_k_prime, 2, 2] = gather(index: position, table: w_rope)
- Apply RoPE
- Reshape (noop)
Q: shape![B, 1, h_q, d_k] == shape![B, 1, h_q, d_k_prime, f]K_new: shape![B, 1, h_kv, d_k] == shape![B, 1, h_kv, d_k_prime, f]t_rope: shape![1, d_k_prime, 2, 2] == shape![1, d_k_prime, f, 2]
- einsum
Q_rope = einsum(Q, t_rope)(shape![B, 1, h_q, d_k_prime, f], shape![1, d_k_prime, f, 2]) -> shape![B, h_q, 1, d_k_prime, 2] == shape![B, h_q, 1, d_k]
K_rope = einsum(K_new, t_rope)(shape![B, 1, h_kv, d_k_prime, f], shape![1, d_k_prime, f, 2]) -> shape![B, h_kv, 1, d_k_prime, 2] == shape![B, h_kv, 1, d_k]
- Reshape (noop)
- RoPE table lookup
2.2.3. KV Cache Update
KV cache update appends the new Key and Value to the existing cache for future token generation.
- Inputs
kv_cache_l_K: shapekv_cache_l_V: shapeK_rope: shapeV_new: shape
TODO (youseok.yang):
V_newhas shape[B, 1, h_kv, d_k]but the cache expects[B, h_kv, s, d_k]. Either correctV_new’s shape to[B, h_kv, 1, d_k](consistent withK_ropeand the cache) or add an explicit reshape/transpose step before the cache update. - Outputs
kv_cache_l_K: shapekv_cache_l_V: shape
- Operations
- Concatenate: Add new K, V to existing cache
kv_cache_l_K[s-1] = K_ropekv_cache_l_V[s-1] = V_new- Processing differs depending on concat axis allocation. Data movement between slices: use RoutingEngine/parallel copy; data movement between elements: use parallel copy.
- Concat on HBM using DMA is also possible.
- Concatenate: Add new K, V to existing cache
2.2.4. Grouped Query Attention Computation
Attention computation uses the current Query against the entire KV cache to determine which past tokens are relevant.
2.2.4.1. Attention Scores Computation
Attention scores measure similarity between the current Query and all cached Keys.
- Inputs
Q_rope: shape![B, h_q, 1, d_k]kv_cache_l_K: shape![B, h_kv, s, d_k]
- Output
scores: shape![B, h_q, 1, s]
- Operations
scores = (Q_rope @ kv_cache_l_K.T) / sqrt(d_k)- Reshape (noop)
Q_rope: shape![B, h_q, 1, d_k] == shape![B, G, h_kv, 1, d_k]kv_cache_l_K: shape![B, h_kv, s, d_k] == shape![B, h_kv, s, d_k]
- einsum
scores_before_normalize = einsum(Q_rope, kv_cache_l_K)(shape![B, G, h_kv, 1, d_k], shape![B, h_kv, s, d_k]) -> shape![B, G, h_kv, 1, s] == shape![B, h_q, 1, s]- The einsum expression shows that
Gwas broadcast fromkv_cache_l_K, andd_kwas reduced.
- Normalize
scores = scores_before_normalize / sqrt(d_k)- Processed as constant multiplication in Vector Engine
2.2.4.2. Softmax Application
Softmax converts scores to attention weights. Causal mask is unnecessary in decode because the current token only references past tokens.
- Input
scores: shape![B, h_q, 1, s]
- Output
attn_weights: shape![B, h_q, 1, s]
- Operation
attn_weights = softmax(scores)- Softmax is applied over the last axis (
s, i.e., all past tokens) softmax(x)_i = exp(x_i) / sum_j(exp(x_j))- Processed by Vector Engine
2.2.4.3. Weighted Sum (Attention Output)
Weighted sum combines cached Values according to attention weights to produce the attention output.
- Inputs
attn_weights: shape![B, h_q, 1, s]kv_cache_l_V: shape![B, h_kv, s, d_k]
- Output
attn_output: shape![B, h_q, 1, d_k]
- Operations
- Reshape (noop)
attn_weights: shape![B, h_q, 1, s] == shape![B, G, h_kv, 1, s]kv_cache_l_V: shape![B, h_kv, s, d_k] == shape![B, h_kv, s, d_k]
- einsum
attn_output = einsum(attn_weights, kv_cache_l_V)(shape![B, G, h_kv, 1, s], shape![B, h_kv, s, d_k]) -> shape![B, G, h_kv, 1, d_k] == shape![B, h_q, 1, d_k]- The einsum expression shows that
Gwas broadcast fromkv_cache_l_V, andswas reduced.
- Reshape (noop)
2.2.5. Output Projection
Output projection transforms the attention result back to the hidden dimension.
- Input
attn_output: shape![B, h_q, 1, d_k]
- Weight
w_o: shape![h_q, d_k, D]
- Output
attn_out: shape![B, 1, D]
- Operations
attn_out = einsum(attn_output, w_o)(shape![B, h_q, 1, d_k], shape![h_q, d_k, D]) -> shape![B, 1, D]
2.2.6. Residual Connection
Residual connection combines attention output with layer input.
- Inputs
x_prev: shapeattn_out: shape
- Output
x_attn: shape![B, 1, D]
- Operation
x_attn = x_prev + attn_out- elementwise addition: Processed by Vector Engine
2.3. Feed-Forward Network (FFN)
FFN in decode phase is identical to prefill, but processes only a single token (sequence length = 1).
2.3.1. Post-Attention Layer Normalization
Post-attention normalization prepares the token for FFN processing.
- Input
x_attn: shape![B, 1, D]
- Output
x_ffn_norm: shape![B, 1, D]
- Operation
x_ffn_norm = RMSNorm(x_attn)- Processed by Vector Engine
2.3.2. SwiGLU FFN
SwiGLU applies the gated activation function with three projections.
- Input
x_ffn_norm: shape![B, 1, D]
- Weights
w_gate: shape![D, F]w_up: shape![D, F]w_down: shape![F, D]
- Output
ffn_out: shape![B, 1, D]
- Operations
- Gate projection:
gate = einsum(x_ffn_norm, w_gate)(shape![B, 1, D], shape![D, F]) -> shape![B, 1, F]
- Up projection:
up = einsum(x_ffn_norm, w_up)(shape![B, 1, D], shape![D, F]) -> shape![B, 1, F]
- SwiGLU activation:
activated = SiLU(gate) * up- Processed by Vector Engine
- Down projection:
ffn_out = einsum(activated, w_down)(shape![B, 1, F], shape![F, D]) -> shape![B, 1, D]
- Gate projection:
2.3.3. Residual Connection
FFN residual connection produces the final layer output.
- Inputs
x_attn: shape![B, 1, D]ffn_out: shape![B, 1, D]
- Output
x_l: shape![B, 1, D]
- Operation
x_l = x_attn + ffn_out- elementwise addition: Processed by Vector Engine
3. Final Layer Normalization
Final layer normalization prepares the output for the language model head.
- Input
x_L: shape![B, 1, D]
- Output
x_final: shape![B, 1, D]
- Operation
x_final = RMSNorm(x_L)- Processed by Vector Engine
4. Language Model Head
The language model head projects the hidden state to vocabulary logits. Unlike prefill, no slice operation is needed since there is only a single token.
- Input
x_final: shape![B, 1, D]
- Weight
w_lm_head: shape![D, V]
- Output
logits: shape![B, V]
- Operations
- Reshape/Squeeze: Remove sequence dimension
x_squeezed: shape![B, D] = squeeze(x_final)
- einsum: Logit computation for vocabulary
logits = einsum(x_squeezed, w_lm_head)(shape![B, D], shape![D, V]) -> shape![B, V]
- Reshape/Squeeze: Remove sequence dimension
5. Sampling
Sampling is identical to Prefill Sampling: temperature scaling, softmax, and token selection, performed on the Host.
6. Termination Conditions
Generation terminates when any of three conditions is met:
- EOS token generated: Sampled token is the End-of-Sequence token
- Maximum length reached:
s >= s_max - User-defined termination conditions: When specific patterns or conditions are met
If generation continues, update s <- s + 1 and return to the next decoding step.
Prefill vs Decode Phase Comparison
The following table summarizes the key differences between prefill and decode phases:
| Characteristic | Prefill Phase | Decode Phase |
|---|---|---|
| Input sequence length | s_in (variable) | 1 (fixed) |
| Parallel processing | s_in tokens processed in parallel | Only 1 token processed |
| KV Cache | Create and store | Read and update |
| Attention computation | Causal mask required | Causal mask not required |
| Attention shape | shape![B, h_q, s_in, s_in] | shape![B, h_q, 1, s] |
| Computation characteristics | Compute-bound (large-scale computation) | Memory-bound (KV cache access) |
| Throughput | High (parallel processing) | Low (sequential processing) |
| Latency | Relatively high | Low (per token) |
Mixture of Experts
Mixture of Experts (MoE) scales model capacity by routing each token to only K of E experts rather than all of them; this sparse activation allows many parameters while keeping inference cost manageable. This example shows how to implement MoE on TCP hardware, focusing on two key challenges: replacing control-flow-based TopK routing with branchless matrix operations, and executing sparse expert computations blockwise.
Background: Basic FFN
To understand MoE, first consider the basic FFN (Feed-Forward Network) in transformer blocks. The following describes FFN with only up/down projection, without gate projection:
- Input
x_ffn_norm: T x D
- Weights
W_up: D x F(up projection)W_down: F x D(down projection)
- Output
ffn_out: T x D
- Operations
- Up projection:
up = einsum(x_ffn_norm, W_up)(T x D), (D x F) -> T x F
- Down projection:
ffn_out = einsum(up, W_down)(T x F), (F x D) -> T x D
- Up projection:
MoE Structure
MoE replaces a single FFN with E independent FFNs called experts.
Each expert has its own weights:
W_up[0], W_up[1], ..., W_up[E-1]W_down[0], W_down[1], ..., W_down[E-1]
Computing all experts would increase computation by E times.
To avoid this, MoE uses a router to select only the Top-K most suitable experts per token, enabling sparse computation.
Model Parameters
The following arguments define an MoE layer:
T: number of tokens- prefill:
T = B * S_in - decode:
T = B
- prefill:
D: hidden dimensionF: intermediate dimension of ffn up projection resultE: number of total experts (typically 128)K: number of experts to apply ffnllama4: 1,gpt-oss: 4,qwen3: 8
MoE Processing Steps
MoE processing consists of three main stages: routing (selecting which experts to use), sparse expert computation (applying the selected experts), and combining (merging expert outputs with routing weights).
1. Gating (Router) & Top-K Selection
The router calculates a score for each expert for every token, determining which experts should process each token:
- Input
x_norm: T x D
- Weight
W_router: D x E(Gating network weights)
- Output
scores: T x E
- Operation
scores = einsum(x_norm, W_router)(T x D), (D x E) -> T x E- Calculate the score (Logit) for
EExperts per token
2. Top-K Selection
This step selects the Top-K Experts based on router scores and calculates the weight for each selected Expert:
- Input
scores: T x E
- Outputs
topk_indices: T x K(selected Expert ID per token)routing_weights: T x K(weight of selected Expert per token)
- Operations
Top-KSelection:raw_weights, topk_indices = topk(scores, K)- Extract the
KExpert indices and scores with the highest scores per token
- Softmax Normalization:
routing_weights = softmax(raw_weights)- Convert the selected
Kscores to probability values (sum is 1 per token) softmax(x)[i] = exp(x[i]) / sum(exp(x[j]) for j in 0..K)
The output for each token t consists of:
topk_indices[t, :]:KExpert IDs (0 <= e < E)routing_weights[t, :]: weights of those Experts (sum is 1)
3. Sparse Expert Computation
Only selected Experts perform computation, making this stage sparse.
A total of T * K Expert calls occur, but each Expert only computes for the tokens that selected it.
For each token t in [0, T-1] and selected Expert k in [0, K-1]:
- Selected Expert ID:
e = topk_indices[t, k] - Input
x_norm[t]: D(input of tokent)
- Weights (weights of Expert
e)W_up[e]: D x FW_down[e]: F x D
- Output
y[t, k]: D(k-th Expert output of tokent)
- Operations
- Up projection:
up = einsum(x_norm[t], W_up[e])D, (D x F) -> F
- Down projection:
y[t, k] = einsum(up, W_down[e])F, (F x D) -> D
- Up projection:
The results for all (t, k) pairs are collected into y_experts: T x K x D.
4. Weighted Sum (Combine)
The final step combines the K Expert outputs using the routing weights calculated earlier:
- Inputs
y_experts: T x K x Drouting_weights: T x K(weight of each Expert)
- Output
ffn_out: T x D
- Operations
ffn_out = einsum(y_experts, routing_weights)
The result is that each token receives the weighted average output of its selected K Experts.
MoE Implementation on TCP
Implementing MoE efficiently on TCP requires bridging the gap between the model’s logical structure and hardware constraints. This section describes the techniques needed to achieve high performance.
1. Overview and Design Philosophy
1.1. Bridging Logical and Physical Execution
Two fundamental challenges arise when implementing MoE on TCP:
- Challenge 1: Conflict between control flow and parallel structure
- Problem: General
Top-Kalgorithms use branch statements where the execution path varies depending on data values. Such branch statements cause performance degradation in SIMT-based accelerators that process thousands of elements with a single instruction. - Solution: Completely removing control flow and using Branchless
Top-Ktechnique with matrix operations and bit manipulation is essential.
- Problem: General
- Challenge 2: Gap between logical Routing and physical execution
- Problem: Logically, MoE is a process where each token finds the Expert that suits it (Token-centric). However, if implemented as is, memory access becomes irregular and the number of tokens to process per Expert changes dynamically, reducing TCP compiler efficiency.
- Solution: The perspective must be shifted to a method where the Expert becomes the subject and collects tokens (Expert-centric).
1.2. Core Techniques for TCP Implementation
Two core techniques address these challenges:
- Branchless
TopK: Performs routing via matrix operations only, eliminating all control flow - Blockwise execution: Processes only selected Experts with data packed in fixed-size
Blockunits
The following sections describe each technique in detail.
2. Branchless TopK
Branchless TopK replaces control-flow-based sorting with pure matrix operations.
This approach consists of three stages: bit packing to combine score and index, parallel ranking to determine order, and filtering to extract the top K results.
2.1. Bit Packing (Combining Score and Index)
The Vector Engine pipeline operates on all 256 slices in lockstep, so any operation whose address or control path depends on runtime data values must be replaced with a fixed sequence of matrix operations. Bit packing bundles score and index into a single value so the Expert ID is preserved when scores are reordered during sorting:
- Inputs
scores: T x EIndex_expert: EIndex_expert(e) = e where e = 0, 1, 2, ..., E - 1
- Output
Packed_Value: T x E- Tensor with (score, index) packed.
Packed_Value_cmp: T x E- Tensor with (score, index) packed, preprocessed to enable comparison of score magnitude using integer comparison.
- Operation
- Packing
- Place Expert Score (e.g.,
bf16) in the upper bits and Expert Index (e.g.,int16) in the lower bits to create a single 32-bit integer (or float). Packed_Value_unprocessed = (Score << 16) | Index- Processed in Vector Engine.
- Place Expert Score (e.g.,
- Comparison Trick
- This preprocessing enables magnitude comparison of score values using simple integer comparison.
- Bit Flipping preprocessing solves the problem of negative magnitude relationships being reversed when comparing float values as integer values. This enables accurate Top-K selection with only integer comparators.
-
Packed_Value_cmp = if Packed_Value >= 0 { Packed_Value } else { Packed_Value ^ 0x7fff0000 }
- Packing
2.2. Parallel Ranking (All-to-All Comparison)
Parallel ranking determines the order of all experts simultaneously instead of sequential sorting.
Although this requires E x E comparisons, TCP efficiency remains high because only matrix operations are used without control flow:
- Input
Packed_Value_cmp: T x E- 32-bit Packed Tensor with Comparison Trick applied.
- Output
Rank: T x E- Rank of each Expert (0-based rank). Higher scores are closer to 0.
- Operations
- Broadcast & Compare
- Replicate (Tile)
Packed_Value_cmpalong theEaxis to expand toT x E x Eshape. Compare magnitude relationships for all Expert pairs(i, j). Compare[t, i, j] = 1 if Packed_Value_cmp[t, j] > Packed_Value_cmp[t, i] else 0- Meaning: “Is Expert
j’s score higher than Experti’s?”
- Replicate (Tile)
- Rank Calculation (ReduceSum)
- Sum along the
E(comparison target) axis to calculate rank. Rank[t, i] = sum(Compare[t, i, j] for j in 0..E)- Meaning: “The total number of Experts with higher scores than me” becomes my rank.
- Sum along the
- Broadcast & Compare
2.3. Filtering & Unpacking
Filtering extracts the top K entries based on rank, then unpacking separates the packed scores and indices:
- Inputs
Rank: T x EPacked_Value: T x E- Note: The original Packed Value before Comparison Trick was applied must be used to restore accurate Score/Index later.
- Outputs
TopK_Indices: T x KTopK_Scores: T x Krouting_weights: T x K(weights for K selected experts per Token)
- Operations
- Filtering (
FilterCompaction)- Only elements satisfying the
Top-Kcondition (Rank < K) are kept. Mask[t, i] = 1 if Rank[t, i] < K else 0- Only
Packed_Valueat positions where Mask is True are collected and compressed toT x Ksize. - Result:
Selected_Packed: T x K - Uses the filter function of Vector Engine.
- Only elements satisfying the
- Unpacking
- Restore scores and indices through bit operations from the selected 32-bit values.
- Score Extraction:
TopK_Scores = Selected_Packed >> 16(then reinterpreted asbf16type) - Index Extraction:
TopK_Indices = Selected_Packed & 0xffff
- Softmax Normalization
- Softmax is applied to the extracted
Top-KScores to calculate final weights. This is used in the Combine stage later. routing_weights[t, k] = exp(TopK_Scores[t, k]) / sum(exp(TopK_Scores[t, j]) for j in 0..K)
- Softmax is applied to the extracted
- Filtering (
3. Blockwise Execution
Blockwise execution physically rearranges data based on Top-K routing decisions while satisfying TCP’s static shape constraints. This section describes how to handle dynamic token-to-expert assignments efficiently.
3.1. Problem: Dynamic Shape & Memory Explosion
The core challenge is that the number of tokens L_e assigned per Expert varies dynamically depending on the input.
In the worst case, if all tokens are concentrated on a specific Expert, L_e ~ T.
Two approaches address this challenge:
- Naive Solution: Allocating a buffer of maximum size
Tfor all Experts requires memory ofE x T x Dsize, most of which is wasted as padding. - Blockwise Solution: Instead of variable length
L_e, manage data in fixed-sizeBlock(B) units to optimize memory usage to approximatelyT x Klevel.
3.2. Grid Size Calculation
Grid size determines how many blocks are needed to process all tokens.
Tokens for the same Expert are grouped into blocks of B tokens, enabling blockwise computation with a single expert loaded.
The total number of blocks needed (Grid Size, G) is calculated as the sum of blocks required per expert:
- Number of blocks allocated to Expert
e- Number of tokens allocated to
e:Count_e - Number of blocks:
ceil(Count_e / B)
- Number of tokens allocated to
G = sum(ceil(Count_e / B) for e in 0..E)
The compiler calculates the worst-case G value and allocates memory space.
At runtime, sparse operations skip execution for empty Grids.
In the worst case where all Experts include a grid containing only one token, (T*K - E) / B + E Grids are required.
3.3. Index Generation (Cumsum-based Address Calculation)
Index generation computes the destination address for each token using cumsum-based parallel address calculation.
(Cumsum is implemented in the Vector Engine using branch logging; see Section 4 for the hardware implementation.)
This approach avoids loops and enables efficient parallel execution:
- Inputs
TopK_Indices: T x KExpert_Indices: E = [0, 1, ..., E-1]Block_Range: G = [0, 1, ..., G-1](sequence of maximum block count, e.g., 32)
- Outputs
Scatter_Idx: T x K(final 1D address where each token will move)Expert_IDs: G(Expert number each Block is responsible for)
- Operations
-
Mask Generation (One-Hot)
- Convert indices to computable mask form.
Expert_Mask: T x K x E = one_hot(TopK_Indices, depth=E)
-
Histogram
- Sum the masks to count the number of tokens allocated per Expert.
Count: E = reduce_sum(Expert_Mask, axis: (T, K))
-
Block calculation
- Calculate the number of Blocks needed for each Expert.
Num_Blocks: E = ceil(Count / B)
- Calculate the number of Blocks needed for each Expert.
-
Global offset Calculation
- Through Cumsum, obtain the Block Start Index where each Expert starts in the entire Grid (
G). Global_Offset: E = cumsum(Num_Blocks) - Num_Blocks
- Through Cumsum, obtain the Block Start Index where each Expert starts in the entire Grid (
-
Local Offset Calculation
- Using Mask and Cumsum, calculate what position each token is in the Expert’s queue.
Cumsum_Mask: T x K x E = cumsum(Expert_Mask, axis: (T, K))Token_Rank: T x K = gather(Cumsum_Mask, index: TopK_Indices)Local_Offset: T x K = Token_Rank - 1
-
Expert ID expansion
Diff: E x G = Num_Blocks - Block_RangeGrid: E x G-
Grid(e, i) = if Diff(e, i) > 0 { Expert_Indices(e) } else { -1 }
-
Expert_IDs: G = filter_compaction(Grid, condition=(Grid >= 0))- e.g.)
- expert 0: 2 blocks, expert 1: 3 blocks, expert 3: 3 blocks
- Diff[0] = [2, 1, 0, -1, -2, …], Diff[1] = [3, 2, 1, 0, -1, …]: has positive terms equal to the number of allocated blocks per expert.
- Grid[0] = [0, 0, -1,-1, …], Grid[1] = [1, 1, 1, -1, -1, …]: has expert id equal to the number of allocated blocks per expert.
- Expert_IDs = [0, 0, 1, 1, 1, 3, 3, 3]: Filter only values >= 0 (expert id) from Grid.
-
Address Synthesis
Scatter_Idx = (Global_Offset * B) + Local_Offset- Calculate which block and which position within the block each of the
Ttokens corresponds to.Scatter_Idx in [0, G * B)
-
3.4. Dispatch (Blockwise Scatter)
Dispatch physically rearranges tokens using the computed addresses, placing each token in its designated block position:
- Input
x_norm: T x D(Input after Attention and norm)Scatter_Idx: T x K(Final 1D address where each token will move)
- Output
x_blocked: G x B x D(Rearranged Blocked Tensor)
- Operation
- Scatter
- Place tokens
x_normatScatter_Idxpositions.
- Place tokens
- Scatter
3.5. Sparse Computation (Weight Gather)
Sparse computation applies Expert weights to the sorted Blocks. The key insight is that weights are gathered only for Experts that have assigned tokens:
- Inputs
x_blocked: G x B x DExpert_IDs: G(Expert number each Block is responsible for)
- Output
y_blocked: G x B x D
- Operations
- Weight Gather
- Using
Expert_IDsas indices, only the necessary weights are fetched. W_gathered_up: G x D x F = gather(W_up, index: Expert_IDs)W_gathered_down: G x F x D = gather(W_down, index: Expert_IDs)
- Using
- Sparse MLP
- Operations are performed only for valid Blocks (
G). up: G x B x F = einsum(x_blocked, W_gathered_up)y_blocked: G x B x D = einsum(up, W_gathered_down)
- Operations are performed only for valid Blocks (
- Weight Gather
3.6. Combine (Weighted Sum)
Combine restores results to original token order and applies Routing probabilities. This is the final step that produces the MoE layer output:
- Inputs
y_blocked: G x B x DScatter_Idx: T x Krouting_weights: T x K
- Output
moe_out: T x D(Final MoE layer output)
- Operations
- Gather
- Using
Scatter_Idxin reverse, results are fetched fromy_blockedin the original token order. y_restored: T x K x D = gather(y_blocked, index: Scatter_Idx)
- Using
- Weighted Sum
- The final output is summed by multiplying with
routing_weightsobtained from the Top-K process. y_weighted: T x K x D = einsum(y_restored, routing_weights)moe_out: T x D = reduce_sum(y_weighted, axis: K)
- The final output is summed by multiplying with
- Gather
4. Cumsum Implementation on TCP
Cumsum is a key primitive used in index generation.
On TCP, it is implemented in Vector Engine using the following approach:
- Create a static branch logger: For the axis (of size n) over which the sum is computed,
branch(i) = if i == 0 {
0
} else if i < n - 1 {
1
} else {
2 // i == n - 1
}
- Configure the Vector Engine as follows:
add %mainstream, OperandRead(branch = 1, 2)
WriteOperand(branch = 0, 1)