Introduction

FuriosaAI’s Tensor Contraction Processor (TCP) is a massively parallel AI accelerator targeting inference workloads. High-level frameworks such as PyTorch and XLA abstract away memory layouts and hardware scheduling, but give programmers no control over either. Low-level kernel APIs give fine-grained control, but require reasoning in bytes and hardware addresses rather than tensors. TCP’s Virtual Instruction Set Architecture (Virtual ISA) bridges this gap: it lets programmers think in terms of tensors while directly managing memory allocation and tensor unit scheduling. This manual explains TCP programming through the Virtual ISA.

The manual walks through concrete examples, targeting two audiences: programmers writing Virtual ISA directly and compiler developers generating it. Basic Rust familiarity is assumed; see the language manual if needed.

Warning

Alpha Test Build: Experimental Software

This software is an early, experimental, and incomplete build intended strictly for technical evaluation and internal testing.

Before using this software for any production work, critical tasks, or for important data, you must consult with Furiosa engineers.

Your feedback is vital to our development. Please provide it.

Installation

Install two dependencies:

Rust: Follow the official guide.
Furiosa SDK: Follow the SDK documentation.

Your First Program

Create a new project:

cargo new --bin tcp-my-project
cd tcp-my-project
cargo add furiosa-visa-std tokio

Add rust-toolchain.toml:

[toolchain]
channel = "nightly-2025-12-12"
components = ["rustfmt", "clippy"]

Write main.rs:

#![feature(register_tool)]
#![register_tool(tcp)]
extern crate furiosa_visa_std;
extern crate tokio;
extern crate rand;
use rand::SeedableRng;
use rand::rngs::SmallRng;
use furiosa_visa_std::prelude::*;  // provided by the Furiosa SDK

// Declare axis sizes
axes![A = 8, B = 512];

/// The main function running in host
#[tokio::main]
async fn main() {
    // Acquire exclusive access to the TCP device
    let mut ctx = Context::acquire();

    // TCP has three memory levels:
    // - Host: system memory
    // - HBM (High-Bandwidth Memory): device's main memory
    // - SRAM (on-chip scratchpad): the primary SRAM tier is called DM (Data Memory)
    //
    // Data flows: Host → HBM → DM → compute → DM → HBM → Host.
    //
    // Two DMA engines move data between these levels:
    // - `ctx.pdma` (PCIe DMA): transfers between Host and HBM
    // - `ctx.tdma` (Tensor DMA): transfers between HBM and DM

    // Create tensor on host
    // Tensors are parameterized by element type and mapping
    // The mapping `m![A, B]` specifies `A` as the major axis and `B` as the minor axis
    let mut rng = SmallRng::seed_from_u64(42);
    let host: HostTensor<i8, m![A, B]> = HostTensor::rand(&mut rng);

    // Transfer to device HBM using PCIe DMA engine
    // HBM tensor has two dimensions: m![A] for chip and m![B] for intra-chip address
    let hbm: HbmTensor<i8, m![A], m![B]> = host.to_hbm(&mut ctx.pdma, 0x1000).await;

    // Launch kernel on device
    // Host continues while kernel runs asynchronously, but the kernel synchronously occupies the device
    launch(kernel, (&mut ctx, &hbm))
    // Host waits for the asynchronous execution of the kernel to finish
        .await;
}

#[device(chip = 1)] // Running on a single chip
fn kernel(ctx: &mut Context, hbm: &HbmTensor<i8, m![A], m![B]>) {
    // Move to DM (Data Memory) in on-chip SRAM using Tensor DMA engine
    let dm = hbm.to_dm::<m![1], m![A], m![B]>(&mut ctx.tdma, 0);

    // ... perform computations ...
}

Build and Test

TCP supports two execution environments, ordered from fastest iteration to production use:

# 1. CPUs (standalone Rust)
cargo build  # Add --release for optimized builds, same below
cargo test

# 2. Real TCP devices
cargo furiosa-opt build
cargo furiosa-opt test

Development Tools

The TCP Software Toolchain (cargo furiosa-opt) provides utilities for developing, testing, and optimizing Virtual ISA programs on Furiosa chips. It complements the Furiosa SDK’s compiler by giving developers fine-grained control over program behavior, whether the programmer writes Virtual ISA by hand or a compiler generates it.

The toolchain consists of four components:

Compiler: Translates Virtual ISA into executable code for the chip.
Interpreter: Executes Virtual ISA as native Rust programs for software simulation and debugging.
Language Server: Enables IDE features (autocompletion, diagnostics, navigation) via Rust’s language server infrastructure.
Schedule Viewer: Visualizes the execution timeline to help identify performance bottlenecks.

Book Organization

The rest of this book is organized in the following chapters:

Hello, TCP!: How TCP programming works, introduced through worked examples covering element-wise operations and tensor contractions.
Mapping Tensors: How logical tensors map to physical memory: axis layout, stride, padding, and tiling.
Moving Tensors: How data moves between memory tiers (HBM, DM) and the Tensor Unit via Fetch, Commit, and DMA engines.
Computing Tensors: How the Tensor Unit pipeline (Switching, Collect, Contraction, Vector, Cast, Transpose) transforms data each cycle.
Scheduling: How to control the order and concurrency of operations across contexts.
Kernel Examples: End-to-end examples showing how mapping, movement, computation, and scheduling combine into real kernels.

License

This documentation and the entire furiosa-opt repository are licensed under the Apache License Version 2.0.

Hello, TCP!

This chapter introduces TCP programming through worked examples. Each example builds a mental model of how computation maps to hardware, making the rest of this book easier to follow. The first two examples cover element-wise operations; the remaining three cover tensor contractions (dot product, GEMV, and GEMM), each adding one new hardware concept. Two additional examples (Blocked GEMM and Flash Attention) are outlined as stubs.

Mathematical Background

This section defines the two mathematical concepts that TCP is built to accelerate: tensors and their contractions.

Tensor

A tensor is a mapping from tensor index to its corresponding value.

To understand this, we must first define tensor’s shape. Unlike other libraries where axis order encodes meaning (e.g., NumPy’s ndarray), we define tensor’s shape as an unordered set of named axes. The shapes $\{\texttt{N} = 4, \texttt{C} = 3\}$ and $\{\texttt{C} = 3, \texttt{N} = 4\}$ identify the same tensor; axis names carry the meaning, not the position.

A tensor index is formed by specifying index value for each axes. For a tensor with shape $\{\texttt{N} = 4, \texttt{C} = 3\}$, the valid indices will be: $\{\texttt{N}: 0, \texttt{C}: 0\}$, $\{\texttt{N}: 0, \texttt{C}: 1\}$, $\{\texttt{N}: 0, \texttt{C}: 2\}$, $\{\texttt{N}: 1, \texttt{C}: 0\}$, etc.

A tensor can behave like a multi-dimensional array of numbers. For example:

0D Tensor (Scalar): a single number like $5.2$
1D Tensor (Vector): a sequence like $[1, 2, 3]$ with one axis
2D Tensor (Matrix): a $2 \times 4$ grid with two axes
4D Tensor: a batch of RGB images with shape $\{\texttt{N} = 4, \texttt{C} = 3, \texttt{H} = 256, \texttt{W} = 512\}$

Tensor Contraction

A tensor contraction is a operation on a tensor that takes two tensors, pair up specific axes that appears in both inputs, then sums the products of their elements along those axes.
Einsum notation is a compact way to write contractions: each input tensor is listed by its axis labels, and output axes follow the → arrow; any axis that appears in both inputs but not in the output is summed over.

Operation	Formula	Einsum notation
Dot product	$\sum_i x_i y_i$	$I, I \rightarrow 1$
GEMV	$y_i = \sum_j A_{ij} x_j$	$IJ, J \rightarrow I$
GEMM	$C_{ij} = \sum_k A_{ik} B_{kj}$	$IK, KJ \rightarrow IJ$

Every contraction can be decomposed into three steps: Broadcast, Multiply, and Reduce.

Step	Dot Product ($I, I \rightarrow 1$)	GEMV ($IJ, J \rightarrow I$)	GEMM ($IK, KJ \rightarrow IJ$)
Broadcast	none (axes match)	$x$ broadcasts across $I$	$A$ across $J$; $B$ across $I$
Multiply	$x_i \cdot y_i$	$A_{ij} \cdot x_j$	$A_{ik} \cdot B_{kj}$
Reduce	$\sum_i x_i y_i$	$y_i = \sum_j A_{ij} x_j$	$C_{ij} = \sum_k A_{ik} B_{kj}$

Tensor Contraction Processor

This section covers the hardware concepts needed to understand the examples: the processing unit hierarchy, memory tiers, tensor mapping types, and execution contexts.

Processing Units

The TCP architecture accelerates these contractions by streaming tensor data through a hierarchy of parallel processing units.

Level	Count (RNGD)	Role
`Chip`	(system-dependent)	Top-level unit; holds HBM
`Cluster`	2 per chip	Groups 256 slices
`Slice`	256 per cluster	Runs one Tensor Unit: a Fetch → Switching → Collect → Contraction → Vector → Cast → Transpose → Commit pipeline
`Row`	8 per slice	One row of the Contraction Engine’s MAC (multiply-accumulate) array

The Switch Engine connects slices, enabling data redistribution across the slice array.

Memory

Type	Location	Capacity (RNGD)	Role
`HbmTensor`	On-package	48 GB, 1.5 TB/s	Long-term weight and activation storage
`DmTensor`	On-chip SRAM	256 MB total; 512 KB/slice	Primary working memory for computations
`TrfTensor`	On-chip SRAM	8 KB × 8 MAC rows / slice	Weight register file for Contraction Engine
`VrfTensor`	On-chip SRAM	8 KB / slice	Operand register file for Vector Engine

Most alignment and capacity constraints in this book derive from the counts and capacities in these tables.

Tensor Mapping

TCP’s Virtual ISA exposes the hardware hierarchy through its type system. Each tensor type encodes the element type and how each logical axis distributes across the hardware hierarchy. For example, DmTensor<bf16, m![1], m![1 # 2], m![A / 8 # 256], m![A % 8]> (with axes![A = 2048]) represents a bf16 tensor on one chip, one of two clusters, distributed across 256 slices with 8 elements per slice. TCP also introduces two kernel-specific parameters: Time indexes pipeline iterations; Packet indexes elements within each iteration.

The mapping expression (m![] macro and its operators) is used to express this distribution:

/ splits by stride: A / 8 gives 2048 / 8 = 256 indices, the “which slice” index.
% gives the inner count: A % 8 gives the 8 indices for each element the slice holds.
# pads to the hardware unit count: # 256 makes the slice count explicit.

Together, each element of A is mapped to a well-defined position within exactly one slice.

Execution Contexts

Every device kernel has two execution contexts running concurrently on separate hardware resources: ctx.main and ctx.sub. main runs the primary computation; sub runs a concurrent pipeline, typically used to prefetch operands into TRF or VRF while main computes. If main needs operands that sub is still fetching, main automatically waits for sub’s execution to ensure synchronization.

Because both contexts share the flat on-chip SRAM, the programmer must explicitly assign DM addresses (e.g. the addr argument in .to_dm(), .commit()) to prevent tensors from overlapping. Addresses must not collide, but they can be non-contiguous.

Examples

The first two examples cover element-wise operations by using the Vector Engine; the remaining three cover tensor contractions by using the Contraction Engine.

Constant Addition

The first kernel takes a vector of integers and adds the constant 1 to each element. It uses one chip, one of two clusters, and all 256 slices in that cluster, with one 8-element group per slice. The Vector Engine processes integers using fixed-point operations, so we use vector_fxp(FxpBinaryOp::AddFxp, 1) to add the constant value.

flowchart TB
    HOST[Host] <-->|PCIe DMA| HBM[(HBM)]
    HBM <-->|Tensor DMA| DM[(DM)]

    subgraph TU[Tensor Unit]
        direction TB
        FE[Fetch] --> SW["Switch (Forward)"] --> CO[Collect] --> VE["Vector (AddFxp +1)"] --> CM[Commit]
    end

    DM -->|stream| FE
    CM -->|stream| DM

This example demonstrates the full Tensor Unit pipeline. to_dm moves data from HBM to DM, splitting the flat tensor across 256 slices. The begin → fetch → collect → vector_init → vector_intra_slice_branch → vector_fxp → vector_final → commit chain processes each slice in one pass, and vector_fxp(FxpBinaryOp::AddFxp, 1) adds the integer constant 1 to every element in parallel across all 256 slices. BranchMode::Unconditional configures the pipeline to execute on every cycle.

#![feature(register_tool)]
#![register_tool(tcp)]
extern crate furiosa_visa_std;
extern crate tokio;
extern crate rand;
use rand::SeedableRng;
use rand::rngs::SmallRng;
use furiosa_visa_std::prelude::*;

axes![A = 2048];  // declare named axis A with size 2048; used in all tensor types below

type Chip    = m![1];
type Cluster = m![1 # 2];            // 1 active cluster; hardware has 2 per chip
type Slice   = m![A / 8 # 256];      // distribute A across 256 slices, 8 elements each

#[tokio::main]
async fn main() {
    let mut ctx = Context::acquire();

    // Create input on the host and transfer to HBM
    let mut rng = SmallRng::seed_from_u64(42);
    let input  = HostTensor::<i32, m![A]>::rand(&mut rng);
    let in_hbm = input.to_hbm(&mut ctx.pdma, 0).await;

    // Launch the device kernel
    let out_hbm = launch(kernel, (&mut ctx, &in_hbm)).await;

    // Transfer result back to host
    let _out = out_hbm.to_host::<m![A]>(&mut ctx.pdma).await;
}

#[device(chip = 1)]
fn kernel(ctx: &mut Context, input: &HbmTensor<i32, Chip, m![A]>) -> HbmTensor<i32, Chip, m![A]> {
    // HBM → DM: split 2048 elements across 256 slices (8 elements per slice)
    let dm = input.to_dm::<Cluster, Slice, m![A % 8]>(&mut ctx.tdma, 0);

    let result = ctx
        .main
        .begin(dm.view())
        // Fetch: stream 8-element packets from DM into the pipeline
        .fetch::<i32, m![1], m![A % 8]>()
        // Collect: normalize the stream into 32-byte flits (8 × i32)
        .collect::<m![1], m![A % 8]>()
        // Vector Engine: enter pipeline and arm unconditionally
        .vector_init()
        .vector_intra_slice_branch(BranchMode::Unconditional)
        // Add the scalar constant 1 to every element
        .vector_fxp(FxpBinaryOp::AddFxp, 1)
        // Exit VE and commit: write results back to DM
        .vector_final()
        .commit::<m![A % 8]>(1 << 12);

    // DM → HBM
    result.to_hbm(&mut ctx.tdma, 1 << 28)
}

Elementwise Multiplication

The second kernel multiplies two same-shape vectors element-wise. Because the Vector Engine’s fixed-point multiply unit (FxpMul) takes a second operand per element, that operand must come from the VRF (Vector Register File). The VRF is a small per-slice register file that the Vector Engine reads every cycle; it is loaded in the sub context while the main computation streams.

flowchart TB
    LHS_HBM[(lhs: HBM)] -->|Tensor DMA| LHS_DM[(lhs: DM)]
    RHS_HBM[(rhs: HBM)] -->|Tensor DMA| RHS_DM[(rhs: DM)]

    subgraph sub[sub context]
        direction LR
        sFE[Fetch] --> sSW[Switch] --> sCO[Collect]
    end

    subgraph main[main context]
        direction LR
        mFE[Fetch] --> mSW[Switch] --> mCO[Collect] --> VE["Vector (MulInt)"] --> CM[Commit]
    end

    RHS_DM --> sFE
    LHS_DM --> mFE
    sCO --> VRF[(VRF)]
    VRF --> VE
    CM --> OUT_DM[(result: DM)]
    OUT_DM -->|Tensor DMA| OUT_HBM[(HBM)]

This example adds the VRF and the sub context. rhs_dm is allocated at a different base address (1 << 12) to avoid overlapping with lhs_dm. The sub context loads rhs_dm into the VRF through the Fetch → Switch → Collect → .to_vrf(0) pipeline. The main context then streams lhs_dm and multiplies each element by its VRF counterpart using MulInt; the hardware runs both contexts concurrently where possible.

#![feature(register_tool)]
#![register_tool(tcp)]
extern crate furiosa_visa_std;
extern crate tokio;
extern crate rand;
use rand::SeedableRng;
use rand::rngs::SmallRng;
use furiosa_visa_std::prelude::*;

axes![A = 2048];

type Chip    = m![1];
type Cluster = m![1 # 2];
type Slice   = m![A / 8 # 256];

#[tokio::main]
async fn main() {
    let mut ctx = Context::acquire();

    let mut rng = SmallRng::seed_from_u64(42);
    let lhs = HostTensor::<i32, m![A]>::rand(&mut rng);
    let rhs = HostTensor::<i32, m![A]>::rand(&mut rng);

    let lhs_hbm = lhs.to_hbm(&mut ctx.pdma, 0).await;
    let rhs_hbm = rhs.to_hbm(&mut ctx.pdma, 1 << 28).await;

    let out_hbm = launch(kernel, (&mut ctx, &lhs_hbm, &rhs_hbm)).await;

    let _out = out_hbm.to_host::<m![A]>(&mut ctx.pdma).await;
}

#[device(chip = 1)]
fn kernel(
    ctx: &mut Context,
    lhs: &HbmTensor<i32, Chip, m![A]>,
    rhs: &HbmTensor<i32, Chip, m![A]>,
) -> HbmTensor<i32, Chip, m![A]> {
    // Move both operands from HBM to DM; use distinct base addresses to avoid overlap
    let lhs_dm = lhs.to_dm::<Cluster, Slice, m![A % 8]>(&mut ctx.tdma, 0);
    let rhs_dm = rhs.to_dm::<Cluster, Slice, m![A % 8]>(&mut ctx.tdma, 1 << 12);

    // Sub context: load rhs into VRF (runs concurrently with the main context below).
    // VRF holds a per-slice operand that the Vector Engine reads every cycle.
    let rhs_vrf: VrfTensor<i32, Chip, Cluster, Slice, m![A % 8]> = ctx
        .sub
        .begin(rhs_dm.view())
        .fetch::<i32, m![1], m![A % 8]>()
        .collect::<m![A % 8 / 8], m![A % 8 % 8]>()
        .to_vrf(0);

    // Main context: multiply every lhs element by its rhs counterpart from VRF
    let result = ctx
        .main
        .begin(lhs_dm.view())
        .fetch::<i32, m![1], m![A % 8]>()
        .collect::<m![1], m![A % 8]>()
        .vector_init()
        .vector_intra_slice_branch(BranchMode::Unconditional)
        // Each slice multiplies its 8 lhs elements by the matching 8 rhs elements in VRF
        .vector_fxp(FxpBinaryOp::MulInt, &rhs_vrf)
        .vector_final()
        .commit::<m![A % 8]>(1 << 13);

    result.to_hbm(&mut ctx.tdma, 1 << 28)
}

The following three examples implement the contractions from the table above, each introducing a different Switch Engine topology.

Dot Product

The dot product $I, I \rightarrow 1$ is the simplest contraction: there is no broadcast step, and both operands reduce along the same axis. The sub context loads rhs into the TRF, the on-chip register file that holds one operand stationary while the other streams through, via Fetch → Collect → .to_trf(). TrfAddress::Full dedicates the entire TRF to this tensor. .align() pairs the streaming LHS flits with the stationary RHS, doubling the packet width. .contract() multiplies and reduce-adds along A spatially via the hardware reduction tree; .accumulate() then performs temporal accumulation across the time axis, producing a scalar per slice; .cast() converts the f32 accumulator output back to bf16.

#![feature(register_tool)]
#![register_tool(tcp)]
extern crate furiosa_visa_std;
extern crate tokio;
extern crate rand;
use rand::SeedableRng;
use rand::rngs::SmallRng;
use furiosa_visa_std::prelude::*;

axes![A = 2048];

type Chip = m![1];
type Cluster = m![1 # 2];
type Slice = m![1 # 256];  // 1 active slice; m![A / 8 # 256] would distribute across all 256
type Time = m![1];         // No temporal iteration
type Row = m![1];          // No row parallelism

#[tokio::main]
async fn main() {
    let mut ctx = Context::acquire();

    let mut rng = SmallRng::seed_from_u64(42);
    let lhs = HostTensor::<bf16, m![A]>::rand(&mut rng);
    let rhs = HostTensor::<bf16, m![A]>::rand(&mut rng);

    let lhs_hbm = lhs.to_hbm(&mut ctx.pdma, 0).await;
    let rhs_hbm = rhs.to_hbm(&mut ctx.pdma, 1 << 28).await;

    let out_hbm = launch(kernel, (&mut ctx, &lhs_hbm, &rhs_hbm)).await;

    let out = out_hbm.to_host::<m![1]>(&mut ctx.pdma).await;
}

#[device(chip = 1)]
fn kernel(
    ctx: &mut Context,
    lhs: &HbmTensor<bf16, Chip, m![A]>,
    rhs: &HbmTensor<bf16, Chip, m![A]>,
) -> HbmTensor<bf16, Chip, m![1]> {
    // HBM → DM
    let lhs: DmTensor<bf16, Chip, Cluster, Slice, m![A]> = lhs.to_dm(&mut ctx.tdma, 0);
    let rhs: DmTensor<bf16, Chip, Cluster, Slice, m![A]> = rhs.to_dm(&mut ctx.tdma, 1 << 12);

    // Sub context: load rhs into TRF (TrfAddress::Full dedicates the entire TRF to this tensor)
    let rhs: TrfTensor<bf16, Chip, Cluster, Slice, Row, m![A]> = ctx
        .sub
        .begin(rhs.view())
        .fetch::<bf16, Time, m![A]>()
        .collect::<m![{Time}, A / 16], m![A % 16]>()
        .to_trf(TrfAddress::Full);

    // Main context: stream lhs through the Contraction Engine, reduce along A
    let result: DmTensor<bf16, Chip, Cluster, Slice, m![1 # 8]> = ctx
        .main
        .begin(lhs.view())
        .fetch::<bf16, Time, m![A]>()
        .collect::<m![A / 16], m![A % 16]>()
        // Pair consecutive 32-byte flits into 64-byte packets, halving time steps (A/16 → A/32)
        .align::<m![A / 32], m![A % 32], _, _>(&rhs)
        .contract::<m![1]>()
        .accumulate::<m![1], m![1 # 8]>(AccumulationKind::Interleaved)
        .cast::<bf16, m![1 # 16]>()  // cast f32 accumulator output back to bf16
        .commit::<m![1 # 8]>(1 << 13);

    // DM → HBM
    result.to_hbm(&mut ctx.tdma, 2 << 28)
}

The dot product reduces along a single axis with no redistribution needed, so the Switch Engine is skipped and collect() is called directly on the FetchTensor. The pseudocode below describes this behavior:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 2048];

// Dot product: both operands reduce along A; no slice redistribution needed.
fn collect_dot_product<'l, const T: Tu>(
    input: FetchTensor<'l, T, bf16, m![1], m![1], m![1 # 256], m![1], m![A]>,
) -> CollectTensor<'l, T, bf16, m![1], m![1], m![1 # 256], m![A / 16], m![A % 16]> {
    input.collect()
}
}

GEMV

GEMV $IJ, J \rightarrow I$ extends the dot product by requiring the Switch Engine (which redistributes data across slices between Fetch and Collect) to broadcast the vector across all I slices, so each slice can independently compute its row of the output $y_i = \sum_j A_{ij} x_j$.

The reduced dimension J splits into Time (one iteration per tile) and Packet (elements within each tile). The preserved output dimension I maps to Slice, distributing output elements across slices for spatial parallelism.

GEMV requires broadcasting the vector to all I slices, which the Switch Engine handles with Broadcast01. The pseudocode below describes this behavior:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![I = 256, J = 2048];

// GEMV: broadcast the vector across all I slices.
fn switch_gemv<'l, const T: Tu>(
    input: FetchTensor<'l, T, bf16, m![1], m![1], m![1 # 256], m![1], m![J]>,
) -> SwitchTensor<'l, T, bf16, m![1], m![1], m![I], m![1 # 256], m![J]> {
    input.switch(SwitchConfig::Broadcast01 {
        slice1: 256,
        slice0: 1,
        time0: 1,
    })
}
}

#![allow(unused)]
fn main() {
#![feature(register_tool)]
#![register_tool(tcp)]
extern crate furiosa_visa_std;
extern crate rand;
use rand::SeedableRng;
use rand::rngs::SmallRng;
use furiosa_visa_std::prelude::*;
axes![I = 256, J = 2048];

type Chip = m![1];
type Cluster = m![1 # 2];
type Slice = m![I];        // Distribute output dimension across slices
type Time = m![J / 32];    // Temporal iterations for reduction dimension
type Packet = m![J % 32];  // Packet size for reduction dimension
type Row = m![1];

async fn run() {
    let mut ctx = Context::acquire();

    // Create matrix and vector on host
    let mut rng = SmallRng::seed_from_u64(42);
    let matrix = HostTensor::<bf16, m![I, J]>::rand(&mut rng);
    let vector = HostTensor::<bf16, m![J]>::rand(&mut rng);

    // Transfer to HBM
    let matrix_hbm = matrix.to_hbm(&mut ctx.pdma, 0 << 28).await;
    let vector_hbm = vector.to_hbm(&mut ctx.pdma, 1 << 28).await;
    // Launch kernel
    let out_hbm = launch(kernel, (&mut ctx, &matrix_hbm, &vector_hbm)).await;

    // Transfer result back
    // > TODO(jeongmin.park): Consider adding a type annotation here.
    let out = out_hbm.to_host::<m![I]>(&mut ctx.pdma).await;
}

#[device(chip = 1)]
fn kernel(
    ctx: &mut Context,
    matrix: &HbmTensor<bf16, Chip, m![I, J]>,
    vector: &HbmTensor<bf16, Chip, m![J]>,
) -> HbmTensor<bf16, Chip, m![I]> {
    // Move data from HBM to DM
    let matrix: DmTensor<bf16, Chip, Cluster, Slice, m![J]> = matrix.to_dm(&mut ctx.tdma, 0);
    let vector: DmTensor<bf16, Chip, Cluster, Slice, m![J]> = vector.to_dm(&mut ctx.tdma, 1 << 12);

    // Load vector into TRF
    // The Switch Engine automatically broadcasts the vector to all `I` slices
    let vector_trf: TrfTensor<bf16, Chip, Cluster, Slice, Row, m![J]> = ctx
        .sub
        .begin(vector.view())
        .fetch::<bf16, m![1], m![J]>()
        // Collect Engine: split into 32-byte flits.
        .collect::<m![J / 16], m![J % 16]>()
        .to_trf(TrfAddress::Full);

    // Compute GEMV: matrix × vector
    // Key difference: `I` maps to slice (preserved), `J` gets reduced
    let result: DmTensor<bf16, Chip, Cluster, Slice, m![1]> = ctx
        .main
        .begin(matrix.view())
        .fetch::<bf16, Time, Packet>()
        .collect::<Time, Packet>()
        .align::<Time, Packet, _, _>(&vector_trf)
        .contract::<m![1]>()
        .accumulate::<m![1], m![1 # 8]>(AccumulationKind::Interleaved)
        .cast::<bf16, m![1 # 16]>()
        .commit(0);

    // Transfer result to HBM
    result.to_hbm(&mut ctx.tdma, 2 << 28)
}
}

GEMM

GEMM $IK, JK \rightarrow IJ$ computes $C_{ij} = \sum_k A_{ik} B_{jk}$. Each matrix broadcasts along its missing output dimension: $A$ broadcasts across $J$, $B$ broadcasts across $I$. Then the shared dimension $K$ is reduced.

The main change from the GEMV example is that two output dimensions $I$ and $J$ are jointly mapped to Slice, so each slice computes a 2D tile of the output matrix. The slice mapping now covers both dimensions, and the contraction output preserves both.

This example introduces type Slice = m![I / 32, J / 32], which decomposes two output dimensions jointly and assigns each slice a 16 × 16 output tile. The Switch Engine distributes each tile of B to the matching slice, so each slice sees only its portion of J. .contract::<m![1]>() reduces along K spatially, and .accumulate::<m![I], m![J # 8]>(AccumulationKind::Interleaved) accumulates over time, preserving both I and J in the output.

#![allow(unused)]
fn main() {
#![feature(register_tool)]
#![register_tool(tcp)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![I = 512, J = 512, K = 2048];

type Chip = m![1];
type Cluster = m![1 # 2];
// Distribute output dimensions `I` and `J` across slices
type Slice = m![I / 32, J / 32]; // Each slice handles a 16 × 16 output tile
type Row = m![J % 8];

// Host code similar to previous examples:
// - Create matrix tensors A and B
// - Transfer to HBM
// - Launch kernel
// - Transfer result back

#[device(chip = 1)]
fn kernel(
    ctx: &mut Context,
    a: &HbmTensor<bf16, Chip, m![I, K]>,
    b: &HbmTensor<bf16, Chip, m![K, J]>,
) -> HbmTensor<bf16, Chip, m![I, J]> {
    // Move data from HBM to DM
    let a: DmTensor<bf16, Chip, Cluster, Slice, m![I % 32, K]> = a.to_dm(&mut ctx.tdma, 0);
    let b: DmTensor<bf16, Chip, Cluster, Slice, m![J % 32, K]> = b.to_dm(&mut ctx.tdma, 1 << 12);

    // Load matrix B into TRF
    // Switch Engine distributes B across 256 slices
    // Each slice gets the full `K` dimension but only its (16 × 16) output tile
    // See: Switch Engine topologies for details on distribution
    let b_trf: TrfTensor<bf16, Chip, Cluster, Slice, Row, m![J / 8 % 4, K]> = ctx
        .sub
        .begin(b.view())
        .fetch::<bf16, m![J % 8, J / 8 % 4], m![K]>()
        .collect::<m![J % 8, J / 8 % 4, K / 16], m![K % 16]>()
        .to_trf(TrfAddress::Full);

    // Compute GEMM: A × B
    // Switch Engine ensures matching (`I / 32`, `J / 32`) slice distribution
    // Contraction reduces along `K`, preserves `I` and `J`
    let result: DmTensor<bf16, Chip, Cluster, Slice, m![I % 32, J % 32]> = ctx
        .main
        .begin(a.view())
        .fetch::<bf16, m![I % 32, J / 8 % 4], m![K]>()
        .collect::<m![I % 32, J / 8 % 4, K / 16], m![K % 16]>()
        .align::<m![I % 32, J / 8 % 4, K / 32], m![K % 32], _, _>(&b_trf)
        .contract::<m![1]>()
        .accumulate::<m![I % 32, J / 8 % 4], m![J % 8]>(AccumulationKind::Interleaved)
        .cast::<bf16, m![J % 8 # 16]>()
        .commit(0);

    // Transfer result to HBM
    result.to_hbm(&mut ctx.tdma, 2 << 28)
}
}

Blocked GEMM

Note

This section is a work in progress. A complete example extending GEMM with blocking (tiling) for matrices that exceed on-chip DM capacity — covering temporal partitioning over the K dimension and spatial partitioning that distributes I and J tiles across multiple chips — will be added in a future release.

Flash Attention

Note

This section is a work in progress. A complete flash attention example combining GEMM-style contraction, softmax (Vector Engine), and the multi-pass main/sub prefetch pattern across a full transformer attention head will be added in a future release.

Together, the five complete examples above demonstrate every major hardware engine in TCP: the DMA engines (HBM to DM), the Vector Engine (element-wise ops), the Contraction Engine (multiply-reduce), and the Switch Engine (data redistribution across slices). (The Blocked GEMM and Flash Attention sections are stubs and will demonstrate additional patterns once complete.)

Mapping Tensors

This chapter explains what mappings are, how to declare them in TCP’s Virtual ISA, and how to choose them for performance.

Layout and Performance

Tensors have no intrinsic order of elements. A mapping is a function from tensor indices to buffer positions, which defines the order in which elements are stored. When storing a tensor in hardware, you need to decide how they will be mapped into the flat buffer.

The choice of mapping matters because hardware reads memory in contiguous blocks: elements stored far apart require more memory transfers. For example, one can choose which axis is major (outermost, changes slowest) and which is minor (innermost, changes fastest, stored contiguously). Changing a layout after allocation requires copying and transposing data, so the mapping chosen at allocation time constrains all subsequent operations to match that layout.

Consider a tensor with axes H (height, 6 rows) and W (width, 8 columns). The same tensor admits different mappings, each with different performance characteristics.

H\W	0	1	2	3	4	5	6	7
0	a	b	c	d	e	f	g	h
1	i	j	k	l	m	n	o	p
2	·	·	·	·	·	·	·	·
3	·	·	·	·	·	·	·	·
4	·	·	·	·	·	·	·	·
5	·	·	·	·	·	·	·	·

H Major, W Minor

A scan along W is contiguous; a scan along H accesses one element per cache line.

H=0								H=1								...
a	b	c	d	e	f	g	h	i	j	k	l	m	n	o	p	...

W Major, H Minor

A scan along H is contiguous; a scan along W accesses one element per cache line.

W=0						W=1						W=2						...
a	i	·	·	·	·	b	j	·	·	·	·	c	k	·	·	·	·	...

Either choice sacrifices spatial locality (the property that nearby elements are stored at nearby addresses) in one direction. Tiling achieves good locality along both axes by grouping nearby H and W indices into 2D tiles.

2×2 Tiles

All elements within a tile are contiguous.

t(0,0)				t(0,1)				t(0,2)				...
a	b	i	j	c	d	k	l	e	f	m	n	...

W-minor layout is fast along W but slow along H; H-minor layout is the reverse; tiling gives balanced locality in both directions at the cost of more complex address calculation. The outer dimension of a decomposition can become a hardware time loop; the inner dimension, a parallel lane. Choosing a decomposition is how programmers control which dimensions execute sequentially and which execute in parallel. TCP names these hardware dimensions Time (the sequential loop counter) and Packet (the parallel data lane width), used throughout this book; Memory and Stream explains how decompositions map to them.

The Declarative Approach

Virtual ISA lets the programmer declare a mapping in terms of logical axes; the compiler derives physical placement, alignment, and hardware scheduling. In the above example, the simplest form is m![H, W] for H-major and m![W, H] for W-major, where the leftmost axis is major and the rightmost is minor. Decomposing axes further with / and % enables tiling, expressed as m![H / 2, W / 2, H % 2, W % 2]: the first two dimensions are the tile indices and the last two are positions within the tile.

Declarative mappings offer three benefits:

Expressiveness: Layout is stated in terms of logical axes (m![H, W]), not raw memory strides or offsets.
Correctness: The compiler normalizes mapping expressions to canonical form and verifies them symbolically, turning layout properties into compile-time invariants.
Portability: The same expression targets CPUs, GPUs, and TCPs without rewrites; the compiler derives hardware-specific placement from the axis description.

Mapping expressions describe a tensor at every stage of its life, not only when it is at rest in memory. The same tensor can be stored in HBM, loaded into DM with a different layout, and streamed through the pipeline as packets; each stage holds the same mathematical values under a different mapping.

This unified view treats data movement as preserving the mathematical tensor: moving a tensor between stages changes only its physical representation, not its values. The Tensor Functions page formalizes this perspective and shows how it makes data movement composable with computation in the same pipeline.

Mapping Expressions

A mapping expression defines where each tensor element sits in a buffer. This page covers the available mapping constructors and the equivalences between mappings.

Consider a tensor with axes![A = 8, B = 512]. The mapping expression m![A, B] places $A$ as the major axis and $B$ as the minor axis, requiring a buffer of 8 × 512 = 4096 elements. Buffer position 0 holds $\{A=0, B=0\}$, position 1 holds $\{A=0, B=1\}$, and so on through all 512 elements where $A=0$ before moving to $A=1$.

Axis Sizes

The axes! macro declares axis identifiers and their sizes. Throughout this section, assume the following axis sizes.

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 512];
}

Mapping Interface

A mapping expression like m![H, W] is a Rust type that describes how tensor indices map to buffer positions. Every mapping expression implements the M trait, which provides the buffer size and a buffer-index-to-tensor-index mapping function:

#![allow(unused)]
fn main() {
// Inside `furiosa_visa_std::prelude`...
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::fmt::Debug;


/// Tensor index: a map from axis identifiers to coordinate values.
pub struct Index { /* ... */ }

/// Constructs tensor indices.
/// `i![A: 2, B: 3]` creates an `Index` with A = 2 and B = 3.
macro_rules! i {
    () => {};
    /* ... */
}
}

A mapping defines what mathematical tensor a buffer represents. For example, HostTensor<bf16, m![A, B]> denotes a host memory buffer containing m![A, B]::SIZE elements of bf16 data, which is 4096 elements. We say a buffer holds a tensor $T$ when:

For every buffer index i and tensor index ti,
if m![A, B]::map(i) = Some(ti),
then the i-th element of the buffer stores the value of tensor $T$ at index ti.

Constructors

Mapping expressions are built by composing small constructors, each of which transforms or combines simpler mappings. These expressions use arithmetic-like operators (/, %, and # for padding) to concisely define the mapping between tensor and linear buffer indices.

Symbol

A symbol is a single uppercase letter whose size comes from the shape declaration. The mapping m![A] maps 8 buffer indices linearly to tensor indices along the axis: buffer index 0 holds i![] (empty tensor index), index 1 holds i![A: 1], index 2 holds i![A: 2], and so on:

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;

axes![A = 8];

type E = m![A]; // Symbol<Ident::A, 8>

#[test]
fn test_symbol() {
    for i in 0..E::SIZE {
        assert_eq!(E::map(i), Some(i![A: i]));
    }
    assert_eq!(E::map(E::SIZE), None);
}
}

// Trait implementation
impl<S: AxisName> M for Symbol<S> {
    const SIZE: usize = S::SIZE;

    fn to_value() -> Mapping {
        Mapping::Symbol {
            symbol: S::NAME,
            size: S::SIZE,
        }
    }

    fn map(i: usize) -> Option<Index> {
        if i < S::SIZE {
            let mut index = Index::new();
            Index::add_term(
                &mut index,
                Term {
                    inner: Atom::Symbol {
                        symbol: S::NAME,
                        size: S::SIZE,
                    },
                    stride: 1,
                    modulo: S::SIZE,
                },
                i,
            );
            Some(index)
        } else {
            None
        }
    }
}

Note

For every symbol A, the 0’th index i![A: 0] corresponds to the empty tensor index i![].

Pair

One way to store a 2D tensor with shape $\{A=8, B=512\}$ is the pair mapping m![A, B]. This creates a buffer of 4096 elements where A is the major axis and B is the minor axis. The first 512 elements hold A = 0 and the next 512 elements hold A = 1. Buffer index 519 holds i![A: 1, B: 7] since 519 == 512 * 1 + 7.

The mapping Pair<L, R> maps the Cartesian product of two spaces into a linear buffer where L is the major dimension and R is the minor dimension. The size is L::SIZE * R::SIZE, and the mapping uses floor division and modulo to decompose indices. m![A, B, C, D] expands to Pair<A, Pair<B, Pair<C, D>>> and is right-associative.

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;

axes![A = 8, B = 512];

type E = m![A, B]; // Pair<m![A], m![B]>

#[test]
fn test_pair() {
    for i in 0..E::SIZE {
        assert_eq!(E::map(i), Some(i![A: i / <m![B]>::SIZE, B: i % <m![B]>::SIZE]));
    }
    assert_eq!(E::map(2 * <m![B]>::SIZE + 7), Some(i![A: 2, B: 7]));
    assert_eq!(E::map(E::SIZE), None);
}
}

// Trait implementation
impl<L, R> M for Pair<L, R>
where
    L: M,
    R: M,
{
    const SIZE: usize = L::SIZE * R::SIZE;

    fn to_value() -> Mapping {
        Mapping::Pair {
            left: RBox::new(L::to_value()),
            right: RBox::new(R::to_value()),
        }
    }

    fn map(i: usize) -> Option<Index> {
        let mut l = L::map(i / R::SIZE)?;
        let r = R::map(i % R::SIZE)?;
        Index::add(&mut l, r);
        Some(l)
    }
}

Identity

The identity mapping m![1] creates a single-element buffer that maps buffer index 0 to the empty tensor index i![]. It serves as the identity element for Pair: m![1, A] and m![A, 1] are both equivalent to m![A].

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;

type E = m![1]; // Identity

#[test]
fn test_identity() {
    assert_eq!(E::map(0), Some(i![]));
    assert_eq!(E::map(1), None);
}
}

// Trait implementation
impl M for Identity {
    const SIZE: usize = 1;

    fn to_value() -> Mapping {
        Mapping::Identity
    }

    fn map(i: usize) -> Option<Index> {
        if i == 0 { Some(Index::new()) } else { None }
    }
}

Padding

Padding aligns data to hardware requirements by adding unused buffer space. For example, the DMA engine requires rows to start on 64-byte boundaries. With axes![C = 13, D = 61], m![C, D] creates misaligned rows since 61 is not divisible by 64. m![C, D # 64] fixes this by aligning each row to 64-byte boundaries, using 3 extra elements per row.

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;

axes![C = 13, D = 61];

type E = m![C, D # 64]; // Pair<m![C], Padding<m![D], 64>>

#[test]
fn test_padding() {
    assert_eq!(E::map(0),  Some(i![C: 0, D: 0]));
    assert_eq!(E::map(60), Some(i![C: 0, D: 60]));
    assert_eq!(E::map(61), None); // padding
    assert_eq!(E::map(62), None); // padding
    assert_eq!(E::map(63), None); // padding
    assert_eq!(E::map(64), Some(i![C: 1, D: 0]));
}
}

// Trait implementation
impl<L, const SIZE: usize> M for Padding<L, SIZE>
where
    L: M,
{
    const SIZE: usize = SIZE;

    fn to_value() -> Mapping {
        Mapping::Padding {
            inner: RBox::new(L::to_value()),
            padding: SIZE,
            kind: PaddingKind::Top,
        }
    }

    fn map(i: usize) -> Option<Index> {
        L::map(i)
    }
}

Resize

Resize constrains a mapping to a smaller logical size by truncating indices beyond the new size, discarding elements outside that range. Unlike padding, which expands the buffer, Resize shrinks the logical view. The mapping m![D = 2] takes only the first 2 elements of axis D, producing indices D = 0 and D = 1.

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;

axes![C = 2, D = 3];
type E = m![C, D = 2]; // Pair<m![C], Resize<m![D], 2>>

#[test]
fn test_resize() {
    assert_eq!(E::map(0), Some(i![C: 0, D: 0]));
    assert_eq!(E::map(1), Some(i![C: 0, D: 1]));
    assert_eq!(E::map(2), Some(i![C: 1, D: 0]));
    assert_eq!(E::map(3), Some(i![C: 1, D: 1]));
    assert_eq!(E::map(4), None);
}
}

// Trait implementation
impl<L, const SIZE: usize> M for Resize<L, SIZE>
where
    L: M,
{
    const SIZE: usize = SIZE;

    fn to_value() -> Mapping {
        Mapping::Resize {
            inner: RBox::new(L::to_value()),
            resize: SIZE,
        }
    }

    fn map(i: usize) -> Option<Index> {
        if i < SIZE { L::map(i) } else { None }
    }
}

Tiling is implemented through indexed views, pure metadata transformations without data copies. The .tile() method extracts a tile by resizing one dimension to the tile size and offsetting into the buffer.

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;

axes![A = 8, B = 512];

fn tiles() {
let tensor = unsafe { HbmTensor::<bf16, m![1], m![A, B]>::from_addr(0) };
let view = tensor.view(); // HbmTensorView::<'_, bf16, m![1], m![A, B]>
let tile01 = view.tile::<m![B], 2, m![A, B = 2 # 4]>(0); // HbmTensorView::<'_, bf16, m![1], m![A, B = 2 # 4]>
let tile23 = view.tile::<m![B], 2, m![A, B = 2 # 4]>(2); // HbmTensorView::<'_, bf16, m![1], m![A, B = 2 # 4]>
}
}

The .tile() method takes three type parameters and one value parameter.

The tile dimension m![B] specifies which dimension to divide along.
The tile size 2 specifies the number of elements per tile.
The tile mapping m![A, B = 2 # 4] defines the resulting view’s mapping. The mapping B = 2 # 4 signifies that dimension B has a logical size of 2 within the view but exists within a physical footprint of 4. This is essential for preserving the original memory layout and stride calculations.
The starting index specifies which tile to extract. Passing 0 captures the range 0..2 for tile01, while passing 2 captures the range 2..4 for tile23.

Stride and Modulo

Stride (/) and modulo (%) decompose a single dimension into two: the outer (block index) and the inner (position within block). Consider the 512-element axis B divided into 8 blocks of 64 elements each. The mapping m![B / 64, B % 64] creates an 8 × 64 grid where the first dimension selects which block and the second dimension selects the position within that block. Buffer index 130 corresponds to block 2 at position 2 within that block, giving tensor index B = 64 × 2 + 2 = 130, equal to the flat-buffer result (since m![B / 64, B % 64] is equivalent to m![B]):

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 512];
type D1 = m![B / 64]; // stride with size 8
type D2 = m![B % 64]; // modulo with size 64

type E = m![B / 64, B % 64]; // equivalent to `m![B]`

#[test]
fn test_stride_modulo() {
    for i in 0..8 {
        assert_eq!(D1::map(i), Some(i![B / 64: i]));
    }
    assert_eq!(D1::map(8), None);

    for j in 0..64 {
        assert_eq!(D2::map(j), Some(i![B % 64: j]));
    }
    assert_eq!(D2::map(64), None);

    for i in 0..8 {
        for j in 0..64 {
            assert_eq!(
                E::map(64 * i + j),     // i![B / 64: i, B % 64: j]
                <m![B]>::map(64 * i + j), // equivalent to above
            );
        }
    }
    assert_eq!(E::map(512), None);
}
}

// Trait implementation
impl<L, const SIZE: usize> M for Stride<L, SIZE>
where
    L: M,
{
    const SIZE: usize = {
        assert!(L::SIZE % SIZE == 0, "Stride size must divide the original size");
        L::SIZE / SIZE
    };

    fn to_value() -> Mapping {
        Mapping::Stride {
            inner: RBox::new(L::to_value()),
            stride: SIZE,
        }
    }

    fn map(i: usize) -> Option<Index> {
        if i < Self::SIZE { L::map(i * SIZE) } else { None }
    }
}

impl<L, const SIZE: usize> M for Modulo<L, SIZE>
where
    L: M,
{
    const SIZE: usize = {
        assert!(L::SIZE % SIZE == 0, "Modulo size must divide the original size");
        SIZE
    };

    fn to_value() -> Mapping {
        Mapping::Modulo {
            inner: RBox::new(L::to_value()),
            modulo: SIZE,
        }
    }

    fn map(i: usize) -> Option<Index> {
        if i < Self::SIZE { L::map(i % L::SIZE) } else { None }
    }
}

Together, m![B / 64, B % 64] transforms axis B into an 8 × 64 grid. The mapping is equivalent to m![B] but expresses a different logical view of the same data, revealing block structure hidden in the flat representation.

Stride and modulo mappings can be visualized in tabular form. Consider the mapping m![B / 4, B % 4] with B::SIZE = 16. The following table shows how buffer indices are arranged: each row corresponds to a specific index of B / 4 (the stride axis), and each column corresponds to an index of B % 4 (the modulo axis):

	`i![B % 4: 0]`	`i![B % 4: 1]`	`i![B % 4: 2]`	`i![B % 4: 3]`
`i![B / 4: 0]`	`i![B: 0]`	`i![B: 1]`	`i![B: 2]`	`i![B: 3]`
`i![B / 4: 1]`	`i![B: 4]`	`i![B: 5]`	`i![B: 6]`	`i![B: 7]`
`i![B / 4: 2]`	`i![B: 8]`	`i![B: 9]`	`i![B: 10]`	`i![B: 11]`
`i![B / 4: 3]`	`i![B: 12]`	`i![B: 13]`	`i![B: 14]`	`i![B: 15]`

Stride and modulo factorize a single mapping into multiple dimensions. The expression m![B / n] creates an outer dimension indexing blocks of size n. The expression m![B % n] creates an inner dimension indexing positions within each block.

Modulo differs from resize in how it handles buffer size:

Resize shrinks the buffer by truncating indices beyond the new size.
Modulo preserves the original buffer size while partitioning it into equal-sized blocks.

These operations can be nested for complex decompositions. The following example splits B into three dimensions where the buffer’s bit layout differs from that of the tensor index.

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 512];
// B's bits: 6 - 8,  0 - 4,          5
// Values:   0 - 7, 0 - 31,      0 - 1
type E = m![B / 64, B % 32, B / 32 % 2];

#[test]
fn test_nested_stride() {
    for i in 0..8 {
        for j in 0..32 {
            for k in 0..2 {
                assert_eq!(
                    E::map(64 * i + 2 * j + k),
                    Some(i![B: 64 * i + j + 32 * k]),
                );
            }
        }
    }
    assert_eq!(E::map(512), None);
}
}

The buffer index decomposes as 64 * i + 2 * j + k where i selects the block, j selects position within the block, and k selects the sub-block. The tensor index B reconstructs as 64 * i + j + 32 * k, which rearranges the bit positions.

For example, buffer index 67 maps to B = 97:

Buffer 67 = 64 * 1 + 2 * 1 + 1 gives i=1, j=1, k=1
Tensor B = 64 * 1 + 1 + 32 * 1 = 97
Verify: 97 / 64 = 1, 97 % 32 = 1, (97 / 32) % 2 = 1

This kind of bit rearrangement maps naturally to hardware memory layouts where address bits are reordered for bank interleaving or cache efficiency. In binary, this rearranges bit positions: buffer 001_00001_1 becomes B = 001_1_00001. The buffer groups bits as [8:6]_[5:1]_[0] while B groups them as [8:6]_[5]_[4:0].

Tiling can operate on blocks rather than individual elements. The following example tiles by block using m![B / 32] and creates overlapping tiles:

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 512];
let tensor = unsafe { HbmTensor::<bf16, m![1], m![A, B]>::from_addr(0) };
for i in 0..15 {
    let tile = tensor.view().tile::<m![B / 32], 2, m![A, B / 32 = 2 # 16, B % 32]>(i);
}
}

With B = 512, the dimension B / 32 has 16 blocks numbered 0-15. Each tile takes 2 consecutive blocks starting at index i. Tile 0 covers blocks {0, 1}, tile 1 covers blocks {1, 2}, and so on through tile 14 covering blocks {14, 15}. These tiles overlap because consecutive tiles share one block: tiles 0 and 1 both include block 1.

The tile mapping B / 32 = 2 resizes the block dimension to 2 since each tile contains exactly 2 blocks. When tiling with a single block, B / 32 = 1 simplifies to the identity m![1] since the dimension has only one value.

Escape

For complex mappings, define type aliases and reference them using { ... }. With separate mappings L = m![A] and R = m![B], combining them as m![{ L }, { R }] produces the same result as m![A, B]:

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 512];
type L = m![A];
type R = m![B];
type E = m![{ L }, { R }]; // equivalent to `m![A, B]`
}

This escape syntax breaks down complex mappings into named, reusable components.

Advanced Constructors

Skewed Axis

A skewed axis creates a diagonal access pattern across two dimensions. Skewed axes introduce derived axis labels defined by arithmetic differences between existing axes; for example, B' = B - A defines a new axis B' whose coordinate at any point equals B minus A. Algorithms that process data along diagonals use this pattern, such as certain wavefront computations.

The expression m![A, B' = 4] with B' = B - A creates a mapping where each row is shifted relative to the previous one. The = operator specifies the logical size after skewing. The result wraps around using modular arithmetic.

For example, with axes![A = 4, B = 4] and B' = B - A:

(A, B’)	(A, B)
(0, 0)	(0, 0)
(0, 1)	(0, 1)
(0, 2)	(0, 2)
(0, 3)	(0, 3)
(1, 0)	(1, 1)
(1, 1)	(1, 2)
(1, 2)	(1, 3)
(1, 3)	(1, 0)

When A = 1 and B' = 3, the original B coordinate wraps to 0 via modular arithmetic since B = (B' + A) % 4 = (3 + 1) % 4 = 0.

Indirect Sequencing

TODO: Document indirect sequencing patterns for non-contiguous memory access. This advanced constructor provides index-based memory access where the sequence of buffer positions is determined by an indirection table rather than a mathematical formula.

Sliding (Linear Combination)

Note

Linear combination expressions $(e1:n1, ..., ed:nd) combine multiple dimensions with specified strides. Formal definition: size_S($(e1:n1, ..., ed:nd)) = 1 + sum_k((size_S(ek) - 1) * nk). The mapping S, $(e1:n1, ..., ed:nd) |- si ~ ti holds if there exist si1...sid, ti1...tid such that for all k: S, ek |- sik ~ tik, si = sum_k(sik * nk), and ti = sum_k(tik * nk).

Linear combinations can encode outer sum: e1 * e2 is equivalent to $(e1 : size_S(e2), e2 : 1). However, outer sum is preferred because it’s more resilient to axis reordering. Changing e1 * e2 to e2 * e1 doesn’t require manual stride updates.

Sliding operations access overlapping data blocks, essential for convolutional neural networks. Consider a buffer of 9 elements representing a tensor with shape $\{N=5, F=3\}$ where each row is a 3-element slice that slides one element at a time. The tensor element at $(N, F)$ maps to buffer index $N + 2F$:

$$ \begin{array}{c|ccc} & F=0 & F=1 & F=2 \\ \hline N=0 & 0 & 2 & 4 \\ N=1 & 1 & 3 & 5 \\ N=2 & 2 & 4 & 6 \\ N=3 & 3 & 5 & 7 \\ N=4 & 4 & 6 & 8 \\ \end{array} $$

Note

In this sliding pattern, a single space index can map to multiple tensor indices. For example, space index 4 maps to {4_N}, {2_N, 1_F}, and {2_F} simultaneously. This illustrates the non-one-to-one nature of (S, e).maps(si, ti).

This can be expressed using a linear combination expression where the N axis has stride 1 and the F axis has stride 2, yielding a total size of 1 + (5-1)*1 + (3-1)*2 = 9.

Equivalent Mapping

Mappings E1 and E2 are equivalent when:

E1::SIZE == E2::SIZE
For every i, E1::map(i) == E2::map(i)

The equivalence relation is reflexive, symmetric, and transitive. Examples:

Identity of pairs: for every E, E is equivalent both to m![{ E }, 1] and m![1, { E }].
Stride-modulo decomposition: for every E whose size E::SIZE is divisible by n, E and m![{ E } / n, { E } % n] are equivalent.
Pair projection: for every A and B, m![[{ A }, { B }] / B::SIZE] is equivalent to m![A] and m![[{ A }, { B }] % B::SIZE] is equivalent to m![B].
Associativity of pairs: for every E1, E2, E3, m![{ E1 }, { E2 }, { E3 }], m![[{ E1 }, { E2 }], { E3 }], and m![{ E1 }, [{ E2 }, { E3 }]] are equivalent.
Idempotent operations: for every E, E is equivalent to m![{ E } / 1], to m![{ E } # E::SIZE], and to m![{ E } = E::SIZE].
Modulo by 1: For every E, m![E % 1] is equivalent to the identity mapping m![1].

Memory and Stream

The Mapping Expressions page covered host tensors, which use a single flat buffer. Device tensors extend host tensors in two directions: storage adds spatial dimensions (chips, clusters, slices) to match the hardware hierarchy, while data flowing through the Tensor Unit pipeline takes a streaming form with Time and Packet dimensions. Both arise from the same hardware distinction between static storage and pipeline flow.

HBM and SRAM

(See Formal Definition at the end of this page for the precise buffer-to-tensor correspondence.)

Device memory has multiple levels, each with its own geometry. Each level is represented as a separate type parameter, enabling spatial parallelism:

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
// Assumed throughout this page.
axes![A = 8, B = 512];

// HBM tensors
struct HbmTensor<D: Scalar, Chip: M, Element: M> {
    /* ... */
    _marker: PhantomData<(D, Chip, Element)>,
}

// SRAM tensors
// DM (Data Memory), TRF (Tensor Register File), and VRF (Vector Register File)
struct DmTensor<D: Scalar, Chip: M, Cluster: M, Slice: M, Element: M> {
    /* ... */
    _marker: PhantomData<(D, Chip, Cluster, Slice, Element)>,
}
struct TrfTensor<D: Scalar, Chip: M, Cluster: M, Slice: M, Row: M, Element: M> {
    /* ... */
    _marker: PhantomData<(D, Chip, Cluster, Slice, Row, Element)>,
}
struct VrfTensor<D: Scalar, Chip: M, Cluster: M, Slice: M, Element: M> {
    /* ... */
    _marker: PhantomData<(D, Chip, Cluster, Slice, Element)>,
}
}

HBM tensors distribute data across chips for spatial parallelism (processing different data elements simultaneously on different hardware units). For example, HbmTensor<bf16, m![A], m![B]> distributes 8 × 512 = 4096 elements across 8 chips with 512 elements per chip. The i-th chip’s j-th element stores tensor index i![A = i, B = j].

SRAM tensor types add Cluster and Slice dimensions for finer-grained parallelism. TrfTensor additionally has a Row dimension that distributes weight data across the 8 MAC rows per slice. See Contraction Engine for details.

These tensor types assume all units at each level share the same mapping. The type parameters directly mirror the device structure, avoiding complex address calculations that would arise from flattening multi-dimensional storage into linear indices.

Alignment Constraint

Alignment constraints apply to the Element dimension: the starting address must be a multiple of size_of::<D>(). This ensures natural alignment for maximum throughput. The Chip, Cluster, and Slice dimensions have no additional alignment constraints.

Size Constraint

Each dimension must fit within hardware limits. Each chip has 256MB of SRAM: 2 clusters × 256 slices × 512KB per slice. An 8-chip system provides 2GB total SRAM capacity.

All device tensor types share the following spatial constraints:

Unit	Count	Constraint	Padding Required
Chip	System-dependent	`Chip::SIZE == NUM_CHIPS`	`m![1 # NUM_CHIPS]`
Cluster	2 / Chip	`Cluster::SIZE == 2`	`m![1 # 2]`
Slice	256 / Cluster	`Slice::SIZE == 256`	`m![X / N # 256]`

Note

These exact-match constraints are a current limitation: the runtime operates at chip granularity (#[device(chip = N)]), so partial chip or cluster usage is not yet supported. Use the # padding operator to fill unused positions. This may be relaxed in future releases.

The Element dimension varies by tensor type:

Type	Unit	Constraint
`DmTensor`	512KB / Slice	`Element::SIZE * size_of::<D>() <= 512KB`
`TrfTensor`	8KB / Row	`Row::SIZE <= 8`, `Element::SIZE * size_of::<D>() <= 8KB`
`VrfTensor`	8KB / Slice	`Element::SIZE * size_of::<D>() <= 8KB`

When a kernel uses fewer clusters than the hardware provides, the Cluster dimension is padded. For example, a single-cluster kernel uses type Cluster = m![1 # 2], meaning 1 logical cluster padded to the hardware’s 2 clusters per chip. A DmTensor<D, ..., Element> at address addr occupies addr..(addr + Element::SIZE * size_of::<D>()).

Tensor Unit Stream

While tensor data is stored in DM in a compact, storage-optimized layout, the Tensor Unit receives tensor data as streams of elements delivered over time. The Packet dimension determines how many elements are delivered to the Tensor Unit in a single cycle. Fetch Sequencers read DM data chunks and deliver a portion each clock cycle.

The Time dimension models this sequence of data delivery. Unlike spatial dimensions that are constrained by hardware capacity, Time has no hardware-imposed size limit; it grows with the amount of data to process.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::ConstParamTy;
use std::marker::PhantomData;
axes![N = 4, C = 64, H = 32, W = 32];

/// Pipeline stage.
/// `Vector` is intentionally absent: the Vector Engine uses a separate typestate
/// (`VectorBranchTensor` and friends) that tracks branch, ALU, and other Vector-specific state.
/// `Commit` is intentionally absent: once the Commit Engine writes results back to DM,
/// the data is at rest and the type becomes `DmTensor`, not `StreamTensor`.
#[derive(PartialEq, Eq, ConstParamTy)]
enum Position {
    Begin,       // After the start of the pipeline
    Fetch,       // After the Fetch Engine
    Switch,      // After the Switch Engine
    Collect,     // After the Collect Engine
    Contraction, // After the Contraction Engine
    Reduce,      // After the Reduce Engine
    Cast,        // After the Cast Engine
    Transpose,   // After the Transpose Engine
}

struct StreamTensor<
    'l,                // Lifetime tied to the Tensor Unit context
    const P: Position,
    D: Scalar,
    Chip: M,
    Cluster: M,
    Slice: M,
    Time: M,
    Packet: M,
> {
    /* ... */
     _marker: PhantomData<&'l (D, Chip, Cluster, Slice, Time, Packet)>,
}

type T<'l> = StreamTensor<
    'l,
    { Position::Fetch }, // Fetch Engine's output
    bf16,
    m![1],           // Chip: single chip
    m![1],           // Cluster: single cluster
    m![C / 2],       // Slice: distribute 64 channels across 32 slices
    m![N, H, W],     // Time: iterate over batch (N) and spatial (H, W) dimensions
    m![C % 2],       // Packet: 2 channels per cycle
>;
}

Type T streams a tensor with an aggregate shape of $\{N=4, C=64, H=32, W=32\}$ across 32 slices (Slice::SIZE = m![C / 2]::SIZE = 32). The Time dimension (m![N, H, W]) has size 4 * 32 * 32 = 4096, which means there are 4096 temporal iterations or cycles. Each cycle, the Packet dimension m![C % 2] delivers 2 channels to each slice. Since 32 slices operate in parallel, each cycle processes 32 * 2 = 64 channels total.

Formal Definition

The following formalizes the buffer-to-tensor correspondence described above for multi-dimensional storage.

For an HBM tensor holding tensor $T$, the correspondence is:

For every chip index i, element index j, and corresponding tensor indices ti, tj:
if Chip::map(i) = Some(ti) and Element::map(j) = Some(tj),
then the i-th chip’s j-th element stores the value of tensor $T$ at index ti ∪ tj (the union of the two partial tensor indices).

The same principle extends to SRAM tensors: for a DmTensor, the correspondence additionally requires matching Cluster::map and Slice::map indices, with the tensor index being the union of all four partial indices. TrfTensor further adds a Row index. Stream tensors add Time and Packet dimensions: the Time dimension indexes which cycle delivers the data, and the Packet dimension indexes elements within a single cycle.

Tensor Functions

The preceding pages showed that the same tensor can live in different memory tiers with different mapping expressions. To reason about operations independently of physical layout, TCP models hardware operations as abstract functions on mathematical tensors, focusing on what data they produce rather than how it is physically arranged.

The function elementwise_add implements the mathematical operation $f(T_1, T_2) = T_1 + T_2$:

For every tensor $T_1$ and $T_2$,
if lhs holds $T_1$ and rhs holds $T_2$,
then the return value holds $T_1 + T_2$.

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 512];

fn elementwise_add(
    lhs: &HbmTensor<bf16, m![A], m![B]>,
    rhs: &HbmTensor<bf16, m![A], m![B]>,
) -> HbmTensor<bf16, m![A], m![B]> {
    // ... computes elementwise add ...
    todo!("elementwise add lhs and rhs")
}
}

The same reasoning applies to data movement: moving a tensor from one memory tier to another is also a tensor function, one that preserves the mathematical content while changing the physical representation. The .to_dm() method implements the identity function on the mathematical tensor (not the physical representation), copying tensor $T$ from HBM to on-chip Data Memory:

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 512];

fn hbm_to_dm(
    ctx: &mut Context,
    hbm: &HbmTensor<bf16, m![A], m![B]>,
) -> DmTensor<bf16, m![A], m![1], m![B / 2], m![B % 2]> {
    hbm.to_dm(&mut ctx.tdma, 1 << 16) // 64KB offset
}
}

Both the input HbmTensor and output DmTensor hold the same mathematical tensor $T$, but in different memory tiers and with different mapping expressions. This means correctness is defined at the tensor level: a function is correct if its output holds the right mathematical tensor, regardless of which mapping or memory tier is used. Treating data movement as a function on tensors rather than as a copy of bytes makes it composable with compute operations in the same pipeline.

Moving Tensors

This chapter explains the three engines that move tensor data between memory tiers (HBM, DM, SPM) and the Tensor Unit: the Fetch Engine (DM to pipeline), the Commit Engine (pipeline to DM), and the DMA Engine (HBM/SPM to DM). Their APIs are designed around what you control: packet sizes, which engine moves each tensor, and how axes map to hardware dimensions. The compiler translates these declarations into low-level hardware concerns such as memory bank scheduling, stride calculation, and access alignment.

Device memory has two primary levels: off-chip HBM (High Bandwidth Memory) for high-capacity storage, and on-chip SRAM for low-latency working memory. SRAM is subdivided into DM (Data Memory) (the primary working memory), SPM (Scratchpad Memory) (a smaller high-speed buffer within each DM), TRF (Tensor Register File), and VRF (Vector Register File). Tensors are stored in these tiers in storage-optimized layouts. This chapter covers HBM, DM, and SPM (the tiers accessed by the DMA, Fetch, and Commit engines); TRF and VRF are loaded through the Tensor Unit pipeline and are covered in Computing Tensors.

flowchart TB
    HBM[(HBM)] <--> DMA[DMA]
    SPM[(SPM)] <--> DMA[DMA]
    DMA <--> DM[(DM)]

    subgraph TU[Tensor Unit]
        direction TB
        FE[Fetch] --> DOT1[...] --> CT[Contraction] --> VE[Vector] --> DOT2[...] --> CM[Commit]
    end

    DM -->|stream| FE
    CM -->|stream| DM

    click DMA "./dma-engine.html" "DMA Engine"
    click FE "./fetch-engine.html" "Fetch Engine"
    click CT "../computing-tensors/contraction-engine/index.html" "Contraction Engine"
    click VE "../computing-tensors/vector-engine/index.html" "Vector Engine"
    click CM "./commit-engine.html" "Commit Engine"
    click TU "../computing-tensors/index.html" "Tensor Unit"

The Fetch engine converts DM storage layout into packet streams for the Tensor Unit; the Commit engine performs the reverse; the DMA engine converts between HBM and DM layouts. All three engines rely on Sequencers, a configuration abstraction that controls memory access patterns through nested-loop configurations, generating and consuming fixed-size packets for deterministic per-cycle transfers and aligned bank access. Memory Performance provides guidance on achieving optimal throughput.

Sequencers read DM at the Fetch Engine and write DM at the Commit Engine, converting between storage layout and stream format. The DMA Engine is a separate pipeline that moves data between HBM/SPM and DM independently of the Tensor Unit.

Sequencer

The Fetch and Commit Engines use sequencers to address DM; this page explains how sequencers work, including their configuration constraints and the failure cases that arise when those constraints are exceeded. Sequencers convert between tensors in memory and packet streams: reading converts a memory buffer into a stream of packets, and writing performs the reverse. Each packet is a fixed-size chunk delivered each clock cycle; its size is set by the Packet mapping dimension. The DMA Engine chains a read and a write sequencer to move data between HBM/SPM and DM without intermediate buffers.

As a kernel writer, you control the Time and Packet type parameters, which determine iteration count and packet size; the compiler derives the register configuration and strides. For performance implications of Packet choices, see Memory Performance.

Interface

To explain sequencer concepts, we use simplified types that capture the essential structure. The actual API is introduced in later sections. The read and write methods preserve tensor values while transforming between memory layout and stream format.

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
/// A tensor in a linear buffer with mapping `Buf`.
struct BufTensor<D: Scalar, Buf: M> {
    /* ... */
  _marker: PhantomData<(D, Buf)>,
}

/// A tensor in motion.
/// - `'l`: Lifetime tied to the source buffer, ensuring the stream cannot outlive its data.
/// - `Time`: Temporal mapping (iteration over time).
/// - `Packet`: Spatial mapping (contents of a single packet).
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M> {
    /* ... */
  _marker: PhantomData<&'l (D, Time, Packet)>,
}

impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
    /// Reads a tensor from a linear buffer into a stream.
    fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> {
        // hardware implementation
      unimplemented!()
    }

    /// Writes a stream of packets back into a linear buffer.
    fn write<'l, Time: M, Packet: M>(&mut self, stream: StreamTensor<'l, D, Time, Packet>) {
        // hardware implementation
      unimplemented!()
    }
}
}

Examples

(The Configuration section below explains how the compiler derives these configurations from tensor mappings.)

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
    fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
    fn write<'l, Time: M, Packet: M>(&mut self, stream: StreamTensor<'l, D, Time, Packet>) { let _ = stream; }
}
axes![A = 8, B = 512, N = 4, C = 3, H = 8, W = 8, T = 4, P = 4];

/// Strided access: read 8×512 tensor as 128 packets of 32 elements.
/// Time = m![A, B / 32] produces 8 * 16 = 128 time steps.
/// Packet = m![B % 32] delivers 32 consecutive elements per packet.
fn strided_read<'l>(
    buf: &'l BufTensor<bf16, m![A, B]>,
) -> StreamTensor<'l, bf16, m![A, B / 32], m![B % 32]> {
    buf.read()  // Automatic type inference
}

/// Strided write: write 128 packets of 32 elements back to 8×512 tensor.
fn strided_write(
    buf: &mut BufTensor<bf16, m![A, B]>,
    stream: StreamTensor<bf16, m![A, B / 32], m![B % 32]>,
) {
    buf.write(stream)
}

/// Axis reordering read: change traversal from [N, C, H, W] to [W, H, C, N].
/// Time = m![W, H, C, N] iterates in reversed axis order.
/// Packet = m![1] delivers single-element packets.
fn axis_reordering_read<'l>(
    buf: &'l BufTensor<bf16, m![N, C, H, W]>,
) -> StreamTensor<'l, bf16, m![W, H, C, N], m![1]> {
    buf.read()
}

/// Axis reordering write: write [W, H, C, N] stream back to [N, C, H, W] buffer.
fn axis_reordering_write(
    buf: &mut BufTensor<bf16, m![N, C, H, W]>,
    stream: StreamTensor<bf16, m![W, H, C, N], m![1]>,
) {
    buf.write(stream)
}

/// Tiling read: break axes into sub-blocks for cache efficiency.
/// Time = m![A % 2, B % 4, A / 2, B / 4] tiles A into 2 × 4, B into 4 × 128 blocks.
/// Packet = m![C # 32] pads C to 32 elements per packet.
fn tiling_read<'l>(
    buf: &'l BufTensor<i8, m![A, B, C # 8]>,
) -> StreamTensor<'l, i8, m![A % 2, B % 4, A / 2, B / 4], m![C # 32]> {
    buf.read()
}

/// Tiling write: write tiled stream back to buffer.
fn tiling_write(
    buf: &mut BufTensor<i8, m![A, B, C # 8]>,
    stream: StreamTensor<i8, m![A % 2, B % 4, A / 2, B / 4], m![C # 32]>,
) {
    buf.write(stream)
}

/// Broadcasting read: replicate elements using stride 0.
/// Time = m![T, A] broadcasts T temporally (same data repeated T times).
/// Packet = m![P] broadcasts P spatially (same element fills packet).
fn broadcasting_read<'l>(
    buf: &'l BufTensor<i8, m![A]>,
) -> StreamTensor<'l, i8, m![T, A], m![P]> {
    buf.read()
}

/// Broadcasting write: write broadcast stream back to buffer.
fn broadcasting_write(
    buf: &mut BufTensor<i8, m![A]>,
    stream: StreamTensor<i8, m![T, A], m![P]>,
) {
    buf.write(stream)
}
}

Configuration

The compiler translates input and output tensor mappings into nested-loop configurations that the sequencer hardware executes. Each configuration has the form [size_0 : stride_0, size_1 : stride_1, ...] : packet_size, where each entry’s subscript corresponds to its position in the loop nest (0 = outermost), represented by the following Rust type:

#![allow(unused)]
fn main() {
struct Config {
    /// Each entry defines a nested loop level.
    entries: Vec<Entry>,
    /// Number of elements per hardware fetch.
    packet_size: usize,
}

struct Entry {
    /// Number of iterations for this loop level.
    size: usize,
    /// Memory address distance (in elements) to skip after each iteration.
    stride: isize,
}
}

Each entry encodes one dimension of tensor traversal. The size field determines how many times this loop iterates, while the stride field determines the memory offset between consecutive iterations. Together, entries form nested loops that traverse memory.

Example: `[N, C, H, W]` ↔ `[W, H, C, N]`

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
    fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
    fn write<'l, Time: M, Packet: M>(&mut self, stream: StreamTensor<'l, D, Time, Packet>) { let _ = stream; }
}
struct Config {
    entries: Vec<Entry>,
    packet_size: usize,
}
struct Entry {
    size: usize,
    stride: isize,
}
axes![N = 4, C = 3, H = 8, W = 8];

fn read_nchw_whcn(buf: &BufTensor<bf16, m![N, C, H, W]>) ->
                     StreamTensor<bf16, m![W, H, C, N], m![1]> {
    // Compiler-generated configuration: [8 : 1, 8 : 8, 3 : 64, 4 : 192] : 1
    let config = Config {
        entries: vec![
            Entry { size: 8, stride: 1 },   // W
            Entry { size: 8, stride: 8 },   // H
            Entry { size: 3, stride: 64 },  // C
            Entry { size: 4, stride: 192 }, // N
        ],
        packet_size: 1,
    };

    // The hardware executes the configuration as nested loops:
    for w in 0..8 {
        for h in 0..8 {
            for c in 0..3 {
                for n in 0..4 {
                    // Read each address
                    let addr = 1 * w + 8 * h + 64 * c + 192 * n;
                    // yield buf[addr];
                }
            }
        }
    }

    buf.read()
}

fn write_whcn_nchw(buf: &mut BufTensor<bf16, m![N, C, H, W]>,
                  stream: StreamTensor<bf16, m![W, H, C, N], m![1]>) {
    // The compiler generates an identical config for writing
    // The hardware executes the configuration as nested loops:
    for w in 0..8 {
        for h in 0..8 {
            for c in 0..3 {
                for n in 0..4 {
                    // Write to each address
                    let addr = 1 * w + 8 * h + 64 * c + 192 * n;
                    // buf[addr] = stream.next();
                }
            }
        }
    }
}
}

The Time dimension represents logical iteration steps, not physical clock cycles. The Packet dimension represents logical unit of data processed per Time. The hardware computes fetch_size to determine the minimum number of fetch cycles required (see Fetch Engine for the constraints of fetch_size).

Configuration Examples

The compiler automatically derives configurations required to traverse memory. The following examples illustrate common patterns.

Rearranging Axes

Rearranging axes changes the traversal order of the tensor. When Time specifies a different axis order than Buf, the compiler computes strides to traverse memory in the requested order.

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
    fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
}
axes![A = 8, B = 8, C = 8];

fn read_rearranging<'l>(
    buf: &'l BufTensor<i8, m![A, B, C # 32]>,  // Buf
) -> StreamTensor<'l, i8, m![B, A], m![C # 16]> {  // Time, Packet
    // Compiler-generated configuration: [
    //   B      ->  8 : 32,
    //   A      ->  8 : 256,
    //   C # 16 -> 16 : 1,
    // ] : 16
    buf.read()
}
}

The compiler generates configuration entries by processing the combined mapping m![B, A, C # 16] term by term, transforming Buf along the way. For each term, the entry size equals the term size, and the stride equals the volume that term occupies within the current Buf. After processing a term, Buf is updated to reflect that the axis has been consumed:

Term	Entry	Stride Source	`Buf` After
`B`	`8 : 32`	`m![C # 32]::SIZE`	`m![A, 1 # 8, C # 32]`
`A`	`8 : 256`	`m![1 # 8, C # 32]::SIZE`	`m![1 # 64, C # 32]`
`C # 16`	`16 : 1`	contiguous (`Packet` dimension)	`1 # 2048`

The maximum fetch_size is 16. The innermost entry 16 : 1 has stride 1, making elements contiguous within the packet.

Splitting Axes

Splitting axes enables tiling by breaking logical axes into multiple entries. This is useful for cache efficiency or matching tensor unit buffer sizes.

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
    fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
}
axes![A = 8, B = 8, C = 4];

fn read_splitting<'l>(
    buf: &'l BufTensor<i8, m![A, B, C # 8]>,  // Buf
) -> StreamTensor<'l, i8, m![A % 2, B % 4, A / 2, B / 4], m![C # 32]> {  // Time, Packet
    // The compiler generates: [
    //   A % 2  ->  2 : 64,
    //   B % 4  ->  4 : 8,
    //   A / 2  ->  4 : 128,
    //   B / 4  ->  2 : 32,
    //   C # 32 -> 32 : 1,
    // ] : 32
    buf.read()
}
}

Expressions like A % 2 and A / 2 split axis A into separate entries. The compiler processes m![A % 2, B % 4, A / 2, B / 4, C # 32] term by term:

Term	Entry	Stride Source	`Buf` After
`A % 2`	`2 : 64`	`m![B, C # 8]::SIZE`	`m![A / 2, 1 # 2, B, C # 8]`
`B % 4`	`4 : 8`	`m![C # 8]::SIZE`	`m![A / 2, 1 # 2, B / 4, 1 # 4, C # 8]`
`A / 2`	`4 : 128`	`m![1 # 2, B / 4, 1 # 4, C # 8]::SIZE`	`m![1 # 8, B / 4, 1 # 4, C # 8]`
`B / 4`	`2 : 32`	`m![1 # 4, C # 8]::SIZE`	`m![1 # 64, C # 8]`
`C # 32`	`32 : 1`	contiguous (`Packet` dimension)	`1 # 512`

The maximum fetch_size is 32.

Slicing Axes

Slicing reads only a partial range of indices from the memory layout. This arises from indexed views that select subsets of the original tensor.

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
    fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
}
axes![A = 16, B = 8, C = 8];

fn read_slicing<'l>(
    buf: &'l BufTensor<i8, m![A, B, C]>,  // Buf
) -> StreamTensor<'l, i8, m![A / 4, A % 4 = 3, B / 4, B % 4 = 2], m![C]> {  // Time, Packet
    // Compiler-generated configuration: [
    //   A / 4     -> 4 : 256,
    //   A % 4 = 3 -> 3 : 64,
    //   B / 4     -> 2 : 32,
    //   B % 4 = 2 -> 2 : 8,
    //   C         -> 8 : 1,
    // ] : 8
    buf.read()
}
}

The = 3 notation limits A % 4 to only 3 iterations instead of 4, restricting the hardware to a sub-region of the tensor. The compiler processes m![A / 4, A % 4 = 3, B / 4, B % 4 = 2, C] term by term:

Term	Entry	Stride Source	`Buf` After
`A / 4`	`4 : 256`	`m![A % 4, B, C]::SIZE`	`m![1 # 4, A % 4, B, C]`
`A % 4 = 3`	`3 : 64`	`m![B, C]::SIZE` (sliced to 3)	`m![1 # 16, B, C]`
`B / 4`	`2 : 32`	`m![B % 4, C]::SIZE`	`m![1 # 32, B % 4, C]`
`B % 4 = 2`	`2 : 8`	`m![C]::SIZE` (sliced to 2)	`m![1 # 128, C]`
`C`	`8 : 1`	contiguous (`Packet` dimension)	`1 # 1024`

The maximum fetch_size is 8.

Broadcasting Axes

Broadcasting replicates tensor elements across multiple packets or time steps. Stride 0 causes the hardware to repeatedly read from the same memory location.

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
    fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
}
axes![A = 16, T = 4, P = 4];

fn read_broadcasting<'l>(
    buf: &'l BufTensor<i8, m![A]>,  // Buf
) -> StreamTensor<'l, i8, m![T, A], m![P]> {  // Time, Packet
    // Compiler-generated configuration: [
    //   T ->  4 : 0,   // temporal broadcast
    //   A -> 16 : 1,
    //   P ->  4 : 0,   // spatial broadcast
    // ] : 4
    buf.read()
}
}

Axes not present in Buf get stride 0. The compiler processes m![T, A, P] term by term:

Term	Entry	Stride Source	`Buf` After
`T`	`4 : 0`	not in `Buf` (broadcast)	`m![A]`
`A`	`16 : 1`	`A` in `m![A]`	`1 # 16`
`P`	`4 : 0`	not in `Buf` (broadcast)	`1 # 16`

The maximum fetch_size is 4. Since P has stride 0, the same element is replicated across the packet (spatial broadcast). Since T has stride 0, the same data is repeated across time steps (temporal broadcast).

Merging Entries

When a transformation produces more than 8 entries, the compiler merges adjacent entries to meet hardware limits. Adjacent entries (n1 : s1) and (n2 : s2) merge into (n1 * n2 : s2) when physically contiguous: s1 == n2 * s2.

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
    fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
}
axes![N = 8, C = 8, H = 8, W = 32];

fn read_merging<'l>(
    buf: &'l BufTensor<i8, m![N, C, H, W]>,  // Buf
) -> StreamTensor<'l, i8, m![W / 16, H % 2, H / 2, C / 2, C % 2, N / 2, N % 2, W / 8 % 2], m![W % 8]> {  // Time, Packet
    // Initial 9 entries:
    //   W / 16     -> 2 : 16,
    //   H % 2      -> 2 : 32,
    //   H / 2      -> 4 : 64,
    //   C / 2      -> 4 : 512,
    //   C % 2      -> 2 : 256,
    //   N / 2      -> 4 : 4096,
    //   N % 2      -> 2 : 2048,
    //   W / 8 % 2  -> 2 : 8,
    //   W % 8      -> 8 : 1,

    // After merging to 6 entries:
    //   W / 16     -> 2 : 16,
    //   H % 2      -> 2 : 32,
    //   H / 2      -> 4 : 64,
    //   C          -> 8 : 256,    // merged C / 2 and C % 2
    //   N          -> 8 : 2048,   // merged N / 2 and N % 2
    //   W % 16     -> 16 : 1,     // merged W / 8 % 2 and W % 8
    buf.read()
}
}

The compiler processes m![W / 16, H % 2, H / 2, C / 2, C % 2, N / 2, N % 2, W / 8 % 2, W % 8] term by term, producing 9 initial entries:

Term	Entry	Stride Source
`W / 16`	`2 : 16`	`m![W % 16]::SIZE`
`H % 2`	`2 : 32`	`m![W]::SIZE`
`H / 2`	`4 : 64`	`m![H % 2, W]::SIZE`
`C / 2`	`4 : 512`	`m![C % 2, H, W]::SIZE`
`C % 2`	`2 : 256`	`m![H, W]::SIZE`
`N / 2`	`4 : 4096`	`m![N % 2, C, H, W]::SIZE`
`N % 2`	`2 : 2048`	`m![C, H, W]::SIZE`
`W / 8 % 2`	`2 : 8`	`m![W % 8]::SIZE`
`W % 8`	`8 : 1`	contiguous (packet dimension)

Since 9 entries exceed the hardware limit of 8, the compiler merges contiguous pairs where s1 == n2 * s2. The entries for H % 2 -> (2 : 32) and H / 2 -> (4 : 64) are not merged because they are not physically contiguous ($s_1 \neq n_2 \times s_2 \iff 32 \neq 4 \times 64$). The final configuration has 6 entries. The last merge combines a temporal entry with the packet entry, increasing the packet size from 8 to 16.

Term	Entry	Merged Entries
`W / 16`	`2 : 16`
`H % 2`	`2 : 32`
`H / 2`	`4 : 64`
`C`	`8 : 256`	`C / 2 (4 : 512)`, `C % 2 (2 : 256)`
`N`	`8 : 2048`	`N / 2 (4 : 4096)`, `N % 2 (2 : 2048)`
`W % 16`	`16 : 1`	`W / 8 % 2 (2 : 8)`, `W % 8 (8 : 1)`

Configuration Failures

In TCP, violating any of the following limits causes a compilation error:

Entry limit: Maximum 8 entries; the compiler merges adjacent entries where possible (see Merging Entries in Configuration Examples).
Iteration limit: size <= 65,536 per entry.
Packet size: Must be 1, 2, 4, 8, 16, or 32 bytes.
Packet fetch: The innermost entry n : s must satisfy one of:
- Contiguous access (adjacent elements): (s == 0 || s == 1) && n % packet_size == 0
- Discrete access (single-element packets): packet_size == 1

If merging fails or limits are exceeded, redesign the tensor mapping or split the operation across multiple sequencer calls. The following examples illustrate common failure cases.

Insufficient Input

The temporal mapping Time attempts to iterate over indices that do not exist in Buf.

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
    fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
}
axes![N = 2048];

fn read_insufficient<'l>(
    buf: &'l BufTensor<i8, m![N % 512]>,  // Buf
) -> StreamTensor<'l, i8, m![N / 512], m![N % 512]> {  // Time, Packet
    buf.read() // Compilation error: insufficient input
}
}

Time requires N / 512, but Buf only contains N % 512. The buffer does not have the indices that the temporal mapping tries to iterate over.

Incompatible Shapes

The buffer and stream mappings have the same total size but different mathematical structures that no single sequencer configuration can reconcile.

#![allow(unused)]
fn main() {
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
use std::marker::PhantomData;
struct BufTensor<D: Scalar, Buf: M>(PhantomData<(D, Buf)>);
struct StreamTensor<'l, D: Scalar, Time: M, Packet: M>(PhantomData<&'l (D, Time, Packet)>);
impl<D: Scalar, Buf: M> BufTensor<D, Buf> {
    fn read<'l, Time: M, Packet: M>(&'l self) -> StreamTensor<'l, D, Time, Packet> { unimplemented!() }
}
axes![A = 15];

fn read_incompatible<'l>(
    buf: &'l BufTensor<i8, m![A % 5, A / 5]>,  // Buf
) -> StreamTensor<'l, i8, m![1], m![A % 3, A / 3]> {  // Time, Packet
    buf.read() // Compilation error: incompatible shapes
}
}

Both Buf and Packet represent 15 elements, but their internal index mappings differ. The buffer uses a base-5 decomposition (A % 5, A / 5) while the packet uses a base-3 decomposition (A % 3, A / 3). These are mathematically incompatible: there is no way to traverse memory in one pattern to produce the other.

The compiler detects this using a “factorizable” concept: it attempts to decompose axes with the same label into (inner, intersection, outer) components to find a common representation. When no such factorization exists, the configuration is rejected.

Indirect Access

Standard sequencer entries use fixed strides: the memory offset between iterations is constant. IndirectLoop extends this by allowing variable offsets per iteration, enabling gather operations with data-dependent access patterns.

The standard pattern (limit, stride) becomes (limit, [offset0, offset1, ...]), where each iteration uses a different offset from the provided sequence. This supports operations like embedding lookups where indices are determined at runtime.

Fetch Engine

The Tensor Unit is a pipeline of engines (Switching, Collect, Contraction, Vector, Cast, Transpose, Commit) that processes tensor data. It cannot operate directly on tensors stored in DM: data must first be converted into a stream of fixed-size packets that flow through the compute pipeline. The Fetch Engine performs this conversion: it reads tensor data from DM and produces packet streams for the rest of the Tensor Unit.

The Fetch Engine operates in two stages:

Fetch Sequencer: Reads tensor data from DM using nested-loop configurations that define access patterns.
Fetch Adapter: Post-processes streams through masking, type conversion, and batching to produce computation-ready packets.

Additional sections cover the interaction with the Switch Engine and performance guidelines.

As a kernel writer, you control the Time and Packet type parameters of .fetch(), which determine packet size and iteration count. The compiler derives the sequencer loop configuration and stride calculation. For performance implications of Packet choices, see Memory Performance.

Interface

The Fetch Engine implements a logical tensor move from DM to tensor streams:

impl<'l, const T: Tu, D: Scalar, Chip: M, Cluster: M, Slice: M, Time: M, Packet: M>
    BeginTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet>
{
    /// Performs fetch operation to create a fetched tensor.
    #[primitive(BeginTensor::fetch)]
    pub fn fetch<D2: Scalar, Time2: M, Packet2: M>(self) -> FetchTensor<'l, T, D2, Chip, Cluster, Slice, Time2, Packet2>
    where
        D: FetchCast<D2>,
    {
        assert_eq!(Cluster::SIZE, 2, "Cluster size must be 2, got {}", Cluster::SIZE);
        assert_eq!(Slice::SIZE, 256, "Slice size must be 256, got {}", Slice::SIZE);
        let packet_bytes = D2::size_in_bytes_from_length(Packet2::SIZE);
        assert_eq!(
            packet_bytes % FETCH_ALIGN_BYTES,
            0,
            "Fetch output packet must be {FETCH_ALIGN_BYTES}-byte aligned, got {packet_bytes} bytes.",
        );
        FetchTensor::new(self.ctx, self.inner.map(|v| v.map(|v| v.cast())).transpose(true))
    }
}

The resulting FetchTensor represents a stream of packets flowing through the Tensor Unit pipeline. The mapping [Chip, Cluster, Slice, Time, Packet] distributes data across hardware and time.

Time2 represents the temporal iteration mapping, while Packet2 is the packet shape within each cycle. The output packet size must be 8-byte aligned (a multiple of the fetch sequencer’s read granularity). The output type D2 supports type casting (such as i8 to i32).

Notice that .fetch() does not have an output parameter for Slice: each slice independently reads its own DM data, so the Slice mapping is inherited unchanged from the input BeginTensor. To redistribute data across slices, use the Switch Engine.

Examples

A matrix stored as 8-bit integers needs conversion to 32-bit integers for computation:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512, B = 32];

/// Fetches matrix data from DM, casting i8 to i32.
fn fetch_matrix_example<'l, const T: Tu>(
    input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![A, B]>,
) -> FetchTensor<'l, T, i32, m![1], m![1], m![1], m![A], m![B]> {
    input.fetch()
}
}

The input BeginTensor represents data in DM. The output FetchTensor represents a packet stream: 512 rows over time, each containing 32 i32 (128 bytes). The compiler automatically configures the sequencer and adapter from the output type signature. For the StreamTensor type hierarchy across pipeline stages, see Memory and Stream.

Fetch Sequencer

The Fetch Sequencer defines the memory access pattern: which addresses to read, in what order, and how to package the data into packets. Each slice executes its own sequencer independently, enabling parallel data movement.

Sequencers typically use homogeneous configurations: each slice processes the same pattern on its local data partition. The hardware also supports heterogeneous configurations where different slices execute different access patterns simultaneously.

Constraints

The sequencer has physical limits that must be respected:

Chip::SIZE $=$ number of chips in the system.
Cluster::SIZE $=$ 2 (clusters per chip).
Slice::SIZE $=$ 256 (slices per cluster).

Note

These exact-match constraints are a current limitation: the runtime operates at chip granularity (#[device(chip = N)]), so partial chip or cluster usage is not yet supported. Use the # padding operator (e.g., m![1 # 2] for a single logical cluster) to fill unused positions. This may be relaxed in future releases.

Fetch addresses must be 1-byte aligned (minimal constraint), but 8-byte alignment is required for certain DMA operations.
Depending on the context, the fetch_size is restricted to certain values:
- The main-context supports a fetch_size of 1, 2, 4, 8, 16, or 32 bytes (see main-context).
- The sub-context supports a fetch_size of 4 bytes (when casting from i4 to i32), or 8 bytes (otherwise) (see sub-context).
fetch_size is determined as the largest supported divisor of gcd(packet_size, contiguous_sram_access_size).
- fetch_size must divide packet_size because data fetched in a single memory read cannot be split across different packets.
- fetch_size must divide contiguous_sram_access_size because a single memory fetch can only read physically contiguous data.

The contiguous_sram_access_size represents the total byte size of contiguous elements in memory that can be accessed without stride discontinuities. It is derived from the sequencer configuration by multiplying the sizes of consecutive physically contiguous entries, from the innermost to the outermost level. Two adjacent entries—an outer (n1 : s1) and an inner (n2 : s2)—are physically contiguous if s1 == n2 * s2.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![N = 4, C = 3, H = 4, W = 8];

// Compiler-generated configuration: [
//   N -> 4 : 96,   (96 == 3 * 32,     contiguous)
//   C -> 3 : 32,   (32 == 4 * 8,      contiguous)
//   H -> 4 : 8,    (8  == 8 * 1,      contiguous)
//   W -> 8 : 1,    (packet dimension, contiguous)
// ] : 8
// contiguous_sram_access_size = 8 * 4 * 3 * 4 = 384
fn fully_contiguous<'l, const T: Tu>(
    input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![N, C, H, W]>,
) -> FetchTensor<'l, T, i8, m![1], m![1], m![1], m![N, C, H], m![W]> {
    input.fetch()
}

// Compiler-generated configuration: [
//   C -> 3 : 32,   (32 != 4 * 96,     NOT contiguous)
//   N -> 4 : 96,   (96 != 4 * 8,      NOT contiguous)
//   H -> 4 : 8,    (8  == 8 * 1,      contiguous)
//   W -> 8 : 1,    (packet dimension, contiguous)
// ] : 8
// contiguous_sram_access_size = 8 * 4 = 32
fn three_axes_non_contiguous<'l, const T: Tu>(
    input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![N, C, H, W]>,
) -> FetchTensor<'l, T, i8, m![1], m![1], m![1], m![C], m![N, H, W]> {
    input.fetch()
}

// Compiler-generated configuration: [
//   N -> 4 : 96,   (96 != 4 * 8,  NOT contiguous)
//   H -> 4 : 8,    (8  != 3 * 32, NOT contiguous)
//   C -> 3 : 32,   (32 != 8 * 1,  NOT contiguous)
//   W -> 8 : 1,    (packet dimension, contiguous)
// ] : 8
// contiguous_sram_access_size = 8
fn four_axes_non_contiguous<'l, const T: Tu>(
    input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![N, C, H, W]>,
) -> FetchTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![N, H, C, W]> {
    input.fetch()
}
}

For detailed information on how packet size interacts with memory access patterns and sequencer configuration, see the sequencer configuration.

Note

The optimal sequencer configuration is automatically generated by the compiler based on the output type of fetch(). Users do not directly specify sequencer configurations in Virtual ISA. Similarly, fetch_size and contiguous_sram_access_size are automatically derived by the compiler, and not directly specified by users.

Optimizations

Different configurations can achieve the same tensor move with varying efficiency. Two key optimizations dramatically improve performance: padding packets to maximize bandwidth and interleaving tensors to combine operations.

Padding Packets

Padding packets to full hardware bandwidth drastically reduces fetch cycles. The increased packet size allows the compiler to increase fetch_size, which reduces the number of fetch cycles needed to transfer the same amount of data. The following example demonstrates this effect:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 3, B = 5, C = 2];

/// Smallest packet: only C dimension (2 bytes). Takes 15 cycles.
fn fetch_packet_C<'l, const T: Tu>(
    input: BeginTensor<'l, T, f8e4m3, m![1], m![1], m![1], m![1], m![A, B, C]>,
) -> FetchTensor<'l, T, f8e4m3, m![1], m![1], m![1], m![A, B], m![C]> {
    input.fetch()
}

/// Medium packet: B and C dimensions padded to 16 bytes. Takes 3 cycles.
fn fetch_packet_BC<'l, const T: Tu>(
    input: BeginTensor<'l, T, f8e4m3, m![1], m![1], m![1], m![1], m![A, B, C]>,
) -> FetchTensor<'l, T, f8e4m3, m![1], m![1], m![1], m![A], m![[B, C] # 16]> {
    input.fetch()
}

/// Largest packet: all dimensions padded to 32 bytes. Takes 1 cycle.
fn fetch_packet_ABC<'l, const T: Tu>(
    input: BeginTensor<'l, T, f8e4m3, m![1], m![1], m![1], m![1], m![A, B, C]>,
) -> FetchTensor<'l, T, f8e4m3, m![1], m![1], m![1], m![1], m![[A, B, C] # 32]> {
    input.fetch()
}
}

Padding reads beyond the actual data, but this is safe because padding values do not affect computation. Note that different padding strategies produce different FetchTensor mappings, which may affect downstream components.

Interleaving Tensors

Interleaving combines two tensors with identical mappings into a single sequencer operation, reducing overhead when both tensors are needed for the same computation. An explicit axis is introduced in the Time dimension to encode alternation between the two tensors.

In the following example, the main context creates an interleaved tensor using begin_interleaved(). This introduces an axis I = 2 in the Time dimension, which encodes alternation between the two tensors. The first temporal iteration fetches from lhs, the second iteration fetches from rhs, the third iteration fetches the next packet from lhs, and so on, continuing this alternating pattern. At most two tensors can be interleaved in a single fetch operation.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512, B = 32, I = 2];

/// Interleaves two input tensors into a single packet stream.
/// Useful for operations like 'input1 + input2' in the Vector Engine.
/// The interleaved BeginTensor is created via Tu.begin_interleaved().
/// The `I = 2` axis in Time encodes alternation between the two tensors.
fn fetch_interleaved<'l>(
    ctx: &'l mut Context,
    lhs: &'l DmTensor<i8, m![1], m![1], m![1], m![A, B]>,
    rhs: &'l DmTensor<i8, m![1], m![1], m![1], m![A, B]>,
) -> FetchTensor<'l, { Tu::Main }, i8, m![1], m![1], m![1], m![A, I], m![B]> {
    ctx.main.begin_interleaved::<I, _, _, _, _, _>(lhs.view(), rhs.view()).fetch()
}
}

Fetch Adapter

The Fetch Adapter transforms raw packet streams into computation-ready format through five stages: masking, table indexing, type casting, zero-point subtraction, and batching. The main-context adapter supports all five stages, while sub-context adapters only support zero-point subtraction.

Sequencer and Adapter Interaction

The Fetch Sequencer operates with a fixed stream mapping: (Slice, Time) → Packet. It controls what memory addresses are read and how elements are packed. While structural transformations happen during memory access, element-wise transformations happen during adapter processing. To achieve a specific data layout, reshape the tensor before the sequencer reads it, then apply value-wise transformations in the adapter stage.

For example, consider a tensor with S = [8_a], Element = a, Time = a ! 4. Slicing can be expressed as S = [4_a], Element = a + 4, Time = a. With offset consideration, S = [8_a], Element = a, Time = a @ 4 becomes S = [8_a], Element = 4 + a, Time = a. In principle, if we reshape the tensor appropriately, all forms could be expressed this way, though the necessity of this specific approach warrants further investigation.

Masking

The Tensor Unit requires data in power-of-two sizes for efficient processing, because its internal data paths operate on fixed-width units (32-byte flits containing 8 elements of 32-bit data). A 63-element axis must be padded to 64 elements. Without masking, the padded element might contain an arbitrary value that corrupts operations like sum or max. Masking forces padded elements to neutral values so they do not influence the result.

For example, the Reducer sums elements along an axis. Summing 63 real elements plus 1 arbitrary padded value produces an incorrect result. Masking sets that padded element to zero.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 63];

/// Fetches with automatic masking: pads 63 elements to 64, masking the padding.
/// Hardware automatically masks the 64th element to zero
/// so reduce operations compute correctly on 63 valid elements.
fn fetch_with_masking<'l, const T: Tu>(
    input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![A]>,
) -> FetchTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![A # 64]> {
    input.fetch()
}
}

Masking Configuration

The Fetch Engine supports masking for innermost axes with padding on both sides, expressed as (# n + A + # m) where n is left padding, A is valid data, and m is right padding. The hardware provides three masking cases to handle different padding scenarios, each optimized for specific padding patterns and axis sizes.

All masking configurations use three key parameters:

last_dim: Specifies the dimension index to apply masking to.
left_pad: Masks the first left_pad elements when the index of last_dim is 0.
last_dim_rightmost_valid_count[0]: Masks dim0 - last_dim_rightmost_valid_count[0] elements from the right when the last_dim index is the last. This value is limited to 0-255 for 4-bit types, 0-31 for f32, as the final packet size must not exceed 256 bytes.

Example (Padding case 1)

PADDING_1

axes![A = 32, B = 90]
dtype = i8
base_addr = 0
Element = m![A, B # 96]
Configuration: last_dim = 1, lpad = 2, last_dim_rightmost_valid_count[0] = 4, pad_value = 0
Stream mapping: let B' = # 2 + B + # 4 in { Time: m![A, B' / 32], Packet: m![B' % 32] }
Sequencer configuration: [A = 32 : 96, B' / 32 = 3 : 32, B' % 32 = 32 : 1] : 32 @ base_addr = -2
Packet size: m![B' % 32]::SIZE = 32
Cycles: Time::SIZE = m![A, B' / 32]::SIZE = 32 * 3 = 96
Result: The first 2 and last 4 values of (# 2 + B + # 4) are masked to 0.

Example (Padding case 2)

Case 2 handles the same masking as Case 1, but for non-contiguous padding regions that are split across the data.

PADDING_2

axes![A = 32, B = 90]
dtype = i8
base_addr = 0
Element = m![A, B # 96]
Configuration: last_dim = 0, lpad = 2, last_dim_rightmost_valid_count[0] = 4, pad_value = 0
Stream mapping: let B' = # 2 + B + # 4 in { Time: m![B' / 32, A], Packet: m![B' % 32] }
Sequencer configuration: [B' / 32 = 3 : 32, A = 32 : 96, B' % 32 = 32 : 1] : 32 @ base_addr = -2
Packet size: m![B' % 32]::SIZE = 32
Cycles: Time::SIZE = m![A, B' / 32]::SIZE = 32 * 3 = 96
Result: The first 2 and last 4 values of (# 2 + B + # 4) are masked to 0.

Example (Padding case 3)

PADDING_3

Case 3 supports larger right padding values through per-index masking. Cases 1 and 2 limit right padding to 255 * 4-bit, but Case 3 removes this limitation:

Each entry index i uses its own last_dim_rightmost_valid_count[i] value.
Supports last_dim_rightmost_valid_count[0..8] when axis size is 8 or less.

Consider the following:

axes![A = 32, B = 97]
dtype = f32
base_addr = 0
Element = m![A, B # 128]
Stream mapping: let B' = B + # 31 in { Time: m![A, B' / 16, 1], Packet: m![B' % 16] }
Sequencer configuration: [A = 32 : 128, B' / 16 = 8 : 16, 1 = 1 : 0, B' % 16 = 16 : 1] : 16 @ base_addr = -2
Packet size: m![B' % 16]::SIZE = 16
Cycles: Time::SIZE = m![A, B' / 32]::SIZE = 32 * 8 = 256
Configuration: last_dim_rightmost_valid_count_dim = 1, last_dim = 2, last_dim_rightmost_valid_count[0..8] = [16, 16, 16, 16, 16, 16, 1, 0], pad_value = 0
Result: Of (B # 31), 97 elements are valid and 31 are masked as invalid.

Table Indexing

Some operations cannot be efficiently implemented with standard arithmetic. Non-linear activation functions like Sigmoid and GeLU require expensive approximations, and certain quantization schemes use custom encoding tables.

Table indexing provides hardware-accelerated lookup tables during the fetch stage. Each value is treated as an index into a pre-configured table, and the corresponding table entry is output instead. This enables:

Non-linear activations: Implement Sigmoid, GeLU, and other functions through pre-computed lookup tables.
Custom type casting: Translate specialized encodings like MXFP4 to standard formats using conversion tables.

#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8];

/// Fetches with table lookup: each input value indexes into a pre-configured table.
/// Input [0, 1, 2, 3, 4, 5, 6, 7] with table[x] = 2*x
/// Output [0, 2, 4, 6, 8, 10, 12, 14]
fn fetch_with_table<'l, const T: Tu>(
    input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![A]>,
    table: &LookupTable<i8, i8>,
) -> FetchTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![A]> {
    input.fetch_with_table(table)
}

Performance

Table indexing can introduce performance overhead compared to direct memory fetches. Key considerations include:

Additional latency for table lookup operations
Potential bandwidth limitations when lookup tables are accessed
Impact on pipeline throughput when table access becomes a bottleneck

Type Casting

The Fetch Adapter converts element types as data streams from DM, enabling computation on data stored at lower precision than the compute pipeline requires.

RNGD supports the following conversions:

Input	Output
`i4`	`i5`, `i32`
`i8`	`i9`, `i32`
`i16`	`i32`
`f8e4m3`	`f32`
`f8e5m2`	`f32`
`bf16`	`f32`
`f16`	`f32`
`f32`	`bf16`

RNGD-S supports the following additional type conversions:

Input	Output
`i4`	`i9`
`i16`	`i9`
`f8e4m3`	`bf16`
`f8e5m2`	`bf16`

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8];

/// Fetches with type casting: converts i8 storage to i32 for computation.
/// Input:   i8 [0, 1, 2, 3, 4, 5, 6, 7]
/// Output: i32 [0, 1, 2, 3, 4, 5, 6, 7]
fn fetch_with_type_cast<'l, const T: Tu>(
    input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![A]>,
) -> FetchTensor<'l, T, i32, m![1], m![1], m![1], m![1], m![A]> {
    input.fetch()
}
}

Constraints

Type casting requires the data produced by a single fetch operation to not exceed 32 bytes.

This constraint exists as Fetch Engine packets are forwarded through the Switch Engine and Collect Engine, which normalizes packets into flits (flow control units) for the Compute Engine. Each flit is 32 bytes (single-channel mode) or 64 bytes (dual-channel mode).

The 32-byte limit avoids flit overflow. See Fetch Engine and Switch Engine Interaction for more details.

Consider the following examples:

Valid: i4 to i32 conversion with a fetch_size of 4 bytes
- Fetches 4 bytes (8 elements of i4)
- Produces 32 bytes (8 elements of i32)
- Output size: 32 bytes ✓
Invalid: i4 to i32 conversion with a fetch_size of 8 bytes
- Fetches 8 bytes (16 elements of i4)
- Produces 64 bytes (16 elements of i32)
- Output size: 64 bytes ✗ (exceeds 32-byte limit)
Invalid: i8 to i32 conversion with a fetch_size of 16 bytes
- Fetches 16 bytes (16 elements of i8)
- Produces 64 bytes (16 elements of i32)
- Output size: 64 bytes ✗ (exceeds 32-byte limit)

This constraint affects the allowed fetch_size values in sub-context operations:

When casting from i4 to i32, fetch_size must be 4 bytes.
For other conversions, fetch_size must be 8 bytes.

Zero-Point Subtraction

Quantization schemes are either symmetric (centered around zero) or asymmetric (shifted by an offset called the zero-point). Asymmetric quantization represents data ranges more efficiently but requires subtracting the zero-point before computation. When converting from quantized integers to computation types, the hardware can simultaneously subtract this offset.

extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8];

/// Fetches with zero-point subtraction for asymmetric quantization.
/// Input:   i8 [0, 1, 2, 3, 4, 5, 6, 7], with zero_point = 10
/// Output: i32 [-10, -9, -8, -7, -6, -5, -4, -3]
fn fetch_with_zero_point<'l, const T: Tu>(
    input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![A]>,
    zero_point: i8,
) -> FetchTensor<'l, T, i32, m![1], m![1], m![1], m![1], m![A]> {
    input.fetch_with_zero_point(zero_point)
}

Interleaving fetches enable subtracting different zero points from each tensor. This is useful when combining tensors with different quantization parameters.

extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, I = 2];

/// Fetches interleaved tensors with different zero points per tensor.
/// Input1: [0, 1, 2, 3, 4, 5, 6, 7], with zero_point = 100
/// Input2: [0, 1, 2, 3, 4, 5, 6, 7], with zero_point = -100
/// Output interleaved: [-100, -99, ..., 100, 101, ...]
fn fetch_interleaved_with_zero_points<'l, const T: Tu, const I: Ident>(
    input: BeginTensor<'l, T, i8, m![1], m![1], m![1], m![I], m![A]>,
    zero_points: [i8; 2],
) -> FetchTensor<'l, T, i32, m![1], m![1], m![1], m![I], m![A]> {
    input.fetch_with_zero_points(zero_points)
}

Batching

Memory systems have a minimum efficient transfer size to make full use of each memory access. When a tensor’s natural packet size is smaller than this threshold, fetching each packet individually wastes bandwidth. Batching combines multiple small packets into a single larger transfer by grouping consecutive time steps: the fetches_per_packet value determines how many individual fetches are combined into one packet. With a fetch_size of 8 bytes and fetches_per_packet of 5, the adapter groups 5 fetches together to create a single 40-byte packet.

Note

The fetches_per_packet value is derived by the compiler, from the output type of fetch(). Users do not directly specify fetches_per_packet.

Tip

Prefer fetching large packets directly from DM using the Fetch Sequencer. Use Fetch Adapter batching only when large packets cannot be retrieved in one cycle due to memory layout constraints, such as a 24-byte packet spread across three non-contiguous 8-byte locations.

The total number of cycles required to fetch the data is: $$ \text{#cycles} = \texttt{Time:SIZE} \times \text{ceil}\left({\frac{\texttt{Packet::SIZE}}{\texttt{fetch_size}}}\right) $$

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![N = 4, C = 3, H = 4, W = 8];

/// Sequencer config: [N = 4 : 96, C = 3 : 32, H = 4 : 8, W = 8 : 1].
/// contiguous_sram_access_size = m![N, C, H, W]::SIZE = 384
/// packet_size = 8 bytes
/// fetch_size = gcd(packet_size, contiguous_sram_access_size) = 8 bytes
/// batching_factor (fetches_per_packet) = packet_size / fetch_size = 1
/// #cycles = 48
fn fetch_batch_1<'l, const T: Tu>(
    input: BeginTensor<'l, T, i4, m![1], m![1], m![1], m![1], m![N, C, H, W]>,
) -> FetchTensor<'l, T, i4, m![1], m![1], m![1], m![N, C, H], m![W]> {
    input.fetch()
}

/// Sequencer config: [N = 4 : 96, C = 3 : 32, H / 2 = 2 : 16, H % 2 = 2 : 8, W = 8 : 1].
/// contiguous_sram_access_size = m![N, C, H / 2, H % 2, W]::SIZE = 384
/// packet_size = 16 bytes
/// fetch_size = gcd(packet_size, contiguous_sram_access_size) = 16 bytes
/// batching_factor (fetches_per_packet) = packet_size / fetch_size = 1
/// #cycles = 24
fn fetch_batch_2<'l, const T: Tu>(
    input: BeginTensor<'l, T, i4, m![1], m![1], m![1], m![1], m![N, C, H, W]>,
) -> FetchTensor<'l, T, i4, m![1], m![1], m![1], m![N, C, H / 2], m![H % 2, W]> {
    input.fetch()
}

/// Sequencer config: [N = 4 : 96, C = 3 : 32, H = 4 : 8, W = 8 : 1].
/// contiguous_sram_access_size = m![N, C, H, W]::SIZE = 384
/// packet_size = 32 bytes
/// fetch_size = gcd(packet_size, contiguous_sram_access_size) = 32 bytes
/// batching_factor (fetches_per_packet) = packet_size / fetch_size = 32 / 32 = 1
/// #cycles = 12
fn fetch_batch_3<'l, const T: Tu>(
    input: BeginTensor<'l, T, i4, m![1], m![1], m![1], m![1], m![N, C, H, W]>,
) -> FetchTensor<'l, T, i4, m![1], m![1], m![1], m![N, C], m![H, W]> {
    input.fetch()
}

/// Sequencer config: [N = 4 : 96, C = 3 : 32, H = 4 : 8, W = 8 : 1].
/// contiguous_sram_access_size = m![N, C, H, W]::SIZE = 384
/// packet_size = 96 bytes
/// fetch_size = gcd(packet_size, contiguous_sram_access_size) = 32 bytes
///   fetch_size should be in {1, 2, 4, 8, 16, 32}
/// batching_factor (fetches_per_packet) = packet_size / fetch_size = 96 / 32 = 3
/// #cycles = 12
fn fetch_batch_4<'l, const T: Tu>(
    input: BeginTensor<'l, T, i4, m![1], m![1], m![1], m![1], m![N, C, H, W]>,
) -> FetchTensor<'l, T, i4, m![1], m![1], m![1], m![N], m![C, H, W]> {
    input.fetch()
}
}

Constraints

The output packet size must be 8-byte aligned (a multiple of the fetch sequencer’s read granularity).
The Time dimension size must be divisible by the batching factor.
Type conversion is limited to specific type pairs (see Type Casting).
Zero-point values must fit within the target data type’s representable range.
The underlying sequencer produces base packets of 1, 2, 4, 8, 16, or 32 bytes (see Sequencer Constraints).

Fetch Engine and Switch Engine Interaction

After batching, Fetch Engine packets are forwarded to the Switch Engine, where data is routed through an interconnect network of slices. A network topology determines the distribution pattern of packets between slices. The packet passes through the Switch Engine unchanged.

After switching, the Collect Engine normalizes packets to 32-byte flits (flow control units used by all downstream engines). The Collect Engine pads packets that are not divisible by the flit size (32 bytes) with zeros. For example, for fetch_size = 8 bytes and fetches_per_packet = 5, the Fetch Adapter batches 5 fetches together, producing a 40-byte packet. Since it is not 32-byte aligned, the Collect Engine adds 24 bytes of zero padding, producing a 64-byte packet (2 flits).

The hardware operates over flits (flow control units), which are the physical unit of data transfer. The Tensor Unit has two Switch Engines (one for each context), each with a 32-byte data width. Throughput and flit size depend on the configured mode:

Single channel mode: A single flit has 32 bytes. Half of the available bandwidth is used.
Dual channel mode: The main and sub contexts are combined to produce 64-byte flits. Dual channel mode requires explicit configuration. The compiler does not generate it automatically.

Forwarding 32-byte aligned packets from the Fetch Engine avoids wasted bandwidth in the Collect Engine. An unaligned packet requires padding: for example, for a 20-byte packet, $\frac{32 - 20}{32} \approx 37.5\%$ of the flit payload is unused.

Tip

Prefer 32-byte aligned packets over unaligned ones.

Performance

Memory Bandwidth

Peak DM bandwidth is 256 B/cycle with proper DMN interleaving (see Memory Performance for the interleaving technique). Contiguous accesses enable parallel bank access; distributing fetches across slices maximizes parallelism. See Memory Performance for optimization strategies.

Adapter Overhead

Each adapter stage adds minimal latency: batching must accumulate fetches_per_packet packets, type conversion takes 1–2 cycles.

Bank Starvation

The Fetch Engine shares DM bank access with Commit and DMA Engines. Fetch operations have higher priority than DMA, but consecutive accesses to the same bank (64+ accesses) can starve the DMA Engine. The compiler prevents this by avoiding concurrent scheduling of problematic patterns. See Bank Starvation for details.

Dual Channel Mode Benefits

When both main and sub contexts are available:

Bandwidth doubles to 64 bytes per cycle
Fetch cycles halve for large transfers
Trade-off: sub-context unavailable for independent operations

Packet Alignment

Prefer 32-byte aligned packets over unaligned ones. See Fetch Engine and Switch Engine Interaction for more details.

Commit Engine

The Commit Engine writes Tensor Unit results back to DM (Data Memory), the primary on-chip SRAM tier. It implements a logical tensor move from Tensor Unit streams to SRAM, writing each slice’s result to its designated DM address.

After the Tensor Unit completes computation, results exist as streaming packets distributed across slices. The Commit Engine transforms these packets through an adapter (truncating) and writes them to DM via a sequencer. This page covers the interface and examples, the adapter stages, the sequencer, sub-context operations, and performance guidelines.

Interface

impl<'l, const T: Tu, P: Position, D: Scalar, Chip: M, Cluster: M, Slice: M, Time: M, Packet: M>
    StreamTensor<'l, { T }, P, D, Chip, Cluster, Slice, Time, Packet>
{
    /// Commits to the data memory.
    #[primitive(StreamTensor::commit)]
    pub fn commit<Element: M>(self, address: Address) -> DmTensor<D, Chip, Cluster, Slice, Element> {
        verify_commit::<D, Time, Packet, Element>();
        DmTensor::new(self.inner.transpose(false), address)
    }

    /// Commits to mutable tensor view in the data memory.
    #[primitive(StreamTensor::commit_view)]
    pub fn commit_view<Element: M>(self, mut dst: DmTensorViewMut<'l, D, Chip, Cluster, Slice, Element>) {
        verify_commit::<D, Time, Packet, Element>();
        dst.inner.write_transpose(self.inner.view(), false);
    }
}

The Commit Engine mirrors the Fetch Engine’s structure, but operates in reverse.

For detailed examples, see kernel examples.

Examples

Consider storing a matrix multiplication result C = A * B back to DM after computation. The Cast Engine converts the Contraction Engine’s f32 packet elements to bf16 to save space. The Commit Engine stores the resulting tensor to DM.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![P = 256, M = 16, N = 8];

fn cast_commit<'l, const T: Tu>(
    input: AccumulationTensor<'l, T, f32, m![1], m![1], m![P], m![M], m![N]>,
) -> DmTensor<bf16, m![1], m![1], m![P], m![M, N # 16]> {
    // Cast f32 to bf16 (Cast Engine), then commit to DM (Commit Engine).
    // Input: M = 16 time steps, N = 8 f32 elements per packet (32 bytes).
    // After cast: N = 8 bf16 elements padded to 16 (32 bytes).
    // The sequencer writes across P = 256 slices.
    input.cast::<bf16, m![N # 16]>().commit(0)
}
}

Adapter

The adapter transforms stream packets before writing to DM via truncating.

The main context and sub-context adapters both support truncating. The sub-context is typically used for prefetching to TRF/VRF.

Truncating

Truncating reduces packet size by keeping only the leading elements. The input packet is always a full 32-byte flit. The commit_in_size parameter controls how many bytes are actually written to DM: 8, 16, 24, or 32 bytes (where 32 bytes means no reduction). This operation discards trailing elements or satisfies downstream alignment constraints.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 4, K = 2, W = 8, N = 16, J = 64];

fn i8_padding_truncation<'l, const T: Tu>(
    input: CastTensor<'l, T, i8, m![1], m![1], m![1], m![M, K], m![W # 32]>,
) -> DmTensor<i8, m![1], m![1], m![1], m![M, K, W]> {
    // Input: 8 i8 elements padded to 32 (32 bytes per packet).
    // Truncation removes padding: only the 8 leading elements are written to DM.
    // commit_in_size = 8 elements × 1 byte = 8 bytes.
    input.commit(0)
}

fn f32_non_padding_truncation<'l, const T: Tu>(
    input: AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![M, K], m![W]>,
) -> DmTensor<f32, m![1], m![1], m![1], m![M, K, W = 4]> {
    // Input: 8 f32 elements (32 bytes per packet).
    // Truncation: only the first 4 elements are written to DM.
    // commit_in_size = 4 elements × 4 bytes = 16 bytes.
    input.commit(0)
}

fn bf16_truncation_with_transpose<'l, const T: Tu>(
    input: CastTensor<'l, T, bf16, m![1], m![1], m![1], m![M, K], m![N]>,
) -> DmTensor<bf16, m![1], m![1], m![1], m![K, M, N = 8]> {
    // Input: 16 bf16 elements (32 bytes per packet).
    // Truncation: only the leading 8 elements are written to DM.
    // commit_in_size = 8 elements × 2 bytes = 16 bytes.
    // Time is transposed: m![M, K] → m![K, M].
    input.commit(0)
}

fn i4_no_truncation_with_transpose<'l, const T: Tu>(
    input: CastTensor<'l, T, i4, m![1], m![1], m![1], m![M, K], m![J]>,
) -> DmTensor<i4, m![1], m![1], m![1], m![K, M, J]> {
    // Input: 64 i4 elements (32 bytes per packet).
    // No truncation: the full 32-byte packet is written to DM.
    // commit_in_size = 64 elements × 0.5 bytes = 32 bytes.
    // Time is transposed: m![M, K] → m![K, M].
    input.commit(0)
}
}

Note

The commit_in_size value is automatically derived by the compiler from the output tensor mapping. It is not manually specified by the user.

Commit Sequencer

The commit sequencer writes streams to DM across slices. Each slice within an aggregation executes its own sequencer. This mirrors how fetch sequencers pull data into Tensor Units.

The commit_size value determines how many bytes are written per sequencer step. It is analogous to the Fetch Engine’s fetch_size and is also derived from contiguous_sram_access_size:

$$ \texttt{commit_size} = \gcd(\texttt{contiguous_sram_access_size_bytes},\ \texttt{commit_in_size}) $$

When commit_size == commit_in_size, each time step produces a single DM write.
When commit_size < commit_in_size, the packet is split into commit_in_size / commit_size writes per time step.

The main context supports a commit_size of 8, 16, 24, or 32 bytes (see main context). The sub-context supports a commit_size of 8 bytes only (see sub-context).

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 4, K = 2, W = 8, N = 16];

// Compiler-generated configuration: [
//   M -> 4 : 16,  (16 == 2 * 8,  contiguous)
//   K -> 2 : 8,   (8  == 8 * 1,  contiguous)
//   W -> 8 : 1    (packet dimension, contiguous)
// ] : 8
// contiguous_sram_access_size = (8 * 2 * 4) elements × 1 byte = 64 bytes
// commit_in_size = 8 bytes (8 valid i8 elements out of 32-byte flit)
// commit_size = gcd(64, 8) = 8
fn no_transpose<'l, const T: Tu>(
    input: CastTensor<'l, T, i8, m![1], m![1], m![1], m![M, K], m![W # 32]>,
) -> DmTensor<i8, m![1], m![1], m![1], m![M, K, W]> {
    input.commit(0)
}

// Compiler-generated configuration: [
//   M -> 4 : 8,   (8  != 2 * 32, NOT contiguous)
//   K -> 2 : 32,  (32 != 8 * 1,  NOT contiguous)
//   W -> 8 : 1    (packet dimension, contiguous)
// ] : 32
// contiguous_sram_access_size = 8 elements × 4 bytes = 32 bytes
// commit_in_size = 32 bytes
// commit_size = gcd(32, 32) = 32
fn transpose<'l, const T: Tu>(
    input: AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![M, K], m![W]>,
) -> DmTensor<f32, m![1], m![1], m![1], m![K, M, W]> {
    input.commit(0)
}

// Compiler-generated configuration: [
//   M -> 4 : 8,   (8  != 2 * 32, NOT contiguous)
//   K -> 2 : 32,  (32 != 8 * 1,  NOT contiguous)
//   N -> 8 : 1    (truncated packet dimension, contiguous)
// ] : 16
// contiguous_sram_access_size = 8 elements × 2 bytes = 16 bytes
// commit_in_size = 16 bytes (8 bf16 elements; truncation from 16 elements to 8)
// commit_size = gcd(16, 16) = 16
fn transpose_with_truncation<'l, const T: Tu>(
    input: CastTensor<'l, T, bf16, m![1], m![1], m![1], m![M, K], m![N]>,
) -> DmTensor<bf16, m![1], m![1], m![1], m![K, M, N = 8]> {
    input.commit(0)
}

// Compiler-generated configuration: [
//   K -> 2 : 64,  (64 == 4 * 16, contiguous)
//   M -> 4 : 16,  (16 != 8 * 1,  NOT contiguous)
//   W -> 8 : 1    (packet dimension, contiguous)
// ] : 8
// contiguous_sram_access_size = 8 elements × 1 byte = 8 bytes
// commit_in_size = 32 bytes
// commit_size = gcd(8, 32) = 8
//
// The 32-byte packet is split into 4 × 8-byte writes along the M axis:
// - Write 0: packet[ 0.. 8] → DM offset  0
// - Write 1: packet[ 8..16] → DM offset 16
// - Write 2: packet[16..24] → DM offset 32
// - Write 3: packet[24..32] → DM offset 48
fn padding_chunking<'l, const T: Tu>(
    input: CastTensor<'l, T, i8, m![1], m![1], m![1], m![K], m![M, W]>,
) -> DmTensor<i8, m![1], m![1], m![1], m![K, M, W # 16]> {
    input.commit(0)
}
}

Slice Bitmap

The slice bitmap enables selective commits to specific slices. A 256-bit mask controls which slices receive commit data, with each bit corresponding to one slice.

For example:

bitmap = 00000000...01 enables commit only to slice 0
bitmap = 11111111...10 enables commit to all slices except slice 0

This feature supports workflows that compute on specific slices and commit results only to those slices.

Hardware Constraint

The commit sequencer must adhere to the same limits as fetch sequencers. See fetch sequencer constraints for details.

Sub-Context Operations

The sub-context Commit Engine provides specialized capabilities beyond the main context, though it supports fewer adapter stages.

Valid Count Packing: This operation selectively commits only valid tensor elements based on a runtime count, excluding padding or invalid data from the output buffer. When computation produces variable-length results (for example, filtering operations or dynamic sequence lengths), valid count packing ensures that only meaningful elements are written to DM, preventing wasted memory and simplifying downstream processing. The hardware uses a count parameter to determine how many leading elements from each packet should be committed, discarding the remainder.

Generate Mode: Writes a single 32-bit value to a specified address via an ITOS (immediate-to-SRAM) command, bypassing the Tensor Unit execution pipeline.

Constraints

The input packet size must be 32 bytes.
The commit_in_size must be 8, 16, 24, or 32 bytes. The commit_size must be 8, 16, 24, or 32 bytes for the main context and 8 bytes only for the sub-context. Note that the user only specifies the Element mapping. These constraints are internal to the compiler.

The two contexts support different capabilities:

Stage	Main context	Sub context
Truncating	Yes	Yes
Valid Count Packing	No	Yes
Generate Mode	No	Yes

Sub-context commits can only follow fetch. These cannot be preceded by Cast Engine or Transpose Engine operations.
The commit sequencer shares the same limits as the fetch sequencer (see fetch sequencer constraints). Additionally, all sequencer strides must be multiples of 8 bytes.

Performance

Commit Engine performance directly affects overall computation throughput since DM writes must complete before subsequent operations can access the data.

Write Bandwidth

The Commit Engine achieves maximum write bandwidth when:

Slice Interleaving: Distributing writes across all active slices (or the subset specified by the slice bitmap) avoids bottlenecks on individual slices. The RNGD chip has 64 slices per PE. The 256-bit bitmap accommodates up to 4 PEs (4 × 64 = 256).
Sequential Addresses: Writing to sequential DM addresses within each slice enables parallel bank access (128 B/cycle per DMN, 256 B/cycle with DMN interleaving).
Aligned Packet Sizes: Using 8-byte aligned packet sizes (8, 16, 24, 32 bytes) avoids partial bank writes.

For detailed memory performance characteristics, see Memory Performance.

Adapter Stage Costs

Each adapter stage adds minimal latency:

Truncating: Nearly zero cost (simple data width reduction)

Bank Starvation Prevention

The Commit Engine shares DM bank access with the Fetch Engine and DMA Engine. To prevent bank starvation and catastrophic NoC timeouts, ensure commit patterns avoid 64+ consecutive accesses to the same bank. The compiler automatically enforces this constraint by treating violating operations as if they occupy DMA context, preventing concurrent DMA operations.

See DM Bank Starvation for details.

DMA Engine

The DMA Engine moves tensors directly between memory locations without involving the Tensor Unit. It supports all combinations of HBM, SPM, and DM transfers while optionally transforming memory layouts.

As a kernel writer, you control the source and destination memory tiers and any layout transformation expressed as mapping expressions. Prefer direct transfers between tiers: routing data through an intermediate tier (e.g., HBM→SPM→DM when HBM→DM suffices) adds unnecessary latency and bandwidth pressure. The compiler derives the read/write sequencer configuration.

This page covers the interface, worked examples, architecture, and performance characteristics.

Interface

extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
/// Moves a tensor from one memory location to another using DMA.
/// Supports layout transformations during transfer.
fn dma<D: Scalar, InMedia, OutMedia, InMapping, OutMapping, StreamMapping>(
    input: &Tensor<D, InMedia, InMapping>,
    output: &mut Tensor<D, OutMedia, OutMapping>,
    stream: StreamMapping,
) {
    // Hardware implementation:
    // - Read sequencer fetches from source memory
    // - Write sequencer stores to destination memory
    // - Stream mapping coordinates the transfer
}

The operation signature follows this pattern:

impl<D: Scalar, Chip: M, Element: M> HbmTensor<D, Chip, Element> {
    /// Converts to data memory tensor.
    #[primitive(HbmTensor::to_dm)]
    pub fn to_dm<Cluster: M, Slice: M, Element2: M>(
        &self,
        _dma: &mut DmaContext<{ Dma::Tensor }>,
        address: Address,
    ) -> DmTensor<D, Chip, Cluster, Slice, Element2> {
        DmTensor::new(self.inner.transpose(true), address)
    }
}

Transfer capabilities:

All nine source-destination pairs between DM, SPM, and HBM (including same-tier copies)
Cross-DMN, cross-cluster, and cross-chip transfers
Inter-chip transfers via PCIe at 30 bytes/cycle

See also: Memory Performance, Sequencer.

Examples

Layout Transformation

Consider transposing a tensor’s layout while moving it from HBM to DM:

extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![N = 4, C = 3, H = 8, W = 8];

// Tensor in HBM with NCHW layout
let hbm: HbmTensor<i8, m![1], m![N, C, H, W]> = /* ... */;

// DMA Engine moves to DM with NHWC layout
let dm: DmTensor<i8, m![1], m![1], m![1], m![N, H, W, C]> =
    dma_engine(&hbm);

The DMA Engine reads from HBM using one access pattern and writes to DM using a different pattern, transforming the layout during transfer. For parameter definitions, see the Architecture section below.

Architecture

The DMA Engine coordinates paired read and write sequencers for flexible tensor movement. Each RNGD chip contains eight DMA Engines, one per pair of DMNs, so up to eight independent tensor transfers can proceed simultaneously.

Single-Engine Operation

A single DMA Engine operation transforms a tensor by reading it from one memory location and writing it to another with a potentially different layout.

Parameters

The DMA operation requires several parameters to specify the source tensor, destination tensor, and how data flows between them:

shape: The tensor’s logical shape (declared via axes![...])
dtype: Element datatype (e.g., i8, bf16)
media_in, media_out: Source and destination media types (DM/SPM/HBM)
b_in, b_out: Base memory addresses for input/output tensors (when media is HBM, b = { element: b_element })
In, Out: Mapping environments that specify how logical tensor indices map to physical memory locations
Stream: Intermediate stream mapping environment that coordinates the read and write sequencers

The operation executes using two coordinated sequencers: The read sequencer applies read(shape, dtype, b_in, In, Stream) to fetch data from the source, while the write sequencer applies write(shape, dtype, b_out, Out, Stream) to store data at the destination. These sequencers work together through the shared Stream environment to ensure data flows correctly from source to destination.

Alignment Constraints

These constraints reflect the physical organization of memory hardware and the AXI bus protocol. The 8-byte DM write alignment stems from SRAM bank structure: each bank has an 8-byte data width, and the bank controller can only write complete 8-byte units. Misaligned writes require a read-modify-write operation, tripling the time and blocking other operations on that bank. The 1-byte read alignment reflects asymmetric hardware capabilities: SRAM read ports can extract arbitrary byte ranges using byte-select logic, but write ports cannot. HBM-to-DM 8-byte alignment combines both constraints: unaligned HBM reads incur severe performance penalties (potentially halving bandwidth), so the hardware enforces alignment for this critical path. The 4096-byte packet limit comes from the AXI bus protocol: AXI transactions cannot exceed 256 beats, and with 16-byte data width this yields 4096 bytes maximum. Violating these constraints causes correctness errors or hardware exceptions, not just performance degradation. The compiler enforces these rules because they are hardware invariants, not optimization hints.

Structural Requirements

The mapping environments must follow specific structural requirements depending on the media types involved:

Stream must have a specific form:

// Stream = { time: Time, packet: Packet }

In/out must have a specific form depending on the media media_in/out:

// In =
//   if media_in in {HBM, SPM}: { element: ElementIn }
//   if media_in in {DM}: { slice: SliceIn, element: ElementIn }
//
// Out =
//   if media_out in {HBM, SPM}: { element: ElementOut }
//   if media_out in {DM}: { slice: SliceOut, element: ElementOut }

This specifies the respective memory space.

b_in/out must have a specific form depending on the media media_in/out:

// b_in =
//   if media_in in {HBM, SPM}: { chip: b_chip_in, element: b_element_in }
//   if media_in in {DM}: { chip: b_chip_in, cluster: b_cluster_in, slice: b_sliceIn, element: b_element_in }
//
// b_out =
//   if media_out in {HBM, SPM}: { chip: b_chip_out, element: b_element_out }
//   if media_out in {DM}: { chip: b_chip_out, cluster: b_cluster_out, slice: b_sliceOut, element: b_element_out }

This specifies addresses in the respective memory space.

RNGD imposes the following hardware constraints on DMA Engine sequencers (see sequencer constraints for details):
- Alignment requirements for addresses and packet size (Packet::SIZE):
  
  HBM DM (SRAM)
  
  Read address 1B 1B
  
  Write address 1B 8B
  
  packet size 1B 8B
  
  In addition, HBM-to-DM DMA transfers require an 8-byte alignment for the read address, write address, and packet size, regardless of the values shown in the table above.
- The packet size must be less than or equal to 4096 bytes (AXI protocol constraint).

	HBM	DM (SRAM)
Read address	`1B`	`1B`
Write address	`1B`	`8B`
packet size	`1B`	`8B`

Example: Basic HBM-to-HBM Layout Transformation

This example demonstrates how a DMA operation transforms a tensor’s memory layout through a simple HBM-to-HBM transfer that rearranges tensor dimensions.

Consider a DMA operation with the following arguments:

axes![N = 4, C = 3, H = 8, W = 8];
// dtype = i8
// media_in = media_out = HBM
// b_in = { chip: 0, element: 1024 }, b_out = { chip: 0, element: 2048 }
// In = { element: m![N, C, H, W] }, Out = { element: m![H, C, N, W] }
// Stream = { time: m![H, C, N], packet: m![W] }

The compiler generates the following sequencer configurations from these arguments:

Read sequencer configuration: [H=8:8, C=3:64, N=3:192, W=8:1]:8 HBM/D@1024
Write sequencer configuration: [H=8:192, C=8:32, N=3:8, W=8:1]:8 HBM/D@2048

The hardware traverses memory locations according to these sequencer configurations. The following pseudocode models this behavior conceptually:

#![allow(unused)]
fn main() {
fn dma_sequencer() {
    let packet_size = 8; // packet size divides last consecutive read/write sequencer configuration entry

    for h in 0..8 {
        for c in 0..3 {
            for n in 0..4 {
                for w_packet in 0..1 {
                    // packet size is 8, so W=8 is accessed as a single chunk
                    let read_index = h * 8 + c * 64 + n * 192 + w_packet * 1;
                    let stream = Mem[read_index..(read_index + packet_size)];
                    let writ_index = h * 96 + c * 32 + n * 8 + w_packet * 1;
                    Mem[writ_index..(writ_index + packet_size)] = stream;
                }
            }
        }
    }
}
}

This example illustrates how the stream environment (Stream) mediates between different input and output layouts (In and Out), transforming the tensor’s organization in memory while moving it.

Performance

Optimal DMA performance requires attention to startup overhead, alignment, and packet size:

Startup overhead: Each DMA operation incurs approximately 500 cycles of initial overhead. Combining multiple transfers into fewer operations improves efficiency.

Alignment: While the constraints above specify minimum requirements, using larger alignment factors (particularly 256-byte alignment) yields better throughput. For detailed guidance, refer to the memory performance section.

Packet size and internal DMA requests: DMA automatically splits packets into 256-byte units internally: an n-byte packet becomes ceil(n / 256) DMA requests. Examples:

If the innermost entry is x=4095:1, packet size 4095 results in 16 DMA requests.
If the innermost entry is x=4099:1, since 4099 is prime, a single DmaCommand processes 1 byte at a time (very inefficient). Split into two DmaCommands (e.g., 4096/3 portions) instead, though each additional DmaCommand adds ~500 cycles of initial latency.

Homogeneous Aggregate Operation

Multiple DMA Engines work together in parallel to improve throughput for large tensor moves. The homogeneous aggregate operation distributes a single logical tensor move across DMA Engines in multiple DMNs, with all DMNs using identical stream environments to coordinate their work. With four chips, up to 32 DMA Engines execute portions of a single tensor move concurrently.

The operation has the following form:

// dma(shape, dtype, media_in, media_out, b_in, b_out, In, Stream, Out)

Each participating DMN executes its own DMA Engine to handle a portion of the overall transfer, together implementing the following single logical tensor move:

// <shape, In, media_in / dtype @ { element: b_in }> --id--> <shape, Out, media_out / dtype @ { element: b_out }>

Parallel execution across multiple DMNs requires extending the mapping environments beyond the single-DMN case to include chip, cluster, and slice dimensions:

// In =
//   if media_in in {HBM, SPM}: { chip: ChipIn, element: ElementIn }
//   if media_in in {DM}: { chip: ChipIn, cluster: ClusterIn, slice: SliceIn,
//                          element: ElementIn }
//
// Out =
//   if media_out in {HBM, SPM}: { chip: ChipOut, element: ElementOut }
//   if media_out in {DM}: { chip: ChipOut, cluster: ClusterOut, slice: SliceOut,
//                           element: ElementOut }

The key characteristic of homogeneous operations is that all DMNs share the same parametric stream environment:

// Stream = { chip: ChipStream, cluster: ClusterStream, slice: SliceStream,
//              time: Time, packet: Packet }

Heterogeneous Aggregate Operation

The heterogeneous aggregate operation provides flexibility for different DMNs to process data differently during a parallel transfer. This variant allows each DMN to use a distinct stream environment while coordinating to perform a single logical tensor move.

Two constraints maintain correctness with this added flexibility:

All participating DMA Engines must use the same input and output media types
A single, unified input and output tensor mapping expression must govern the overall transfer

The heterogeneous aggregate DMA operation is defined as:

// dma(shape, dtype, media_in, media_out, b_in, b_out, In, StreamFn, Out)

Each DMN executes its own DMA Engine to implement the following single logical tensor move:

// <shape, In, media_in / dtype @ { element: b_in }> --id--> <shape, Out, media_out / dtype @ { element: b_out }>

The stream environment specification distinguishes this operation. Instead of a single parametric stream environment shared by all DMNs, the heterogeneous operation uses StreamFn, a function mapping each DMN’s location to its own unique stream environment. For a DMN at chip i, cluster j, and slice index k, the function StreamFn(i, j, k) returns that DMN’s specific stream mapping of the form { time: Time, packet: Packet }. The input and output mapping environments (In and Out) remain structurally identical to the homogeneous case, ensuring a well-defined overall logical tensor move.

DMA Command Syntax

Two syntactic forms express DMA operations, depending on whether each DMN needs its own descriptor or can share a common pattern.

Heterogeneous Syntax (Full Flexibility)

The heterogeneous syntax specifies a complete DMA descriptor for each DMN individually, including potentially different source and destination media:

<DMACommand> ::= HashMap(<DmnIndex>, <DmaDescriptor>)
<DmaDescriptor> ::= (<DmaSequencer>, <source_media: Media>, <dest_media: Media>)
<DmaSequencer> ::= (<limit: integer>, <source_stride: integer>, <dest_stride: integer>)*,
                   (<source_base: integer>, <dest_base: integer>), <stride0: 1~4096>
<Media> ::= "HBM"(<ChipIndex>) | "DM"(DmnIndex) | "SPM"(DmnIndex)
<DmnIndex> ::= (<ChipIndex>, <ClusterInChipIndex>, <SliceInClusterIndex>)
<ChipIndex> ::= 0 | 1 | 2 | 3 (when using 4 chips)
<ClusterInChipIndex> ::= 0 | 1
<SliceInClusterIndex> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7

Note: While a DMA operation logically uses separate read and write sequencers, the compiler represents them compactly as a single DmaSequencer with paired strides and bases per entry (one for source, one for destination).

Homogeneous Syntax (Common Case)

For the common case where all DMNs follow a regular pattern, the homogeneous syntax offers a concise representation:

<DMACommand> ::= ( <source: Tensor>, <dest: Tensor>, HashMap(<DmnIndex>, <StreamShape>) )
<Tensor> ::= ( <Shape>, <Memory Mapping Expression>, <Media>, <addr: integer>, <Dtype> )
<Shape>, <Memory Mapping Expression>: defined before
<Media> ::= "HBM" | "DM" | "SPM"
<Dtype> ::= i4 | i8 | f8e4m3 | f8e5m2 | i16 | fp16 | bf16 | i32 | f32
<StreamShape> ::= <Memory Mapping Expression>
<DmnIndex> ::= (<ChipIndex>, <ClusterInChipIndex>, <SliceInClusterIndex>)

Key usage notes:

DM tensor specifications must include chip, cluster, and slice dimensions in the Memory Mapping Expression to identify the exact memory location
Each DMN’s StreamShape includes inter-DMN mapping information (e.g., chip: A!4, chip #2 means the stream shape uses A@2!1 to specify reading from a particular chip)
Stream shapes are often inferred: if only source and destination tensors are provided, the compiler derives appropriate stream shapes. Alternatively, specify a single stream shape with chip/cluster/slice dimensions, from which per-DMN stream shapes are automatically derived

Example of heterogeneous mapping:

axes![A = 4, B = 256, C = 256, D = 256];
// source: [Chip: [A % 4], Dram: [B % 256 * C % 256 * D % 256]], HBM @ 0
// dest: [Chip: [A % 4], Cluster: [B / 128], Partitioning: [C % 256], InSlice: [B % 128 * D % 256]], DM @ 0

StreamShape for (Chip_i, Cluster_j, Slice_k):

// [A @ i % 1 * (B / 128) @ j % 1 * (C / 64) @ k % 1 * C % 32 * B % 128 * C / 32 % 2 (DMN) * D % 256]
// DMA Sequencer = [C=32:(32 * 256, slice_stride), B=128:(256 * 256, 256),
//                  C/32=2:(32 * 256, 32) * slice_stride, D=256:(1, 1)],
//                  base: (Chip, Cluster, Slice, HBM/InSlice) = ((i, i), (j, j), (k, k), (0, 0))
// (slice_stride = 4MB = virtual address space of in_slice DM)

Implementation Details

This section explains how the compiler generates DMA operations and how the hardware executes them.

Compiler generates aggregate operations by default: The compiler treats tensor-to-tensor moves (T → T’) as atomic units and automatically distributes work across available DMNs in parallel, similar to Fetch/Commit Sequencers. Aggregate operations are the primary abstraction programmers interact with, which explains why this documentation emphasizes them rather than single-DMN DMA.

Sequencer representation is compact: Although DMA operations logically use separate read and write sequencers, the compiler represents them efficiently as a single structure. Each entry in this unified sequencer contains shared loop limits but separate strides (one for read, one for write) and separate base addresses (one for source, one for destination). This compact representation exploits the fact that read and write amounts must always match.

DMA Engine assignment is flexible:

Any DMA Engine among the 8 can handle any transfer, but using the DMN’s own DMA Engine is more efficient (not quantitatively measured).
The compiler typically uses the source DM DMN’s DMA Engine, but any DMA Engine works.
The 8 DMA Engines can transfer between different memory components in parallel (e.g., DMA #0: HBM ↔ DM, DMA #1: DM ↔ DM).
The compiler only allows moving from one tensor (HBM/DM/SPM) to another tensor (HBM/DM/SPM).
For inter-chip transfers, all chip IDs are globally agreed upon across the system
Programmers can leave DMA Engine selection unspecified and let the compiler choose, though explicit specification is also supported

SRAM access patterns for optimal bandwidth: SRAM memory bandwidth depends critically on DMN (Data Memory Network) interleaving. For detailed SRAM performance characteristics and interleaving patterns, see the Data Memory section. The key principle: interleave across both DMNs to achieve full 256 B/cycle bandwidth.

Bandwidth trade-offs: DMA provides flexibility for arbitrary tensor moves but may underutilize SRAM slice bandwidth compared to the Tensor Unit. However, HBM bandwidth is often the bottleneck in practice, making this less critical. For SRAM-to-SRAM transfers, the Tensor Unit is often more efficient, except when the Switch Engine operates at size 256 (which may be slower than DMA).

Tensor Memory Mapping

The compiler automatically derives the correspondence between source and destination memory indices from the mapping environments. Given tensor memory mappings (e, e'), the compiler computes how each flat memory index relates to the logical tensor dimensions:

// S, e' |- i ~ { i_A = (i % 65536) / 256, i_B = i / 65536, i_C = i % 256 }

A simple layout transformation that reorders dimensions:

axes![A = 256, B = 256, C = 256];
// e_1 = A * B * C, e_2 = B * A * C
// DMA: <S, e_1, HBM@0> =id=> <S, e_2, HBM@256^3>

DMA Sequencer Internals

This section explains how DMA sequencers execute at the hardware level.

A DMA Descriptor represents a single execution unit that the hardware can process. Each DMN’s DMA Engine can accept multiple DmaDescriptors, which it executes in sequence (or potentially in parallel when resources permit). The sequencer within each descriptor determines the exact order in which memory addresses are accessed.

Startup overhead detail: As mentioned in the performance considerations above, each DMA Descriptor incurs approximately 500 cycles of initial latency before data transfer begins.

Example sequencer execution:

// DmaSequencer = [A=256:(65536, 256), B=256:(256, 65536), C=256:(1, 1)], base=(0, 256^3)

Reading one data element per cycle, each cycle performs:

i	ti	read addr	write addr
0	`{ A: 0, B: 0, C: 0 }`	0	write_base (=256³)
1	`{ A: 0, B: 0, C: 1 }`	1	1 + write_base
…	…	…	…
255	`{ A: 0, B: 0, C: 255 }`	255	255 + write_base
256	`{ A: 0, B: 1, C: 0 }`	256	65536 + write_base
`i = a256² + b256 + c`	`{ A: a, B: b, C: c }`	`i`	`256a + 256²b + c + write_base`

The DmaSequencer compactly represents this address mapping table. With stride0 = 256, the hardware reads and writes 256 bytes per cycle: cycle 0 processes all values for (A, B, C) = (0, 0, 0..255) as a single packet.

A complete descriptor example:

// DmaSequencer = [A=256:(65536, 256), B=256:(256, 65536), C=256:(1, 1)],
//                base=(0, 256^3), stride0 = 256
// media_source = HBM, media_dest = HBM, DmnIndex = (0, 0, 0)

This descriptor activates the DMA Engine on Chip 0, Cluster 0, DMN 0, moving data from HBM starting at address 0 to HBM starting at address 256³. The transfer completes in approximately 500 cycles (initial latency) + 256 × 256 cycles (data transfer).

How the compiler derives sequencers: Given source and destination tensor shapes along with a stream shape, the compiler derives the DMA sequencer configuration:

// stream_shape = [A * B * C]
// => read_sequencer = [A=256:65536, B=256:256, C=256:1], base=0
// => write_sequencer = [A=256:256, B=256:65536, C=256:1], base=256^3

The derivation process follows these steps:

The read sequencer is derived by projecting the source tensor mapping onto the stream shape
The write sequencer is derived by projecting the destination tensor mapping onto the stream shape
These are combined into a unified DMA sequencer with paired strides and bases
The packet size (stride0) is inferred from the consecutive read/write volume: if both read and write access 256 consecutive bytes, the optimal stride0 is 256 bytes

When stride0 is not 256-byte aligned, the cycle count formula is ceil(stride0 / 256). However, HBM write operations incur additional penalties beyond the ceil calculation. The unaligned write requires a Read-Modify-Write (RMW) operation for the partial 256-byte block, slowing the operation significantly (see the Misaligned Access section in ./memory-performance.md for details). For HBM read operations, the penalty is limited to the ceil overhead. For SRAM operations, alignment has minimal impact.

Memory Bandwidth Limits

Memory bandwidth limits are crucial for achieving optimal DMA performance. A single DMA Engine can theoretically move up to 256 bytes per clock cycle, but the actual transfer rate is constrained by the slowest component in the data path: the source memory, the destination memory, or the PCIe interconnect for inter-chip transfers.

For detailed characteristics and optimization strategies for each memory type, see:

Data Memory (DM) performance
High-Bandwidth Memory (HBM) performance
Scratchpad Memory (SPM) performance

Key bandwidth constraints:

HBM: 1.5 TB/s combined read + write per chip (0.75 TB/s read + 0.75 TB/s write)
DM: 256 B/cycle per cluster with proper DMN interleaving (128 B/cycle per DMN)
SPM: 128 B/cycle per cluster
PCIe DMA Engine: 30 bytes/cycle for inter-chip transfers

Detailed Examples

The following examples illustrate DMA Engine behavior across various configurations, from simple single-engine transfers to complex multi-DMN operations with performance considerations.

Example 1: Single DMA Engine HBM to HBM

This example demonstrates a basic HBM-to-HBM transfer using a single DMA Engine to rearrange tensor dimensions. The operation achieves good performance through effective channel interleaving, distributing memory accesses across different HBM channels to enable parallel processing.

Operation arguments:

axes![A = 8, B = 8, C = 256];
// dtype = i8
// media_in = media_out = HBM
// b_in = { chip: 0, element: 0 }, b_out = { chip: 0, element: 16384 }
// In = { element: m![A, B, C] }
// Out = { element: m![B, A, C] }
// Stream = { time: m![A, B], packet: m![C] }

Generated sequencer configurations:

Read sequencer configuration: [A=8:2048, B=8:256, C=256:1]:256 HBM/D@0
Write sequencer configuration: [A=8:256, B=8:2048, C=256:1]:256 HBM/D@16384

Why this achieves good performance: Channel interleaving enables efficient parallel processing. The strides in the non-innermost sequencer entries (256 and 2048) toggle HBM address bits 8 and 11, which correspond to the stack and channel selection bits. This access pattern ensures that every read and write request targets a different HBM channel, with multiple memory operations proceeding in parallel.

Although each 256-byte transfer takes 4 cycles at 0.75GHz clock speed, the parallel distribution across channels enables efficient execution. At 1GHz, the total time is approximately 128 cycles (64 read requests + 64 write requests) plus approximately 500 cycles of initial latency.

read #i	ti	read addr	write addr
0	`{ A: 0, B: 0, C: 0 }`	0	write_base(=16384)
1	`{ A: 0, B: 0, C: 1 }`	1	1 + write_base
2	`{ A: 0, B: 0, C: 2 }`	2	2 + write_base
…	…
255	`{ A: 0, B: 0, C: 255 }`	255	255 + write_base
256	`{ A: 0, B: 1, C: 0 }`	256	2048 + write_base
257	`{ A: 0, B: 1, C: 1 }`	257	2048 + 1 + write_base
…	…
`i = a * 2048 + b * 256 + c`	`{ A: a, B: b, C: c }`	`i`	256 * a + 2048 * b + c + write_base
…	…

Bandwidth sharing note: HBM bandwidth is 1.5 TB/s (read + write combined), and each DMA Engine has 256 GB/s bandwidth. For DRAM ↔ DRAM operations, read bandwidth is 0.75 TB/s. If 4 DMA Engines perform DRAM ↔ DRAM operations, each gets ~0.1875 TB/s. Even with stride0=256, each engine reads 256B per request but cannot complete one request per cycle due to this bandwidth constraint.

Example 2: Single DMA Engine HBM to DM

This example demonstrates an HBM-to-DM transfer that achieves optimal bandwidth by carefully interleaving both HBM channels and DM DMNs. Both memory systems require specific access patterns to reach their full bandwidth potential.

Operation arguments:

axes![A = 256, B = 256, C = 256];
// dtype = i8
// media_in = HBM
// media_out = DM
// b_in = { chip: 0, element: 0 }
// b_out = { chip: 0, cluster: 0, slice: 0, element: 0 }
// In = { element: m![B, A, C] }
// Out = { slice: m![A / 4], element: m![A % 4, B, C] }
// Stream = { time: m![B, A % 4, A / 4 % 32, A / 128], packet: m![C] }

Generated sequencer configurations:

Read sequencer configuration: [B=256:65536, A%4=4:256, A/4%32=32:1024, A/128=2:32768, C=256:1]:256 HBM/D@0
Write sequencer configuration: [B=256:256, A%4=4:65536, A/4%32=32:slice_stride, A/128=2:DMN_stride, C=256:1]:256 DM/D@0

Performance analysis: Both HBM and DM achieve full bandwidth through careful interleaving in their respective access patterns.

HBM side: The stride of 32768 for the A/128=2 loop interleaves memory accesses effectively. For the innermost 2 iterations, this interleaves at the byte level; for outer iterations, it interleaves across HBM channels. The hardware command queue processes all 65536 requests (256 * 4 * 32 * 2) efficiently, utilizing full HBM bandwidth.

DM side: DMN and slice interleaving work together to maximize throughput. Each of the two DMNs provides 128 bytes/cycle bandwidth, so a 256-byte write normally requires 2 cycles on a single DMN. However, interleaving consecutive requests across both DMNs (achieved through the DMN_stride and slice_stride) enables the two DMNs to operate in parallel, processing one 256-byte request per cycle. All 65536 write requests therefore complete at one request per cycle.

Total execution time: Approximately 65536 cycles (read) + 65536 cycles (write) + 500 cycles (initial latency). Since reads and writes overlap in the pipeline, the actual time is closer to max(65536, 65536) + 500 ≈ 66036 cycles.

Example 3: Single DMA Engine DM to DM

This example shows a DM-to-DM transfer within a single cluster, where both reads and writes access the same DM. This scenario requires careful DMN interleaving for both operations to avoid contention and achieve maximum bandwidth.

Operation arguments:

axes![A = 256, B = 256, C = 256];
// dtype = i8
// media_in = DM
// media_out = DM
// b_in = { chip: 0, cluster: 0, slice: 0, element: 0 }
// b_out = { chip: 0, cluster: 0, slice: 0, element: 4 * 256 * 256 }
// In = { slice: m![A / 4], element: m![A % 4, B, C] }
// Out = { slice: m![A / 4], element: m![B, A % 4, C] }
// Stream = { time: m![B, A % 4, A / 4 % 32, A / 128], packet: m![C] }

Generated sequencer configurations:

Read sequencer configuration: [B=256:1, A%4=4:65536, A/4%32=32:slice_stride, A/128=2:DMN_stride, C=256:1]:256 DM/D@0
Write sequencer configuration: [B=256:1024, A%4=4:256, A/4%32=32:slice_stride, A/128=2:DMN_stride, C=256:1]:256 DM/D@(4 * 256 * 256)

Performance analysis: DMN and slice interleaving enable full bandwidth for both read and write operations. Each 256-byte access is structured to interleave across the two DMNs, while the outer loops interleave across different DM slices. Each DMN provides 128 bytes/cycle bandwidth, so a single 256-byte access normally requires 2 cycles on one DMN. However, alternating requests between both DMNs enables parallel operation to achieve full 256 B/cycle bandwidth.

Request execution:

Total read requests: 65536 (256 * 4 * 32 * 2)
Total write requests: 65536
At saturation with proper interleaving, one request completes per cycle

Total execution time: Approximately 131072 cycles (since reads and writes must proceed sequentially for DM-to-DM within the same cluster) + 500 cycles (initial latency).

Note on packet size alignment: The choice of C=256 is important for performance. If C were between 1-255, the cycle count remains similar because the number of DMA requests determines execution time. However, if the packet size is 256n+r (where 0 ≤ r < 256), the cycle count increases by a factor of (n+1) due to more requests. Aligning packet sizes to 256-byte boundaries maximizes data transferred per request.

Example 4: Homogeneous DMA Engine, HBM to DM (Pathological: Bank Conflict)

This example demonstrates performance degradation from poorly designed memory access patterns: severe HBM bank conflicts. The issue arises when the stream shape causes consecutive accesses to trigger row switches within HBM banks, preventing efficient parallel execution and resulting in approximately 10x slower performance compared to well-optimized access patterns.

Operation arguments:

// 1 chip (8 DMNs): chip-related mapping is not needed
axes![A = 64, B = 2048, C = 1024];
// dtype = i8
// media_in = HBM
// media_out = DM
// b_in = 0
// b_out = 0
// In = { cluster: m![B / 1024], slice: m![B / 256 % 4, A], element: m![B % 256, C] }
// Out = { slice: m![A / 4], element: m![B, A % 4, C] }
// Stream = { cluster: m![B / 1024], slice: m![B / 256 % 4],
//              time: m![B % 256, C / 256, A % 32, A / 32], packet: m![C % 256] }

Generated sequencer configurations:

Read sequencer configuration at (cluster_i, dmn_j): [B%256=256:1024, C/256=4:256, A%32=32:2^21, A/32=2:2^26, C%256=256:1]:256 HBM/D@(i * (1024 * 1024) + j * (256 * 1024))
- The base address offset i * (1024 * 1024) + j * (256 * 1024) is derived from the DMN location (B/1024, B/256%4) = (i, j)
Write sequencer configuration at (cluster_i, dmn_j): [B%256=256:1024, C/256=4:256, A%32=32:slice_stride, A/32=2:DMN_stride, C%256=256:1]:256 DM/D@(cluster_i, dmn_j, 0)

Why this performs poorly: row-level bank conflicts The stream shape structure optimized for DM’s DMN/slice interleaving creates a pathological access pattern for HBM. The innermost interleaving dimensions (A%32 and A/32) correspond to HBM address bits 21 and 26, which control row addressing within banks. Consecutive memory accesses trigger row switches within the same bank on nearly every request.

Channel interleaving still occurs (the C dimension’s stride of 256 enables stack interleaving across all 32 channels), but this parallelism cannot compensate for the row conflict penalty within each channel. Each access within a channel must wait for the previous row to close and the new row to open, dramatically increasing latency.

Performance breakdown:

HBM reads (the bottleneck):

Per DMN: 65536 data requests (256 * 4 * 32 * 2)
Across 8 DMNs: 524288 total requests, distributed evenly across 32 HBM channels
Each channel handles: 16384 requests
Each request incurs approximately 40 cycles due to bank conflicts (a conservative estimate; actual penalty depends on tCCD and FR-FCFS scheduling)
Total HBM time: approximately 655360 cycles (16384 * 40)

DM writes (not the bottleneck):

DMN interleaving works correctly, achieving full 256 B/cycle bandwidth
65536 requests per DMN, processing at one request per cycle

Total execution time: Approximately 655360 cycles + 500 cycles (initial latency) ≈ 655860 cycles.

Critical lesson: Careful access pattern design is essential for performance. Avoid bank conflicts through proper stream shape construction. Note that this estimate is conservative; actual performance may be somewhat better due to FR-FCFS (First Ready-First Come First Served) memory scheduling, which can mitigate some conflicts, but the fundamental problem remains severe.

Example 5: Homogeneous DMA Engine HBM to DM (Pathological: Missing Stack Interleaving)

This example demonstrates another common pitfall: failing to interleave across HBM’s stack dimension (address bit 8). When this bit is not toggled by the access pattern, only 16 of the 32 available HBM channels are utilized, cutting effective bandwidth in half.

Operation arguments:

// 1 chip (8 DMNs)
axes![A = 8, B = 64, C = 8, D = 512];
// dtype = i8
// media_in = HBM
// media_out = DM
// b_in = 0
// b_out = 0
// In = { element: m![A, B, C, D] }
// Out = { cluster: m![A / 4], slice: m![A % 4, B], element: m![C, D % 256] }
// Stream = { cluster: m![A / 4], slice: m![A % 4], time: m![C, B % 32, B / 32], packet: m![D % 256] }

Generated sequencer configurations:

Read sequencer configuration at (cluster_i, dmn_j): [C=8:512, B%32=32:4096, B/32=2:131072, D%256=256:1]:256 DM/D@(i * 2^20 + j * 2^18)
- The base address offset i * 2^20 + j * 2^18 is derived from the DMN location (A/4, A%4) = (i, j)
Write sequencer configuration at (cluster_i, dmn_j): [C=8:256, B%32=32:slice_stride, B/32=2:DMN_stride, D%256=256:1]:256 DM/D@(cluster_i, dmn_j, 0)

Why this performs poorly: missing stack bit interleaving The stream shape does not exercise HBM address bit 8, which controls the stack dimension. In the HBM access pattern, the C axis has a stride of 512, so bit 8 is never toggled during the innermost loops. This occurs in operations like tensor splits where dimension structure changes between input and output (notice that the input tensor mapping includes D/256 but the output/stream does not).

HBM channel selection uses address bits 9-28, while the stack bit is bit 8. Without bit 8 interleaving, memory requests distribute across only 16 of the 32 available channels, immediately halving achievable bandwidth.

Performance breakdown:

HBM reads (the bottleneck):

Per DMN: 512 data requests (8 * 32 * 2)
Across 8 DMNs: 4096 total requests, distributed across only 16 channels
Each channel handles: 256 requests
Each channel’s bandwidth: 256B per 4 cycles at 0.75GHz, or approximately 5.3 cycles per request at 1GHz
Total HBM time: approximately 1357 cycles (256 * 5.3)

DM writes (not the bottleneck):

DMN interleaving achieves full 256 B/cycle bandwidth
512 requests per DMN (8 * 32 * 2), processing at one request per cycle
DM writes overlap with HBM reads in the pipeline, so their latency is hidden

Total execution time: Approximately 1357 cycles + 500 cycles (initial latency) ≈ 1857 cycles.

Critical lesson: Achieving full HBM bandwidth (1.5TB/s) and DMA Engine bandwidth (2TB/s) requires memory access patterns that interleave across all 32 channels by toggling all relevant address bits including the stack bit (bit 8). Missing even one dimension of interleaving significantly degrades performance.

Example 6: Heterogeneous DMA Engine with Segmentation

This example demonstrates a heterogeneous DMA operation where the tensor shape does not divide evenly across all DMNs. Some DMNs must use different stream environments than others, and in extreme cases, a DMN may need to segment its work into multiple DMA commands to avoid writing to incorrect memory locations. This illustrates both the flexibility and complexity of heterogeneous DMA operations.

Operation arguments:

// 4 chips
axes![A = 15, B = 32, C = 256, D = 8];
// dtype = i8
// media_in = DM
// media_out = HBM
// b_in = 0
// b_out = 0
// In = let A' = A + 1# in
//        { chip: m![D / 2], cluster: m![D % 2], slice: m![A' / 4, A' / 2 % 2, B],
//          element: m![A' % 2, C] }
// Out = { chip: m![D / 2], element: m![D % 2, B, A, C] }
// StreamFn(chip_i, cluster_j, slice_k) = let A' = A + 1# in
//        { chip: m![(D / 2) @ i = 1],
//          cluster: m![(D % 2) @ j = 1],
//          slice: m![(A' / 4) @ k = 1],
//          time: (k == 0,1,2): m![A' % 2, B, A' / 2 % 2, C]
//                (k == 3, exec #0): m![A' % 2, B, A' / 2 = 1, C]
//                (k == 3, exec #1): m![A' = 1, B, A' / 2 % 2 @ 1, C],
//          packet: m![C] }

The compiler generates the following sequencer configurations:

Read sequencer configuration at (chip_i, cluster_j, dmn_k):
- k = 0, 1, 2: [A'%2=2:256, B=32:slice_stride, A'/2%2=2:DMN_stride, C=256:1]:256 DM/D@(chip_i, cluster_j, dmn_k, 0)
- k = 3:
  - execution #0 [A'%2=2:256, B=32:slice_stride, A'/2=1:DMN_stride, C=256:1]:256 DM/D@(chip_i, cluster_j, dmn_3, 0)
  - execution #1 [A'%2=2:256, B=32:slice_stride, A'/2=1:DMN_stride, C=256:1]:256 DM/D@(chip_i, cluster_j, dmn_3, 0)
Write sequencer configuration at (chip_i, cluster_j, dmn_k):
- k = 0, 1, 2: [A'%2=2:256, B=32:15 * 256, A'/2%2=2:512, C=256:1]:256 HBM/D@(0 + i * 2 * (15 * 32 * 256) + j * (15 * 32 * 256) + k * (4 * 256))
- k = 3:
  - execution #0 [A'%2=2:256, B=32:15 * 256, A'/2=1:512, C=256:1]:256 HBM/D@(0 + i * 2 * (15 * 32 * 256) + j * (15 * 32 * 256) + 3 * (4 * 256))
  - execution #1 [A'%2=1:256, B=32:15 * 256, A'/2=1:512, C=256:1]:256 HBM/D@(0 + i * 2 * (15 * 32 * 256) + j * (15 * 32 * 256) + 3 * (4 * 256) + 512)
    - 512: offset by A'/2%2@1

Why DMN #3 requires segmentation: The tensor dimension A=15 does not divide evenly across 4 DMNs (15 = 3*4 + 3). DMNs #0, #1, and #2 each process exactly 4 elements of the A dimension. DMN #3 must process the remaining 3 elements (A=12, 13, 14) but its sequencer would naturally try to process 4 elements. If DMN #3 used the same single-command pattern as the other DMNs, it would write one extra element, corrupting memory in the region designated for B * (A + 1#) * C.

The compiler segments DMN #3’s work into two commands to avoid this:

Execution #0 handles part of the valid range
Execution #1 handles the remainder, ensuring the total is exactly 3 elements rather than 4

Performance comparison:

DMNs #0, #1, #2 (single command each):

DM reads: 128 cycles for 2 * 32 * 2 packets of 256B each
HBM writes: 128 cycles with proper channel interleaving
Total: approximately 256 cycles (reads and writes overlap) + 500 cycles (initial latency) = 756 cycles

DMN #3 (two commands):

Execution #0: 64 DM read cycles + 64 HBM write cycles + 500 cycles initial latency
- Note: reads from one DMN only, but slice interleaving still applies
Execution #1: 32 DM read cycles + 32 HBM write cycles + 500 cycles initial latency
Total: 192 data cycles + 1000 cycles (initial latency for two commands) = 1192 cycles

Overall execution time: The heterogeneous operation completes when the slowest DMN finishes. DMN #3 determines the total time: approximately 1192 cycles.

Key insight: Command segmentation incurs additional startup overhead (500 cycles per command). Choose tensor shapes that divide evenly across DMNs when possible, avoiding the need for heterogeneous stream environments and command segmentation.

Performance

DMA Engine performance depends on memory types, access patterns, and parallelism strategies.

Memory-Specific Bandwidth

Transfer bandwidth varies by memory type and configuration:

Data Memory (DM/SRAM):

Peak bandwidth: 256 B/cycle (with proper DMN interleaving)
Requires interleaving across both DMNs (128 B/cycle each)
Bank conflicts and starvation can severely degrade performance
See Memory Performance for DM optimization details

High-Bandwidth Memory (HBM):

Peak bandwidth: 1.5 TB/s per chip (48 GB/s per channel × 32 channels)
Channel interleaving is essential for high bandwidth
Misaligned access and bank conflicts cause severe degradation
See HBM Performance for optimization strategies

Scratchpad Memory (SPM):

Bandwidth: 128 B/cycle per cluster
Restricted to same-chip transfers

Startup Latency

Each DMA command incurs approximately 500 cycles of startup latency before data transfer begins. This fixed cost is amortized over large transfers but becomes significant for small tensors.

Command segmentation (as shown in Example 6) doubles startup latency by requiring two separate commands, emphasizing the importance of tensor shapes that divide evenly across DMNs.

Parallelism Strategies

Multiple DMA Engines can operate simultaneously:

8 DMA Engines per chip (one per pair of DMNs, 8 DMNs per cluster)
Parallel DMA operations on independent data enable high aggregate bandwidth
Local DMN memory access is faster than cross-DMN access (not quantitatively measured)

Alignment Constraints

Strict alignment requirements affect performance:

DM writes: 8-byte alignment required for addresses and packet sizes
HBM operations: 1-byte alignment for reads/writes, but HBM-to-DM transfers require 8-byte alignment
Maximum packet size: 4096 bytes (AXI protocol constraint)
Misaligned access in HBM can halve bandwidth or trigger expensive Read-Modify-Write operations

Bank Starvation Prevention

DMA Engine shares DM bank access with Fetch and Commit Engines. DMA has the lowest priority among these engines, making it vulnerable to bank starvation. If a DMA request blocks for more than 4,096 cycles, a NoC timeout occurs, requiring a hardware reset.

The compiler prevents this by ensuring operations with 64+ consecutive same-bank accesses are not scheduled concurrently with DMA. See Bank Starvation for details.

Inter-Chip Transfers

PCIe-based inter-chip transfers have limited bandwidth:

30 B/cycle for both reads and writes
Significantly slower than on-chip transfers
Consider minimizing cross-chip data movement in algorithm design

Memory Performance

Memory performance fundamentally determines TCP program efficiency. This page is the primary actionable reference for kernel writers: it documents hardware specifications, explains why each constraint exists, and maps API choices to their performance consequences. It covers DM first (the tier kernel writers interact with most), then SPM and HBM.

In practice, most performance problems trace back to two root causes, each with multiple specific manifestations:

Unnecessary hops: routing data through an intermediate memory tier (e.g., through SPM when a direct HBM→DM transfer suffices) adds latency and bandwidth pressure.
Low throughput: a Packet that is smaller than necessary or non-contiguous in memory causes more sequencer iterations and strided access patterns. The following table details the hardware constraints that determine how Packet choices affect each memory type:

Memory	Issue	Rule	Penalty
DM	Bank starvation	< 64 consecutive same-bank accesses	NoC timeout → hardware reset
DM	DMN interleaving	Alternate across 2 DMNs per cluster	50% bandwidth loss
DM	Slice interleaving	Spread across 32 slices per DMN	Command queue contention
HBM	Alignment	256-byte aligned access	Unaligned read: 2× penalty; unaligned write: ~50× penalty (RMW)
HBM	Bank conflicts	Avoid row switches within same bank	30–40× degradation
HBM	Channel interleaving	Spread across 32 channels	Reduced parallelism

Data Memory (DM)

Data Memory is the primary SRAM for tensor computations. A single RNGD chip contains 256MB of DM, structured to maximize parallel access and bandwidth. The following table summarizes the DM geometry:

Unit	Count
Clusters	2 / Chip
Data Memory Networks (DMNs)	8 / Cluster
Slices	32 / DMN
Banks	16 / Slice
Rows	4096 / Bank
Bytes	8 / Row

The SRAM hierarchy consists of clusters, DMNs, and slices. A single chip contains two clusters, each with eight Data Memory Networks. Each DMN contains 32 slices, totaling 256 slices per cluster. Clusters can exchange data through the Switch Engine; see the dedicated section for details.

Address Space in a Slice

Each slice provides 512KB of SRAM with a dedicated address space. The memory is organized into 16 parallel banks, each with an 8-byte data width, enabling a total data access rate of 128 B/cycle. Access to any individual bank is serialized, but the address space distributes 128 consecutive bytes across all 16 banks (8 bytes per bank) for parallel access. The following bit mapping defines this distribution:

Bit #	Component
`0–2`	Byte
`3–6`	Bank
`7–18`	Row

This bit mapping optimizes various access patterns, particularly during sequential access. Distributing consecutive bytes across banks enables parallel access and maximizes bandwidth utilization.

Optimizing DMA Performance

Achieving full DMA bandwidth requires following three guidelines: interleaving across DMNs, interleaving across slices, and preventing bank starvation.

1. Interleave across DMNs.

DMN interleaving is essential because each DMN provides only 128 B/cycle bandwidth. Since the standard 256-byte transfer unit requires two cycles per DMN, you should pipeline accesses across both DMNs to maintain continuous throughput:

cycle	DMN #0	DMN #1
0	read #0 (1/2)	(idle)
1	read #0 (2/2)	read #1 (1/2)
2	read #2 (1/2)	read #1 (2/2)
3	read #2 (2/2)	read #3 (1/2)
…	…	…
2n-1	read #2n-2 (2/2)	read #2n-1 (1/2)
2n	(idle)	read #2n-1 (2/2)

2. Interleave across slices.

Slice interleaving improves efficiency by distributing DMA requests across the 32 slices within each DMN. Slices are shared resources used by the DMA, Fetch, and Commit Engines, so spreading requests helps manage contention. Each slice has two command queue entries to hold pending DMA requests.

3. Prevent bank starvation. Unlike the first two issues, bank starvation may force a complete chip reset, not just a slowdown.

Bank Starvation

The key constraint is the 64-access rule: Fetch and Commit engines must not access the same DM bank for more than 64 consecutive operations while DMA is active. Violating this causes a NoC timeout and full cluster reset.

The fundamental issue is priority inversion in a shared resource system: a low-priority requester is indefinitely blocked by high-priority ones accessing the same resource. The DM controller prioritizes requests in this order:

Main-context Fetch Engine
Main-context Commit Engine
Sub-context Fetch Engine
Sub-context Commit Engine
DMA Engine

DMA has the lowest priority among all memory engines, which makes sense during normal operation since computation engines should get first access to data. However, this creates a dangerous scenario when high-priority engines continuously access the same bank: the DMA Engine’s request sits in the queue, unable to make progress, while the higher-priority engines monopolize that bank. After 4,096 cycles without a response, the NoC (Network on Chip) protocol declares the transaction dead and enters an exception state. This 4,096-cycle limit exists as a safety mechanism to detect deadlocks and hung transactions in the NoC protocol; without this timeout, a stuck transaction could hang the entire system indefinitely. When the timeout triggers, the hardware lacks a graceful recovery mechanism, and the only recovery is a full cluster domain reset, losing all computation state and requiring complete reinitialization.

The 64-access rule prevents this catastrophe: the Fetch and Commit Engines must not access the same bank for more than 64 consecutive operations while DMA is active. Why 64? The math comes from NoC bandwidth: (256 B/cycle DMA ÷ 128 B/cycle per DMN) × 64 accesses × 32 slices < 4096 cycles—this ensures DMA requests complete before the timeout even in the worst case.

For example, suppose the DMA Engine issues a request to bank 0 (along with 15 other banks), but the main-context’s Fetch Engine continuously requests bank 0. The DMA request stalls, and if this exceeds 4,096 cycles, a NoC timeout forces a hardware reset.

Compiler scheduling behavior: When Tensor Unit operations would violate the 64-access limit, the compiler schedules them as if they occupy DMA, preventing concurrent DMA operations. This sacrifices the TCP architecture’s inherent main/sub/DMA context parallelism where data preparation and computation occur in parallel, but avoids catastrophic hardware resets. Treat this as a hard constraint: never use patterns with 64+ consecutive same-bank accesses.

Main-context starving sub-context is less severe because it does not trigger NoC timeouts and only increases processing time. Additionally, the Tensor Unit’s internal pipeline naturally generates back-pressure between Fetch and Commit Engines, preventing internal starvation.

Scheduling model: The scheduler uses context occupancy information: if operation A occupies a context (e.g., main context), the next operation B using that context waits until A completes. Understanding which contexts operations occupy enables predicting parallel execution. A scheduling visualization utility would help verify actual schedules.

The 64-access limit details:

The limit is cumulative across all concurrent commands: if the total number of consecutive accesses from all commands (main-context fetch/commit + sub-context fetch/commit + DMA) to the same bank exceeds 64 cycles, DMA starvation occurs.
Even if individual commands interleave accesses to the same bank, their combined access count still accumulates toward the 64-access limit, which can cause DMA Engine starvation.
The compiler controls only single commands accessing the same bank consecutively; multiple commands interleaving the same bank are not controlled.
In practice, sub-context rarely accesses the same bank consecutively (StoTrf, StoVrf operations typically use sequential addresses and tiling prevents same-bank access).
Sub-context operations that would exceed the limit are also not scheduled concurrently with DMA.

Note

Cumulative Bank Access Constraint

Even if each individual command accesses a bank fewer than 64 times, the TOTAL across all concurrent main/sub/DMA commands to the SAME bank must be less than 64.

The compiler prevents individual commands from exceeding this limit, but it cannot prevent accumulation from multiple concurrent operations. For example:

Main-context Fetch: 30 consecutive bank accesses

Sub-context Fetch: 20 consecutive bank accesses

DMA: 1 concurrent request to the same bank

Total: 51 accesses → Safe (below 64)

But if either main or sub reaches 35+ accesses, the combined total exceeds 64 and triggers starvation. This is why the compiler sacrifices main/sub/DMA parallelism (scheduling them sequentially instead) when cumulative access patterns would exceed 64 to the same bank.

Main/sub-context contention: Main-context can starve sub-context, but this is less severe:

Unlike DMA starvation, sub-context starvation does not cause NoC timeout or hardware reset and only increases processing time.
Collision probability is lower: DMA Engine occupies 16 banks at once, while sub fetch/commit engines occupy only one bank.
Starvation does not occur between fetch and commit engines within the same context due to pipeline back-pressure.

Performance impact example: If main-context exec command continuously accesses a specific bank while sub-context stos command is scheduled, sub-context processing is delayed. Worst case: total time = main-context time + sub-context time. Ideal case: main and sub access different banks, achieving total time = max(main-context time, sub-context time).

Technical Details: Banks and Command Queues

Bank access: At 128 B/cycle DMN access, 16 banks are accessed simultaneously. Banks are shared resources among Fetch Engine, Commit Engine, and DMA Engine. Access to any individual bank is serialized.

DMN bandwidth: Within the Data Memory Network, Data Memory Slices share data paths, so DMA Engine transfers achieve 128 B/cycle per DMN.

Command queues: Each Data Memory Slice has a 2-entry command queue for pending DMA read/write requests. Since this is limited, spreading DMA requests across multiple slices is ideal: distributing across M slices reduces required throughput to 1/M even if request processing slows due to priority. DMN interleaving every n cycles achieves saturated 256 B/cycle.

Note

While command queues theoretically allow some burst access without interleaving, we strongly recommend always interleaving across DMNs when generating DMA streams, as this is the most natural approach.

The 4096-cycle limit derivation: The formula is: (TDMA_IO_BYTE / DMN_IO_BYTE) * Max_Consecutive_Access * DMN_SIZE < 4096. With TDMA_IO_BYTE=256, DMN_IO_BYTE=128, DMN_SIZE=32, this yields Max_Consecutive_Access < 64.

DMN NoC architecture: Tensor DMA connects to DRAM and DMN through a NoC acting as a hub. Each port (DMA port, DRAM port, DMN ports) receives requests and must send responses. Transactions are considered hung if response takes more than 4096 cycles after request. When a DMN port request doesn’t receive a response within 4096 cycles, the NoC treats it as an error and enters an exception state, requiring a cluster domain reset.

Data Memory Network topology: Data Memory Routers are DMN components connected in a ring topology forming the Data Memory Network. The path is: slice0_in → slice31_out, slice32_in → slice63_out.

Scratchpad Memory (SPM)

Note

This section is a work in progress; hardware-specific details (capacity, addressing, bank structure) are pending.

Scratchpad Memory provides additional fast storage within each DMN for temporary data and intermediate results. Each DMN contains SPM with a bandwidth of 128 B/cycle, offering high-speed access for frequently reused values such as constants, lookup tables, or small working sets that don’t require the full capacity of SRAM.

SPM serves as a middle tier in the memory hierarchy between the ultra-fast VRF (Vector Register File) and the larger SRAM. Its primary use cases include storing scalar constants, small weight matrices, activation function lookup tables, and configuration data that needs rapid access without consuming scarce VRF capacity. The compiler automatically selects SPM for data that exhibits high temporal locality but modest capacity requirements.

The key distinction from SRAM is explicit software management: the compiler explicitly allocates data to SPM when beneficial, whereas SRAM allocation follows more general-purpose policies. SPM’s 128 B/cycle bandwidth per DMN enables high-throughput access for small tensors, and because each DMN has dedicated SPM, there are no inter-DMN contention issues. SPM is particularly valuable for per-DMN state that would otherwise require repeated SRAM fetches.

High-Bandwidth Memory (HBM)

HBM provides high-capacity off-chip storage with substantial bandwidth for large tensor operations. A single RNGD chip contains 48GB of HBM. The following table summarizes the HBM geometry:

Unit	Count
Stacks	2 / Chip
Channels	16 / Stack
Slices	3 / Channel
Bank Groups	4 / Slice
Banks	4 / Bank Group
Rows	16K / Bank
Bytes	2K / Row

Address Space in a Chip

The HBM address space uses a non-linear bit mapping optimized for parallel sequential access. This design maximizes parallelism and minimizes overhead rather than directly mapping to physical geometry:

Bit #	Main Component	Additional Components
0–7	Byte
8	Stack
9–12	Channel
13	Bank Group	Channel
14–16	Byte	Channel
17–18	Bank	Channel
19	Bank Group	Channel
20	Slice	Channel
21–33	Row	Channel (21–28)
34	Slice	Row
35	Row

The bit assignment for each component corresponds to the physical memory geometry. For instance, the byte component occupies 11 bits (bits 0-7, 14-16) to represent 2K (2^11) bytes per row. Three exceptions exist:

Slice representation: Two bits (20 and 34) represent slice, even though there are only three slices.
Contiguous address space: Bit 34 is influenced by the row component to ensure bits 34 and 35 are never both 1, guaranteeing a contiguous 48GB address space.
Channel XOR mapping: The channel component equals the XOR of bits 9-12 and 13-28 (e.g., the channel’s first bit equals the XOR of bits 9, 13, 21, and 25).

This unconventional bit order enhances performance by enabling parallelism across different memory resources.

Peak Bandwidth

Peak HBM bandwidth reaches 1.5TB/s per chip through parallel operation of stacks and channels. The channel controller transfers 64B/cycle at 0.75GHz,¹ yielding 48GB/s per channel (0.75GHz x 64B/cycle) or 1.5TB/s per chip (48GB/s x 32 channels). The fundamental transfer unit is 256 bytes, requiring 4 clock cycles per channel. Saturating a single DMA Engine (256GB/s capacity) requires interleaving accesses across multiple channels.

Achieving peak bandwidth requires careful attention to access patterns. Channel throughput is highly sensitive to misalignment, bank conflicts, and resource sharing. Each channel controller has a 64-entry command queue that interleaves accesses to minimize penalties, but pathological cases can still cause severe degradation. The following sections describe causes of performance degradation and how to avoid them.

Misaligned Access

Misaligned access significantly degrades HBM performance. Bits 0-7 (the eight LSBs) represent the 256-byte minimum access unit within a memory row. Accessing data that crosses this boundary incurs substantial penalties.

Unaligned Read: Read requests crossing a 256-byte boundary require two NoC transfers, effectively halving bandwidth.

Unaligned or Partial Write: DMA packets are internally segmented into 256-byte transactions. When a packet’s size is not 256-byte aligned (e.g., a 2,800-byte packet splits into ten 256-byte requests plus one 240-byte request), the final “leftover” transaction requires a Read-Modify-Write (RMW) operation. RMW reads the entire 256-byte unit, updates the requested bytes, then writes the entire unit back. RMW can slow writes by roughly 50× compared to aligned writes.

Bank Conflict

Bank conflicts cause severe performance degradation of 30–40× compared to accessing an already-open row. They occur when consecutive accesses target different rows within the same bank. Only one row per bank can be open at a time; all rows start closed. Once open, a row’s 256-byte words can be accessed quickly, but switching rows requires closing the current row and opening a new one, adding 40–50 ns (60–75 cycles at 1.5 GHz) of latency.

Channel interleaving mitigates bank conflicts. Interleaving accesses across all 32 channels distributes load and reduces conflicts. Bits 8-12 (the next five LSBs) represent independent stacks and channels; placing these at low addresses prevents interference between adjacent accesses, which is vital for parallelizing contiguous operations. Non-contiguous operations often benefit from natural channel interleaving because the channel component spans bits 9-28. However, the stack component only corresponds to bit 8, so interleaving this bit requires explicit attention.

The controller hides row-switch latency through command interleaving. Within each channel, the controller automatically interleaves commands across banks, enabling useful transfers while other banks perform row switches. The controller manages bank states using its command queue and employs FR-FCFS (First Ready-First Come First Served) scheduling, prioritizing commands targeting already-open rows.

Despite this sophisticated scheduling, access patterns that continuously switch rows within the same bank still degrade performance significantly. Compilers and programmers should estimate row-switch costs when generating code.

Column-to-Column Delay

Column-to-Column Delay (tCCD) rarely affects performance significantly, so skip this section on first reading.

tCCD is the minimum time between consecutive read or write commands on the same channel. It determines the maximum command issue rate, directly affecting channel throughput. Vendor specifications set tCCD values based on analog constraints for accessing DRAM stack layers and shared resources.

The tCCD value depends on which memory resources consecutive commands target:

Command Relation	`tCCD` (cycles @ `1.5GHz`)	Relative Performance	Reason for Penalty
Same Slice, Different Bank Group	`2`	`1`	Ideal interleaving of bank groups
Different Slice	`3`	`2/3`	Data path switching
Same Slice, Same Bank Group	`4`	`1/2`	Shared I/O buffer among four banks

The optimal case is interleaving between different bank groups within the same slice (tCCD = 2 cycles at 1.5GHz), allowing a new 64B command to be issued every cycle at 0.75GHz, achieving back-to-back transmission and full channel speed. Any tCCD greater than 2 reduces the command rate and channel utilization.

Pathological tCCD patterns cause less severe degradation than bank conflicts for two reasons: either they often coincide with bank conflicts anyway, or channel interleaving masks their impact:

Different Slice (tCCD = 3): Slice ID corresponds to bit 20, and bit 21 corresponds to the row. Interleaving across slices therefore likely causes bank conflicts simultaneously.
Same Slice, Same Bank Group (tCCD = 4): This pattern interleaves bits 8-35 except bits 13, 19, 20, and 34. Bits 29-35 relate to bank conflicts; bits 8-28 relate to channel interleaving.

Although the channel controller operates at a frequency of 0.75GHz, it performs eight bursts per cycle, leading to an effective frequency of 0.75×8=6GHz. ↩

Computing Tensors

The Tensor Unit transforms data through a pipeline of eight specialized engines. Data flows from DM, through the engine pipeline, and back to DM. After the Collect Engine normalizes packets to flits (32-byte flow control units), all downstream engines — Contraction, Vector, Cast, Transpose, and Commit — operate on flits. (See Collect Engine for the normalization details.)

flowchart TB
    subgraph SRAM
        DM[(DM)] & TRF[(TRF)] & VRF[(VRF)]
    end

    subgraph TU[Tensor Unit]
        direction LR
        FE[Fetch] --> SW[Switching] --> CO[Collect] --> CE[Contraction] --> VE[Vector] --> CA[Cast] --> TR[Transpose] --> CM[Commit]
    end

    DM --> FE
    CM --> DM
    CO --> TRF --> CE
    CO --> VRF --> VE

    click FE "../moving-tensors/fetch-engine.html" "Fetch Engine"
    click SW "./switch-engine.html" "Switch Engine"
    click CO "./collect-engine.html" "Collect Engine"
    click CE "./contraction-engine/index.html" "Contraction Engine"
    click VE "./vector-engine/index.html" "Vector Engine"
    click CA "./cast-engine.html" "Cast Engine"
    click TR "./transpose-engine.html" "Transpose Engine"
    click CM "../moving-tensors/commit-engine.html" "Commit Engine"

Two register files serve distinct roles: TRF (Tensor Register File; see hello-tcp memory overview) holds weights for the Contraction Engine (load once, reuse across many cycles), while VRF (Vector Register File) holds operands for the Vector Engine. The Collect Engine loads data into TRF via .to_trf() and VRF via .to_vrf().

Fetch and Commit are part of the Tensor Unit pipeline but interface directly with DM; see Moving Tensors.

Engine	Function	Key Constraint
Fetch	Load data from DM into the pipeline	Packet must be 8-byte aligned; `Slice` is unchanged
Switching	Redistribute data across slices	Ring network topology; `Slice` can change
Collect	Normalize packets to 32-byte flits	Output = exactly one flit
Contraction	Einsum: matmul, convolution, attention	Weight-stationary via TRF
Vector	Elementwise, binary, reduce operations	Only i32/f32 input
Cast	Precision lowering with batching	Output = exactly one flit
Transpose	Reorder elements within a flit	Within-flit only
Commit	Write results back to DM	Flit-aligned writes

As a kernel writer, you specify data types, tensor mapping expressions, and computations in einsum form. The compiler translates these into per-engine hardware configurations.

Execution Contexts

Two execution contexts enable double-buffering (preparing the next operand batch while the current one is being computed) to hide memory latency:

Context	Compute Engines	Fetch/Commit	Typical Use
Main	Exclusive access	Dedicated units	Computation
Sub	Idle only	Lower bandwidth	Prefetching to TRF/VRF

While the main context computes, the sub context prefetches the next operand batch into TRF/VRF. When the sub context is unused, the main and sub Switch Engine channels combine into dual channel mode (see Switch Engine), doubling bandwidth. See Scheduling for how the scheduler coordinates the two contexts and the DMA Engine.

The following sections cover each engine in detail.

Switch Engine

The Fetch Engine produces a FetchTensor where each slice holds its own portion of data. The Switch Engine then redistributes data across slices so each slice receives exactly what it needs for computation. Data flows through a ring network of 256 interconnected slices; each slice’s router decides per packet whether to output locally or forward to a neighbor.

This data redistribution overlaps with computation, enabling the Contraction Engine to receive data in the exact pattern it needs while continuously executing operations. This page covers the interface, routing architecture (Forwarding, Broadcast01, Broadcast1, Transpose, InterTranspose, and Custom Topologies), hardware constraints, and performance characteristics.

Interface

impl<'l, const T: Tu, D: Scalar, Chip: M, Cluster: M, Slice: M, Time: M, Packet: M>
    FetchTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet>
{
    /// Performs switching operation to create a switched tensor.
    ///
    /// Applies switching network routing only. The packet passes through
    /// unchanged — no padding, no reshaping. Use [`SwitchTensor::collect`]
    /// afterwards to normalize the packet to flit-sized chunks.
    #[primitive(FetchTensor::switch)]
    pub fn switch<Slice2: M, Time2: M>(
        self,
        config: SwitchConfig,
    ) -> SwitchTensor<'l, T, D, Chip, Cluster, Slice2, Time2, Packet> {
        verify_switch::<Slice, Time, Slice2, Time2>(&config);
        SwitchTensor::new(self.ctx, self.inner.transpose(true))
    }

    /// Skips the switching network and goes directly to collect.
    ///
    /// Slice and Time are preserved from fetch; only the packet is normalized
    /// to flit-sized chunks.
    #[primitive(FetchTensor::collect)]
    pub fn collect<Time2: M, Packet2: M>(self) -> CollectTensor<'l, T, D, Chip, Cluster, Slice, Time2, Packet2> {
        verify_collect::<D, Time, Packet, Time2, Packet2>();
        CollectTensor::new(self.ctx, self.inner.transpose(false))
    }
}

The transformation preserves the tensor’s mathematical representation while redistributing data across slices. The Chip and Cluster dimensions pass through unchanged; only Slice and Time are permuted. The packet passes through the switch engine unchanged. After switching, call collect() to normalize the packet to 32-byte flits.

Architecture

This section explains how routers make decisions to route data, then shows regular topologies with predictable data flow, and finally covers custom topologies that enable arbitrary patterns. The Switch Engine only supports specific slice and temporal dimension transformations determined by the switching topology.

Router Decision Process

Understanding how routers make forwarding decisions is essential before exploring specific topologies.

Each slice has a router that decides whether to send its input packet to an adjacent slice or to output.

Each packet has a source slice number attached, and each slice configures a snoop bitmap (a bitmask specifying which source slices’ packets to accept and output) to control which data it receives.

Each slice’s router can make the following three routing decisions:

Input routing: The router decides whether its input packet goes to output, rightward to the next slice, or leftward to the previous slice.
Right-neighbor routing: Data arriving from the right neighbor can be forwarded to output, rightward, or leftward.
Left-neighbor routing: Data arriving from the left neighbor can be forwarded to output, leftward, or rightward.

Using these settings, data moving in a counter-clockwise ring pattern can be configured to reach the desired slice.

Common router configurations for counter-clockwise ring communication:

Root node: Outputs input data and data from the right slice, sends to right slice.
Middle node: Outputs data from the left slice, forwards to right.
Leaf node: Forwards input data to left, outputs data from the left slice.

To understand how the switching mechanism routes data through the ring network, consider a minimal example with 2 slices and 2 input packets per slice.

This example shows how data flows through the ring over time, with each slice deciding whether to output data locally or forward it to neighbors.

Given:

axes![A = 2, B = 2, C = 64]
Slice: m![A]
Time: m![B]
Packet: m![C]
Input packets per slice: slice0 = [0, 1], slice1 = [2, 3]

i(cycle)	slice#0	slice#1	Output Data
0	0: from input, to (output, right)		`0: [0]` `1: []`
1	1: from input, to (output, right)	0: from left, to output 2: from input, to left	`0: [0, 1]` `1: [0]`
2	2: from right, to (output, right)	1: from left, to output 3: from input, to left	`0: [0, 1, 2]` `1: [0, 1]`
3	3: from right, to (output, right)	2: from left, to output	`0: [0, 1, 2, 3]` `1: [0, 1, 2]`
4		3: from left, to output	`0: [0, 1, 2, 3]` `1: [0, 1, 2, 3]`

As a result, a tensor with the following mapping expression is output:

Slice: m![A / 2, 2]
Time: m![B / 2, B % 2, A % 2]
Packet: m![C]

The hardware provides pre-defined regular topologies (like Broadcast01 with slice0 = 2, slice1 = 1) that configure the routers to achieve such patterns efficiently.

Forwarding

Forwarding passes data through the switching network unchanged, preserving the Slice and Time dimension mapping. Each slice’s router simply passes its input data directly to output; no inter-slice communication occurs.

To use forwarding, skip the .switch() call entirely and invoke collect() directly on the FetchTensor:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 256, B = 64, C = 32];

fn forwarding<'l, const T: Tu>(
    input: FetchTensor<'l, T, f32, m![1], m![1], m![A], m![B], m![C]>,
) -> CollectTensor<'l, T, f32, m![1], m![1], m![A], m![B], m![C]> {
    input.collect()
}
}

The ring network operates at the following minimum cost when forwarding:

$$ \text{#cycles} = \text{ring_size} \times \text{input_time} \times \text{cycles_per_packet} $$

ring_size is 1 since no inter-slice communication is needed, making this the most efficient topology when no actual switching is required.

Broadcast01

Broadcast01 replicates data across slices along two inner Slice sub-dimensions (called slice0 and slice1 in the layout diagram below), enabling parallel computation on the same data across multiple processing elements.

This topology is essential for operations like matrix-vector multiplication where a vector needs to be broadcast to all rows of a matrix distributed across slices.

This topology is parameterized by slice1, slice0, and time0. The compiler infers slice2 = InSlice::SIZE / (slice1 * slice0) and time1 = InTime::SIZE / time0. The following table shows the input axis structure (outermost to innermost, left to right):

+--------------------------+---------------+
|          Slice           |      Time     |
+--------+--------+--------+-------+-------+
| slice2 | slice1 | slice0 | time1 | time0 |
+--------+--------+--------+-------+-------+

After switching, slice1 and slice0 move from Slice into Time, broadcasting those dimensions across the ring group while tiling slice2:

+----------------------+---------------------------------+
|        Slice         |              Time               |
+--------+------+------+-------+--------+-------+--------+
| slice2 | tile | tile | time1 | slice1 | time0 | slice0 |
+--------+------+------+-------+--------+-------+--------+

Moving slice1 and slice0 from Slice to Time creates time1 × time0 independent ring groups, each of size slice1 × slice0, where slices within each ring group exchange data to achieve the broadcast pattern. The broadcast dimensions (slice1, slice0) are placed at the innermost positions of the output Time dimension (just outside Packet).

This broadcast topology takes data that was spatially distributed across slices (the Slice axis) and broadcasts it over time. Instead of different slices having different data, all slices in the same ring group receive the same data sequentially through time.

Example

Consider the following configuration:

axes![A = 256, B = 64, C = 63, D = 8]
dtype = i8
In:
- Chip: m![D / 2]
- Cluster: m![D % 2]
- Slice: m![A]
- Time: m![B]
- Packet: m![C # 64]
Out:
- Chip: m![D / 2]
- Cluster: m![D % 2]
- Slice: m![A / 4, 4]
- Time: m![B / 4, A / 2 % 2, B % 4, A % 2]
- Packet: m![C # 64]

This configuration sets slice1 = 2, slice0 = 2, time0 = 4 in the Broadcast01 topology. The compiler infers slice2 = 256 / (2 * 2) = 64 and time1 = 64 / 4 = 16. Notice that slice2 * slice1 * slice0 = 256 = Slice::SIZE, and time1 * time0 = 64 = (old)Time::SIZE.

The difference between the input and output mappings is that the A % 4 axis moved from Slice to Time, while slice2 is tiled. This divides the 256 slices into 16 groups of size ring_size = slice0 * slice1 = 4. The axis movement between Slice and Time enables this broadcast behavior: when an axis moves from Slice to Time, it creates dependencies where slices in a particular ring group receive data from the other slices in the same ring group. The slice1 and slice0 broadcast axes each move to Time as A / 2 % 2 and A % 2, respectively.

This particular configuration is equivalent to the following custom snoop bitmap, which maps the slice identified by the bitmap index to its corresponding ring group. The broadcast pattern is evident: rows 0-3 have identical entries as the slices they represent ({0, 1, 2, 3}) receive data from the same input slices ({0, 1, 2, 3}).

Bitmap Index	`(A / 4, A % 4)`	`A`	Ring Group
0	`(0, 0)`, `(0, 1)`, `(0, 2)`, `(0, 3)`	`0`, `1`, `2`, `3`	`0`, `1`, `2`, `3`
1	`(0, 0)`, `(0, 1)`, `(0, 2)`, `(0, 3)`	`0`, `1`, `2`, `3`	`0`, `1`, `2`, `3`
2	`(0, 0)`, `(0, 1)`, `(0, 2)`, `(0, 3)`	`0`, `1`, `2`, `3`	`0`, `1`, `2`, `3`
3	`(0, 0)`, `(0, 1)`, `(0, 2)`, `(0, 3)`	`0`, `1`, `2`, `3`	`0`, `1`, `2`, `3`
4	`(1, 0)`, `(1, 1)`, `(1, 2)`, `(1, 3)`	`4`, `5`, `6`, `7`	`4`, `5`, `6`, `7`
…	…	…	…
255	`(63, 0)`, `(63, 1)`, `(63, 2)`, `(63, 3)`	`252`, `253`, `254`, `255`	`252`, `253`, `254`, `255`

Since this matches exactly the pre-defined Broadcast01 form, it is an input/output format that can be processed by the Switch Engine.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 256, B = 64, C = 63, D = 8, X = 4];

fn broadcast01<'l, const T: Tu>(
    input: FetchTensor<'l, T, f32, m![D / 2], m![D % 2], m![A], m![B], m![C # 64]>,
) -> SwitchTensor<'l, T, f32, m![D / 2], m![D % 2], m![A / 4, X], m![B / 4, A / 2 % 2, B % 4, A % 2], m![C # 64]> {
    // X is a newly introduced axis for broadcast semantics.
    // Input: each slice has its own portion of data (256 slices, 64 time steps, 64 byte packets)
    // Output: all slices receive broadcast data from their 4-slice ring group
    // Packet passes through unchanged; call .collect() afterwards to normalize to flits.
    input.switch::<m![A / 4, X], m![B / 4, A / 2 % 2, B % 4, A % 2]>(
        SwitchConfig::Broadcast01 {
            slice1: 2,
            slice0: 2,
            time0: 4
        }
    )
}
}

Cycle Estimation

The Switch Engine’s cycle estimation follows the formula:

$$ \text{#cycles} = \text{ring_size} \times \text{input_time} \times \text{cycles_per_packet} $$

$$ = (\texttt{slice0} \times \texttt{slice1}) \times \texttt{B::SIZE} \times \frac{\texttt{(C # 64)::SIZE}}{32} $$

$$ = (2 \times 2) \times 64 \times \frac{64}{32} = 512 \text{ cycles} $$

Ring Structure

The ring_size of 4 means that inter-slice data movement occurs in groups of 4 slices, with data dependencies existing only within each ring.

When we group all 256 slices into rings of size 4, we get 64 independent rings that operate in parallel.

Within each ring, exchanging data takes time proportional to ring_size, and each packet represents the minimum unit of data exchange.

Regular topologies can be expressed as tensor mapping expressions. For example, with:

axes![A = 64, B = 64]
Slice = m![A]
Time = m![B / 2]
Packet = m![B % 32]

If configured with Broadcast01 (slice0 = 8, slice1 = 8, time0 = 2), the tensor mapping expression corresponding to the output is:

axes![A = 64, B = 64]
Slice = m![A / 64, 64]
Time = m![B / 4, A / 8 % 8, B / 2 % 2, A % 8]
Packet = m![B % 32]

Broadcast1

Broadcast1 replicates data across slices along Slice dimension 1, enabling parallel computation where a single dimension needs to be broadcast while preserving another dimension in the slice. This topology is simpler than Broadcast01 as it only broadcasts along one Slice dimension.

This topology is parameterized by slice1 and slice0. The compiler infers slice2 = InSlice::SIZE / (slice1 * slice0). An input tensor structured as follows:

+--------------------------+--------+
|           Slice          |  Time  |
+--------+--------+--------+--------+
| slice2 | slice1 | slice0 | time0  |
+--------+--------+--------+--------+

is transformed into the following output tensor, where only the slice1 axis moves from Slice to Time, broadcasting this dimension across the slice’s ring group, while preserving slice0 in Slice dimension and tiling slice2.

+------------------------+----------------+
|          Slice         |      Time      |
+--------+------+--------+-------+--------+
| slice2 | tile | slice0 | time0 | slice1 |
+--------+------+--------+-------+--------+

Example

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 256, B = 64, C = 63, X = 4];

fn broadcast1<'l, const T: Tu>(
    input: FetchTensor<'l, T, i8, m![1], m![1], m![A], m![B], m![C # 64]>,
) -> SwitchTensor<'l, T, i8, m![1], m![1], m![A / 32, X, A % 8], m![B, A / 8 % 4], m![C # 64]> {
    // X is a newly introduced axis for broadcast semantics.
    // Packet passes through unchanged; call .collect() afterwards to normalize to flits.
    input.switch::<m![A / 32, X, A % 8], m![B, A / 8 % 4]>(
        SwitchConfig::Broadcast1 {
            slice1: 4,
            slice0: 8,
        }
    )
}
}

Transpose

Transpose permutes axes within the innermost part of the Slice dimension. This topology is parameterized by slice1 and slice0. An input tensor with the slice dimension structured as [slice2, slice1, slice0] is transformed so that the output slice becomes [slice2, slice0, slice1]:

+--------------------------+         +--------------------------+
|           Slice          |         |           Slice          |
+--------+--------+--------+   -->   +--------+--------+--------+
| slice2 | slice1 | slice0 |         | slice2 | slice0 | slice1 |
+--------+--------+--------+         +--------+--------+--------+

Example

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 256, B = 64, C = 63];

// Transpose with slice1 = 32, slice0 = 2.
// Input Slice:  m![A]: [slice2 = 4, slice1 = 32, slice0 = 2]
// Output Slice: m![A / 64, A % 2, A / 2 % 32]
fn transpose<'l, const T: Tu>(
    input: FetchTensor<'l, T, i8, m![1], m![1], m![A], m![B], m![C # 64]>,
) -> SwitchTensor<'l, T, i8, m![1], m![1], m![A / 64, A % 2, A / 2 % 32], m![B], m![C # 64]> {
    input.switch::<m![A / 64, A % 2, A / 2 % 32], m![B]>(SwitchConfig::Transpose {
        slice1: 32,
        slice0: 2,
    })
}
}

The output slice m![A / 64, A % 2, A / 2 % 32] decomposes the original axis A into three parts: A / 64 extracts slice2 (stride 64, size 4), A % 2 extracts slice0 (stride 1, size 2), and A / 2 % 32 extracts slice1 (stride 2, size 32). Compared to the input slice ordering ([slice2, slice1, slice0]), slice1 and slice0 are swapped.

InterTranspose

While regular Transpose permutes axes within Slice only, InterTranspose swaps between the Slice and Time dimensions and transposes in the Time dimension.

This topology is parameterized by slice1 (the size of the dimension being swapped), slice0, and time0. The compiler derives slice2 and time2 from the input Slice and Time mappings. Since time1 must have the same size as slice1 for OutSlice::SIZE to be 256, this effectively swaps equally-sized chunks between the Slice and Time dimensions:

Input:
+--------------------------+-----------------------+
|           Slice          |         Time          |
+--------+--------+--------+-------+-------+-------+
| slice2 | slice1 | slice0 | time2 | time1 | time0 |
+--------+--------+--------+-------+-------+-------+

Output:
+--------------------------+------------------------+
|          Slice           |         Time           |
+--------+--------+--------+-------+-------+--------+
| slice2 | time1  | slice0 | time2 | time0 | slice1 |
+--------+--------+--------+-------+-------+--------+

The slice2 and slice0 axes remain unchanged in Slice, while time1 in Slice comes from the Time axis. The output Time dimension contains slice1 from the original Slice dimension.

Example

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 32, C = 256];

// InterTranspose with slice1 = 2, slice0 = 16, time0 = 2.
// The compiler derives: slice2 = 8, time2 = 2.
// Input Slice:  m![C] = [slice2 = 8, slice1 = 2, slice0 = 16]
// Input Time:   m![A] = [time2 = 2, time1 = 2, time0 = 2]
// Output Slice: m![C / 32, A / 2 % 2, C % 16]
// Output Time:  m![A / 4, A % 2, C / 16 % 2]
fn inter_transpose<'l, const T: Tu>(
    input: FetchTensor<'l, T, i8, m![1], m![1], m![C], m![A], m![B # 32]>,
) -> SwitchTensor<'l, T, i8, m![1], m![1], m![C / 32, A / 2 % 2, C % 16], m![A / 4, A % 2, C / 16 % 2], m![B # 32]> {
    input.switch::<m![C / 32, A / 2 % 2, C % 16], m![A / 4, A % 2, C / 16 % 2]>(
        SwitchConfig::InterTranspose {
            slice1: 2,
            slice0: 16,
            time0: 2,
        })
}
}

The output Slice (m![C / 32, A / 2 % 2, C % 16]) decomposes into:

C / 32 extracts slice2 (from input Slice)
A / 2 % 2 extracts time1 (from input Time)
C % 16 extracts slice0 (from input Slice)

The output Time (m![A / 4, A % 2, C / 16 % 2]) contains:

A / 4 extracts time2 (from input Time)
A % 2 extracts time0 (from input Time)
C / 16 % 2 extracts slice1 (from input Slice)

Custom Topologies

Regular topologies cover the most common data movement patterns efficiently, but some tensor operations require arbitrary permutations or partial axis extractions that don’t fit these predefined patterns.

Custom topologies solve this problem by allowing you to program exactly which input slices map to which output slices using a bitmap, giving you complete flexibility for complex transformations.

Configuration Overhead

The tradeoff for this flexibility is configuration overhead: using a custom topology requires preempting DMA and sub-context operations to write the bitmap to the hardware’s Special Function Registers (SFRs).

This setup cost makes custom topologies most appropriate when the computation benefits outweigh the initialization overhead.

Supported Transformation Patterns

Custom bitmaps support two key transformation patterns that regular topologies cannot express.

First, they enable free transpose with broadcast, allowing arbitrary permutation and broadcast of partitioning axes—regular topologies only support specific forms like Transpose or TransposedDim1Broadcast, but custom bitmaps let you freely mix axes while broadcasting.

Second, they support partial axis extraction, where only a portion of an axis moves to Time during broadcasting—regular topologies like Broadcast01 always move the entire broadcast axis, but custom bitmaps can select subsets.

Example 1: Arbitrary Permutation

This example demonstrates arbitrary slice dimension permutations that regular topologies cannot express, enabling flexible data reordering for specialized computation patterns.

Arguments:

1 cluster (256 slices): Chip/Cluster context applies to a single cluster.
axes![A = 16, B = 16, C = 8, D = 8, E = 8]
dtype = i8
In:
- Slice: m![A, B]
- Time: m![C]
- Packet: m![D, E]
Out:
- Slice: m![B % 4, B / 4, A % 4, A / 4]
- Time: m![C]
- Packet: m![D, E]

The difference between In and Out is that permutation occurred in the slice shape.

The form of permutation is [0, 1, 2, 3] to [3, 2, 1, 0].

There is no regular topology corresponding to such free permutation, but it is a form that can be simply expressed with custom bitmap.

Bitmap Index	`(B % 4, B / 4, A % 4, A / 4)`	`(A, B)`	Ring Group
0	`(0, 0, 0, 0)`	`(0, 0)`	`0`
1	`(0, 0, 0, 1)`	`(4, 0)`	`64`
2	`(0, 0, 0, 2)`	`(8, 0)`	`128`
3	`(0, 0, 0, 3)`	`(12, 0)`	`192`
4	`(0, 0, 1, 0)`	`(0, 1)`	`1`
5	`(0, 0, 1, 1)`	`(4, 1)`	`65`
…	…	…	…
255	`(3, 3, 3, 3)`	`(15, 15)`	`255`

The cycle calculation follows the standard formula:

$$ \text{cycle} = \text{ring_size} \times \text{input_time} \times \text{cycles_per_packet}. $$

$$ = 256 \times \texttt{C::SIZE} \times \frac{\texttt{m![D, E]::SIZE}}{32} = 4096 $$

The ring size must be a power of 2, and in this case we need the maximum value of 256, as this particular permutation creates dependencies across all slices with no repeating structure. For example, data from input slice 196 must reach output slice 3, which means we need a ring large enough to cover all such cross-slice dependencies.

This high cycle count reflects the cost of the arbitrary permutation—contrast this with regular topologies that achieve much lower cycle counts through structured parallelism.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 16, B = 16, C = 8, D = 8, E = 8];

fn arbitrary_permutation<'l, const T: Tu>(
    input: FetchTensor<'l, T, f32, m![1], m![1], m![A, B], m![C], m![D, E]>,
) -> SwitchTensor<'l, T, f32, m![1], m![1], m![B % 4, B / 4, A % 4, A / 4], m![C], m![D, E]> {
    input.switch::<m![B % 4, B / 4, A % 4, A / 4], m![C]>(
        SwitchConfig::CustomBroadcast { ring_size: 256 }
    )
}
}

Example 2: Multi-Axis Broadcast

This example shows broadcasting across multiple non-contiguous axes within the Slice dimension, useful for complex tensor operations that require replication along several independent dimensions simultaneously.

Arguments:

1 cluster (256 slices): Chip/Cluster context applies to a single cluster.
axes![A = 16, B = 16, C = 8, D = 8, E = 8]
dtype = i8
In:
- Slice: m![A, B]
- Time: m![C]
- Packet: m![D, E]
Out:
- Slice: m![A / 2, 2, B / 2, 2]
- Time: m![C, A % 2, B % 2]
- Packet: m![D, E]

The difference between In and Out is that the two axes A % 2 and B % 2 moved from Slice to Time, and broadcast occurred at the original position.

Among regular topologies, Dim0/Dim1Broadcast supports a similar form, but cases where axes corresponding to Slice to Time are separated within the slice cannot be expressed.

However, this is a form that can be simply expressed with custom bitmap.

Bitmap Index	`(A / 2, A % 2, B / 2, B % 2)`	`(A, B)`	Ring Group
0	`(0, 0, 0, 0)`, `(0, 0, 0, 1)`, `(0, 1, 0, 0)`, `(0, 1, 0, 1)`	`(0, 0)`, `(0, 1)`, `(1, 0)`, `(1, 1)`	`0`, `1`, `16`, `17`
1	`(0, 0, 0, 0)`, `(0, 0, 0, 1)`, `(0, 1, 0, 0)`, `(0, 1, 0, 1)`	`(0, 0)`, `(0, 1)`, `(1, 0)`, `(1, 1)`	`0`, `1`, `16`, `17`
2	`(0, 0, 1, 0)`, `(0, 0, 1, 1)`, `(0, 1, 1, 0)`, `(0, 1, 1, 1)`	`(0, 2)`, `(0, 3)`, `(1, 2)`, `(1, 3)`	`2`, `3`, `18`, `19`
3	`(0, 0, 1, 0)`, `(0, 0, 1, 1)`, `(0, 1, 1, 0)`, `(0, 1, 1, 1)`	`(0, 2)`, `(0, 3)`, `(1, 2)`, `(1, 3)`	`2`, `3`, `18`, `19`
…	…	…	…
255	`(7, 0, 7, 0)`, `(7, 0, 7, 1)`, `(7, 1, 7, 0)`, `(7, 1, 7, 1)`	`(14, 14)`, `(14, 15)`, `(15, 14)`, `(15, 15)`	`238`, `239`, `254`, `255`

The cycle calculation gives us:

$$ \text{#cycles} = \text{ring_size} \times \text{input_time} \times \text{cycles_per_packet} $$

$$ = 32 \times 8 \times 2 = 512 \text{ cycles} $$

The ring_size of 32 is smaller than the full 256 slices because the outermost A / 2 part of the slice dimension doesn’t require data exchange—only the remaining 32 slices within each A / 2 group need to exchange data with each other.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 16, B = 16, C = 8, D = 8, E = 8, X = 2, Y = 2];

fn multi_axis_broadcast<'l, const T: Tu>(
    input: FetchTensor<'l, T, f32, m![1], m![1], m![A, B], m![C], m![D, E]>,
) -> SwitchTensor<'l, T, f32, m![1], m![1], m![A / 2, X, B / 2, Y], m![C, A % 2, B % 2], m![D, E]> {
    input.switch::<m![A / 2, X, B / 2, Y], m![C, A % 2, B % 2]>(
        SwitchConfig::CustomBroadcast { ring_size: 32 }
    )
}
}

Understanding the Bitmap Pattern

Two key patterns appear in the bitmap that reveal how the transformation works.

First, broadcast manifests as identical bitmaps: bitmap[0] and bitmap[1] are completely identical because output slices 0 and 1 both receive the same source data, implementing the broadcast operation.

Second, the Slice to Time movement appears as one output slice receiving from multiple input slices: bitmap[0] = {0, 1, 16, 17} shows that output slice 0 collects data from four different input slices.

Example 3: Partial Axis Extraction (Slicing)

This example demonstrates extracting only a subset of an axis during the Slice to Time transformation, enabling selective data distribution for operations that don’t require the full axis range.

Arguments:

1 cluster (256 slices): Chip/Cluster context applies to a single cluster.
axes![A = 16, B = 16, C = 8, D = 8, E = 8]
dtype = i8
In:
- Slice: m![A, B]
- Time: m![C]
- Packet: m![D, E]
Out:
- Slice: m![A, B / 4, 4]
- Time: m![C, B % 4 = 3]
- Packet: m![D, E]

The difference between In and Out is that the B % 4 axis moved from Slice to Time, and broadcast occurred at the original position.

The somewhat unusual point is that B % 4 did not move completely intact to Time, but was partially sliced (3 out of total 4).

Among regular topologies, Dim0/Dim1Broadcast supports a similar form, but the form where axes corresponding to Slice to Time are sliced cannot be expressed.

However, this is a form that can be simply expressed with a custom bitmap.

Bitmap Index	`(A, B / 4, B % 4 = 3)`	`(A, B)`	Ring Group
0	`(0, 0, 0)`, `(0, 0, 1)`, `(0, 0, 2)`	`(0, 0)`, `(0, 1)`, `(0, 2)`	`0`, `1`, `2`
1	`(0, 0, 0)`, `(0, 0, 1)`, `(0, 0, 2)`	`(0, 0)`, `(0, 1)`, `(0, 2)`	`0`, `1`, `2`
2	`(0, 0, 0)`, `(0, 0, 1)`, `(0, 0, 2)`	`(0, 0)`, `(0, 1)`, `(0, 2)`	`0`, `1`, `2`
4	`(0, 1, 0)`, `(0, 1, 1)`, `(0, 1, 2)`	`(0, 4)`, `(0, 5)`, `(0, 6)`	`4`, `5`, `6`
…	…	…	…
255	`(15, 3, 0)`, `(15, 3, 1)`, `(15, 3, 2)`	`(15, 12)`, `(15, 13)`, `(15, 14)`	`252`, `253`, `254`

The cycle calculation follows the formula:

$$ \text{#cycles} = \text{ring_size} \times \text{input_time} \times \text{cycles_per_packet} $$

$$ = 4 \times 8 \times 2 = 128 \text{ cycles} $$

The small ring_size of 4 reflects that the A, (B / 4) outermost portion of the slice doesn’t exchange data. Only the innermost 4 slices within each group need to communicate.

The bitmap reveals how partial axis extraction works: bitmap[0] = {0, 1, 2} shows that output slice 0 receives from only 3 input slices.

If the bitmap were {0, 1, 2, 3}, it would represent receiving the entire B axis (all 4 values).

By including only {0, 1, 2}, the bitmap implements slicing—extracting 3 out of 4 values from the B axis dimension.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 16, B = 16, C = 8, D = 8, E = 8, X = 4];

fn partial_axis_extraction<'l, const T: Tu>(
    input: FetchTensor<'l, T, f32, m![1], m![1], m![A, B], m![C], m![D, E]>,
) -> SwitchTensor<'l, T, f32, m![1], m![1], m![A, B / 4, X], m![C, B % 4 = 3], m![D, E]> {
    input.switch::<m![A, B / 4, X], m![C, B % 4 = 3]>(
        SwitchConfig::CustomBroadcast { ring_size: 4 }
    )
}
}

Constraint 1: Order Preservation

Hardware limitations require that axes moving from Slice to Time must preserve their relative order from the input Slice dimension.

This constraint exists because the routing network can efficiently forward data in the original axis order, but reordering axes during the transfer would require additional buffering that the hardware doesn’t provide.

The following example shows an unsupported transformation that violates this constraint.

Arguments:

1 cluster (256 slices): Chip/Cluster context applies to a single cluster.
axes![A = 16, B = 16, C = 8, D = 8, E = 8]
dtype = i8
In:
- Slice: m![A, B]
- Time: m![C]
- Packet: m![D, E]
Out:
- Slice: m![A, B / 4, 4]
- Time: m![C, B % 2, B / 2]
- Packet: m![D, E]

In this example, the B % 2 and B / 2 axes appear in reversed order compared to their arrangement in the input slice dimension.

While the slice bitmap could theoretically represent this pattern, the hardware cannot execute it because it lacks the buffering needed to reorder axes during transfer.

If the output slice were instead m![A, B / 4, 4] with time m![C, B / 2, B % 2] and packet m![D, E], then the transformation would be valid.

In this corrected version, the B / 2, B % 2 axes maintain their original order from the input slice, satisfying the order preservation constraint.

Constraint 2: Innermost Time Position

The hardware requires axes moving from Slice to Time to appear at the innermost positions of the output time dimension.

Axes moving from Slice to Time are delivered last per packet, so they become the innermost Time dimensions in the output stream.

The following example shows an unsupported transformation that violates this constraint.

Arguments:

1 cluster (256 slices): Chip/Cluster context applies to a single cluster.
axes![A = 16, B = 16, C = 8, D = 8, E = 8]
dtype = i8
In:
- Slice: m![A, B]
- Time: m![C]
- Packet: m![D, E]
Out:
- Slice: m![A / 2, 2, B / 2, 2]
- Time: m![A % 2, C, B % 2]
- Packet: m![D, E]

In this example, the A % 2 and B % 2 axes that move from Slice to Time preserve their relative order correctly.

However, the transformation is still invalid because these axes are not positioned at the innermost part of the output time dimension—the C axis appears between A % 2 and B % 2, violating the innermost position requirement.

Note that Broadcast01 topology can sometimes work around this constraint using the time0 parameter, which provides additional flexibility in axis positioning.

Custom topologies lack this time0 mechanism, so they must strictly place all Slice to Time axes at the innermost time positions.

Constraints

Understanding switching constraints prevents compilation errors and ensures correct data movement patterns.

Why Switching Constraints Exist

The Switch Engine constraints reflect fundamental hardware design decisions about the ring network topology and router capabilities.

Ring network topology fundamentally limits flexibility. The hardware implements a physical ring connecting 256 slices in a fixed order. Data flows counter-clockwise through this ring, with each router deciding whether to output locally or forward to neighbors. This topology is highly efficient for regular patterns (like broadcasting) where all slices follow similar routing rules. However, it cannot efficiently express arbitrary permutations that would require complex routing tables or multiple ring passes. The hardware provides only 256 router configuration entries—one per slice—rather than a full crossbar switch that could connect any slice to any other.

Buffering constraints drive the order preservation rule. Each slice router has minimal buffering (essentially one packet), which enables high throughput but prevents reordering. When data arrives from the ring network, the router must immediately decide: output locally or forward? It cannot buffer multiple packets and reorder them. Therefore, axes moving from the Slice to Time dimension must maintain their original order—the hardware simply forwards data in arrival order without reordering capabilities.

Pipeline structure requires innermost time position. Data from other slices arrives last within each packet, so Slice-to-Time axes naturally become the innermost time dimensions. Placing these axes anywhere else would require the hardware to buffer and reorder complete time sequences, which would require prohibitive amounts of SRAM and complex control logic.

Regular Topology Constraints

Regular topologies impose specific structural requirements:

Topology pattern matching: Input/output mapping expressions must match the predefined topology pattern. Violating this causes a compilation error. Example: Broadcast01 requires specific axis ordering (slice2, slice1, slice0, time1) that cannot be arbitrarily reordered.
Full cluster operation: InSlice::SIZE = OutSlice::SIZE = 256. Partial cluster operations are not supported; violating this causes a compilation error.

Custom Topology Constraints

Custom topologies provide flexibility but impose two critical constraints:

1. Order Preservation: Axes moving from Slice to Time must preserve their relative order from the input slice dimension (see Buffering constraints above). Violating this causes a compilation error or incorrect data routing.

// Input:  Slice: m![A, B]
// INVALID: B % 2 and B / 2 are reversed
// Output: Time: m![C, B % 2, B / 2]
// Valid:   Time: m![C, B / 2, B % 2]

2. Innermost Time Position: Axes moving from Slice to Time must appear at the innermost positions of the output Time dimension (see Pipeline structure above). Violating this causes a compilation error or incorrect data ordering.

// Input:  Slice: m![A, B]
// INVALID: C appears between moved axes
// Output: Time: m![A % 2, C, B % 2]
// Valid:   Time: m![C, A % 2, B % 2]

Note

The Broadcast01 topology can sometimes work around the innermost position constraint using the time0 parameter. Custom topologies lack this mechanism and must strictly follow the constraint.

Performance

The Switch Engine performance directly affects computation throughput since data redistribution overlaps with tensor operations.

Cycle Estimation

Switching operations follow the formula:

$$ \text{cycle} = \text{ring_size} \times \text{input_time} \times \text{cycle_per_packet} $$

Where:

ring_size: Number of slices in each independent ring (e.g., slice0 × slice1 for Broadcast01)
input_time: Size of the input time dimension
cycles_per_packet: packet_size / 32 (number of 32-byte flits per packet)

For example, with ring_size = 4, input_time = 64, and 64-byte packets (2 flits):

$$ \text{#cycles} = \text{ring_size} \times \text{input_time} \times \text{cycles_per_packet} $$

$$ = 4 \times 64 \times 2 = 512 \text{ cycles} $$

Parallelism Across Rings

When 256 slices are grouped into rings (e.g., 64 rings of size 4), all rings operate independently and in parallel.

This parallelism is critical for high throughput: although each ring takes ring_size × input_time × cycles_per_packet cycles to complete, all rings finish simultaneously.

Custom Topology Overhead

Custom topologies provide arbitrary permutation flexibility but incur configuration overhead:

Requires preempting DMA and sub-context operations
Must write the bitmap to Special Function Registers (SFRs)
Setup cost makes custom topologies most appropriate when computation benefits outweigh initialization overhead

Communication Cost

Communication cost in the ring network scales with ring size and data volume:

Regular topologies: Optimized for common patterns, minimal overhead
Custom topologies: Flexible but potentially higher setup cost
Ring topology characteristic: Data movement cost increases proportionally with ring size, unlike other dimensions where stride differences have minimal impact

Collect Engine

The Collect Engine normalizes packets to exactly one flit (a 32-byte flow control unit that all downstream engines operate on). It follows the Switch Engine in the pipeline, or the Fetch Engine directly when forwarding is implied.

Interface

impl<'l, const T: Tu, D: Scalar, Chip: M, Cluster: M, Slice: M, Time: M, Packet: M>
    SwitchTensor<'l, { T }, D, Chip, Cluster, Slice, Time, Packet>
{
    /// Normalizes packet to exactly 32 bytes (one flit).
    ///
    /// Pads to flit-aligned boundary, then splits: inner 32 bytes become Packet2,
    /// outer flit portion is absorbed into Time2.
    /// For packets already ≤ 32 bytes, only padding is added.
    #[primitive(SwitchTensor::collect)]
    pub fn collect<Time2: M, Packet2: M>(self) -> CollectTensor<'l, T, D, Chip, Cluster, Slice, Time2, Packet2> {
        verify_collect::<D, Time, Packet, Time2, Packet2>();
        CollectTensor::new(self.ctx, self.inner.transpose(false))
    }
}

Packet Normalization

The collect() method transforms an arbitrary-sized packet into exactly one flit (32 bytes):

Pad the input packet to the nearest 32-byte boundary (if not already aligned). Skipped if the packet is already 32-byte aligned.
Split at the flit boundary: the inner 32 bytes become Packet2, and the outer flit count is absorbed into Time2. Skipped if the padded packet is at most 32 bytes (i.e., fits in one flit).

Packet = 32 bytes (identity)

i8, B = 32: packet = 32 elements × 1 byte = 32 bytes = one flit. Nothing changes.

Before:   Time = m![A]
          Packet = m![B]
          ┌──────────────────────────┐
          │            B             │  32 bytes
          └──────────────────────────┘

After:    Time = m![A]
          Packet = m![B # 32]
          ┌──────────────────────────┐
          │          B # 32          │  32 bytes
          └──────────────────────────┘

Packet < 32 bytes (pad to one flit)

i8, B = 16: packet = 16 elements × 1 byte = 16 bytes. Padded to 32 bytes.

Before:   Time = m![A]
          Packet = m![B]
          ┌────────────┐
          │     B      │  16 bytes
          └────────────┘

After:    Time = m![A]
          Packet = m![B # 32]
          ┌────────────┬─────────────┐
          │     B      │     pad     │  32 bytes
          └────────────┴─────────────┘

Packet > 32 bytes (split into flits)

bf16, B = 32: packet = 32 elements × 2 bytes = 64 bytes = 2 flits. The outer flit count (2) is absorbed into Time.

Before:   Time = m![A]
          Packet = m![B]
          ┌──────────────────────────┬──────────────────────────┐
          │       B/16 == 0          │       B/16 == 1          │  64 bytes
          └──────────────────────────┴──────────────────────────┘
                  32 bytes                    32 bytes

After:    Time = m![A, B/16]
          Packet = m![B % 16]
          ┌──────────────────────────┐
          │         B % 16           │  32 bytes  × B/16 time steps
          └──────────────────────────┘

Each flit is delivered in a separate time step, so Time grows from m![A] to m![A, B/16].

Pipeline Position

The collect step is mandatory in the Tensor Unit pipeline: all downstream engines (Contraction, TRF/VRF load, etc.) require exactly-32-byte flits, so every execution must pass through fetch → [switch →] collect to normalize packets before proceeding.

When no slice redistribution is needed, call FetchTensor::collect() directly — no .switch() call is required:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 64];

fn direct_collect<'l, const T: Tu>(
    input: FetchTensor<'l, T, i8, m![1], m![1], m![1], m![A], m![B]>,
) -> CollectTensor<'l, T, i8, m![1], m![1], m![1], m![A, B / 32], m![B % 32]> {
    input.collect()
}
}

Examples

Single-flit packet (identity)

When the input packet is already exactly 32 bytes, collect passes it through unchanged.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 32];

fn collect_identity<'l, const T: Tu>(
    input: SwitchTensor<'l, T, i8, m![1], m![1], m![1], m![A], m![B # 32]>,
) -> CollectTensor<'l, T, i8, m![1], m![1], m![1], m![A], m![B # 32]> {
    // B=32 elements × 1 byte (i8) = 32 bytes = one flit.
    // Time and Packet pass through unchanged.
    input.collect()
}
}

Sub-flit packet (padding added)

When the input packet is smaller than 32 bytes, collect pads to 32 bytes.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 16];

fn collect_padding<'l, const T: Tu>(
    input: SwitchTensor<'l, T, i8, m![1], m![1], m![1], m![A], m![B]>,
) -> CollectTensor<'l, T, i8, m![1], m![1], m![1], m![A], m![B # 32]> {
    // B=16 elements × 1 byte = 16 bytes < 32 bytes.
    // Padded to 32 bytes: Packet2 = m![B # 32].
    // Time unchanged since it fits in one flit.
    input.collect()
}
}

Multi-flit packet (outer absorbed into Time)

When the input packet exceeds 32 bytes, collect splits into flits and absorbs the outer portion into Time.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, B = 32];

fn collect_multi_flit<'l, const T: Tu>(
    input: SwitchTensor<'l, T, bf16, m![1], m![1], m![1], m![A], m![B]>,
) -> CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![A, B / 16], m![B % 16]> {
    // B=32 elements × 2 bytes (bf16) = 64 bytes = 2 flits.
    // Inner 16 elements = 32 bytes → Packet2 = m![B % 16].
    // Outer 2 flits → absorbed into Time2 = m![A, B / 16].
    input.collect()
}
}

Contraction Engine

The Contraction Engine performs einsum operations — tensor contractions such as matrix multiplication, convolution, and attention — which are the dominant computations in deep learning workloads.

The key mental model is weight-stationary execution: one operand (weights) is loaded into TRF (Tensor Register File) once and held fixed while the other streams through the pipeline, so maximizing TRF reuse minimizes memory traffic. As a kernel writer, you specify the einsum expression, input/output data types, and which tensor goes into the TRF as weights. The compiler maps this to the hardware components described below.

The rest of this chapter explains how einsum operations decompose into hardware primitives across the Contraction Engine’s two components: Aligner (Stream Adapter + TRF Sequencer) and Reducer.

Einsum

Einsum (Einstein summation) generalizes matrix multiplication to arbitrary tensors by specifying which dimensions to contract. For background, see Einsum Is All You Need.

// AB, BC -> AC
// AC[i, j] = sum(AB[i, k] * BC[k, j] for k in 0..B)

Every einsum decomposes into four fundamental steps:

Broadcast LHS: Expand tensor T0: [A, B] to T0_prime: [A, B, C]
Broadcast RHS: Expand tensor T1: [B, C] to T1_prime: [A, B, C]
Elementwise multiply: Compute T2 = T0_prime * T1_prime
Reduce-add: Sum over contracted dimension T3: [A, C] where T3[i, j] = sum(T2[i, k, j] for k in 0..B)

Overview

flowchart TB
    subgraph CE[Contraction]
        direction LR
        SA[Stream Adapter] --> RD[Reducer]
        TS[TRF Sequencer] --> RD
    end

    TRF[(TRF)] --> TS
    CO[Collect] --> SA
    RD --> VE[Vector]

    click SA "./stream-adapter.html" "Stream Adapter"
    click TS "./trf-sequencer.html" "TRF Sequencer"
    click RD "./reducer.html" "Reducer"
    click CO "../collect-engine.html" "Collect Engine"
    click VE "../vector-engine/index.html" "Vector Engine"

The einsum steps map to diagram components:

Einsum Step	Component
LHS broadcast	Switch Engine → Collect Engine → Aligner: Stream Adapter
RHS broadcast	Aligner: TRF Sequencer
Elementwise multiply	Reducer
Reduce-add	Reducer

For reductions across slices or chips, the Vector Engine handles the final aggregation.

The following sections present case studies showing how common operations map to the Contraction Engine. Each case study shows a compiler-generated configuration dump; for the format definition, see Aligner and Reducer. For a beginner-friendly introduction, see the Hello, Contraction! Tutorial.

Case Studies

Batched MatMul

This section demonstrates how batched matrix multiplication maps to the Contraction Engine using the einsum VMK, VNK -> VMN, where V is the batch axis, M and N are the output axes, and K is the contraction axis.

Choose a mapping based on which axis is largest: use K contraction when K is large (maximizes Reducer efficiency), V vectorized when the batch axis V is large (maximizes temporal parallelism), and N×M tiled when both output axes are large (distributes work across TRF rows and time).

`K` contraction by Reducer

The Reducer can perform the K-axis contraction directly, placing K in the temporal dimension. The following dump shows the resulting input, TRF, computation, and accumulation mappings:

// Configuration: input_type = bf16, trf_type = bf16, reduce_op = `Add`
//        Input mapping: [ H: [V=32, M=32, K=32] ] (1)
//          TRF mapping: [ Row: [N=8] | H: [V=5, N/8=3, K=32] ] (1)
//  Computation mapping: [ H: [V=32, M=32] | Row: [N=8] | T: [K=32] ] (1)
// Accumulation mapping: [ H: [V=32, M=32, N/8=3] | T: [N=8] ] (1)

`V` - vectorized mapping

The batch axis V can be placed in the temporal dimension for vectorized computation. The following dump shows how V moves into the temporal dimension while M and K remain in their respective positions:

// Configuration: input_type = bf16, trf_type = bf16, reduce_op = `Add`
//        Input mapping: [ H: [M=5, K=32, V=32] ] (1)
//          TRF mapping: [ Row: [N=8] | H: [N/8=3, K=32, V=32] ] (1)
//  Computation mapping: [ H: [M=5, N/8=3, K=32] | Row: [N=8] | T: [V=32] ] (1)
// Accumulation mapping: [ H: [M=5, N=24] | T: [V=32] ] (1)

`N x M` - tiled mapping

Both output axes N and M can be tiled across the hardware for maximum parallelism. The following dump shows both output axes distributed across the TRF Row and temporal dimensions:

// Configuration: input_type = bf16, trf_type = bf16, reduce_op = `Add`
//        Input mapping: [ H: [V=5, K=32, M=32] ] (1)
//          TRF mapping: [ Row: [N=8] | H: [V=5, N/8=3, K=32, M=32] ] (1)
//  Computation mapping: [ H: [V=5, N/8=3, K=32] | Row: [N=8] | T: [M=32] ] (1)
// Accumulation mapping: [ H: [V=5, N=24] | T: [M=32] ] (1)

Mixed configurations and constraints for these mappings are detailed in the Reducer section.

2D Convolution

This section demonstrates 2D convolution mapping using the einsum $(H + Fh)$(W + Fw)K, FhFwKC -> HWC, where H and W are spatial output axes, C is the output channel axis, and Fh, Fw, K are contraction axes (filter height, filter width, and input channels). Variations covered in the batched matmul section are omitted.

Filter-Stride 1

For stride-1 convolution, the Stream Adapter performs shift-reuse on the input to produce sliding windows. The $(H+Fh) sliding is done in the Fetch Engine before reaching the Stream Adapter, while the input $(W+Fw) undergoes shift-reuse in the Stream Adapter to produce Fw, W sliding in the computation. The example below uses shift-stride of 1 with two shifts:

// Configuration: input_type = bf16, trf_type = bf16, reduce_op = `Add`
//        Input mapping: [ H: [H=30, Fh=3, K=32, $(W=30 + Fw=3)=32] ] (1)
//          TRF mapping: [ Row: [C=8] | H: [K=32, C=24, Fh=3, Fw=3] ] (1)
//  Computation mapping: [ H: [H=30, C/8=3, Fh=3, K=32, Fw=3] | Row: [C=8] | T: [W=30+2#] ] (1)
// Accumulation mapping: [ H: [H=30, C=32] | T: [W=30+2#] ] (1)

Filter-Stride 2

For stride-2 convolution, shift-reuse with 1 shift and shift-stride of 2 extracts strided windows.

The transformation $(W:2=15 + Fw=4)=32 produces Fw/2=2, (W=15, Fw=2), conceptually extracting a size-2 axis with stride :2 from a linear combination as an outer product: $(W:2=15 + (Fw/2:2=2, Fw=2))=32 becomes Fw/2=2, $(W:2=15, Fw=2).

// Configuration: input_type = bf16, trf_type = bf16, reduce_op = `Add`
//        Input mapping: [ H: [H=15, Fh=4, K=32, $(W:2=15 + Fw=4)=32] ] (1)
//          TRF mapping: [ Row: [C=8] | H: [K=32, C=24, Fh=4, Fw=4] ] (1)
//  Computation mapping: [ H: [H=15, C/8=3, Fh=4, K=32, Fw/2=2] | Row: [C=8] | T: [W=15+1#, Fw=2] ] (1)
// Accumulation mapping: [ H: [H=15, C=32] | T: [W=15+1#] ] (1)

To fully utilize MACs, fill more flits in the shift buffer by increasing feed_flits from the default of 2 to 3. The transformation $(W:2=16 + Fw=4)=34 then produces Fw/2=2, (W=16, Fw=2):

// Configuration: feed_flits = 3, input_type = bf16, trf_type = bf16, reduce_op = `Add`
//        Input mapping: [ H: [H=16, Fh=4, K=32, $(W:2=16 + Fw=4)=34] ] (1)
//          TRF mapping: [ Row: [C=8] | H: [K=32, C=24, Fh=4, Fw=4] ] (1)
//  Computation mapping: [ H: [H=16, C/8=3, Fh=4, K=32, Fw/2=2] | Row: [C=8] | T: [W=16, Fw=2] ] (1)
// Accumulation mapping: [ H: [H=16, C=32] | T: [W=16] ] (1)

Dilation 2

For dilation-2 convolution, shift-reuse with 2 shifts and shift-stride of 2 extracts dilated filter positions.

The transformation $(W=27 + Fw:2=3)=32 produces Fw=3, W=27, conceptually extracting a size-3 axis with stride :2 from a linear combination as an outer product: $(W=27 + Fw:2=3)=32 becomes Fw=3, $(W=27).

// Configuration: input_type = bf16, trf_type = bf16, reduce_op = `Add`
//        Input mapping: [ H: [H=27, Fh=3, K=32, $(W=27 + Fw:2=3)=32] ] (1)
//          TRF mapping: [ Row: [C=8] | H: [K=32, C=24, Fh=3, Fw=3] ] (1)
//  Computation mapping: [ H: [H=27, C/8=3, Fh=3, K=32, Fw=3] | Row: [C=8] | T: [W=27+5#] ] (1)
// Accumulation mapping: [ H: [H=27, C=32] | T: [W=27+5#] ] (1)

Filter-Stride 2, Dilation 2

Combining stride-2 and dilation-2 requires shift operations similar to dilation 2 alone.

The transformation $(W:2=14 + Fw:2=3)=31 + 1# produces Fw=3, W=14, 1+1#, extracting a size-3 axis with stride :2 from a linear combination as an outer product: $(W:2=14 + Fw:2=3)=31 becomes Fw=3, $(W:2=14).

The notation 1z is similar to 1# (dummy padding) but filled with zeros instead of arbitrary values. The TRF must contain zero-padded dummies so that 1+1# contracted with 1+1z yields 1. Note that 1+1# contracted with 1+1# would yield 1#.

// Configuration: input_type = bf16, trf_type = bf16, reduce_op = `Add`
//        Input mapping: [ H: [H=14, Fh=3, K=32, $(W:2=14 + Fw:2=3)=31+1#] ] (1)
//          TRF mapping: [ Row: [C=8] | H: [K=32, C=24, Fh=3, Fw=3, 1+1z] ] (1)
//  Computation mapping: [ H: [H=14, C/8=3, Fh=3, K=32, Fw=3] | Row: [C=8] | T: [W=14+2#, 1+1z] ] (1)
// Accumulation mapping: [ H: [H=14, C=32] | T: [W=14+2#] ] (1)

Constraints

Mapping alignment (compiler-enforced): The computation mapping from the Stream Adapter must exactly match the computation mapping from the TRF Sequencer. Misaligned mappings prevent the Reducer from operating correctly. The compiler ensures this alignment during code generation.

TRF capacity (programmer responsibility): TRF storage limits constrain weight tensor size. Large weight tensors require tiling and multiple SRAM-to-TRF loads, adding overhead. The TRF Sequencer can broadcast weight data across time and head dimensions, enabling efficient reuse of loaded weights.

Row utilization (programmer responsibility): The hardware provides 8 Rows. Operations should use all 8 Rows when possible to maximize throughput. Using fewer Rows (1, 2, 4) reduces parallelism and effective computational bandwidth.

Stream Adapter buffer limits (compiler-enforced): The Stream Adapter has limited buffer capacity for shift operations and packet collection. Configurations exceeding these limits are invalid. See Stream Adapter documentation for specific capacity constraints.

Data type support (compiler-enforced): Input data types are limited to i4, i8, f8, and bf16. Output types are widened automatically (i4/i8 → i32, f8/bf16 → f32). The Reducer does not support f32 input directly, though f32 can be processed in the Vector Engine after contraction.

Performance

Contraction Engine performance depends on Row utilization, TRF reuse, and memory bandwidth:

Row parallelism: Using all 8 Rows achieves 8× parallelism. Configurations with fewer Rows proportionally reduce throughput. Configure tensor mappings to distribute work across all available Rows.

TRF reuse through broadcasting: Weight data loaded into TRF can be broadcast across time and head dimensions at no additional cost. Design tensor mappings to maximize weight reuse through broadcasting, minimizing SRAM-to-TRF transfers.

Pipeline latency: The Contraction Engine pipeline includes the Reducer (5-7 cycles for spatial reduction depending on data type, plus cycles proportional to time dimension size for temporal reduction). Total latency is the sum of these stages plus Stream Adapter/TRF Sequencer overhead.

Memory bandwidth bottlenecks: The Stream Adapter is limited by DM fetch bandwidth (256 B/cycle with proper interleaving). TRF bandwidth is typically not a bottleneck due to the broadcasting capability. Ensure fetch patterns interleave across DMNs and slices to maximize bandwidth utilization.

Aligner

The Aligner stage prepares both operands for the Reducer by transforming them into a matching computation mapping. The computation mapping is the common tensor layout ([Chip, Cluster, Slice, Row, Time, Packet]) that both the Stream Adapter and TRF Sequencer must produce so the Reducer can pair them element-by-element. It is positioned within the Contraction Engine data flow as follows:

fetch() -> switch() -> collect() -> align(trf) -> contract() -> accumulate()

The Aligner consists of two parallel paths:

Path	Component	Source	Role
Data	Stream Adapter	Collect Engine (Stream data from DM)	Collect flits, broadcast to Rows
Weight	TRF Sequencer	TRF (weight data)	Broadcast and transform weight data

Overview

                    ┌───────────────────────────────────────────────────┐
                    │                      Aligner                      │
                    │                                                   │
                    │                           ┌─────────────────────┐ │
  Switching ──────► │   Stream Adapter ────────►│                     │ │
  Engine            │                           │ Computation mapping │───► Reducer
                    │                           |                     | │
  TRF ────────────► │   TRF Sequencer  ────────►│                     │ │
                    │                           └─────────────────────┘ │
                    │                                                   │
                    └───────────────────────────────────────────────────┘

The computation mapping consists of the following dimensions:

Chip: No change from Stream Adapter/TRF Sequencer input
Cluster: No change from Stream Adapter/TRF Sequencer input
Slice: No change from Stream Adapter/TRF Sequencer input
Row: Maps to the 8 Rows in the Reducer
Time: The temporal dimension for sequential processing
Packet: Data packet dimension

The key difference between the two paths is:

Stream Adapter: Always populates Rows via broadcasting, and supports basic flit collection and data feeding for convolutions.
TRF Sequencer: Leverages a sequencer to enable more complex data transformations.

Example: Batched MatMul

A batched matrix multiplication demonstrates how the Stream Adapter and TRF Sequencer align data and weights into a matching computation mapping (each detailed in the Stream Adapter and TRF Sequencer sub-sections). The code below does three things:

Flit Collection (Stream Adapter, collect_flits = 2): L = 2 flits are collected from the innermost Time axis into the Packet dimension, forming a 64B packet. The collected data is broadcast to Rows (1, 2, 4, or 8 rows depending on the computation mapping).
Packet Broadcast (TRF Sequencer, reg_read_size = 32B): The TRF Sequencer reads 32B (K = 16 bf16) contiguously each cycle and broadcasts twice to fill the 64 bytes, matching the Stream Adapter’s Packet.
Time Permute (TRF Sequencer): The order of axes in TRF Element [O = 2, M = 32, K = 16] does not match Time: [M = 32, O = 2]. The sequencer reorders this by placing O in Entry 0 (inner loop) with stride 1024, while M uses Entry 1 (outer loop) with stride 32.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 32, N = 8, K = 16, L = 2, O = 2];

/// Stores weights into TRF (sub context).
fn store_weights<'l, const T: Tu>(
    weights: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![N, O, M], m![K]>,
) -> TrfTensor<bf16, m![1], m![1], m![1], m![N], m![O, M, K]> {
    // TRF mapping: [
    //   Row: [N = 8]: 8 output channels mapped to 8 Rows
    //   Element: [O = 2, M = 32, K = 16]: each Row stores 2×32×16 = 1024 bf16 elements
    // ]
    weights.to_trf(TrfAddress::FirstHalf)
}

/// Aligns data and weights, then contracts (main context).
fn matmul<'l, const T: Tu>(
    input: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![M, O, L], m![K]>,
    trf: &TrfTensor<bf16, m![1], m![1], m![1], m![N], m![O, M, K]>,
) -> AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![M, O, L], m![N]> {
    // Collect mapping: [Time: [M = 32, O = 2, L = 2], Packet: [K = 16]]
    // TRF mapping:     [Row: [N = 8], Element: [O = 2, M = 32, K = 16]]
    //
    // Stream Adapter (collect_flits = 2):
    //   Flit Collection:
    //     Collects L = 2 flits from innermost Time into Packet.
    //     After collection, the computation mapping dimensions become:
    //       Time = [M = 32, O = 2], Packet = [L = 2, K = 16] = 32 bf16 = 64B
    //   Broadcasts Packet to Rows (N = 8).
    //
    // TRF Sequencer (reg_read_size = 32B):
    //   Packet Broadcast:
    //     reg_read_size read: reads K = 16 bf16 = 32B contiguously from TRF,
    //     then broadcasts 2× to fill the 64B — matching Packet = [L = 2, K = 16].
    //   Time Permute:
    //     TRF Element outer of reg_read_size(K) is [O = 2(outer), M = 32(inner)],
    //     but Time is [M = 32(outer), O = 2(inner)] — M, O are reordered via sequencer.
    //
    // Compiler-generated TRF Sequencer configuration:
    //   Entry 0: { size: 2, stride: 1024 }  — O (inner loop, stride = K×M×sizeof(bf16))
    //   Entry 1: { size: 32, stride: 32 }  — M (outer loop, stride = K×sizeof(bf16))
    //
    // Computation mapping: [Time: [M = 32, O = 2], Row: [N = 8], Packet: [L = 2, K = 16]]
    // Output mapping: [Time: [M = 32, O = 2, L = 2], Packet: [N = 8]]
    //   (K is contracted, column major)
    input.align::<m![M, O], m![L, K], _, _>(trf)
         .contract::<m![L]>()
         .accumulate::<m![M, O, L], m![N]>(AccumulationKind::Interleaved)
}
}

For details on each component, see the sub-sections:

Stream Adapter — Flit collection, Rows broadcast
- Advanced Operations — Transpose, Shift (for convolutions)
TRF Sequencer — SRAM-to-TRF, weight broadcasting

Stream Adapter

The Stream Adapter is part of the Aligner stage. It transforms activation data from the Collect Engine into the computation mapping required by the Reducer. It collects incoming flits into properly sized packets and broadcasts them across Rows, enabling data reuse across output channels. This operation is the data-side counterpart to the TRF Sequencer, which prepares weight data on the other side.

Interface

The Stream Adapter is configured through the align method on CollectTensor (see TRF Sequencer — Interface for the full API). The Time and Packet type parameters determine how the Stream Adapter reshapes the input:

extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
impl<'l, const T: Tu, D, Chip, Cluster, Slice, Time, Packet>
    CollectTensor<'l, { T }, D, Chip, Cluster, Slice, Time, Packet>
{
    /// Aligns this input stream with a TRF tensor for contraction.
    /// Configures both the Stream Adapter (data path) and TRF Sequencer (weight path)
    /// to produce a matching computation mapping.
    pub fn align<OutTime: M, OutPacket: M, Row: M, TrfElement: M>(
        self,
        trf_tensor: &TrfTensor<D, Chip, Cluster, Slice, Row, TrfElement>,
    ) -> AlignedPair<'l, { T }, D, Chip, Cluster, Slice, Row, OutTime, OutPacket> {
        // Hardware implementation: configures Stream Adapter and TRF Sequencer
    }
}

The typical data flow is: switch() → collect() → align(&trf) → contract() → accumulate() for activations (main context). The Chip, Cluster, and Slice dimensions pass through unchanged.

Architecture

Conceptual Operation

The Stream Adapter transforms the collect tensor mapping into the computation mapping:

Collect mapping:     [Chip, Cluster, Slice, Time, Packet]
                         ↓ Stream Adapter (collect + broadcast)
Computation mapping: [Chip, Cluster, Slice, Row, OutTime, OutPacket]

This transformation involves three operations:

Collect: Buffer collect_flits incoming 32-byte flits from the innermost Time axis into Packet, creating the OutTime and OutPacket mappings.
Rows broadcast: Broadcast the collected OutPacket to 1, 2, 4, or 8 Rows (determined by the computation mapping).
Time broadcast: Repeat the same activation data across tiling axes in OutTime.

For advanced operations (transpose, shift-and-reuse for convolutions), see Advanced Operations.

Flit Buffer

The Flit Buffer buffers incoming flits so the Reducer receives data in properly sized units.

The Collect Engine sends data in 32-byte flits. The collect_flits parameter controls how many consecutive flits are collected into one OutPacket:

`collect_flits`	Data per Packet	Zero padding	MAC utilization	Use case
1	32 bytes	32 bytes	Half	Small data where a single flit covers the `Packet` axis
2 (default)	64 bytes	None	Full	Standard — full `mac_width` utilization
3	96 bytes	N/A	Full	Shift-reuse with padding (see Advanced)

OutPacket is always 64 bytes (mac_width). The collect_flits parameter determines how much of that 64 bytes is actual data versus zero padding.

When collect_flits = 2, the innermost Time axis is consumed into Packet. For example, if the collect mapping has Time: [..., L = 2] and Packet: [K = 16], collecting L = 2 produces Packet = [L = 2, K = 16] = 32 bf16 elements = 64 bytes of data, filling the entire mac_width.

When collect_flits = 1, no Time axis is consumed. The original Packet (32 bytes) occupies the first half, and the remaining 32 bytes are zero-padded. Only half the MACs produce meaningful results — the zero-padded half always multiplies by zero.

The Flit Buffer has 96-byte physical capacity: up to 3 single-channel flits (32 bytes each) or 1 dual-channel flit (64 bytes).

Rows Broadcast

After collection, the Stream Adapter broadcasts the same OutPacket data to multiple Rows. The number of Rows receiving the broadcast is determined by the computation mapping: 1, 2, 4, or 8.

This is in contrast to the TRF Sequencer, where each Row reads different weight data from its own TRF partition. The Reducer then multiplies each Row’s shared activation data against its unique weights.

                 ┌─── Row 0: Packet (same data)
Stream Adapter ──┼─── Row 1: Packet (same data)
  (rows=4)       ├─── Row 2: Packet (same data)
                 └─── Row 3: Packet (same data)

Time Broadcast

When the computation mapping includes Time axes that have no corresponding axes in the activation data, the Stream Adapter tiles the input data.

For example, if the TRF data has a T = 5 axis that the activation data lacks, the Stream Adapter tiles the input Packet 5 times.

                  ┌─── T = 0: Packet (same data)
                  ├─── T = 1: Packet (same data)
Time broadcast ───┼─── T = 2: Packet (same data)
  (T = 5)         ├─── T = 3: Packet (same data)
                  └─── T = 4: Packet (same data)

Tiling axes are placed at the innermost positions of OutTime. Multiple tiling axes can be used.

Specifications

Parameter	Values	Description
`collect_flits`	1, 2, 3	Number of 32-byte flits collected per `OutPacket`
Flit Buffer capacity	96 bytes	Physical buffer limit (3 × 32-byte flits)
`OutPacket` size	Always 64B	= `mac_width`; zero-padded when `collect_flits = 1`
`Rows`	1, 2, 4, 8	Number of Rows receiving the broadcast (from computation mapping)
Tiling axes	Any size, stride = 0	`Time` axes that broadcast activation data without re-fetching

Performance

For collect_flits = 1 or 2, the Stream Adapter is effectively a pass-through with no overhead. The collect_flits = 3 case (shift-reuse) introduces additional latency; see Advanced Operations.

Examples

`collect_flits = 2` (Flit Collection)

This example collects L = 2 flits from the innermost Time axis into Packet, producing a 64B OutPacket (computation packet):

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 32, N = 8, K = 16, L = 2, O = 2];

fn align<'l, const T: Tu>(
    input: CollectTensor<'l, { T }, bf16, m![1], m![1], m![1], m![M, O, L], m![K]>,
    trf: &TrfTensor<bf16, m![1], m![1], m![1], m![N], m![O, K]>,
) -> AlignedPair<'l, { T }, bf16, m![1], m![1], m![1], m![N], m![M, O], m![L, K]> {
    // Collect mapping: [Time: [M=32, O=2, L=2], Packet: [K=16]]
    //
    // Stream Adapter (collect_flits = 2):
    //   Flit Collection:
    //     Collects L = 2 flits from innermost Time into Packet:
    //       Time = [M = 32, O = 2], Packet = [L = 2, K = 16] = 32 bf16 = 64B
    //   Broadcasts Packet to Rows (N = 8).
    //
    // Computation mapping:
    //   [Time: [M = 32, O = 2] | Row: [N = 8] | Packet: [L = 2, K = 16]]
    input.align::<m![M, O], m![L, K], _, _>(trf)
}
}

`collect_flits = 1` (No Collection)

When the Packet axis already covers the contraction dimension and no additional flits need to be collected:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 32, N = 8, K = 16];

fn align<'l, const T: Tu>(
    input: CollectTensor<'l, { T }, bf16, m![1], m![1], m![1], m![M], m![K]>,
    trf: &TrfTensor<bf16, m![1], m![1], m![1], m![N], m![K]>,
) -> AlignedPair<'l, { T }, bf16, m![1], m![1], m![1], m![N], m![M], m![K # 32]> {
    // Switch mapping: [Time: [M = 32], Packet: [K = 16]]
    //
    // Stream Adapter (collect_flits = 1):
    //   Flit Collection:
    //     No Time axis collected — data = [K = 16] = 16 bf16 = 32B.
    //     Packet = [K = 16 # 32] = 64B (32B data + 32B zero padding).
    //   Broadcasts Packet to Rows (N = 8).
    //   Half MAC utilization — zero-padded half always multiplies by zero.
    //
    // Computation mapping:
    //   [Time: [M = 32] | Row: [N = 8] | Packet: [K = 16 # 32]] (64 bytes)
    input.align::<m![M], m![K # 32], _, _>(trf)
}
}

Time Broadcast

When the TRF has axes not present in the input data, the Stream Adapter tiles the activation across Time:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 32, N = 8, K = 16, T = 5];

fn align<'l, const T: Tu>(
    input: CollectTensor<'l, { T }, bf16, m![1], m![1], m![1], m![M], m![K]>,
    trf: &TrfTensor<bf16, m![1], m![1], m![1], m![N], m![T, K]>,
) -> AlignedPair<'l, { T }, bf16, m![1], m![1], m![1], m![N], m![M, T], m![K # 32]> {
    // Collect mapping: [Time: [M = 32], Packet: [K = 16]]
    //
    // Stream Adapter (collect_flits = 1):
    //   Flit Collection:
    //     No Time axis collected — Packet = [K = 16 # 32] (32B data + 32B zero padding).
    //   Rows Broadcast: N = 8.
    //   Time Broadcast: T = 5 - activation tiled 5 times per M position.
    //
    // Computation mapping:
    //   [Row: [N = 8], Time: [M = 32, T = 5], Packet: [K = 16 # 32]]
    input.align::<m![M, T], m![K # 32], _, _>(trf)
}
}

TRF Sequencer

The TRF Sequencer is part of the Aligner stage. It reads weight data from the Tensor Register File (TRF) and reshapes it to match the computation mapping required by the Reducer. It broadcasts stored weights across the temporal (via sequencer) and spatial (via reg_read_size) dimensions, enabling weight reuse without additional memory usage. This operation is the weight-side counterpart to the Stream Adapter, which prepares activation data on the input side.

Interface

extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
/// TRF address modes for partitioning the register file.
enum TrfAddress {
    FirstHalf,  // First half of TRF
    SecondHalf, // Second half of TRF
    Full,       // Entire TRF
}

impl<'l, const T: Tu, D, Chip, Cluster, Slice, Time, Packet>
    CollectTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet>
{
    /// Stores tensor data from the Collect Engine into TRF.
    /// The outermost axes of the input become the Row dimension,
    /// and the remaining inner axes become the Element dimension.
    /// The resulting layout is [Chip, Cluster, Slice, Row, Element].
    pub fn to_trf<Row: M, Element: M>(
        self,
        address: TrfAddress,
    ) -> TrfTensor<D, Chip, Cluster, Slice, Row, Element> {
        // Hardware implementation: writes data to TRF via SRAM-to-TRF
    }

    /// Aligns this input stream with a TRF tensor for contraction.
    /// Configures the TRF Sequencer to reshape the TRF tensor
    /// to match the computation mapping.
    pub fn align<OutTime: M, OutPacket: M, Row: M, TrfElement: M>(
        self,
        trf_tensor: &TrfTensor<D, Chip, Cluster, Slice, Row, TrfElement>,
    ) -> AlignedPair<'l, T, D, Chip, Cluster, Slice, Row, OutTime, OutPacket> {
        // Hardware implementation: configures Stream Adapter and TRF Sequencer
    }
}

The typical data flow is: collect() → to_trf() for weights (sub context), then collect() → align(&trf) → contract() → accumulate() for activations (main context). The Chip, Cluster, and Slice dimensions pass through unchanged.

Architecture

Conceptual Operation

The TRF Sequencer transforms the TRF tensor mapping into the computation mapping:

TRF mapping:         [Chip, Cluster, Slice, Row, Element]
                          ↓ TRF Sequencer
Computation mapping: [Chip, Cluster, Slice, Row, OutTime, OutPacket]

This transformation involves four operations:

Spatial Read: Fill in OutPacket (which is 64 bytes), with the mechanism involving reg_read_size.
Row Partitioning: Each Row reads from its own TRF region.
Temporal Broadcasting: Axes with stride 0 are broadcast across Time, reusing the same weight data each cycle.
Time Reordering: The Time axes are reordered via a nested-loop sequencer configuration.

       SRAM
        │
        │ SRAM-to-TRF (short command or tensor unit path)
        ▼
┌──────────────┐   TRF mapping: [Chip, Cluster, Slice, Row, Element]
│     TRF      │
└──────┬───────┘
       │ TRF Sequencer (nested-loop config)
       ▼
┌──────────────┐   Computation mapping: [Chip, Cluster, Slice, Row, OutTime, OutPacket]
│   Reducer    │◄── Stream Adapter (activation data)
└──────────────┘

TRF Read Mechanism

Every cycle, the Reducer consumes exactly mac_width (64 bytes) of data. This 64-byte window is composed of two parts:

┌──────────── mac_width (64 bytes) ────────────┐
│   broadcast    │  reg_read_size (contiguous) |
│ ← (repeated) → │    ← inner (from TRF) →     |
└────────────────┴─────────────────────────────┘

reg_read_size: The number of contiguous bytes read from TRF each cycle. Must be a power of two: 1, 2, 4, 8, 16, 32, or 64 bytes.
broadcast: The portion of mac_width not covered by reg_read_size is filled by repeating the read data. For example, if reg_read_size = 8, the 8 bytes are broadcast 8× to fill 64 bytes.

The inner part (within reg_read_size) is always read contiguously each cycle. The sequencer does not control this region. The outer part (beyond mac_width) is controlled by the sequencer entries’ (size, stride) pairs, which specify the iteration order over the remaining dimensions.

Note

reg_read_size is not a user-specified parameter. The compiler determines it by comparing the innermost axes of the TRF mapping and the computation mapping: the contiguous portion that is common to both (within 64 bytes) becomes reg_read_size. This means neither the TRF mapping alone nor the computation mapping alone determines reg_read_size — it is derived from their intersection.

Note

64-byte alignment constraint: When reg_read_size = 64 bytes (i.e., equal to mac_width), the base address and all sequencer strides must be aligned to 64 bytes. A 64-byte read spans both bank columns (32 bytes each). If the address is not 64-byte aligned, the read would cross a bank column boundary, which is not supported by the hardware.

SRAM-to-TRF (StoTRF)

Data is loaded into TRF after the Collect Engine. If the sequencer is configured for completely contiguous access (no gaps or reordering), the load can be executed as a short command (a compact hardware instruction that bypasses the full tensor unit pipeline).

If this condition is not met, the load goes through the full tensor unit path (SRAM → Fetch → Switch → Collect → to_trf()), which supports arbitrary layouts via the fetch engine but has higher setup overhead.

TRF Memory Layout

The TRF is a banked SRAM organized as 8 bank rows × 2 bank columns. Each bank row corresponds to a Row. Each bank contains 128 rows (Full mode) or 64 rows (Half mode), and each row holds 32 bytes (320b) of data:

  bank row 7 ──────────────────────────────────────────┐
  (= Row 7)  bank col 0           bank col 1      │
       :       ┌─ 320b ─┐           ┌─ 320b ─┐       ╱
       :      ╱         ╱|         ╱         ╱|      ╱
  bank row 1 ╱─────────╱ |        ╱─────────╱ |    ╱  bank row
  bank row 0╱         ╱  |       ╱         ╱  |   ╱   (= Row)
            │         │  │       │         │  │
            │ 128 rows│ ╱        │ 128 rows│ ╱
            │         │╱         │         │╱
            └─────────┘          └─────────┘

Each bank row corresponds to a Row and can be accessed independently in parallel. The 2 bank columns within each bank row share the same row address space.

Each element in the TRF is addressed via a bit-field index:

┌───────────┬──────────────┬──────────┬──────────┐
│ bank row  │ row in bank  │ bank col │  offset  │
│  (3 bit)  │  (7b / 6b)   │  (1 bit) │  (6 bit) │
└───────────┴──────────────┴──────────┴──────────┘

Field	Bits	Description
bank row	3	Selects Row (0–7). Each bank row corresponds directly to a Row and can be accessed independently, enabling parallel reads. When `rows < 8`, unused bits extend the row address space
row in bank	7 (Full) / 6 (Half)	Selects row within a bank: 128 rows (Full) or 64 rows (Half). `FirstHalf` uses the lower 64 rows (rows 0–63) and `SecondHalf` uses the upper 64 rows (rows 64–127), so Half mode needs only 6 bits
bank col	1	Selects bank column (2 columns per bank)
offset	6	Element offset within a row in 5-bit granularity (64 positions × 5 bits = 320 bits per row)

The reg_types value determines how many 5-bit slots each element occupies:

`reg_types`	Element Width	Slots per Element	Elements per Row
0	5-bit (`i4` extended)	1	64
1	10-bit (`i4`→`i8`)	2	32
2	10-bit (`i8`→`f8`)	2	32
3	20-bit (`bf16`)	4	16

When rows < 8, the unused bank row bits effectively increase the per-Row capacity. For example, with rows = 4, one extra bit extends row in bank from 7 to 8 bits, doubling the rows available per Row.

Specifications

TRF Address Modes

Mode	Region	Capacity
`FirstHalf`	First half of TRF	`register_file_size / 2`
`SecondHalf`	Second half of TRF	`register_file_size / 2`
`Full`	Entire TRF	`register_file_size`

These address modes partition the TRF so that tensors stored in different regions do not interfere with each other. Full dedicates the entire TRF to a single tensor. FirstHalf and SecondHalf isolate up to two tensors, allowing them to coexist in TRF simultaneously (for example, one half can be read by the Sequencer while the other is written by SRAM-to-TRF, enabling double buffering). The two halves can be flipped between iterations.

Rows

Each Row maps directly to a bank row in TRF. Since bank rows are physically independent, all Rows can read in parallel without contention.

`rows`	Description
1	Single Row (1 bank row used)
2	2 Rows (2 bank rows used)
4	4 Rows (4 bank rows used)
8	8 Rows (all bank rows used)

Each Row reads the same sequencer pattern from a different TRF offset (different bank row, same row_in_bank/bank_col/offset).

Sequencer Configuration

The TRF Sequencer uses the same nested-loop configuration as all other sequencers (see Sequencer):

Parameter	Range	Description
Entries	1–8	Each entry is a `(size, stride)` pair
`size` per entry	1–65,536	Iteration count for this dimension
`stride` per entry	signed 32-bit	Address increment per iteration

The sequencer entries control iteration over the outer part — the dimensions beyond mac_width. The inner part (within reg_read_size) is read contiguously each cycle and is not represented in the sequencer entries. See TRF Read Mechanism for how the inner and outer parts relate.

Axes with stride = 0 are broadcast: the same data is repeated for each iteration of that dimension.

Performance

TRF Cache

Purpose

The TRF bank columns are shared between read (main context sequencer) and write (sub context StoTRF). A read cache sits between the TRF banks and the Reducer so that cache hits serve reads without occupying a bank — freeing the bank for concurrent StoTRF writes.

Structure

The cache is direct-mapped:

Parameter	Value
Rows per Row	4 rows × 2 bank columns = 8 entries
Entry size	32 bytes
Rows	8
Total capacity	8 × 4 × 2 × 32 = 2,048 bytes

Operation

First read (cache miss) — data is fetched from the TRF bank and loaded into the cache. The bank is occupied for this cycle.
Subsequent reads to the same address (cache hit) — data is served from the cache. The bank is not occupied, allowing StoTRF writes to proceed simultaneously.

Bank Conflict Priority

When a cache miss and an StoTRF write target the same bank column in the same cycle, read has higher priority than write. This means frequent cache misses can stall concurrent StoTRF operations.

Impact of `reg_read_size`

`reg_read_size`	Bank columns used per cycle	Cache miss impact
≤ 32 bytes	1 column	A miss on the same bank column as a concurrent StoTRF write will still cause a conflict. However, if the innermost sequencer entry interleaves reads at 32-byte granularity across the two bank columns, then the sequencer and StoTRF alternate columns on successive cycles — avoiding degradation even during misses.
64 bytes	Both columns	A miss occupies both columns simultaneously, blocking StoTRF for that cycle. Write throughput degrades in proportion to the miss rate.

Examples

Basic Weight Broadcasting (MatMul)

This example shows a matrix multiplication where weights are stored in TRF and broadcast across the M (output row) dimension:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 32, N = 8, K = 32];

/// Stores weights into TRF (sub context).
fn store_weights<'l, const T: Tu>(
    weights: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![N, K / 16], m![K % 16]>,
) -> TrfTensor<bf16, m![1], m![1], m![1], m![N], m![K]> {
    // TRF mapping: [
    //   Row: [N = 8]: output channels mapped to 8 Rows
    //   Element: [K = 32]: 32 weight elements stored per Row
    // ]
    weights.to_trf(TrfAddress::Full)
}

/// Performs matmul contraction (main context).
fn matmul<'l, const T: Tu>(
    input: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![M, K / 16], m![K % 16]>,
    trf: &TrfTensor<bf16, m![1], m![1], m![1], m![N], m![K]>,
) -> AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![M], m![N]> {
    // TRF mapping: [
    //   Row: [N = 8],
    //   Element: [K = 32]
    // ]
    // Computation mapping: [
    //   Time: [M = 32, K / 16 = 2],
    //   Row: [N = 8],
    //   Packet: [K % 16 = 16]
    // ]
    //
    // reg_read_size: K = 32 bf16 elements = 64 bytes = mac_width
    //   → reg_read_size = 64B (no broadcast, full mac_width read each cycle)
    //   → 64B alignment required: base and strides must be 64-byte aligned
    //
    // Compiler-generated TRF Sequencer configuration:
    //   Entry 0: { size: 32, stride: 0 }  — M (broadcast, not in TRF)

    // 1. M = 32 is broadcast (stride 0): weights reused for each M iteration
    // 2. N = 8 maps to Row: each Row reads from its TRF partition
    // 3. K axis is contracted, and remaining Time: [M], Row: [N]
    //    By using column major, outputs Time: [M], Packet: [N]
    input.align::<m![M], m![K], _, _>(trf)
         .contract::<m![1]>()
         .accumulate::<m![M], m![N]>(AccumulationKind::Interleaved)
}
}

Small `reg_read_size`

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 32, N = 8, K = 16, L = 2, O = 2];

/// Stores weights into TRF (sub context).
fn store_weights<'l, const T: Tu>(
    weights: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![N, O], m![K]>,
) -> TrfTensor<bf16, m![1], m![1], m![1], m![N], m![O, K]> {
    // TRF mapping: [
    //   Row: [N = 8]: output channels mapped to 8 Rows
    //   Element: [O = 2, K = 16]: 16 weight elements stored per Row
    // ]
    weights.to_trf(TrfAddress::FirstHalf)
}

/// Performs matmul contraction (main context).
fn matmul<'l, const T: Tu>(
    input: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![O, M, L], m![K]>,
    trf: &TrfTensor<bf16, m![1], m![1], m![1], m![N], m![O, K]>,
) -> AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![O, M, L], m![N]> {
    // TRF mapping: [
    //   Row: [N = 8],
    //   Element: [O = 2, K = 16]
    // ]
    // Computation mapping: [
    //   Time: [O = 2, M = 32],
    //   Row: [N = 8],
    //   Packet: [L = 2, K = 16]
    // ]
    //
    // reg_read_size: K=16 bf16 elements = 32 bytes (= mac_width/2)
    //   → reg_read_size = 32B (outer size 2 is broadcast)
    //
    // Compiler-generated TRF Sequencer configuration:
    //   Entry 0: { size: 32, stride: 0 }  — M (broadcast, not in TRF)
    //   Entry 1: { size: 2, stride: 32 }  — O (direct from O)

    // 1. M = 32 is broadcast (stride 0): weights reused for each M iteration
    // 2. N = 8 maps to Row: each Row reads from its TRF partition
    // 3. K axis is contracted, and remaining Time: [O, M], Row: [N], Packet: [L]
    //    By using column major, outputs Time: [O, M, L], Packet: [N]
    input.align::<m![O, M], m![L, K], _, _>(trf)
         .contract::<m![L]>()
         .accumulate::<m![O, M, L], m![N]>(AccumulationKind::Interleaved)
}
}

TODO: Read (Main) and Write (Sub) to TRF at the same time

Reducer

The Reducer performs elementwise multiplication followed by reduce-add. Each slice’s Reducer contains 8 independent Rows, which are parallel MAC lanes that each process a different weight channel. It receives input data from the Stream Adapter and weight data from the TRF Sequencer.

Interface

The Reducer is invoked via .align() followed by .contract() and .accumulate():

extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
impl CollectTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet> {
    /// Aligns input stream and TRF to computation mapping.
    pub fn align<OutTime: M, OutPacket: M, Row: M, TrfElement: M>(
        self,
        trf: &TrfTensor<D, Chip, Cluster, Slice, Row, TrfElement>,
    ) -> AlignedPair<'l, T, D, Chip, Cluster, Slice, Row, OutTime, OutPacket>;
}

impl AlignedPair<'l, T, D, Chip, Cluster, Slice, Row, Time, Packet> {
    /// Performs spatial reduction: elementwise multiplication followed by reduce-add
    /// across the Packet dimension via the hardware reduction tree.
    /// Data type is widened during contraction: i4/i8 -> i32, f8/bf16 -> f32.
    pub fn contract<OutPacket: M>(
        self,
    ) -> ContractionTensor<'l, T, OutD, Chip, Cluster, Slice, Row, Time, OutPacket>;
}

impl ContractionTensor<'l, T, D, Chip, Cluster, Slice, Row, Time, Packet> {
    /// Performs temporal accumulation: accumulates values over the Time dimension
    /// and produces the final contraction output.
    pub fn accumulate<OutTime: M, OutPacket: M>(
        self, kind: AccumulationKind,
    ) -> AccumulationTensor<'l, T, D, Chip, Cluster, Slice, OutTime, OutPacket>;
}

The Reducer computes the dot product of input stream $X$ and TRF weights $W$:

$$\text{output}[i] = \sum_{j} X[i, j] \times W[i, j]$$

The summation index $j$ corresponds to axes removed during reduction:

Spatial reduction removes axes from the Packet dimension via the hardware reduction tree
Temporal reduction removes axes from the Time dimension via the accumulator buffer

The output mapping is determined by which axes survive reduction: OutPacket contains Packet axes after spatial reduction, and OutTime contains Time axes after temporal reduction.

Examples

Matrix Multiplication

Matrix multiplication with 8 Rows operating in parallel:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 32, B = 32, C = 8];

fn matmul<'l, const T: Tu>(
    input: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![A, B / 16], m![B % 16]>,
    trf: &TrfTensor<bf16, m![1], m![1], m![1], m![C], m![B]>,
) -> AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![A], m![C]> {
    // Computation mapping: [
    //   Time: [A = 32],
    //   Row: [C = 8],
    //   Packet: [B = 32] (32 bf16 elements = 64 bytes)
    // ]
    //
    // Spatial reduction: tree depth 5 reduces 32 bf16 elements along B → f32
    // Output (Interleaved): Time = [A], Packet = [C]
    input.align::<m![A], m![B], _, _>(&trf)
         .contract::<m![1]>()
         .accumulate::<m![A], m![C]>(AccumulationKind::Interleaved)
}
}

At each Row, elementwise multiplication of input * trf occurs. With tree depth 5, reduce-add sums over the 32 bf16 elements of B, producing one f32 per A position.

Full Tensor Reduction

This example demonstrates a complete reduce-add over a tensor m![A] with m![A]::SIZE = 65536, showing how spatial and temporal reduction combine with slice-level reduction.

Mapping:

Slice = m![A / 256] (256 slices process in parallel)
Time = m![A / 32 % 8] (8 temporal iterations per slice)
Packet = m![A % 32] (32 elements reduced spatially)

Reduction breakdown:

Stage	Axes Reduced	Mechanism	Cycles
Spatial	`A % 32`	Reducer tree (depth 5)	5
Temporal	`A / 32 % 8`	Reducer accumulator (8 iterations)	8
Slice-level	`A / 256`	Inter-Slice Block	256

Analysis:

Each slice processes A / 256 (256) elements
Within a slice: A % 32 elements are reduced spatially by the tree (5 cycles for bf16)
The temporal axis A / 32 % 8 means 8 flits arrive sequentially, accumulated by the buffer
After in-slice reduction completes (~40 cycles), 256 partial results exist across slices
The Inter-Slice Block reduces these 256 slice results (256 cycles)
Total: ~296 cycles for reducing 65536 elements to a single scalar

Architecture

The Reducer consists of 8 independent Rows operating in parallel. Data flows to the Rows from two sources:

StreamUnit data: Broadcast to all Rows (same data to every row)
TRF data: Read in parallel from 8 independent Row spaces (TRF Row $i$ feeds Row $i$ directly)

Each Row contains a reduction tree for spatial reduction, followed by a shared accumulator buffer for temporal reduction.

Contraction Engine

The diagram shows data widths at different stages. The 320b/640b corresponds to 64/128 elements for i4/i5, 32/64 elements for i8/f8/i9, 16/32 elements for bf16, and 8/16 elements for f32/i32.

Spatial Reduction

Each Row contains a reduction tree that sums products hierarchically.

Reduction Tree

At depth 0, each Row multiplies the input stream from the Stream Adapter, with the weight data from the TRF Sequencer (each 64 bytes wide). Each subsequent depth sums pairs of partial products, halving the element count from the previous depth. The tree depth varies by data type to provide sufficient depth for reducing the full data width:

i4: depth 7 (reduces 128 elements)
i8/f8: depth 6 (reduces 64 elements)
bf16: depth 5 (reduces 32 elements)

The output data type is widened to accommodate larger result values from contraction. With i8 input, i8 * i8 multiplication occurs first, and up to 64 values can be summed across the 6-depth tree. Inputs i4/i8 produce i32 outputs, and inputs f8/bf16 produce f32 outputs.

Given a computation mapping of m![Row, Time, Packet], spatial reduction eliminates the innermost m![Packet % 2^n] axes (where n is the tree depth), producing an output mapping of m![Row, Time, Packet / 2^n].

Note

Spatial reduction in addition mode allows full 8-Row usage, but max mode only supports a single Row (Row 0).

Resize

After spatial reduction, the output is resized to exactly 32 i32/f32 elements per Row before being fed to the temporal accumulator. When the tree depth is 0 (no spatial reduction), the 32 outer elements are truncated. Otherwise, the spatial reduction output is padded or broadcast to fill the 32 columns of the temporal accumulator, depending on the output mode.

The Reducer supports two output modes that determine how the resize is performed:

Sequential: Rows are sequentially ordered. The spatial reduction output is padded with zeros.
Interleaved: Rows are interleaved. The spatial reduction output is repeated across the 32 columns.

The figure below illustrates the output of spatial reduction for various i8 reduction depths. The left side shows Sequential mode adding zero-padding; the right shows Interleaved mode replicating the output to fill 32 element positions.

Sequential and Interleaved Spatial Reduction Output

Temporal Reduction

After resizing, each Row feeds its output to a shared temporal accumulator.

The temporal accumulator stores intermediate results in a buffer and accumulates values that arrive sequentially over time, enabling reduce operations even when the reduce axis is not contiguous in the innermost dimension.

The buffer has 1024 slots total: 8 rows × 32 columns × 4 registers/column.

Consider axes![A = 2048, B = 8] and a tensor with mapping m![A, B], where we want to reduce along axis B. With mapping Time = m![B / 4, A % 8] and Packet = m![B % 4], the spatial reduction stage outputs 16 flits (since Time::SIZE = m![B / 4, A % 8]::SIZE = 2 * 8 = 16).

The accumulator uses 8 buffer slots (one per A % 8 value) to accumulate across the B / 4 (2) iterations:

flit #	`B / 4`	`A % 8`	Buffer Slot	Operation
0	0	0	0	Store
1	0	1	1	Store
2	0	2	2	Store
3	0	3	3	Store
4	0	4	4	Store
5	0	5	5	Store
6	0	6	6	Store
7	0	7	7	Store
8	1	0	0	Accumulate with flit #0
9	1	1	1	Accumulate with flit #1
10	1	2	2	Accumulate with flit #2
11	1	3	3	Accumulate with flit #3
12	1	4	4	Accumulate with flit #4
13	1	5	5	Accumulate with flit #5
14	1	6	6	Accumulate with flit #6
15	1	7	7	Accumulate with flit #7, then output

The first 8 flits are stored in buffer slots 0-7. When flits 8-15 arrive, they accumulate with the stored values. After flit 15, the buffer contains the final reduced results and outputs them.

For buffered reduction to work, the product of all axis sizes inner to the reduce axis must be at most 1024, in order to fit the accumulator buffer.

The temporal accumulator supports two operation modes: Sequential and Interleaved.

Interleaved provides a greater buffer capacity of 128 for the axes inner to the reduce axis, compared to the Sequential 32-element capacity. However, Interleaved changes the output packet structure. Choose the mode based on buffer constraints and whether the desired output ordering matches downstream requirements. See Constraints for the full buffer capacity rules.

Interleaved Mode

In Interleaved mode, the Reducer outputs data element-by-element across all Rows. The output bus carries one value from each of the 8 Rows, per beat.

Packet Slicing: In Interleaved mode, not all of Packet is fed to the accumulator. Since the reduction tree broadcasts $m$ partial sums across all 32 column positions (via replication), only the first $m$ columns get written to accumulator entries, slicing Packet from 32 down to $m$.

Note

User-specified slicing should only slice padded Packet axes.

Column Interleaving: To achieve maximum accumulator utilization, all of the 32 accumulator columns are filled by interleaving $\frac{32}{m}$ column groups over successive cycles. For $m = 4$, the first cycle writes to columns 0–3, the next to columns 4–7, and so on, giving 8 interleave steps to fill all 32 columns.
Full Row Utilization: Additionally, all 8 accumulator rows are always active regardless of the actual input Row count: if Row < 8, the data is padded to occupy all 8 Rows.
Output: OutTime: m![Time', Packet / 2^n = m], OutPacket: m![Row # 8].
- OutTime preserves the order of Time, Packet, but with some axes from Time removed. The removed axes undergo reduce-add, yielding Time'.
- OutPacket equals Row padded with dummies to align to 8, as all Rows are utilized.

Note

Interleaved mode has reduced accumulator utilization when Row < 8: only Row out of 8 rows store meaningful data, while the output bus always sends all 8 Rows together. Effective accumulator capacity is Row × 32 × 4 instead of the full 8 × 32 × 4 = 1024 slots. This limitation is most severe at Row = 1 (128 useful slots), but applies to Row = 2 and Row = 4 as well.

Example

This example performs a contraction where K is partially reduced spatially (K % 4 in Packet) and temporally (K / 16 in Time), with K % 16 / 4 surviving in the output:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 4, N = 8, K = 64];

fn interleaved<'l, const T: Tu>(
    input: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![K / 16, M], m![K % 16]>,
    trf: &TrfTensor<bf16, m![1], m![1], m![1], m![N], m![K]>,
) -> AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![M, K % 16 / 4], m![N]> {
    // Computation mapping:
    //   Time: [K / 16, M], Row: [N], Packet: [K % 16]
    //   16 bf16 elements per packet = 32 bytes
    //
    // Spatial: tree depth 2 reduces groups of 4 bf16 -> 1 f32,
    //          leaving K % 16 / 4 (4) columns
    // Temporal: K / 16 (4) iterations accumulated in buffer
    //
    // Interleaved output:
    //   m = 4 valid columns, 16 / 4 = 4 column groups interleaved
    //   OutTime = [M, K % 16 / 4] (K / 16 reduced, surviving Packet appended)
    //   OutPacket = [N] (Row, already 8)
    input.align::<m![K / 16, M], m![K % 16 # 32], _, _>(&trf)
         .contract::<m![K % 16 / 4]>()
         .accumulate::<m![M, K % 16 / 4], m![N]>(AccumulationKind::Interleaved)
}
}

The axes inner to the reduce axis (K / 16) are M and K % 16 / 4, with a total size of 4 × 4 = 16. This satisfies the Interleaved buffer constraint (≤ 128).

Sequential Mode

In Sequential mode, the Reducer outputs the reduced data in each Row sequentially. The output bus carries up to 8 elements from Packet, per beat.

Full Packet Utilization: In Sequential mode, all 32 columns of Packet / 2^n are fed to the accumulator. Unlike in Interleaved mode, no packet slicing occurs. Each cycle writes all 32 columns simultaneously, with zeros padding any unused positions.
Row Interleaving: To achieve maximum accumulator utilization, all 8 accumulator rows are filled by interleaving $\frac{8}{\texttt{Row}}$ row groups over successive cycles. With Row::SIZE = 4, the first 4 rows of the temporal accumulator store rows 0–3, and, in the next cycle, the next 4 rows store in rows 4–7.
Output: OutTime: m![Time', Row, Packet_outer], OutPacket: m![Packet_inner].
- OutTime preserves the order of Time, Row, but with some axes from Time removed. The removed axes undergo reduce-add, yielding Time'.
- Since the output bus is 8 elements-wide, only multiples of 8 elements (8, 16, 24, or 32) can be output. Packet is split accordingly: Packet_outer = m![Packet / 2^n / 8] (number of beats per row) and Packet_inner = m![Packet / 2^n % 8] (elements per beat).

Example

The same computation mapping as the Interleaved example above, but with Sequential output:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 4, N = 8, K = 64];

fn sequential<'l, const T: Tu>(
    input: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![K / 16, M], m![K % 16]>,
    trf: &TrfTensor<bf16, m![1], m![1], m![1], m![N], m![K]>,
) -> AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![M, N], m![K % 16 / 4 # 8]> {
    // Computation mapping:
    //   Time: [K / 16, M], Row: [N], Packet: [K % 16]
    //   16 bf16 elements per packet = 32 bytes
    //
    // Spatial: tree depth 2 reduces groups of 4 bf16 -> 1 f32,
    //          leaving K % 16 / 4 (4) columns
    // Temporal: K / 16 (4) iterations accumulated in buffer
    //
    // Sequential output:
    //   Packet'' = K % 16 / 4, padded to 8:
    //     - Packet_outer = [1],
    //     - Packet_inner = [K % 16 / 4 # 8]
    //   OutTime = [M, N] (K / 16 reduced, Row appended)
    //   OutPacket = [K % 16 / 4 # 8] (surviving Packet padded to 8)
    input.align::<m![K / 16, M], m![K % 16 # 32], _, _>(&trf)
         .contract::<m![K % 16 / 4]>()
         .accumulate::<m![M, N], m![K % 16 / 4 # 8]>(AccumulationKind::Sequential)
}
}

The axes inner to the reduce axis (K / 16) are M and N, with a total size of 4 × 8 = 32. This satisfies the Sequential buffer constraint (≤ 32).

Note

Sequential mode has reduced accumulator utilization when Packet is spatially reduced: only Packet / 2^n out of 32 elements store meaningful data per Row. Effective accumulator capacity is 8 × 1 × 4 = 32 slots instead of the full 1024. This limitation applies whenever the non-padded portion of Packet / 2^n is fewer than 32 elements.

Constraints

Row count: The hardware provides exactly 8 Rows. Operations can use 1, 2, 4, or 8 rows, but the Row dimension size must match one of these values.
Tree depth: Determines how many elements can be reduced spatially. Depth 7 for i4 (128 elements), depth 6 for i8/f8 (64 elements), depth 5 for bf16 (32 elements). The input packet size must not exceed the maximum elements reducible at the given depth.
Spatial output limit: For a tree depth of 0 (no spatial reduction), the Reducer outputs at most 32 i32/f32 elements. Configurations that would produce more than 32 output elements per cycle are invalid.
Data types: Input types must be i4, i8, f8, or bf16. Output types are automatically widened to i32 (from i4/i8) or f32 (from f8/bf16). The type widening is mandatory.
Reduce-max: Only supports using a single Row (Row 0), limiting reduce-max throughput to 1/8th of reduce-add capacity.
Buffer capacity: The accumulator has 1024 buffer slots (8 rows × 32 columns × 4 registers/column). The product of axes inner to the outermost reduce axis must fit within this capacity.
- Interleaved constraints: Requires axes inner to outermost reduce in OutTime to be at most 128. Full constraint: align_up(Row, 8) * (axes inner to reduce) ≤ 1024.
- Interleaved utilization: When Row < 8, effective capacity is reduced to Row × 32 × 4 slots, preventing full buffer utilization.
- Sequential constraints: Requires axes inner to outermost reduce in OutTime to be at most 32. Full constraint: align_up(reduced_packet.len(), 32) * (axes inner to reduce) ≤ 1024.
- Sequential utilization: When Packet is reduced, full buffer utilization cannot be achieved. For instance, when reducing Packet totally, only one column of each Row is used, wasting 31/32 of the buffer capacity.

Performance

Spatial latency: Tree depth determines spatial reduction latency: i4 depth 7 (128 elements in 7 cycles), i8/f8 depth 6 (64 elements in 6 cycles), bf16 depth 5 (32 elements in 5 cycles). Shallower trees complete faster, but larger data types require less depth due to narrower data paths.
Temporal latency: Each accumulation cycle processes one packet. For a reduction axis of size N in the time dimension, the accumulator requires approximately N cycles to complete the reduction.
Parallelism: Using all 8 Rows maximizes throughput. Each Row operates independently, so 8 rows achieve 8× parallelism compared to a single row.
Type widening: Output data types are widened to prevent overflow (i4/i8 → i32, f8/bf16 → f32). This widening is automatic and adds minimal latency, but downstream components must handle 32-bit data.
Reduce-max: Only supports single Row (Row 0) usage, limiting parallelism to 1/8th of reduce-add throughput.
Truncation: When tree depth is 0, the Reducer can output at most 32 elements spatially. Larger packets are truncated.
Pipeline integration: The Reducer sits between Stream Adapter/TRF Sequencer and Vector Engine, adding latency proportional to tree depth plus time dimension size.

Vector Engine

The Vector Engine applies element-wise operations: activations such as GELU and SiLU, normalizations such as softmax and layer norm, and binary operations. It is used both after the Contraction Engine (to post-process f32/i32 accumulator results) and independently for element-wise kernels that skip contraction entirely.

The Vector Engine operates exclusively on i32 and f32 data types. Data moves in 32-byte units called flits, each containing eight 32-bit values. This 32-bit restriction exists because lower-precision data is widened before or during computation: bf16 products accumulate in f32, and i8 products accumulate in i32.

The Vector Engine sits between the Contraction Engine and the Cast Engine in the Tensor Unit pipeline:

Fetch -> Switch -> Collect -> Contraction -> Vector -> Cast -> Transpose -> Commit
                    |                       ^
                    +-----------------------+
                     (skip contraction)

Data enters the Vector Engine as either:

From the Collect Engine when the Contraction Engine is skipped
From the Contraction Engine when it produces the input

Interface

    /// Initializes Vector Engine processing for this tensor.
    #[primitive(CollectTensor::vector_init)]
    pub fn vector_init(self) -> VectorInitTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet>
    where
        D: VeScalar,

    #[primitive(VectorInitTensor::vector_intra_slice_branch)]
    pub fn vector_intra_slice_branch(
        self,
        branch: BranchMode,
    ) -> VectorBranchTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet, D, NoTensor, { VeOrder::IntraFirst }> {

    #[primitive(VectorInitTensor::vector_intra_slice_unzip)]
    pub fn vector_intra_slice_unzip<I: AxisName, TileTime: M, SplitTime: M>(
        self,
    ) -> VectorTensorPair<'l, T, D, stage::Branch, Chip, Cluster, Slice, SplitTime, Packet> {

    #[primitive(VectorInitTensor::vector_inter_slice_reduce)]
    pub fn vector_inter_slice_reduce<Slice2: M, Time2: M>(
        self,
        op: InterSliceReduceOpI32,
    ) -> VectorInterSliceReduceTensor<'l, T, i32, Chip, Cluster, Slice2, Time2, Packet, { VeOrder::InterFirst }> {

The same vector_init() entry point is available regardless of whether the input comes from the Collect Engine (when contraction is skipped) or the Contraction Engine (for post-contraction processing). After vector_init(), choose the first block by calling either vector_intra_slice_branch(...), vector_intra_slice_unzip(...), or vector_inter_slice_reduce(...). For detailed stage-by-stage API coverage, see Intra-Slice Block and Inter-Slice Block.

Quick Reference

Block	How to Reach It	Use It For	Output
Intra-Slice Block	Start with `vector_init()`, then call `vector_intra_slice_branch()`	Elementwise ops, binary ops, intra-slice reduce	Chain stages, then `vector_final()`
Inter-Slice Block	Either call `vector_init() -> vector_inter_slice_reduce()` first, or switch from an eligible intra-slice tensor with `vector_inter_slice_reduce()`	Reduction across the 256 slices in a cluster	`vector_inter_slice_reduce()`, then optional intra-slice work or `vector_final()`
Two-group intra-slice mode	Start with `vector_init()`, then call `vector_intra_slice_unzip()`	Process two interleaved groups before combining them	`_zip` to merge, then `vector_final()`

Examples

ReLU Activation

Applying ReLU activation (max(x, 0)) after matrix multiplication:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 128, N = 256, K = 64];

fn relu<'l, const T: Tu>(
    input: AccumulationTensor<'l, T, f32, m![1], m![1], m![K], m![M], m![N]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1], m![K], m![M], m![N]> {
    input
        .vector_init()
        .vector_intra_slice_branch(BranchMode::Unconditional)
        .vector_clip(ClipBinaryOpF32::Max, 0.0f32)
        .vector_final()
}
}

Inter-Slice Reduce

Reducing a tensor across slices:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 4, A = 512];

fn inter_slice_reduce<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, R], m![1], m![A % 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, 1 # 4], m![1], m![A % 8]> {
    input
        .vector_init()
        .vector_inter_slice_reduce::<m![A / 8, 1 # 4], m![1]>(InterSliceReduceOpI32::AddSat)
        .vector_final()
}
}

Ordering

Order	Flow	Typical Use
`IntraFirst`	Intra-Slice Block -> optional Inter-Slice Block	Post-process each slice, then reduce across slices
`InterFirst`	Inter-Slice Block -> optional Intra-Slice Block	Reduce first, then apply elementwise post-processing

The examples above show one concrete IntraFirst path and one concrete InterFirst path.

Constraints

When using i8 or bf16 input without the Contraction Engine, widening must still fit within one 32-byte flit. This limits how much data the Fetch Engine can supply per flit after type conversion. See Fetch Engine: Type Casting Constraints.

Intra-Slice Block

The Intra-Slice Block performs elementwise, binary, and intra-slice reduce operations on tensor data.

After the Contraction Engine completes matrix multiplication, the Intra-Slice Block applies activation functions, normalization, and other elementwise transformations to produce the final result. For example, computing sigmoid(X * W + b) requires the Contraction Engine for X * W, then the Intra-Slice Block for addition and sigmoid activation.

Interface

    /// Initializes Vector Engine processing for this tensor.
    #[primitive(CollectTensor::vector_init)]
    pub fn vector_init(self) -> VectorInitTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet>
    where
        D: VeScalar,

    #[primitive(VectorInitTensor::vector_intra_slice_branch)]
    pub fn vector_intra_slice_branch(
        self,
        branch: BranchMode,
    ) -> VectorBranchTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet, D, NoTensor, { VeOrder::IntraFirst }> {

    #[primitive(VectorInitTensor::vector_intra_slice_unzip)]
    pub fn vector_intra_slice_unzip<I: AxisName, TileTime: M, SplitTime: M>(
        self,
    ) -> VectorTensorPair<'l, T, D, stage::Branch, Chip, Cluster, Slice, SplitTime, Packet> {

The same vector_init() entry point is available regardless of whether the input comes from the Collect Engine (when contraction is skipped) or the Contraction Engine (for post-contraction processing). After vector_init(), enter the intra-slice block with either vector_intra_slice_branch(...) or vector_intra_slice_unzip(...). For the paired path entered through vector_intra_slice_unzip(...), see Two-Group Mode.

After entry, operations are chained stage by stage:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512];

fn staged_pipeline<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
    input
    .vector_init()
    .vector_intra_slice_branch(BranchMode::Unconditional)
    .vector_fxp(FxpBinaryOp::AddFxp, 100)
    .vector_fxp_to_fp(31)
    .vector_trim_way4::<m![A % 2 # 4]>()
    .vector_fp_unary(FpUnaryOp::Sigmoid)
    .vector_pad_way8::<m![A % 2 # 8]>()
    .vector_fp_to_fxp(31)
    .vector_clip(ClipBinaryOpI32::Max, 0)
    .vector_final()
}
}

Each method corresponds to a hardware pipeline stage. The type system enforces valid stage transitions at compile time. For example, vector_fp_unary is only available after the pipeline has been narrowed with vector_split or vector_trim_way4.

Architecture

flowchart TD
    classDef way8 fill:#e8f5e9,stroke:#2e7d32
    classDef way4 fill:#e3f2fd,stroke:#1565c0
    classDef conv fill:#fff3e0,stroke:#e65100
    classDef ctrl fill:#f3e5f5,stroke:#6a1b9a
    classDef vrf fill:#fce4ec,stroke:#880e4f

    Entry["vector_intra_slice_branch(BranchMode)"]:::ctrl

    VRF_L["VRF"]:::vrf
    Logic["Logic Cluster<br><code>vector_logic()</code>"]:::way8
    VRF_L -. "operand" .-> Logic

    Entry --> Logic

    Fxp["Fxp Cluster<br><code>vector_fxp()</code>"]:::way8
    VRF_R1["VRF"]:::vrf
    Fxp -. "operand" .- VRF_R1

    Logic --> Fxp

    FxpToFp["FxpToFp<br><code>vector_fxp_to_fp()</code>"]:::conv
    Fxp --> FxpToFp

    Narrow["Narrow Stage<br><code>vector_split() / vector_trim_way4()</code>"]:::conv
    FxpToFp --> Narrow

    VRF_L2["VRF"]:::vrf
    Fp["Float Cluster<br><code>vector_fp_unary/binary/ternary()</code>"]:::way4
    VRF_L2 -. "operand" .-> Fp

    Narrow --> Fp

    Reduce["IntraSliceReduce Stage<br><code>vector_intra_slice_reduce()</code>"]:::way4
    Fp --> Reduce

    FpDiv["FpDiv<br><code>vector_fp_div()</code>"]:::way4
    VRF_R2["VRF"]:::vrf
    FpDiv -. "operand" .- VRF_R2

    Reduce --> FpDiv

    Widen["Widen Stage<br><code>vector_concat() / vector_pad_way8()</code>"]:::conv
    FpDiv --> Widen

    FpToFxp["FpToFxp<br><code>vector_fp_to_fxp()</code>"]:::conv
    Widen --> FpToFxp

    VRF_L3["VRF"]:::vrf
    Clip["Clip Cluster<br><code>vector_clip()</code>"]:::way8
    VRF_L3 -. "operand" .-> Clip

    FpToFxp --> Clip

    Exit["vector_final()"]:::ctrl
    Clip --> Exit

Quick Reference

Entry and Transition

Before the stage-by-stage table, it helps to separate the ways you can enter or resume the intra-slice block:

Current state	Method	Result
Fresh VE input after `vector_init()`	`vector_intra_slice_branch(BranchMode)`	Enters the single-stream intra-slice path
Fresh VE input after `vector_init()`	`vector_intra_slice_unzip()`	Enters the two-group intra-slice path
Tensor after `vector_inter_slice_reduce()`	`vector_intra_slice_branch(BranchMode)`	Continues with intra-slice work after inter-slice reduction

The stage table below describes the single-stream path after vector_intra_slice_branch(). For the paired path after vector_intra_slice_unzip(), see Two-Group Mode.

Stages

Every stage is optional; you can skip directly from Branch to any downstream stage. Recall from Vector Engine that Way8 processes 8 elements per cycle and Way4 processes 4. The ALU column is shown only where the API exposes multiple competing ALUs inside one stage. Stages such as FxpToFp, Narrow, IntraSliceReduce, FpDiv, Widen, FpToFxp, Filter, and Output do not require a user-visible ALU choice here.

Stage	Method	Data Type	Mode	ALUs	Notes
Branch	`vector_intra_slice_branch(BranchMode)`	i32, f32	Way8		Single-stream entry after `vector_init()`, or continuation after `vector_inter_slice_reduce()`
Logic	`vector_logic(op, operand)`	i32, f32	Way8	LogicAnd, LogicOr, LogicXor, LogicLshift, LogicRshift
Fxp	`vector_fxp(op, operand)`	i32	Way8	FxpAdd, FxpLshift, FxpMul, FxpRshift
FxpToFp	`vector_fxp_to_fp(int_width)`	i32 → f32	Way8
Narrow	`vector_split()` / `vector_trim_way4()`	f32	Way8 → Way4
Float	`vector_fp_unary/binary/ternary(op, ...)`	f32	Way4	FpFma, FpFpu, FpExp, FpMul0, FpMul1
IntraSliceReduce	`vector_intra_slice_reduce(op)`	i32, f32	Way4
FpDiv	`vector_fp_div(op, operand)`	f32	Way4
Widen	`vector_concat()` / `vector_pad_way8()`	f32	Way4 → Way8
FpToFxp	`vector_fp_to_fxp(int_width)`	f32 → i32	Way8
Clip	`vector_clip(op, operand)`	i32, f32	Way8	ClipAdd, ClipMax, ClipMin
Filter	`vector_filter(mode)`	i32, f32	Way8
Output	`vector_final()`	i32, f32	Way8

vector_intra_slice_branch() is both the initial single-stream intra-slice entry after vector_init() and the continuation point after vector_inter_slice_reduce(). vector_intra_slice_unzip() is only available directly from vector_init().

Within a stage, each ALU can only be used once per pass. This matters mainly in Logic, Fxp, Fp, and Clip, where multiple operators share a stage-local ALU pool. For example, tanh(sqrt(x)) is impossible in a single pass because both tanh and sqrt require the FpFpu ALU. Such operations require multiple Tensor Unit invocations with intermediate results stored in DM or TRF.

vector_stash() is not a pipeline stage. It can be called at any Stashable point in the chain to snapshot the current tensor for later use as an operand. See Stash for details.

Examples

`i32` Pipeline Example

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512];

fn add_constant<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
    input
        .vector_init()
        .vector_intra_slice_branch(BranchMode::Unconditional)
        .vector_fxp(FxpBinaryOp::AddFxp, 100)
        .vector_final()
}
}

`f32` Pipeline Example

In this example, vector_trim_way4() is the Narrow step: it changes the tensor from Way8 to Way4 before the float operation. Later, vector_pad_way8() is the Widen step: it changes the tensor from Way4 back to Way8 after the float operation.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512];

fn sigmoid<'l, const T: Tu>(
    input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
    input
        .vector_init()
        .vector_intra_slice_branch(BranchMode::Unconditional)
        .vector_trim_way4::<m![A % 2 # 4]>() // Narrow: Way8 -> Way4
        .vector_fp_unary(FpUnaryOp::Sigmoid)
        .vector_pad_way8::<m![A % 2 # 8]>() // Widen: Way4 -> Way8
        .vector_final()
}
}

For stash usage, see Stash.

Stage Details

Branch (`vector_intra_slice_branch`)

Enters the pipeline and configures conditional execution via BranchMode. Each 32-bit element in a flit is assigned a 4-bit ExecutionId (0-15) that determines which operations to apply.

Mode	Description
`Unconditional`	All elements get ExecutionId 0
`AxisToggle { axis }`	Toggle group based on axis index (`group_id = axis_index % 2`)
`ValidCount`	Via Valid Count Generator
`Comparison([InputCmp; 4])`	Set branch bits via comparison operations on input values
`Vrf`	Load ExecutionIds from VRF (pre-written by Branch Logger in prior TuExec)

Logic Cluster (`vector_logic`)

Bitwise operations on i32 or f32 (bit-level). Requires Way8 mode.

This stage has multiple ALUs, so operator choice matters for fusion: LogicAnd, LogicOr, LogicXor, LogicLshift, and LogicRshift can each be used at most once per pass.

i32 operations:

Op	ALU	Note
`BitAnd`	`LogicAnd`	bitwise and
`BitOr`	`LogicOr`	bitwise or
`BitXor`	`LogicXor`	bitwise xor
`LeftShift`	`LogicLshift`	logical left shift
`LogicRightShift`	`LogicRshift`	logical right shift
`ArithRightShift`	`LogicRshift`	arithmetic right shift

f32 operations:

Op	ALU	Note
`BitAnd`	`LogicAnd`	bitwise and on fp bit patterns
`BitOr`	`LogicOr`	bitwise or on fp bit patterns
`BitXor`	`LogicXor`	bitwise xor on fp bit patterns

Fxp Cluster (`vector_fxp`)

Integer and fixed-point arithmetic on i32. Requires Way8 mode.

This stage has four reusable ALU classes: FxpAdd, FxpLshift, FxpMul, and FxpRshift. Operators sharing the same class cannot be fused in one pass.

Op	ALU	Note
`AddFxp`	`FxpAdd`	wrapping add
`AddFxpSat`	`FxpAdd`	saturating add
`SubFxp`	`FxpAdd`	wrapping subtract
`SubFxpSat`	`FxpAdd`	saturating subtract
`LeftShift`	`FxpLshift`	logical left shift
`LeftShiftSat`	`FxpLshift`	saturating left shift
`MulFxp`	`FxpMul`	fixed-point multiply
`MulInt`	`FxpMul`	integer multiply
`LogicRightShift`	`FxpRshift`	logical right shift
`ArithRightShift`	`FxpRshift`	arithmetic right shift
`ArithRightShiftRound`	`FxpRshift`	arithmetic right shift with rounding

FxpToFp Conversion (`vector_fxp_to_fp`)

Converts i32 to f32. The int_width parameter specifies the integer bit width for the conversion.

Method	Effect
`vector_fxp_to_fp(int_width)`	convert `i32` stream to `f32`

Narrow (`vector_split`, `vector_trim_way4`)

Way8 and Way4 are the two packet modes of the intra-slice pipeline. In Way8, one packet carries 8 active lanes (Packet = m![... # 8]). In Way4, one packet carries 4 active lanes (Packet = m![... # 4]). Narrow switches the pipeline from Way8 to Way4, floating-point and intra-slice reduce stages run in Way4, and Widen switches back to Way8.

This usually halves throughput for the float / reduce path, because the same logical tensor shape now takes twice as many packets or passes.

Method	Use When	Effect
`vector_split()`	both halves contain real data	split one 8-way flit into two 4-way packets, updating `Time` and `Packet`
`vector_trim_way4()`	upper 4 lanes are already padding or irrelevant	keep only the lower 4 lanes

Shape semantics:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![S = 64, A = 512];

fn split_semantics<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, i32, m![1], m![1 # 2], m![S # 16 / 4], m![S # 16 % 4], m![A % 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorNarrowTensor<'l, T, i32, m![1], m![1 # 2], m![S # 16 / 4], m![S # 16 % 4, A / 4 % 2], m![A % 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input.vector_split::<m![S # 16 % 4, A / 4 % 2], m![A % 4]>()
    // shape semantics: [T], [P] -> [T, P / 2], [P % 4]
}

fn trim_way4_semantics<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorNarrowTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input.vector_trim_way4::<m![A % 2 # 4]>()
    // shape semantics: [T], [P] -> [T], [P = 4]
}
}

Float Cluster (`vector_fp_unary`, `vector_fp_binary`, `vector_fp_ternary`)

Floating-point operations on f32. Requires Way4 mode. That is, the input must already have passed through Narrow, so each packet carries 4 active lanes rather than 8.

This is the stage where ALU planning matters most. It exposes five independent ALUs, FpFma, FpFpu, FpExp, FpMul0, and FpMul1, and each can be used once per pass.

Unary ops:

Op	ALU	Note
`Exp`	`FpExp`	exponential
`NegExp`	`FpExp`	negative exponential
`Sqrt`	`FpFpu`	square root
`Tanh`	`FpFpu`	hyperbolic tangent
`Sigmoid`	`FpFpu`	sigmoid
`Erf`	`FpFpu`	error function
`Log`	`FpFpu`	natural logarithm
`Sin`	`FpFpu`	sine
`Cos`	`FpFpu`	cosine

Binary ops:

Op	ALU	Note
`AddF`	`FpFma`	floating-point add
`SubF`	`FpFma`	floating-point subtract
`MulF(FpMulAlu::Mul0)`	`FpMul0`	multiply using mul lane 0
`MulF(FpMulAlu::Mul1)`	`FpMul1`	multiply using mul lane 1
`MaskMulF(FpMulAlu::Mul0)`	`FpMul0`	masked multiply
`MaskMulF(FpMulAlu::Mul1)`	`FpMul1`	masked multiply
`DivF`	`FpFpu`	division inside `Fp` stage

Ternary ops:

Op	ALU	Note
`FmaF`	`FpFma`	fused multiply-add
`MaskFmaF`	`FpFma`	masked fused multiply-add

Example: To compute exp(sqrt(((x + 1) * 2) * 3)):

x1 = x + 1 via FpFma (FpBinaryOp::AddF)
x2 = x1 * 2 via FpMul0 (FpBinaryOp::MulF(FpMulAlu::Mul0))
x3 = x2 * 3 via FpMul1 (FpBinaryOp::MulF(FpMulAlu::Mul1))
x4 = sqrt(x3) via FpFpu (FpUnaryOp::Sqrt)
x5 = exp(x4) via FpExp (FpUnaryOp::Exp)

IntraSliceReduce (`vector_intra_slice_reduce`)

Reduces axes within a single slice. Requires Way4 mode. This stage uses a dedicated reduction resource rather than a user-selectable ALU set.

Data Type	Supported Ops
`i32`	`AddSat`, `Max`, `Min`
`f32`	`Add`, `Max`, `Min`

See Intra-Slice Reduce for details.

FpDiv (`vector_fp_div`)

Floating-point division. Requires Way4 mode. The public API exposes only division here, so there is no operator-level ALU choice to plan in normal use.

Op	Note
`FpDivBinaryOp::DivF`	dedicated division stage after `IntraSliceReduce`

Widen (`vector_concat`, `vector_pad_way8`)

These APIs enter the Widen stage and transition from Way4 back to Way8. After Widen, later stages such as FpToFxp, Clip, Filter, and Output see 8-lane packets again.

Method	Use When	Effect
`vector_concat()`	reversing a prior `vector_split()`	merge two 4-way packets back into one 8-way flit
`vector_pad_way8()`	reversing a prior `vector_trim_way4()`	pad a 4-way packet back to 8 lanes with invalid elements

Shape semantics:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![S = 64, A = 512];

fn concat_semantics<'l, const T: Tu>(
    input: VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1 # 2], m![S # 16 / 4], m![A / 4 % 2], m![A % 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorWidenTensor<'l, T, i32, m![1], m![1 # 2], m![S # 16 / 4], m![1], m![A % 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input.vector_concat::<m![1], m![A % 8]>()
    // shape semantics: [T, P / 2], [P % 4] -> [T], [P]
}

fn pad_way8_semantics<'l, const T: Tu>(
    input: VectorFpTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorWidenTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input.vector_pad_way8::<m![A % 2 # 8]>()
    // shape semantics: [T], [P] -> [T], [P # 8]
}
}

FpToFxp Conversion (`vector_fp_to_fxp`)

Converts f32 back to i32. The int_width parameter specifies the integer bit width.

Method	Effect
`vector_fp_to_fxp(int_width)`	convert `f32` stream back to `i32`

Clip Cluster (`vector_clip`)

Clamping and comparison operations. Requires Way8 mode.

This stage exposes three ALU classes, ClipAdd, ClipMax, and ClipMin, and each can be used once per pass.

i32 operations:

Op	ALU	Note
`Min`	`ClipMin`	minimum
`Max`	`ClipMax`	maximum
`AbsMin`	`ClipMin`	absolute minimum
`AbsMax`	`ClipMax`	absolute maximum
`AddFxp`	`ClipAdd`	wrapping add
`AddFxpSat`	`ClipAdd`	saturating add

f32 operations:

Op	ALU	Note
`Min`	`ClipMin`	minimum
`Max`	`ClipMax`	maximum
`AbsMin`	`ClipMin`	absolute minimum
`AbsMax`	`ClipMax`	absolute maximum
`Add`	`ClipAdd`	floating-point add

Filter (`vector_filter`)

Applies an execution mask based on a branch condition to filter output flits.

Output (`vector_final`)

Exits the Vector Engine pipeline. The result can continue to the Cast Engine, Transpose Engine, or Commit Engine.

Stash (`vector_stash`)

Saves the current tensor data for later use as an operand via the Stash marker.

Key points:

vector_stash() snapshots the current tensor for later use. Later binary or ternary ops can read it as Stash.
The stash is typed. An f32 stash can be consumed only by later f32 ops, and an i32 stash only by later i32 ops.
The stash follows the current tensor mapping. When it is read later, the implementation reinterprets or transposes it to the current mapping as needed.
It is available only at Stashable stages: Branch, Logic, Fxp, Narrow, Fp, FpDiv, and Clip.
It is not available after a binary op consumes the stash.
It is a single slot per pass. The type system exposes at most one live stash in the chain.

Stash: Fxp-Only Path

Stash at an early stage, then use it later in a Clip operation. This implements max(x + bias, x):

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512];

fn stash_at_fxp<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
    input
        .vector_init()                                      // enter VE
        .vector_intra_slice_branch(BranchMode::Unconditional) // start the intra-slice path
        .vector_stash()                                     // save original x
        .vector_fxp(FxpBinaryOp::AddFxp, 100)               // compute x + bias
        .vector_clip(ClipBinaryOpI32::Max, Stash)           // compute max(x + bias, x)
        .vector_final()
}
}

Stash: Read/Write Across Narrow and Widen

Stash before narrowing, consume after widening. This computes max(sigmoid(x), x):

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512];

fn stash_across_narrow_widen<'l, const T: Tu>(
    input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
    input
        .vector_init()                                      // enter VE
        .vector_intra_slice_branch(BranchMode::Unconditional) // start the intra-slice path
        .vector_stash()                                    // save x (Way8)
        .vector_trim_way4::<m![A % 2 # 4]>()               // narrow to Way4
        .vector_fp_unary(FpUnaryOp::Sigmoid)               // compute sigmoid(x) in Way4
        .vector_pad_way8::<m![A % 2 # 8]>()                // widen back to Way8
        .vector_clip(ClipBinaryOpF32::Max, Stash)          // compute max(sigmoid(x), x)
        .vector_final()
}
}

Operands

Operations (excluding unary and reduce) take operands specifying the second (or third) input. The IntoOperands trait accepts multiple types:

Operand Type	Example	Description
Constant	`100`, `2.5f32`	Scalar broadcast to all elements
VRF tensor	`VeRhs::vrf(&vrf_tensor)`	Pre-loaded via `.to_vrf()` before entering the Vector Engine
Stash	`Stash`	Value saved by a prior `vector_stash()` call

For ternary operations (FmaF), use (operand0, operand1) pairs or TernaryOperand.

Operands can be conditioned per ExecutionId group using VeBranchOperand:

VeBranchOperand::always(operand), applied to all groups
VeBranchOperand::group(operand, GroupId::Zero), applied only to group 0

VRF Input

Using pre-loaded VRF data as an operand:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512, B = 256];

fn vrf_add<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![B], m![A % 2 # 8]>,
    vrf: &VrfTensor<i32, m![1], m![1 # 2], m![A / 2], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![B], m![A % 2 # 8]> {
    input
        .vector_init()
        .vector_intra_slice_branch(BranchMode::Unconditional)
        .vector_fxp(FxpBinaryOp::AddFxp, VeRhs::vrf(vrf))
        .vector_final()
}
}

Argument Modes

Unary, binary, and ternary ops can select how the operator arguments are sourced.

For single-stream operations, “stream” refers to the tensor value carried by the self input of the method chain.

UnaryArgMode:

Mode	Meaning	Computation
`Mode0`	stream	`op(stream)` (default)
`Mode1`	operand	`op(operand)`

BinaryArgMode:

Mode	Meaning	Computation
`Mode00`	stream / stream	`op(stream, stream)`
`Mode01`	stream / operand	`op(stream, operand)` (default)
`Mode10`	operand / stream	`op(operand, stream)`
`Mode11`	operand / operand	`op(operand, operand)`

TernaryArgMode:

Mode	Meaning	Computation
`Mode012`	stream / operand0 / operand1	`op(stream, operand0, operand1)` (default)
`Mode002`	stream / stream / operand1	`op(stream, stream, operand1)`
`Mode102`	operand0 / stream / operand1	`op(operand0, stream, operand1)`
`Mode112`	operand0 / operand0 / operand1	`op(operand0, operand0, operand1)`
`Mode020`	stream / operand1 / stream	`op(stream, operand1, stream)`
`Mode021`	stream / operand1 / operand0	`op(stream, operand1, operand0)`
`Mode120`	operand0 / operand1 / stream	`op(operand0, operand1, stream)`

In two-group mode, BinaryArgMode has two interpretations:

For per-group ops such as vector_fxp_with_mode(...) or vector_fp_binary_with_mode(...), the mode is interpreted independently inside each group. 0 means that group’s stream and 1 means that group’s operand.
For _zip ops such as vector_fxp_zip_with_mode(...) or vector_fp_zip_with_mode(...), the mode refers to the two grouped streams directly. 0 means Group 0 and 1 means Group 1.

For _zip ops, BinaryArgMode maps to the grouped streams as follows:

Mode	Meaning	Computation
`Mode00`	group0 / group0	`op(group0, group0)`
`Mode01`	group0 / group1	`op(group0, group1)` (default)
`Mode10`	group1 / group0	`op(group1, group0)`
`Mode11`	group1 / group1	`op(group1, group1)`

Single-stream example:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512];

fn bias_minus_x<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
    input
        .vector_init()
        .vector_intra_slice_branch(BranchMode::Unconditional)
        .vector_fxp_with_mode(FxpBinaryOp::SubFxp, BinaryArgMode::Mode10, 7) // compute 7 - x
        .vector_final()
}
}

Two-group _zip example:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512, I = 2];

fn pair_sub_reverse<'l, const T: Tu>(
    input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![I], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
    input
        .vector_init()
        .vector_intra_slice_unzip::<I, m![1 # 2], m![1]>()
        .vector_trim_way4::<m![A % 2 # 4]>()
        .vector_fp_zip_with_mode(FpBinaryOp::SubF, BinaryArgMode::Mode10) // compute group1 - group0
        .vector_pad_way8::<m![A % 2 # 8]>()
        .vector_final()
}
}

Two-Group Mode

Enter via vector_intra_slice_unzip() to process two interleaved groups in parallel. This is the API used after begin_interleaved(...), where the collected tensor carries a 2-way grouping axis that should be treated as “group 0” and “group 1”.

The high-level flow is:

vector_intra_slice_unzip() splits the grouped input into two parallel streams.
Per-group stages run in lock-step on both groups.
A _zip op merges the pair back into a single stream.
The merged result can continue to vector_final().

There are two kinds of operations in this mode:

Common stages apply to both groups together: vector_fxp_to_fp, vector_split, vector_trim_way4, vector_concat, vector_pad_way8, vector_fp_to_fxp.
Per-group ops take one argument per group. Use () to skip one side, or pass different operands to each side. See Argument Modes for how BinaryArgMode is interpreted in per-group ops and _zip ops.

Minimal example, zip two interleaved groups with integer add:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512, I = 2];

fn pair_add<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![I], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
    input
        .vector_init()
        .vector_intra_slice_unzip::<I, m![1 # 2], m![1]>()
        .vector_clip_zip(ClipBinaryOpI32::AddFxp)
        .vector_final()
}
}

With asymmetric preprocessing, only group 0 is scaled before zip:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512, I = 2];

fn pair_preprocess_one_side<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![I], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
    input
        .vector_init()
        .vector_intra_slice_unzip::<I, m![1 # 2], m![1]>()
        .vector_fxp(FxpBinaryOp::MulInt, 10, ())   // group 0 only
        .vector_clip_zip(ClipBinaryOpI32::AddFxp)
        .vector_final()
}
}

For float pipelines, both groups must narrow together before vector_fp_*, then zip in Way4, then widen again if later stages need Way8.

Important constraints:

While the two groups are still paired (before _zip), stash() and filter() are not available.
After _zip merges the pair, the result can continue downstream, but stash() and filter() remain unavailable on the merged tensor.
ALU usage is shared across both groups. If either group uses an ALU in a stage, that ALU is consumed for the whole pair pass.

Float Pipeline with Zip

Both groups go through the float path (narrow -> fp -> zip -> widen):

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512, I = 2];

fn pair_fp_mul_zip<'l, const T: Tu>(
    input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![I], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
    input
        .vector_init()
        .vector_intra_slice_unzip::<I, m![1 # 2], m![1]>()
        .vector_trim_way4::<m![A % 2 # 4]>()               // both groups: Way8 -> Way4
        .vector_fp_zip(FpBinaryOp::MulF(FpMulAlu::Mul0))   // group0 * group1 (Way4)
        .vector_pad_way8::<m![A % 2 # 8]>()                // Way4 -> Way8
        .vector_final()
}
}

Per-Group Preprocessing

Apply different operations to each group before zipping:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 512, I = 2];

fn pair_asymmetric_preprocess<'l, const T: Tu>(
    input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![I], m![A % 2 # 8]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![A / 2], m![1], m![A % 2 # 8]> {
    input
        .vector_init()
        .vector_intra_slice_unzip::<I, m![1 # 2], m![1]>()
        .vector_trim_way4::<m![A % 2 # 4]>()
        .vector_fp_unary(FpUnaryOp::Exp, true, false)         // group 0: exp(x), group 1: skip
        .vector_fp_zip(FpBinaryOp::MulF(FpMulAlu::Mul0))   // exp(group0) * group1
        .vector_pad_way8::<m![A % 2 # 8]>()
        .vector_final()
}
}

Constraints

Constraint	Detail
ALU single-use	Each ALU usable once per pass. Same-ALU operations require separate TU invocations.
Data types	`i32`/`f32` only. Lower-precision data must be converted at fetch or after contraction.
ExecutionId range	4-bit (0-15), limiting conditional paths to 16 branches per element.
VRF capacity	Limited capacity; binary/ternary operands must be pre-loaded via `.to_vrf()`.
Narrow/Widen overhead	Float operations halve throughput due to the Way8→Way4→Way8 path.
Stash single-use	Only one `vector_stash()` snapshot can be live in a pass, and it is unavailable after binary ops.
Two-group context	`stash()` and `filter()` are unavailable while paired (before `_zip`) and after `_zip`.

ALU Conflict Example

Each ALU can only be used once. This example panics because AddFxp and SubFxp both use the FxpAdd ALU:

// PANICS: "FxpAdd is already in use"
input
    .vector_init()
    .vector_intra_slice_branch(BranchMode::Unconditional)
    .vector_fxp(FxpBinaryOp::AddFxp, 10)    // uses FxpAdd
    .vector_fxp(FxpBinaryOp::MulInt, 2)     // uses FxpMul ✓
    .vector_fxp(FxpBinaryOp::SubFxp, 5)     // uses FxpAdd again ✗
    .vector_final()

Performance

Throughput

Logic, Fxp, Clip Clusters: Full 8-way throughput (8 elements per cycle)
Float Cluster: 4-way throughput (typically requires Narrow/Widen around the float path, effectively halving throughput)

Pipeline Latency

Each ALU introduces one cycle of latency. Operations requiring multiple ALUs accumulate latencies. For example, exp(sqrt(x)) adds 2 cycles (FpFpu for sqrt + FpExp for exp).

Intra-Slice Reduce

The Intra-Slice Reduce is a reduction operation performed by the IntraSliceReduce stage within the Intra-Slice Block. At the hardware level, this corresponds to the reduction unit in the 4-way path. It reduces axes within a single slice, contrasting with the Inter-Slice Block which reduces across the 256 slices of a cluster (inter-slice reduce). This document covers the blocking-mode case where the accumulator result is not stored to an intermediate buffer. Non-blocking mode (where accumulator results are stored to an intermediate buffer) is not covered here.

Interface

impl<
    'l,
    const T: Tu,
    S,
    Chip: M,
    Cluster: M,
    Slice: M,
    Time: M,
    Packet: M,
    StashD: VeScalar,
    Stash: TensorState<StashD>,
    FS: stage::VeTensorContext,
    const VE_ORDER: VeOrder,
> VectorTensor<'l, T, S, i32, Chip, Cluster, Slice, Time, Packet, StashD, Stash, VE_ORDER, FS, { Way4 }>
where
    S: stage::Stage + CanTransitionTo<stage::IntraSliceReduce>,
{
    /// Intra-slice reduce operation (i32).
    #[primitive(VectorTensor::vector_intra_slice_reduce)]
    pub fn vector_intra_slice_reduce<Reduce: AxisName, OTime: M, OPacket: M>(
        mut self,
        op: IntraSliceReduceOpI32,
    ) -> VectorIntraSliceReduceTensor<
        'l,
        T,
        i32,
        Chip,
        Cluster,
        Slice,
        OTime,
        OPacket,
        StashD,
        Stash,
        VE_ORDER,
        stage::Standalone,
        { Way4 },
    >

impl<
    'l,
    const T: Tu,
    S,
    Chip: M,
    Cluster: M,
    Slice: M,
    Time: M,
    Packet: M,
    StashD: VeScalar,
    Stash: TensorState<StashD>,
    FS: stage::VeTensorContext,
    const VE_ORDER: VeOrder,
> VectorTensor<'l, T, S, f32, Chip, Cluster, Slice, Time, Packet, StashD, Stash, VE_ORDER, FS, { Way4 }>
where
    S: stage::Stage + CanTransitionTo<stage::IntraSliceReduce>,
{
    /// Intra-slice reduce operation (f32).
    #[primitive(VectorTensor::vector_intra_slice_reduce)]
    pub fn vector_intra_slice_reduce<Reduce: AxisName, OTime: M, OPacket: M>(
        mut self,
        op: IntraSliceReduceOpF32,
    ) -> VectorIntraSliceReduceTensor<
        'l,
        T,
        f32,
        Chip,
        Cluster,
        Slice,
        OTime,
        OPacket,
        StashD,
        Stash,
        VE_ORDER,
        stage::Standalone,
        { Way4 },
    >

Parameters:

REDUCE_LABEL: Which axis to reduce, specified as an Ident value (e.g., Ident::R). Each axes![] declaration creates a named label (Ident); all factors derived from the same declaration share that label. All factors in Time and Packet carrying this label are eliminated by the reduction, so they must not appear in the output shape (OTime, OPacket). For example, if R is split as R / 4 in Time and R % 4 in Packet, specifying REDUCE_LABEL = Ident::R eliminates both.
op: The reduce operation (IntraSliceReduceOpI32 for i32, IntraSliceReduceOpF32 for f32).
OTime, OPacket: The output Time and Packet shape after reduction. These must be exactly the input Time and Packet with all REDUCE_LABEL factors removed.

The Chip, Cluster, and Slice dimensions pass through unchanged from input to output.

Mechanism

Conceptual Operation

The IntraSliceReduce stage sits inside the Intra-Slice Block pipeline, after Narrow and before Widen. It accepts a 4-way input (after Buffering Split divides the 8-way flit), performs a 2-level tree reduce to produce a single value, and accumulates the result into an accumulator slot.

4-way input from Narrow stage
    ┌───┬───┬───┬───┐
    │ a │ b │ c │ d │
    └─┬─┴─┬─┴─┬─┴─┬─┘
      │   │   │   │
      └─┬─┘   └─┬─┘          Level 1: pairwise reduce
    op(a,b)   op(c,d)
        └───┬───┘             Level 2: pairwise reduce
    op(op(a,b),op(c,d))
            │
            ▼
     ┌─────────────┐
     │ Accumulator │         Accumulate across time steps
     │ (8 slots)   │
     └─────────────┘

The accumulator holds partial results across multiple input flits, implementing temporal reduction over the Time axis. Up to 8 accumulator slots are available, each serving as a buffer that accumulates partial reduce results across time steps.

Padding exclusion is handled by the Valid Count Generator (VCG), which tags each flit with the count of valid elements so the IntraSliceReduce stage can skip pad data.

Architectural Parameters

Parameter	Value	Description
Tree input width	4-way	Fixed; `Narrow` produces the 4-way input
Tree depth	2	Two levels of pairwise reduction
Accumulator slots	8	Independent reduction accumulators
Accumulator type	Temporal	Accumulates across time steps within each slot

Supported Operations

Integer Operations (`IntraSliceReduceOpI32`)

Operation	Description	Identity Element
`AddSat`	Saturating addition	`0`
`Max`	Maximum value	`i32::MIN`
`Min`	Minimum value	`i32::MAX`

Floating-Point Operations (`IntraSliceReduceOpF32`)

Operation	Description	Identity Element
`Add`	Floating-point addition	`0.0`
`Max`	Maximum value	`f32::NEG_INFINITY`
`Min`	Minimum value	`f32::INFINITY`

Examples

Reduce in Time (i32 Saturating Add)

R exists only in Time, accumulated across time steps.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 16];
// Slice = m![A / 2], Time = m![R], Packet = m![A % 2 # 8] (8-way)
// R in Time → temporal accumulation. Packet is non-reduce.
fn reduce_time<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![A / 2], m![R], m![A % 2 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![A / 2], m![1], m![A % 2 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![A % 2 # 4]>()       // 8-way → 4-way
        .vector_intra_slice_reduce::<R, m![1], m![A % 2 # 4]>(
            IntraSliceReduceOpI32::AddSat,
        )
    // Output: Slice = m![A / 2], Time = m![1], Packet = m![A % 2 # 4]
    // R eliminated from Time.
}
}

Reduce in Packet Only (f32 Add)

R exists only in Packet, so the hardware performs a 4-way tree reduce within each flit with no temporal accumulation.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 4];
// Slice = m![A / 2], Time = m![A % 2], Packet = m![R # 8] (8-way)
// R in Packet → tree reduce within flit. VCG tags valid_count = |R|.
fn reduce_packet<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![A / 2], m![A % 2], m![R # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![A / 2], m![A % 2], m![1 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![R]>()             // 8-way → 4-way
        .vector_intra_slice_reduce::<R, m![A % 2], m![1 # 4]>(
            IntraSliceReduceOpF32::Add,
        )
    // Output: Slice = m![A / 2], Time = m![A % 2], Packet = m![1 # 4]
    // R eliminated from Packet.
}
}

Reduce Split Across Time and Packet (f32 Max)

R is split: R % 4 in Packet is tree-reduced within each flit, then accumulated across R / 4 time steps.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 16];
// Slice = m![A / 2], Time = m![R / 4], Packet = m![R % 4 # 8] (8-way)
// R % 4 in Packet → spatial tree reduce
// R / 4 in Time → temporal accumulation
fn reduce_time_packet<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![A / 2], m![R / 4], m![R % 4 # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![A / 2], m![1], m![1 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![R % 4]>()            // 8-way → 4-way
        .vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
            IntraSliceReduceOpF32::Max,
        )
    // Output: Slice = m![A / 2], Time = m![1], Packet = m![1 # 4]
    // Both R portions eliminated.
}
}

Reduce Axis Spanning Slice, Time, and Packet (i32 Min)

R spans all three dimensions. The VCG handles per-slice valid count variation for boundary slices.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 13];
// Slice = m![R # 32 / 8], Time = m![R # 32 / 4 % 2], Packet = m![R # 32 % 4 # 8] (8-way)
// R split across all three: Slice (groups of 8), Time (pairs within group), Packet (4 elements).
// Boundary slices may have fewer valid time steps, and the VCG handles this.
fn reduce_slice_time_packet<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![R # 32 / 8], m![R # 32 / 4 % 2], m![R # 32 % 4 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![R # 32 / 8], m![1], m![1 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![R # 32 % 4]>()       // 8-way → 4-way
        .vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
            IntraSliceReduceOpI32::Min,
        )
    // Output: Slice = m![R # 32 / 8], Time = m![1], Packet = m![1 # 4]
    // R eliminated from Time and Packet (accumulated within each slice).
}
}

Constraints

Accumulator Slot Limit

The IntraSliceReduce stage has 8 accumulator slots. Each non-reduce (NR) position inside the outermost reduce factor occupies a separate slot, so the product of inner NR axis sizes must be ≤ 8.

Consider Time = m![R / 2, A % 2, R % 2] where R is the reduce label and A is non-reduce. The NR factor A % 2 sits between the outer reduce R / 2 and inner reduce R % 2. Each value of A % 2 needs its own accumulator slot to maintain an independent reduction:

Time = m![R / 2, A % 2, R % 2]
        ~~~~~~  ~~~~~~  ~~~~~~
        outer R   NR    inner R

Flit sequence (R=4, A%2 has values 0,1):

  flit #0: R/2=0, A%2=0, R%2=0  ──→ ┌─────────────────┐
  flit #1: R/2=0, A%2=0, R%2=1  ──→ │ Slot 0 (A%2=0)  │  accumulates R for A%2=0
  flit #4: R/2=1, A%2=0, R%2=0  ──→ │                 │
  flit #5: R/2=1, A%2=0, R%2=1  ──→ └─────────────────┘

  flit #2: R/2=0, A%2=1, R%2=0  ──→ ┌─────────────────┐
  flit #3: R/2=0, A%2=1, R%2=1  ──→ │ Slot 1 (A%2=1)  │  accumulates R for A%2=1
  flit #6: R/2=1, A%2=1, R%2=0  ──→ │                 │
  flit #7: R/2=1, A%2=1, R%2=1  ──→ └─────────────────┘

  2 NR positions → 2 slots used (≤ 8 ✓)

With multiple NR factors, slot usage multiplies:

Valid:   Time = m![R, A % 2, B % 4]  →  2 × 4 = 8 slots  (≤ 8 ✓)
Invalid: Time = m![R, A % 3, B % 4]  →  3 × 4 = 12 slots (> 8 ✗)

If the NR product exceeds 8, the mapping must be restructured.

Invalid: Accumulator Slot Limit Exceeded (i32 AddSat)

NR factors between reduce factors occupy accumulator slots. Here A % 3 (3 values) and B % 4 (4 values) sit inside the reduce axis, requiring 3 × 4 = 12 slots, exceeding the 8-slot limit.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 6, B = 8, R = 16];
// Time = m![R, A % 3, B % 4] -> NR product = 3 × 4 = 12 > 8
fn invalid_too_many_slots<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![A / 3], m![R, A % 3, B % 4], m![B / 4 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![A / 3], m![A % 3, B % 4], m![B / 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![B / 4]>()
        .vector_intra_slice_reduce::<R, m![A % 3, B % 4], m![B / 4]>(
            IntraSliceReduceOpI32::AddSat,
        )
    // Rejected: 12 accumulator slots required, but only 8 are available.
}
}

Padding Strategy

When a reduce axis is padded to fit hardware dimensions, the padded positions contain arbitrary data that must be excluded from the reduction result. Three strategies are available:

Situation	Strategy
Mapping supported by VCG (see Valid Count Generator’s Interface)	VCG (automatic, no extra setup)
Unsupported VCG placement, simple reduce op (Add, Max, Min)	Identity-element padding via Fetch Engine’s `pad_value`
Unsupported VCG placement, composed op (e.g., `exp` + `Add`)	Restructure the mapping, or use other methods

1. VCG (Valid Count Generator). The VCG tags each flit with a valid_count so the IntraSliceReduce stage excludes pad elements automatically. Not all axis placements across Slice, Time, and Packet are supported; see Valid Count Generator’s Interface for details.

2. Identity-element padding. Fill pad positions with the identity element of the reduce operation before data reaches the Intra-Slice Block. The Fetch Engine’s padding adapter can set a pad_value during fetch.

Operation	Identity Element
`AddSat` / `Add`	`0` / `0.0`
`Max`	`i32::MIN` / `f32::NEG_INFINITY`
`Min`	`i32::MAX` / `f32::INFINITY`

This does not work when the reduce operation is composed with a preceding non-invertible transformation. For example, exp(x) + exp(y) + ... (sum of exponentials): there is no value p such that exp(p) = 0 (the additive identity), so padding with any value produces an incorrect contribution.

3. Other methods. NAN masking via ExecutionId and per-slice SFR override (stosfr/itosfr) are additional options, but they are not covered in this page.

Performance

Throughput

The 2-level tree reduce is fully pipelined within the Intra-Slice Block pipeline. Each input flit passes through the tree in one pipeline stage, adding no extra per-flit throughput cost.

Latency

The reduce must accumulate all input flits for a reduction group before emitting the result. If the reduce axis spans n time steps, the first output flit is delayed by n flit cycles beyond the normal pipeline latency. In a multi-engine pipeline, this accumulation delay can stall downstream engines waiting for the first flit.

This page covers blocking mode only. Non-blocking mode, ExecutionId-based NAN masking, and per-slice SFR override are outside its scope.

Valid Count Generator’s Interface

Overview

Recall that when a reduce axis is padded, the extra positions contain arbitrary data that must be excluded from the reduction (see Padding Strategy). The Valid Count Generator (VCG) solves this at the hardware level: it tags each 8-element packet with a valid_count (abbreviated vc), telling the Intra-Slice Reduce stage how many elements are real data.

The intra-slice reduce API takes a REDUCE_LABEL that identifies the axis to reduce (e.g., Ident::R). When that axis is padded, the VCG automatically determines which elements are real data and which are padding, based on how the axis is distributed across Slice, Time, and Packet.

The VCG requires the reduce axis to be structured in a specific way (padded, split, and distributed across Slice, Time, and Packet). The distribution rules are described in How R Should Be Distributed, followed by concrete examples for each placement. The VCG is configured automatically by the compiler; no manual setup is needed. For the underlying hardware mechanism, see Valid Count Generator’s Implementation.

Quick Reference

If you are checking whether a mapping is supported, start with this table and the examples below.

Placement	Mode	Example	Supported
Slice + Time	Time Reduce	`Slice = m![X, R # 24 / 3]`, `Time = m![R # 24 % 3]`	Yes
Time only	Time Reduce	`Time = m![R # 16]`	Yes
Slice + Time (transposed, supported)	Time Reduce	`Slice = m![X, R # 8 % 4]`, `Time = m![R # 8 / 4]`	Yes
Slice + Time (transposed, not supported)	Time Reduce	`Slice = m![X, R # 20 % 4]`, `Time = m![R # 20 / 4]`	No
Non-outer/inner ordering	Time Reduce	`Slice = m![X, R # 16 / 2 % 4, R # 16 / 8]`	No
Packet only	Packet Reduce	`Packet = m![R # 8]`	Yes
Time + Packet	Packet Reduce	`Time = m![R # 24 / 8]`, `Packet = m![R # 24 % 8]`	Yes
Time + Packet (Packet not innermost)	Packet Reduce	`Time = m![R # 24 % 8]`, `Packet = m![R # 24 / 8 # 8]`	No
Time + Packet (mixed Packet axes)	Packet Reduce	`Packet = m![R # 24 % 4, A]`	No
Slice + Packet	Packet Reduce	`Slice = m![R # 2048 / 8]`, `Packet = m![R # 2048 % 8]`	No

Two Modes

Depending on whether R appears in Packet, the VCG operates in one of two exclusive modes:

Packet Reduce Mode: R appears in Packet, for example Packet = m![R # 24 % 8]. The VCG assigns a per-packet valid_count from 0 to 8, so the IntraSliceReduce stage knows how many packet elements are real data.

Time Reduce Mode: R does not appear in Packet, only in Slice and/or Time. The VCG makes a binary valid-or-invalid decision per flit.

How `R` Should Be Distributed

All examples in this document assume the same high-level pattern: first pad the reduce axis, then split it, then place the resulting sub-expressions across Slice, Time, and Packet.

At the top level, R # p is split into an outer and an inner part:

R # p -> R # p / k (outer), R # p % k (inner)

Each part is then assigned to one of the three hardware dimensions. Within a dimension, a part can be split again recursively.

The VCG rule depends on where R appears:

Slice: when R appears multiple times in Slice, the stride order must increase from inner to outer. This keeps each slice’s R range contiguous.
Time: multiple R sub-expressions can appear in any order, and non-reduce axes may sit between them.
Packet: Packet must still be padded to # 8. R may appear at most once in Packet, and it must be the innermost % k part occupying the packet prefix.

Example: why Slice stride order matters

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 13, X = 32];
// R # 16, split into 2 * 4 * 2:
//   Slice = m![R # 16 / 8, X, R # 16 / 2 % 4]
//   Time  = m![R # 16 % 2]
//
// Slice strides for R:
//   R # 16 / 2 % 4  -> stride = 2  (inner)
//   R # 16 / 8      -> stride = 8  (outer)
//   2 < 8, so each slice receives a contiguous R interval.
fn example_stride_ordering<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![R # 16 / 8, X, R # 16 / 2 % 4], m![R # 16 % 2], m![1 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![R # 16 / 8, X, R # 16 / 2 % 4], m![1], m![1 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![1 # 4]>()
        .vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
            IntraSliceReduceOpI32::Min,
        )
}
}

Examples: Time Reduce Mode

R does not appear in Packet. The VCG decides per-flit: valid or invalid.

Slice + Time

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 4, R = 17, X = 32];

// The most common pattern. Outer part of R → Slice, inner part → Time.
//
// R = 17, padded to 24 = 8 * 3.
// R # 24 is split into:
//   R # 24 / 3  (size 8, outer) → Slice
//   R # 24 % 3  (size 3, inner) → Time
// This follows the standard outer→Slice, inner→Time pattern.
//
// Slice = m![X, R # 24 / 3], Time = m![R # 24 % 3], Packet = m![A # 8]
// |Slice| = X(32) * 8 = 256.
// Time Reduce Mode: R does not appear in Packet. VCG gates flits by slice and time.
//   Boundary slice = floor(17 / 3) = 5. Valid time steps in boundary = 17 mod 3 = 2.
//   For each X group, R-slices 0-4: all 3 time steps valid.
//   R-slice 5: 2 of 3 valid (boundary). R-slices 6-7: all invalid.
fn reduce_slice_time<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![X, R # 24 / 3], m![R # 24 % 3], m![A # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![X, R # 24 / 3], m![1], m![A], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![A]>()
        .vector_intra_slice_reduce::<R, m![1], m![A]>(
            IntraSliceReduceOpI32::AddSat,
        )
    // Output: Slice = m![X, R # 24 / 3], Time = m![1], Packet = m![A]
    // R eliminated from Time. Boundary slice (R-slice #5) accumulated only their valid steps.
}
}

Valid count trace for R = 17 (within one X group)

R-slice	Group	t=0	t=1	t=2
0	all-valid	0	1	2
1	all-valid	3	4	5
2	all-valid	6	7	8
3	all-valid	9	10	11
4	all-valid	12	13	14
5	boundary	15	16	.
6	all-invalid	.	.	.
7	all-invalid	.	.	.

This pattern repeats identically for each of the 32 X groups (256 total slices).

Slice + Time (`R` Split Into Multiple Sub-Expressions in Slice)

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 13, X = 32];
// R can appear as multiple sub-expressions in Slice.
//
// R # 16 is first split into outer (/ 2) and inner (% 2) for Slice vs Time:
//   R # 16 / 2  → Slice portion (size 8)
//   R # 16 % 2  → Time portion (size 2)
//
// The Slice portion (R # 16 / 2, size 8) is further split into two sub-expressions:
//   R # 16 / 8      stride = 8, size = 2  (outer)
//   R # 16 / 2 % 4  stride = 2, size = 4  (inner)
//
// Slice = m![R # 16 / 8, X, R # 16 / 2 % 4], Time = m![R # 16 % 2], Packet = m![1 # 8]
// Slice product = 2 * 32 * 4 = 256.
//
// Slice ordering check (inner to outer, ascending stride):
//   R # 16 / 2 % 4  stride = 2, size = 4  (inner)
//   R # 16 / 8       stride = 8, size = 2  (outer)
//   inner_stride(2) * size(4) = 8 = outer_stride ✓
//
// This gives each slice contiguous R indices (within one X group):
//   S0: 0,1  S1: 2,3  S2: 4,5  S3: 6,7  S4: 8,9  S5: 10,11  S6: 12,13  S7: 14,15
// Time Reduce Mode:
//   Boundary slice = floor(13 / 2) = 6. Valid time steps = 13 mod 2 = 1.
//   S0-S5: all-valid, S6: boundary (1 of 2), S7: all-invalid.
fn reduce_multi_level<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![R # 16 / 8, X, R # 16 / 2 % 4], m![R # 16 % 2], m![1 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![R # 16 / 8, X, R # 16 / 2 % 4], m![1], m![1 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![1 # 4]>()
        .vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
            IntraSliceReduceOpI32::Min,
        )
}
}

Slice + Time (Transposed)

The reverse of Slice + Time: the inner part of R goes to Slice, the outer part to Time. In transposed mode, the slice ID represents the inner index and the time step the outer index. Slices beyond the boundary still have valid data at early time steps.

Given R # p split as Slice = m![..., R # p % slice_size], Time = m![R # p / slice_size, ...] (where slice_size is the size of R’s portion in Slice), this mode is supported only when:

$$\text{time_size} = p / \text{slice_size} = \lceil |R| / \text{slice_size} \rceil$$

$|R|$: original axis size (before padding)
$\text{slice_size}$: the size of R’s portion in Slice (R # p % slice_size)
$\text{time_size}$: the size of R’s portion in Time (R # p / slice_size), which equals $p / \text{slice_size}$

Slice + Time (Transposed, Supported)

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 5, X = 64];
// Transposed: inner part of R → Slice, outer part → Time.
// (The reverse of standard Slice + Time.)
//
// R = 5, padded to 8 = 4 * 2.
// R # 8 is split into:
//   R # 8 % 4  (size 4, inner) → Slice  (transposed: inner goes to Slice)
//   R # 8 / 4  (size 2, outer) → Time   (transposed: outer goes to Time)
//
// Slice = m![X, R # 8 % 4], Time = m![R # 8 / 4], Packet = m![1 # 8]
// |Slice| = 64 * 4 = 256.
//
// time_size(2) == ceil(5 / 4) = 2 ✓
//
// Time Reduce Mode:
//   Boundary slice = |R| mod slice_size = 5 mod 4 = 1.
//   Valid time steps in boundary = floor(|R| / slice_size) = floor(5/4) = 1.
//   R-slice 0   (< boundary): all 2 time steps valid.
//   R-slice 1   (= boundary): 1 of 2 time steps valid.
//   R-slices 2-3 (> boundary): also 1 of 2 time steps valid (transposed behavior).
fn reduce_transposed<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![X, R # 8 % 4], m![R # 8 / 4], m![1 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![X, R # 8 % 4], m![1], m![1 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![1 # 4]>()
        .vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
            IntraSliceReduceOpI32::AddSat,
        )
    // Output: Slice = m![X, R # 8 % 4], Time = m![1], Packet = m![1 # 4]
}
}

Slice + Time (Transposed, Not Supported)

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 14, X = 64];
// NOT supported: time_size is over-allocated.
//
// Slice = m![X, R # 20 % 4], Time = m![R # 20 / 4], Packet = m![1 # 8]
// Slice = 64 * 4 = 256.
//
// R = 14, padded to 20 = 4 (S) * 5 (time_size).
// time_size(5) != ceil(14 / 4) = 4 ✗
//
// R-slice 0 should have 4 valid time steps (indices 0, 4, 8, 12, all < 14).
// But the VCG classifies R-slice 0 as "all-valid" = 5 time steps. WRONG.
//
// a possible Fix: pad to 16 instead, so time_size = ceil(14/4) = 4:
//   Slice = m![X, R # 16 % 4], Time = m![R # 16 / 4], Packet = m![1 # 8]
fn reduce_transposed_wrong<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![X, R # 20 % 4], m![R # 20 / 4], m![1 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![X, R # 20 % 4], m![1], m![1 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![1 # 4]>()
        .vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
            IntraSliceReduceOpI32::AddSat,
        )
    // ✗ VCG will over-count valid time steps. Use R # 16 instead of R # 20.
}
}

Time Only

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 12, X = 64];
// R exists only in Time. VCG gates excess time steps as invalid.
// All slices see the same pattern.
//
// Slice = m![X, A / 2], Time = m![R # 16], Packet = m![A % 2 # 8]
// Slice = 64 * 4 = 256.
//
// R = 12, padded to 16. Time Reduce Mode: time steps 0-11 valid, 12-15 invalid.
fn reduce_time_only<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![X, A / 2], m![R # 16], m![A % 2 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![X, A / 2], m![1], m![A % 2 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![A % 2 # 4]>()
        .vector_intra_slice_reduce::<R, m![1], m![A % 2 # 4]>(
            IntraSliceReduceOpI32::AddSat,
        )
    // Output: Slice = m![X, A / 2], Time = m![1], Packet = m![A % 2 # 4]
    // R eliminated from Time. Time steps 12-15 were gated off.
}
}

Non-Outer/Inner Ordering (Not Supported)

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 13, X = 32];
// NOT supported: reordered sub-expressions in Slice break monotonic validity.
// R's sub-expressions must form a clean outer/inner relationship across dimensions.
// If reordered, the VCG cannot express the resulting validity pattern.
//
// Slice = m![X, R # 16 / 2 % 4, R # 16 / 8], Time = m![R # 16 % 2]
// Slice = 32 * 4 * 2 = 256.
//
// Striped R indices per R-slice group:
//   S0: 0,1  S1: 8,9  S2: 2,3  S3: 10,11  S4: 4,5  S5: 12,13  S6: 6,7  S7: 14,15
//   Validity: valid, valid, valid, valid, valid, partial, valid, invalid
//
// Non-monotonic (S6 valid after S5 partial). The VCG cannot express this.
//
// Fix: use standard ordering instead:
//   Slice = m![X, R # 16 / 8, R # 16 / 2 % 4], Time = m![R # 16 % 2]
fn reduce_wrong_ordering<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![X, R # 16 / 2 % 4, R # 16 / 8], m![R # 16 % 2], m![1 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![X, R # 16 / 2 % 4, R # 16 / 8], m![1], m![1 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![1 # 4]>()
        .vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
            IntraSliceReduceOpI32::Min,
        )
    // ✗ Non-monotonic slice validity. VCG cannot express this pattern.
}
}

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 13, X = 64];
// NOT supported: Time-Slice-Time interleave.
// Slice = m![X, R # 16 / 2 % 4], Time = m![R # 16 / 8, R # 16 % 2]
// Slice = 64 * 4 = 256.
//
// The interleave causes different R-slices to need different valid time step counts:
//   S0: R indices 0,1,8,9   -> 4/4 valid
//   S1: R indices 2,3,10,11 -> 4/4 valid
//   S2: R indices 4,5,12,13 -> 3/4 valid
//   S3: R indices 6,7,14,15 -> 2/4 valid
//
// The VCG has a single threshold, so it cannot express per-slice values.
//
// Fix: standard Slice outer, Time inner:
//   Slice = m![X, R # 16 / 4], Time = m![R # 16 % 4]
fn reduce_wrong_interleave<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![X, R # 16 / 2 % 4], m![R # 16 / 8, R # 16 % 2], m![1 # 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![X, R # 16 / 2 % 4], m![1], m![1 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![1 # 4]>()
        .vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
            IntraSliceReduceOpI32::Min,
        )
    // ✗ Per-slice V values needed. VCG cannot express this pattern.
}
}

Time: Flexible Ordering

When R is split into multiple sub-expressions within Time, they can appear in any order and non-reduce axes can sit between them (unlike Slice, where ordering is strict).

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 4, R = 45, X = 2, Y = 256];
// R's Time portion can be split into multiple sub-expressions (order does not matter).
//
// R # 48 is split into outer (/ 8) and inner (% 8) for Time vs Packet:
//   R # 48 / 8  → Time portion (size 6)
//   R # 48 % 8  → Packet portion (size 8)
//
// The Time portion (R # 48 / 8, size 6) is further split:
//   R # 48 / 8 / 2  (size 3)
//   R # 48 / 8 % 2  (size 2)
// These can appear in any order with non-reduce axes between them.
//
// Slice = m![Y], Time = m![R # 48 / 8 % 2, X, R # 48 / 8 / 2], Packet = m![A # 8]
// |Slice| = 256.
//
// The VCG combines both R sub-expressions' positions to determine validity.
// (Contrast with Slice, where ordering is strict.)
fn reduce_time_split<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![Y], m![R # 48 / 8 % 2, X, R # 48 / 8 % 2], m![A # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![Y], m![X], m![A], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![A]>()
        .vector_intra_slice_reduce::<R, m![X], m![A]>(
            IntraSliceReduceOpF32::Add,
        )
    // Output: Slice = m![Y], Time = m![X], Packet = m![A]
    // Both R sub-expressions eliminated from Time; X remains.
}
}

Examples: Packet Reduce Mode

R appears in Packet. The VCG assigns a per-packet valid_count (0-8) that varies by time step.

Packet Only

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 3, X = 64];
// R exists only in Packet. Every packet gets the same constant valid_count = |R|.
//
// Slice = m![X, A / 2], Time = m![1], Packet = m![R # 8]
// Slice = 64 * 4 = 256.
//
// R = 3, padded to 8. Packet Reduce Mode: every packet has vc = 3.
// No slice or time variation needed.
fn reduce_packet_only<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![1], m![R # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![1], m![1 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![R]>()
        .vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
            IntraSliceReduceOpF32::Add,
        )
    // Output: Slice = m![X, A / 2], Time = m![1], Packet = m![1 # 4]
    // R eliminated from Packet. All 3 of 8 elements were counted as valid.
}
}

Time + Packet

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 19, X = 64];
// R spans both Time and Packet.
// VCG produces full packets first, then a partial packet at the tail.
//
// Slice = m![X, A / 2], Time = m![R # 24 / 8], Packet = m![R # 24 % 8]
// Slice = 64 * 4 = 256.
//
// R = 19, padded to 24 = 3 (Time) * 8 (Packet).
// Packet Reduce Mode: R fills all 8 Packet positions.
//   t=0: vc = 8 (all valid)
//   t=1: vc = 8 (all valid)
//   t=2: vc = 3 (first 3 valid, last 5 are padding)
// All slices see the same [8, 8, 3] pattern.
fn reduce_time_packet<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![R # 24 / 8], m![R # 24 % 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![1], m![1 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![R # 24 % 4]>()
        .vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
            IntraSliceReduceOpF32::Add,
        )
    // Output: Slice = m![X, A / 2], Time = m![1], Packet = m![1 # 4]
    // R eliminated from both Time and Packet.
}
}

Time + Packet (Packet Not Innermost, Not Supported)

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 19, X = 64];
// NOT supported: R's portion in Packet must be the innermost (lowest) sub-expression.
//
// R = 19, padded to 24 = 3 * 8.
// R # 24 is split into:
//   R # 24 % 8  (size 8, inner), should go to Packet
//   R # 24 / 8  (size 3, outer), should go to Time
// But here they are swapped: the OUTER part (R # 24 / 8) goes to Packet,
// and the INNER part (R # 24 % 8) goes to Time.
//
// Slice = m![X, A / 2], Time = m![R # 24 % 8], Packet = m![R # 24 / 8 # 8]
// |Slice| = 64 * 4 = 256.
//
// The VCG's prefix-based valid count assumes R in Packet is the innermost index.
// When the outer part is in Packet instead, the valid count pattern no longer
// forms a simple decreasing sequence; it would need non-contiguous validity.
//
// Fix: put the inner part in Packet and the outer part in Time:
//   Time = m![R # 24 / 8], Packet = m![R # 24 % 8]
fn reduce_wrong_packet_outer<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![R # 24 % 8], m![R # 24 / 8 # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![1], m![1 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![R # 24 / 4]>()
        .vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
            IntraSliceReduceOpF32::Add,
        )
    // ✗ Outer part in Packet violates innermost requirement.
}
}

Time + Packet (R fills fewer than 8 positions)

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 7, X = 64];
// When R fills fewer than 8 Packet positions, the remaining must be padding, not another axis.
// The valid count is capped at the size of R in Packet.
//
// Slice = m![X, A / 2], Time = m![R # 8 / 4], Packet = m![R # 8 % 4 # 8]
// Slice = 64 * 4 = 256.
//
// R = 7, padded to 8 = 2 * 4. R fills 4 Packet positions, padded to 8-way.
// Packet Reduce Mode: valid count capped at 4 (the size of R in Packet).
//   t=0: vc = 4 (positions 0-3 valid, 4-7 are padding)
//   t=1: vc = 3 (positions 0-2 valid)
// All slices see the same [4, 3] pattern.
//
// Supported: R solely occupies the prefix; positions 4-7 are padding, not another axis.
fn reduce_time_packet_partial<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![R # 8 / 4], m![R # 8 % 4 # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![1], m![1 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![R # 8 % 4]>()
        .vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
            IntraSliceReduceOpF32::Add,
        )
    // Output: Slice = m![X, A / 2], Time = m![1], Packet = m![1 # 4]
}
}

Time + Packet (Mixed Packet Axes, Not Supported)

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 2, R = 19, X = 256];
// NOT supported. R must be the sole occupant of the Packet prefix.
// If another axis shares the Packet, the prefix-based count marks that axis's data as padding.
//
// Slice = m![X], Time = m![R # 24 / 4], Packet = m![R # 24 % 4, A # 8]
// Slice = 256.
//
// R fills positions 0-3, A fills positions 4-5 (padded to 8).
// Packet Reduce Mode: valid count applies to the whole packet as a prefix.
//   vc = 3 means "positions 0-2 valid", but A's real data at positions 4-5
//   is ALWAYS treated as invalid, regardless of A's actual size.
//   The reduce result silently loses A's contributions.
//
// Fix: put A outside Packet, or pad R to fill all 8 positions:
//   Time = m![R # 24 / 8], Packet = m![R # 24 % 8]
fn reduce_wrong_mixed_packet<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![X], m![R # 24 / 4], m![R # 24 % 4, A # 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![X], m![1], m![A # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![R # 24 % 4, A]>()
        .vector_intra_slice_reduce::<R, m![1], m![A # 4]>(
            IntraSliceReduceOpF32::Add,
        )
    // ✗ A's data at positions 4-5 silently excluded by prefix-based vc.
}
}

Time + Packet: Perfectly Aligned

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 8, R = 24, X = 64];
// When |R| is exactly divisible by the size of R in Packet,
// every packet is full and the VCG is not needed.
//
// Slice = m![X, A / 2], Time = m![R / 8], Packet = m![R % 8]
// Slice = 64 * 4 = 256.
//
// R = 24, no padding needed! 24 = 3 * 8.
// Every element is real data. All vc = 8.
fn reduce_time_packet_aligned<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![R / 8], m![R % 8], f32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, f32, m![1], m![1], m![X, A / 2], m![1], m![1 # 4], f32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![R % 4]>()
        .vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
            IntraSliceReduceOpF32::Add,
        )
    // Output: Slice = m![X, A / 2], Time = m![1], Packet = m![1 # 4]
}
}

Slice + Packet (Not Supported)

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 2045];
// NOT supported. The VCG produces the same valid_count for all slices at a given time step.
// When R spans Slice and Packet, different slices need different counts.
//
// Slice = m![R # 2048 / 8], Time = m![1], Packet = m![R # 2048 % 8]
// Slice = 2048 / 8 = 256.
//
// R = 2045, padded to 2048 = 256 (Slice) * 8 (Packet).
// Packet Reduce Mode:
//   Slices 0-254 need vc = 8 (full). Slice 255 needs vc = 5 (2045 mod 8 = 5).
//   But vc is the same for all slices at the same time step.
//   The VCG cannot produce vc = 8 for some slices and vc = 5 for others.
//
// Fix: add R to Time so R spans Slice + Time instead:
//   Slice = m![R # 2048 / 8], Time = m![R # 2048 % 8], Packet = m![A # 8]
// Another possible fix: if R's size was 2048, no padding is introduced, which does not need VCG at all:
//   Slice = m![R: 2048 / 8], Time: m![1], Packet: m![R % 8]
fn reduce_wrong_slice_packet<'l, const T: Tu>(
    input: VectorBranchTensor<'l, T, i32, m![1], m![1], m![R # 2048 / 8], m![1], m![R # 2048 % 8], i32, NoTensor, { stage::VeOrder::IntraFirst }>,
) -> VectorIntraSliceReduceTensor<'l, T, i32, m![1], m![1], m![R # 2048 / 8], m![1], m![1 # 4], i32, NoTensor, { stage::VeOrder::IntraFirst }>
{
    input
        .vector_trim_way4::<m![R # 2048 % 4]>()
        .vector_intra_slice_reduce::<R, m![1], m![1 # 4]>(
            IntraSliceReduceOpI32::AddSat,
        )
    // ✗ Slice-varying vc needed. VCG cannot express this pattern.
}
}

Valid Count Generator’s Implementation

This document describes what the Valid Count Generator (VCG) hardware can express, independent of mapping expressions or tensor shapes. VCG tags are consumed by the Intra-Slice Reduce stage to exclude padding from reductions. For how mapping expressions control VCG behavior (supported placements, constraints, and examples), see Valid Count Generator’s Interface.

Data Model

Data flows into the VectorEngine as a stream of flits (packets). Each flit contains 8 elements. The VCG operates at the VectorEngine’s input, tagging each 8-way flit with a valid count. The 4-way halving and its valid count derivation are described in Downstream: 4-Way Operations.

A flit is identified by two coordinates. A slice corresponds to the Slice dimension in the mapping; a time step indexes sequential flits within a slice.

Coordinate	Range	Meaning
`s` (slice number)	`[0, num_slices)`	Which slice processes this flit
`t` (time step)	`[0, num_flits)`	Sequential position within a slice

The VCG assigns a valid count (abbreviated vc in formulas and diagrams) to each flit:

$$\text{vc}(s, t) \in {0, 1, \ldots, 8}$$

Element p (where p is in [0, 8)) within flit (s, t) is valid if and only if p < vc(s, t). Valid elements always form a contiguous prefix; this is a fundamental hardware constraint. The VCG cannot express “elements 0, 1, 3 are valid but 2 is not.”

Valid Count Formula

The VCG computes vc(s, t) through a pipeline of stages:

$$t \overset{\text{Sequencer}}{\longrightarrow} (c_0, c_1, \ldots, c_{k-1}) \overset{\text{Original Dims}}{\longrightarrow} \text{idx}(t) \overset{\text{Validity}}{\longrightarrow} \text{vc}(s, t)$$

Sequencer: The flat time index t is decomposed into counter values $(c_0, c_1, \ldots, c_{k-1})$ via mixed-radix decomposition.
Original Dimensions: Each counter is assigned to one of 4 dimensions (packet dim or gate dim 0-2). Per-dimension indices are computed as $\text{idx}(t) = \sum_{i} c_i \cdot \sigma _ i$, where $\sigma_i$ is the counter’s stride.
Validity Decision:
- Packet Dim: produces a packet-level valid count $\text{packet_vc}(t) = \min(\text{stride}_p,; \max(0,; V_p - \text{idx}_p(t)))$.
- Gate Dims: each produces a binary gate $\text{gate}_d(s, t) \in {0, 1}$ based on slice classification (below/boundary/above a threshold) and the per-dim index.

The final valid count combines these components:

$$\text{vc}(s, t) = \text{packet_vc}(t) \times \text{gate}_0(s, t) \times \text{gate}_1(s, t) \times \text{gate}_2(s, t)$$

vc(s,t) = packet_vc(t)  ×  gate_0(s,t)  ×  gate_1(s,t)  ×  gate_2(s,t)
          ───────────      ───────────      ───────────      ───────────
          packet dim       gate dim 0       gate dim 1       gate dim 2
          (count 0-8)      (gate 0/1)       (gate 0/1)       (gate 0/1)

If all gates are open (= 1): the flit gets packet_vc(t) valid elements.
If any gate is closed (= 0): vc = 0 (entire flit is invalid, regardless of packet dim’s count).

VCG Configuration

Configuration is organized around two concepts: counters (which drive the sequencer) and original dimensions (which decide validity). Counters produce a flit sequence; each dim uses its assigned counters to compute an index and decide validity.

The VCG is configured via the following parameters (each is explained in detail in subsequent sections):

Field	Scope	Description
Counter limits $L_0 \ldots L_7$	per counter (up to 8)	Sequencer counter limits
Original dim assignment	per counter	Which dim (packet / gate 0-2) each counter belongs to
stride $\sigma_i$	per counter	stride for index computation
$\text{mask}_{gd}$	per gate dim	Slice-id bitmask
$\text{match}_{gd}$	per gate dim	Threshold for slice classification
$V_p$ / $V_{gd}$	packet dim / per gate dim	Valid count / threshold
$P_{gd}$	per gate dim	Standard (0) vs transposed (1)

Unassigned counters and disabled gate dims ($\text{mask}_{gd} = 0, \text{match}_{gd} = 1$) effectively pass through as “all valid.”

Sequencer

The sequencer interprets the flat time index t as a multi-dimensional counter.

Counter Structure

Up to 8 nested counters iterate to produce the flit sequence:

$$t \to (c_0, c_1, \ldots, c_{k-1})$$

where $c_0$ is the fastest (innermost) and $c_{k-1}$ is the slowest (outermost).

Each counter $c_i$ has a limit $L_i$, cycling through $0, 1, \ldots, L_i - 1$, and a stride $\sigma_i$ that scales the counter’s contribution to the dimension index (see Original Dimensions). The total number of flits per slice is $L_0 \times L_1 \times \cdots \times L_{k-1}$.

Example: 3 counters with limits [3, 2, 2]

This produces 3 * 2 * 2 = 12 flits per slice. The counters cycle as:

t=0:  (c_0=0, c_1=0, c_2=0)
t=1:  (c_0=1, c_1=0, c_2=0)
t=2:  (c_0=2, c_1=0, c_2=0)
t=3:  (c_0=0, c_1=1, c_2=0)   <- c_0 wraps, c_1 increments
t=4:  (c_0=1, c_1=1, c_2=0)
t=5:  (c_0=2, c_1=1, c_2=0)
t=6:  (c_0=0, c_1=0, c_2=1)   <- c_1 wraps, c_2 increments
t=7:  (c_0=1, c_1=0, c_2=1)
t=8:  (c_0=2, c_1=0, c_2=1)
t=9:  (c_0=0, c_1=1, c_2=1)
t=10: (c_0=1, c_1=1, c_2=1)
t=11: (c_0=2, c_1=1, c_2=1)

c_0 changes every flit, c_1 every 3 flits, c_2 every 6 flits, just like digits in a mixed-radix number.

The sequencer produces counter values; the next step is mapping them to original dimensions and then to the validity decision.

Original Dimensions

Each counter is assigned to one of 4 original dimensions (packet dim or gate dim 0-2), or left unassigned.

Let $D_d$ be the set of counters assigned to original dimension d. Each counter contributes to its assigned dimension’s index by multiplying its current value by its stride. The sum of all contributions gives the current position within that dimension’s data:

$$\text{idx}_d(t) = \sum _ {i \in D_d} c_i(t) \cdot \sigma_i$$

This index tracks the position within that dimension’s original data range. Multiple counters can be assigned to the same dim; their contributions are simply summed.

Example: Counters mapped to original dimensions

Suppose 3 counters are configured as follows:

Counter	Limit	stride	Assigned to
c_0	3	8	packet dim (W axis)
c_1	2	1	gate dim 0 (C axis)
c_2	2	1	gate dim 1 (H axis)

At time step t=4, which gives (c_0=1, c_1=1, c_2=0):

idx_p = 1 * 8 = 8, position 8 along W
idx_g0 = 1 * 1 = 1, position 1 along C
idx_g1 = 0 * 1 = 0, position 0 along H

Each dim uses its index independently to decide validity.

Validity Decision

The following diagram shows the complete pipeline from time index to final valid count. Each stage is explained in the subsections below.

t (flat time index)
│
├─ mixed-radix decomposition (see Sequencer above)
▼
(c_0, c_1, ..., c_{k-1})              ← counter values
│
├─ each counter assigned to a dim, multiplied by stride σ_i
│  (see Original Dimensions above)
▼
idx_p(t), idx_g0(t), idx_g1(t), idx_g2(t)  ← per-dim indices
│
├─ packet dim: packet_vc = min(stride_p, max(0, V_p - idx_p))
├─ gate dim 0: gate_0 = f(masked_id(s), idx_g0, match_g0, V_g0)
├─ gate dim 1: gate_1 = f(masked_id(s), idx_g1, match_g1, V_g1)
├─ gate dim 2: gate_2 = f(masked_id(s), idx_g2, match_g2, V_g2)
│
▼
vc(s,t) = packet_vc(t) × gate_0(s,t) × gate_1(s,t) × gate_2(s,t)

Packet dim and gate dims make qualitatively different judgments:

Packet dim answers: “how many elements in this flit are valid?” (a count, 0-8)
Gate dims each answer: “is this flit valid at all?” (a binary gate, yes or no)

Gate dims act as gates: only when all three report “valid” does packet dim’s count take effect. If any gate reports “invalid”, the entire flit gets valid count = 0.

Packet Dim: Packet-Level Valid Count

Packet dim determines how many elements within a flit are valid. Two parameters control the computation:

$V_p$: the original valid count for packet dim (the unpadded size of the data along this dimension).
$\text{stride}_p$: the stride of the innermost counter assigned to packet dim, representing how many flit elements belong to the axis tracked by packet dim.

The per-packet valid count is:

$$\text{packet_vc}(t) = \min(\text{stride}_p, \max(0, V_p - \text{idx}_p(t)))$$

packet_vc(t) = min( stride_p,         max(0, V_p - idx_p(t) ))
                    ─────────              ─────────────────
                    HW width cap           remaining valid data

When the axis fills all 8 flit positions, $\text{stride}_p = 8$ and the formula is equivalent to $\min(8, \ldots)$. When the axis occupies only $k < 8$ positions (with the remaining positions padded), $\text{stride}_p = k$ caps the valid count so that only the axis’s portion of the flit is counted as valid.

Hardware constraints:

The innermost Packet counter must always be assigned to packet dim.
Other counters may also be assigned to packet dim (e.g., a Time counter for the same axis).
When no axis is assigned to packet dim, packet_vc is always 8 (full flit) or 0 (empty flit), effectively making packet dim a binary gate like gate dims.

As the sequencer advances, $\text{idx}_p$ increases and packet_vc decreases. This produces a repeating sawtooth pattern:

Example 1: V_p = 19, stride_p = 8, counter stride=8, limit=3  (axis fills full 8-way)

flit 0: idx_p =  0 -> packet_vc = min(8, 19 -  0) = 8  (full)
flit 1: idx_p =  8 -> packet_vc = min(8, 19 -  8) = 8  (full)
flit 2: idx_p = 16 -> packet_vc = min(8, 19 - 16) = 3  (partial)

Example 2: V_p = 11, stride_p = 4, counter stride=4, limit=3  (axis fills 4 of 8 positions)

flit 0: idx_p =  0 -> packet_vc = min(4, 11 -  0) = 4  (full within stride)
flit 1: idx_p =  4 -> packet_vc = min(4, 11 -  4) = 4  (full within stride)
flit 2: idx_p =  8 -> packet_vc = min(4, 11 -  8) = 3  (partial)

In Example 2, positions 4-7 in each flit are padding and automatically excluded by the $\text{stride}_p = 4$ cap.

Key property: packet_vc depends only on the sequencer state t, not on the slice s. All slices receive the same packet valid count for the same time step.

Example: Why packet_vc is slice-independent

If $V_p = 19$, $\text{stride}_p = 8$, and the packet counter cycles [0, 8, 16], then:

At t=2 (idx_p=16): packet_vc = 3 for every slice.
Slice 0 gets vc=3, slice 5 gets vc=3, slice 15 gets vc=3, all the same.

This is because packet dim’s formula $\min(\text{stride}_p, V_p - \text{idx}_p)$ has no s term. Gate dims can still make certain slices’ final vc = 0 (by reporting invalid), but they cannot change the packet_vc value itself.

Gate Dims: Per-Flit Binary Validity

Gate dims 0, 1, 2 decide whether a flit as a whole is valid (1) or invalid (0), not a count.

Each gate dim classifies slices into groups by extracting a subset of the slice-id bits (via a bitmask) and comparing the result against a threshold. The bitmask $\text{mask}_{gd}$ selects which bits of the slice-id this gate dim tracks:

$$\text{masked_id} (s) = s \mathbin{\&} \text{mask}_{gd}$$

Example: 16 slices (4-bit slice_id), mask_g0 = 0b1100

slice_id (4 bits):   [ b3  b2  b1  b0 ]
mask_g0 = 0b1100:    [  1   1   0   0 ]
                      ─────────────────
masked_id:           [ b3  b2   0   0 ]  → extracts the upper 2 bits

Slices fall into three groups based on comparing $\text{masked_id}$ with $\text{match}_{gd}$:

Group	Condition	Meaning
Below	$\text{masked_id}(s) < \text{match}_{gd}$	All time steps valid
Boundary	$\text{masked_id}(s) = \text{match}_{gd}$	Valid when $\text{idx}_{gd}(t) < V_{gd}$
Above	$\text{masked_id}(s) > \text{match}_{gd}$	Depends on mode (see below)

The $P_{gd}$ flag selects between two modes that differ only in the “above” group:

Standard mode ($P_{gd} = 0$)

$$\text{gate}_d(s, t) = \begin{cases} 1 & \text{masked_id}(s) < \text{match}_{gd} \\ [\text{idx}_{gd}(t) < V_{gd}] & \text{masked_id}(s) = \text{match}_{gd} \\ 0 & \text{masked_id}(s) > \text{match}_{gd} \end{cases}$$

Above-threshold slices are entirely invalid. This is the common case: the Slice factor is laid out in ascending order, so slices beyond the boundary contain no valid data.

Transposed mode ($P_{gd} = 1$)

$$\text{gate}_d(s, t) = \begin{cases} 1 & \text{masked_id}(s) < \text{match}_{gd} \\ [\text{idx}_{gd}(t) < V_{gd}] & \text{masked_id}(s) \ge \text{match}_{gd} \end{cases}$$

Above-threshold slices get the same $V_{gd}$ check as the boundary: they are not entirely invalid. This handles the transposed case where the slice ID encodes the inner index: slices beyond the boundary still contain valid data at early time steps (the outer index is small enough), and only run out of valid data at the same point as the boundary slice.

To disable a gate dim (make it always valid), set $\text{mask}_{gd} = 0, \text{match}_{gd} = 1$. Then $\text{masked_id} = 0 < 1$ for all slices, so every slice is in the “below” group.

Example: Standard mode, H=5 split into Ho=4 (slice) × Hi=2 (time)

H=5 is split into Ho × Hi = 4 × 2 (padded from 5 to 8). Ho is the Slice factor (encoded in slice-id bits), Hi is the Time factor (sequencer counter). Axis index = Ho * 2 + Hi. Valid when index < 5.

Gate dim 0 config: mask=0b1100 (extracts 2 bits for Ho), match=2, V_g0=1, standard mode.

16 slices, where masked_id = (slice_id & 0b1100) >> 2 gives Ho:

Ho	masked_id	Group	Hi=0	Hi=1
0	0	below (< 2)	valid	valid
1	1	below (< 2)	valid	valid
2	2	boundary (= 2)	valid (idx=0 < 1)	invalid (idx=1 >= 1)
3	3	above (> 2)	invalid	invalid

Ho=0,1: both time steps valid (index 0-3, all < 5). Ho=2: only first time step (index 4 < 5), second invalid (index 5 >= 5). Ho=3: fully invalid (index 6, 7 >= 5).

Example: Transposed mode, H=5 split into Ho=4 (slice, inner) × Hi=2 (time, outer)

H=5 is split into Ho × Hi = 4 × 2 (padded from 5 to 8), but transposed: Ho is the inner factor, Hi is the outer factor. Axis index = Hi * 4 + Ho. Valid when index < 5.

Gate dim 0 config: match=1 (= 5 mod 4), V_g0=1 (= floor(5/4)), transposed mode.

Ho	masked_id	Group	Hi=0	Hi=1
0	0	below (< 1)	valid	valid
1	1	boundary (= 1)	valid (idx=0 < 1)	invalid (idx=1 >= 1)
2	2	above (> 1)	valid (idx=0 < 1)	invalid (idx=1 >= 1)
3	3	above (> 1)	valid (idx=0 < 1)	invalid (idx=1 >= 1)

Verify: Ho=0, Hi=0: 0 < 5, Hi=1: 4 < 5, so 2 steps. Ho=1, Hi=0: 1 < 5, Hi=1: 5 >= 5, so 1 step. Ho=2, Hi=0: 2 < 5, Hi=1: 6 >= 5, so 1 step. Ho=3, Hi=0: 3 < 5, Hi=1: 7 >= 5, so 1 step.

Key difference from standard: the “above” group (Ho=2,3) still gets V_g0=1 valid time steps, not zero.

Example: Full VCG computation for [H=5, C=5, W=19], step-by-step build-up

This example builds up from one axis to three, so each dimension’s contribution is clear.

Original shape [H, C, W] = [5, 5, 19]. Each axis is split into a slice part (slice_id) and a time part (sequencer):

H = 5  ->  Ho(slice) * Hi(time) = 4 * 2    (padded from 5 to 8)
C = 5  ->  Co(slice) * Ci(time) = 4 * 2    (padded from 5 to 8)
W = 19 ->  Wi(packet)            = 3 * 8    (padded from 19 to 24)

Step 1: W=19 only (packet dim, no gates)

Ignore H and C for now. Disable gate dims 0 and 1. Every slice processes 3 flits (Wi limit=3), and packet dim produces the sawtooth:

packet_vc:  8, 8, 3
              ^     ^
            full   19 - 16 = 3 (partial)

Since there are no gates, every slice gets this exact same pattern:

All slices, all flits:
flit 0: ████████  (vc=8)
flit 1: ████████  (vc=8)
flit 2: ███       (vc=3)

Step 2: Add C=5 (packet dim + gate dim 0)

Now enable the C-axis gate (gate dim 0). C=5 is split into Co(slice, 4 values) * Ci(time, limit 2). The C-gate uses: mask=0b0011 (extracts Co from slice_id), match=2, V_g0=1, standard mode.

Each slice now runs 6 flits: Ci (limit 2) * Wi (limit 3). The C-gate classifies slices by their Co value:

Co	Group	Effect
0	below (< 2)	gate open: all 6 flits get packet dim’s pattern
1	below (< 2)	gate open: same
2	boundary (= 2)	gate open for Ci=0, closed for Ci=1
3	above (> 2)	gate closed: all 6 flits get vc=0

Result per slice (6 flits = 2 Ci groups * 3 Wi flits):

Co=0:  [8,8,3, 8,8,3]    <- both Ci steps valid
Co=1:  [8,8,3, 8,8,3]    <- same
Co=2:  [8,8,3, 0,0,0]    <- Ci=0 valid, Ci=1 gated off
Co=3:  [0,0,0, 0,0,0]    <- entirely gated off

Notice the gate’s effect: some slices go entirely to zero, and the boundary slice loses its second half. But within the valid flits, the [8,8,3] pattern from packet dim is unchanged.

Step 3: Add H=5 (full 3-axis, packet dim + gate dim 0 + gate dim 1)

Now enable the H-axis gate (gate dim 1). H=5 is split into Ho(slice, 4 values) * Hi(time, limit 2). The H-gate uses: mask=0b1100 (extracts Ho from slice_id), match=0b1000, V_g1=1, standard mode.

Slice ID encodes both slice factors: slice_id = Ho * 4 + Co, giving 16 slices. Each slice now runs 12 flits: Hi (limit 2) * Ci (limit 2) * Wi (limit 3).

Dim	Axis	What it tracks	VCG config
packet	W=19	element count in packet	V_p=19, stride_p=8, counter stride=8, limit=3
gate 0	C=5	gate: is Co within valid range?	mask=0b0011, match=2, V_g0=1, standard
gate 1	H=5	gate: is Ho within valid range?	mask=0b1100, match=0b1000, V_g1=1, standard

The H-gate classifies slices by Ho, same logic as C-gate by Co:

Ho	Group	Effect
0	below	H-gate open
1	below	H-gate open
2	boundary	H-gate open for Hi=0, closed for Hi=1
3	above	H-gate closed

The final vc for each flit is packet_vc(t) * C_gate(s,t) * H_gate(s,t). Both gates must be open for packet dim’s count to survive.

The complete heatmap (16 slices * 12 flits). Columns are slices grouped by Ho; rows are flits grouped by (Hi, Ci). Right-side annotations show which gates are active for each row:

                     Ho=0       |Ho=1       |Ho=2       |Ho=3
                Co:  0  1  2  3 | 0  1  2  3| 0  1  2  3| 0  1  2  3
  H-gate:            v  v  v  v | v  v  v  v| >  >  >  >| x  x  x  x
  C-gate:            v  v  >  x | v  v  >  x| v  v  >  x| v  v  >  x
--------------------------------------------------------------------------------
 t= 0  Hi=0,Ci=0  W  8  8  8  0 | 8  8  8  0| 8  8  8  0| 0  0  0  0  H:v C:v
 t= 1             |  8  8  8  0 | 8  8  8  0| 8  8  8  0| 0  0  0  0
 t= 2             |  3  3  3  0 | 3  3  3  0| 3  3  3  0| 0  0  0  0
                                |           |           |
 t= 3  Hi=0,Ci=1  W  8  8  0  0 | 8  8  0  0| 8  8  0  0| 0  0  0  0  H:v C:>
 t= 4             |  8  8  0  0 | 8  8  0  0| 8  8  0  0| 0  0  0  0
 t= 5             |  3  3  0  0 | 3  3  0  0| 3  3  0  0| 0  0  0  0
                                |           |           |
 t= 6  Hi=1,Ci=0  W  8  8  8  0 | 8  8  8  0| 0  0  0  0| 0  0  0  0  H:> C:v
 t= 7             |  8  8  8  0 | 8  8  8  0| 0  0  0  0| 0  0  0  0
 t= 8             |  3  3  3  0 | 3  3  3  0| 0  0  0  0| 0  0  0  0
                                |           |           |
 t= 9  Hi=1,Ci=1  W  8  8  0  0 | 8  8  0  0| 0  0  0  0| 0  0  0  0  H:> C:>
 t=10             |  8  8  0  0 | 8  8  0  0| 0  0  0  0| 0  0  0  0
 t=11             |  3  3  0  0 | 3  3  0  0| 0  0  0  0| 0  0  0  0

v = open (below threshold)   > = boundary (partial)   x = closed (above)

Reading the patterns:

Ho=3 columns (rightmost 4): all 0. H-gate x (above threshold, always closed).
Co=3 columns (every 4th): all 0. C-gate x.
Co=2 columns (H:v C:>): C-gate is boundary; only rows with Ci=0 pass. Compare Co=1 vs Co=2 to see the gate’s effect.
Ho=2 columns (H:> C:v): H-gate is boundary; only rows with Hi=0 pass. Compare Ho=1 vs Ho=2.
Ho=2 * Co=2 (both >): only (Hi=0, Ci=0) rows pass, the intersection of both boundaries.
Within valid cells: the [8, 8, 3] sawtooth from packet dim always appears, the same regardless of slice.

What Patterns Are Expressible

The valid count formula is a product of four independent terms. This multiplicative structure determines which vc(s, t) functions the hardware can produce, and which it cannot.

Why Limitations Arise

Each limitation traces back to a specific part of the formula.

Packet dim cannot see slice-id. The packet dim formula packet_vc(t) = min(stride_p, max(0, V_p − idx_p(t))) depends only on t. If two slices need different partial counts at the same time step, packet dim cannot produce both:

Suppose we need:  vc(s=0, t=0) = 8,  vc(s=1, t=0) = 3
                                       ───────────────
                                       packet dim would need to output
                                       both 8 and 3 at t=0, impossible

Each gate classifies slices by a single threshold after masking. Gate dims first apply a bitmask to the slice-id (masked_id = slice_id & mask), then compare against one value match. This produces three contiguous groups: below, boundary, above. A gate cannot express “slices 0, 3, 7 are valid but 1, 2 are not”; it can represent only contiguous ranges of masked_id. The mask selects which bits of the slice-id to inspect, allowing one gate to track a specific axis even when the slice-id encodes multiple axes.

At most 4 independent checks. One packet count (packet dim) + three binary gates (gate dims) = 4 orthogonal dimensions total.

Single-Axis Scenarios

A padded axis (original size n, padded to n' > n) occupies some combination of Packet (packet dim), Time (sequencer), and Slice (gate dims via slice-id bits).

Single-position: Packet / Time / Slice

When an axis occupies only one position, validity tracking is straightforward:

Packet only → packet dim handles the sawtooth (see Packet Dim examples).
Time only → gate dim with mask=0, match=0: all slices are boundary, binary validity by time step.
Slice only → gate dim with appropriate mask/match: all time steps within a valid slice pass, invalid slices are fully gated.

All three are always supported.

Slice + Time

One axis split between slice-id bits and sequencer counters. This is the VCG’s most important use case, and it is how gate dims are typically used.

Standard (slice outer, time inner): axis index = Ho × time_count + Hi.

Example: H=14, Ho=8 (slice) × Hi=3 (time), standard mode

Gate dim config: match = ⌊14/3⌋ = 4, V = 14 mod 3 = 2.

Ho  masked_id  group       Hi=0         Hi=1         Hi=2
──  ─────────  ─────────   ──────────   ──────────   ──────────
0   0          below       idx=0 < 2 ✅  idx=1 < 2 ✅  always ✅
1   1          below       always ✅     always ✅     always ✅
2   2          below       always ✅     always ✅     always ✅
3   3          below       always ✅     always ✅     always ✅
4   4          boundary    idx=0 < 2 ✅  idx=1 < 2 ✅  idx=2 < 2 ❌
5   5          above       ❌            ❌            ❌
6   6          above       ❌            ❌            ❌
7   7          above       ❌            ❌            ❌

Why “below” is genuinely all-valid: Ho × 3 + Hi < 4 × 3 = 12 ≤ 14 for all Hi ∈ [0,3). Standard mode tolerates over-allocated time_count; the “below” interpretation remains correct.

Transposed (time outer, slice inner): axis index = Hi × slice_count + Ho.

Example: H=19, Ho=8 (slice, inner) × Hi=3 (time, outer), transposed mode

Gate dim config: match = 19 mod 8 = 3, V = ⌊19/8⌋ = 2, transposed mode.

Ho  masked_id  group       Hi=0         Hi=1         Hi=2
──  ─────────  ─────────   ──────────   ──────────   ──────────
0   0          below       always ✅     always ✅     always ✅
1   1          below       always ✅     always ✅     always ✅
2   2          below       always ✅     always ✅     always ✅
3   3          boundary    idx=0 < 2 ✅  idx=1 < 2 ✅  idx=2 < 2 ❌
4   4          above       idx=0 < 2 ✅  idx=1 < 2 ✅  idx=2 < 2 ❌
5   5          above       idx=0 < 2 ✅  idx=1 < 2 ✅  idx=2 < 2 ❌
6   6          above       idx=0 < 2 ✅  idx=1 < 2 ✅  idx=2 < 2 ❌
7   7          above       idx=0 < 2 ✅  idx=1 < 2 ✅  idx=2 < 2 ❌

Verify against real data (axis index = Hi × 8 + Ho, valid when < 19):

Ho=0, Hi=2: 2×8 + 0 = 16 < 19 ✅; “below” gives all-valid = 3 steps, need V+1 = 3 steps ✅
Ho=3, Hi=2: 2×8 + 3 = 19 ≥ 19 ❌; boundary gives 2 steps ✅
Ho=7, Hi=1: 1×8 + 7 = 15 < 19 ✅; “above” gives V=2 steps, actual need is 2 steps ✅

Constraint: time_count must equal ⌈n / slice_count⌉ (= V + 1). The “below” group gets time_count valid steps from the HW “all-valid” interpretation. If time_count > V + 1, the “below” group receives more valid steps than the data actually has.

Packet + Time

Both packet and time factors assigned to packet dim, with multiple counters contributing to idx_p(t) (see Original Dimensions).

Example: n=50, two counters on packet dim, contiguous (`stride_outer = 8 × 3 = 24`)

c_inner (limit=3, stride=8):  packet counter
c_outer (limit=3, stride=24): time counter     (24 = 8 × 3 ✅ contiguous)

idx_p = c_outer × 24 + c_inner × 8

          c_inner=0    c_inner=1    c_inner=2
          ─────────    ─────────    ─────────
c_outer=0  idx_p=0→8   idx_p=8→8    idx_p=16→8
c_outer=1  idx_p=24→8  idx_p=32→8   idx_p=40→8
c_outer=2  idx_p=48→2  idx_p=56→0   idx_p=64→0
                  ↑
                  min(8, 50-48)=2

Packet dim handles both the within-flit and across-flit boundaries.

Slice + Packet: not supported

One axis split between slice (gate dim) and packet (packet dim). This directly violates the slice-independent packet count constraint (see Packet Dim: Key property).

packet_vc(t) depends only on t, but the boundary slice needs a different partial count than all-valid slices. The gate can multiply by 0 or 1, so it can fully close a flit but cannot change the partial count.

Example: n=10, stride_p=8, slice_count=2

What we need:
  Ho=0: elements 0-7,   all valid     →  vc = 8
  Ho=1: elements 8-15,  first 2 valid →  vc = 2
                                           ─
                                           partial count, different from 8

Attempt 1: set `V_p = 10`:
  packet_vc = min(8, 10-0) = 8         for ALL slices
  Ho=0:  vc = 8 × 1 = 8  ✅
  Ho=1:  vc = 8 × 1 = 8  ❌  (need 2, not 8)
         vc = 8 × 0 = 0  ❌  (gate can close to 0, not to 2)

Attempt 2: set `V_p = 2`:
  packet_vc = min(8, 2-0) = 2          for ALL slices
  Ho=0:  vc = 2 × 1 = 2  ❌  (need 8, not 2)

No single V_p works.  Packet dim produces one value; the gate can only multiply by 0 or 1.

Note on degenerate cases: When n % stride_p = 0, every packet is either fully valid or fully invalid, so packet dim produces no partial counts and a gate alone handles validity. This is effectively Slice only, not a true Slice + Packet scenario. Similarly, n <= stride_p means a single flit covers the entire axis, reducing to Packet only.

When the VCG cannot express the required pattern, Padding Strategy alternatives are available.

Slice + Time + Packet: not supported

The axis spans all three positions. The Slice + Packet conflict carries over: the boundary slice still needs a different partial count than all-valid slices, and packet_vc(t) still cannot vary by slice.

The same degenerate exception applies: n % stride_p = 0 eliminates partial counts, reducing to Slice + Time (packet dim unused).

Multiple Axes

Each padded axis that needs validity tracking consumes one original dimension slot:

Resource	Capacity	Notes
Packet Dim (packet count)	1 slot	Innermost counter determines `stride_p`
Gate Dims (binary gates)	3 slots	One gate per padded axis
Unpadded axes	free	No dim needed (`mask=0, match=1`)

When the packet axis is fully aligned (n % stride_p = 0), packet_vc is constant and packet dim is effectively unused, so it can be repurposed as a gate for another axis.

Summary

A valid count function vc(s, t) is VCG-expressible only if:

Prefix property: Valid elements form a contiguous prefix [0, vc) within each flit.
Slice-independent packet count: packet_vc(t) must be the same across all slices at the same t. Slices can be gated to vc = 0, but cannot receive a different partial count.
Monotonic slice ordering: Each gate dim classifies slices by a single threshold on masked_id.
At most 4 orthogonal dimensions: 1 packet count + 3 binary gates.

Placement	Dim	Supported?	Key constraint
Packet only	packet	✅	none
Time only	gate	✅	none
Slice only	gate	✅	none
Slice + Time (standard)	gate	✅	none
Slice + Time (transposed)	gate	✅	`time_count = ⌈n / slice_count⌉`
Packet + Time	packet	✅	none
Slice + Packet	packet + gate	❌	`packet_vc(t)` cannot vary by slice
Slice + Time + Packet	packet + gate	❌	same as Slice + Packet

For mapping-level code examples of each placement, see Examples. For unsupported cases, see Padding Strategy.

Downstream: 4-Way Operations

VCG assigns valid counts per 8-way flit, but the VectorArithmeticUnit can operate on 4-way halves.

Operation	Input	Output	Valid Count Transformation
split_way4	8-way flit (vc = v)	two 4-way flits	`vc_low = min(v, 4)`, `vc_high = max(v - 4, 0)`
trim_way4	8-way flit (vc = v)	one 4-way flit	`vc = v` (requires v <= 4)
concat_way8	two 4-way flits	8-way flit	`vc = vc_low + vc_high`
pad_way8	4-way flit	8-way flit	`vc` unchanged

The prefix property is preserved through split and concat. For trim_way4, the constraint v <= 4 must be statically guaranteed by the mapping; if the upper 4 elements could be valid, trimming them would lose data.

Inter-Slice Block

The Inter-Slice Block performs inter-slice reduction, aggregating partial results across the 256 slices within a cluster. It preserves Chip, Cluster, and Packet, and rewrites Slice and Time to SliceOut and TimeOut.

Interface

`i32` Interface

    #[primitive(VectorInitTensor::vector_inter_slice_reduce)]
    pub fn vector_inter_slice_reduce<Slice2: M, Time2: M>(
        self,
        op: InterSliceReduceOpI32,
    ) -> VectorInterSliceReduceTensor<'l, T, i32, Chip, Cluster, Slice2, Time2, Packet, { VeOrder::InterFirst }> {

`f32` Interface

    #[primitive(VectorInitTensor::vector_inter_slice_reduce)]
    pub fn vector_inter_slice_reduce<Slice2: M, Time2: M>(
        self,
        op: InterSliceReduceOpF32,
    ) -> VectorInterSliceReduceTensor<'l, T, f32, Chip, Cluster, Slice2, Time2, Packet, { VeOrder::InterFirst }> {

You can reach this block in two ways:

Run inter-slice first: vector_init() -> vector_inter_slice_reduce::<SliceOut, TimeOut>(op)
Run intra-slice first, then switch: call vector_inter_slice_reduce() directly on the current intra-slice tensor instead of calling vector_init() again.

In the IntraFirst path, vector_inter_slice_reduce() is available only from Way8 intra-slice stages that can transition to inter-slice reduction: Branch, Logic, Fxp, FxpToFp, Widen, FpToFxp, and Clip. It is not available from Way4 stages such as Narrow, Fp, IntraSliceReduce, or FpDiv.

Quick Reference

Current state	Method	Result
Fresh VE input after `vector_init()`	`vector_inter_slice_reduce::<SliceOut, TimeOut>(op)`	Enters inter-slice reduction directly (`InterFirst`)
Eligible intra-slice tensor	`vector_inter_slice_reduce::<SliceOut, TimeOut>(op)`	Transitions from intra-slice to inter-slice reduction (`IntraFirst`)
Tensor after `vector_inter_slice_reduce()`	`vector_intra_slice_branch(BranchMode)`	Switches to intra-slice work after inter-slice reduction

Operations

Integer Operations (`InterSliceReduceOpI32`)

Operation	Description
`Add`	Wrapping addition
`AddSat`	Saturating addition
`Max`	Maximum value
`Min`	Minimum value

Floating-Point Operations (`InterSliceReduceOpF32`)

Operation	Description
`Add`	Floating-point addition
`Max`	Maximum value
`Min`	Minimum value
`Mul`	Floating-point multiplication

Output Mapping Rule

After inter-slice reduction removes a slice factor R, the output mapping typically follows one of three rules:

Rule	Output mapping	Reference
Broadcast	`Slice = m![A, R], Time = m![C] -> SliceOut = m![A, X], TimeOut = m![C]`	Broadcast Into a New Slice Axis
Dummy	`Slice = m![A, R], Time = m![C] -> SliceOut = m![A, 1 # n], TimeOut = m![C]`	Dummy Replacement
Promotion	`Slice = m![A, R], Time = m![C] -> SliceOut = m![A, C], TimeOut = m![1]`	Promotion from `Time` into `SliceOut`

Chip, Cluster, and Packet pass through unchanged. Only Slice and Time are rewritten into SliceOut and TimeOut.

Examples

Dummy Replacement

Replace the reduced slice factor with a dummy factor in SliceOut:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 4, A = 512];

fn inter_slice_add<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, R], m![1], m![A % 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, 1 # 4], m![1], m![A % 8]> {
    input
        .vector_init()
        .vector_inter_slice_reduce::<m![A / 8, 1 # 4], m![1]>(InterSliceReduceOpI32::AddSat)
        .vector_final()
}
}

R occupies part of the Slice dimension. After reduction, R is eliminated and the remaining A / 8 positions are padded from R(=4) slots to 1 # 4.

Broadcast Into a New Slice Axis

Introduce a new axis in SliceOut, and broadcast the reduced value over that axis:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![W = 64, R = 4, X = 4, P = 8];

fn broadcast_into_x<'l, const T: Tu>(
    input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![W, R], m![1], m![P]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![W, X], m![1], m![P]> {
    input
        .vector_init()
        .vector_inter_slice_reduce::<m![W, X], m![1]>(InterSliceReduceOpF32::Add)
        .vector_final()
}
}

Here, R is reduced away. X is a new axis that appears only in SliceOut, so the reduced value is broadcast over the X positions in the output.

Promotion from `Time` into `SliceOut`

If Time already contains an axis that should occupy the freed slice space, promote that axis into SliceOut. The promoted axis does not have to be the outermost axis in Time:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![W = 32, R = 4, T0 = 2, T2 = 4, T1 = 2, P = 8];

fn axis_promotion<'l, const T: Tu>(
    input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![W, R], m![T0, T2, T1], m![P]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![W, T2], m![T0, T1], m![P]> {
    // Before: Slice = m![W, R], Time = m![T0, T2, T1], Packet = m![P]
    // After:  Slice = m![W, T2], Time = m![T0, T1], Packet = m![P]
    // R is reduced away, and T2 is promoted from the middle of Time into Slice.
    input
        .vector_init()
        .vector_inter_slice_reduce::<m![W, T2], m![T0, T1]>(InterSliceReduceOpF32::Add)
        .vector_final()
}
}

Inter-Slice Reduce with `AddSat`, Then Intra-Slice

Reducing an i32 tensor across slices, then applying an elementwise add:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 4, A = 512];

fn reduce_then_add<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, R], m![1], m![A % 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, 1 # 4], m![1], m![A % 8]> {
    input
        .vector_init()
        .vector_inter_slice_reduce::<m![A / 8, 1 # 4], m![1]>(InterSliceReduceOpI32::AddSat)
        .vector_intra_slice_branch(BranchMode::Unconditional)
        .vector_fxp(FxpBinaryOp::AddFxp, 100)
        .vector_final()
}
}

Intra-Slice Then Inter-Slice Reduce with `AddSat`

Applying an intra-slice operation first, then reducing the resulting i32 tensor across slices:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 4, A = 512];

fn add_then_reduce<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, R], m![1], m![A % 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, 1 # 4], m![1], m![A % 8]> {
    input
        .vector_init()
        .vector_intra_slice_branch(BranchMode::Unconditional)
        .vector_fxp(FxpBinaryOp::AddFxp, 100)
        .vector_inter_slice_reduce::<m![A / 8, 1 # 4], m![1]>(InterSliceReduceOpI32::AddSat)
        .vector_final()
}
}

Constraints

Constraint	Detail
Data types	`i32` and `f32` only
Scope	Reduction happens within one 256-slice cluster
Packet mapping	`Packet` does not change across inter-slice reduction

Performance

Inter-slice reduce is best understood as a ring-like global reduction across the participating slices. For documentation purposes, the most useful high-level estimate is:

Quantity	Rough rule of thumb
First reduced output	on the order of one ring traversal for the reduction group
Total time	input streaming time + that ring-sized tail
Main tuning knob	reduction ratio, that is, how many slices participate in one inter-slice contraction group

If you want a quick mental model, let r be the reduction ratio or route-group size:

first output appears after roughly O(r) cycles
larger r means more noticeable inter-slice tail latency
if upstream already produces flits slowly, that upstream rate dominates and the inter-slice cost is partly hidden

This is intentionally a high-level approximation. The practical mental model is simple: stream partial results in, then pay about one ring traversal before the reduced result settles.

Interaction With Other Pipelines

Contraction -> Inter-Slice: if contraction takes longer to produce partial sums, contraction can dominate and inter-slice may not be the bottleneck.
Intra-Slice -> Inter-Slice: intra-slice work can reduce the number of packets that reach inter-slice, or simply take longer itself. In those cases, inter-slice is less visible because there is less data to reduce, or because the front half already dominates.
Large ring / large reduction ratio: when many slices participate, inter-slice tail latency grows and can become the bottleneck.
Small tensors: even when total data volume is small, the fixed ring-style tail can still matter because it is amortized over fewer packets.

For an end-to-end contraction example that includes inter-slice reduction, see Reducer.

Cast Engine

Storing full f32/i32 results in DM would waste memory; the Cast Engine narrows them back to application-specified types (e.g., bf16) before the Commit Engine writes to DM.

Interface

impl<'l, const T: Tu, D: VeScalar, Chip: M, Cluster: M, Slice: M, Time: M, Packet: M> StreamCast<D>
    for CollectTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet>
{
    type CastOutput<D2: Scalar, OutPacket: M>
        = CastTensor<'l, T, D2, Chip, Cluster, Slice, Time, OutPacket>
    where
        D: Cast<D2>;

    #[primitive(CollectTensor::cast)]
    fn cast<D2: Scalar, OutPacket: M>(self) -> Self::CastOutput<D2, OutPacket>
    where
        D: Cast<D2>,
    {
        cast_stream(self.ctx, self.inner)
    }
}

Precision Lowering

Precision lowering downcasts f32 or i32 data into specific lower-precision formats:

Input Type (`D1`)	Supported Output Types (`D2`)
`i32`	`i4`, `i8`, `i16`
`f32`	`f8e5m2`, `f8e4m3`, `f16`, `bf16`

Packet Transformation

The input packet must be exactly 32 bytes (one flit). The Collect Engine ensures this before data reaches the Cast Engine.

After casting each element to the output type, the result is padded back to 32 bytes. Time passes through unchanged.

Input:  Time = [T],  Packet = [P # (32 / sizeof(D1))],  dtype = D1
Output: Time = [T],  Packet = [P # (32 / sizeof(D2))],  dtype = D2

Examples

Single-flit packet

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![B = 4, A = 8];

fn cast_i32_to_i8<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1], m![1], m![B], m![A]>,
) -> CastTensor<'l, T, i8, m![1], m![1], m![1], m![B], m![A # 32]> {
    input.cast()
}
}

Before the cast, each flit is fully utilized: A = 8 elements x 4 bytes (i32) = 32 bytes. After the cast, each element shrinks to 1 byte (i8), so A = 8 elements occupy only 8 bytes. The A # 32 padding fills the remaining 24 bytes to maintain the 32-byte flit alignment. Time stays m![B] because it passes through unchanged.

Padded input packet

When the input data doesn’t fill the full flit, it arrives already padded from the Collect Engine.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![A = 4];

fn cast_padded<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1], m![1], m![1], m![A # 8]>,
) -> CastTensor<'l, T, i8, m![1], m![1], m![1], m![1], m![A # 32]> {
    input.cast()
}
}

Input packet A # 8 = 4 data elements padded to 8 elements at i32 = 32 bytes (one flit). After cast to i8, 4 data elements occupy 4 bytes, padded to 32: m![A # 32].

This under-utilization may look wasteful, but the Cast Engine is a pass-through stage that is never the pipeline bottleneck. The downstream Commit Engine can aggregate multiple under-utilized flits into dense DM writes anyway. The net effect is the same: no bandwidth is wasted at the DM level.

Transpose Engine

When computation results are in a different memory layout than DM requires, the Transpose Engine reorders the data within flits before the Commit Engine writes to DM. The Transpose Engine reorders data within a 2D matrix by swapping rows and columns. It interprets input data as a [in_rows, in_cols] matrix, transposes it, and optionally slices padded elements to produce the desired output shape.

Interface

extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
impl<'l, const T: Tu, D, Chip, Cluster, Slice, Time, Packet>
    CollectTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet>
{
    /// Transposes axes between the Time and Packet mappings.
    /// Swaps the innermost Time axes with the Packet axis, converting [A, B] layout to [B, A].
    pub fn transpose<OutTime: M, OutPacket: M>(
        self,
    ) -> TransposeTensor<'l, T, D, Chip, Cluster, Slice, OutTime, OutPacket>
    {
        // Hardware implementation: swaps rows and columns within [Time, Packet]
    }
}

The Transpose Engine operates on the Time and Packet dimensions only. The Chip, Cluster, and Slice dimensions pass through unchanged.

Architecture

Conceptual Operation

The Transpose Engine performs four stages:

Unpack: Each input packet is 32 bytes, but the transpose buffer only uses the first elements_per_packet elements (see Internal Buffer Architecture). This stage discards extraneous padding from each packet, keeping only elements_per_packet elements. There are in_rows time steps (each delivering packets_per_col packets), which assemble the [in_rows × in_cols] input matrix, where in_cols = packets_per_col × elements_per_packet.
Transpose: The matrix is transposed: [in_rows × in_cols] → [in_cols × in_rows].
Trim: After transposing, padded elements within each input packet constitute entire rows. This stage allows the removal of those padded rows, producing [out_rows × in_rows], where out_rows <= in_cols.
Align: Each output row is in_rows elements wide. This stage pads each row to 32 bytes (output_alignment elements), producing the final output packets of shape [out_rows × (in_rows # output_alignment)].

                 in_cols                 output_alignment
           ┌─────────────────┐         ┌──────────────────┐
           │ 12 13 14 15 ... │         │ 3  7  11 15  ... │
 in_rows   │ 8  9  10 11 ... │  ────►  │ 2  6  10 14  ... │  out_rows
           │ 4  5  6  7  ... │         │ 1  5  9  13  ... │
           │ 0  1  2  3  ... │         │ 0  4  8  12  ... │
           └─────────────────┘         └──────────────────┘
                data_in                      data_out

Specifications

Internal Buffer Architecture

The Transpose Engine has two internal buffers, each with num_buffer_cols = 16 columns. The input interface receives a fixed number of elements per cycle based on the data type:

Data Type	`elements_per_packet`
4-bit	16
8/16/32-bit	8

Input Bus Constraints

The input bus to the Transpose Engine is 32 bytes, but its usable capacity depends on the data type:

Type	Input Format
4-bit	4b × 16
8-bit	8b × 8
16-bit	16b × 8
32-bit	32b × 8

The Transpose Engine receives data from three possible sources:

Contraction Engine: Outputs 32b × 8
Vector Engine: Outputs 4b × 16, 8b × 8, 16b × 8, or 32b × 8
Fetch Engine: Outputs 4b x 16, 8b x 8, 16b x 8, or 32b x 8

Constraints

The following parameters are dependent on the data type:

Data type	`elements_per_packet`	`output_alignment`	Max `in_rows`	Valid `in_cols`
4-bit	16	64	16	16, 32
8-bit	8	32	8	8, 16, 32
16-bit	8	16	4	8, 16, 32
32-bit	8	8	2	8, 16, 32

The following are type-agnostic:

Both the input and output packets must be 32 bytes.
out_rows <= in_cols (determines the number of sliced rows in the Trim stage)

Performance

Double Buffering

The buffering mode is determined by comparing in_cols with num_buffer_cols. Double buffering occurs when in_cols <= num_buffer_cols. Otherwise, single buffering is used.

`in_cols`	Condition	Buffering Mode
8	8 ≤ 16	Double buffering
16	16 ≤ 16	Double buffering
32	32 > 16	Single buffering

Double buffering: One buffer receives input while the other produces output simultaneously
Single buffering: Both buffers are used together, so input and output must alternate

Cycle Calculation

Variable definitions:

$$ \texttt{input_flits_per_iter} = \texttt{in_rows} \times \frac{\texttt{in_cols}}{\texttt{elements_per_packet}} $$ $$ = \texttt{in_rows} \times \texttt{packets_per_col} $$ $$ \texttt{output_flits_per_iter} = \texttt{out_rows} $$ $$ \texttt{n} = \frac{\texttt{OutTime::SIZE}}{\texttt{out_rows}} $$

Cycles per iteration:

Double buffering: max(input_flits_per_iter, output_flits_per_iter)
- Input and output happen simultaneously, so the slower one determines the cycle count
Single buffering: input_flits_per_iter + output_flits_per_iter
- Input and output alternate, so both are added

Total cycles in a burst:

Double buffering (pipelined execution): $$ \texttt{input_flits_per_iter} + (n - 1) \times \texttt{cycles_per_iter} + \texttt{output_flits_per_iter} $$
- input_flits_per_iter: Initial input-only phase (filling first buffer)
- (n - 1) * cycles_per_iter: Middle phase where input and output overlap
- output_flits_per_iter: Final output-only phase (draining last buffer)
Single buffering (sequential execution): $$ n \times \texttt{cycles_per_iter} $$

Examples

Basic 8×8 Transpose

The simplest case transposes an 8×8 matrix across the Packet and Time dimensions:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![P = 256, C = 8, D = 8, E = 8];

fn basic_transpose<'l, const T: Tu>(
    input: CollectTensor<'l, T, i8, m![1], m![1], m![P], m![C, D], m![E # 32]>,
) -> TransposeTensor<'l, T, i8, m![1], m![1], m![P], m![C, E], m![D # 32]> {
    // in_rows = 8 (D)
    // packets_per_col = 1,
    // elements_per_packet = 8 (i8),
    // in_cols = packets_per_col * elements_per_packet = 8 (E)
    // out_rows = 8 (E)
    // output_alignment = 32 (i8)

    // 1. Unpack: [in_rows x packets_per_col x packet]: [D, E # 32] →
    //            [in_rows x packets_per_col x elements_per_packet]: [D, E] =
    //            [in_rows x in_cols]
    // 2. Transpose: [in_rows x in_cols]: [D, E] →
    //               [in_cols x in_rows]: [E, D]
    // 3. Trim: [in_cols x in_rows]: [E, D] →
    //          [out_rows x in_rows]: [E, D] (no rows trimmed)
    // 4. Align: [out_rows x in_rows]: [E, D] →
    //           [out_rows x (in_rows # output_alignment)]: [E, D # 32]

    // cycle estimation: in_cols (8) ≤ num_buffer_cols (16), double buffering
    // input_flits_per_iter = 8, output_flits_per_iter = 8, n = 8 (C), cycles_per_iter = 8
    // cycles = input_flits_per_iter + (n - 1) * cycles_per_iter + output_flits_per_iter = 72
    input.transpose()
}
}

Small Matrix Transpose

Transpose works with matrices smaller than the maximum size:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![P = 64, A = 4, B = 2];

fn small_transpose<'l, const T: Tu>(
    input: CollectTensor<'l, T, i8, m![1], m![1], m![P], m![A], m![B # 32]>,
) -> TransposeTensor<'l, T, i8, m![1], m![1], m![P], m![B], m![A # 32]> {
    // in_rows = 4 (A),
    // packets_per_col = 1,
    // elements_per_packet = 8 (i8),
    // in_cols = packets_per_col * elements_per_packet = 1 * 8 = 8
    //   (B=2 data elements, padded internally to 8)
    // out_rows = 2 (B),
    // output_alignment = 32 (i8)

    // 1. Unpack: [in_rows x packets_per_col x packet]: [A, B # 32] →
    //            [in_rows x packets_per_col x elements_per_packet]: [A, B # 8] =
    //            [in_rows x in_cols]
    // 2. Transpose: [in_rows x in_cols]: [A, B # 8] →
    //               [in_cols x in_rows]: [B # 8, A]
    // 3. Trim: [in_cols x in_rows]: [B # 8, A] →
    //          [out_rows x in_rows]: [B, A] (6 rows trimmed)
    // 4. Align: [out_rows x in_rows]: [B, A] →
    //           [out_rows x (in_rows # output_alignment)]: [B, A # 32]
    // cycle estimation: in_cols (8) ≤ num_buffer_cols (16), double buffering
    // input_flits_per_iter = 4, output_flits_per_iter = 2, n = 1, cycles_per_iter = 4
    // cycles = input_flits_per_iter + (n - 1) * cycles_per_iter + output_flits_per_iter = 6
    input.transpose()
}
}

Large Column Transpose (`in_cols` = 32, single buffering)

When in_cols exceeds num_buffer_cols, single buffering is used:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![P = 256, B = 2, C = 8, D = 4, E = 8];

fn large_col_transpose<'l, const T: Tu>(
    input: CollectTensor<'l, T, i8, m![1], m![1], m![P], m![B, C, D], m![E # 32]>,
) -> TransposeTensor<'l, T, i8, m![1], m![1], m![P], m![B, D, E], m![C # 32]> {
    // in_rows = 8 (C),
    // packets_per_col = 4 (D),
    // elements_per_packet = 8 (i8),
    // in_cols = packets_per_col * elements_per_packet = 32 (D * E),
    // out_rows = 32 (D * E),
    // output_alignment = 32 (i8)

    // 1. Unpack: [in_rows x packets_per_col x packet]: [C, D, E # 32] →
    //            [in_rows x packets_per_col x elements_per_packet]: [C, D, E] =
    //            [in_rows x in_cols]
    // 2. Transpose: [in_rows x in_cols]: [C, D, E] →
    //               [in_cols x in_rows]: [D, E, C]
    // 3. Trim: [in_cols x in_rows]: [D, E, C] →
    //          [out_rows x in_rows]: [D, E, C] (no rows trimmed)
    // 4. Align: [out_rows x in_rows]: [D, E, C] →
    //           [out_rows x (in_rows # output_alignment)]: [D, E, C # 32]
    // cycle estimation: in_cols (32) > num_buffer_cols (16), single buffering
    // input_flits_per_iter = 8 * 4 = 32, output_flits_per_iter = 32, n = 2 (B), cycles_per_iter = 32 + 32 = 64
    // cycles = n * cycles_per_iter = 128
    input.transpose()
}
}

16-bit Data Type (bf16)

For 16-bit types, the maximum in_rows is reduced to 4:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![P = 256, C = 8, D = 4, E = 8];

fn bf16_transpose<'l, const T: Tu>(
    input: CollectTensor<'l, T, bf16, m![1], m![1], m![P], m![C, D], m![E # 16]>,
) -> TransposeTensor<'l, T, bf16, m![1], m![1], m![P], m![C, E], m![D # 16]> {
    // in_rows = 4 (D),
    // packets_per_col = 1,
    // elements_per_packet = 8 (bf16),
    // in_cols = 8 (E),
    // out_rows = 8 (E),
    // output_alignment = 16 (bf16)

    // 1. Unpack: [in_rows x packets_per_col x packet]: [D, E # 16] →
    //            [in_rows x in_cols]: [D, E]
    // 2. Transpose: [in_rows x in_cols]: [D, E] →
    //               [in_cols x in_rows]: [E, D]
    // 3. Trim: [in_cols x in_rows]: [E, D] →
    //          [out_rows x in_rows]: [E, D] (no rows trimmed)
    // 4. Align: [out_rows x in_rows]: [E, D] →
    //           [out_rows x (in_rows # output_alignment)]: [E, D # 16]

    // cycle estimation: in_cols (8) ≤ num_buffer_cols (16), double buffering
    // input_flits_per_iter = 4, output_flits_per_iter = 8, n = 8 (C), cycles_per_iter = 8
    // cycles = input_flits_per_iter + (n - 1) * cycles_per_iter + output_flits_per_iter = 68
    input.transpose()
}
}

Scheduling

Scheduling determines how operations execute on hardware resources. This chapter explains how the Virtual ISA translates programs into executable schedules. Two programmer-visible inputs determine the schedule: the textual order of operations and explicit memory address assignments.

Programs implicitly define their schedule through the textual order of operations and explicit memory address assignments. The scheduler respects the written order as the authoritative sequence — it does not reorder operations. The scheduler then analyzes resource dependencies to determine which operations can run in parallel.

Operation Order

Operation order is how the program communicates sequencing intent to the scheduler: the textual order of operations defines their execution order. The following example shows this, where load_from_host() loads a tensor from host memory and .op() represents any pipeline operation:

let t0 = load_from_host();  // O0
let t1 = load_from_host();  // O1
let t2 = t0.op();           // O2
let t3 = t1.op();           // O3
let t4 = t2.op();           // O4
let t5 = t4.op();           // O5

The final execution order respects the written order: O0 → O1 → O2 → O3 → O4 → O5.

Memory Allocation

Each tensor requires a specific memory address for precise scheduling. Currently, tensor addresses must be specified explicitly by the programmer.

Hardware Resources

The hardware provides three allocatable resources that can execute in parallel:

Resource	Description
Main context	Primary Tensor Unit execution context
Sub context	Secondary context for data prefetching
Direct Memory Access (DMA) Engine	Memory-to-memory data transfer

Main context handles compute-intensive operations through the complete Tensor Unit pipeline — Fetch, Switching, Collect, Contraction, Vector, Cast, Transpose, and Commit — but can only execute one operation at a time.

Sub context runs data movement operations (SRAM-to-TRF transfers, SRAM-to-SRAM copies) concurrently with the main context, enabling double-buffering where the next operation’s data is prepared while the current one computes.

DMA Engine moves large tensors between HBM and SRAM independently of both Tensor Unit contexts, enabling overlapped data transfer and computation.

Two factors cause operations to serialize: resource conflicts and memory dependencies.

Resource conflicts occur when two operations require the same resource, forcing the later one to wait. For example, two matrix multiplications both requiring the main context must execute sequentially. However, a matrix multiplication (main context) can run in parallel with a DMA transfer (DMA engine) because they use different resources.

Memory dependencies arise from data hazards on shared addresses. Read-after-write (RAW) hazards require a read to see the result of a preceding write. Write-after-read (WAR) hazards prevent a write from overwriting data still being read. Write-after-write (WAW) hazards require writes to the same address to execute in order. The scheduler detects these hazards by analyzing the memory addresses specified in the program.

The scheduler manages these constraints automatically. It analyzes each operation’s resource usage and memory addresses to determine parallelism opportunities while respecting program order, inserting implicit waits where necessary. This dependency resolution frees programmers from manually inserting synchronization barriers, though memory addresses must still be specified explicitly (see Memory Allocation).

Kernel Examples

The introductory tutorial briefly introduced temporal and spatial partitioning for large tensors in its Further Reading section. The preceding chapters explained how mapping expressions distribute work across TCP’s hardware hierarchy and how each component reduces partial results. This chapter shows how to combine mapping, movement, computation, and scheduling into complete, working kernels. The table below summarizes the available parallelism and reduction at each level:

Dimension	Type	Defined in	Reduced in
`Chip`	Spatial	HBM, SRAM, Stream	DMA + Vector
`Cluster`	Spatial	SRAM, Stream	DMA + Vector
`Slice`	Spatial	SRAM, Stream	Vector
`Row`	Spatial	TRF	Contraction
`Time`	Temporal	Stream	Contraction
`Packet`	Spatial	Stream	Contraction

For cross-chip and cross-cluster reduction patterns (the Chip and Cluster rows above), see Chip/Cluster Reduce, which demonstrates DMA broadcast followed by Vector Engine binary add.

The examples progress from single-engine patterns to composed multi-engine patterns to full model implementations:

Tiling (coming soon): Tile size selection, memory layout, and accumulation strategies.
Split Reduce: Interleaved fetch for reducing across multiple tensor instances. Use when a reduction dimension exceeds what a single tile can accumulate.
Chip/Cluster Reduce: ReduceScatter and AllReduce across chips. Use when computation must be distributed across multiple chips or clusters.
Fetch and Commit Engine: Axis permutation, full-flit commit, tail padding, and tensor segmentation. Use when data layout transformations are needed between memory and compute.
GEMM with Double-Buffering (coming soon): DMA load from HBM, sub-context TRF prefetch, main-context tiled contraction, cast, and commit. A short end-to-end example bridging single-engine patterns and full model implementations.
Transformer: Llama 3 70B implementation with prefill and decode phases. A full model combining tiling, multi-chip reduce, and memory management.
Mixture of Experts: Branchless TopK routing and blockwise sparse computation. A full model demonstrating dynamic routing with sparse computation patterns.

Tiling

Warning

This page is a work in progress. Content will be added in a future release.

Tiling breaks large tensors into smaller tiles that fit in on-chip memory. When a tensor exceeds VRF capacity (8KB per slice) or DM capacity, it must be processed in multiple iterations.

When to Use Tiling

Tiling applies when:

A tensor dimension exceeds what fits in a single hardware pass — compare the dimension size against the DM capacity table in Memory Performance.
Memory bandwidth needs to be optimized by reusing loaded data — check whether the same data is fetched more than once across operations.
Computation needs to be distributed across time rather than space — use when the spatial dimensions are already fully distributed but a loop over tiles is needed.

Basic Tiling Pattern

The basic pattern is: (1) choose a tile size that fits in VRF/DM, (2) loop over tiles in the outer dimensions, (3) fetch each tile from HBM to DM, (4) run the computation, and (5) accumulate partial results before writing back. The tile size must satisfy alignment constraints (32-byte flits) and leave room for double-buffering if overlapping fetch with compute.

Warning

Add a simple tiling example showing:

Original tensor shape exceeding VRF

Tile size calculation

Loop structure for processing tiles

Accumulation of partial results

// TODO: Example code
// axes![M = 8192, N = 8192, K = 2048];
//
// Tile sizes chosen to fit in VRF:
// type TileM = m![M / 32];  // 256 elements per tile
// type TileN = m![N / 32];  // 256 elements per tile
//
// Outer loop iterates over tiles
// Inner computation processes one tile

Example: Tiled Matrix Multiplication

Warning

Add complete GEMM example with tiling:

Input matrices A[M, K] and B[K, N] where M, N, K exceed VRF capacity
Tile along M and N dimensions
Accumulate partial results across K tiles

Memory Layout

Warning

Describe how tiles are laid out in HBM and DM

Tile Size Selection

Warning

Explain constraints for choosing tile sizes:

VRF capacity (8KB per slice)

DM capacity

Alignment requirements (32-byte flits)

Trade-off between tile size and iteration count

Accumulation Strategy

Warning

Explain how partial results are accumulated:

Accumulate in higher precision (f32) to avoid precision loss

Store intermediate results in DM or HBM depending on size

Final cast to output precision (bf16)

Example: Tiled Attention

Warning

Add attention example showing tiling for long sequences:

Query, Key, Value tensors with long sequence length

Tile along sequence dimension

FlashAttention-style tiling for memory efficiency

Performance Considerations

Warning

Add performance analysis:

Overhead of tile boundary handling

Memory bandwidth utilization

Optimal tile sizes for different tensor shapes

Interaction with hardware prefetching

Split Reduce

Split reduce handles reductions when a logical reduction axis cannot be mapped to a single continuous hardware dimension: the axis is split into multiple separate tensor instances that must be fetched independently using interleaved fetch and combined using Vector Engine binary operations.

When to Use Split Reduce

Split reduce applies when:

The reduce axis exceeds single-tensor capacity: A reduction axis is too large to fit in VRF (8KB per slice) as a single tensor, requiring the logical axis to be split into multiple physical tensor instances.
Independent tensor instances exist: Multiple tensor instances hold different portions of the same logical reduction axis (e.g., from different model layers, experts, or temporal segments).
Avoiding cross-chip communication: Data resides on the same chip/cluster but in separate memory allocations, making interleaved fetch more efficient than DMA-based approaches.

Split reduce sits between slice-level and chip-level reductions in TCP’s reduction hierarchy:

Packet reduce: Within a single packet (Reducer)
Time reduce: Across time dimension (Reducer)
Slice reduce: Across slices within a cluster (Inter-Slice Block)
Split reduce: Across multiple independent tensor instances, using interleaved fetch (alternating loads from separate tensor instances) combined with Vector Engine binary ops
Chip/Cluster reduce: Across chips or clusters (DMA + interleaved fetch + Vector Engine binary op)

Implementation: Interleaved Fetch

Split reduce uses interleaved fetch to load multiple tensor instances alternately, creating a time-interleaved stream that the Vector Engine reduces. The fetch pattern introduces an interleave dimension I that indexes the separate tensor instances:

// Two tensor instances to be reduced together
let tensor_0: DmTensor<bf16, m![1], m![1], m![1], m![A, B]> = ...;
let tensor_1: DmTensor<bf16, m![1], m![1], m![1], m![A, B]> = ...;

// Interleaved fetch creates alternating time stream: I=2 dimension
let interleaved: TuTensor<bf16, m![1], m![1], m![1],
    m![I: 2, A], m![B]
> = ctx.main.begin_interleaved().fetch(&tensor_0, &tensor_1);

// Vector Engine reduction combines the I dimension
let reduced: TuTensor<bf16, m![1], m![1], m![1],
    m![A], m![B]
> = interleaved.reduce_add(axis: I);

The interleaved fetch alternates between tensor instances in the time dimension: time[0] holds data from tensor_0, time[1] from tensor_1, time[2] from tensor_0 again, and so on. The Vector Engine performs binary operations (add, max, min) across the interleave dimension to complete the reduction.

Example 1: Layer Normalization Split Reduction

Layer normalization computes statistics (mean, variance) over the feature dimension. When the feature dimension is too large to fit in a single VRF allocation, it must be split into multiple chunks that are processed separately and then combined.

The Problem: Layer normalization requires computing the mean and variance of all features for each token. The formula is:

output = (input - mean) / sqrt(variance + epsilon)

where mean and variance are computed over the entire Hidden dimension.

When Hidden is very large (like 8192 elements), the tensor won’t fit in the 8KB VRF, so we cannot reduce it in a single operation.

Input: A 3D tensor representing transformer activations:

Shape: [Batch=32, SeqLen=128, Hidden=8192]
Data type: bf16 (2 bytes per element)
Total size: 32 × 128 × 8192 × 2 bytes = 64 MB
Per-token slice: For each of 4096 tokens (32 × 128), we have 8192 features = 16 KB per token
VRF constraint: Only 8KB per slice ≈ 4096 bf16 elements
Problem: Cannot load all 8192 features for a token simultaneously

Solution Strategy: Split the Hidden dimension into two 4096-element chunks:

Chunk 0: [Batch=32, SeqLen=128, Hidden_0=4096] - first half of features
Chunk 1: [Batch=32, SeqLen=128, Hidden_1=4096] - second half of features
Each chunk = 4096 elements × 2 bytes = 8KB, fits in VRF

Step-by-Step Execution

Step 1: Compute Partial Statistics

First, compute statistics for each chunk independently:

// Chunk 0: Hidden dimensions 0..4096
let chunk_0: DmTensor<bf16, m![1], m![1], m![1], m![Batch, SeqLen, Hidden_0: 4096]> = ...;

// Chunk 1: Hidden dimensions 4096..8192
let chunk_1: DmTensor<bf16, m![1], m![1], m![1], m![Batch, SeqLen, Hidden_1: 4096]> = ...;

// Compute sum for each chunk (using Reducer + Inter-Slice Block)
let sum_0: DmTensor<f32, m![1], m![1], m![1], m![Batch, SeqLen]> = chunk_0.reduce_sum(axis: Hidden_0);
let sum_1: DmTensor<f32, m![1], m![1], m![1], m![Batch, SeqLen]> = chunk_1.reduce_sum(axis: Hidden_1);

Step 2: Interleaved Fetch and Combine

Use split reduce to combine the partial sums:

// Fetch both chunks in interleaved pattern
let interleaved_sums: TuTensor<f32, m![1], m![1], m![1],
    m![I: 2, Batch, SeqLen], m![1]
> = ctx.main.begin_interleaved().fetch(&sum_0, &sum_1);

// Vector Engine adds across I dimension to get total sum
let total_sum: TuTensor<f32, m![1], m![1], m![1],
    m![Batch, SeqLen], m![1]
> = interleaved_sums.reduce_add(axis: I);

// Compute mean: total_sum / Hidden
let mean = total_sum * (1.0 / 8192.0);  // Vector Engine scalar multiply

Step 3: Compute Variance

Similarly, combine partial variance calculations:

// Compute squared differences for each chunk
let sq_diff_0 = (chunk_0 - mean).square().reduce_sum(axis: Hidden_0);
let sq_diff_1 = (chunk_1 - mean).square().reduce_sum(axis: Hidden_1);

// Split reduce to combine variance contributions
let interleaved_vars: TuTensor<f32, m![1], m![1], m![1],
    m![I: 2, Batch, SeqLen], m![1]
> = ctx.main.begin_interleaved().fetch(&sq_diff_0, &sq_diff_1);

let total_variance = interleaved_vars.reduce_add(axis: I);
let std = total_variance.sqrt();

Output:

The three steps produce the statistics needed for layer normalization:

Mean: [Batch=32, SeqLen=128] - one mean value per token, representing the average of all 8192 features
Standard deviation: [Batch=32, SeqLen=128] - one std value per token

Result: Use these statistics to normalize each token’s 8192 features:

normalized_chunk_0 = (chunk_0 - mean) / std
normalized_chunk_1 = (chunk_1 - mean) / std

Computing statistics in two separate chunks produces the same mathematical result as computing over all 8192 features at once:

Mathematically: mean([a,b,c,d,e,f]) = (sum(a,b,c) + sum(d,e,f)) / 6
In practice: mean([Hidden_0, Hidden_1]) = (sum(Hidden_0) + sum(Hidden_1)) / 8192

Split reduce computes global statistics despite VRF capacity limits.

Hardware Mapping

The split reduce operation maps to hardware as follows:

Operation	Hardware Component	Cycles
Fetch chunk_0	Fetch Engine	~1 cycle per 32-byte flit
Fetch chunk_1	Fetch Engine (interleaved)	~1 cycle per 32-byte flit
Interleave dimension creation	Fetch Sequencer	0 (structural transformation)
Binary add across I	Vector Engine	1 cycle per packet

Performance Analysis

Total cycles for split reduce:

Fetch both tensors: 2 * (Batch * SeqLen * ceil(Hidden / flit_elements)) cycles
Vector Engine reduction: (Batch * SeqLen) cycles
Total: Dominated by fetch time, ~8K cycles for this example

Bottleneck: Memory bandwidth for fetching both tensor instances sequentially.

Optimization: Restructure the computation to avoid splitting the reduction axis when possible. If the axis must be split, minimize the number of split instances.

Example 2: Batch Normalization Across Split Batches

Batch normalization computes statistics across the batch dimension. When processing very large batches, the batch dimension may be split across multiple tensor allocations.

Problem Setup

Input: [Batch_0 = 256, ...], [Batch_1 = 256, ...] (two separate batch tensors)
Reduction goal: Compute mean and variance across all 512 examples
Constraint: Cannot allocate single tensor for 512 batches due to memory limits

Execution Pattern

// Two batch allocations
let batch_0: DmTensor<bf16, m![1], m![1], m![1], m![Batch_0: 256, C, H, W]> = ...;
let batch_1: DmTensor<bf16, m![1], m![1], m![1], m![Batch_1: 256, C, H, W]> = ...;

// Compute per-batch statistics (reduce over H, W)
let batch_stats_0 = batch_0.reduce_mean(axis: [H, W]);  // [Batch_0=256, C]
let batch_stats_1 = batch_1.reduce_mean(axis: [H, W]);  // [Batch_1=256, C]

// Split reduce to combine batch statistics
let interleaved: TuTensor<f32, m![1], m![1], m![1],
    m![I: 2, Batch: 256, C], m![1]
> = ctx.main.begin_interleaved().fetch(&batch_stats_0, &batch_stats_1);

// Compute global statistics across all batches
let global_mean = interleaved.reduce_mean(axis: I);  // Average the two batch means

This pattern extends naturally to more than two splits by increasing the interleave dimension: I: 4 for four splits, etc.

Example 3: Mixture of Experts Partial Reduction

Problem Setup

Expert outputs: Multiple tensors from different expert evaluations
Routing weights: Weights determining how much each expert contributes
Goal: Weighted sum across expert outputs

Execution Pattern

// Expert outputs from separate evaluations (simplified: 2 experts)
let expert_0_output: DmTensor<bf16, m![1], m![1], m![1], m![Tokens, Hidden]> = ...;
let expert_1_output: DmTensor<bf16, m![1], m![1], m![1], m![Tokens, Hidden]> = ...;

let routing_weights: [f32; 2] = [0.7, 0.3];  // Per-expert weights

// Apply routing weights during fetch using zero-point arithmetic or scaling
let weighted_0 = expert_0_output * routing_weights[0];
let weighted_1 = expert_1_output * routing_weights[1];

// Split reduce to combine weighted expert contributions
let interleaved: TuTensor<bf16, m![1], m![1], m![1],
    m![I: 2, Tokens], m![Hidden]
> = ctx.main.begin_interleaved().fetch(&weighted_0, &weighted_1);

let combined_output = interleaved.reduce_add(axis: I);

Example 4: Temporal Reduction Across Windows

Problem Setup

Input: Video frames or sequence tokens split into temporal chunks
Goal: Compute global statistics across all chunks
Constraint: Cannot load all chunks simultaneously due to memory limits

Execution Pattern

// Temporal chunks
let chunk_t0: DmTensor<bf16, m![1], m![1], m![1], m![Time_0: 128, Features]> = ...;
let chunk_t1: DmTensor<bf16, m![1], m![1], m![1], m![Time_1: 128, Features]> = ...;
let chunk_t2: DmTensor<bf16, m![1], m![1], m![1], m![Time_2: 128, Features]> = ...;
let chunk_t3: DmTensor<bf16, m![1], m![1], m![1], m![Time_3: 128, Features]> = ...;

// Compute per-chunk max (e.g., for max pooling over time)
let max_t0 = chunk_t0.reduce_max(axis: Time_0);  // [Features]
let max_t1 = chunk_t1.reduce_max(axis: Time_1);  // [Features]
let max_t2 = chunk_t2.reduce_max(axis: Time_2);  // [Features]
let max_t3 = chunk_t3.reduce_max(axis: Time_3);  // [Features]

// Split reduce with I=4 to find global maximum
let interleaved: TuTensor<bf16, m![1], m![1], m![1],
    m![I: 4], m![Features]
> = ctx.main.begin_interleaved().fetch(&max_t0, &max_t1, &max_t2, &max_t3);

let global_max = interleaved.reduce_max(axis: I);

Comparison with Other Reduction Methods

The choice between split reduce and its alternatives depends on data location, tensor shape, and whether data can be merged into a single allocation.

Split Reduce vs. Slice Reduce (Inter-Slice Block)

Aspect	Split Reduce	Slice Reduce (Inter-Slice Block)
Data layout	Multiple independent tensors	Single tensor across slices
Fetch pattern	Interleaved fetch from multiple sources	Single contiguous fetch
Reduction hardware	Vector Engine binary ops	Inter-Slice Block
Typical cycles	~2x fetch time + cycles	~256 cycles (slice reduction)
Use case	Data cannot fit in single tensor	Data distributed across hardware

Prefer split reduce: Multiple tensor instances that cannot be merged into a single tensor due to memory allocation constraints, but all reside on the same chip/cluster.

Prefer slice reduce: Allocate a single tensor that spans slices, allowing the hardware to handle distribution automatically.

Split Reduce vs. Chip/Cluster Reduce

Aspect	Split Reduce	Chip/Cluster Reduce
Data location	Same chip/cluster	Across chips/clusters
Communication	Local memory fetch	DMA over chip interconnect
Overhead	Minimal (interleaved fetch)	Significant (DMA + synchronization)
Bandwidth	SRAM bandwidth	Chip interconnect bandwidth

Prefer split reduce: All data resides on the same chip, even if in separate allocations.

Prefer chip/cluster reduce: Data is distributed across physically separate processing units requiring cross-chip communication.

Implementation Methods

The split reduce operation maps to the following hardware primitives:

Interleaved fetch: Fetch Engine with begin_interleaved() mode, creating the I interleave dimension
Reduction across I: Vector Engine binary operations (add, max, min) configured to reduce the interleave axis
Alternative for 2-way split: Can use binary operation directly without explicit interleave dimension

Two-Instance Optimization

For the common case of splitting into exactly two instances, the Vector Engine can perform the reduction without creating an explicit interleave dimension:

// Direct binary operation for 2-way split
let sum_0: TuTensor<f32, m![1], m![1], m![1], m![A], m![B]> = ...;
let sum_1: TuTensor<f32, m![1], m![1], m![1], m![A], m![B]> = ...;

// Fetch both and add in one operation
let total = sum_0.binary_add(sum_1);  // No interleave dimension needed

This optimization reduces overhead by combining fetch and reduction into a single pipelined operation.

Performance Considerations

Cycle Analysis

Split reduce cycle count depends on three factors:

Fetch cycles: N_splits * fetch_cycles_per_tensor
Vector Engine cycles: Time_dim_size * cycles_per_packet (typically 1 cycle per packet)
Pipeline overlap: Fetch and VE operations can overlap when possible

Total cycles ≈ N_splits * fetch_cycles + max(0, VE_cycles - pipeline_overlap)

Memory Bandwidth

Split reduce consumes memory bandwidth proportionally to the number of splits:

2-way split: 2x memory bandwidth vs. single tensor
4-way split: 4x memory bandwidth vs. single tensor

Optimization: Minimize the number of splits by maximizing individual tensor size within VRF capacity.

Comparison to Alternatives

For a reduction requiring combining N tensor instances:

Method	Cycles	Memory BW	Complexity
Split reduce (interleaved)	~N * fetch + VE	N * tensor_size	Low
Sequential fetch + accumulate	~N * (fetch + VE)	N * tensor_size	Medium
DMA to single buffer + reduce	DMA + single_reduce	N * tensor_size	High

Split reduce with interleaved fetch provides the best balance of performance and implementation simplicity for same-chip reductions.

Constraints and Limitations

Hardware Constraints

Interleave dimension size: Limited by Fetch Engine capabilities
Tensor alignment: All tensor instances must have compatible shapes for interleaving

VRF capacity: After interleaving, the combined tensor must fit in VRF (8KB per slice)

When Split Reduce Is Not Optimal

Single tensor possible: Data fits in one tensor allocation, use slice reduce (Inter-Slice Block) instead
Cross-chip reduction needed: Data spans chips, use chip/cluster reduce with DMA
Very large split count: Beyond ~8 splits, consider alternative memory management strategies

Best Practices

Minimize splits: Design tensor allocations to minimize the number of splits required
Power-of-2 splits: Use 2, 4, or 8 splits when possible for optimal hardware utilization
Reuse reduction results: Cache split reduce results when the same combination is needed multiple times
Consider memory layout: Organize tensor allocations to enable efficient interleaved fetch patterns

Chip/Cluster Reduce

When a previous operation has already mapped the reduce axis to the Chip or Cluster dimension, chip/cluster reduce is needed to combine partial results across physically separate processing units. This section demonstrates how to perform those reduction operations when data is distributed across multiple chips or clusters.

When possible, assigning reduce axes to Slice/Element (reduced by Inter-Slice Block/Vector Engine) is preferred because it avoids cross-chip communication overhead.

Two main operations implement chip/cluster reduce: AllReduce and ReduceScatter. Both combine Switch Engine operations (for data redistribution across slices within a cluster) with Vector Engine binary operations (for actual reduction computation).

`ReduceScatter`

ReduceScatter reduces data distributed across chip/cluster axes while distributing the result so each chip holds a portion. This operation is useful when you need both reduction and result distribution in a single step.

Example: 4-chip `ReduceScatter` with Add

This example demonstrates how to perform reduction across chips when data is partitioned by one dimension (A) but needs to be reduced along a different dimension (B). The challenge is that each chip owns data for all B values of its assigned A value, but we need to sum across all A values for each B.

Input: A 2D tensor [A=4, B=4] with 16 total elements, distributed across 4 chips:

Shape: [A=4, B=4] - 16 elements total
Data type: i8 (8-bit signed integer)
Storage: SRAM on each chip
Distribution: In = {chip: A, slice: 256, element: B}
- Chip 0 owns: (A=0, B=0), (A=0, B=1), (A=0, B=2), (A=0, B=3) - all B values for A=0
- Chip 1 owns: (A=1, B=0), (A=1, B=1), (A=1, B=2), (A=1, B=3) - all B values for A=1
- Chip 2 owns: (A=2, B=0), (A=2, B=1), (A=2, B=2), (A=2, B=3) - all B values for A=2
- Chip 3 owns: (A=3, B=0), (A=3, B=1), (A=3, B=2), (A=3, B=3) - all B values for A=3

Goal: Reduce along the A axis (summing across chips) while keeping results distributed by B:

Output shape: [B=4] - 4 elements (A dimension eliminated by reduction)
Output distribution: Out = {chip: 4, slice: 256, element: 1}
- Chip 0 should hold: sum of (A=0..3, B=0) - the sum of all A values for B=0
- Chip 1 should hold: sum of (A=0..3, B=1) - the sum of all A values for B=1
- Chip 2 should hold: sum of (A=0..3, B=2) - the sum of all A values for B=2
- Chip 3 should hold: sum of (A=0..3, B=3) - the sum of all A values for B=3

Processing:

Slice (also called asymmetric slice) is a sub-context operation that extracts a subset of elements from specific chip positions (see Implementation Methods). ChipShuffle is a DMA-based redistribution operation that moves data from one chip to another.

The algorithm works through six stages: create four intermediate tensors using diagonal Slice + ChipShuffle patterns, add them to reduce the A axis, then broadcast results to all chips. The diagonal pattern ensures each chip receives the data it needs for its assigned B value.

Initial State

Each chip owns one value along the A axis:

Chip 0: (A=0, B=0), (A=0, B=1), (A=0, B=2), (A=0, B=3)
Chip 1: (A=1, B=0), (A=1, B=1), (A=1, B=2), (A=1, B=3)
Chip 2: (A=2, B=0), (A=2, B=1), (A=2, B=2), (A=2, B=3)
Chip 3: (A=3, B=0), (A=3, B=1), (A=3, B=2), (A=3, B=3)

Step 1: Create Tensor `T0` - `Slice(0,1,2,3)`

This step selects specific positions along the B axis from each chip using Slice, creating a diagonal selection pattern:

Chip 0: select (0,0)
Chip 1: select (1,1)
Chip 2: select (2,2)
Chip 3: select (3,3)

Result T0:

Chip 0: (0,0)
Chip 1: (1,1)
Chip 2: (2,2)
Chip 3: (3,3)

Step 2: Create Tensor `T1` - `Slice(3,0,1,2)` + `ChipShuffle(1,2,3,0)`

This step combines Slice with ChipShuffle to create a rotated diagonal pattern. First, Slice selects elements:

Chip 0: (0,3)
Chip 1: (1,0)
Chip 2: (2,1)
Chip 3: (3,2)

Then ChipShuffle(1,2,3,0) redistributes the data so each chip receives data from another chip:

Data from Chip 1 moves to Chip 0: (1,0)
Data from Chip 2 moves to Chip 1: (2,1)
Data from Chip 3 moves to Chip 2: (3,2)
Data from Chip 0 moves to Chip 3: (0,3)

Step 3: Create Tensor `T2` - Slice(2,3,0,1) + ChipShuffle(2,3,0,1)

This step creates another rotated diagonal pattern. First, Slice selects positions:

Chip 0: select (0,2)
Chip 1: select (1,3)
Chip 2: select (2,0)
Chip 3: select (3,1)

Then ChipShuffle(2,3,0,1) redistributes the data, yielding T2:

Chip 0: (2,0)
Chip 1: (3,1)
Chip 2: (0,2)
Chip 3: (1,3)

Step 4: Create Tensor `T3` - ChipSlice(1,2,3,0) + ChipShuffle(3,0,1,2)

This step creates the final rotated diagonal pattern. First, Slice selects positions:

Chip 0: select (0,1)
Chip 1: select (1,2)
Chip 2: select (2,3)
Chip 3: select (3,0)

Then ChipShuffle(3,0,1,2) redistributes the data, yielding T3:

Chip 0: (3,0)
Chip 1: (0,1)
Chip 2: (1,2)
Chip 3: (2,3)

Step 5: Vector Engine Add - A Axis Reduction

This step performs the actual reduction by adding all 4 tensors element-wise:

Chip 0: (0,0) + (1,0) + (2,0) + (3,0)
Chip 1: (1,1) + (2,1) + (3,1) + (0,1)
Chip 2: (2,2) + (3,2) + (0,2) + (1,2)
Chip 3: (3,3) + (0,3) + (1,3) + (2,3)

After this addition, each chip holds only one value because the A axis has been reduced:

Intermediate = { chip: B, slice: 256, element: 1 }

Step 6: AllGather

This final step broadcasts the result so all chips hold the complete reduction output. Each chip gathers data from Chip 0 through Chip 3:

Intermediate = { chip: 4, slice: 256, element: B }

Output:

After all six steps complete, each chip holds a portion of the reduced result:

Final distribution: Out = {chip: A, slice: 256, element: 4}
Chip 0: Holds sum of all (A=*, B=0) values
Chip 1: Holds sum of all (A=*, B=1) values
Chip 2: Holds sum of all (A=*, B=2) values
Chip 3: Holds sum of all (A=*, B=3) values

The A axis has been reduced (summed across all 4 chips), and the results are scattered across chips based on the B value. Each chip now owns one element representing the sum of all A values for its assigned B coordinate.

Why this example is useful:

ReduceScatter combines two operations that frequently occur together in distributed computing:

Reduction across processors: Summing/aggregating data distributed across multiple chips
Result distribution: Each chip gets a portion of the result rather than duplicating it everywhere

This pattern is essential for:

Distributed matrix multiplication: Reduce partial products from different chips while distributing the result
Gradient aggregation in data parallelism: Sum gradients across workers, with each worker holding a portion
Memory efficiency: Avoids storing the full reduced result on every chip (unlike AllReduce)
Pipeline parallelism: Enables efficient communication patterns between pipeline stages

The diagonal slicing pattern is key: it ensures that data needed for each output element is gathered from all chips before reduction, minimizing communication rounds.

`AllReduce`

AllReduce reduces data distributed across the chip axis so that all chips have identical reduction results. Unlike ReduceScatter, AllReduce ensures every chip ends up with the complete result rather than a portion.

Example: 4-chip `AllReduce` with Add

This example demonstrates the most common collective operation in distributed deep learning: reducing values across all processors so every processor has the identical complete result. This is essential for operations like averaging gradients across data-parallel training workers.

Input: A 2D tensor [A=4, B=4] distributed across 4 chips by the A dimension:

Shape: [A=4, B=4] - 16 elements total
Data type: i8 (8-bit signed integer)
Storage: SRAM on each chip
Distribution: In = {chip: A, slice: 256, element: B}
- Chip 0 owns: (A=0, B=0-3) - all 4 B values for A=0
- Chip 1 owns: (A=1, B=0-3) - all 4 B values for A=1
- Chip 2 owns: (A=2, B=0-3) - all 4 B values for A=2
- Chip 3 owns: (A=3, B=0-3) - all 4 B values for A=3

Goal: Reduce along the A axis and replicate the complete result to all chips:

Output shape: [B=4] - 4 elements (A dimension eliminated by summation)
Output distribution: Out = {chip: 4, slice: 256, element: B}
- Every chip holds: sum of all (A=0..3, B=0), sum of all (A=0..3, B=1), sum of all (A=0..3, B=2), sum of all (A=0..3, B=3)
- All chips have identical data after AllReduce completes

Processing:

The algorithm creates 4 versions of the input tensor through rotation, then adds them all together:

Use 3 ChipShuffle operations on the original tensor T0 to create 3 rotated versions (T1, T2, T3)
Add all 4 tensors element-wise using Vector Engine
Every chip performs the same additions on its local data, producing identical results everywhere

Initial State (`T0`)

Each chip owns one value along the A axis:

Chip 0: (A=0, B=0), (A=0, B=1), (A=0, B=2), (A=0, B=3)
Chip 1: (A=1, B=0), (A=1, B=1), (A=1, B=2), (A=1, B=3)
Chip 2: (A=2, B=0), (A=2, B=1), (A=2, B=2), (A=2, B=3)
Chip 3: (A=3, B=0), (A=3, B=1), (A=3, B=2), (A=3, B=3)

Step 1: Create Tensor `T1` - ChipShuffle(1,2,3,0)

This step rotates the data by one chip position. ChipShuffle(1,2,3,0) is applied to the original T0:

Data from Chip 1 moves to Chip 0
Data from Chip 2 moves to Chip 1
Data from Chip 3 moves to Chip 2
Data from Chip 0 moves to Chip 3

The resulting T1:

Chip 0: (1,0), (1,1), (1,2), (1,3)
Chip 1: (2,0), (2,1), (2,2), (2,3)
Chip 2: (3,0), (3,1), (3,2), (3,3)
Chip 3: (0,0), (0,1), (0,2), (0,3)

Step 2: Create Tensor `T2` - ChipShuffle(2,3,0,1)

This step rotates the data by two chip positions. ChipShuffle(2,3,0,1) is applied to the original T0:

Data from Chip 2 moves to Chip 0
Data from Chip 3 moves to Chip 1
Data from Chip 0 moves to Chip 2
Data from Chip 1 moves to Chip 3

The resulting T2:

Chip 0: (2,0), (2,1), (2,2), (2,3)
Chip 1: (3,0), (3,1), (3,2), (3,3)
Chip 2: (0,0), (0,1), (0,2), (0,3)
Chip 3: (1,0), (1,1), (1,2), (1,3)

Step 3: Create Tensor `T3` - ChipShuffle(3,0,1,2)

This step rotates the data by three chip positions. ChipShuffle(3,0,1,2) is applied to the original T0:

Data from Chip 3 moves to Chip 0
Data from Chip 0 moves to Chip 1
Data from Chip 1 moves to Chip 2
Data from Chip 2 moves to Chip 3

The resulting T3:

Chip 0: (3,0), (3,1), (3,2), (3,3)
Chip 1: (0,0), (0,1), (0,2), (0,3)
Chip 2: (1,0), (1,1), (1,2), (1,3)
Chip 3: (2,0), (2,1), (2,2), (2,3)

Step 4: Vector Engine Add - A Axis Reduction

This step performs the actual reduction by adding all 4 tensors T0, T1, T2, T3:

Chip 0: (0,0)+(1,0)+(2,0)+(3,0), (0,1)+(1,1)+(2,1)+(3,1), (0,2)+(1,2)+(2,2)+(3,2), (0,3)+(1,3)+(2,3)+(3,3)
Chip 1: (1,0)+(2,0)+(3,0)+(0,0), (1,1)+(2,1)+(3,1)+(0,1), (1,2)+(2,2)+(3,2)+(0,2), (1,3)+(2,3)+(3,3)+(0,3)
Chip 2: (2,0)+(3,0)+(0,0)+(1,0), (2,1)+(3,1)+(0,1)+(1,1), (2,2)+(3,2)+(0,2)+(1,2), (2,3)+(3,3)+(0,3)+(1,3)
Chip 3: (3,0)+(0,0)+(1,0)+(2,0), (3,1)+(0,1)+(1,1)+(2,1), (3,2)+(0,2)+(1,2)+(2,2), (3,3)+(0,3)+(1,3)+(2,3)

Notice that each chip computes the same mathematical result, just with operands in different orders (addition is commutative, so order doesn’t matter). After this step, all chips have identical data.

Output:

After the AllReduce completes, every chip holds the complete reduced result:

Final distribution: Out = {chip: 4, slice: 256, element: B}
Every chip holds identical data: The sum of all A values for each B position
- All chips have: [sum(A=0..3, B=0), sum(A=0..3, B=1), sum(A=0..3, B=2), sum(A=0..3, B=3)]

This can be viewed as transforming [A=4] | [B=4] to [Broadcast=4] | [B=4]:

The A axis has been reduced (eliminated through summation)
The result is broadcast to all chips (every chip has the complete result)

Why this example is useful:

AllReduce is the workhorse operation for distributed machine learning:

Data parallel training: Average gradients computed across multiple batches on different chips
Model averaging: Combine parameter updates from multiple workers
Synchronization primitive: Ensure all chips have identical state before proceeding
Global statistics: Compute metrics like mean/max/min across the entire distributed dataset

Key characteristics:

Bandwidth efficient: Each chip only receives data from 3 shuffle operations (not 3 full tensor transfers)
Symmetric: All chips perform the same computation, simplifying implementation
Complete replication: Every chip ends with full result, enabling independent downstream operations
Foundation for collectives: More complex distributed operations build on AllReduce

The rotation-based algorithm shown here scales to any power-of-2 number of chips: for 8 chips, use 7 rotations; for 16 chips, use 15 rotations, etc.

Implementation Methods for Each Operation

Each operation in chip/cluster reduce maps to specific hardware primitives. Understanding these mappings helps predict performance and resource usage patterns.

Asymmetric Slice

Chip/cluster asymmetric slice operations extract a subset of data from specific positions in the chip or cluster dimension. The ParallelCopy operation implements this by running in the sub-context using the stos (Store to SRAM) command. This approach enables selective data extraction without full tensor movement, copying only the elements at positions specified by the slice indices. The sub-context execution ensures that slice operations can overlap with main-context computation, maintaining pipeline efficiency.

Shuffle

Chip/cluster shuffle redistributes data across chips using DMA operations through HBM. The DmaCommand handles intra-chip shuffles by moving data between HBM regions associated with different chips, while PCIeDmaCommand extends this capability to inter-chip communication when needed. The HBM-to-HBM transfer pattern avoids unnecessary round-trips through chip-local memory, directly routing data to its destination. Shuffle operations are the primary cost factor in chip/cluster reduce because they involve cross-chip data movement over the interconnect fabric, typically requiring hundreds to thousands of cycles depending on data volume.

Tensor Addition

Tensor addition combines multiple input tensors element-wise to perform the actual reduction computation. This operation runs in the main context using a two-stage approach: interleaved fetch brings data from multiple tensor instances into the pipeline, and the Vector Engine’s binary add operation performs the element-wise summation. The interleaved fetch pattern enables the Vector Engine to process additions efficiently by presenting operands in alternating time steps, avoiding the need for separate accumulation buffers. This main-context execution provides maximum throughput for the arithmetic-intensive reduction phase after data has been properly arranged through slice and shuffle operations.

Fetch and Commit Engine

The Fetch Engine reads tensors from SRAM while the Commit Engine writes them back. These examples demonstrate the complete data path: input tensor -> fetch sequencer -> Switch Engine -> Collect Engine -> commit unit -> output tensor.

Each example focuses on a specific pattern: axis permutation, full-flit commit, tail padding optimization, and tensor segmentation. These four patterns represent distinct aspects of the fetch-commit data path: axis reordering (permutation), write granularity (full-flit), memory layout choices (tail padding), and handling of tensors that exceed hardware capacity (segmentation).

Example 1: Axis Permutation

This example demonstrates tensor reshaping by permuting axes during a fetch-commit cycle. The Switch Engine enables axis reordering without additional computation by controlling data flow from fetch to commit.

axes![A = 3, B = 5, C = 2];

// Input: shape [A, B, C] at address 0
let input: DmTensor<f8, m![1], m![1], m![1], m![A, B, C]> = ...;

// Output: shape [B, A, C] at address 1024 (permuted layout)
let output: DmTensor<f8, m![1], m![1], m![1], m![B, A, C # 6]> = ctx
    .main
    .begin(input.view())
    .fetch::<f8, m![A, B], m![C # 6]>()       // Time=[A,B], Packet=[C] padded to 8 bytes
    .collect::<m![A, B], m![C # 30]>()        // Pad to 32-byte flit (forwarding switch implied)
    .commit(1024);                             // Write with permuted sequencer config

Input Tensor: The input is a 3D tensor stored in SRAM with dimensions [A=3, B=5, C=2], containing 30 elements total:

Shape: A × B × C = 3 × 5 × 2
Data type: f8 (8-bit floating-point)
Memory layout: m![A, B, C] - consecutive in memory as A varies slowest, C varies fastest
Base address: b = 0 (starts at SRAM address 0)
Physical storage: Elements are arranged as [A0,B0,C0][A0,B0,C1][A0,B1,C0]...[A2,B4,C1]

Labeling elements by their indices, memory contains:

Address 0-1:   (A=0,B=0,C=0-1)
Address 2-3:   (A=0,B=1,C=0-1)
Address 4-5:   (A=0,B=2,C=0-1)
...continuing with A=0, varying B...
Address 10-11: (A=1,B=0,C=0-1)
...and so on

Output Tensor (Target): Store the same logical tensor with axes permuted to layout [B, A, C]:

Shape: Still B × A × C = 5 × 3 × 2 (same 30 elements, different order)
Data type: f8 (unchanged)
Memory layout: m![B, A, C # 6] - now B varies slowest, with 6-byte padding per element
Base address: b = 1024 (stored at SRAM address 1024)
Physical storage: Elements arranged as [B0,A0,C0-1][B0,A1,C0-1][B0,A2,C0-1][B1,A0,C0-1]...

This reordering changes which elements are adjacent in memory: in the input, all B values for A=0 are contiguous; in the output, all A values for B=0 are contiguous.

Processing:

The axis permutation happens through three stages—Fetch, Switch, and Commit:

Fetch Sequencer: Reads the input tensor from SRAM and creates a packet stream
- Time dimension: Time = m![A, B] - iterates through 15 cycles (3×5)
- Packet dimension: Packet = m![C # 6] - each packet contains 2 C elements plus 6 bytes padding
- Fetch size: 8 bytes per cycle (meets hardware alignment requirement)
- Note: Hardware requires 8-byte packet alignment, so we cannot use C=2 bytes alone; we pad to 8 bytes
Collect Engine: Normalizes packets into standard 32-byte flits for the commit stage
- Input packets (8 bytes) are padded to create 32-byte flits
- Time dimension: Time = m![A, B] - unchanged, still 15 cycles
- flit dimension: Flit = m![C # 30] - 2 data bytes + 30 bytes padding = 32-byte flit
- The collect engine pads and normalizes packet sizes without reordering data
Commit Unit: Writes data to SRAM with the new axis order [B, A, C]
- Receives flits with time m![A, B] but writes to memory layout m![B, A, C # 6]
- The write sequencer configuration creates the permutation
- Commit size: 8 bytes per write (matching fetch size)
- Slices incoming 32-byte flits down to 8-byte write units

The write sequencer configuration determines how to map the incoming time-ordered stream m![A, B, C] to the permuted memory layout m![B, A, C #6]. The notation [axis=count:stride, ...] @ base / commit_size means: for each axis, loop count times advancing stride bytes per step; @ sets the base address; / sets the bytes written per commit operation (see Sequencer for the full sequencer model).

The sequencer is configured as: [A=3:8, B=5:24, C=8:1] @ 1024 / 8

A=3:8 means loop 3 times with stride 8 bytes between iterations
B=5:24 means loop 5 times with stride 24 bytes between iterations
C=8:1 means write 8 bytes (the packet size) with stride 1
Base address: 1024 (output tensor starts here)
Commit size: 8 bytes per write operation

This configuration causes data arriving in [A, B] time order to be written to addresses that correspond to [B, A] spatial order. Here’s how the writes occur:

Cycle `i`	Time axes	Write to memory address	Explanation
0	A=0, B=0	1024-1032 (B=0, A=0)	First element: writes to base address
1	A=0, B=1	1048-1056 (B=1, A=0)	Stride 24 bytes forward (next B)
2	A=0, B=2	1072-1080 (B=2, A=0)	Another 24-byte stride
3-4	A=0, B=3-4	Continue with B=3,4	Complete A=0 row
5	A=1, B=0	1032-1040 (B=0, A=1)	Jump to B=0, A=1 (+8 from cycle 0)
6	A=1, B=1	1056-1064 (B=1, A=1)	+24 stride for next B
7-14	Continue	…	Complete all A=1,2 rows

Notice how the write pattern interleaves: we write A=0,B=0 then A=0,B=1, but these end up at addresses that place all A values for each B together in the output layout.

Output:

After commit completes, SRAM address 1024 onwards contains the tensor with permuted layout:

Memory layout: [B=5, A=3, C=2] with 6-byte padding
Physical arrangement: All A values for B=0 are contiguous, then all A values for B=1, etc.

Address structure:

1024-1032: (B=0, A=0, C=0-1) + 6 bytes padding
1032-1040: (B=0, A=1, C=0-1) + 6 bytes padding
1040-1048: (B=0, A=2, C=0-1) + 6 bytes padding
1048-1056: (B=1, A=0, C=0-1) + 6 bytes padding
...and so on

The permutation is complete: the same 30 data elements that were in [A, B, C] order are now in [B, A, C] order. This operation takes 15 cycles (one per A×B combination) and requires no actual computation—only memory read/write with different address patterns.

Key constraints:

Three constraints govern axis permutation operations:

8-byte alignment: commit_in_size and commit_size are always in 8-byte units, so the target tensor for commit always corresponds to an 8-byte aligned range, naturally creating 8-byte tail alignment (a dummy is added to align the tail to 8 bytes).
Sequencer limit: Like the Fetch Engine, sequencer entries are limited to 8 total (limit < 65536).
Non-contiguous writes: Since the write sequencer sets the commit address, committed data need not be contiguous in flit time order—permutations like AB -> BA are possible.

Why this example is useful:

Axis permutation is a common requirement in deep learning:

Tensor layout transformations: Converting between NCHW (batch, channels, height, width) and NHWC (batch, height, width, channels) formats for different operations
Matrix transpose: Preparing data for operations that require transposed matrices without actual computation
Memory access optimization: Reordering axes to make the most frequently accessed dimension innermost for better cache performance
Inter-operation compatibility: Reformatting tensors to match the input requirements of subsequent operations

The TCP architecture performs these reshapes during data movement without consuming compute resources or requiring separate transpose kernels.

Example 2: Full-flit Commit

This example demonstrates full-flit commit, an optimization that writes entire 32-byte flits directly to memory without slicing them into smaller chunks. Tensor dimensions naturally aligned to 32-byte boundaries eliminate commit slicer overhead and simplify write sequencer configuration.

axes![A = 3, B = 5, C = 2];

// Input: same shape [A, B, C]
let input: DmTensor<f8, m![1], m![1], m![1], m![A, B, C]> = ...;

// Output: merge B and C, pad to 32 bytes per A slice
let output: DmTensor<f8, m![1], m![1], m![1], m![A, [B, C] # 22]> = ctx
    .main
    .begin(input.view())
    .fetch::<f8, m![A], m![[B, C] # 22]>()     // Time=[A], Packet=[B,C] padded to 32 bytes
    .collect::<m![A], m![[B, C] # 22]>()        // Already 32-byte flit, identity collect
    .commit(1024);                               // Full-flit commit: 3 cycles vs 15 in Example 1

Input Tensor: The input tensor is identical to Example 1, but committed with a different memory layout that allows full-flit writes:

Shape: [A=3, B=5, C=2] containing 30 elements
Data type: f8 (8-bit floating-point, 1 byte per element)
Memory layout: m![A, B, C] - standard row-major order
Base address: b = 0
Element size: 1 byte × 30 elements = 30 bytes of data

Output Tensor (Target): Instead of permuting axes like Example 1, we merge the last two dimensions and add padding:

Shape: Still [A=3, B=5, C=2] logically, but stored as [A=3, BC=10]
Data type: f8 (unchanged)
Memory layout: m![A, [B, C] # 22] - merge B and C dimensions, add 22 bytes padding
Base address: b = 1024
Physical layout: Each A iteration stores 10 data bytes (B×C) plus 22 padding bytes = 32 bytes total
The 32-byte size per A slice perfectly matches hardware flit size, enabling full-flit writes

Processing:

Data dimensions aligned with hardware flit size enable a simpler pipeline than Example 1:

Fetch Sequencer: Reads input and pads to 32-byte packets immediately
- Time dimension: Time = m![A, B] - 15 cycles (3×5)
- Packet dimension: Packet = m![[B, C] # 22] - merges B and C, adds 22 bytes padding to reach 32 bytes
- Fetch size: 32 bytes per cycle (full packet, not split)
- The sequencer pads from 10 data bytes to 32 bytes during fetch
Collect Engine: Receives 32-byte packets and passes them as 32-byte flits
- Time dimension: Time = m![A] - simplified to just 3 cycles since B and C are merged into packet
- flit dimension: Flit = m![[B, C] # 22] - full 32-byte flit with no additional padding needed
- No reformatting required: packet size = flit size = 32 bytes
Commit Unit: Writes full 32-byte flits directly to memory without slicing
- Receives 32-byte flits and writes them as complete 32-byte units
- commit_in_size = 32 bytes: No slicer operation needed
- commit_size = 32 bytes: Each write operation handles a full flit
- Time: Only 3 cycles (one per A), much faster than Example 1’s 15 cycles

The write sequencer configuration is simple: [A=3:32, [B,C]=32:1] @ 1024 / 32

A=3:32 means loop 3 times with 32-byte stride (one full flit per A)
[B,C]=32:1 means write 32 bytes with stride 1 (continuous write of flit contents)
Each cycle writes one complete flit: cycle 0 writes flit for A=0, cycle 1 for A=1, cycle 2 for A=2

Output:

After commit, SRAM address 1024 onwards contains the tensor packed into 32-byte-aligned blocks:

Memory layout: [A=3, BC=10+padding] with each A slice occupying exactly 32 bytes

Physical structure:

1024-1056: (A=0, all 10 B×C elements) + 22 bytes padding = 32 bytes
1056-1088: (A=1, all 10 B×C elements) + 22 bytes padding = 32 bytes
1088-1120: (A=2, all 10 B×C elements) + 22 bytes padding = 32 bytes

Performance: Only 3 write cycles vs 15 in Example 1 (5× faster)
Simplicity: No slicing overhead, no complex stride patterns

Why this example is useful:

Full-flit commit demonstrates an important optimization strategy:

Alignment optimization: When you can pad dimensions to 32-byte boundaries, commit becomes much more efficient
Reduced cycles: Fewer, larger writes complete faster than many small writes
Hardware efficiency: Writing full flits maximizes memory bandwidth utilization
Design principle: Sometimes adding padding to align with hardware granularity improves overall performance

This technique is particularly valuable for:

Small tensors where padding overhead is minimal compared to the benefit
Intermediate results that don’t need compact storage
Situations where downstream operations also benefit from 32-byte alignment

Key constraint: Write sequencer configurations require non-zero stride for all entries. This means you cannot discard data beyond slicing (no selective writes), and broadcast (reuse) operations are not possible during commit.

Example 3: Tail Padding and Fetch Size

The amount of tail padding dramatically affects fetch/commit efficiency. In the mapping expression m![A # 72], the # 72 pads A up to 72 elements; the pads are referred to as dummy in the hardware configuration below. Understanding padding interaction with hardware fetch size constraints enables optimization between memory usage (less padding) and performance (more padding aligned to hardware boundaries).

axes![A = 65, B = 2];

let input: DmTensor<f8, m![1], m![1], m![1], m![B, A # 72]> = ...;

// Option 1: dummy=7, fetch_size=24 bytes, 6 cycles
let out_7: DmTensor<f8, m![1], m![1], m![1], m![B, A # 72]> = ctx.main
    .begin(input.view())
    .fetch::<f8, m![B * (A # 72) / 24], m![A % 24]>()
    .collect::<m![B * (A # 72) / 24], m![A % 24 # 8]>()
    .commit(1024);

// Option 2: dummy=31, fetch_size=32 bytes, 6 cycles (best performance)
let out_31: DmTensor<f8, m![1], m![1], m![1], m![B, A # 96]> = ctx.main
    .begin(input.view())
    .fetch::<f8, m![B * (A # 96) / 32], m![A % 32]>()
    .collect::<m![B * (A # 96) / 32], m![A % 32]>()
    .commit(1024);

The Problem: Commit a tensor with shape [A=65, B=2] (130 bytes of data). Hardware fetch sizes must be 8, 16, 24, or 32 bytes. Determine the padding amount for dimension A to maximize performance.

Input Tensor:

Shape: [A=65, B=2] - 130 elements (65 elements across A dimension, 2 across B)
Data type: f8 (1 byte per element)
Memory layout: m![B, A+7] - stored with 7 bytes of tail padding after A
Base address: b = 0
Total size: 2 × (65 + 7) = 144 bytes (includes padding)

Output Tensor (Variable Padding): The target can have different padding amounts, each enabling different fetch sizes:

Shape: [A=65, B=2] (same logical data)
Data type: f8
Base address: b = 1024
Memory layout options:
- m![B, A # 7]: 7 bytes padding → enables 24-byte fetch size
- m![B, A # 15]: 15 bytes padding → enables 16-byte fetch size
- m![B, A # 23]: 23 bytes padding → enables 8-byte fetch size (worst)
- m![B, A # 31]: 31 bytes padding → enables 32-byte fetch size (best)

The optimal fetch size unit varies depending on the tail dummy value. The following subsections show each case:

`dummy = 7`

fetch sequencer output
- Time = m![B * (A # 7) / 24]
- Flit = m![A % 24]
- fetch_size = 24 bytes
Switch Engine output (= commit unit input)
- Time = m![B * (A # 7) / 24]
- Flit = m![A % 24 # 8]
Commit Unit
- commit_in_size = 24 bytes
- sliced shape
  - Time = m![B * (A # 7) / 24]
  - Flit = m![A % 24]
- write sequencer configuration
  - m![B, A # 7] -> m![B * (A # 7) / 24 * A % 24]
  - Sequencer configuration: [B=2:72, (A # 7)/24=3:24, A=24:1] @ 1024 / 24

`dummy = 15`

fetch sequencer output
- Time = m![B * (A # 15) / 16]
- Flit = m![A % 16]
- fetch_size = 16 bytes
Switch Engine output (= commit unit input)
- Time = m![B * (A # 15) / 16]
- Flit = m![A % 16]
Commit Unit
- commit_in_size = 16 bytes
- sliced shape
  - Time = m![B * (A # 15) / 16]
  - Flit = m![A % 16]
- write sequencer configuration
  - m![B, A # 15] -> m![B * (A # 15) / 16 * A % 16]
  - Sequencer configuration: [B=2:80, (A # 15)/16=5:16, A=16:1] @ 1024 / 16

`dummy = 23`

fetch sequencer output
- Time = m![B * (A # 23) / 8]
- Flit = m![A % 8]
- fetch_size = 8 bytes
Switch Engine output (= commit unit input)
- Time = m![B * (A # 23) / 8]
- Flit = m![A % 8 # 24]
Commit Unit
- commit_in_size = 8 bytes
- sliced shape
  - Time = m![B * (A # 23) / 8]
  - Flit = m![A % 8]
- write sequencer configuration
  - m![B, A # 23] -> m![B * (A # 23) / 8 * A % 8]
  - Sequencer configuration: [B=2:88, (A # 23)/8=11:8, A=8:1] @ 1024 / 8

`dummy = 31`

fetch sequencer output
- Time = m![B * (A # 31) / 32]
- Flit = m![A % 32]
- fetch_size = 32 bytes
Switch Engine output (= commit unit input)
- Time = m![B * (A # 31) / 32]
- Flit = m![A % 32]
Commit Unit
- commit_in_size = 32 bytes
- sliced shape
  - Time = m![B * (A # 31) / 32]
  - Flit = m![A % 32]
- write sequencer configuration
  - m![B, A # 31] -> m![B * (A # 31) / 32 * A % 32]
  - Sequencer configuration: [B=2:96, (A # 31)/32=3:32, A=32:1] @ 1024 / 32

Summary: The Impact of Padding Choice

The following table summarizes how output tail padding affects performance:

Padding (`dummy`)	`fetch_size`	Fetch cycles	Memory overhead	Efficiency
7	24 bytes	6 cycles	14 bytes (9.7%)	Good
15	16 bytes	10 cycles	30 bytes (18.8%)	Moderate
23	8 bytes	22 cycles	46 bytes (26.1%)	Poor
31	32 bytes	6 cycles	62 bytes (32.3%)	Best

Key Insights:

Performance varies dramatically: dummy=23 requires 22 cycles (8-byte fetches) while dummy=31 requires only 6 cycles (32-byte fetches) - nearly 4× faster despite using similar amounts of padding
Optimal padding aligns with hardware: The best performance comes when (data_size + padding) is divisible by 32 bytes (the largest fetch size)
Trade-off: Adding 8 more bytes of padding (23→31) increases memory overhead from 26.1% to 32.3% (only 6% increase) but improves performance by 3.7× (22 cycles→6 cycles)
Design principle: Prefer padding amounts that enable the largest possible fetch size (32 bytes), even with slightly more memory waste

Why this example is useful:

Naive padding choices cause severe performance degradation:

Padding to arbitrary values like 23 bytes forces small 8-byte fetches
Understanding fetch size constraints enables strategic padding choices
Pad to the next multiple of 32 bytes when possible
The memory cost of better padding is usually negligible compared to the performance gain

Why `dummy=23` Cannot Use 32-byte Commits

The dummy=23 case cannot use 32-byte commits because write sequencer configurations must never exceed tensor boundaries:

fetch sequencer output
- Time = m![B * (A # 31) / 32] : Since A + 23 is not divisible by 32, setting fetch_size=32 bytes requires fetching A+31 elements.
- Flit = m![A % 32]
- fetch_size = 32 bytes
Switch Engine output (= commit unit input)
- Time = m![B * (A # 31) / 32]
- Flit = m![A % 32]
Commit Unit
- commit_in_size = 32 bytes (if commit_in_size < 32, it would cut valid A portion: must set commit_in_size=32 bytes)
- sliced shape
  - Time = m![B * (A # 31) / 32]
  - Flit = m![A % 32]
- write sequencer configuration
  - m![B, A # 23] -> m![B * (A # 31) / 32 * A % 32]
  - Sequencer configuration: [B=2:88, (A # 31)/32=3:32, A=32:1] @ 1024 / 32

Key insight: Read sequencer configurations can safely overfetch (reading dummy addresses beyond the input tensor range is acceptable), but write sequencer configurations must never write beyond the tensor boundary (as it could write data to space occupied by other tensors). This asymmetry is why dummy=23 cannot use 32-byte fetches.

Example 4: Tensor Segmentation

TCP handles tensors exceeding Vector Register File (VRF) capacity through segmentation. Tensors too large to process in a single execution are automatically split into smaller chunks that fit within hardware constraints, with each chunk processed independently.

axes![A = 2048, B = 32];

// Input: 64KB tensor, exceeds 8KB VRF limit
let input: DmTensor<f8, m![1], m![1], m![1], m![A, B]> = ...;

// Segmented into two executions (compiler handles this automatically)

// Execution #0: first half of A
let seg_0: DmTensor<f8, m![1], m![1], m![1], m![A % 1024, B]> = ctx.main
    .begin(input.view().slice(A, 0..1024))
    .fetch::<f8, m![A % 1024], m![B]>()
    .collect::<m![A % 1024], m![B]>()
    .commit(256 * 1024);  // Write to 256K

// Execution #1: second half of A
let seg_1: DmTensor<f8, m![1], m![1], m![1], m![A @ 1024, B]> = ctx.main
    .begin(input.view().slice(A, 1024..2048))
    .fetch::<f8, m![A @ 1024], m![B]>()
    .collect::<m![A @ 1024], m![B]>()
    .commit(256 * 1024 + 32 * 1024);  // Write to 256K + 32K

The Problem: The Vector Engine’s VRF has only 8KB of capacity per slice. Tensors requiring more storage than this cannot be fetched and processed in one operation. Segmentation splits the tensor across multiple executions.

The 8KB limit is per-slice, not total. While a cluster has 256 slices (2MB total VRF), each slice holds only 8KB. Tensor distribution across slices is controlled by the slice dimension in the tensor mapping. Tensors without enough elements mapped to the slice dimension, or operations requiring entire rows/columns in individual slices (common in reduction operations), hit the per-slice limit before using all 256 slices. A [2048, 32] tensor with mapping m![1, 1, 1, 2048, 32] (no slice distribution) attempts to store all 64KB in slice 0, exceeding the 8KB limit. Even with slice distribution m![1, 1, 256, 8, 32], each slice stores only 256 bytes, but intermediate results or operation constraints may require more per-slice storage. Segmentation ensures each slice’s VRF usage stays within the 8KB hardware limit.

Input Tensor: A large 2D tensor that exceeds VRF capacity:

Shape: [A=2048, B=32] - 65,536 elements
Data type: f8 (1 byte per element)
Total size: 2048 × 32 = 65,536 bytes = 64 KB
Memory layout: m![A, B] - standard row-major
Base address: b = 0
Problem: 64 KB far exceeds the 8 KB VRF limit per slice

Output Tensor: The same tensor needs to be written to a different SRAM location:

Shape: [A=2048, B=32] (identical)
Data type: f8
Memory layout: m![A, B]
Base address: b = 256K (different location)

Solution Strategy: Split the A dimension into two segments:

Segment 1: A % 1024 (first 1024 elements) = 1024 × 32 = 32 KB
Segment 2: A @ 1024 (second 1024 elements) = 1024 × 32 = 32 KB
Each segment fits comfortably within the 8 KB limit… wait, that’s still too large!

This requires further splitting or distributing dimensions across slices. Segmentation processes arbitrarily large tensors by dividing them into hardware-manageable chunks.

Processing:

Process the tensor in two separate executions:

Execution #0

fetch sequencer output
- Time = m![A % 1024]
- Flit = m![B]
- fetch_size = 32 bytes
Switch Engine output (= commit unit input)
- Time = m![A % 1024]
- Flit = m![B]
Commit Unit
- commit_in_size = 32 bytes
- sliced shape
  - Time = m![A % 1024]
  - Flit = m![B]
- write sequencer configuration
  - m![A, B] -> m![(A % 1024), B]
  - Sequencer configuration: [A%1024=1024:32, B=32:1] @ 256K / 32
  - From the entire output tensor with mapping m![A, B], only the first half is fetched and committed.

Execution #1 (Second Half)

Processes the second half of the A dimension:

Fetch sequencer output:
- Time = m![A @ 1024] - time dimension covers A elements 1024-2047
- Flit = m![B] - 32-byte packets containing full B dimension
- fetch_size = 32 bytes
Switch Engine output:
- Time = m![A @ 1024] - 1024 cycles for second half of A
- Flit = m![B] - 32-byte flits
Commit Unit:
- commit_in_size = 32 bytes (full flit commit)
- write sequencer configuration: [A@1024=1024:32, B=32:1] @ (256K + 32 * 1024) / 32
- Base address: 256K + 32KB (offset to skip the first segment)
- Writes to addresses 256K+32KB through 256K+64KB

Output:

After both executions complete, the output tensor is reconstructed:

Memory layout: SRAM starting at address 256K contains the complete tensor [A=2048, B=32]
Segment 1: Addresses 256K to 256K+32KB hold A[0:1023, B[0:31]]
Segment 2: Addresses 256K+32KB to 256K+64KB hold A[1024:2047, B[0:31]]
Result: Logically identical to the input, just stored at a different location

The compiler automatically determines segmentation requirements and splits tensors into multiple executions. From the programmer’s perspective, this is a single logical operation; the segmentation is transparent.

Why this example is useful:

Tensor segmentation is essential for practical deep learning workloads:

Large model support: Modern LLMs have tensors with billions of elements that cannot fit in VRF
Automatic handling: The compiler manages segmentation automatically based on VRF capacity
No performance penalty for well-designed splits: When segment boundaries align with memory access patterns, segmentation adds minimal overhead
Scalability: This mechanism enables processing tensors of arbitrary size on fixed hardware
Memory hierarchy exploitation: Segmentation naturally maps to hierarchical memory systems (VRF → SRAM → HBM)

In practice, the compiler considers multiple factors when segmenting:

VRF capacity constraints
Memory bandwidth utilization
Alignment with tensor unit requirements
Minimizing the number of segments to reduce overhead

Transformer Architecture

This page uses Llama 3 70B as a concrete example to show how each transformer operation maps to specific TCP hardware components. Llama 3 70B implements a decoder-only transformer architecture with two main phases: prefill (input encoding) and decode (token generation).

Model Parameters

The following parameters define the Llama 3 70B architecture, grouped by category:

Sequence dimensions (control input/output length):

B: batch size
s_in: input sequence length
s_max: maximum sequence length/context length
s: total sequence length processed so far (prefill + decode)

Model size (vocabulary and layer counts):

V = 128256: vocab size
D = 8192: hidden dimension/size of embedding
F = 28672: intermediate dimension for FFN up projection
L = 80: num layers

Attention head dimensions (how attention is partitioned):

h_q = 64: number of query heads
h_kv = 8: number of key/value heads
G = 8: number of attention groups (= h_q / h_kv)
d_k = 128: head dimension (equal to D / h_q)
d_k_prime = 64: split head dimension for RoPE computation
f = 2: frequency dimension for adjacent heads (d_k = d_k_prime * f)

Prefill Phase

The prefill phase processes the entire input sequence in parallel, outputting the first token while storing computed Key/Value pairs as KV cache. The transformer block executes on all input tokens provided by the user. The following subsections describe each step in order.

1. Embedding Lookup

Embedding lookup converts input tokens to vector space representations.

Input
- input: shape![B, s_in]
- Token indices of input text (which vocabulary entry each token corresponds to)
Weight
- w_emb: shape![V, D]
- Pre-trained embedding value table for each vocabulary entry
Output
- x_0: shape![B, s_in, D]
Operation
- x_0 = gather(index: input, table: w_emb)
- gather: Operation that reads values from the table using index values specified in the index tensor.
  - Processed by TensorDMA.

2. Transformer Layers (repeated L times)

Each transformer layer applies attention and feed-forward operations sequentially. For each layer l = 1, ..., L, perform the following:

2.1. Input Layer Normalization

Input layer normalization stabilizes training by normalizing activations before attention.

Input
- x_prev: shape![B, s_in, D] (layer input from previous layer)
Output
- x_norm: shape![B, s_in, D]
Operation
- Apply RMSNorm
- x_norm = RMSNorm(x_prev)
- RMSNorm: Root Mean Square Layer Normalization
  - Processed by Vector Engine.

2.2. Multi-Head Grouped Query Attention (GQA)

Grouped Query Attention (GQA) improves memory efficiency by sharing key/value heads across multiple query heads, reducing KV cache size.

2.2.1. QKV Projection

QKV projection transforms the normalized input into Query, Key, and Value tensors.

Input
- x_norm: shape![B, s_in, D]
Weights
- w_q: shape![D, h_q, d_k]
- w_k: shape![D, h_kv, d_k]
- w_v: shape![D, h_kv, d_k]
Outputs
- Q: shape![B, s_in, h_q, d_k]
- K: shape![B, s_in, h_kv, d_k]
- V: shape![B, s_in, h_kv, d_k]
Operations
- Q = einsum(x_norm, w_q)
- K = einsum(x_norm, w_k)
- V = einsum(x_norm, w_v)
- matmul corresponds to einsum (= broadcast + elementwise mul + reduce add).
- elementwise mul: Contraction Engine
- reduce add
  - packet reduce: Reducer
  - time reduce: Reducer
  - slice reduce: global adder tree
  - split reduce: interleaved fetch + Vector Engine binary op
  - cluster/chip reduce: DMA + interleaved fetch + Vector Engine binary op

2.2.2. Rotary Position Embedding (RoPE)

Rotary Position Embedding (RoPE) applies positional information to Query and Key tensors through rotation transformations.

Inputs
- Q: shape![B, s_in, h_q, d_k]
- K: shape![B, s_in, h_kv, d_k]
- d_k = d_k_prime * f
  - Split the d_k axis to apply RoPE rotation in a TCP-friendly manner.
RoPE table
- w_rope: shape![s_max, d_k_prime, 2, 2]
- Pre-computed table of cos/sin values based on sequence position and head position.
- RoPE operation groups consecutive pairs among d_k values and applies rotation transformation using cos/sin.
- Store the 2 × 2 matrix representing the cos/sin rotation transformation for TCP-friendly execution.
Position
- position: shape![s_in]
- position(i) = i
Outputs
- Q_rope: shape![B, h_q, s_in, d_k]
- K_rope: shape![B, h_kv, s_in, d_k]
Operations
- RoPE table lookup
  - t_rope: shape![s_in, d_k_prime, 2, 2] = gather(index: position, table: w_rope)
- Apply RoPE
  - RoPE computation reduces to a simple einsum operation given the prepared rotation transformation matrix values.
  - Reshape (noop)
    - Q: shape![B, s_in, h_q, d_k] == shape![B, s_in, h_q, d_k_prime, f]
    - K: shape![B, s_in, h_kv, d_k] == shape![B, s_in, h_kv, d_k_prime, f]
    - t_rope: shape![s_in, d_k_prime, 2, 2] == shape![s_in, d_k_prime, f, 2]
  - einsum
    - Q_rope = einsum(Q, t_rope)
      - (shape![B, s_in, h_q, d_k_prime, f], shape![s_in, d_k_prime, f, 2]) -> shape![B, h_q, s_in, d_k_prime, 2] == shape![B, h_q, s_in, d_k]
    - K_rope = einsum(K, t_rope)
      - (shape![B, s_in, h_kv, d_k_prime, f], shape![s_in, d_k_prime, f, 2]) -> shape![B, h_kv, s_in, d_k_prime, 2] == shape![B, h_kv, s_in, d_k]

As a result of RoPE, Q/K values encode relative positional information.

2.2.3. Store in KV Cache

KV cache stores the current layer’s Key and Value for reuse during the decode phase, avoiding redundant computation.

Inputs
- K_rope: shape![B, h_kv, s_in, d_k]
- V: shape![B, s_in, h_kv, d_k]
KV Cache (for layer l)
- kv_cache_l_K: shape![B, h_kv, s_in, d_k]
- kv_cache_l_V: shape![B, h_kv, s_in, d_k]
Operations
- kv_cache_l_K = K_rope
- kv_cache_l_V = V
- Cache storage: Stores einsum computation results from DM to HBM, processed by TensorDMA.

2.2.4. Grouped Query Attention Computation

Grouped Query Attention shares each key/value head across multiple query heads. Each of the 8 KV heads is shared with 8 Query heads (G = h_q / h_kv = 64 / 8 = 8).

2.2.4.1. Attention Scores Computation

Attention scores measure the relevance between query and key positions using dot product similarity.

Inputs
- Q_rope: shape![B, h_q, s_in, d_k]
- K_rope: shape![B, h_kv, s_in, d_k]
Output
- scores: shape![B, h_q, s_in, s_in]
Operations
- scores = (Q_rope @ K_rope.T) / sqrt(d_k)
- Reshape (noop)
  - The dot product operation can be expressed as einsum. Each tensor’s shape axes must be precisely distinguished from the output shape perspective to accurately represent the einsum operation semantics.
  - Q_rope: shape![B, h_q, s_in, d_k] == shape![B, G, h_kv, s_in_q, d_k]
  - K_rope: shape![B, h_kv, s_in, d_k] == shape![B, h_kv, s_in_k, d_k]
- einsum
  - scores_before_normalize = einsum(Q_rope, K_rope)
  - (shape![B, G, h_kv, s_in_q, d_k], shape![B, h_kv, s_in_k, d_k]) -> shape![B, G, h_kv, s_in_q, s_in_k] == shape![B, h_q, s_in, s_in]
  - The einsum expression shows that G was broadcast from K_rope, and d_k was reduced.
- Normalize
  - scores = scores_before_normalize / sqrt(d_k)
  - Division by sqrt(d_k) can be computed as multiplication by 1/sqrt(d_k). The value 1/sqrt(d_k) is pre-computed, and the Vector Engine performs simple constant multiplication.

2.2.4.2. Causal Mask Application

Causal masking prevents tokens from attending to future positions, preserving autoregressive semantics. In the prefill phase, s_in tokens are processed in parallel, but the i-th token must not reference tokens after position i to maintain the autoregressive model’s semantics.

Input
- scores: shape![B, h_q, s_in, s_in]
- attention_mask: shape![s_in, s_in]
- attention_mask(i, j) = true if j <= i, false if j > i
Output
- scores_masked: shape![B, h_q, s_in, s_in]
Operation
- scores_masked(b, h, i, j) = scores(b, h, i, j) if j <= i, -inf if j > i
- In the Vector Engine, the attention_mask tensor is written to the branch log, then processed through branched operations.

2.2.4.3. Softmax Application

Softmax normalizes attention scores into a probability distribution over key positions.

Input
- scores_masked: shape![B, h_q, s_in, s_in]
Output
- attn_weights: shape![B, h_q, s_in, s_in]
Operation
- attn_weights = softmax(scores_masked)
- Softmax computes the ratio at which each query should reference each token to combine values.
- Reduces the key-corresponding axis among the two s_in dimensions.
- softmax(x)_i = exp(x_i) / sum_j(exp(x_j))
- Processed by Vector Engine

2.2.4.4. Weighted Sum (Attention Output)

Weighted sum computes the attention output by combining Value vectors according to attention weights.

Inputs
- attn_weights: shape![B, h_q, s_in, s_in]
- V: shape![B, s_in, h_kv, d_k]
Output
- attn_output: shape![B, h_q, s_in, d_k]
Operations
- Reshape (noop)
  - attn_weights: shape![B, h_q, s_in, s_in] == shape![B, G, h_kv, s_in_q, s_in_kv]
  - V: shape![B, s_in, h_kv, d_k] == shape![B, h_kv, s_in_kv, d_k]
- einsum
  - attn_output = einsum(attn_weights, V)
  - (shape![B, G, h_kv, s_in_q, s_in_kv], shape![B, h_kv, s_in_kv, d_k]) -> shape![B, G, h_kv, s_in_q, d_k] == shape![B, h_q, s_in, d_k]
  - The einsum expression shows that G was broadcast from V, and s_in_kv was reduced.

2.2.5. Output Projection

Output projection combines the multi-head attention results into a single hidden state vector.

Input
- attn_output: shape![B, h_q, s_in, d_k]
Weight
- w_o: shape![h_q, d_k, D]
Output
- attn_out: shape![B, s_in, D]
Operations
- attn_out = einsum(attn_output, w_o)
- (shape![B, h_q, s_in, d_k], shape![h_q, d_k, D]) -> shape![B, s_in, D]

2.2.6. Residual Connection

Residual connection adds the attention output to the layer input, improving gradient flow during training.

Inputs
- x_prev: shape![B, s_in, D] (layer input)
- attn_out: shape![B, s_in, D] (attention output)
Output
- x_attn: shape![B, s_in, D]
Operation
- x_attn = x_prev + attn_out
- elementwise addition: Processed by Vector Engine

2.3. Feed-Forward Network (FFN)

The Feed-Forward Network applies non-linear transformations to each token independently after attention.

2.3.1. Post-Attention Layer Normalization

Post-attention normalization stabilizes activations before the FFN computation.

Input
- x_attn: shape![B, s_in, D]
Output
- x_ffn_norm: shape![B, s_in, D]
Operation
- x_ffn_norm = RMSNorm(x_attn)
- RMSNorm: Processed by Vector Engine

2.3.2. SwiGLU FFN

SwiGLU (Swish-Gated Linear Unit) is Llama 3’s activation function, combining gating with the Swish non-linearity.

Input
- x_ffn_norm: shape![B, s_in, D]
Weights
- w_gate: shape![D, F] (gate projection)
- w_up: shape![D, F] (up projection)
- w_down: shape![F, D] (down projection)
Output
- ffn_out: shape![B, s_in, D]
Operations
- Gate projection:
  - gate = einsum(x_ffn_norm, w_gate)
  - (shape![B, s_in, D], shape![D, F]) -> shape![B, s_in, F]
- Up projection:
  - up = einsum(x_ffn_norm, w_up)
  - (shape![B, s_in, D], shape![D, F]) -> shape![B, s_in, F]
- SwiGLU activation:
  - activated = SiLU(gate) * up
  - SiLU (Swish): SiLU(x) = x * sigmoid(x)
  - *: element-wise multiplication
  - Processed by Vector Engine
- Down projection:
  - ffn_out = einsum(activated, w_down)
  - (shape![B, s_in, F], shape![F, D]) -> shape![B, s_in, D]

2.3.3. Residual Connection

FFN residual connection adds the FFN output to the post-attention output.

Inputs
- x_attn: shape![B, s_in, D] (post-attention output)
- ffn_out: shape![B, s_in, D] (FFN output)
Output
- x_l: shape![B, s_in, D] (final output of layer l)
Operation
- x_l = x_attn + ffn_out
- elementwise addition: Processed by Vector Engine

3. Final Layer Normalization

Final layer normalization is applied after passing through all 80 transformer layers.

Input
- x_L: shape![B, s_in, D] (last layer output)
Output
- x_final: shape![B, s_in, D]
Operation
- x_final = RMSNorm(x_L)
- RMSNorm: Processed by Vector Engine

4. Language Model Head (Output Layer)

The language model head converts the hidden state at the last token position into vocabulary logits for next-token prediction.

Input
- x_final: shape![B, s_in, D]
Weight
- w_lm_head: shape![D, V]
- Typically w_lm_head = w_emb.T (weight tying)
Output
- logits: shape![B, V]
Operations
- Slice: In prefill phase, only the last token is used
  - x_last: shape![B, D] = x_final[:, -1, :]
  - Extract only the hidden state of the last token to predict the next token
  - Process the slice operation as a simple view operation depending on shape, or use parallel copy to directly read and move a portion of data.
- einsum: Logit computation for vocabulary
  - logits = einsum(x_last, w_lm_head)
  - (shape![B, D], shape![D, V]) -> shape![B, V]

5. Sampling

Sampling converts logit values into a probability distribution and selects the next token. This process occurs on the Host, not the TCP.

Input
- logits: shape![B, V]
- temperature: scalar (sampling temperature parameter, typically 0.7~1.0)
Output
- next_token: shape![B] (next token index for each batch)
Operations
- Temperature scaling:
  - logits_scaled = logits / temperature
  - Higher temperature leads to more diverse token selection, lower temperature leads to more deterministic selection
  - The value 1/temperature is pre-computed, then processed as constant multiplication in Vector Engine
- Softmax:
  - probs: shape![B, V] = softmax(logits_scaled)
  - softmax(x)_i = exp(x_i) / sum_j(exp(x_j))
  - Apply softmax over the Vocabulary axis (V)
- Token sampling:
  - Sample the next token index from the probability distribution probs
  - Sampling strategies:
    - Greedy: next_token = argmax_i(probs_i)
    - Top-k sampling: Sample only from the top k tokens by probability
    - Top-p (nucleus) sampling: Sample from the smallest token set whose cumulative probability exceeds p

Decode Phase

The decode phase reuses the same operation sequence as prefill (embedding, transformer layers, LM head, sampling), but operates on a single token at a time, reusing cached KV pairs instead of recomputing them. The decode phase generates tokens one at a time autoregressively, continuing until an end token (EOS) is produced or the maximum length is reached. Unlike prefill, decode processes only one token per iteration.

Three characteristics distinguish decode from prefill:

Single-token input: s_in = 1 (only the most recent output token is used as query)
KV cache reuse: Previously computed Key and Value tensors are reused rather than recomputed
Autoregressive generation: Each token prediction references all previous tokens via the cache

For each decoding step s = s_prefill + 1, ..., s_max:

1. Embedding Lookup

Embedding lookup converts the previously generated token to its vector representation.

Input
- input: shape![B, 1]
- Token index sampled in the previous step
Weight
- w_emb: shape![V, D]
Output
- x_0: shape![B, 1, D]
Operation
- x_0 = gather(index: input, table: w_emb)
- Processed by TensorDMA

2. Transformer Layers (repeated L times)

Each transformer layer processes the single token through attention and FFN, reusing cached KV pairs. For each layer l = 1, ..., L, perform the following:

2.1. Input Layer Normalization

Input layer normalization prepares the token for attention computation.

Input
- x_prev: shape![B, 1, D] (layer input from previous layer)
Output
- x_norm: shape![B, 1, D]
Operation
- x_norm = RMSNorm(x_prev)
- Processed by Vector Engine

2.2. Multi-Head Grouped Query Attention (GQA)

Attention in decode phase computes attention between the current token (query) and all cached tokens (keys/values).

2.2.1. QKV Projection

QKV projection computes Query, Key, and Value for the current token only.

Input
- x_norm: shape![B, 1, D]
Weights
- w_q: shape![D, h_q, d_k]
- w_k: shape![D, h_kv, d_k]
- w_v: shape![D, h_kv, d_k]
Outputs
- Q: shape![B, 1, h_q, d_k]
- K_new: shape![B, 1, h_kv, d_k]
- V_new: shape![B, 1, h_kv, d_k]
Operations
- Q = einsum(x_norm, w_q)
- K_new = einsum(x_norm, w_k)
- V_new = einsum(x_norm, w_v)
- (shape![B, 1, D], shape![D, h_q/kv, d_k]) -> shape![B, 1, h_q/kv, d_k]

2.2.2. Rotary Position Embedding (RoPE)

RoPE applies positional encoding corresponding to the current sequence position.

Inputs
- Q: shape![B, 1, h_q, d_k]
- K_new: shape![B, 1, h_kv, d_k]
RoPE table
- w_rope: shape![s_max, d_k_prime, 2, 2]
Position
- position: shape![1]
- position(0) = s (total sequence length processed so far)
Outputs
- Q_rope: shape![B, h_q, 1, d_k]
- K_rope: shape![B, h_kv, 1, d_k]
Operations
- RoPE table lookup
  - t_rope: shape![1, d_k_prime, 2, 2] = gather(index: position, table: w_rope)
- Apply RoPE
  - Reshape (noop)
    - Q: shape![B, 1, h_q, d_k] == shape![B, 1, h_q, d_k_prime, f]
    - K_new: shape![B, 1, h_kv, d_k] == shape![B, 1, h_kv, d_k_prime, f]
    - t_rope: shape![1, d_k_prime, 2, 2] == shape![1, d_k_prime, f, 2]
  - einsum
    - Q_rope = einsum(Q, t_rope)
      - (shape![B, 1, h_q, d_k_prime, f], shape![1, d_k_prime, f, 2]) -> shape![B, h_q, 1, d_k_prime, 2] == shape![B, h_q, 1, d_k]
    - K_rope = einsum(K_new, t_rope)
      - (shape![B, 1, h_kv, d_k_prime, f], shape![1, d_k_prime, f, 2]) -> shape![B, h_kv, 1, d_k_prime, 2] == shape![B, h_kv, 1, d_k]

2.2.3. KV Cache Update

KV cache update appends the new Key and Value to the existing cache for future token generation.

Inputs
- kv_cache_l_K: shape![B, h_kv, s-1, d_k] (existing cache)
- kv_cache_l_V: shape![B, h_kv, s-1, d_k] (existing cache)
- K_rope: shape![B, h_kv, 1, d_k] (new Key)
- V_new: shape![B, 1, h_kv, d_k] (new Value)
TODO (youseok.yang): V_new has shape [B, 1, h_kv, d_k] but the cache expects [B, h_kv, s, d_k]. Either correct V_new’s shape to [B, h_kv, 1, d_k] (consistent with K_rope and the cache) or add an explicit reshape/transpose step before the cache update.
Outputs
- kv_cache_l_K: shape![B, h_kv, s, d_k] (updated cache)
- kv_cache_l_V: shape![B, h_kv, s, d_k] (updated cache)
Operations
- Concatenate: Add new K, V to existing cache
  - kv_cache_l_K[s-1] = K_rope
  - kv_cache_l_V[s-1] = V_new
  - Processing differs depending on concat axis allocation. Data movement between slices: use RoutingEngine/parallel copy; data movement between elements: use parallel copy.
  - Concat on HBM using DMA is also possible.

2.2.4. Grouped Query Attention Computation

Attention computation uses the current Query against the entire KV cache to determine which past tokens are relevant.

2.2.4.1. Attention Scores Computation

Attention scores measure similarity between the current Query and all cached Keys.

Inputs
- Q_rope: shape![B, h_q, 1, d_k]
- kv_cache_l_K: shape![B, h_kv, s, d_k]
Output
- scores: shape![B, h_q, 1, s]
Operations
- scores = (Q_rope @ kv_cache_l_K.T) / sqrt(d_k)
- Reshape (noop)
  - Q_rope: shape![B, h_q, 1, d_k] == shape![B, G, h_kv, 1, d_k]
  - kv_cache_l_K: shape![B, h_kv, s, d_k] == shape![B, h_kv, s, d_k]
- einsum
  - scores_before_normalize = einsum(Q_rope, kv_cache_l_K)
  - (shape![B, G, h_kv, 1, d_k], shape![B, h_kv, s, d_k]) -> shape![B, G, h_kv, 1, s] == shape![B, h_q, 1, s]
  - The einsum expression shows that G was broadcast from kv_cache_l_K, and d_k was reduced.
- Normalize
  - scores = scores_before_normalize / sqrt(d_k)
  - Processed as constant multiplication in Vector Engine

2.2.4.2. Softmax Application

Softmax converts scores to attention weights. Causal mask is unnecessary in decode because the current token only references past tokens.

Input
- scores: shape![B, h_q, 1, s]
Output
- attn_weights: shape![B, h_q, 1, s]
Operation
- attn_weights = softmax(scores)
- Softmax is applied over the last axis (s, i.e., all past tokens)
- softmax(x)_i = exp(x_i) / sum_j(exp(x_j))
- Processed by Vector Engine

2.2.4.3. Weighted Sum (Attention Output)

Weighted sum combines cached Values according to attention weights to produce the attention output.

Inputs
- attn_weights: shape![B, h_q, 1, s]
- kv_cache_l_V: shape![B, h_kv, s, d_k]
Output
- attn_output: shape![B, h_q, 1, d_k]
Operations
- Reshape (noop)
  - attn_weights: shape![B, h_q, 1, s] == shape![B, G, h_kv, 1, s]
  - kv_cache_l_V: shape![B, h_kv, s, d_k] == shape![B, h_kv, s, d_k]
- einsum
  - attn_output = einsum(attn_weights, kv_cache_l_V)
  - (shape![B, G, h_kv, 1, s], shape![B, h_kv, s, d_k]) -> shape![B, G, h_kv, 1, d_k] == shape![B, h_q, 1, d_k]
  - The einsum expression shows that G was broadcast from kv_cache_l_V, and s was reduced.

2.2.5. Output Projection

Output projection transforms the attention result back to the hidden dimension.

Input
- attn_output: shape![B, h_q, 1, d_k]
Weight
- w_o: shape![h_q, d_k, D]
Output
- attn_out: shape![B, 1, D]
Operations
- attn_out = einsum(attn_output, w_o)
- (shape![B, h_q, 1, d_k], shape![h_q, d_k, D]) -> shape![B, 1, D]

2.2.6. Residual Connection

Residual connection combines attention output with layer input.

Inputs
- x_prev: shape![B, 1, D] (layer input)
- attn_out: shape![B, 1, D] (attention output)
Output
- x_attn: shape![B, 1, D]
Operation
- x_attn = x_prev + attn_out
- elementwise addition: Processed by Vector Engine

2.3. Feed-Forward Network (FFN)

FFN in decode phase is identical to prefill, but processes only a single token (sequence length = 1).

2.3.1. Post-Attention Layer Normalization

Post-attention normalization prepares the token for FFN processing.

Input
- x_attn: shape![B, 1, D]
Output
- x_ffn_norm: shape![B, 1, D]
Operation
- x_ffn_norm = RMSNorm(x_attn)
- Processed by Vector Engine

2.3.2. SwiGLU FFN

SwiGLU applies the gated activation function with three projections.

Input
- x_ffn_norm: shape![B, 1, D]
Weights
- w_gate: shape![D, F]
- w_up: shape![D, F]
- w_down: shape![F, D]
Output
- ffn_out: shape![B, 1, D]
Operations
- Gate projection:
  - gate = einsum(x_ffn_norm, w_gate)
  - (shape![B, 1, D], shape![D, F]) -> shape![B, 1, F]
- Up projection:
  - up = einsum(x_ffn_norm, w_up)
  - (shape![B, 1, D], shape![D, F]) -> shape![B, 1, F]
- SwiGLU activation:
  - activated = SiLU(gate) * up
  - Processed by Vector Engine
- Down projection:
  - ffn_out = einsum(activated, w_down)
  - (shape![B, 1, F], shape![F, D]) -> shape![B, 1, D]

2.3.3. Residual Connection

FFN residual connection produces the final layer output.

Inputs
- x_attn: shape![B, 1, D]
- ffn_out: shape![B, 1, D]
Output
- x_l: shape![B, 1, D]
Operation
- x_l = x_attn + ffn_out
- elementwise addition: Processed by Vector Engine

3. Final Layer Normalization

Final layer normalization prepares the output for the language model head.

Input
- x_L: shape![B, 1, D]
Output
- x_final: shape![B, 1, D]
Operation
- x_final = RMSNorm(x_L)
- Processed by Vector Engine

4. Language Model Head

The language model head projects the hidden state to vocabulary logits. Unlike prefill, no slice operation is needed since there is only a single token.

Input
- x_final: shape![B, 1, D]
Weight
- w_lm_head: shape![D, V]
Output
- logits: shape![B, V]
Operations
- Reshape/Squeeze: Remove sequence dimension
  - x_squeezed: shape![B, D] = squeeze(x_final)
- einsum: Logit computation for vocabulary
  - logits = einsum(x_squeezed, w_lm_head)
  - (shape![B, D], shape![D, V]) -> shape![B, V]

5. Sampling

Sampling is identical to Prefill Sampling: temperature scaling, softmax, and token selection, performed on the Host.

6. Termination Conditions

Generation terminates when any of three conditions is met:

EOS token generated: Sampled token is the End-of-Sequence token
Maximum length reached: s >= s_max
User-defined termination conditions: When specific patterns or conditions are met

If generation continues, update s <- s + 1 and return to the next decoding step.

Prefill vs Decode Phase Comparison

The following table summarizes the key differences between prefill and decode phases:

Characteristic	Prefill Phase	Decode Phase
Input sequence length	`s_in` (variable)	1 (fixed)
Parallel processing	`s_in` tokens processed in parallel	Only 1 token processed
KV Cache	Create and store	Read and update
Attention computation	Causal mask required	Causal mask not required
Attention shape	`shape![B, h_q, s_in, s_in]`	`shape![B, h_q, 1, s]`
Computation characteristics	Compute-bound (large-scale computation)	Memory-bound (KV cache access)
Throughput	High (parallel processing)	Low (sequential processing)
Latency	Relatively high	Low (per token)

Mixture of Experts

Mixture of Experts (MoE) scales model capacity by routing each token to only K of E experts rather than all of them; this sparse activation allows many parameters while keeping inference cost manageable. This example shows how to implement MoE on TCP hardware, focusing on two key challenges: replacing control-flow-based TopK routing with branchless matrix operations, and executing sparse expert computations blockwise.

Background: Basic FFN

To understand MoE, first consider the basic FFN (Feed-Forward Network) in transformer blocks. The following describes FFN with only up/down projection, without gate projection:

Input
- x_ffn_norm: T x D
Weights
- W_up: D x F (up projection)
- W_down: F x D (down projection)
Output
- ffn_out: T x D
Operations
- Up projection:
  - up = einsum(x_ffn_norm, W_up)
  - (T x D), (D x F) -> T x F
- Down projection:
  - ffn_out = einsum(up, W_down)
  - (T x F), (F x D) -> T x D

MoE Structure

MoE replaces a single FFN with E independent FFNs called experts. Each expert has its own weights:

W_up[0], W_up[1], ..., W_up[E-1]
W_down[0], W_down[1], ..., W_down[E-1]

Computing all experts would increase computation by E times. To avoid this, MoE uses a router to select only the Top-K most suitable experts per token, enabling sparse computation.

Model Parameters

The following arguments define an MoE layer:

T: number of tokens
- prefill: T = B * S_in
- decode: T = B
D: hidden dimension
F: intermediate dimension of ffn up projection result
E: number of total experts (typically 128)
K: number of experts to apply ffn
- llama4: 1, gpt-oss: 4, qwen3: 8

MoE Processing Steps

MoE processing consists of three main stages: routing (selecting which experts to use), sparse expert computation (applying the selected experts), and combining (merging expert outputs with routing weights).

1. Gating (Router) & Top-K Selection

The router calculates a score for each expert for every token, determining which experts should process each token:

Input
- x_norm: T x D
Weight
- W_router: D x E (Gating network weights)
Output
- scores: T x E
Operation
- scores = einsum(x_norm, W_router)
- (T x D), (D x E) -> T x E
- Calculate the score (Logit) for E Experts per token

2. `Top-K` Selection

This step selects the Top-K Experts based on router scores and calculates the weight for each selected Expert:

Input
- scores: T x E
Outputs
- topk_indices: T x K (selected Expert ID per token)
- routing_weights: T x K (weight of selected Expert per token)
Operations
- Top-K Selection:
  - raw_weights, topk_indices = topk(scores, K)
  - Extract the K Expert indices and scores with the highest scores per token
- Softmax Normalization:
  - routing_weights = softmax(raw_weights)
  - Convert the selected K scores to probability values (sum is 1 per token)
  - softmax(x)[i] = exp(x[i]) / sum(exp(x[j]) for j in 0..K)

The output for each token t consists of:

topk_indices[t, :]: K Expert IDs (0 <= e < E)
routing_weights[t, :]: weights of those Experts (sum is 1)

3. Sparse Expert Computation

Only selected Experts perform computation, making this stage sparse. A total of T * K Expert calls occur, but each Expert only computes for the tokens that selected it.

For each token t in [0, T-1] and selected Expert k in [0, K-1]:

Selected Expert ID: e = topk_indices[t, k]
Input
- x_norm[t]: D (input of token t)
Weights (weights of Expert e)
- W_up[e]: D x F
- W_down[e]: F x D
Output
- y[t, k]: D (k-th Expert output of token t)
Operations
- Up projection:
  - up = einsum(x_norm[t], W_up[e])
  - D, (D x F) -> F
- Down projection:
  - y[t, k] = einsum(up, W_down[e])
  - F, (F x D) -> D

The results for all (t, k) pairs are collected into y_experts: T x K x D.

4. Weighted Sum (Combine)

The final step combines the K Expert outputs using the routing weights calculated earlier:

Inputs
- y_experts: T x K x D
- routing_weights: T x K (weight of each Expert)
Output
- ffn_out: T x D
Operations
- ffn_out = einsum(y_experts, routing_weights)

The result is that each token receives the weighted average output of its selected K Experts.

MoE Implementation on TCP

Implementing MoE efficiently on TCP requires bridging the gap between the model’s logical structure and hardware constraints. This section describes the techniques needed to achieve high performance.

1. Overview and Design Philosophy

1.1. Bridging Logical and Physical Execution

Two fundamental challenges arise when implementing MoE on TCP:

Challenge 1: Conflict between control flow and parallel structure
- Problem: General Top-K algorithms use branch statements where the execution path varies depending on data values. Such branch statements cause performance degradation in SIMT-based accelerators that process thousands of elements with a single instruction.
- Solution: Completely removing control flow and using Branchless Top-K technique with matrix operations and bit manipulation is essential.
Challenge 2: Gap between logical Routing and physical execution
- Problem: Logically, MoE is a process where each token finds the Expert that suits it (Token-centric). However, if implemented as is, memory access becomes irregular and the number of tokens to process per Expert changes dynamically, reducing TCP compiler efficiency.
- Solution: The perspective must be shifted to a method where the Expert becomes the subject and collects tokens (Expert-centric).

1.2. Core Techniques for TCP Implementation

Two core techniques address these challenges:

Branchless TopK: Performs routing via matrix operations only, eliminating all control flow
Blockwise execution: Processes only selected Experts with data packed in fixed-size Block units

The following sections describe each technique in detail.

2. Branchless `TopK`

Branchless TopK replaces control-flow-based sorting with pure matrix operations. This approach consists of three stages: bit packing to combine score and index, parallel ranking to determine order, and filtering to extract the top K results.

2.1. Bit Packing (Combining Score and Index)

The Vector Engine pipeline operates on all 256 slices in lockstep, so any operation whose address or control path depends on runtime data values must be replaced with a fixed sequence of matrix operations. Bit packing bundles score and index into a single value so the Expert ID is preserved when scores are reordered during sorting:

Inputs
- scores: T x E
- Index_expert: E
  - Index_expert(e) = e where e = 0, 1, 2, ..., E - 1
Output
- Packed_Value: T x E
  - Tensor with (score, index) packed.
- Packed_Value_cmp: T x E
  - Tensor with (score, index) packed, preprocessed to enable comparison of score magnitude using integer comparison.
Operation
- Packing
  - Place Expert Score (e.g., bf16) in the upper bits and Expert Index (e.g., int16) in the lower bits to create a single 32-bit integer (or float).
  - Packed_Value_unprocessed = (Score << 16) | Index
  - Processed in Vector Engine.
- Comparison Trick
  - This preprocessing enables magnitude comparison of score values using simple integer comparison.
  - Bit Flipping preprocessing solves the problem of negative magnitude relationships being reversed when comparing float values as integer values. This enables accurate Top-K selection with only integer comparators.
  - ```
  Packed_Value_cmp = if Packed_Value >= 0 {
      Packed_Value
  } else {
      Packed_Value ^ 0x7fff0000
  }
```

2.2. Parallel Ranking (All-to-All Comparison)

Parallel ranking determines the order of all experts simultaneously instead of sequential sorting. Although this requires E x E comparisons, TCP efficiency remains high because only matrix operations are used without control flow:

Input
- Packed_Value_cmp: T x E
  - 32-bit Packed Tensor with Comparison Trick applied.
Output
- Rank: T x E
  - Rank of each Expert (0-based rank). Higher scores are closer to 0.
Operations
- Broadcast & Compare
  - Replicate (Tile) Packed_Value_cmp along the E axis to expand to T x E x E shape. Compare magnitude relationships for all Expert pairs (i, j).
  - Compare[t, i, j] = 1 if Packed_Value_cmp[t, j] > Packed_Value_cmp[t, i] else 0
  - Meaning: “Is Expert j’s score higher than Expert i’s?”
- Rank Calculation (ReduceSum)
  - Sum along the E (comparison target) axis to calculate rank.
  - Rank[t, i] = sum(Compare[t, i, j] for j in 0..E)
  - Meaning: “The total number of Experts with higher scores than me” becomes my rank.

2.3. Filtering & Unpacking

Filtering extracts the top K entries based on rank, then unpacking separates the packed scores and indices:

Inputs
- Rank: T x E
- Packed_Value: T x E
  - Note: The original Packed Value before Comparison Trick was applied must be used to restore accurate Score/Index later.
Outputs
- TopK_Indices: T x K
- TopK_Scores: T x K
- routing_weights: T x K (weights for K selected experts per Token)
Operations
- Filtering (FilterCompaction)
  - Only elements satisfying the Top-K condition (Rank < K) are kept.
  - Mask[t, i] = 1 if Rank[t, i] < K else 0
  - Only Packed_Value at positions where Mask is True are collected and compressed to T x K size.
  - Result: Selected_Packed: T x K
  - Uses the filter function of Vector Engine.
- Unpacking
  - Restore scores and indices through bit operations from the selected 32-bit values.
  - Score Extraction: TopK_Scores = Selected_Packed >> 16 (then reinterpreted as bf16 type)
  - Index Extraction: TopK_Indices = Selected_Packed & 0xffff
- Softmax Normalization
  - Softmax is applied to the extracted Top-K Scores to calculate final weights. This is used in the Combine stage later.
  - routing_weights[t, k] = exp(TopK_Scores[t, k]) / sum(exp(TopK_Scores[t, j]) for j in 0..K)

3. Blockwise Execution

Blockwise execution physically rearranges data based on Top-K routing decisions while satisfying TCP’s static shape constraints. This section describes how to handle dynamic token-to-expert assignments efficiently.

3.1. Problem: Dynamic Shape & Memory Explosion

The core challenge is that the number of tokens L_e assigned per Expert varies dynamically depending on the input. In the worst case, if all tokens are concentrated on a specific Expert, L_e ~ T.

Two approaches address this challenge:

Naive Solution: Allocating a buffer of maximum size T for all Experts requires memory of E x T x D size, most of which is wasted as padding.
Blockwise Solution: Instead of variable length L_e, manage data in fixed-size Block (B) units to optimize memory usage to approximately T x K level.

3.2. Grid Size Calculation

Grid size determines how many blocks are needed to process all tokens. Tokens for the same Expert are grouped into blocks of B tokens, enabling blockwise computation with a single expert loaded.

The total number of blocks needed (Grid Size, G) is calculated as the sum of blocks required per expert:

Number of blocks allocated to Expert e
- Number of tokens allocated to e: Count_e
- Number of blocks: ceil(Count_e / B)
G = sum(ceil(Count_e / B) for e in 0..E)

The compiler calculates the worst-case G value and allocates memory space. At runtime, sparse operations skip execution for empty Grids.

In the worst case where all Experts include a grid containing only one token, (T*K - E) / B + E Grids are required.

3.3. Index Generation (`Cumsum`-based Address Calculation)

Index generation computes the destination address for each token using cumsum-based parallel address calculation. (Cumsum is implemented in the Vector Engine using branch logging; see Section 4 for the hardware implementation.) This approach avoids loops and enables efficient parallel execution:

Inputs
- TopK_Indices: T x K
- Expert_Indices: E = [0, 1, ..., E-1]
- Block_Range: G = [0, 1, ..., G-1] (sequence of maximum block count, e.g., 32)
Outputs
- Scatter_Idx: T x K (final 1D address where each token will move)
- Expert_IDs: G (Expert number each Block is responsible for)
Operations
- Mask Generation (One-Hot)
  - Convert indices to computable mask form.
  - Expert_Mask: T x K x E = one_hot(TopK_Indices, depth=E)
- Histogram
  - Sum the masks to count the number of tokens allocated per Expert.
  - Count: E = reduce_sum(Expert_Mask, axis: (T, K))
- Block calculation
  - Calculate the number of Blocks needed for each Expert.
    - Num_Blocks: E = ceil(Count / B)
- Global offset Calculation
  - Through Cumsum, obtain the Block Start Index where each Expert starts in the entire Grid (G).
  - Global_Offset: E = cumsum(Num_Blocks) - Num_Blocks
- Local Offset Calculation
  - Using Mask and Cumsum, calculate what position each token is in the Expert’s queue.
  - Cumsum_Mask: T x K x E = cumsum(Expert_Mask, axis: (T, K))
  - Token_Rank: T x K = gather(Cumsum_Mask, index: TopK_Indices)
  - Local_Offset: T x K = Token_Rank - 1
- Expert ID expansion
  - Diff: E x G = Num_Blocks - Block_Range
  - Grid: E x G
    - ```
    Grid(e, i) = if Diff(e, i) > 0 {
        Expert_Indices(e)
    } else {
        -1
    }
```
- Expert_IDs: G = filter_compaction(Grid, condition=(Grid >= 0))
- e.g.)
  - expert 0: 2 blocks, expert 1: 3 blocks, expert 3: 3 blocks
  - Diff[0] = [2, 1, 0, -1, -2, …], Diff[1] = [3, 2, 1, 0, -1, …]: has positive terms equal to the number of allocated blocks per expert.
  - Grid[0] = [0, 0, -1,-1, …], Grid[1] = [1, 1, 1, -1, -1, …]: has expert id equal to the number of allocated blocks per expert.
  - Expert_IDs = [0, 0, 1, 1, 1, 3, 3, 3]: Filter only values >= 0 (expert id) from Grid.
- Address Synthesis
  - Scatter_Idx = (Global_Offset * B) + Local_Offset
  - Calculate which block and which position within the block each of the T tokens corresponds to. Scatter_Idx in [0, G * B)

3.4. Dispatch (Blockwise Scatter)

Dispatch physically rearranges tokens using the computed addresses, placing each token in its designated block position:

Input
- x_norm: T x D (Input after Attention and norm)
- Scatter_Idx: T x K (Final 1D address where each token will move)
Output
- x_blocked: G x B x D (Rearranged Blocked Tensor)
Operation
- Scatter
  - Place tokens x_norm at Scatter_Idx positions.

3.5. Sparse Computation (Weight Gather)

Sparse computation applies Expert weights to the sorted Blocks. The key insight is that weights are gathered only for Experts that have assigned tokens:

Inputs
- x_blocked: G x B x D
- Expert_IDs: G (Expert number each Block is responsible for)
Output
- y_blocked: G x B x D
Operations
- Weight Gather
  - Using Expert_IDs as indices, only the necessary weights are fetched.
  - W_gathered_up: G x D x F = gather(W_up, index: Expert_IDs)
  - W_gathered_down: G x F x D = gather(W_down, index: Expert_IDs)
- Sparse MLP
  - Operations are performed only for valid Blocks (G).
  - up: G x B x F = einsum(x_blocked, W_gathered_up)
  - y_blocked: G x B x D = einsum(up, W_gathered_down)

3.6. Combine (Weighted Sum)

Combine restores results to original token order and applies Routing probabilities. This is the final step that produces the MoE layer output:

Inputs
- y_blocked: G x B x D
- Scatter_Idx: T x K
- routing_weights: T x K
Output
- moe_out: T x D (Final MoE layer output)
Operations
- Gather
  - Using Scatter_Idx in reverse, results are fetched from y_blocked in the original token order.
  - y_restored: T x K x D = gather(y_blocked, index: Scatter_Idx)
- Weighted Sum
  - The final output is summed by multiplying with routing_weights obtained from the Top-K process.
  - y_weighted: T x K x D = einsum(y_restored, routing_weights)
  - moe_out: T x D = reduce_sum(y_weighted, axis: K)

4. `Cumsum` Implementation on TCP

Cumsum is a key primitive used in index generation. On TCP, it is implemented in Vector Engine using the following approach:

Create a static branch logger: For the axis (of size n) over which the sum is computed,

branch(i) = if i == 0 {
    0
} else if i < n - 1 {
    1
} else {
    2  // i == n - 1
}

Configure the Vector Engine as follows:

add %mainstream, OperandRead(branch = 1, 2)
WriteOperand(branch = 0, 1)

Step	Dot Product (\(I, I \rightarrow 1\))	GEMV (\(IJ, J \rightarrow I\))	GEMM (\(IK, KJ \rightarrow IJ\))
Broadcast	none (axes match)	\(x\) broadcasts across \(I\)	\(A\) across \(J\); \(B\) across \(I\)
Multiply	\(x_i \cdot y_i\)	\(A_{ij} \cdot x_j\)	\(A_{ik} \cdot B_{kj}\)
Reduce	\(\sum_i x_i y_i\)	\(y_i = \sum_j A_{ij} x_j\)	\(C_{ij} = \sum_k A_{ik} B_{kj}\)

Keyboard shortcuts

Programming Tensor Contraction Processors