Commit Engine

The Commit Engine writes Tensor Unit results back to DM (Data Memory), the primary on-chip SRAM tier. It implements a logical tensor move from Tensor Unit streams to SRAM, writing each slice’s result to its designated DM address.

After the Tensor Unit completes computation, results exist as streaming packets distributed across slices. The Commit Engine transforms these packets through an adapter (truncating) and writes them to DM via a sequencer. This page covers the interface and examples, the adapter stages, the sequencer, sub-context operations, and performance guidelines.

Interface

impl<'l, const T: Tu, P: Position, D: Scalar, Chip: M, Cluster: M, Slice: M, Time: M, Packet: M>
    StreamTensor<'l, { T }, P, D, Chip, Cluster, Slice, Time, Packet>
{
    /// Commits to the data memory.
    #[primitive(StreamTensor::commit)]
    pub fn commit<Element: M>(self, address: Address) -> DmTensor<D, Chip, Cluster, Slice, Element> {
        verify_commit::<D, Time, Packet, Element>();
        DmTensor::new(self.inner.transpose(false), address)
    }

    /// Commits to mutable tensor view in the data memory.
    #[primitive(StreamTensor::commit_view)]
    pub fn commit_view<Element: M>(self, mut dst: DmTensorViewMut<'l, D, Chip, Cluster, Slice, Element>) {
        verify_commit::<D, Time, Packet, Element>();
        dst.inner.write_transpose(self.inner.view(), false);
    }
}

The Commit Engine mirrors the Fetch Engine’s structure, but operates in reverse.

For detailed examples, see kernel examples.

Examples

Consider storing a matrix multiplication result C = A * B back to DM after computation. The Cast Engine converts the Contraction Engine’s f32 packet elements to bf16 to save space. The Commit Engine stores the resulting tensor to DM.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![P = 256, M = 16, N = 8];

fn cast_commit<'l, const T: Tu>(
    input: AccumulationTensor<'l, T, f32, m![1], m![1], m![P], m![M], m![N]>,
) -> DmTensor<bf16, m![1], m![1], m![P], m![M, N # 16]> {
    // Cast f32 to bf16 (Cast Engine), then commit to DM (Commit Engine).
    // Input: M = 16 time steps, N = 8 f32 elements per packet (32 bytes).
    // After cast: N = 8 bf16 elements padded to 16 (32 bytes).
    // The sequencer writes across P = 256 slices.
    input.cast::<bf16, m![N # 16]>().commit(0)
}
}

Adapter

The adapter transforms stream packets before writing to DM via truncating.

The main context and sub-context adapters both support truncating. The sub-context is typically used for prefetching to TRF/VRF.

Truncating

Truncating reduces packet size by keeping only the leading elements. The input packet is always a full 32-byte flit. The commit_in_size parameter controls how many bytes are actually written to DM: 8, 16, 24, or 32 bytes (where 32 bytes means no reduction). This operation discards trailing elements or satisfies downstream alignment constraints.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 4, K = 2, W = 8, N = 16, J = 64];

fn i8_padding_truncation<'l, const T: Tu>(
    input: CastTensor<'l, T, i8, m![1], m![1], m![1], m![M, K], m![W # 32]>,
) -> DmTensor<i8, m![1], m![1], m![1], m![M, K, W]> {
    // Input: 8 i8 elements padded to 32 (32 bytes per packet).
    // Truncation removes padding: only the 8 leading elements are written to DM.
    // commit_in_size = 8 elements × 1 byte = 8 bytes.
    input.commit(0)
}

fn f32_non_padding_truncation<'l, const T: Tu>(
    input: AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![M, K], m![W]>,
) -> DmTensor<f32, m![1], m![1], m![1], m![M, K, W = 4]> {
    // Input: 8 f32 elements (32 bytes per packet).
    // Truncation: only the first 4 elements are written to DM.
    // commit_in_size = 4 elements × 4 bytes = 16 bytes.
    input.commit(0)
}

fn bf16_truncation_with_transpose<'l, const T: Tu>(
    input: CastTensor<'l, T, bf16, m![1], m![1], m![1], m![M, K], m![N]>,
) -> DmTensor<bf16, m![1], m![1], m![1], m![K, M, N = 8]> {
    // Input: 16 bf16 elements (32 bytes per packet).
    // Truncation: only the leading 8 elements are written to DM.
    // commit_in_size = 8 elements × 2 bytes = 16 bytes.
    // Time is transposed: m![M, K] → m![K, M].
    input.commit(0)
}

fn i4_no_truncation_with_transpose<'l, const T: Tu>(
    input: CastTensor<'l, T, i4, m![1], m![1], m![1], m![M, K], m![J]>,
) -> DmTensor<i4, m![1], m![1], m![1], m![K, M, J]> {
    // Input: 64 i4 elements (32 bytes per packet).
    // No truncation: the full 32-byte packet is written to DM.
    // commit_in_size = 64 elements × 0.5 bytes = 32 bytes.
    // Time is transposed: m![M, K] → m![K, M].
    input.commit(0)
}
}

Note

The commit_in_size value is automatically derived by the compiler from the output tensor mapping. It is not manually specified by the user.

Commit Sequencer

The commit sequencer writes streams to DM across slices. Each slice within an aggregation executes its own sequencer. This mirrors how fetch sequencers pull data into Tensor Units.

The commit_size value determines how many bytes are written per sequencer step. It is analogous to the Fetch Engine’s fetch_size and is also derived from contiguous_sram_access_size:

$$ \texttt{commit_size} = \gcd(\texttt{contiguous_sram_access_size_bytes},\ \texttt{commit_in_size}) $$

When commit_size == commit_in_size, each time step produces a single DM write.
When commit_size < commit_in_size, the packet is split into commit_in_size / commit_size writes per time step.

The main context supports a commit_size of 8, 16, 24, or 32 bytes (see main context). The sub-context supports a commit_size of 8 bytes only (see sub-context).

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 4, K = 2, W = 8, N = 16];

// Compiler-generated configuration: [
//   M -> 4 : 16,  (16 == 2 * 8,  contiguous)
//   K -> 2 : 8,   (8  == 8 * 1,  contiguous)
//   W -> 8 : 1    (packet dimension, contiguous)
// ] : 8
// contiguous_sram_access_size = (8 * 2 * 4) elements × 1 byte = 64 bytes
// commit_in_size = 8 bytes (8 valid i8 elements out of 32-byte flit)
// commit_size = gcd(64, 8) = 8
fn no_transpose<'l, const T: Tu>(
    input: CastTensor<'l, T, i8, m![1], m![1], m![1], m![M, K], m![W # 32]>,
) -> DmTensor<i8, m![1], m![1], m![1], m![M, K, W]> {
    input.commit(0)
}

// Compiler-generated configuration: [
//   M -> 4 : 8,   (8  != 2 * 32, NOT contiguous)
//   K -> 2 : 32,  (32 != 8 * 1,  NOT contiguous)
//   W -> 8 : 1    (packet dimension, contiguous)
// ] : 32
// contiguous_sram_access_size = 8 elements × 4 bytes = 32 bytes
// commit_in_size = 32 bytes
// commit_size = gcd(32, 32) = 32
fn transpose<'l, const T: Tu>(
    input: AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![M, K], m![W]>,
) -> DmTensor<f32, m![1], m![1], m![1], m![K, M, W]> {
    input.commit(0)
}

// Compiler-generated configuration: [
//   M -> 4 : 8,   (8  != 2 * 32, NOT contiguous)
//   K -> 2 : 32,  (32 != 8 * 1,  NOT contiguous)
//   N -> 8 : 1    (truncated packet dimension, contiguous)
// ] : 16
// contiguous_sram_access_size = 8 elements × 2 bytes = 16 bytes
// commit_in_size = 16 bytes (8 bf16 elements; truncation from 16 elements to 8)
// commit_size = gcd(16, 16) = 16
fn transpose_with_truncation<'l, const T: Tu>(
    input: CastTensor<'l, T, bf16, m![1], m![1], m![1], m![M, K], m![N]>,
) -> DmTensor<bf16, m![1], m![1], m![1], m![K, M, N = 8]> {
    input.commit(0)
}

// Compiler-generated configuration: [
//   K -> 2 : 64,  (64 == 4 * 16, contiguous)
//   M -> 4 : 16,  (16 != 8 * 1,  NOT contiguous)
//   W -> 8 : 1    (packet dimension, contiguous)
// ] : 8
// contiguous_sram_access_size = 8 elements × 1 byte = 8 bytes
// commit_in_size = 32 bytes
// commit_size = gcd(8, 32) = 8
//
// The 32-byte packet is split into 4 × 8-byte writes along the M axis:
// - Write 0: packet[ 0.. 8] → DM offset  0
// - Write 1: packet[ 8..16] → DM offset 16
// - Write 2: packet[16..24] → DM offset 32
// - Write 3: packet[24..32] → DM offset 48
fn padding_chunking<'l, const T: Tu>(
    input: CastTensor<'l, T, i8, m![1], m![1], m![1], m![K], m![M, W]>,
) -> DmTensor<i8, m![1], m![1], m![1], m![K, M, W # 16]> {
    input.commit(0)
}
}

Slice Bitmap

The slice bitmap enables selective commits to specific slices. A 256-bit mask controls which slices receive commit data, with each bit corresponding to one slice.

For example:

bitmap = 00000000...01 enables commit only to slice 0
bitmap = 11111111...10 enables commit to all slices except slice 0

This feature supports workflows that compute on specific slices and commit results only to those slices.

Hardware Constraint

The commit sequencer must adhere to the same limits as fetch sequencers. See fetch sequencer constraints for details.

Sub-Context Operations

The sub-context Commit Engine provides specialized capabilities beyond the main context, though it supports fewer adapter stages.

Valid Count Packing: This operation selectively commits only valid tensor elements based on a runtime count, excluding padding or invalid data from the output buffer. When computation produces variable-length results (for example, filtering operations or dynamic sequence lengths), valid count packing ensures that only meaningful elements are written to DM, preventing wasted memory and simplifying downstream processing. The hardware uses a count parameter to determine how many leading elements from each packet should be committed, discarding the remainder.

Generate Mode: Writes a single 32-bit value to a specified address via an ITOS (immediate-to-SRAM) command, bypassing the Tensor Unit execution pipeline.

Constraints

The input packet size must be 32 bytes.
The commit_in_size must be 8, 16, 24, or 32 bytes. The commit_size must be 8, 16, 24, or 32 bytes for the main context and 8 bytes only for the sub-context. Note that the user only specifies the Element mapping. These constraints are internal to the compiler.

The two contexts support different capabilities:

Stage	Main context	Sub context
Truncating	Yes	Yes
Valid Count Packing	No	Yes
Generate Mode	No	Yes

Sub-context commits can only follow fetch. These cannot be preceded by Cast Engine or Transpose Engine operations.
The commit sequencer shares the same limits as the fetch sequencer (see fetch sequencer constraints). Additionally, all sequencer strides must be multiples of 8 bytes.

Performance

Commit Engine performance directly affects overall computation throughput since DM writes must complete before subsequent operations can access the data.

Write Bandwidth

The Commit Engine achieves maximum write bandwidth when:

Slice Interleaving: Distributing writes across all active slices (or the subset specified by the slice bitmap) avoids bottlenecks on individual slices. The RNGD chip has 64 slices per PE. The 256-bit bitmap accommodates up to 4 PEs (4 × 64 = 256).
Sequential Addresses: Writing to sequential DM addresses within each slice enables parallel bank access (128 B/cycle per DMN, 256 B/cycle with DMN interleaving).
Aligned Packet Sizes: Using 8-byte aligned packet sizes (8, 16, 24, 32 bytes) avoids partial bank writes.

For detailed memory performance characteristics, see Memory Performance.

Adapter Stage Costs

Each adapter stage adds minimal latency:

Truncating: Nearly zero cost (simple data width reduction)

Bank Starvation Prevention

The Commit Engine shares DM bank access with the Fetch Engine and DMA Engine. To prevent bank starvation and catastrophic NoC timeouts, ensure commit patterns avoid 64+ consecutive accesses to the same bank. The compiler automatically enforces this constraint by treating violating operations as if they occupy DMA context, preventing concurrent DMA operations.

See DM Bank Starvation for details.

Keyboard shortcuts

Programming Tensor Contraction Processors