Aligner
The Aligner stage prepares both operands for the Reducer by transforming them into a matching computation mapping.
The computation mapping is the common tensor layout ([Chip, Cluster, Slice, Row, Time, Packet]) that both the Stream Adapter and TRF Sequencer must produce so the Reducer can pair them element-by-element.
It is positioned within the Contraction Engine data flow as follows:
fetch() -> switch() -> collect() -> align(trf) -> contract() -> accumulate()
The Aligner consists of two parallel paths:
| Path | Component | Source | Role |
|---|---|---|---|
| Data | Stream Adapter | Collect Engine (Stream data from DM) | Collect flits, broadcast to Rows |
| Weight | TRF Sequencer | TRF (weight data) | Broadcast and transform weight data |
Overview
┌───────────────────────────────────────────────────┐
│ Aligner │
│ │
│ ┌─────────────────────┐ │
Switching ──────► │ Stream Adapter ────────►│ │ │
Engine │ │ Computation mapping │───► Reducer
│ | | │
TRF ────────────► │ TRF Sequencer ────────►│ │ │
│ └─────────────────────┘ │
│ │
└───────────────────────────────────────────────────┘
The computation mapping consists of the following dimensions:
Chip: No change from Stream Adapter/TRF Sequencer inputCluster: No change from Stream Adapter/TRF Sequencer inputSlice: No change from Stream Adapter/TRF Sequencer inputRow: Maps to the 8 Rows in the ReducerTime: The temporal dimension for sequential processingPacket: Data packet dimension
The key difference between the two paths is:
- Stream Adapter: Always populates Rows via broadcasting, and supports basic flit collection and data feeding for convolutions.
- TRF Sequencer: Leverages a sequencer to enable more complex data transformations.
Example: Batched MatMul
A batched matrix multiplication demonstrates how the Stream Adapter and TRF Sequencer align data and weights into a matching computation mapping (each detailed in the Stream Adapter and TRF Sequencer sub-sections). The code below does three things:
- Flit Collection (Stream Adapter,
collect_flits = 2):L = 2flits are collected from the innermostTimeaxis into thePacketdimension, forming a 64B packet. The collected data is broadcast to Rows (1, 2, 4, or 8 rows depending on the computation mapping). - Packet Broadcast (TRF Sequencer,
reg_read_size = 32B): The TRF Sequencer reads 32B (K = 16bf16) contiguously each cycle and broadcasts twice to fill the 64 bytes, matching the Stream Adapter’s Packet. - Time Permute (TRF Sequencer):
The order of axes in TRF Element
[O = 2, M = 32, K = 16]does not matchTime: [M = 32, O = 2]. The sequencer reorders this by placingOin Entry 0 (inner loop) with stride 1024, whileMuses Entry 1 (outer loop) with stride 32.
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 32, N = 8, K = 16, L = 2, O = 2];
/// Stores weights into TRF (sub context).
fn store_weights<'l, const T: Tu>(
weights: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![N, O, M], m![K]>,
) -> TrfTensor<bf16, m![1], m![1], m![1], m![N], m![O, M, K]> {
// TRF mapping: [
// Row: [N = 8]: 8 output channels mapped to 8 Rows
// Element: [O = 2, M = 32, K = 16]: each Row stores 2×32×16 = 1024 bf16 elements
// ]
weights.to_trf(TrfAddress::FirstHalf)
}
/// Aligns data and weights, then contracts (main context).
fn matmul<'l, const T: Tu>(
input: CollectTensor<'l, T, bf16, m![1], m![1], m![1], m![M, O, L], m![K]>,
trf: &TrfTensor<bf16, m![1], m![1], m![1], m![N], m![O, M, K]>,
) -> AccumulationTensor<'l, T, f32, m![1], m![1], m![1], m![M, O, L], m![N]> {
// Collect mapping: [Time: [M = 32, O = 2, L = 2], Packet: [K = 16]]
// TRF mapping: [Row: [N = 8], Element: [O = 2, M = 32, K = 16]]
//
// Stream Adapter (collect_flits = 2):
// Flit Collection:
// Collects L = 2 flits from innermost Time into Packet.
// After collection, the computation mapping dimensions become:
// Time = [M = 32, O = 2], Packet = [L = 2, K = 16] = 32 bf16 = 64B
// Broadcasts Packet to Rows (N = 8).
//
// TRF Sequencer (reg_read_size = 32B):
// Packet Broadcast:
// reg_read_size read: reads K = 16 bf16 = 32B contiguously from TRF,
// then broadcasts 2× to fill the 64B — matching Packet = [L = 2, K = 16].
// Time Permute:
// TRF Element outer of reg_read_size(K) is [O = 2(outer), M = 32(inner)],
// but Time is [M = 32(outer), O = 2(inner)] — M, O are reordered via sequencer.
//
// Compiler-generated TRF Sequencer configuration:
// Entry 0: { size: 2, stride: 1024 } — O (inner loop, stride = K×M×sizeof(bf16))
// Entry 1: { size: 32, stride: 32 } — M (outer loop, stride = K×sizeof(bf16))
//
// Computation mapping: [Time: [M = 32, O = 2], Row: [N = 8], Packet: [L = 2, K = 16]]
// Output mapping: [Time: [M = 32, O = 2, L = 2], Packet: [N = 8]]
// (K is contracted, column major)
input.align::<m![M, O], m![L, K], _, _>(trf)
.contract::<m![L]>()
.accumulate::<m![M, O, L], m![N]>(AccumulationKind::Interleaved)
}
}
For details on each component, see the sub-sections:
- Stream Adapter — Flit collection, Rows broadcast
- Advanced Operations — Transpose, Shift (for convolutions)
- TRF Sequencer — SRAM-to-TRF, weight broadcasting