Collect Engine

All downstream engines (Contraction Engine, Vector Engine, Cast Engine, Transpose Engine, and Commit Engine) consume exactly 32-byte flits. The Collect Engine normalizes arbitrary-sized packets to one flit in two steps:

Pad the input packet up to the next 32-byte boundary. Skipped if the packet is already 32-byte aligned.
Split at the flit boundary: the inner 32 bytes become Packet2, and the outer flit count is absorbed into Time2. Skipped if the packet is already 32 bytes.

The resulting CollectTensor either flows down the pipeline to a downstream engine or is stored in the Register Files.

Interface

SwitchTensor and FetchTensor both expose .collect() with the same semantics. The FetchTensor entry point bypasses the Switch Engine when no slice distribution is needed.

impl<'l, const T: Tu, P: CanApplyCollect, D: Scalar, Chip: M, Cluster: M, Slice: M, Time: M, Packet: M, B: Backend>
    TuTensor<'l, T, P, D, Chip, Cluster, Slice, Time, Packet, B>
{
    /// Normalizes packet to exactly 32 bytes (one flit).
    ///
    /// Pads to flit-aligned boundary, then splits: inner 32 bytes become
    /// `Packet2`, outer flit portion is absorbed into `Time2`. For packets
    /// already ≤ 32 bytes, only padding is added.
    #[primitive(TuTensor::collect)]
    pub fn collect<Time2: M, Packet2: M>(self) -> CollectTensor<'l, T, D, Chip, Cluster, Slice, Time2, Packet2, B> {
        verify_collect::<D, Time, Packet, Time2, Packet2>();
        CollectTensor::new(self.ctx, self.inner.transpose(false))
    }
}

Examples

Single-Flit Packet

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_opt_std;
use furiosa_opt_std::prelude::*;
axes![A = 8, B = 32];

fn collect_identity<'l, const T: Tu>(
    input: SwitchTensor<'l, T, i8, m![1], m![1 # 2], m![1 # 256], m![A], m![B]>,
) -> CollectTensor<'l, T, i8, m![1], m![1 # 2], m![1 # 256], m![A], m![B # 32]> {
    // B=32 elements × 1 byte (i8) = 32 bytes = one flit.
    // Time and Packet pass through unchanged.
    input.collect()
}

let mut ctx = Context::acquire();

let c: SwitchTensor<'_, _, i8, m![1], m![1 # 2], m![1 # 256], m![A], m![B]> = SwitchTensor::new(&mut ctx.main, Tensor::zero());
let _o = collect_identity(c);
}

When the input packet is already exactly 32 bytes, collect passes it through unchanged (B = 32 elements × 1 byte for i8 = 32 bytes).

Before:   Time = m![A]
          Packet = m![B]
          ┌──────────────────────────┐
          │            B             │  32 bytes
          └──────────────────────────┘

After:    Time = m![A]
          Packet = m![B # 32]
          ┌──────────────────────────┐
          │          B # 32          │  32 bytes
          └──────────────────────────┘

Sub-Flit Packet

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_opt_std;
use furiosa_opt_std::prelude::*;
axes![A = 8, B = 16];

fn collect_padding<'l, const T: Tu>(
    input: SwitchTensor<'l, T, i8, m![1], m![1 # 2], m![1 # 256], m![A], m![B]>,
) -> CollectTensor<'l, T, i8, m![1], m![1 # 2], m![1 # 256], m![A], m![B # 32]> {
    // B=16 elements × 1 byte = 16 bytes < 32 bytes.
    // Padded to 32 bytes: Packet2 = m![B # 32].
    // Time unchanged since it fits in one flit.
    input.collect()
}

let mut ctx = Context::acquire();

let c: SwitchTensor<'_, _, i8, m![1], m![1 # 2], m![1 # 256], m![A], m![B]> = SwitchTensor::new(&mut ctx.main, Tensor::zero());
let _o = collect_padding(c);
}

When the input packet is smaller than 32 bytes, collect pads to 32 bytes (B = 16 elements × 1 byte for i8 = 16 bytes).

Before:   Time = m![A]
          Packet = m![B]
          ┌────────────┐
          │     B      │  16 bytes
          └────────────┘

After:    Time = m![A]
          Packet = m![B # 32]
          ┌────────────┬─────────────┐
          │     B      │     pad     │  32 bytes
          └────────────┴─────────────┘

Multi-Flit Packet

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_opt_std;
use furiosa_opt_std::prelude::*;
axes![A = 8, B = 32];

fn collect_multi_flit<'l, const T: Tu>(
    input: SwitchTensor<'l, T, bf16, m![1], m![1 # 2], m![1 # 256], m![A], m![B]>,
) -> CollectTensor<'l, T, bf16, m![1], m![1 # 2], m![1 # 256], m![A, B / 16], m![B % 16]> {
    // B=32 elements × 2 bytes (bf16) = 64 bytes = 2 flits.
    // Inner 16 elements = 32 bytes → Packet2 = m![B % 16].
    // Outer 2 flits → absorbed into Time2 = m![A, B / 16].
    input.collect()
}

let mut ctx = Context::acquire();

let c: SwitchTensor<'_, _, bf16, m![1], m![1 # 2], m![1 # 256], m![A], m![B]> = SwitchTensor::new(&mut ctx.main, Tensor::zero());
let _o = collect_multi_flit(c);
}

When the input packet exceeds 32 bytes, collect splits into flits and absorbs the outer flit count into Time (B = 32 elements × 2 bytes for bf16 = 64 bytes, so B / 16 = 2 flits).

Before:   Time = m![A]
          Packet = m![B]
          ┌──────────────────────────┬──────────────────────────┐
          │       B / 16 == 0        │       B / 16 == 1        │  64 bytes
          └──────────────────────────┴──────────────────────────┘
                    32 bytes                   32 bytes

After:    Time = m![A, B / 16]
          Packet = m![B % 16]
          ┌──────────────────────────┐
          │          B % 16          │  32 bytes  × B/16 time steps
          └──────────────────────────┘

Multi-Flit Packet With Padding

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_opt_std;
use furiosa_opt_std::prelude::*;
axes![A = 8, B = 56];

fn collect_multi_flit_padded<'l, const T: Tu>(
    input: SwitchTensor<'l, T, i8, m![1], m![1 # 2], m![1 # 256], m![A], m![B]>,
) -> CollectTensor<'l, T, i8, m![1], m![1 # 2], m![1 # 256], m![A, B # 64 / 32], m![B # 64 % 32]> {
    // B is not 32-byte aligned; first pad B to a multiple of 32 bytes.
    // B # 64=64 elements × 1 byte (i8) = 64 bytes = 2 flits.
    // Inner 32 elements = 32 bytes → Packet2 = m![B # 64 % 32].
    // Outer 2 flits → absorbed into Time2 = m![A, B # 64 / 32].
    input.collect()
}

let mut ctx = Context::acquire();

let c: SwitchTensor<'_, _, i8, m![1], m![1 # 2], m![1 # 256], m![A], m![B]> = SwitchTensor::new(&mut ctx.main, Tensor::zero());
let result = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| { collect_multi_flit_padded(c) }));
}

When the input packet is not aligned to 32 bytes, it is first padded (B = 51 elements × 1 byte for i8 = 51 bytes, padded to 64). Then, collect splits into flits and absorbs the outer flit count (B # 64 / 32 = 2) into Time.

Before:   Time = m![A]
          Packet = m![B]
          ┌──────────────────────────┬───────────────┐
          │       B / 32 == 0        │  B / 32 == 1  │  51 bytes
          └──────────────────────────┴───────────────┘
                    32 bytes             19 bytes

Padded:   Time = m![A]
          Packet = m![B # 64]
          ┌──────────────────────────┬───────────────┬──────────┐
          │       B / 32 == 0        │  B / 32 == 1  │   pad    │  64 bytes
          └──────────────────────────┴───────────────┴──────────┘
                    32 bytes                   32 bytes

After:    Time = m![A, B # 64 / 32]
          Packet = m![B # 64 % 32]
          ┌──────────────────────────┐
          │       B # 64 % 32        │  32 bytes  × B # 64 / 32 time steps
          └──────────────────────────┘

Register File Loading

After normalization, store the CollectTensor into the Tensor Register File via .to_trf() or the Vector Register File via .to_vrf(). The “To TRF” / “To VRF” subsections below describe the store mechanism (time_inner derivation, [time_inner, Packet] sequenced into Element).

To TRF

.to_trf::<Row, Element>() partitions the TRF along its row dimension. The kernel writer chooses Row (the row layout in the TRF, with Row::SIZE in {1, 2, 4, 8}) and Element (the per-row element layout). The compiler then finds a time_inner such that Time decomposes into [Row, time_inner] and [time_inner, Packet] is sequenced into Element, so each row of the TRF is filled by time_inner consecutive flits.

.to_trf() uses the entire TRF (TrfAddress::Full). To let two tensors occupy the TRF independently, use .to_trf_at::<Row, Element>(address) with a TrfAddress that selects the region:

Full: the entire TRF.
FirstHalf / SecondHalf: the TRF split into two halves.

The compiler bounds the resulting tensor’s total byte size by the chosen region’s capacity.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_opt_std;
use furiosa_opt_std::prelude::*;
axes![B = 32];

fn load_trf<'l, const T: Tu>(
    input: CollectTensor<'l, T, i8, m![1], m![1 # 2], m![1 # 256], m![1], m![B]>,
) -> TrfTensor<i8, m![1], m![1 # 2], m![1 # 256], m![1], m![B]> {
    input.to_trf()
}

let mut ctx = Context::acquire();

let c: CollectTensor<'_, _, i8, m![1], m![1 # 2], m![1 # 256], m![1], m![B]> = CollectTensor::new(&mut ctx.main, Tensor::zero());
let _o = load_trf(c);
}

To VRF

.to_vrf::<Element>() stores the flits into the VRF; .to_vrf_at::<Element>(address) stores at a raw Address (no bounded-region selection). The kernel writer chooses Element, the destination element layout in the VRF. Unlike .to_trf (which accepts any Scalar element type), .to_vrf requires a VeScalar element type (i.e., i32 or f32) because the Vector Engine downstream consumes these types only.

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_opt_std;
use furiosa_opt_std::prelude::*;
axes![B = 64];

fn load_vrf<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![1 # 256], m![B / 8], m![B % 8]>,
) -> VrfTensor<i32, m![1], m![1 # 2], m![1 # 256], m![B]> {
    input.to_vrf()
}

let mut ctx = Context::acquire();

let c: CollectTensor<'_, _, i32, m![1], m![1 # 2], m![1 # 256], m![B / 8], m![B % 8]> = CollectTensor::new(&mut ctx.main, Tensor::zero());
let _o = load_vrf(c);
}

Keyboard shortcuts

Programming Tensor Contraction Processors