Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Vector Engine

The Vector Engine applies element-wise operations: activations such as GELU and SiLU, normalizations such as softmax and layer norm, and binary operations. It is used both after the Contraction Engine (to post-process f32/i32 accumulator results) and independently for element-wise kernels that skip contraction entirely.

The Vector Engine operates exclusively on i32 and f32 data types. Data moves in 32-byte units called flits, each containing eight 32-bit values. This 32-bit restriction exists because lower-precision data is widened before or during computation: bf16 products accumulate in f32, and i8 products accumulate in i32.

The Vector Engine sits between the Contraction Engine and the Cast Engine in the Tensor Unit pipeline:

Fetch -> Switch -> Collect -> Contraction -> Vector -> Cast -> Transpose -> Commit
                    |                       ^
                    +-----------------------+
                     (skip contraction)

Data enters the Vector Engine as either:

  • From the Collect Engine when the Contraction Engine is skipped
  • From the Contraction Engine when it produces the input

Interface

    /// Initializes Vector Engine processing for this tensor.
    #[primitive(CollectTensor::vector_init)]
    pub fn vector_init(self) -> VectorInitTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet>
    where
        D: VeScalar,

    #[primitive(VectorInitTensor::vector_intra_slice_branch)]
    pub fn vector_intra_slice_branch(
        self,
        branch: BranchMode,
    ) -> VectorBranchTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet, D, NoTensor, { VeOrder::IntraFirst }> {

    #[primitive(VectorInitTensor::vector_intra_slice_unzip)]
    pub fn vector_intra_slice_unzip<I: AxisName, TileTime: M, SplitTime: M>(
        self,
    ) -> VectorTensorPair<'l, T, D, stage::Branch, Chip, Cluster, Slice, SplitTime, Packet> {

    #[primitive(VectorInitTensor::vector_inter_slice_reduce)]
    pub fn vector_inter_slice_reduce<Slice2: M, Time2: M>(
        self,
        op: InterSliceReduceOpI32,
    ) -> VectorInterSliceReduceTensor<'l, T, i32, Chip, Cluster, Slice2, Time2, Packet, { VeOrder::InterFirst }> {

The same vector_init() entry point is available regardless of whether the input comes from the Collect Engine (when contraction is skipped) or the Contraction Engine (for post-contraction processing). After vector_init(), choose the first block by calling either vector_intra_slice_branch(...), vector_intra_slice_unzip(...), or vector_inter_slice_reduce(...). For detailed stage-by-stage API coverage, see Intra-Slice Block and Inter-Slice Block.

Quick Reference

BlockHow to Reach ItUse It ForOutput
Intra-Slice BlockStart with vector_init(), then call vector_intra_slice_branch()Elementwise ops, binary ops, intra-slice reduceChain stages, then vector_final()
Inter-Slice BlockEither call vector_init() -> vector_inter_slice_reduce() first, or switch from an eligible intra-slice tensor with vector_inter_slice_reduce()Reduction across the 256 slices in a clustervector_inter_slice_reduce(), then optional intra-slice work or vector_final()
Two-group intra-slice modeStart with vector_init(), then call vector_intra_slice_unzip()Process two interleaved groups before combining them_zip to merge, then vector_final()

Examples

ReLU Activation

Applying ReLU activation (max(x, 0)) after matrix multiplication:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 128, N = 256, K = 64];

fn relu<'l, const T: Tu>(
    input: AccumulationTensor<'l, T, f32, m![1], m![1], m![K], m![M], m![N]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1], m![K], m![M], m![N]> {
    input
        .vector_init()
        .vector_intra_slice_branch(BranchMode::Unconditional)
        .vector_clip(ClipBinaryOpF32::Max, 0.0f32)
        .vector_final()
}
}

Inter-Slice Reduce

Reducing a tensor across slices:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 4, A = 512];

fn inter_slice_reduce<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, R], m![1], m![A % 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, 1 # 4], m![1], m![A % 8]> {
    input
        .vector_init()
        .vector_inter_slice_reduce::<m![A / 8, 1 # 4], m![1]>(InterSliceReduceOpI32::AddSat)
        .vector_final()
}
}

Ordering

OrderFlowTypical Use
IntraFirstIntra-Slice Block -> optional Inter-Slice BlockPost-process each slice, then reduce across slices
InterFirstInter-Slice Block -> optional Intra-Slice BlockReduce first, then apply elementwise post-processing

The examples above show one concrete IntraFirst path and one concrete InterFirst path.

Constraints

When using i8 or bf16 input without the Contraction Engine, widening must still fit within one 32-byte flit. This limits how much data the Fetch Engine can supply per flit after type conversion. See Fetch Engine: Type Casting Constraints.