Vector Engine
The Vector Engine applies element-wise operations: activations such as GELU and SiLU, normalizations such as softmax and layer norm, and binary operations. It is used both after the Contraction Engine (to post-process f32/i32 accumulator results) and independently for element-wise kernels that skip contraction entirely.
The Vector Engine operates exclusively on i32 and f32 data types. Data moves in 32-byte units called flits, each containing eight 32-bit values. This 32-bit restriction exists because lower-precision data is widened before or during computation: bf16 products accumulate in f32, and i8 products accumulate in i32.
The Vector Engine sits between the Contraction Engine and the Cast Engine in the Tensor Unit pipeline:
Fetch -> Switch -> Collect -> Contraction -> Vector -> Cast -> Transpose -> Commit
| ^
+-----------------------+
(skip contraction)
Data enters the Vector Engine as either:
- From the Collect Engine when the Contraction Engine is skipped
- From the Contraction Engine when it produces the input
Interface
/// Initializes Vector Engine processing for this tensor.
#[primitive(CollectTensor::vector_init)]
pub fn vector_init(self) -> VectorInitTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet>
where
D: VeScalar,
#[primitive(VectorInitTensor::vector_intra_slice_branch)]
pub fn vector_intra_slice_branch(
self,
branch: BranchMode,
) -> VectorBranchTensor<'l, T, D, Chip, Cluster, Slice, Time, Packet, D, NoTensor, { VeOrder::IntraFirst }> {
#[primitive(VectorInitTensor::vector_intra_slice_unzip)]
pub fn vector_intra_slice_unzip<I: AxisName, TileTime: M, SplitTime: M>(
self,
) -> VectorTensorPair<'l, T, D, stage::Branch, Chip, Cluster, Slice, SplitTime, Packet> {
#[primitive(VectorInitTensor::vector_inter_slice_reduce)]
pub fn vector_inter_slice_reduce<Slice2: M, Time2: M>(
self,
op: InterSliceReduceOpI32,
) -> VectorInterSliceReduceTensor<'l, T, i32, Chip, Cluster, Slice2, Time2, Packet, { VeOrder::InterFirst }> {
The same vector_init() entry point is available regardless of whether the input comes from the Collect Engine (when contraction is skipped) or the Contraction Engine (for post-contraction processing).
After vector_init(), choose the first block by calling either vector_intra_slice_branch(...), vector_intra_slice_unzip(...), or vector_inter_slice_reduce(...).
For detailed stage-by-stage API coverage, see Intra-Slice Block and Inter-Slice Block.
Quick Reference
| Block | How to Reach It | Use It For | Output |
|---|---|---|---|
| Intra-Slice Block | Start with vector_init(), then call vector_intra_slice_branch() | Elementwise ops, binary ops, intra-slice reduce | Chain stages, then vector_final() |
| Inter-Slice Block | Either call vector_init() -> vector_inter_slice_reduce() first, or switch from an eligible intra-slice tensor with vector_inter_slice_reduce() | Reduction across the 256 slices in a cluster | vector_inter_slice_reduce(), then optional intra-slice work or vector_final() |
| Two-group intra-slice mode | Start with vector_init(), then call vector_intra_slice_unzip() | Process two interleaved groups before combining them | _zip to merge, then vector_final() |
Examples
ReLU Activation
Applying ReLU activation (max(x, 0)) after matrix multiplication:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![M = 128, N = 256, K = 64];
fn relu<'l, const T: Tu>(
input: AccumulationTensor<'l, T, f32, m![1], m![1], m![K], m![M], m![N]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1], m![K], m![M], m![N]> {
input
.vector_init()
.vector_intra_slice_branch(BranchMode::Unconditional)
.vector_clip(ClipBinaryOpF32::Max, 0.0f32)
.vector_final()
}
}
Inter-Slice Reduce
Reducing a tensor across slices:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 4, A = 512];
fn inter_slice_reduce<'l, const T: Tu>(
input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, R], m![1], m![A % 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, 1 # 4], m![1], m![A % 8]> {
input
.vector_init()
.vector_inter_slice_reduce::<m![A / 8, 1 # 4], m![1]>(InterSliceReduceOpI32::AddSat)
.vector_final()
}
}
Ordering
| Order | Flow | Typical Use |
|---|---|---|
IntraFirst | Intra-Slice Block -> optional Inter-Slice Block | Post-process each slice, then reduce across slices |
InterFirst | Inter-Slice Block -> optional Intra-Slice Block | Reduce first, then apply elementwise post-processing |
The examples above show one concrete IntraFirst path and one concrete InterFirst path.
Constraints
When using i8 or bf16 input without the Contraction Engine, widening must still fit within one 32-byte flit. This limits how much data the Fetch Engine can supply per flit after type conversion. See Fetch Engine: Type Casting Constraints.