Computing Tensors
The Tensor Unit transforms data through a pipeline of eight specialized engines. Data flows from DM, through the engine pipeline, and back to DM. After the Collect Engine normalizes packets to flits (32-byte flow control units), all downstream engines — Contraction, Vector, Cast, Transpose, and Commit — operate on flits. (See Collect Engine for the normalization details.)
flowchart TB
subgraph SRAM
DM[(DM)] & TRF[(TRF)] & VRF[(VRF)]
end
subgraph TU[Tensor Unit]
direction LR
FE[Fetch] --> SW[Switching] --> CO[Collect] --> CE[Contraction] --> VE[Vector] --> CA[Cast] --> TR[Transpose] --> CM[Commit]
end
DM --> FE
CM --> DM
CO --> TRF --> CE
CO --> VRF --> VE
click FE "../moving-tensors/fetch-engine.html" "Fetch Engine"
click SW "./switch-engine.html" "Switch Engine"
click CO "./collect-engine.html" "Collect Engine"
click CE "./contraction-engine/index.html" "Contraction Engine"
click VE "./vector-engine/index.html" "Vector Engine"
click CA "./cast-engine.html" "Cast Engine"
click TR "./transpose-engine.html" "Transpose Engine"
click CM "../moving-tensors/commit-engine.html" "Commit Engine"
Two register files serve distinct roles: TRF (Tensor Register File; see hello-tcp memory overview) holds weights for the Contraction Engine (load once, reuse across many cycles), while VRF (Vector Register File) holds operands for the Vector Engine.
The Collect Engine loads data into TRF via .to_trf() and VRF via .to_vrf().
Fetch and Commit are part of the Tensor Unit pipeline but interface directly with DM; see Moving Tensors.
| Engine | Function | Key Constraint |
|---|---|---|
| Fetch | Load data from DM into the pipeline | Packet must be 8-byte aligned; Slice is unchanged |
| Switching | Redistribute data across slices | Ring network topology; Slice can change |
| Collect | Normalize packets to 32-byte flits | Output = exactly one flit |
| Contraction | Einsum: matmul, convolution, attention | Weight-stationary via TRF |
| Vector | Elementwise, binary, reduce operations | Only i32/f32 input |
| Cast | Precision lowering with batching | Output = exactly one flit |
| Transpose | Reorder elements within a flit | Within-flit only |
| Commit | Write results back to DM | Flit-aligned writes |
As a kernel writer, you specify data types, tensor mapping expressions, and computations in einsum form. The compiler translates these into per-engine hardware configurations.
Execution Contexts
Two execution contexts enable double-buffering (preparing the next operand batch while the current one is being computed) to hide memory latency:
| Context | Compute Engines | Fetch/Commit | Typical Use |
|---|---|---|---|
| Main | Exclusive access | Dedicated units | Computation |
| Sub | Idle only | Lower bandwidth | Prefetching to TRF/VRF |
While the main context computes, the sub context prefetches the next operand batch into TRF/VRF. When the sub context is unused, the main and sub Switch Engine channels combine into dual channel mode (see Switch Engine), doubling bandwidth. See Scheduling for how the scheduler coordinates the two contexts and the DMA Engine.
The following sections cover each engine in detail.