Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Computing Tensors

The Tensor Unit transforms data through a pipeline of eight specialized engines. Data flows from DM, through the engine pipeline, and back to DM. After the Collect Engine normalizes packets to flits (32-byte flow control units), all downstream engines — Contraction, Vector, Cast, Transpose, and Commit — operate on flits. (See Collect Engine for the normalization details.)

flowchart TB
    subgraph SRAM
        DM[(DM)] & TRF[(TRF)] & VRF[(VRF)]
    end

    subgraph TU[Tensor Unit]
        direction LR
        FE[Fetch] --> SW[Switching] --> CO[Collect] --> CE[Contraction] --> VE[Vector] --> CA[Cast] --> TR[Transpose] --> CM[Commit]
    end

    DM --> FE
    CM --> DM
    CO --> TRF --> CE
    CO --> VRF --> VE

    click FE "../moving-tensors/fetch-engine.html" "Fetch Engine"
    click SW "./switch-engine.html" "Switch Engine"
    click CO "./collect-engine.html" "Collect Engine"
    click CE "./contraction-engine/index.html" "Contraction Engine"
    click VE "./vector-engine/index.html" "Vector Engine"
    click CA "./cast-engine.html" "Cast Engine"
    click TR "./transpose-engine.html" "Transpose Engine"
    click CM "../moving-tensors/commit-engine.html" "Commit Engine"

Two register files serve distinct roles: TRF (Tensor Register File; see hello-tcp memory overview) holds weights for the Contraction Engine (load once, reuse across many cycles), while VRF (Vector Register File) holds operands for the Vector Engine. The Collect Engine loads data into TRF via .to_trf() and VRF via .to_vrf().

Fetch and Commit are part of the Tensor Unit pipeline but interface directly with DM; see Moving Tensors.

EngineFunctionKey Constraint
FetchLoad data from DM into the pipelinePacket must be 8-byte aligned; Slice is unchanged
SwitchingRedistribute data across slicesRing network topology; Slice can change
CollectNormalize packets to 32-byte flitsOutput = exactly one flit
ContractionEinsum: matmul, convolution, attentionWeight-stationary via TRF
VectorElementwise, binary, reduce operationsOnly i32/f32 input
CastPrecision lowering with batchingOutput = exactly one flit
TransposeReorder elements within a flitWithin-flit only
CommitWrite results back to DMFlit-aligned writes

As a kernel writer, you specify data types, tensor mapping expressions, and computations in einsum form. The compiler translates these into per-engine hardware configurations.

Execution Contexts

Two execution contexts enable double-buffering (preparing the next operand batch while the current one is being computed) to hide memory latency:

ContextCompute EnginesFetch/CommitTypical Use
MainExclusive accessDedicated unitsComputation
SubIdle onlyLower bandwidthPrefetching to TRF/VRF

While the main context computes, the sub context prefetches the next operand batch into TRF/VRF. When the sub context is unused, the main and sub Switch Engine channels combine into dual channel mode (see Switch Engine), doubling bandwidth. See Scheduling for how the scheduler coordinates the two contexts and the DMA Engine.

The following sections cover each engine in detail.