Kernel Examples

The introductory tutorial briefly introduced temporal and spatial partitioning for large tensors in its Further Reading section. The preceding chapters explained how mapping expressions distribute work across TCP’s hardware hierarchy and how each component reduces partial results. This chapter shows how to combine mapping, movement, computation, and scheduling into complete, working kernels. The table below summarizes the available parallelism and reduction at each level:

Dimension	Type	Defined in	Reduced in
`Chip`	Spatial	HBM, SRAM, Stream	DMA + Vector
`Cluster`	Spatial	SRAM, Stream	DMA + Vector
`Slice`	Spatial	SRAM, Stream	Vector
`Row`	Spatial	TRF	Contraction
`Time`	Temporal	Stream	Contraction
`Packet`	Spatial	Stream	Contraction

For cross-chip and cross-cluster reduction patterns (the Chip and Cluster rows above), see Chip/Cluster Reduce, which demonstrates DMA broadcast followed by Vector Engine binary add.

The examples progress from single-engine patterns to composed multi-engine patterns to full model implementations:

Tiling (coming soon): Tile size selection, memory layout, and accumulation strategies.
Split Reduce: Interleaved fetch for reducing across multiple tensor instances. Use when a reduction dimension exceeds what a single tile can accumulate.
Chip/Cluster Reduce: ReduceScatter and AllReduce across chips. Use when computation must be distributed across multiple chips or clusters.
Fetch and Commit Engine: Axis permutation, full-flit commit, tail padding, and tensor segmentation. Use when data layout transformations are needed between memory and compute.
GEMM with Double-Buffering (coming soon): DMA load from HBM, sub-context TRF prefetch, main-context tiled contraction, cast, and commit. A short end-to-end example bridging single-engine patterns and full model implementations.
Transformer: Llama 3 70B implementation with prefill and decode phases. A full model combining tiling, multi-chip reduce, and memory management.
Mixture of Experts: Branchless TopK routing and blockwise sparse computation. A full model demonstrating dynamic routing with sparse computation patterns.

Keyboard shortcuts

Programming Tensor Contraction Processors

Kernel Examples