Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Kernel Examples

The introductory tutorial briefly introduced temporal and spatial partitioning for large tensors in its Further Reading section. The preceding chapters explained how mapping expressions distribute work across TCP’s hardware hierarchy and how each component reduces partial results. This chapter shows how to combine mapping, movement, computation, and scheduling into complete, working kernels. The table below summarizes the available parallelism and reduction at each level:

DimensionTypeDefined inReduced in
ChipSpatialHBM, SRAM, StreamDMA + Vector
ClusterSpatialSRAM, StreamDMA + Vector
SliceSpatialSRAM, StreamVector
RowSpatialTRFContraction
TimeTemporalStreamContraction
PacketSpatialStreamContraction

For cross-chip and cross-cluster reduction patterns (the Chip and Cluster rows above), see Chip/Cluster Reduce, which demonstrates DMA broadcast followed by Vector Engine binary add.

The examples progress from single-engine patterns to composed multi-engine patterns to full model implementations:

  • Tiling (coming soon): Tile size selection, memory layout, and accumulation strategies.
  • Split Reduce: Interleaved fetch for reducing across multiple tensor instances. Use when a reduction dimension exceeds what a single tile can accumulate.
  • Chip/Cluster Reduce: ReduceScatter and AllReduce across chips. Use when computation must be distributed across multiple chips or clusters.
  • Fetch and Commit Engine: Axis permutation, full-flit commit, tail padding, and tensor segmentation. Use when data layout transformations are needed between memory and compute.
  • GEMM with Double-Buffering (coming soon): DMA load from HBM, sub-context TRF prefetch, main-context tiled contraction, cast, and commit. A short end-to-end example bridging single-engine patterns and full model implementations.
  • Transformer: Llama 3 70B implementation with prefill and decode phases. A full model combining tiling, multi-chip reduce, and memory management.
  • Mixture of Experts: Branchless TopK routing and blockwise sparse computation. A full model demonstrating dynamic routing with sparse computation patterns.