Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Tiling

Warning

This page is a work in progress. Content will be added in a future release.

Tiling breaks large tensors into smaller tiles that fit in on-chip memory. When a tensor exceeds VRF capacity (8KB per slice) or DM capacity, it must be processed in multiple iterations.

When to Use Tiling

Tiling applies when:

  • A tensor dimension exceeds what fits in a single hardware pass — compare the dimension size against the DM capacity table in Memory Performance.
  • Memory bandwidth needs to be optimized by reusing loaded data — check whether the same data is fetched more than once across operations.
  • Computation needs to be distributed across time rather than space — use when the spatial dimensions are already fully distributed but a loop over tiles is needed.

Basic Tiling Pattern

The basic pattern is: (1) choose a tile size that fits in VRF/DM, (2) loop over tiles in the outer dimensions, (3) fetch each tile from HBM to DM, (4) run the computation, and (5) accumulate partial results before writing back. The tile size must satisfy alignment constraints (32-byte flits) and leave room for double-buffering if overlapping fetch with compute.

Warning

Add a simple tiling example showing:

  • Original tensor shape exceeding VRF
  • Tile size calculation
  • Loop structure for processing tiles
  • Accumulation of partial results
// TODO: Example code
// axes![M = 8192, N = 8192, K = 2048];
//
// Tile sizes chosen to fit in VRF:
// type TileM = m![M / 32];  // 256 elements per tile
// type TileN = m![N / 32];  // 256 elements per tile
//
// Outer loop iterates over tiles
// Inner computation processes one tile

Example: Tiled Matrix Multiplication

Warning

Add complete GEMM example with tiling:

  • Input matrices A[M, K] and B[K, N] where M, N, K exceed VRF capacity
  • Tile along M and N dimensions
  • Accumulate partial results across K tiles

Memory Layout

Warning

Describe how tiles are laid out in HBM and DM

Tile Size Selection

Warning

Explain constraints for choosing tile sizes:

  • VRF capacity (8KB per slice)
  • DM capacity
  • Alignment requirements (32-byte flits)
  • Trade-off between tile size and iteration count

Accumulation Strategy

Warning

Explain how partial results are accumulated:

  • Accumulate in higher precision (f32) to avoid precision loss
  • Store intermediate results in DM or HBM depending on size
  • Final cast to output precision (bf16)

Example: Tiled Attention

Warning

Add attention example showing tiling for long sequences:

  • Query, Key, Value tensors with long sequence length
  • Tile along sequence dimension
  • FlashAttention-style tiling for memory efficiency

Performance Considerations

Warning

Add performance analysis:

  • Overhead of tile boundary handling
  • Memory bandwidth utilization
  • Optimal tile sizes for different tensor shapes
  • Interaction with hardware prefetching