Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

FuriosaAI’s Tensor Contraction Processor (TCP) is a massively parallel AI accelerator targeting inference workloads. High-level frameworks such as PyTorch and XLA abstract away memory layouts and hardware scheduling, but give programmers no control over either. Low-level kernel APIs give fine-grained control, but require reasoning in bytes and hardware addresses rather than tensors. TCP’s Virtual Instruction Set Architecture (Virtual ISA) bridges this gap: it lets programmers think in terms of tensors while directly managing memory allocation and tensor unit scheduling. This manual explains TCP programming through the Virtual ISA.

The manual walks through concrete examples, targeting two audiences: programmers writing Virtual ISA directly and compiler developers generating it. Basic Rust familiarity is assumed; see the language manual if needed.

Warning

Alpha Test Build: Experimental Software

This software is an early, experimental, and incomplete build intended strictly for technical evaluation and internal testing.

Before using this software for any production work, critical tasks, or for important data, you must consult with Furiosa engineers.

Your feedback is vital to our development. Please provide it.

Installation

Install two dependencies:

Your First Program

Create a new project:

cargo new --bin tcp-my-project
cd tcp-my-project
cargo add furiosa-visa-std tokio

Add rust-toolchain.toml:

[toolchain]
channel = "nightly-2025-12-12"
components = ["rustfmt", "clippy"]

Write main.rs:

#![feature(register_tool)]
#![register_tool(tcp)]
extern crate furiosa_visa_std;
extern crate tokio;
extern crate rand;
use rand::SeedableRng;
use rand::rngs::SmallRng;
use furiosa_visa_std::prelude::*;  // provided by the Furiosa SDK

// Declare axis sizes
axes![A = 8, B = 512];

/// The main function running in host
#[tokio::main]
async fn main() {
    // Acquire exclusive access to the TCP device
    let mut ctx = Context::acquire();

    // TCP has three memory levels:
    // - Host: system memory
    // - HBM (High-Bandwidth Memory): device's main memory
    // - SRAM (on-chip scratchpad): the primary SRAM tier is called DM (Data Memory)
    //
    // Data flows: Host → HBM → DM → compute → DM → HBM → Host.
    //
    // Two DMA engines move data between these levels:
    // - `ctx.pdma` (PCIe DMA): transfers between Host and HBM
    // - `ctx.tdma` (Tensor DMA): transfers between HBM and DM

    // Create tensor on host
    // Tensors are parameterized by element type and mapping
    // The mapping `m![A, B]` specifies `A` as the major axis and `B` as the minor axis
    let mut rng = SmallRng::seed_from_u64(42);
    let host: HostTensor<i8, m![A, B]> = HostTensor::rand(&mut rng);

    // Transfer to device HBM using PCIe DMA engine
    // HBM tensor has two dimensions: m![A] for chip and m![B] for intra-chip address
    let hbm: HbmTensor<i8, m![A], m![B]> = host.to_hbm(&mut ctx.pdma, 0x1000).await;

    // Launch kernel on device
    // Host continues while kernel runs asynchronously, but the kernel synchronously occupies the device
    launch(kernel, (&mut ctx, &hbm))
    // Host waits for the asynchronous execution of the kernel to finish
        .await;
}

#[device(chip = 1)] // Running on a single chip
fn kernel(ctx: &mut Context, hbm: &HbmTensor<i8, m![A], m![B]>) {
    // Move to DM (Data Memory) in on-chip SRAM using Tensor DMA engine
    let dm = hbm.to_dm::<m![1], m![A], m![B]>(&mut ctx.tdma, 0);

    // ... perform computations ...
}

Build and Test

TCP supports two execution environments, ordered from fastest iteration to production use:

# 1. CPUs (standalone Rust)
cargo build  # Add --release for optimized builds, same below
cargo test

# 2. Real TCP devices
cargo furiosa-opt build
cargo furiosa-opt test

Development Tools

The TCP Software Toolchain (cargo furiosa-opt) provides utilities for developing, testing, and optimizing Virtual ISA programs on Furiosa chips. It complements the Furiosa SDK’s compiler by giving developers fine-grained control over program behavior, whether the programmer writes Virtual ISA by hand or a compiler generates it.

The toolchain consists of four components:

  • Compiler: Translates Virtual ISA into executable code for the chip.
  • Interpreter: Executes Virtual ISA as native Rust programs for software simulation and debugging.
  • Language Server: Enables IDE features (autocompletion, diagnostics, navigation) via Rust’s language server infrastructure.
  • Schedule Viewer: Visualizes the execution timeline to help identify performance bottlenecks.

Book Organization

The rest of this book is organized in the following chapters:

  • Hello, TCP!: How TCP programming works, introduced through worked examples covering element-wise operations and tensor contractions.
  • Mapping Tensors: How logical tensors map to physical memory: axis layout, stride, padding, and tiling.
  • Moving Tensors: How data moves between memory tiers (HBM, DM) and the Tensor Unit via Fetch, Commit, and DMA engines.
  • Computing Tensors: How the Tensor Unit pipeline (Switching, Collect, Contraction, Vector, Cast, Transpose) transforms data each cycle.
  • Scheduling: How to control the order and concurrency of operations across contexts.
  • Kernel Examples: End-to-end examples showing how mapping, movement, computation, and scheduling combine into real kernels.

License

This documentation and the entire furiosa-opt repository are licensed under the Apache License Version 2.0.