Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Inter-Slice Block

The Inter-Slice Block performs inter-slice reduction, aggregating partial results across the 256 slices within a cluster. It preserves Chip, Cluster, and Packet, and rewrites Slice and Time to SliceOut and TimeOut.

Interface

i32 Interface

    #[primitive(VectorInitTensor::vector_inter_slice_reduce)]
    pub fn vector_inter_slice_reduce<Slice2: M, Time2: M>(
        self,
        op: InterSliceReduceOpI32,
    ) -> VectorInterSliceReduceTensor<'l, T, i32, Chip, Cluster, Slice2, Time2, Packet, { VeOrder::InterFirst }> {

f32 Interface

    #[primitive(VectorInitTensor::vector_inter_slice_reduce)]
    pub fn vector_inter_slice_reduce<Slice2: M, Time2: M>(
        self,
        op: InterSliceReduceOpF32,
    ) -> VectorInterSliceReduceTensor<'l, T, f32, Chip, Cluster, Slice2, Time2, Packet, { VeOrder::InterFirst }> {

You can reach this block in two ways:

  • Run inter-slice first: vector_init() -> vector_inter_slice_reduce::<SliceOut, TimeOut>(op)
  • Run intra-slice first, then switch: call vector_inter_slice_reduce() directly on the current intra-slice tensor instead of calling vector_init() again.

In the IntraFirst path, vector_inter_slice_reduce() is available only from Way8 intra-slice stages that can transition to inter-slice reduction: Branch, Logic, Fxp, FxpToFp, Widen, FpToFxp, and Clip. It is not available from Way4 stages such as Narrow, Fp, IntraSliceReduce, or FpDiv.

Quick Reference

Current stateMethodResult
Fresh VE input after vector_init()vector_inter_slice_reduce::<SliceOut, TimeOut>(op)Enters inter-slice reduction directly (InterFirst)
Eligible intra-slice tensorvector_inter_slice_reduce::<SliceOut, TimeOut>(op)Transitions from intra-slice to inter-slice reduction (IntraFirst)
Tensor after vector_inter_slice_reduce()vector_intra_slice_branch(BranchMode)Switches to intra-slice work after inter-slice reduction

Operations

Integer Operations (InterSliceReduceOpI32)

OperationDescription
AddWrapping addition
AddSatSaturating addition
MaxMaximum value
MinMinimum value

Floating-Point Operations (InterSliceReduceOpF32)

OperationDescription
AddFloating-point addition
MaxMaximum value
MinMinimum value
MulFloating-point multiplication

Output Mapping Rule

After inter-slice reduction removes a slice factor R, the output mapping typically follows one of three rules:

RuleOutput mappingReference
BroadcastSlice = m![A, R], Time = m![C] -> SliceOut = m![A, X], TimeOut = m![C]Broadcast Into a New Slice Axis
DummySlice = m![A, R], Time = m![C] -> SliceOut = m![A, 1 # n], TimeOut = m![C]Dummy Replacement
PromotionSlice = m![A, R], Time = m![C] -> SliceOut = m![A, C], TimeOut = m![1]Promotion from Time into SliceOut

Chip, Cluster, and Packet pass through unchanged. Only Slice and Time are rewritten into SliceOut and TimeOut.

Examples

Dummy Replacement

Replace the reduced slice factor with a dummy factor in SliceOut:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 4, A = 512];

fn inter_slice_add<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, R], m![1], m![A % 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, 1 # 4], m![1], m![A % 8]> {
    input
        .vector_init()
        .vector_inter_slice_reduce::<m![A / 8, 1 # 4], m![1]>(InterSliceReduceOpI32::AddSat)
        .vector_final()
}
}

R occupies part of the Slice dimension. After reduction, R is eliminated and the remaining A / 8 positions are padded from R(=4) slots to 1 # 4.

Broadcast Into a New Slice Axis

Introduce a new axis in SliceOut, and broadcast the reduced value over that axis:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![W = 64, R = 4, X = 4, P = 8];

fn broadcast_into_x<'l, const T: Tu>(
    input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![W, R], m![1], m![P]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![W, X], m![1], m![P]> {
    input
        .vector_init()
        .vector_inter_slice_reduce::<m![W, X], m![1]>(InterSliceReduceOpF32::Add)
        .vector_final()
}
}

Here, R is reduced away. X is a new axis that appears only in SliceOut, so the reduced value is broadcast over the X positions in the output.

Promotion from Time into SliceOut

If Time already contains an axis that should occupy the freed slice space, promote that axis into SliceOut. The promoted axis does not have to be the outermost axis in Time:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![W = 32, R = 4, T0 = 2, T2 = 4, T1 = 2, P = 8];

fn axis_promotion<'l, const T: Tu>(
    input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![W, R], m![T0, T2, T1], m![P]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![W, T2], m![T0, T1], m![P]> {
    // Before: Slice = m![W, R], Time = m![T0, T2, T1], Packet = m![P]
    // After:  Slice = m![W, T2], Time = m![T0, T1], Packet = m![P]
    // R is reduced away, and T2 is promoted from the middle of Time into Slice.
    input
        .vector_init()
        .vector_inter_slice_reduce::<m![W, T2], m![T0, T1]>(InterSliceReduceOpF32::Add)
        .vector_final()
}
}

Inter-Slice Reduce with AddSat, Then Intra-Slice

Reducing an i32 tensor across slices, then applying an elementwise add:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 4, A = 512];

fn reduce_then_add<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, R], m![1], m![A % 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, 1 # 4], m![1], m![A % 8]> {
    input
        .vector_init()
        .vector_inter_slice_reduce::<m![A / 8, 1 # 4], m![1]>(InterSliceReduceOpI32::AddSat)
        .vector_intra_slice_branch(BranchMode::Unconditional)
        .vector_fxp(FxpBinaryOp::AddFxp, 100)
        .vector_final()
}
}

Intra-Slice Then Inter-Slice Reduce with AddSat

Applying an intra-slice operation first, then reducing the resulting i32 tensor across slices:

#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 4, A = 512];

fn add_then_reduce<'l, const T: Tu>(
    input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, R], m![1], m![A % 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, 1 # 4], m![1], m![A % 8]> {
    input
        .vector_init()
        .vector_intra_slice_branch(BranchMode::Unconditional)
        .vector_fxp(FxpBinaryOp::AddFxp, 100)
        .vector_inter_slice_reduce::<m![A / 8, 1 # 4], m![1]>(InterSliceReduceOpI32::AddSat)
        .vector_final()
}
}

Constraints

ConstraintDetail
Data typesi32 and f32 only
ScopeReduction happens within one 256-slice cluster
Packet mappingPacket does not change across inter-slice reduction

Performance

Inter-slice reduce is best understood as a ring-like global reduction across the participating slices. For documentation purposes, the most useful high-level estimate is:

QuantityRough rule of thumb
First reduced outputon the order of one ring traversal for the reduction group
Total timeinput streaming time + that ring-sized tail
Main tuning knobreduction ratio, that is, how many slices participate in one inter-slice contraction group

If you want a quick mental model, let r be the reduction ratio or route-group size:

  • first output appears after roughly O(r) cycles
  • larger r means more noticeable inter-slice tail latency
  • if upstream already produces flits slowly, that upstream rate dominates and the inter-slice cost is partly hidden

This is intentionally a high-level approximation. The practical mental model is simple: stream partial results in, then pay about one ring traversal before the reduced result settles.

Interaction With Other Pipelines

  • Contraction -> Inter-Slice: if contraction takes longer to produce partial sums, contraction can dominate and inter-slice may not be the bottleneck.
  • Intra-Slice -> Inter-Slice: intra-slice work can reduce the number of packets that reach inter-slice, or simply take longer itself. In those cases, inter-slice is less visible because there is less data to reduce, or because the front half already dominates.
  • Large ring / large reduction ratio: when many slices participate, inter-slice tail latency grows and can become the bottleneck.
  • Small tensors: even when total data volume is small, the fixed ring-style tail can still matter because it is amortized over fewer packets.

For an end-to-end contraction example that includes inter-slice reduction, see Reducer.