Inter-Slice Block
The Inter-Slice Block performs inter-slice reduction, aggregating partial results across the 256 slices within a cluster.
It preserves Chip, Cluster, and Packet, and rewrites Slice and Time to SliceOut and TimeOut.
Interface
i32 Interface
#[primitive(VectorInitTensor::vector_inter_slice_reduce)]
pub fn vector_inter_slice_reduce<Slice2: M, Time2: M>(
self,
op: InterSliceReduceOpI32,
) -> VectorInterSliceReduceTensor<'l, T, i32, Chip, Cluster, Slice2, Time2, Packet, { VeOrder::InterFirst }> {
f32 Interface
#[primitive(VectorInitTensor::vector_inter_slice_reduce)]
pub fn vector_inter_slice_reduce<Slice2: M, Time2: M>(
self,
op: InterSliceReduceOpF32,
) -> VectorInterSliceReduceTensor<'l, T, f32, Chip, Cluster, Slice2, Time2, Packet, { VeOrder::InterFirst }> {
You can reach this block in two ways:
- Run inter-slice first:
vector_init() -> vector_inter_slice_reduce::<SliceOut, TimeOut>(op) - Run intra-slice first, then switch: call
vector_inter_slice_reduce()directly on the current intra-slice tensor instead of callingvector_init()again.
In the IntraFirst path, vector_inter_slice_reduce() is available only from Way8 intra-slice stages that can transition to inter-slice reduction: Branch, Logic, Fxp, FxpToFp, Widen, FpToFxp, and Clip.
It is not available from Way4 stages such as Narrow, Fp, IntraSliceReduce, or FpDiv.
Quick Reference
| Current state | Method | Result |
|---|---|---|
Fresh VE input after vector_init() | vector_inter_slice_reduce::<SliceOut, TimeOut>(op) | Enters inter-slice reduction directly (InterFirst) |
| Eligible intra-slice tensor | vector_inter_slice_reduce::<SliceOut, TimeOut>(op) | Transitions from intra-slice to inter-slice reduction (IntraFirst) |
Tensor after vector_inter_slice_reduce() | vector_intra_slice_branch(BranchMode) | Switches to intra-slice work after inter-slice reduction |
Operations
Integer Operations (InterSliceReduceOpI32)
| Operation | Description |
|---|---|
Add | Wrapping addition |
AddSat | Saturating addition |
Max | Maximum value |
Min | Minimum value |
Floating-Point Operations (InterSliceReduceOpF32)
| Operation | Description |
|---|---|
Add | Floating-point addition |
Max | Maximum value |
Min | Minimum value |
Mul | Floating-point multiplication |
Output Mapping Rule
After inter-slice reduction removes a slice factor R, the output mapping typically follows one of three rules:
| Rule | Output mapping | Reference |
|---|---|---|
| Broadcast | Slice = m![A, R], Time = m![C] -> SliceOut = m![A, X], TimeOut = m![C] | Broadcast Into a New Slice Axis |
| Dummy | Slice = m![A, R], Time = m![C] -> SliceOut = m![A, 1 # n], TimeOut = m![C] | Dummy Replacement |
| Promotion | Slice = m![A, R], Time = m![C] -> SliceOut = m![A, C], TimeOut = m![1] | Promotion from Time into SliceOut |
Chip, Cluster, and Packet pass through unchanged.
Only Slice and Time are rewritten into SliceOut and TimeOut.
Examples
Dummy Replacement
Replace the reduced slice factor with a dummy factor in SliceOut:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 4, A = 512];
fn inter_slice_add<'l, const T: Tu>(
input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, R], m![1], m![A % 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, 1 # 4], m![1], m![A % 8]> {
input
.vector_init()
.vector_inter_slice_reduce::<m![A / 8, 1 # 4], m![1]>(InterSliceReduceOpI32::AddSat)
.vector_final()
}
}
R occupies part of the Slice dimension. After reduction, R is eliminated and the remaining A / 8 positions are padded from R(=4) slots to 1 # 4.
Broadcast Into a New Slice Axis
Introduce a new axis in SliceOut, and broadcast the reduced value over that axis:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![W = 64, R = 4, X = 4, P = 8];
fn broadcast_into_x<'l, const T: Tu>(
input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![W, R], m![1], m![P]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![W, X], m![1], m![P]> {
input
.vector_init()
.vector_inter_slice_reduce::<m![W, X], m![1]>(InterSliceReduceOpF32::Add)
.vector_final()
}
}
Here, R is reduced away. X is a new axis that appears only in SliceOut,
so the reduced value is broadcast over the X positions in the output.
Promotion from Time into SliceOut
If Time already contains an axis that should occupy the freed slice space, promote that axis into SliceOut.
The promoted axis does not have to be the outermost axis in Time:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![W = 32, R = 4, T0 = 2, T2 = 4, T1 = 2, P = 8];
fn axis_promotion<'l, const T: Tu>(
input: CollectTensor<'l, T, f32, m![1], m![1 # 2], m![W, R], m![T0, T2, T1], m![P]>,
) -> VectorFinalTensor<'l, T, f32, m![1], m![1 # 2], m![W, T2], m![T0, T1], m![P]> {
// Before: Slice = m![W, R], Time = m![T0, T2, T1], Packet = m![P]
// After: Slice = m![W, T2], Time = m![T0, T1], Packet = m![P]
// R is reduced away, and T2 is promoted from the middle of Time into Slice.
input
.vector_init()
.vector_inter_slice_reduce::<m![W, T2], m![T0, T1]>(InterSliceReduceOpF32::Add)
.vector_final()
}
}
Inter-Slice Reduce with AddSat, Then Intra-Slice
Reducing an i32 tensor across slices, then applying an elementwise add:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 4, A = 512];
fn reduce_then_add<'l, const T: Tu>(
input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, R], m![1], m![A % 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, 1 # 4], m![1], m![A % 8]> {
input
.vector_init()
.vector_inter_slice_reduce::<m![A / 8, 1 # 4], m![1]>(InterSliceReduceOpI32::AddSat)
.vector_intra_slice_branch(BranchMode::Unconditional)
.vector_fxp(FxpBinaryOp::AddFxp, 100)
.vector_final()
}
}
Intra-Slice Then Inter-Slice Reduce with AddSat
Applying an intra-slice operation first, then reducing the resulting i32 tensor across slices:
#![allow(unused)]
fn main() {
#![feature(adt_const_params)]
extern crate furiosa_visa_std;
use furiosa_visa_std::prelude::*;
axes![R = 4, A = 512];
fn add_then_reduce<'l, const T: Tu>(
input: CollectTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, R], m![1], m![A % 8]>,
) -> VectorFinalTensor<'l, T, i32, m![1], m![1 # 2], m![A / 8, 1 # 4], m![1], m![A % 8]> {
input
.vector_init()
.vector_intra_slice_branch(BranchMode::Unconditional)
.vector_fxp(FxpBinaryOp::AddFxp, 100)
.vector_inter_slice_reduce::<m![A / 8, 1 # 4], m![1]>(InterSliceReduceOpI32::AddSat)
.vector_final()
}
}
Constraints
| Constraint | Detail |
|---|---|
| Data types | i32 and f32 only |
| Scope | Reduction happens within one 256-slice cluster |
| Packet mapping | Packet does not change across inter-slice reduction |
Performance
Inter-slice reduce is best understood as a ring-like global reduction across the participating slices. For documentation purposes, the most useful high-level estimate is:
| Quantity | Rough rule of thumb |
|---|---|
| First reduced output | on the order of one ring traversal for the reduction group |
| Total time | input streaming time + that ring-sized tail |
| Main tuning knob | reduction ratio, that is, how many slices participate in one inter-slice contraction group |
If you want a quick mental model, let r be the reduction ratio or route-group size:
- first output appears after roughly
O(r)cycles - larger
rmeans more noticeable inter-slice tail latency - if upstream already produces flits slowly, that upstream rate dominates and the inter-slice cost is partly hidden
This is intentionally a high-level approximation. The practical mental model is simple: stream partial results in, then pay about one ring traversal before the reduced result settles.
Interaction With Other Pipelines
- Contraction -> Inter-Slice: if contraction takes longer to produce partial sums, contraction can dominate and inter-slice may not be the bottleneck.
- Intra-Slice -> Inter-Slice: intra-slice work can reduce the number of packets that reach inter-slice, or simply take longer itself. In those cases, inter-slice is less visible because there is less data to reduce, or because the front half already dominates.
- Large ring / large reduction ratio: when many slices participate, inter-slice tail latency grows and can become the bottleneck.
- Small tensors: even when total data volume is small, the fixed ring-style tail can still matter because it is amortized over fewer packets.
For an end-to-end contraction example that includes inter-slice reduction, see Reducer.