Valid Count Generator’s Implementation
This document describes what the Valid Count Generator (VCG) hardware can express, independent of mapping expressions or tensor shapes. VCG tags are consumed by the Intra-Slice Reduce stage to exclude padding from reductions. For how mapping expressions control VCG behavior (supported placements, constraints, and examples), see Valid Count Generator’s Interface.
Data Model
Data flows into the VectorEngine as a stream of flits (packets). Each flit contains 8 elements. The VCG operates at the VectorEngine’s input, tagging each 8-way flit with a valid count. The 4-way halving and its valid count derivation are described in Downstream: 4-Way Operations.
A flit is identified by two coordinates. A slice corresponds to the Slice dimension in the mapping; a time step indexes sequential flits within a slice.
| Coordinate | Range | Meaning |
|---|---|---|
s (slice number) | [0, num_slices) | Which slice processes this flit |
t (time step) | [0, num_flits) | Sequential position within a slice |
The VCG assigns a valid count (abbreviated vc in formulas and diagrams) to each flit:
$$\text{vc}(s, t) \in {0, 1, \ldots, 8}$$
Element p (where p is in [0, 8)) within flit (s, t) is valid if and only if p < vc(s, t).
Valid elements always form a contiguous prefix; this is a fundamental hardware constraint.
The VCG cannot express “elements 0, 1, 3 are valid but 2 is not.”
Valid Count Formula
The VCG computes vc(s, t) through a pipeline of stages:
$$t \overset{\text{Sequencer}}{\longrightarrow} (c_0, c_1, \ldots, c_{k-1}) \overset{\text{Original Dims}}{\longrightarrow} \text{idx}(t) \overset{\text{Validity}}{\longrightarrow} \text{vc}(s, t)$$
- Sequencer: The flat time index
tis decomposed into counter values \((c_0, c_1, \ldots, c_{k-1})\) via mixed-radix decomposition. - Original Dimensions: Each counter is assigned to one of 4 dimensions (packet dim or gate dim 0-2). Per-dimension indices are computed as \(\text{idx}(t) = \sum_{i} c_i \cdot \sigma _ i\), where \(\sigma_i\) is the counter’s stride.
- Validity Decision:
- Packet Dim: produces a packet-level valid count \(\text{packet_vc}(t) = \min(\text{stride}_p,; \max(0,; V_p - \text{idx}_p(t)))\).
- Gate Dims: each produces a binary gate \(\text{gate}_d(s, t) \in {0, 1}\) based on slice classification (below/boundary/above a threshold) and the per-dim index.
The final valid count combines these components:
$$\text{vc}(s, t) = \text{packet_vc}(t) \times \text{gate}_0(s, t) \times \text{gate}_1(s, t) \times \text{gate}_2(s, t)$$
vc(s,t) = packet_vc(t) × gate_0(s,t) × gate_1(s,t) × gate_2(s,t)
─────────── ─────────── ─────────── ───────────
packet dim gate dim 0 gate dim 1 gate dim 2
(count 0-8) (gate 0/1) (gate 0/1) (gate 0/1)
- If all gates are open (= 1): the flit gets
packet_vc(t)valid elements. - If any gate is closed (= 0):
vc = 0(entire flit is invalid, regardless of packet dim’s count).
VCG Configuration
Configuration is organized around two concepts: counters (which drive the sequencer) and original dimensions (which decide validity). Counters produce a flit sequence; each dim uses its assigned counters to compute an index and decide validity.
The VCG is configured via the following parameters (each is explained in detail in subsequent sections):
| Field | Scope | Description |
|---|---|---|
| Counter limits \(L_0 \ldots L_7\) | per counter (up to 8) | Sequencer counter limits |
| Original dim assignment | per counter | Which dim (packet / gate 0-2) each counter belongs to |
| stride \(\sigma_i\) | per counter | stride for index computation |
| \(\text{mask}_{gd}\) | per gate dim | Slice-id bitmask |
| \(\text{match}_{gd}\) | per gate dim | Threshold for slice classification |
| \(V_p\) / \(V_{gd}\) | packet dim / per gate dim | Valid count / threshold |
| \(P_{gd}\) | per gate dim | Standard (0) vs transposed (1) |
Unassigned counters and disabled gate dims (\(\text{mask}_{gd} = 0, \text{match}_{gd} = 1\)) effectively pass through as “all valid.”
Sequencer
The sequencer interprets the flat time index t as a multi-dimensional counter.
Counter Structure
Up to 8 nested counters iterate to produce the flit sequence:
$$t \to (c_0, c_1, \ldots, c_{k-1})$$
where \(c_0\) is the fastest (innermost) and \(c_{k-1}\) is the slowest (outermost).
Each counter \(c_i\) has a limit \(L_i\), cycling through \(0, 1, \ldots, L_i - 1\), and a stride \(\sigma_i\) that scales the counter’s contribution to the dimension index (see Original Dimensions). The total number of flits per slice is \(L_0 \times L_1 \times \cdots \times L_{k-1}\).
Example: 3 counters with limits [3, 2, 2]
This produces 3 * 2 * 2 = 12 flits per slice. The counters cycle as:
t=0: (c_0=0, c_1=0, c_2=0)
t=1: (c_0=1, c_1=0, c_2=0)
t=2: (c_0=2, c_1=0, c_2=0)
t=3: (c_0=0, c_1=1, c_2=0) <- c_0 wraps, c_1 increments
t=4: (c_0=1, c_1=1, c_2=0)
t=5: (c_0=2, c_1=1, c_2=0)
t=6: (c_0=0, c_1=0, c_2=1) <- c_1 wraps, c_2 increments
t=7: (c_0=1, c_1=0, c_2=1)
t=8: (c_0=2, c_1=0, c_2=1)
t=9: (c_0=0, c_1=1, c_2=1)
t=10: (c_0=1, c_1=1, c_2=1)
t=11: (c_0=2, c_1=1, c_2=1)
c_0 changes every flit, c_1 every 3 flits, c_2 every 6 flits, just like digits in a mixed-radix number.
The sequencer produces counter values; the next step is mapping them to original dimensions and then to the validity decision.
Original Dimensions
Each counter is assigned to one of 4 original dimensions (packet dim or gate dim 0-2), or left unassigned.
Let \(D_d\) be the set of counters assigned to original dimension d.
Each counter contributes to its assigned dimension’s index by multiplying its current value by its stride.
The sum of all contributions gives the current position within that dimension’s data:
$$\text{idx}_d(t) = \sum _ {i \in D_d} c_i(t) \cdot \sigma_i$$
This index tracks the position within that dimension’s original data range. Multiple counters can be assigned to the same dim; their contributions are simply summed.
Example: Counters mapped to original dimensions
Suppose 3 counters are configured as follows:
| Counter | Limit | stride | Assigned to |
|---|---|---|---|
| c_0 | 3 | 8 | packet dim (W axis) |
| c_1 | 2 | 1 | gate dim 0 (C axis) |
| c_2 | 2 | 1 | gate dim 1 (H axis) |
At time step t=4, which gives (c_0=1, c_1=1, c_2=0):
idx_p = 1 * 8 = 8, position 8 along Widx_g0 = 1 * 1 = 1, position 1 along Cidx_g1 = 0 * 1 = 0, position 0 along H
Each dim uses its index independently to decide validity.
Validity Decision
The following diagram shows the complete pipeline from time index to final valid count. Each stage is explained in the subsections below.
t (flat time index)
│
├─ mixed-radix decomposition (see Sequencer above)
▼
(c_0, c_1, ..., c_{k-1}) ← counter values
│
├─ each counter assigned to a dim, multiplied by stride σ_i
│ (see Original Dimensions above)
▼
idx_p(t), idx_g0(t), idx_g1(t), idx_g2(t) ← per-dim indices
│
├─ packet dim: packet_vc = min(stride_p, max(0, V_p - idx_p))
├─ gate dim 0: gate_0 = f(masked_id(s), idx_g0, match_g0, V_g0)
├─ gate dim 1: gate_1 = f(masked_id(s), idx_g1, match_g1, V_g1)
├─ gate dim 2: gate_2 = f(masked_id(s), idx_g2, match_g2, V_g2)
│
▼
vc(s,t) = packet_vc(t) × gate_0(s,t) × gate_1(s,t) × gate_2(s,t)
Packet dim and gate dims make qualitatively different judgments:
- Packet dim answers: “how many elements in this flit are valid?” (a count, 0-8)
- Gate dims each answer: “is this flit valid at all?” (a binary gate, yes or no)
Gate dims act as gates: only when all three report “valid” does packet dim’s count take effect. If any gate reports “invalid”, the entire flit gets valid count = 0.
Packet Dim: Packet-Level Valid Count
Packet dim determines how many elements within a flit are valid. Two parameters control the computation:
- \(V_p\): the original valid count for packet dim (the unpadded size of the data along this dimension).
- \(\text{stride}_p\): the stride of the innermost counter assigned to packet dim, representing how many flit elements belong to the axis tracked by packet dim.
The per-packet valid count is:
$$\text{packet_vc}(t) = \min(\text{stride}_p, \max(0, V_p - \text{idx}_p(t)))$$
packet_vc(t) = min( stride_p, max(0, V_p - idx_p(t) ))
───────── ─────────────────
HW width cap remaining valid data
When the axis fills all 8 flit positions, \(\text{stride}_p = 8\) and the formula is equivalent to \(\min(8, \ldots)\). When the axis occupies only \(k < 8\) positions (with the remaining positions padded), \(\text{stride}_p = k\) caps the valid count so that only the axis’s portion of the flit is counted as valid.
Hardware constraints:
- The innermost Packet counter must always be assigned to packet dim.
- Other counters may also be assigned to packet dim (e.g., a Time counter for the same axis).
- When no axis is assigned to packet dim,
packet_vcis always 8 (full flit) or 0 (empty flit), effectively making packet dim a binary gate like gate dims.
As the sequencer advances, \(\text{idx}_p\) increases and packet_vc decreases.
This produces a repeating sawtooth pattern:
Example 1: V_p = 19, stride_p = 8, counter stride=8, limit=3 (axis fills full 8-way)
flit 0: idx_p = 0 -> packet_vc = min(8, 19 - 0) = 8 (full)
flit 1: idx_p = 8 -> packet_vc = min(8, 19 - 8) = 8 (full)
flit 2: idx_p = 16 -> packet_vc = min(8, 19 - 16) = 3 (partial)
Example 2: V_p = 11, stride_p = 4, counter stride=4, limit=3 (axis fills 4 of 8 positions)
flit 0: idx_p = 0 -> packet_vc = min(4, 11 - 0) = 4 (full within stride)
flit 1: idx_p = 4 -> packet_vc = min(4, 11 - 4) = 4 (full within stride)
flit 2: idx_p = 8 -> packet_vc = min(4, 11 - 8) = 3 (partial)
In Example 2, positions 4-7 in each flit are padding and automatically excluded by the \(\text{stride}_p = 4\) cap.
Key property: packet_vc depends only on the sequencer state t, not on the slice s.
All slices receive the same packet valid count for the same time step.
Example: Why packet_vc is slice-independent
If \(V_p = 19\), \(\text{stride}_p = 8\), and the packet counter cycles [0, 8, 16], then:
- At t=2 (idx_p=16): packet_vc = 3 for every slice.
- Slice 0 gets vc=3, slice 5 gets vc=3, slice 15 gets vc=3, all the same.
This is because packet dim’s formula \(\min(\text{stride}_p, V_p - \text{idx}_p)\) has no s term.
Gate dims can still make certain slices’ final vc = 0 (by reporting invalid),
but they cannot change the packet_vc value itself.
Gate Dims: Per-Flit Binary Validity
Gate dims 0, 1, 2 decide whether a flit as a whole is valid (1) or invalid (0), not a count.
Each gate dim classifies slices into groups by extracting a subset of the slice-id bits (via a bitmask) and comparing the result against a threshold. The bitmask \(\text{mask}_{gd}\) selects which bits of the slice-id this gate dim tracks:
$$\text{masked_id} (s) = s \mathbin{\&} \text{mask}_{gd}$$
Example: 16 slices (4-bit slice_id), mask_g0 = 0b1100
slice_id (4 bits): [ b3 b2 b1 b0 ]
mask_g0 = 0b1100: [ 1 1 0 0 ]
─────────────────
masked_id: [ b3 b2 0 0 ] → extracts the upper 2 bits
Slices fall into three groups based on comparing \(\text{masked_id}\) with \(\text{match}_{gd}\):
| Group | Condition | Meaning |
|---|---|---|
| Below | \(\text{masked_id}(s) < \text{match}_{gd}\) | All time steps valid |
| Boundary | \(\text{masked_id}(s) = \text{match}_{gd}\) | Valid when \(\text{idx}_{gd}(t) < V_{gd}\) |
| Above | \(\text{masked_id}(s) > \text{match}_{gd}\) | Depends on mode (see below) |
The \(P_{gd}\) flag selects between two modes that differ only in the “above” group:
Standard mode (\(P_{gd} = 0\))
$$\text{gate}_d(s, t) = \begin{cases} 1 & \text{masked_id}(s) < \text{match}_{gd} \\ [\text{idx}_{gd}(t) < V_{gd}] & \text{masked_id}(s) = \text{match}_{gd} \\ 0 & \text{masked_id}(s) > \text{match}_{gd} \end{cases}$$
Above-threshold slices are entirely invalid. This is the common case: the Slice factor is laid out in ascending order, so slices beyond the boundary contain no valid data.
Transposed mode (\(P_{gd} = 1\))
$$\text{gate}_d(s, t) = \begin{cases} 1 & \text{masked_id}(s) < \text{match}_{gd} \\ [\text{idx}_{gd}(t) < V_{gd}] & \text{masked_id}(s) \ge \text{match}_{gd} \end{cases}$$
Above-threshold slices get the same \(V_{gd}\) check as the boundary: they are not entirely invalid. This handles the transposed case where the slice ID encodes the inner index: slices beyond the boundary still contain valid data at early time steps (the outer index is small enough), and only run out of valid data at the same point as the boundary slice.
To disable a gate dim (make it always valid), set \(\text{mask}_{gd} = 0, \text{match}_{gd} = 1\). Then \(\text{masked_id} = 0 < 1\) for all slices, so every slice is in the “below” group.
Example: Standard mode, H=5 split into Ho=4 (slice) × Hi=2 (time)
H=5 is split into Ho × Hi = 4 × 2 (padded from 5 to 8).
Ho is the Slice factor (encoded in slice-id bits), Hi is the Time factor (sequencer counter).
Axis index = Ho * 2 + Hi. Valid when index < 5.
Gate dim 0 config: mask=0b1100 (extracts 2 bits for Ho), match=2, V_g0=1, standard mode.
16 slices, where masked_id = (slice_id & 0b1100) >> 2 gives Ho:
| Ho | masked_id | Group | Hi=0 | Hi=1 |
|---|---|---|---|---|
| 0 | 0 | below (< 2) | valid | valid |
| 1 | 1 | below (< 2) | valid | valid |
| 2 | 2 | boundary (= 2) | valid (idx=0 < 1) | invalid (idx=1 >= 1) |
| 3 | 3 | above (> 2) | invalid | invalid |
Ho=0,1: both time steps valid (index 0-3, all < 5). Ho=2: only first time step (index 4 < 5), second invalid (index 5 >= 5). Ho=3: fully invalid (index 6, 7 >= 5).
Example: Transposed mode, H=5 split into Ho=4 (slice, inner) × Hi=2 (time, outer)
H=5 is split into Ho × Hi = 4 × 2 (padded from 5 to 8), but transposed: Ho is the inner factor, Hi is the outer factor.
Axis index = Hi * 4 + Ho. Valid when index < 5.
Gate dim 0 config: match=1 (= 5 mod 4), V_g0=1 (= floor(5/4)), transposed mode.
| Ho | masked_id | Group | Hi=0 | Hi=1 |
|---|---|---|---|---|
| 0 | 0 | below (< 1) | valid | valid |
| 1 | 1 | boundary (= 1) | valid (idx=0 < 1) | invalid (idx=1 >= 1) |
| 2 | 2 | above (> 1) | valid (idx=0 < 1) | invalid (idx=1 >= 1) |
| 3 | 3 | above (> 1) | valid (idx=0 < 1) | invalid (idx=1 >= 1) |
Verify: Ho=0, Hi=0: 0 < 5, Hi=1: 4 < 5, so 2 steps. Ho=1, Hi=0: 1 < 5, Hi=1: 5 >= 5, so 1 step. Ho=2, Hi=0: 2 < 5, Hi=1: 6 >= 5, so 1 step. Ho=3, Hi=0: 3 < 5, Hi=1: 7 >= 5, so 1 step.
Key difference from standard: the “above” group (Ho=2,3) still gets V_g0=1 valid time steps, not zero.
Example: Full VCG computation for [H=5, C=5, W=19], step-by-step build-up
This example builds up from one axis to three, so each dimension’s contribution is clear.
Original shape [H, C, W] = [5, 5, 19].
Each axis is split into a slice part (slice_id) and a time part (sequencer):
H = 5 -> Ho(slice) * Hi(time) = 4 * 2 (padded from 5 to 8)
C = 5 -> Co(slice) * Ci(time) = 4 * 2 (padded from 5 to 8)
W = 19 -> Wi(packet) = 3 * 8 (padded from 19 to 24)
Step 1: W=19 only (packet dim, no gates)
Ignore H and C for now. Disable gate dims 0 and 1. Every slice processes 3 flits (Wi limit=3), and packet dim produces the sawtooth:
packet_vc: 8, 8, 3
^ ^
full 19 - 16 = 3 (partial)
Since there are no gates, every slice gets this exact same pattern:
All slices, all flits:
flit 0: ████████ (vc=8)
flit 1: ████████ (vc=8)
flit 2: ███ (vc=3)
Step 2: Add C=5 (packet dim + gate dim 0)
Now enable the C-axis gate (gate dim 0).
C=5 is split into Co(slice, 4 values) * Ci(time, limit 2).
The C-gate uses: mask=0b0011 (extracts Co from slice_id), match=2, V_g0=1, standard mode.
Each slice now runs 6 flits: Ci (limit 2) * Wi (limit 3). The C-gate classifies slices by their Co value:
| Co | Group | Effect |
|---|---|---|
| 0 | below (< 2) | gate open: all 6 flits get packet dim’s pattern |
| 1 | below (< 2) | gate open: same |
| 2 | boundary (= 2) | gate open for Ci=0, closed for Ci=1 |
| 3 | above (> 2) | gate closed: all 6 flits get vc=0 |
Result per slice (6 flits = 2 Ci groups * 3 Wi flits):
Co=0: [8,8,3, 8,8,3] <- both Ci steps valid
Co=1: [8,8,3, 8,8,3] <- same
Co=2: [8,8,3, 0,0,0] <- Ci=0 valid, Ci=1 gated off
Co=3: [0,0,0, 0,0,0] <- entirely gated off
Notice the gate’s effect: some slices go entirely to zero, and the boundary slice loses its second half.
But within the valid flits, the [8,8,3] pattern from packet dim is unchanged.
Step 3: Add H=5 (full 3-axis, packet dim + gate dim 0 + gate dim 1)
Now enable the H-axis gate (gate dim 1).
H=5 is split into Ho(slice, 4 values) * Hi(time, limit 2).
The H-gate uses: mask=0b1100 (extracts Ho from slice_id), match=0b1000, V_g1=1, standard mode.
Slice ID encodes both slice factors: slice_id = Ho * 4 + Co, giving 16 slices.
Each slice now runs 12 flits: Hi (limit 2) * Ci (limit 2) * Wi (limit 3).
| Dim | Axis | What it tracks | VCG config |
|---|---|---|---|
| packet | W=19 | element count in packet | V_p=19, stride_p=8, counter stride=8, limit=3 |
| gate 0 | C=5 | gate: is Co within valid range? | mask=0b0011, match=2, V_g0=1, standard |
| gate 1 | H=5 | gate: is Ho within valid range? | mask=0b1100, match=0b1000, V_g1=1, standard |
The H-gate classifies slices by Ho, same logic as C-gate by Co:
| Ho | Group | Effect |
|---|---|---|
| 0 | below | H-gate open |
| 1 | below | H-gate open |
| 2 | boundary | H-gate open for Hi=0, closed for Hi=1 |
| 3 | above | H-gate closed |
The final vc for each flit is packet_vc(t) * C_gate(s,t) * H_gate(s,t).
Both gates must be open for packet dim’s count to survive.
The complete heatmap (16 slices * 12 flits). Columns are slices grouped by Ho; rows are flits grouped by (Hi, Ci). Right-side annotations show which gates are active for each row:
Ho=0 |Ho=1 |Ho=2 |Ho=3
Co: 0 1 2 3 | 0 1 2 3| 0 1 2 3| 0 1 2 3
H-gate: v v v v | v v v v| > > > >| x x x x
C-gate: v v > x | v v > x| v v > x| v v > x
--------------------------------------------------------------------------------
t= 0 Hi=0,Ci=0 W 8 8 8 0 | 8 8 8 0| 8 8 8 0| 0 0 0 0 H:v C:v
t= 1 | 8 8 8 0 | 8 8 8 0| 8 8 8 0| 0 0 0 0
t= 2 | 3 3 3 0 | 3 3 3 0| 3 3 3 0| 0 0 0 0
| | |
t= 3 Hi=0,Ci=1 W 8 8 0 0 | 8 8 0 0| 8 8 0 0| 0 0 0 0 H:v C:>
t= 4 | 8 8 0 0 | 8 8 0 0| 8 8 0 0| 0 0 0 0
t= 5 | 3 3 0 0 | 3 3 0 0| 3 3 0 0| 0 0 0 0
| | |
t= 6 Hi=1,Ci=0 W 8 8 8 0 | 8 8 8 0| 0 0 0 0| 0 0 0 0 H:> C:v
t= 7 | 8 8 8 0 | 8 8 8 0| 0 0 0 0| 0 0 0 0
t= 8 | 3 3 3 0 | 3 3 3 0| 0 0 0 0| 0 0 0 0
| | |
t= 9 Hi=1,Ci=1 W 8 8 0 0 | 8 8 0 0| 0 0 0 0| 0 0 0 0 H:> C:>
t=10 | 8 8 0 0 | 8 8 0 0| 0 0 0 0| 0 0 0 0
t=11 | 3 3 0 0 | 3 3 0 0| 0 0 0 0| 0 0 0 0
v = open (below threshold) > = boundary (partial) x = closed (above)
Reading the patterns:
- Ho=3 columns (rightmost 4): all 0. H-gate
x(above threshold, always closed). - Co=3 columns (every 4th): all 0. C-gate
x. - Co=2 columns (H:v C:
>): C-gate is boundary; only rows with Ci=0 pass. Compare Co=1 vs Co=2 to see the gate’s effect. - Ho=2 columns (H:
>C:v): H-gate is boundary; only rows with Hi=0 pass. Compare Ho=1 vs Ho=2. - Ho=2 * Co=2 (both
>): only (Hi=0, Ci=0) rows pass, the intersection of both boundaries. - Within valid cells: the
[8, 8, 3]sawtooth from packet dim always appears, the same regardless of slice.
What Patterns Are Expressible
The valid count formula is a product of four independent terms.
This multiplicative structure determines which vc(s, t) functions the hardware can produce, and which it cannot.
Why Limitations Arise
Each limitation traces back to a specific part of the formula.
Packet dim cannot see slice-id.
The packet dim formula packet_vc(t) = min(stride_p, max(0, V_p − idx_p(t))) depends only on t.
If two slices need different partial counts at the same time step, packet dim cannot produce both:
Suppose we need: vc(s=0, t=0) = 8, vc(s=1, t=0) = 3
───────────────
packet dim would need to output
both 8 and 3 at t=0, impossible
Each gate classifies slices by a single threshold after masking.
Gate dims first apply a bitmask to the slice-id (masked_id = slice_id & mask), then compare against one value match.
This produces three contiguous groups: below, boundary, above.
A gate cannot express “slices 0, 3, 7 are valid but 1, 2 are not”; it can represent only contiguous ranges of masked_id.
The mask selects which bits of the slice-id to inspect, allowing one gate to track a specific axis even when the slice-id encodes multiple axes.
At most 4 independent checks. One packet count (packet dim) + three binary gates (gate dims) = 4 orthogonal dimensions total.
Single-Axis Scenarios
A padded axis (original size n, padded to n' > n) occupies some combination of Packet (packet dim), Time (sequencer), and Slice (gate dims via slice-id bits).
Single-position: Packet / Time / Slice
When an axis occupies only one position, validity tracking is straightforward:
- Packet only → packet dim handles the sawtooth (see Packet Dim examples).
- Time only → gate dim with
mask=0, match=0: all slices are boundary, binary validity by time step. - Slice only → gate dim with appropriate mask/match: all time steps within a valid slice pass, invalid slices are fully gated.
All three are always supported.
Slice + Time
One axis split between slice-id bits and sequencer counters. This is the VCG’s most important use case, and it is how gate dims are typically used.
Standard (slice outer, time inner): axis index = Ho × time_count + Hi.
Example: H=14, Ho=8 (slice) × Hi=3 (time), standard mode
Gate dim config: match = ⌊14/3⌋ = 4, V = 14 mod 3 = 2.
Ho masked_id group Hi=0 Hi=1 Hi=2
── ───────── ───────── ────────── ────────── ──────────
0 0 below idx=0 < 2 ✅ idx=1 < 2 ✅ always ✅
1 1 below always ✅ always ✅ always ✅
2 2 below always ✅ always ✅ always ✅
3 3 below always ✅ always ✅ always ✅
4 4 boundary idx=0 < 2 ✅ idx=1 < 2 ✅ idx=2 < 2 ❌
5 5 above ❌ ❌ ❌
6 6 above ❌ ❌ ❌
7 7 above ❌ ❌ ❌
Why “below” is genuinely all-valid: Ho × 3 + Hi < 4 × 3 = 12 ≤ 14 for all Hi ∈ [0,3).
Standard mode tolerates over-allocated time_count; the “below” interpretation remains correct.
Transposed (time outer, slice inner): axis index = Hi × slice_count + Ho.
Example: H=19, Ho=8 (slice, inner) × Hi=3 (time, outer), transposed mode
Gate dim config: match = 19 mod 8 = 3, V = ⌊19/8⌋ = 2, transposed mode.
Ho masked_id group Hi=0 Hi=1 Hi=2
── ───────── ───────── ────────── ────────── ──────────
0 0 below always ✅ always ✅ always ✅
1 1 below always ✅ always ✅ always ✅
2 2 below always ✅ always ✅ always ✅
3 3 boundary idx=0 < 2 ✅ idx=1 < 2 ✅ idx=2 < 2 ❌
4 4 above idx=0 < 2 ✅ idx=1 < 2 ✅ idx=2 < 2 ❌
5 5 above idx=0 < 2 ✅ idx=1 < 2 ✅ idx=2 < 2 ❌
6 6 above idx=0 < 2 ✅ idx=1 < 2 ✅ idx=2 < 2 ❌
7 7 above idx=0 < 2 ✅ idx=1 < 2 ✅ idx=2 < 2 ❌
Verify against real data (axis index = Hi × 8 + Ho, valid when < 19):
- Ho=0, Hi=2:
2×8 + 0 = 16 < 19✅; “below” gives all-valid = 3 steps, needV+1 = 3steps ✅ - Ho=3, Hi=2:
2×8 + 3 = 19 ≥ 19❌; boundary gives 2 steps ✅ - Ho=7, Hi=1:
1×8 + 7 = 15 < 19✅; “above” gives V=2 steps, actual need is 2 steps ✅
Constraint: time_count must equal ⌈n / slice_count⌉ (= V + 1).
The “below” group gets time_count valid steps from the HW “all-valid” interpretation.
If time_count > V + 1, the “below” group receives more valid steps than the data actually has.
Packet + Time
Both packet and time factors assigned to packet dim, with multiple counters contributing to idx_p(t) (see Original Dimensions).
Example: n=50, two counters on packet dim, contiguous (`stride_outer = 8 × 3 = 24`)
c_inner (limit=3, stride=8): packet counter
c_outer (limit=3, stride=24): time counter (24 = 8 × 3 ✅ contiguous)
idx_p = c_outer × 24 + c_inner × 8
c_inner=0 c_inner=1 c_inner=2
───────── ───────── ─────────
c_outer=0 idx_p=0→8 idx_p=8→8 idx_p=16→8
c_outer=1 idx_p=24→8 idx_p=32→8 idx_p=40→8
c_outer=2 idx_p=48→2 idx_p=56→0 idx_p=64→0
↑
min(8, 50-48)=2
Packet dim handles both the within-flit and across-flit boundaries.
Slice + Packet: not supported
One axis split between slice (gate dim) and packet (packet dim). This directly violates the slice-independent packet count constraint (see Packet Dim: Key property).
packet_vc(t) depends only on t, but the boundary slice needs a different partial count than all-valid slices.
The gate can multiply by 0 or 1, so it can fully close a flit but cannot change the partial count.
Example: n=10, stride_p=8, slice_count=2
What we need:
Ho=0: elements 0-7, all valid → vc = 8
Ho=1: elements 8-15, first 2 valid → vc = 2
─
partial count, different from 8
Attempt 1: set `V_p = 10`:
packet_vc = min(8, 10-0) = 8 for ALL slices
Ho=0: vc = 8 × 1 = 8 ✅
Ho=1: vc = 8 × 1 = 8 ❌ (need 2, not 8)
vc = 8 × 0 = 0 ❌ (gate can close to 0, not to 2)
Attempt 2: set `V_p = 2`:
packet_vc = min(8, 2-0) = 2 for ALL slices
Ho=0: vc = 2 × 1 = 2 ❌ (need 8, not 2)
No single V_p works. Packet dim produces one value; the gate can only multiply by 0 or 1.
Note on degenerate cases:
When n % stride_p = 0, every packet is either fully valid or fully invalid, so packet dim produces no partial counts and a gate alone handles validity.
This is effectively Slice only, not a true Slice + Packet scenario.
Similarly, n <= stride_p means a single flit covers the entire axis, reducing to Packet only.
When the VCG cannot express the required pattern, Padding Strategy alternatives are available.
Slice + Time + Packet: not supported
The axis spans all three positions.
The Slice + Packet conflict carries over: the boundary slice still needs a different partial count than all-valid slices, and packet_vc(t) still cannot vary by slice.
The same degenerate exception applies: n % stride_p = 0 eliminates partial counts, reducing to Slice + Time (packet dim unused).
Multiple Axes
Each padded axis that needs validity tracking consumes one original dimension slot:
| Resource | Capacity | Notes |
|---|---|---|
| Packet Dim (packet count) | 1 slot | Innermost counter determines stride_p |
| Gate Dims (binary gates) | 3 slots | One gate per padded axis |
| Unpadded axes | free | No dim needed (mask=0, match=1) |
When the packet axis is fully aligned (n % stride_p = 0), packet_vc is constant and packet dim is effectively unused, so it can be repurposed as a gate for another axis.
Summary
A valid count function vc(s, t) is VCG-expressible only if:
- Prefix property: Valid elements form a contiguous prefix
[0, vc)within each flit. - Slice-independent packet count:
packet_vc(t)must be the same across all slices at the samet. Slices can be gated tovc = 0, but cannot receive a different partial count. - Monotonic slice ordering: Each gate dim classifies slices by a single threshold on
masked_id. - At most 4 orthogonal dimensions: 1 packet count + 3 binary gates.
| Placement | Dim | Supported? | Key constraint |
|---|---|---|---|
| Packet only | packet | ✅ | none |
| Time only | gate | ✅ | none |
| Slice only | gate | ✅ | none |
| Slice + Time (standard) | gate | ✅ | none |
| Slice + Time (transposed) | gate | ✅ | time_count = ⌈n / slice_count⌉ |
| Packet + Time | packet | ✅ | none |
| Slice + Packet | packet + gate | ❌ | packet_vc(t) cannot vary by slice |
| Slice + Time + Packet | packet + gate | ❌ | same as Slice + Packet |
For mapping-level code examples of each placement, see Examples. For unsupported cases, see Padding Strategy.
Downstream: 4-Way Operations
VCG assigns valid counts per 8-way flit, but the VectorArithmeticUnit can operate on 4-way halves.
| Operation | Input | Output | Valid Count Transformation |
|---|---|---|---|
| split_way4 | 8-way flit (vc = v) | two 4-way flits | vc_low = min(v, 4), vc_high = max(v - 4, 0) |
| trim_way4 | 8-way flit (vc = v) | one 4-way flit | vc = v (requires v <= 4) |
| concat_way8 | two 4-way flits | 8-way flit | vc = vc_low + vc_high |
| pad_way8 | 4-way flit | 8-way flit | vc unchanged |
The prefix property is preserved through split and concat.
For trim_way4, the constraint v <= 4 must be statically guaranteed by the mapping; if the upper 4 elements could be valid, trimming them would lose data.