Skip to content

Add on-the-fly compute support with a transpose engine#112

Merged
DanielKellerM merged 13 commits into
pulp-platform:develfrom
DanielKellerM:compute/transpose-engine
Jun 17, 2026
Merged

Add on-the-fly compute support with a transpose engine#112
DanielKellerM merged 13 commits into
pulp-platform:develfrom
DanielKellerM:compute/transpose-engine

Conversation

@DanielKellerM

Copy link
Copy Markdown
Collaborator

Summary

This PR adds optional on-the-fly compute to the generated iDMA backends: transfers can be transformed while they stream through the DMA, with no extra memory passes. The first (and currently only) compute op is a matrix transpose; the architecture is an extension point for further ops.

The transpose datapath is adapted from the datamover (Ratha) HWPE — imported verbatim from pulp-platform/datamover@d58a985 in the first commit (original authors credited), then reworked to iDMA conventions.

Architecture

opt.compute (request)  →  idma_transpose_midend  →  idma_nd_midend  →  backend
                           (tensor shape → 4-D walk)                     │
                                       write seam: idma_otf_compute ── idma_otf_transpose
  • Request model (idma_pkg): opt.compute = {enable, op, params} carried per transfer; transpose params are the element mode (int8/fp16/fp32) and the tensor shape (M, N up to 4095).
  • Geometry midend: expands a compact transpose request into the tiled NumDim=4 ND walk (row / row-tile / col-tile) for the unmodified idma_nd_midend; the geometry strength-reduces to shifts except one 12x12 stride product.
  • Write-seam dispatcher: idma_otf_compute latches the per-transfer options and selects one op; the AXI write manager gains an external strobe mask plus a strobe-independent beat-done so partial edge tiles drain correctly.
  • Engine: NE x NE tile ping-pong (NE = StrbWidth/elem), runtime element size, element-granular edge masking. Steady state 1 + 1/NE cycles per tile (~98% of bus peak at NE=64).

Generation-time configuration

Compute is a generation decision, not an SV parameter: IDMA_VIDMA_IDS entries take variant[:ops][:fd|hd] (default rw_axi = transpose, full duplex). Non-listed variants render without any compute logic. The hd option builds a single tile bank — half the buffer area (StrbWidth^2 bytes per bank, e.g. 4 KiB at 512 bit) for roughly half the streaming rate.

Constraints, enforced at generation/elaboration time: compute variants need a single AXI write port and NO_ERROR_HANDLING; sources/destinations must be readable/writable up to the tile-padded bounds (documented contract; writes are strobe-masked, reads are not).

Verification

  • Standalone engine regression against a DPI-C golden model, in both duplex modes
  • Multi-tile aligned + edge ND transposes through the generated rw_axi backend, back-to-back geometry-leak checks, an nd_midend burst-address regression and a field-for-field midend unit test
  • Non-compute variants regenerate logic-identical to the base branch

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds optional on-the-fly compute support to generated iDMA backends by introducing a transpose operation that is dispatched at the AXI write seam and carried per-transfer via new opt.compute request fields. It extends the generator to selectively include compute logic per backend variant, adds a transpose-geometry midend, updates write masking/beat retirement for edge tiles, and provides multiple new regressions (engine-only, ND end-to-end, and back-to-back).

Changes:

  • Extend request/options types with compute_options_t and add generator plumbing to enable compute per backend variant (--compute-ids / IDMA_VIDMA_IDS).
  • Add transpose datapath blocks (idma_otf_compute, idma_otf_transpose) and integrate them into the transport write seam, including strobe-mask support and strobe-independent beat retirement in idma_axi_write.
  • Add transpose geometry expander midend plus new directed/unit testbenches and a DPI-C golden model.

Reviewed changes

Copilot reviewed 27 out of 28 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
util/mario/util.py Adds parsing for compute-enabled backend variant IDs (--compute-ids).
util/mario/transport_layer.py Threads compute configuration into transport-layer template context.
util/mario/legalizer.py Threads compute enable into legalizer template context.
util/mario/backend.py Enforces compute placement constraints (single AXI write port) and passes op set into backend template context.
util/gen_idma.py Adds --compute-ids CLI support and forwards compute configuration into renderers.
src/include/idma/typedef.svh Extends options_t with compute field to carry per-transfer compute config.
src/idma_pkg.sv Introduces compute op enums and packed option/enable types for on-the-fly compute.
src/midend/idma_transpose_midend.sv New midend that expands transpose requests into a NumDim=4 tiled ND walk.
src/midend/idma_nd_midend.sv Adds a simulation-time stride/address width consistency assert.
src/backend/idma_otf_compute.sv New write-seam dispatcher that latches per-transfer compute options and selects the active engine.
src/backend/idma_otf_transpose.sv New transpose engine (tile ping-pong) with edge masking.
src/backend/idma_axi_write.sv Adds external strobe mask input and strobe-independent beat-done output for correct draining of masked beats.
src/db/idma_axi.yml Wires compute into write datapath request and connects new idma_axi_write ports.
src/db/idma_tilelink.yml Forwards compute into write datapath request struct literal.
src/backend/tpl/idma_transport_layer.sv.tpl Integrates compute at the write seam and carries shifted external mask to the write manager.
src/backend/tpl/idma_legalizer.sv.tpl Forces decouple signals when compute is enabled and forwards opt.compute through mutable options.
src/backend/tpl/idma_backend.sv.tpl Adds compute fields to datapath request structs and increases meta FIFO depth for compute latency.
test/idma_test.sv Extends test driver task to optionally enable/parameterize transpose per transfer.
test/idma_transpose_dpi.c Adds DPI-C golden transpose model for standalone engine verification.
test/tb_idma_otf_transpose.sv New standalone transpose-engine self-checking regression using DPI golden.
test/tb_idma_transpose_nd.sv New end-to-end ND→backend transpose regression with edge/padding checks.
test/tb_idma_transpose_b2b.sv New end-to-end back-to-back transpose regression to catch stale per-transfer state.
test/midend/tb_idma_transpose_midend.sv Unit test verifying transpose midend geometry expansion and passthrough behavior.
test/midend/tb_idma_nd_midend_b2b.sv Back-to-back ND midend regression under backpressure to catch base-address reuse.
doc/transpose-engine-routing-plan.md Detailed routing/signaling design doc for transpose integration.
idma.mk Adds compute-ids generation hook, split-RTL sim flow tag, and new Questa/VSIM targets for transpose regressions.
Bender.yml Adds new RTL sources and introduces a split_rtl target flow for per-variant generated files and transpose tests.
jobs/backend_rw_axi/transpose_none.txt Adds a placeholder job file for the rw_axi backend.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread util/gen_idma.py
Comment thread util/mario/util.py
Comment thread src/backend/idma_otf_transpose.sv
@DanielKellerM DanielKellerM marked this pull request as draft June 10, 2026 14:26
@DanielKellerM DanielKellerM mentioned this pull request Jun 16, 2026
7 tasks
@DanielKellerM DanielKellerM added the enhancement New feature or request label Jun 16, 2026
Comment thread doc/transpose-engine-routing-plan.md Outdated
FrancescoConti and others added 7 commits June 16, 2026 15:34
Verbatim copy of rtl/datamover_engine.sv from pulp-platform/datamover@d58a985
(branch cdurrer/konark), the transpose core of the Ratha HWPE.

Co-authored-by: Sergio Mazzola <smazzola@iis.ee.ethz.ch>
Co-authored-by: Cyrill Durrer <cdurrer@iis.ee.ethz.ch>
Rework the imported datamover_engine.sv to iDMA conventions: plain valid/ready
with byte/strb ports, no hwpe_stream/hci dependencies, transpose only. Runtime
element size (int8/fp16/fp32), element-granular edge strobe, ping-pong tile
banks with a half-area FullDuplex=0 option, and a standalone DPI-C golden
regression.
compute_options_t carries {enable, op, params} in the request options;
transpose_options_t packs the element mode and tensor shape; compute_enable_t
is the compile-time per-op build gate.
idma_transpose_midend derives the NumDim=4 tiled walk (row / row-tile /
col-tile) from the tensor shape and the bus StrbWidth, leaving the generic
nd_midend to walk it; the geometry folds to shifts except one stride product.
Guards the domain (StrbWidth >= 4, reserved mode, zero dims) and documents the
tile-padded access contract; nd_midend asserts strides match the address width.
idma_otf_compute latches the per-transfer compute options and runtime-selects
one op per transfer; the AXI write manager gains an external strobe mask and a
strobe-independent beat-done so edge tiles drain. Compute support is decided
at generation time: IDMA_VIDMA_IDS entries (variant[:ops][:fd|hd]) render the
seam, the per-op ComputeEnable set and the transpose duplex into the listed
variants only, non-compute variants are untouched. The write-side FIFOs grow
by a tile to clear the legalizer in-flight bound and compute variants require
NO_ERROR_HANDLING.
Multi-tile aligned and edge transposes through the rw_axi backend, back-to-back
geometry-leak checks, an nd_midend burst-address regression, a field-for-field
midend unit test and launch_tf transpose options; the engine regression runs in
both duplex modes.
Validate IDMA_VIDMA_IDS against the built backend variants, reject empty
and duplicate compute IDs, and assert StrbWidth is a power of two >= 2 in
the transpose engine. Drop the standalone routing-plan doc (folded into
the docs PR) and trim the midend header.
@DanielKellerM DanielKellerM force-pushed the compute/transpose-engine branch from 577c381 to f9853a5 Compare June 16, 2026 13:50
The transpose files used a '//' empty-comment separator and singular
'Author:'; lint-authors.py requires a blank line after the SPDX line and
after the author list (its header-regex is a folded YAML scalar with a
trailing newline) plus a plural 'Authors:' bullet list.
The PeakRDL addrmap packages are bundled in idma_generated.sv but were
missing from the per-variant split_rtl source list, so the split_rtl
compile flow could not find idma_*_addrmap_pkg (used by the desc64 reg
wrapper). Order them after the reg_top files, matching the bundle.
Stock backends carry no on-the-fly compute by default; opt in via
IDMA_VIDMA_IDS=<id>. A stamp file forces regeneration of the compute-
bearing RTL when the value changes (Make tracks timestamps, not variable
values). The transpose sim targets opt in to rw_axi via a target-specific
variable.
idma_rtl_clean left the .vidma_ids stamp behind; remove it for hygiene
(the stamp only triggers regen, so this was not a correctness bug).
Also delete the empty, unreferenced jobs/backend_rw_axi/transpose_none.txt.
@DanielKellerM DanielKellerM marked this pull request as ready for review June 16, 2026 19:50
@DanielKellerM DanielKellerM changed the title Add on-the-fly compute support with a transpose engine at the write seam Add on-the-fly compute support with a transpose engine Jun 17, 2026

@DanielKellerM DanielKellerM left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to fix

Comment thread src/backend/tpl/idma_transport_layer.sv.tpl Outdated
Comment thread src/backend/tpl/idma_transport_layer.sv.tpl Outdated
Comment thread src/backend/idma_axi_write.sv Outdated
Comment thread src/backend/idma_otf_compute.sv Outdated
Comment thread src/midend/idma_transpose_midend.sv Outdated
Comment thread Bender.yml Outdated
Comment thread Bender.yml Outdated
Comment thread Bender.yml Outdated
Comment thread Bender.yml Outdated
Comment thread idma.mk Outdated
The transpose TBs took the M/N/EB geometry as elaboration-time parameters,
so the makefile drove coverage with a long list of per-geometry vsim runs.
M/N/EB are runtime loop bounds and addressing, not packed dimensions, so
the sweep now lives in the TB: each self-checks a geometry list in one
elaboration and the make target runs once per bus width, matching the
single-vsim convention of the other self-checking testbenches.
Remove the split_rtl Bender target and -t split_rtl from the compile
scripts: it existed only for the hand-edited transpose prototype, and the
generated bundle (idma_generated.sv) carries the compute-enabled rw_axi
when IDMA_VIDMA_IDS=rw_axi (the transpose sim targets opt in). Also remove
a duplicate idma_sim_tb_idma_rt_midend target, drop the internal
'write seam' wording, and trim verbose comments.
@DanielKellerM DanielKellerM force-pushed the compute/transpose-engine branch from eac0fc6 to 5a47a21 Compare June 17, 2026 10:03
@DanielKellerM DanielKellerM merged commit 2435af6 into pulp-platform:devel Jun 17, 2026
12 checks passed
@DanielKellerM

DanielKellerM commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

FYI @FrancescoConti - the long-promised feature :)

DanielKellerM added a commit that referenced this pull request Jun 19, 2026
PR #112 added opt.compute to idma_req_t, but the desc64 stimulus class
randomizes idma_req_t and zeroes every opt sub-field the descriptor format
cannot express except compute. The golden model thus carried random compute
values while the DUT (descriptors have no compute encoding) emitted zero,
firing a Burst mismatch on every descriptor and turning the non-allow_failure
desc64 vcs-sim / vsim-sim-cov jobs red. Constrain compute to zero, matching
the existing beo/axi-param zeroing.
DanielKellerM added a commit that referenced this pull request Jun 19, 2026
Two generation defects in the compute (#112) / multi-head (#136) tracks:

- The idma_otf_compute .ComputeEnable parameter rendered a bare assignment
  pattern '{...}; Questa infers the type but DC Presto (VER-294) and Spyglass
  reject it. Type-prefix it with idma_pkg::compute_enable_t.

- w_beat_done was a single scalar net bound by every write instance, so a
  backend with >1 write head drove it multiply (vsim-3839, multihead_rw).
  Vectorize it per write head like the other write-port nets; keep the scalar
  for the single-write-port case the compute engine consumes.
DanielKellerM added a commit to DanielKellerM/iDMA that referenced this pull request Jun 23, 2026
inst64 is a multi-write backend (rw_axi_rw_init_rw_obi) that cannot host the
pulp-platform#112 FF transpose engine. Add an AddrGenTranspose mode to idma_transpose_midend:
instead of the NumDim=4 tiled engine walk, emit an element-granular NumDim=3
swapped-stride program (out_T[c][r]=in[r][c], contiguous N x M dst) and clear
compute.enable so the backend runs a plain strided copy. Correct on any protocol
(ideal on random-access OBI/TCDM). idma_inst64_top gains the AddrGenTranspose
param, wires it to the expander, and gates the engine-only gen_compute_check.
The inst64 transpose harness drives it end-to-end (int8/fp16/fp32, square/rect/
swapped, back-to-back, reject) -- it could not even elaborate before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants