Add on-the-fly compute support with a transpose engine by DanielKellerM · Pull Request #112 · pulp-platform/iDMA

DanielKellerM · 2026-06-10T14:13:49Z

Summary

This PR adds optional on-the-fly compute to the generated iDMA backends: transfers can be transformed while they stream through the DMA, with no extra memory passes. The first (and currently only) compute op is a matrix transpose; the architecture is an extension point for further ops.

The transpose datapath is adapted from the datamover (Ratha) HWPE — imported verbatim from pulp-platform/datamover@d58a985 in the first commit (original authors credited), then reworked to iDMA conventions.

Architecture

opt.compute (request)  →  idma_transpose_midend  →  idma_nd_midend  →  backend
                           (tensor shape → 4-D walk)                     │
                                       write seam: idma_otf_compute ── idma_otf_transpose

Request model (idma_pkg): opt.compute = {enable, op, params} carried per transfer; transpose params are the element mode (int8/fp16/fp32) and the tensor shape (M, N up to 4095).
Geometry midend: expands a compact transpose request into the tiled NumDim=4 ND walk (row / row-tile / col-tile) for the unmodified idma_nd_midend; the geometry strength-reduces to shifts except one 12x12 stride product.
Write-seam dispatcher: idma_otf_compute latches the per-transfer options and selects one op; the AXI write manager gains an external strobe mask plus a strobe-independent beat-done so partial edge tiles drain correctly.
Engine: NE x NE tile ping-pong (NE = StrbWidth/elem), runtime element size, element-granular edge masking. Steady state 1 + 1/NE cycles per tile (~98% of bus peak at NE=64).

Generation-time configuration

Compute is a generation decision, not an SV parameter: IDMA_VIDMA_IDS entries take variant[:ops][:fd|hd] (default rw_axi = transpose, full duplex). Non-listed variants render without any compute logic. The hd option builds a single tile bank — half the buffer area (StrbWidth^2 bytes per bank, e.g. 4 KiB at 512 bit) for roughly half the streaming rate.

Constraints, enforced at generation/elaboration time: compute variants need a single AXI write port and NO_ERROR_HANDLING; sources/destinations must be readable/writable up to the tile-padded bounds (documented contract; writes are strobe-masked, reads are not).

Verification

Standalone engine regression against a DPI-C golden model, in both duplex modes
Multi-tile aligned + edge ND transposes through the generated rw_axi backend, back-to-back geometry-leak checks, an nd_midend burst-address regression and a field-for-field midend unit test
Non-compute variants regenerate logic-identical to the base branch

Copilot

Pull request overview

This PR adds optional on-the-fly compute support to generated iDMA backends by introducing a transpose operation that is dispatched at the AXI write seam and carried per-transfer via new opt.compute request fields. It extends the generator to selectively include compute logic per backend variant, adds a transpose-geometry midend, updates write masking/beat retirement for edge tiles, and provides multiple new regressions (engine-only, ND end-to-end, and back-to-back).

Changes:

Extend request/options types with compute_options_t and add generator plumbing to enable compute per backend variant (--compute-ids / IDMA_VIDMA_IDS).
Add transpose datapath blocks (idma_otf_compute, idma_otf_transpose) and integrate them into the transport write seam, including strobe-mask support and strobe-independent beat retirement in idma_axi_write.
Add transpose geometry expander midend plus new directed/unit testbenches and a DPI-C golden model.

Reviewed changes

Copilot reviewed 27 out of 28 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
util/mario/util.py	Adds parsing for compute-enabled backend variant IDs (`--compute-ids`).
util/mario/transport_layer.py	Threads compute configuration into transport-layer template context.
util/mario/legalizer.py	Threads compute enable into legalizer template context.
util/mario/backend.py	Enforces compute placement constraints (single AXI write port) and passes op set into backend template context.
util/gen_idma.py	Adds `--compute-ids` CLI support and forwards compute configuration into renderers.
src/include/idma/typedef.svh	Extends `options_t` with `compute` field to carry per-transfer compute config.
src/idma_pkg.sv	Introduces compute op enums and packed option/enable types for on-the-fly compute.
src/midend/idma_transpose_midend.sv	New midend that expands transpose requests into a NumDim=4 tiled ND walk.
src/midend/idma_nd_midend.sv	Adds a simulation-time stride/address width consistency assert.
src/backend/idma_otf_compute.sv	New write-seam dispatcher that latches per-transfer compute options and selects the active engine.
src/backend/idma_otf_transpose.sv	New transpose engine (tile ping-pong) with edge masking.
src/backend/idma_axi_write.sv	Adds external strobe mask input and strobe-independent beat-done output for correct draining of masked beats.
src/db/idma_axi.yml	Wires compute into write datapath request and connects new `idma_axi_write` ports.
src/db/idma_tilelink.yml	Forwards compute into write datapath request struct literal.
src/backend/tpl/idma_transport_layer.sv.tpl	Integrates compute at the write seam and carries shifted external mask to the write manager.
src/backend/tpl/idma_legalizer.sv.tpl	Forces decouple signals when compute is enabled and forwards `opt.compute` through mutable options.
src/backend/tpl/idma_backend.sv.tpl	Adds compute fields to datapath request structs and increases meta FIFO depth for compute latency.
test/idma_test.sv	Extends test driver task to optionally enable/parameterize transpose per transfer.
test/idma_transpose_dpi.c	Adds DPI-C golden transpose model for standalone engine verification.
test/tb_idma_otf_transpose.sv	New standalone transpose-engine self-checking regression using DPI golden.
test/tb_idma_transpose_nd.sv	New end-to-end ND→backend transpose regression with edge/padding checks.
test/tb_idma_transpose_b2b.sv	New end-to-end back-to-back transpose regression to catch stale per-transfer state.
test/midend/tb_idma_transpose_midend.sv	Unit test verifying transpose midend geometry expansion and passthrough behavior.
test/midend/tb_idma_nd_midend_b2b.sv	Back-to-back ND midend regression under backpressure to catch base-address reuse.
doc/transpose-engine-routing-plan.md	Detailed routing/signaling design doc for transpose integration.
idma.mk	Adds compute-ids generation hook, split-RTL sim flow tag, and new Questa/VSIM targets for transpose regressions.
Bender.yml	Adds new RTL sources and introduces a `split_rtl` target flow for per-variant generated files and transpose tests.
jobs/backend_rw_axi/transpose_none.txt	Adds a placeholder job file for the rw_axi backend.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Verbatim copy of rtl/datamover_engine.sv from pulp-platform/datamover@d58a985 (branch cdurrer/konark), the transpose core of the Ratha HWPE. Co-authored-by: Sergio Mazzola <smazzola@iis.ee.ethz.ch> Co-authored-by: Cyrill Durrer <cdurrer@iis.ee.ethz.ch>

Rework the imported datamover_engine.sv to iDMA conventions: plain valid/ready with byte/strb ports, no hwpe_stream/hci dependencies, transpose only. Runtime element size (int8/fp16/fp32), element-granular edge strobe, ping-pong tile banks with a half-area FullDuplex=0 option, and a standalone DPI-C golden regression.

compute_options_t carries {enable, op, params} in the request options; transpose_options_t packs the element mode and tensor shape; compute_enable_t is the compile-time per-op build gate.

idma_transpose_midend derives the NumDim=4 tiled walk (row / row-tile / col-tile) from the tensor shape and the bus StrbWidth, leaving the generic nd_midend to walk it; the geometry folds to shifts except one stride product. Guards the domain (StrbWidth >= 4, reserved mode, zero dims) and documents the tile-padded access contract; nd_midend asserts strides match the address width.

idma_otf_compute latches the per-transfer compute options and runtime-selects one op per transfer; the AXI write manager gains an external strobe mask and a strobe-independent beat-done so edge tiles drain. Compute support is decided at generation time: IDMA_VIDMA_IDS entries (variant[:ops][:fd|hd]) render the seam, the per-op ComputeEnable set and the transpose duplex into the listed variants only, non-compute variants are untouched. The write-side FIFOs grow by a tile to clear the legalizer in-flight bound and compute variants require NO_ERROR_HANDLING.

Multi-tile aligned and edge transposes through the rw_axi backend, back-to-back geometry-leak checks, an nd_midend burst-address regression, a field-for-field midend unit test and launch_tf transpose options; the engine regression runs in both duplex modes.

Validate IDMA_VIDMA_IDS against the built backend variants, reject empty and duplicate compute IDs, and assert StrbWidth is a power of two >= 2 in the transpose engine. Drop the standalone routing-plan doc (folded into the docs PR) and trim the midend header.

The transpose files used a '//' empty-comment separator and singular 'Author:'; lint-authors.py requires a blank line after the SPDX line and after the author list (its header-regex is a folded YAML scalar with a trailing newline) plus a plural 'Authors:' bullet list.

The PeakRDL addrmap packages are bundled in idma_generated.sv but were missing from the per-variant split_rtl source list, so the split_rtl compile flow could not find idma_*_addrmap_pkg (used by the desc64 reg wrapper). Order them after the reg_top files, matching the bundle.

Stock backends carry no on-the-fly compute by default; opt in via IDMA_VIDMA_IDS=<id>. A stamp file forces regeneration of the compute- bearing RTL when the value changes (Make tracks timestamps, not variable values). The transpose sim targets opt in to rw_axi via a target-specific variable.

idma_rtl_clean left the .vidma_ids stamp behind; remove it for hygiene (the stamp only triggers regen, so this was not a correctness bug). Also delete the empty, unreferenced jobs/backend_rw_axi/transpose_none.txt.

DanielKellerM

to fix

The transpose TBs took the M/N/EB geometry as elaboration-time parameters, so the makefile drove coverage with a long list of per-geometry vsim runs. M/N/EB are runtime loop bounds and addressing, not packed dimensions, so the sweep now lives in the TB: each self-checks a geometry list in one elaboration and the make target runs once per bus width, matching the single-vsim convention of the other self-checking testbenches.

Remove the split_rtl Bender target and -t split_rtl from the compile scripts: it existed only for the hand-edited transpose prototype, and the generated bundle (idma_generated.sv) carries the compute-enabled rw_axi when IDMA_VIDMA_IDS=rw_axi (the transpose sim targets opt in). Also remove a duplicate idma_sim_tb_idma_rt_midend target, drop the internal 'write seam' wording, and trim verbose comments.

DanielKellerM · 2026-06-17T15:18:57Z

FYI @FrancescoConti - the long-promised feature :)

PR #112 added opt.compute to idma_req_t, but the desc64 stimulus class randomizes idma_req_t and zeroes every opt sub-field the descriptor format cannot express except compute. The golden model thus carried random compute values while the DUT (descriptors have no compute encoding) emitted zero, firing a Burst mismatch on every descriptor and turning the non-allow_failure desc64 vcs-sim / vsim-sim-cov jobs red. Constrain compute to zero, matching the existing beo/axi-param zeroing.

Two generation defects in the compute (#112) / multi-head (#136) tracks: - The idma_otf_compute .ComputeEnable parameter rendered a bare assignment pattern '{...}; Questa infers the type but DC Presto (VER-294) and Spyglass reject it. Type-prefix it with idma_pkg::compute_enable_t. - w_beat_done was a single scalar net bound by every write instance, so a backend with >1 write head drove it multiply (vsim-3839, multihead_rw). Vectorize it per write head like the other write-port nets; keep the scalar for the single-write-port case the compute engine consumes.

inst64 is a multi-write backend (rw_axi_rw_init_rw_obi) that cannot host the pulp-platform#112 FF transpose engine. Add an AddrGenTranspose mode to idma_transpose_midend: instead of the NumDim=4 tiled engine walk, emit an element-granular NumDim=3 swapped-stride program (out_T[c][r]=in[r][c], contiguous N x M dst) and clear compute.enable so the backend runs a plain strided copy. Correct on any protocol (ideal on random-access OBI/TCDM). idma_inst64_top gains the AddrGenTranspose param, wires it to the expander, and gates the engine-only gen_compute_check. The inst64 transpose harness drives it end-to-end (int8/fp16/fp32, square/rect/ swapped, back-to-back, reject) -- it could not even elaborate before.

DanielKellerM requested a review from thommythomaso as a code owner June 10, 2026 14:13

Copilot AI review requested due to automatic review settings June 10, 2026 14:13

Copilot started reviewing on behalf of DanielKellerM June 10, 2026 14:14 View session

DanielKellerM mentioned this pull request Jun 10, 2026

inst64: Drive on-the-fly transpose; add the snitch integration harness #113

Closed

Copilot AI reviewed Jun 10, 2026

View reviewed changes

Comment thread util/gen_idma.py

Comment thread util/mario/util.py

Comment thread src/backend/idma_otf_transpose.sv

DanielKellerM marked this pull request as draft June 10, 2026 14:26

DanielKellerM mentioned this pull request Jun 16, 2026

release: iDMA 0.7.0 #129

Draft

7 tasks

DanielKellerM added the enhancement New feature or request label Jun 16, 2026

DanielKellerM commented Jun 16, 2026

View reviewed changes

Comment thread doc/transpose-engine-routing-plan.md Outdated

FrancescoConti and others added 7 commits June 16, 2026 15:34

idma_pkg: Add the per-transfer compute request model

6e590eb

compute_options_t carries {enable, op, params} in the request options; transpose_options_t packs the element mode and tensor shape; compute_enable_t is the compile-time per-op build gate.

DanielKellerM force-pushed the compute/transpose-engine branch from 577c381 to f9853a5 Compare June 16, 2026 13:50

DanielKellerM added 4 commits June 16, 2026 15:59

build: Clean .vidma_ids in idma_rtl_clean and drop orphan job file

449daf6

idma_rtl_clean left the .vidma_ids stamp behind; remove it for hygiene (the stamp only triggers regen, so this was not a correctness bug). Also delete the empty, unreferenced jobs/backend_rw_axi/transpose_none.txt.

DanielKellerM marked this pull request as ready for review June 16, 2026 19:50

DanielKellerM changed the title ~~Add on-the-fly compute support with a transpose engine at the write seam~~ Add on-the-fly compute support with a transpose engine Jun 17, 2026

DanielKellerM commented Jun 17, 2026

View reviewed changes

DanielKellerM added 2 commits June 17, 2026 12:02

DanielKellerM force-pushed the compute/transpose-engine branch from eac0fc6 to 5a47a21 Compare June 17, 2026 10:03

DanielKellerM merged commit 2435af6 into pulp-platform:devel Jun 17, 2026
12 checks passed

This was referenced Jun 23, 2026

inst64: On-the-fly transpose via address generation (no engine) #141

Open

inst64: Peak-bandwidth transpose via the FF engine on the multi-write backend #143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add on-the-fly compute support with a transpose engine#112

Add on-the-fly compute support with a transpose engine#112
DanielKellerM merged 13 commits into
pulp-platform:develfrom
DanielKellerM:compute/transpose-engine

DanielKellerM commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DanielKellerM left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DanielKellerM commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

DanielKellerM commented Jun 10, 2026

Summary

Architecture

Generation-time configuration

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DanielKellerM left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DanielKellerM commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DanielKellerM commented Jun 17, 2026 •

edited

Loading