Add on-the-fly compute support with a transpose engine#112
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds optional on-the-fly compute support to generated iDMA backends by introducing a transpose operation that is dispatched at the AXI write seam and carried per-transfer via new opt.compute request fields. It extends the generator to selectively include compute logic per backend variant, adds a transpose-geometry midend, updates write masking/beat retirement for edge tiles, and provides multiple new regressions (engine-only, ND end-to-end, and back-to-back).
Changes:
- Extend request/options types with
compute_options_tand add generator plumbing to enable compute per backend variant (--compute-ids/IDMA_VIDMA_IDS). - Add transpose datapath blocks (
idma_otf_compute,idma_otf_transpose) and integrate them into the transport write seam, including strobe-mask support and strobe-independent beat retirement inidma_axi_write. - Add transpose geometry expander midend plus new directed/unit testbenches and a DPI-C golden model.
Reviewed changes
Copilot reviewed 27 out of 28 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| util/mario/util.py | Adds parsing for compute-enabled backend variant IDs (--compute-ids). |
| util/mario/transport_layer.py | Threads compute configuration into transport-layer template context. |
| util/mario/legalizer.py | Threads compute enable into legalizer template context. |
| util/mario/backend.py | Enforces compute placement constraints (single AXI write port) and passes op set into backend template context. |
| util/gen_idma.py | Adds --compute-ids CLI support and forwards compute configuration into renderers. |
| src/include/idma/typedef.svh | Extends options_t with compute field to carry per-transfer compute config. |
| src/idma_pkg.sv | Introduces compute op enums and packed option/enable types for on-the-fly compute. |
| src/midend/idma_transpose_midend.sv | New midend that expands transpose requests into a NumDim=4 tiled ND walk. |
| src/midend/idma_nd_midend.sv | Adds a simulation-time stride/address width consistency assert. |
| src/backend/idma_otf_compute.sv | New write-seam dispatcher that latches per-transfer compute options and selects the active engine. |
| src/backend/idma_otf_transpose.sv | New transpose engine (tile ping-pong) with edge masking. |
| src/backend/idma_axi_write.sv | Adds external strobe mask input and strobe-independent beat-done output for correct draining of masked beats. |
| src/db/idma_axi.yml | Wires compute into write datapath request and connects new idma_axi_write ports. |
| src/db/idma_tilelink.yml | Forwards compute into write datapath request struct literal. |
| src/backend/tpl/idma_transport_layer.sv.tpl | Integrates compute at the write seam and carries shifted external mask to the write manager. |
| src/backend/tpl/idma_legalizer.sv.tpl | Forces decouple signals when compute is enabled and forwards opt.compute through mutable options. |
| src/backend/tpl/idma_backend.sv.tpl | Adds compute fields to datapath request structs and increases meta FIFO depth for compute latency. |
| test/idma_test.sv | Extends test driver task to optionally enable/parameterize transpose per transfer. |
| test/idma_transpose_dpi.c | Adds DPI-C golden transpose model for standalone engine verification. |
| test/tb_idma_otf_transpose.sv | New standalone transpose-engine self-checking regression using DPI golden. |
| test/tb_idma_transpose_nd.sv | New end-to-end ND→backend transpose regression with edge/padding checks. |
| test/tb_idma_transpose_b2b.sv | New end-to-end back-to-back transpose regression to catch stale per-transfer state. |
| test/midend/tb_idma_transpose_midend.sv | Unit test verifying transpose midend geometry expansion and passthrough behavior. |
| test/midend/tb_idma_nd_midend_b2b.sv | Back-to-back ND midend regression under backpressure to catch base-address reuse. |
| doc/transpose-engine-routing-plan.md | Detailed routing/signaling design doc for transpose integration. |
| idma.mk | Adds compute-ids generation hook, split-RTL sim flow tag, and new Questa/VSIM targets for transpose regressions. |
| Bender.yml | Adds new RTL sources and introduces a split_rtl target flow for per-variant generated files and transpose tests. |
| jobs/backend_rw_axi/transpose_none.txt | Adds a placeholder job file for the rw_axi backend. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Verbatim copy of rtl/datamover_engine.sv from pulp-platform/datamover@d58a985 (branch cdurrer/konark), the transpose core of the Ratha HWPE. Co-authored-by: Sergio Mazzola <smazzola@iis.ee.ethz.ch> Co-authored-by: Cyrill Durrer <cdurrer@iis.ee.ethz.ch>
Rework the imported datamover_engine.sv to iDMA conventions: plain valid/ready with byte/strb ports, no hwpe_stream/hci dependencies, transpose only. Runtime element size (int8/fp16/fp32), element-granular edge strobe, ping-pong tile banks with a half-area FullDuplex=0 option, and a standalone DPI-C golden regression.
compute_options_t carries {enable, op, params} in the request options;
transpose_options_t packs the element mode and tensor shape; compute_enable_t
is the compile-time per-op build gate.
idma_transpose_midend derives the NumDim=4 tiled walk (row / row-tile / col-tile) from the tensor shape and the bus StrbWidth, leaving the generic nd_midend to walk it; the geometry folds to shifts except one stride product. Guards the domain (StrbWidth >= 4, reserved mode, zero dims) and documents the tile-padded access contract; nd_midend asserts strides match the address width.
idma_otf_compute latches the per-transfer compute options and runtime-selects one op per transfer; the AXI write manager gains an external strobe mask and a strobe-independent beat-done so edge tiles drain. Compute support is decided at generation time: IDMA_VIDMA_IDS entries (variant[:ops][:fd|hd]) render the seam, the per-op ComputeEnable set and the transpose duplex into the listed variants only, non-compute variants are untouched. The write-side FIFOs grow by a tile to clear the legalizer in-flight bound and compute variants require NO_ERROR_HANDLING.
Multi-tile aligned and edge transposes through the rw_axi backend, back-to-back geometry-leak checks, an nd_midend burst-address regression, a field-for-field midend unit test and launch_tf transpose options; the engine regression runs in both duplex modes.
Validate IDMA_VIDMA_IDS against the built backend variants, reject empty and duplicate compute IDs, and assert StrbWidth is a power of two >= 2 in the transpose engine. Drop the standalone routing-plan doc (folded into the docs PR) and trim the midend header.
577c381 to
f9853a5
Compare
The transpose files used a '//' empty-comment separator and singular 'Author:'; lint-authors.py requires a blank line after the SPDX line and after the author list (its header-regex is a folded YAML scalar with a trailing newline) plus a plural 'Authors:' bullet list.
The PeakRDL addrmap packages are bundled in idma_generated.sv but were missing from the per-variant split_rtl source list, so the split_rtl compile flow could not find idma_*_addrmap_pkg (used by the desc64 reg wrapper). Order them after the reg_top files, matching the bundle.
Stock backends carry no on-the-fly compute by default; opt in via IDMA_VIDMA_IDS=<id>. A stamp file forces regeneration of the compute- bearing RTL when the value changes (Make tracks timestamps, not variable values). The transpose sim targets opt in to rw_axi via a target-specific variable.
idma_rtl_clean left the .vidma_ids stamp behind; remove it for hygiene (the stamp only triggers regen, so this was not a correctness bug). Also delete the empty, unreferenced jobs/backend_rw_axi/transpose_none.txt.
The transpose TBs took the M/N/EB geometry as elaboration-time parameters, so the makefile drove coverage with a long list of per-geometry vsim runs. M/N/EB are runtime loop bounds and addressing, not packed dimensions, so the sweep now lives in the TB: each self-checks a geometry list in one elaboration and the make target runs once per bus width, matching the single-vsim convention of the other self-checking testbenches.
Remove the split_rtl Bender target and -t split_rtl from the compile scripts: it existed only for the hand-edited transpose prototype, and the generated bundle (idma_generated.sv) carries the compute-enabled rw_axi when IDMA_VIDMA_IDS=rw_axi (the transpose sim targets opt in). Also remove a duplicate idma_sim_tb_idma_rt_midend target, drop the internal 'write seam' wording, and trim verbose comments.
eac0fc6 to
5a47a21
Compare
|
FYI @FrancescoConti - the long-promised feature :) |
PR #112 added opt.compute to idma_req_t, but the desc64 stimulus class randomizes idma_req_t and zeroes every opt sub-field the descriptor format cannot express except compute. The golden model thus carried random compute values while the DUT (descriptors have no compute encoding) emitted zero, firing a Burst mismatch on every descriptor and turning the non-allow_failure desc64 vcs-sim / vsim-sim-cov jobs red. Constrain compute to zero, matching the existing beo/axi-param zeroing.
Two generation defects in the compute (#112) / multi-head (#136) tracks: - The idma_otf_compute .ComputeEnable parameter rendered a bare assignment pattern '{...}; Questa infers the type but DC Presto (VER-294) and Spyglass reject it. Type-prefix it with idma_pkg::compute_enable_t. - w_beat_done was a single scalar net bound by every write instance, so a backend with >1 write head drove it multiply (vsim-3839, multihead_rw). Vectorize it per write head like the other write-port nets; keep the scalar for the single-write-port case the compute engine consumes.
inst64 is a multi-write backend (rw_axi_rw_init_rw_obi) that cannot host the pulp-platform#112 FF transpose engine. Add an AddrGenTranspose mode to idma_transpose_midend: instead of the NumDim=4 tiled engine walk, emit an element-granular NumDim=3 swapped-stride program (out_T[c][r]=in[r][c], contiguous N x M dst) and clear compute.enable so the backend runs a plain strided copy. Correct on any protocol (ideal on random-access OBI/TCDM). idma_inst64_top gains the AddrGenTranspose param, wires it to the expander, and gates the engine-only gen_compute_check. The inst64 transpose harness drives it end-to-end (int8/fp16/fp32, square/rect/ swapped, back-to-back, reject) -- it could not even elaborate before.
Summary
This PR adds optional on-the-fly compute to the generated iDMA backends: transfers can be transformed while they stream through the DMA, with no extra memory passes. The first (and currently only) compute op is a matrix transpose; the architecture is an extension point for further ops.
The transpose datapath is adapted from the
datamover(Ratha) HWPE — imported verbatim from pulp-platform/datamover@d58a985 in the first commit (original authors credited), then reworked to iDMA conventions.Architecture
idma_pkg):opt.compute = {enable, op, params}carried per transfer; transpose params are the element mode (int8/fp16/fp32) and the tensor shape (M, N up to 4095).NumDim=4ND walk (row / row-tile / col-tile) for the unmodifiedidma_nd_midend; the geometry strength-reduces to shifts except one 12x12 stride product.idma_otf_computelatches the per-transfer options and selects one op; the AXI write manager gains an external strobe mask plus a strobe-independent beat-done so partial edge tiles drain correctly.NE = StrbWidth/elem), runtime element size, element-granular edge masking. Steady state1 + 1/NEcycles per tile (~98% of bus peak at NE=64).Generation-time configuration
Compute is a generation decision, not an SV parameter:
IDMA_VIDMA_IDSentries takevariant[:ops][:fd|hd](defaultrw_axi= transpose, full duplex). Non-listed variants render without any compute logic. Thehdoption builds a single tile bank — half the buffer area (StrbWidth^2bytes per bank, e.g. 4 KiB at 512 bit) for roughly half the streaming rate.Constraints, enforced at generation/elaboration time: compute variants need a single AXI write port and
NO_ERROR_HANDLING; sources/destinations must be readable/writable up to the tile-padded bounds (documented contract; writes are strobe-masked, reads are not).Verification
rw_axibackend, back-to-back geometry-leak checks, annd_midendburst-address regression and a field-for-field midend unit test