inst64: Drive on-the-fly transpose; add the snitch integration harness by DanielKellerM · Pull Request #113 · pulp-platform/iDMA

DanielKellerM · 2026-06-10T14:14:16Z

Summary

Drives on-the-fly transpose through the inst64 (Snitch) frontend and adds a standalone Snitch integration harness.

inst64's backend (rw_axi_rw_init_rw_obi) is multi-write and therefore cannot host the #112 single-AXI-write transpose engine. Instead, the transpose is executed by address generation: idma_transpose_midend (new AddrGenTranspose mode) emits an element-granular NumDim=3 swapped-stride ND program (out_T[c][r] = in[r][c]) and clears opt.compute, so the backend runs a plain strided copy — no engine, no tile buffer. OBI/TCDM is random-access, so this is the natural (and only needed) mechanism for inst64; the #112 FF engine remains the full-throughput path for AXI↔AXI transpose on rw_axi.

What's here

Frontend (idma_inst64_top): decode the transpose DMCPY into opt.compute, splice idma_transpose_midend ahead of idma_nd_midend, and reject malformed requests (no hardware / reserved mode / zero dim / unaligned dst).
Address-gen (idma_transpose_midend, AddrGenTranspose): swapped-stride transpose with an optional BankSkew that pads the dst row pitch by one bus-word when the column stride would hammer a single TCDM bank — makes the per-column word stride odd, so writes round-robin across all banks on any power-of-2-bank L1, at ≤1 word/row cost.
Harness (systems/snitch): standalone BFM driving the accelerator port plus AXI and native-OBI sim memories.

Testing

The self-checking harness TB sweeps a geometry list in one elaboration (int8/fp16/fp32, square/rectangular/odd) over the real OBI/TCDM port:

OBI→OBI — transpose a tile within L1/TCDM
AXI→OBI — load an external matrix into TCDM transposed
back-to-back (consecutive cases), cross-transfer compute-leak (a plain copy after transposes must not inherit opt.compute), and malformed-request reject
both BankSkew off (contiguous N×M) and on (padded N×M′, padding stays sentinel)

A standalone copy harness (tb_idma_inst64_copy) also passes. All changes live in src/ — no codegen/templatization.

Notes

Address-gen is element-granular (M·N transactions): ideal on random-access OBI/TCDM, slower than a burst on AXI. The FF engine (Add on-the-fly compute support with a transpose engine #112) stays the full-throughput path for AXI↔AXI on rw_axi.
Rebased on current devel; the earlier transpose-engine commits are dropped (they landed via Add on-the-fly compute support with a transpose engine #112).

Copilot

Pull request overview

This PR wires the on-the-fly transpose feature end-to-end for the Snitch inst64 frontend and adds a standalone Snitch integration harness, while also extending the backend generation flow to optionally include compute support in selected variants.

Changes:

Add a typed per-transfer opt.compute capability (transpose op + params) and route it through legalizer/backend/transport to a write-seam compute dispatcher and transpose engine.
Extend the inst64 frontend to decode transpose requests from spare DMCPY argb bits, expand transpose geometry via a new idma_transpose_midend, and reject malformed transpose requests.
Add new SV/DPI-C testbenches (engine-level + ND/back-to-back) and a Snitch inst64 integration harness + Makefile flow (snitch_transpose_sweep), plus docs and Bender target support (split_rtl).

Reviewed changes

Copilot reviewed 36 out of 37 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
util/mario/util.py	Add parsing for `--compute-ids` configuration strings (ops + fd/hd).
util/mario/transport_layer.py	Pass compute enable/op info into transport-layer templating context.
util/mario/legalizer.py	Pass compute enable flag into legalizer templating context.
util/mario/backend.py	Enforce “single AXI write port” constraint for compute-enabled variants and pass ops into backend templating context.
util/gen_idma.py	Add `--compute-ids` CLI and propagate compute config into generators.
test/tb_idma_transpose_nd.sv	Multi-tile end-to-end transpose test via ND midend → compute backend → AXI sim mem.
test/tb_idma_transpose_b2b.sv	End-to-end back-to-back transpose regression to distinct destinations.
test/tb_idma_otf_transpose.sv	Standalone transpose engine SV testbench using DPI-C golden model.
test/midend/tb_idma_transpose_midend.sv	Unit test for transpose geometry expansion midend.
test/midend/tb_idma_nd_midend_b2b.sv	Back-to-back ND midend base-address reload regression under backpressure.
test/idma_transpose_dpi.c	DPI-C golden model for element-granular transpose verification.
test/idma_test.sv	Extend request-driving task to optionally program transpose compute options.
systems/snitch/test/tb_idma_inst64_transpose.sv	Snitch `inst64` end-to-end transpose integration test (incl. rejects + no-leak).
systems/snitch/test/tb_idma_inst64_copy.sv	Snitch `inst64` plain-copy regression.
systems/snitch/test/idma_inst64_tb_pkg.sv	Package/types/constants for the standalone Snitch harness.
systems/snitch/test/idma_inst64_drv_if.sv	Accelerator-bus BFM tasks, including `DMCPY`-encoded transpose launch helpers.
systems/snitch/test/idma_inst64_base.sv	Base harness instantiating `idma_inst64_top` + AXI sim memories.
systems/snitch/README.md	Document Snitch harness purpose, build flow, and transpose contract.
systems/snitch/Makefile	Standalone build + sim/sweep targets for the Snitch harness.
systems/snitch/.gitignore	Ignore build products for the Snitch harness flow.
src/midend/idma_transpose_midend.sv	New combinational transpose geometry expander (NumDim=4) for ND midend.
src/midend/idma_nd_midend.sv	Add non-synth assertion enforcing stride width == address width.
src/include/idma/typedef.svh	Extend `options_t` with typed `compute` options field.
src/idma_pkg.sv	Define compute op enum, transpose params, compute options, and feature enable struct.
src/frontend/inst64/idma_inst64_top.sv	Add `ComputeEnable` param, decode/validate transpose from `DMCPY`, splice transpose midend, widen strides to addr width, add backend capability cross-check.
src/db/idma_tilelink.yml	Forward compute options into write datapath request struct where needed.
src/db/idma_axi.yml	Forward compute options; extend AXI write template to accept strobe mask + beat-done pulse.
src/backend/tpl/idma_transport_layer.sv.tpl	Add write-seam compute integration (dispatcher + mask/beat-done plumbing).
src/backend/tpl/idma_legalizer.sv.tpl	Force decouple on compute transfers; propagate compute options into mutable transfer opts and write datapath req.
src/backend/tpl/idma_backend.sv.tpl	Add compute-enabled variant metadata (`ComputeEnable`), enforce NO_ERROR_HANDLING, increase meta FIFO depth for compute latency, propagate compute options into write datapath req.
src/backend/idma_otf_transpose.sv	New transpose engine (tile ping-pong) producing per-byte strobe mask.
src/backend/idma_otf_compute.sv	New write-seam compute dispatcher (currently transpose only).
src/backend/idma_axi_write.sv	Add external strobe mask input and a strobe-independent “beat accepted” pulse output.
jobs/backend_rw_axi/transpose_none.txt	Add job artifact/marker for transpose-none configuration (empty in this diff).
idma.mk	Add compute-enabled variant list (`IDMA_VIDMA_IDS`), propagate to generator, add simulation targets for transpose regressions, include `split_rtl` in vsim script target set.
doc/transpose-engine-routing-plan.md	Detailed routing/signaling plan and rationale for transpose integration.
Bender.yml	Add compute RTL, new midend, Snitch harness sources, transpose tests, and introduce `split_rtl` generated-file selection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    /// Extra write-descriptor slots covering the compute (transpose) tile-fill latency
+    localparam int unsigned ComputeFifoDepth = ${"StrbWidth" if enable_compute else "32'd0"};
+% if enable_compute:
+
+    /// Per-op compute set baked into this variant (frontends may cross-check)
+    localparam idma_pkg::compute_enable_t ComputeEnable =
+        '{${', '.join("%s: 1'b1" % op for op in compute_ops)}};
+`ifndef SYNTHESIS
+    // no engine flush on abort: compute is incompatible with error handling
+    initial assert (ErrorCap == idma_pkg::NO_ERROR_HANDLING) else
+        $fatal(1, "compute requires ErrorCap == NO_ERROR_HANDLING");
+`endif
+% endif


+  // full/empty token
+  always_ff @(posedge clk_i or negedge rst_ni) begin
+    if (!rst_ni || clear_i || exec_done) begin
+      full_q <= 2'b00;
+    end else begin


Decode the transpose from spare DMCPY argb bits into opt.compute, expand NumDim to 4 with addr-width strides and splice the transpose midend between the request FIFO and the nd_midend, gated by a ComputeEnable parameter. Malformed requests (no hardware, reserved mode, zero dims, unaligned dst) get an error response and the backend's baked compute set is cross-checked at elaboration.

Standalone BFM harness driving the accelerator port: copy and transpose testbenches and a sweep covering all element sizes, tiling, edge, back-to-back, leak and reject cases, registered behind the snitch_cluster target; the flow regenerates the RTL before compiling.

inst64 is a multi-write backend (rw_axi_rw_init_rw_obi) that cannot host the pulp-platform#112 FF transpose engine. Add an AddrGenTranspose mode to idma_transpose_midend: instead of the NumDim=4 tiled engine walk, emit an element-granular NumDim=3 swapped-stride program (out_T[c][r]=in[r][c], contiguous N x M dst) and clear compute.enable so the backend runs a plain strided copy. Correct on any protocol (ideal on random-access OBI/TCDM). idma_inst64_top gains the AddrGenTranspose param, wires it to the expander, and gates the engine-only gen_compute_check. The inst64 transpose harness drives it end-to-end (int8/fp16/fp32, square/rect/ swapped, back-to-back, reject) -- it could not even elaborate before.

The harness gains an obi_sim_mem backdoor; the transpose TB now drives the real OBI/TCDM port instead of AXI-range addresses: OBI->OBI (transpose a tile within L1/TCDM, the Snitch DMA case), AXI->OBI (load an external matrix into TCDM transposed), back-to-back, no-leak OBI copy, and reject. PASS for int8/ fp16/fp32. Closes the end-to-end gap -- previously the inst64 TB only hit the AXI path; the OBI read+write ports are now covered through the frontend.

A transposed write walks the dst with stride M*E; when M*E is an even number of bus words this hammers a single TCDM bank (1/B bandwidth on a B-bank L1). New BankSkew param (default off) pads the dst row pitch by one bus-word (NE elements) in that case, making the per-column word stride odd -> round-robins all banks on any power-of-2-bank TCDM, at <=1 word/row cost. Plumbed through idma_inst64_top; the harness/TB drive it and check the N x M' padded layout (padding columns stay sentinel). PASS skew-on (32x8 EB4 -> pitch 48) and skew-off (contiguous). Default off keeps the contiguous N x M output.

Per the iDMA TB convention, a self-checking TB drives its own stimulus in one elaboration. M/N/EB are runtime (the DMCPY carries them), so the transpose TB now loops a localparam geometry list (int8/fp16/fp32, square/rect/odd, incl. the bank-skew-triggering shapes) instead of taking M/N/EB as elaboration params swept from the Makefile. Consecutive cases also cover back-to-back leak. Only BankSkew stays structural: the make target runs one vsim per BankSkew config. Drops the external TP_SWEEP loop. PASS BankSkew off and on.

The snitch harness files used the singular '// Author:' header (and the Makefile a trailing '#' line); lint-authors requires a blank line after SPDX, plural '// Authors:', a '// - Name <email>' bullet, and a blank line after the author block. Normalize all six.

DanielKellerM · 2026-06-23T09:48:42Z

Superseded by #141.

Copilot AI review requested due to automatic review settings June 10, 2026 14:14

DanielKellerM requested review from micprog and thommythomaso as code owners June 10, 2026 14:14

Copilot started reviewing on behalf of DanielKellerM June 10, 2026 14:14 View session

DanielKellerM marked this pull request as draft June 10, 2026 14:18

Copilot AI reviewed Jun 10, 2026

View reviewed changes

DanielKellerM mentioned this pull request Jun 10, 2026

build: Add per-top trimmed vsim compile scripts #116

Merged

DanielKellerM force-pushed the systems/snitch-integration branch from 30bf0a1 to 854427a Compare June 11, 2026 14:52

DanielKellerM mentioned this pull request Jun 16, 2026

release: iDMA 0.7.0 #129

Draft

7 tasks

DanielKellerM added the enhancement New feature or request label Jun 16, 2026

DanielKellerM added 6 commits June 22, 2026 19:20

DanielKellerM force-pushed the systems/snitch-integration branch from 1cccfcb to 29ddc74 Compare June 23, 2026 09:29

DanielKellerM closed this Jun 23, 2026

DanielKellerM deleted the systems/snitch-integration branch June 23, 2026 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

inst64: Drive on-the-fly transpose; add the snitch integration harness#113

inst64: Drive on-the-fly transpose; add the snitch integration harness#113
DanielKellerM wants to merge 7 commits into
pulp-platform:develfrom
DanielKellerM:systems/snitch-integration

DanielKellerM commented Jun 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

DanielKellerM commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

DanielKellerM commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's here

Testing

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

DanielKellerM commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DanielKellerM commented Jun 10, 2026 •

edited

Loading

DanielKellerM commented Jun 23, 2026 •

edited

Loading