inst64: Drive on-the-fly transpose; add the snitch integration harness#113
Closed
DanielKellerM wants to merge 7 commits into
Closed
inst64: Drive on-the-fly transpose; add the snitch integration harness#113DanielKellerM wants to merge 7 commits into
DanielKellerM wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR wires the on-the-fly transpose feature end-to-end for the Snitch inst64 frontend and adds a standalone Snitch integration harness, while also extending the backend generation flow to optionally include compute support in selected variants.
Changes:
- Add a typed per-transfer
opt.computecapability (transpose op + params) and route it through legalizer/backend/transport to a write-seam compute dispatcher and transpose engine. - Extend the
inst64frontend to decode transpose requests from spareDMCPYargb bits, expand transpose geometry via a newidma_transpose_midend, and reject malformed transpose requests. - Add new SV/DPI-C testbenches (engine-level + ND/back-to-back) and a Snitch
inst64integration harness + Makefile flow (snitch_transpose_sweep), plus docs and Bender target support (split_rtl).
Reviewed changes
Copilot reviewed 36 out of 37 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| util/mario/util.py | Add parsing for --compute-ids configuration strings (ops + fd/hd). |
| util/mario/transport_layer.py | Pass compute enable/op info into transport-layer templating context. |
| util/mario/legalizer.py | Pass compute enable flag into legalizer templating context. |
| util/mario/backend.py | Enforce “single AXI write port” constraint for compute-enabled variants and pass ops into backend templating context. |
| util/gen_idma.py | Add --compute-ids CLI and propagate compute config into generators. |
| test/tb_idma_transpose_nd.sv | Multi-tile end-to-end transpose test via ND midend → compute backend → AXI sim mem. |
| test/tb_idma_transpose_b2b.sv | End-to-end back-to-back transpose regression to distinct destinations. |
| test/tb_idma_otf_transpose.sv | Standalone transpose engine SV testbench using DPI-C golden model. |
| test/midend/tb_idma_transpose_midend.sv | Unit test for transpose geometry expansion midend. |
| test/midend/tb_idma_nd_midend_b2b.sv | Back-to-back ND midend base-address reload regression under backpressure. |
| test/idma_transpose_dpi.c | DPI-C golden model for element-granular transpose verification. |
| test/idma_test.sv | Extend request-driving task to optionally program transpose compute options. |
| systems/snitch/test/tb_idma_inst64_transpose.sv | Snitch inst64 end-to-end transpose integration test (incl. rejects + no-leak). |
| systems/snitch/test/tb_idma_inst64_copy.sv | Snitch inst64 plain-copy regression. |
| systems/snitch/test/idma_inst64_tb_pkg.sv | Package/types/constants for the standalone Snitch harness. |
| systems/snitch/test/idma_inst64_drv_if.sv | Accelerator-bus BFM tasks, including DMCPY-encoded transpose launch helpers. |
| systems/snitch/test/idma_inst64_base.sv | Base harness instantiating idma_inst64_top + AXI sim memories. |
| systems/snitch/README.md | Document Snitch harness purpose, build flow, and transpose contract. |
| systems/snitch/Makefile | Standalone build + sim/sweep targets for the Snitch harness. |
| systems/snitch/.gitignore | Ignore build products for the Snitch harness flow. |
| src/midend/idma_transpose_midend.sv | New combinational transpose geometry expander (NumDim=4) for ND midend. |
| src/midend/idma_nd_midend.sv | Add non-synth assertion enforcing stride width == address width. |
| src/include/idma/typedef.svh | Extend options_t with typed compute options field. |
| src/idma_pkg.sv | Define compute op enum, transpose params, compute options, and feature enable struct. |
| src/frontend/inst64/idma_inst64_top.sv | Add ComputeEnable param, decode/validate transpose from DMCPY, splice transpose midend, widen strides to addr width, add backend capability cross-check. |
| src/db/idma_tilelink.yml | Forward compute options into write datapath request struct where needed. |
| src/db/idma_axi.yml | Forward compute options; extend AXI write template to accept strobe mask + beat-done pulse. |
| src/backend/tpl/idma_transport_layer.sv.tpl | Add write-seam compute integration (dispatcher + mask/beat-done plumbing). |
| src/backend/tpl/idma_legalizer.sv.tpl | Force decouple on compute transfers; propagate compute options into mutable transfer opts and write datapath req. |
| src/backend/tpl/idma_backend.sv.tpl | Add compute-enabled variant metadata (ComputeEnable), enforce NO_ERROR_HANDLING, increase meta FIFO depth for compute latency, propagate compute options into write datapath req. |
| src/backend/idma_otf_transpose.sv | New transpose engine (tile ping-pong) producing per-byte strobe mask. |
| src/backend/idma_otf_compute.sv | New write-seam compute dispatcher (currently transpose only). |
| src/backend/idma_axi_write.sv | Add external strobe mask input and a strobe-independent “beat accepted” pulse output. |
| jobs/backend_rw_axi/transpose_none.txt | Add job artifact/marker for transpose-none configuration (empty in this diff). |
| idma.mk | Add compute-enabled variant list (IDMA_VIDMA_IDS), propagate to generator, add simulation targets for transpose regressions, include split_rtl in vsim script target set. |
| doc/transpose-engine-routing-plan.md | Detailed routing/signaling plan and rationale for transpose integration. |
| Bender.yml | Add compute RTL, new midend, Snitch harness sources, transpose tests, and introduce split_rtl generated-file selection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+173
to
+185
| /// Extra write-descriptor slots covering the compute (transpose) tile-fill latency | ||
| localparam int unsigned ComputeFifoDepth = ${"StrbWidth" if enable_compute else "32'd0"}; | ||
| % if enable_compute: | ||
|
|
||
| /// Per-op compute set baked into this variant (frontends may cross-check) | ||
| localparam idma_pkg::compute_enable_t ComputeEnable = | ||
| '{${', '.join("%s: 1'b1" % op for op in compute_ops)}}; | ||
| `ifndef SYNTHESIS | ||
| // no engine flush on abort: compute is incompatible with error handling | ||
| initial assert (ErrorCap == idma_pkg::NO_ERROR_HANDLING) else | ||
| $fatal(1, "compute requires ErrorCap == NO_ERROR_HANDLING"); | ||
| `endif | ||
| % endif |
Comment on lines
+177
to
+181
| // full/empty token | ||
| always_ff @(posedge clk_i or negedge rst_ni) begin | ||
| if (!rst_ni || clear_i || exec_done) begin | ||
| full_q <= 2'b00; | ||
| end else begin |
30bf0a1 to
854427a
Compare
Decode the transpose from spare DMCPY argb bits into opt.compute, expand NumDim to 4 with addr-width strides and splice the transpose midend between the request FIFO and the nd_midend, gated by a ComputeEnable parameter. Malformed requests (no hardware, reserved mode, zero dims, unaligned dst) get an error response and the backend's baked compute set is cross-checked at elaboration.
Standalone BFM harness driving the accelerator port: copy and transpose testbenches and a sweep covering all element sizes, tiling, edge, back-to-back, leak and reject cases, registered behind the snitch_cluster target; the flow regenerates the RTL before compiling.
inst64 is a multi-write backend (rw_axi_rw_init_rw_obi) that cannot host the pulp-platform#112 FF transpose engine. Add an AddrGenTranspose mode to idma_transpose_midend: instead of the NumDim=4 tiled engine walk, emit an element-granular NumDim=3 swapped-stride program (out_T[c][r]=in[r][c], contiguous N x M dst) and clear compute.enable so the backend runs a plain strided copy. Correct on any protocol (ideal on random-access OBI/TCDM). idma_inst64_top gains the AddrGenTranspose param, wires it to the expander, and gates the engine-only gen_compute_check. The inst64 transpose harness drives it end-to-end (int8/fp16/fp32, square/rect/ swapped, back-to-back, reject) -- it could not even elaborate before.
The harness gains an obi_sim_mem backdoor; the transpose TB now drives the real OBI/TCDM port instead of AXI-range addresses: OBI->OBI (transpose a tile within L1/TCDM, the Snitch DMA case), AXI->OBI (load an external matrix into TCDM transposed), back-to-back, no-leak OBI copy, and reject. PASS for int8/ fp16/fp32. Closes the end-to-end gap -- previously the inst64 TB only hit the AXI path; the OBI read+write ports are now covered through the frontend.
A transposed write walks the dst with stride M*E; when M*E is an even number of bus words this hammers a single TCDM bank (1/B bandwidth on a B-bank L1). New BankSkew param (default off) pads the dst row pitch by one bus-word (NE elements) in that case, making the per-column word stride odd -> round-robins all banks on any power-of-2-bank TCDM, at <=1 word/row cost. Plumbed through idma_inst64_top; the harness/TB drive it and check the N x M' padded layout (padding columns stay sentinel). PASS skew-on (32x8 EB4 -> pitch 48) and skew-off (contiguous). Default off keeps the contiguous N x M output.
Per the iDMA TB convention, a self-checking TB drives its own stimulus in one elaboration. M/N/EB are runtime (the DMCPY carries them), so the transpose TB now loops a localparam geometry list (int8/fp16/fp32, square/rect/odd, incl. the bank-skew-triggering shapes) instead of taking M/N/EB as elaboration params swept from the Makefile. Consecutive cases also cover back-to-back leak. Only BankSkew stays structural: the make target runs one vsim per BankSkew config. Drops the external TP_SWEEP loop. PASS BankSkew off and on.
1cccfcb to
29ddc74
Compare
The snitch harness files used the singular '// Author:' header (and the Makefile a trailing '#' line); lint-authors requires a blank line after SPDX, plural '// Authors:', a '// - Name <email>' bullet, and a blank line after the author block. Normalize all six.
Collaborator
Author
|
Superseded by #141. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Drives on-the-fly transpose through the inst64 (Snitch) frontend and adds a standalone Snitch integration harness.
inst64's backend (
rw_axi_rw_init_rw_obi) is multi-write and therefore cannot host the #112 single-AXI-write transpose engine. Instead, the transpose is executed by address generation:idma_transpose_midend(newAddrGenTransposemode) emits an element-granularNumDim=3swapped-stride ND program (out_T[c][r] = in[r][c]) and clearsopt.compute, so the backend runs a plain strided copy — no engine, no tile buffer. OBI/TCDM is random-access, so this is the natural (and only needed) mechanism for inst64; the #112 FF engine remains the full-throughput path for AXI↔AXI transpose onrw_axi.What's here
idma_inst64_top): decode the transpose DMCPY intoopt.compute, spliceidma_transpose_midendahead ofidma_nd_midend, and reject malformed requests (no hardware / reserved mode / zero dim / unaligned dst).idma_transpose_midend,AddrGenTranspose): swapped-stride transpose with an optionalBankSkewthat pads the dst row pitch by one bus-word when the column stride would hammer a single TCDM bank — makes the per-column word stride odd, so writes round-robin across all banks on any power-of-2-bank L1, at ≤1 word/row cost.systems/snitch): standalone BFM driving the accelerator port plus AXI and native-OBI sim memories.Testing
The self-checking harness TB sweeps a geometry list in one elaboration (int8/fp16/fp32, square/rectangular/odd) over the real OBI/TCDM port:
opt.compute), and malformed-request rejectBankSkewoff (contiguous N×M) and on (padded N×M′, padding stays sentinel)A standalone copy harness (
tb_idma_inst64_copy) also passes. All changes live insrc/— no codegen/templatization.Notes
rw_axi.devel; the earlier transpose-engine commits are dropped (they landed via Add on-the-fly compute support with a transpose engine #112).