Skip to content

inst64: Drive on-the-fly transpose; add the snitch integration harness#113

Closed
DanielKellerM wants to merge 7 commits into
pulp-platform:develfrom
DanielKellerM:systems/snitch-integration
Closed

inst64: Drive on-the-fly transpose; add the snitch integration harness#113
DanielKellerM wants to merge 7 commits into
pulp-platform:develfrom
DanielKellerM:systems/snitch-integration

Conversation

@DanielKellerM

@DanielKellerM DanielKellerM commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

Drives on-the-fly transpose through the inst64 (Snitch) frontend and adds a standalone Snitch integration harness.

inst64's backend (rw_axi_rw_init_rw_obi) is multi-write and therefore cannot host the #112 single-AXI-write transpose engine. Instead, the transpose is executed by address generation: idma_transpose_midend (new AddrGenTranspose mode) emits an element-granular NumDim=3 swapped-stride ND program (out_T[c][r] = in[r][c]) and clears opt.compute, so the backend runs a plain strided copy — no engine, no tile buffer. OBI/TCDM is random-access, so this is the natural (and only needed) mechanism for inst64; the #112 FF engine remains the full-throughput path for AXI↔AXI transpose on rw_axi.

What's here

  • Frontend (idma_inst64_top): decode the transpose DMCPY into opt.compute, splice idma_transpose_midend ahead of idma_nd_midend, and reject malformed requests (no hardware / reserved mode / zero dim / unaligned dst).
  • Address-gen (idma_transpose_midend, AddrGenTranspose): swapped-stride transpose with an optional BankSkew that pads the dst row pitch by one bus-word when the column stride would hammer a single TCDM bank — makes the per-column word stride odd, so writes round-robin across all banks on any power-of-2-bank L1, at ≤1 word/row cost.
  • Harness (systems/snitch): standalone BFM driving the accelerator port plus AXI and native-OBI sim memories.

Testing

The self-checking harness TB sweeps a geometry list in one elaboration (int8/fp16/fp32, square/rectangular/odd) over the real OBI/TCDM port:

  • OBI→OBI — transpose a tile within L1/TCDM
  • AXI→OBI — load an external matrix into TCDM transposed
  • back-to-back (consecutive cases), cross-transfer compute-leak (a plain copy after transposes must not inherit opt.compute), and malformed-request reject
  • both BankSkew off (contiguous N×M) and on (padded N×M′, padding stays sentinel)

A standalone copy harness (tb_idma_inst64_copy) also passes. All changes live in src/ — no codegen/templatization.

Notes

Copilot AI review requested due to automatic review settings June 10, 2026 14:14
@DanielKellerM DanielKellerM marked this pull request as draft June 10, 2026 14:18

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR wires the on-the-fly transpose feature end-to-end for the Snitch inst64 frontend and adds a standalone Snitch integration harness, while also extending the backend generation flow to optionally include compute support in selected variants.

Changes:

  • Add a typed per-transfer opt.compute capability (transpose op + params) and route it through legalizer/backend/transport to a write-seam compute dispatcher and transpose engine.
  • Extend the inst64 frontend to decode transpose requests from spare DMCPY argb bits, expand transpose geometry via a new idma_transpose_midend, and reject malformed transpose requests.
  • Add new SV/DPI-C testbenches (engine-level + ND/back-to-back) and a Snitch inst64 integration harness + Makefile flow (snitch_transpose_sweep), plus docs and Bender target support (split_rtl).

Reviewed changes

Copilot reviewed 36 out of 37 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
util/mario/util.py Add parsing for --compute-ids configuration strings (ops + fd/hd).
util/mario/transport_layer.py Pass compute enable/op info into transport-layer templating context.
util/mario/legalizer.py Pass compute enable flag into legalizer templating context.
util/mario/backend.py Enforce “single AXI write port” constraint for compute-enabled variants and pass ops into backend templating context.
util/gen_idma.py Add --compute-ids CLI and propagate compute config into generators.
test/tb_idma_transpose_nd.sv Multi-tile end-to-end transpose test via ND midend → compute backend → AXI sim mem.
test/tb_idma_transpose_b2b.sv End-to-end back-to-back transpose regression to distinct destinations.
test/tb_idma_otf_transpose.sv Standalone transpose engine SV testbench using DPI-C golden model.
test/midend/tb_idma_transpose_midend.sv Unit test for transpose geometry expansion midend.
test/midend/tb_idma_nd_midend_b2b.sv Back-to-back ND midend base-address reload regression under backpressure.
test/idma_transpose_dpi.c DPI-C golden model for element-granular transpose verification.
test/idma_test.sv Extend request-driving task to optionally program transpose compute options.
systems/snitch/test/tb_idma_inst64_transpose.sv Snitch inst64 end-to-end transpose integration test (incl. rejects + no-leak).
systems/snitch/test/tb_idma_inst64_copy.sv Snitch inst64 plain-copy regression.
systems/snitch/test/idma_inst64_tb_pkg.sv Package/types/constants for the standalone Snitch harness.
systems/snitch/test/idma_inst64_drv_if.sv Accelerator-bus BFM tasks, including DMCPY-encoded transpose launch helpers.
systems/snitch/test/idma_inst64_base.sv Base harness instantiating idma_inst64_top + AXI sim memories.
systems/snitch/README.md Document Snitch harness purpose, build flow, and transpose contract.
systems/snitch/Makefile Standalone build + sim/sweep targets for the Snitch harness.
systems/snitch/.gitignore Ignore build products for the Snitch harness flow.
src/midend/idma_transpose_midend.sv New combinational transpose geometry expander (NumDim=4) for ND midend.
src/midend/idma_nd_midend.sv Add non-synth assertion enforcing stride width == address width.
src/include/idma/typedef.svh Extend options_t with typed compute options field.
src/idma_pkg.sv Define compute op enum, transpose params, compute options, and feature enable struct.
src/frontend/inst64/idma_inst64_top.sv Add ComputeEnable param, decode/validate transpose from DMCPY, splice transpose midend, widen strides to addr width, add backend capability cross-check.
src/db/idma_tilelink.yml Forward compute options into write datapath request struct where needed.
src/db/idma_axi.yml Forward compute options; extend AXI write template to accept strobe mask + beat-done pulse.
src/backend/tpl/idma_transport_layer.sv.tpl Add write-seam compute integration (dispatcher + mask/beat-done plumbing).
src/backend/tpl/idma_legalizer.sv.tpl Force decouple on compute transfers; propagate compute options into mutable transfer opts and write datapath req.
src/backend/tpl/idma_backend.sv.tpl Add compute-enabled variant metadata (ComputeEnable), enforce NO_ERROR_HANDLING, increase meta FIFO depth for compute latency, propagate compute options into write datapath req.
src/backend/idma_otf_transpose.sv New transpose engine (tile ping-pong) producing per-byte strobe mask.
src/backend/idma_otf_compute.sv New write-seam compute dispatcher (currently transpose only).
src/backend/idma_axi_write.sv Add external strobe mask input and a strobe-independent “beat accepted” pulse output.
jobs/backend_rw_axi/transpose_none.txt Add job artifact/marker for transpose-none configuration (empty in this diff).
idma.mk Add compute-enabled variant list (IDMA_VIDMA_IDS), propagate to generator, add simulation targets for transpose regressions, include split_rtl in vsim script target set.
doc/transpose-engine-routing-plan.md Detailed routing/signaling plan and rationale for transpose integration.
Bender.yml Add compute RTL, new midend, Snitch harness sources, transpose tests, and introduce split_rtl generated-file selection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +173 to +185
/// Extra write-descriptor slots covering the compute (transpose) tile-fill latency
localparam int unsigned ComputeFifoDepth = ${"StrbWidth" if enable_compute else "32'd0"};
% if enable_compute:

/// Per-op compute set baked into this variant (frontends may cross-check)
localparam idma_pkg::compute_enable_t ComputeEnable =
'{${', '.join("%s: 1'b1" % op for op in compute_ops)}};
`ifndef SYNTHESIS
// no engine flush on abort: compute is incompatible with error handling
initial assert (ErrorCap == idma_pkg::NO_ERROR_HANDLING) else
$fatal(1, "compute requires ErrorCap == NO_ERROR_HANDLING");
`endif
% endif
Comment on lines +177 to +181
// full/empty token
always_ff @(posedge clk_i or negedge rst_ni) begin
if (!rst_ni || clear_i || exec_done) begin
full_q <= 2'b00;
end else begin
@DanielKellerM DanielKellerM force-pushed the systems/snitch-integration branch from 30bf0a1 to 854427a Compare June 11, 2026 14:52
@DanielKellerM DanielKellerM mentioned this pull request Jun 16, 2026
7 tasks
@DanielKellerM DanielKellerM added the enhancement New feature or request label Jun 16, 2026
Decode the transpose from spare DMCPY argb bits into opt.compute, expand
NumDim to 4 with addr-width strides and splice the transpose midend between
the request FIFO and the nd_midend, gated by a ComputeEnable parameter.
Malformed requests (no hardware, reserved mode, zero dims, unaligned dst) get
an error response and the backend's baked compute set is cross-checked at
elaboration.
Standalone BFM harness driving the accelerator port: copy and transpose
testbenches and a sweep covering all element sizes, tiling, edge, back-to-back,
leak and reject cases, registered behind the snitch_cluster target; the flow
regenerates the RTL before compiling.
inst64 is a multi-write backend (rw_axi_rw_init_rw_obi) that cannot host the
pulp-platform#112 FF transpose engine. Add an AddrGenTranspose mode to idma_transpose_midend:
instead of the NumDim=4 tiled engine walk, emit an element-granular NumDim=3
swapped-stride program (out_T[c][r]=in[r][c], contiguous N x M dst) and clear
compute.enable so the backend runs a plain strided copy. Correct on any protocol
(ideal on random-access OBI/TCDM). idma_inst64_top gains the AddrGenTranspose
param, wires it to the expander, and gates the engine-only gen_compute_check.
The inst64 transpose harness drives it end-to-end (int8/fp16/fp32, square/rect/
swapped, back-to-back, reject) -- it could not even elaborate before.
The harness gains an obi_sim_mem backdoor; the transpose TB now drives the
real OBI/TCDM port instead of AXI-range addresses: OBI->OBI (transpose a tile
within L1/TCDM, the Snitch DMA case), AXI->OBI (load an external matrix into
TCDM transposed), back-to-back, no-leak OBI copy, and reject. PASS for int8/
fp16/fp32. Closes the end-to-end gap -- previously the inst64 TB only hit the
AXI path; the OBI read+write ports are now covered through the frontend.
A transposed write walks the dst with stride M*E; when M*E is an even number
of bus words this hammers a single TCDM bank (1/B bandwidth on a B-bank L1).
New BankSkew param (default off) pads the dst row pitch by one bus-word (NE
elements) in that case, making the per-column word stride odd -> round-robins
all banks on any power-of-2-bank TCDM, at <=1 word/row cost. Plumbed through
idma_inst64_top; the harness/TB drive it and check the N x M' padded layout
(padding columns stay sentinel). PASS skew-on (32x8 EB4 -> pitch 48) and
skew-off (contiguous). Default off keeps the contiguous N x M output.
Per the iDMA TB convention, a self-checking TB drives its own stimulus in
one elaboration. M/N/EB are runtime (the DMCPY carries them), so the transpose
TB now loops a localparam geometry list (int8/fp16/fp32, square/rect/odd, incl.
the bank-skew-triggering shapes) instead of taking M/N/EB as elaboration params
swept from the Makefile. Consecutive cases also cover back-to-back leak. Only
BankSkew stays structural: the make target runs one vsim per BankSkew config.
Drops the external TP_SWEEP loop. PASS BankSkew off and on.
@DanielKellerM DanielKellerM force-pushed the systems/snitch-integration branch from 1cccfcb to 29ddc74 Compare June 23, 2026 09:29
The snitch harness files used the singular '// Author:' header (and the
Makefile a trailing '#' line); lint-authors requires a blank line after SPDX,
plural '// Authors:', a '// - Name <email>' bullet, and a blank line after the
author block. Normalize all six.
@DanielKellerM

DanielKellerM commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Superseded by #141.

@DanielKellerM DanielKellerM deleted the systems/snitch-integration branch June 23, 2026 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants