inst64: Peak-bandwidth transpose via the FF engine on the multi-write backend by DanielKellerM · Pull Request #143 · pulp-platform/iDMA

DanielKellerM · 2026-06-24T09:50:04Z

What

Enable the #112 on-the-fly FF transpose engine on the multi-write inst64
(rw_axi_rw_init_rw_obi) backend, so a transpose runs at full bus-word
bandwidth over the AXI / OBI write ports instead of only the element-granular
address-generation path. The Snitch DM core can now transpose TCDM↔TCDM
(OBI→OBI) and SoC↔TCDM (AXI↔OBI) at peak.

How

util/mario/backend.py: relax the compute gate from "single AXI write" to
"any AXI or OBI write port" (pure-INIT still rejected).
idma_obi_write + idma_obi.yml: give the OBI write port the two signals the
engine needs and AXI already had — w_beat_done_o (= write_happening) and
mask_ext_i (ANDed into the strobe).
idma_transport_layer.sv.tpl: add the multi-write w_beat_done retire mux
(keyed on dst_protocol; INIT → 1'b0).
idma_backend.sv.tpl: size the bypass AW FIFO by NumAxInFlight + ComputeFifoDepth (was a hardcoded 2) — the read-ahead engine deadlocks an
OBI write port otherwise.
idma_inst64_top: decode the transpose DMCPY into opt.compute and splice
idma_transpose_midend (the engine walk) ahead of the ND midend; reject
malformed requests (feature off, reserved mode, zero dim, twod, element-
misaligned src/dst); capability assert that the backend has the engine.

The midend RTL is unchanged (the engine walk is devel's existing behaviour).

Output layout

Aᵀ lands with an NE-aligned leading dimension MP = ceil(M/NE)·NE (NE = StrbWidth/E): dense when M is NE-aligned, padded (valid columns + sentinel
padding) otherwise.

Validation

make idma_hw_all IDMA_VIDMA_IDS=rw_axi_rw_init_rw_obi — clean.
New public tb_idma_transpose_obi_engine (rw_obi+engine, padded checker +
in-suite negative control): PASS at DataWidth 32 and 64, peak for aligned, no
deadlock.
tb_idma_transpose_nd (rw_axi engine) regression: PASS.
inst64 frontend + engine end-to-end: elaborates clean, OBI→OBI and AXI→OBI
transpose correct at peak, reject cases reject with no bursts launched.

Caveat

OBI has no bursts, so the write is one beat per grant. For a single linear
master writing to word-interleaved TCDM, consecutive addresses hit different
banks → ~1 beat/cycle = full beat-width, but it is more exposed to TCDM grant
contention than AXI (which amortises via bursts).

Follow-ups

inst64: On-the-fly transpose via address generation (no engine) #141 (address-generation transpose) will rebase to follow this, adding the
AddrGenTranspose low-area / dense fallback on top.
The nonfree CI will run the engine regressions in lockstep.

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

DanielKellerM · 2026-06-24T16:35:43Z

        stream_fifo_optimal_wrap #(
-            .Depth        ( 2                    ),
+            .Depth        ( NumAxInFlight + ComputeFifoDepth ),


Explain in detail and why is it required for the transpose

DanielKellerM · 2026-06-24T16:36:27Z

+    // one write beat accepted on the bus; drives the compute engine's retire
+    assign w_beat_done_o = write_happening;
+


there is already a retire id job, why not use that?

DanielKellerM · 2026-06-24T16:36:39Z

    parameter int unsigned NumChannels     = 32'd1,
    parameter bit          TCDMAliasEnable = 1'b0,
    parameter int unsigned DMATracing      = 32'd0,
+    /// Compile-time on-the-fly compute feature enables (e.g. transpose)


remove verbose comment

DanielKellerM · 2026-06-24T16:36:51Z

    localparam type tf_len_t             = logic[TFLenWidth-1:0];
    localparam type offset_t             = logic[OffsetWidth-1:0];
-    localparam type strides_t            = logic[RepWidth-1:0];
+    // strides must match addr_t: signed transpose deltas would not sign-extend if narrower


remove verbose comment

DanielKellerM · 2026-06-24T16:37:11Z

+        // expand a transpose request into the backend FF engine's NumDim=4 tiled
+        // walk (compute.enable kept). Aᵀ lands PADDED at an NE-aligned leading
+        // pitch MP*E, MP = ceil(M/NE)*NE (dense when M is NE-aligned).


verbose comments, keep single line

DanielKellerM · 2026-06-24T16:37:32Z

+                            // transpose request (register form only): argb spare bits
+                            // carry {enable, mode, tensor_m, tensor_n}. The FF engine
+                            // emits Aᵀ PADDED at NE-aligned leading pitch MP = ceil(M/NE)*NE.


more verbose comments

DanielKellerM · 2026-06-24T16:37:41Z

+                            // reject malformed transpose: feature off, reserved mode,
+                            // zero dim, twod (cfg[1]), or src/dst not E-aligned (E=1<<mode)


more verbose comments

DanielKellerM · 2026-06-24T16:38:19Z

this file should be only in nonfree repo,

Lift the compute gate to any AXI/OBI write port (pure-INIT still rejected); give the OBI write port the engine's per-beat retire w_beat_done_o = write_happening (strobe-independent, pulses even for masked beats) and external mask mask_ext_i; add the multi-write w_beat_done retire mux. Gate the bypass i_aw_fifo depth on enable_compute: the engine reads a full NE-beat tile before any write retires, so on combined-aw+w (OBI) the read-meta bypass FIFO must hold a tile's worth (NumAxInFlight + ComputeFifoDepth, matching its lockstep twin i_w_dp_req) or the legalizer deadlocks. Non-compute backends keep Depth 2 (byte-identical to before).

Decode the transpose DMCPY (argb spare bits) into opt.compute and splice idma_transpose_midend (engine walk) ahead of the ND midend, gated by ComputeEnable.transpose with NumDim=4. Widen strides_t to addr_t so signed transpose deltas sign-extend. Reject malformed requests (feature off, reserved mode, zero dim, twod, element-misaligned src/dst) and assert the backend carries the engine.

DanielKellerM requested review from micprog and thommythomaso as code owners June 24, 2026 09:50

Copilot AI review requested due to automatic review settings June 24, 2026 09:50

Copilot AI reviewed Jun 24, 2026

DanielKellerM commented Jun 24, 2026

View reviewed changes

DanielKellerM force-pushed the inst64/transpose-engine branch from 9e122a3 to b358048 Compare June 24, 2026 17:17

DanielKellerM force-pushed the inst64/transpose-engine branch from b358048 to f5684c3 Compare June 25, 2026 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

inst64: Peak-bandwidth transpose via the FF engine on the multi-write backend#143

inst64: Peak-bandwidth transpose via the FF engine on the multi-write backend#143
DanielKellerM wants to merge 2 commits into
develfrom
inst64/transpose-engine

DanielKellerM commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

DanielKellerM Jun 24, 2026

Uh oh!

DanielKellerM Jun 24, 2026

Uh oh!

DanielKellerM Jun 24, 2026

Uh oh!

DanielKellerM Jun 24, 2026

Uh oh!

DanielKellerM Jun 24, 2026

Uh oh!

DanielKellerM Jun 24, 2026

Uh oh!

DanielKellerM Jun 24, 2026

Uh oh!

DanielKellerM Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// one write beat accepted on the bus; drives the compute engine's retire
		assign w_beat_done_o = write_happening;

		// reject malformed transpose: feature off, reserved mode,
		// zero dim, twod (cfg[1]), or src/dst not E-aligned (E=1<<mode)

Uh oh!

Conversation

DanielKellerM commented Jun 24, 2026

What

How

Output layout

Validation

Caveat

Follow-ups

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants