inst64: Peak-bandwidth transpose via the FF engine on the multi-write backend#143
Open
DanielKellerM wants to merge 2 commits into
Open
inst64: Peak-bandwidth transpose via the FF engine on the multi-write backend#143DanielKellerM wants to merge 2 commits into
DanielKellerM wants to merge 2 commits into
Conversation
DanielKellerM
commented
Jun 24, 2026
Comment on lines
+878
to
+879
| stream_fifo_optimal_wrap #( | ||
| .Depth ( 2 ), | ||
| .Depth ( NumAxInFlight + ComputeFifoDepth ), |
Collaborator
Author
There was a problem hiding this comment.
Explain in detail and why is it required for the transpose
Comment on lines
+113
to
+115
| // one write beat accepted on the bus; drives the compute engine's retire | ||
| assign w_beat_done_o = write_happening; | ||
|
|
Collaborator
Author
There was a problem hiding this comment.
there is already a retire id job, why not use that?
| parameter int unsigned NumChannels = 32'd1, | ||
| parameter bit TCDMAliasEnable = 1'b0, | ||
| parameter int unsigned DMATracing = 32'd0, | ||
| /// Compile-time on-the-fly compute feature enables (e.g. transpose) |
Collaborator
Author
There was a problem hiding this comment.
remove verbose comment
| localparam type tf_len_t = logic[TFLenWidth-1:0]; | ||
| localparam type offset_t = logic[OffsetWidth-1:0]; | ||
| localparam type strides_t = logic[RepWidth-1:0]; | ||
| // strides must match addr_t: signed transpose deltas would not sign-extend if narrower |
Collaborator
Author
There was a problem hiding this comment.
remove verbose comment
Comment on lines
+378
to
+380
| // expand a transpose request into the backend FF engine's NumDim=4 tiled | ||
| // walk (compute.enable kept). Aᵀ lands PADDED at an NE-aligned leading | ||
| // pitch MP*E, MP = ceil(M/NE)*NE (dense when M is NE-aligned). |
Collaborator
Author
There was a problem hiding this comment.
verbose comments, keep single line
Comment on lines
+609
to
+611
| // transpose request (register form only): argb spare bits | ||
| // carry {enable, mode, tensor_m, tensor_n}. The FF engine | ||
| // emits Aᵀ PADDED at NE-aligned leading pitch MP = ceil(M/NE)*NE. |
Collaborator
Author
There was a problem hiding this comment.
more verbose comments
Comment on lines
+623
to
+624
| // reject malformed transpose: feature off, reserved mode, | ||
| // zero dim, twod (cfg[1]), or src/dst not E-aligned (E=1<<mode) |
Collaborator
Author
There was a problem hiding this comment.
more verbose comments
Collaborator
Author
There was a problem hiding this comment.
this file should be only in nonfree repo,
Lift the compute gate to any AXI/OBI write port (pure-INIT still rejected); give the OBI write port the engine's per-beat retire w_beat_done_o = write_happening (strobe-independent, pulses even for masked beats) and external mask mask_ext_i; add the multi-write w_beat_done retire mux. Gate the bypass i_aw_fifo depth on enable_compute: the engine reads a full NE-beat tile before any write retires, so on combined-aw+w (OBI) the read-meta bypass FIFO must hold a tile's worth (NumAxInFlight + ComputeFifoDepth, matching its lockstep twin i_w_dp_req) or the legalizer deadlocks. Non-compute backends keep Depth 2 (byte-identical to before).
9e122a3 to
b358048
Compare
Decode the transpose DMCPY (argb spare bits) into opt.compute and splice idma_transpose_midend (engine walk) ahead of the ND midend, gated by ComputeEnable.transpose with NumDim=4. Widen strides_t to addr_t so signed transpose deltas sign-extend. Reject malformed requests (feature off, reserved mode, zero dim, twod, element-misaligned src/dst) and assert the backend carries the engine.
b358048 to
f5684c3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Enable the #112 on-the-fly FF transpose engine on the multi-write
inst64(
rw_axi_rw_init_rw_obi) backend, so a transpose runs at full bus-wordbandwidth over the AXI / OBI write ports instead of only the element-granular
address-generation path. The Snitch DM core can now transpose TCDM↔TCDM
(OBI→OBI) and SoC↔TCDM (AXI↔OBI) at peak.
How
util/mario/backend.py: relax the compute gate from "single AXI write" to"any AXI or OBI write port" (pure-INIT still rejected).
idma_obi_write+idma_obi.yml: give the OBI write port the two signals theengine needs and AXI already had —
w_beat_done_o(=write_happening) andmask_ext_i(ANDed into the strobe).idma_transport_layer.sv.tpl: add the multi-writew_beat_doneretire mux(keyed on
dst_protocol; INIT →1'b0).idma_backend.sv.tpl: size the bypass AW FIFO byNumAxInFlight + ComputeFifoDepth(was a hardcoded2) — the read-ahead engine deadlocks anOBI write port otherwise.
idma_inst64_top: decode the transposeDMCPYintoopt.computeand spliceidma_transpose_midend(the engine walk) ahead of the ND midend; rejectmalformed requests (feature off, reserved mode, zero dim, twod, element-
misaligned src/dst); capability assert that the backend has the engine.
The midend RTL is unchanged (the engine walk is devel's existing behaviour).
Output layout
Aᵀ lands with an
NE-aligned leading dimensionMP = ceil(M/NE)·NE(NE = StrbWidth/E): dense whenMisNE-aligned, padded (valid columns + sentinelpadding) otherwise.
Validation
make idma_hw_all IDMA_VIDMA_IDS=rw_axi_rw_init_rw_obi— clean.tb_idma_transpose_obi_engine(rw_obi+engine, padded checker +in-suite negative control): PASS at DataWidth 32 and 64, peak for aligned, no
deadlock.
tb_idma_transpose_nd(rw_axi engine) regression: PASS.transpose correct at peak, reject cases reject with no bursts launched.
Caveat
OBI has no bursts, so the write is one beat per grant. For a single linear
master writing to word-interleaved TCDM, consecutive addresses hit different
banks → ~1 beat/cycle = full beat-width, but it is more exposed to TCDM grant
contention than AXI (which amortises via bursts).
Follow-ups
AddrGenTransposelow-area / dense fallback on top.