Skip to content

inst64: Peak-bandwidth transpose via the FF engine on the multi-write backend#143

Open
DanielKellerM wants to merge 2 commits into
develfrom
inst64/transpose-engine
Open

inst64: Peak-bandwidth transpose via the FF engine on the multi-write backend#143
DanielKellerM wants to merge 2 commits into
develfrom
inst64/transpose-engine

Conversation

@DanielKellerM

Copy link
Copy Markdown
Collaborator

What

Enable the #112 on-the-fly FF transpose engine on the multi-write inst64
(rw_axi_rw_init_rw_obi) backend, so a transpose runs at full bus-word
bandwidth
over the AXI / OBI write ports instead of only the element-granular
address-generation path. The Snitch DM core can now transpose TCDM↔TCDM
(OBI→OBI) and SoC↔TCDM (AXI↔OBI) at peak.

How

  • util/mario/backend.py: relax the compute gate from "single AXI write" to
    "any AXI or OBI write port" (pure-INIT still rejected).
  • idma_obi_write + idma_obi.yml: give the OBI write port the two signals the
    engine needs and AXI already had — w_beat_done_o (= write_happening) and
    mask_ext_i (ANDed into the strobe).
  • idma_transport_layer.sv.tpl: add the multi-write w_beat_done retire mux
    (keyed on dst_protocol; INIT → 1'b0).
  • idma_backend.sv.tpl: size the bypass AW FIFO by NumAxInFlight + ComputeFifoDepth (was a hardcoded 2) — the read-ahead engine deadlocks an
    OBI write port otherwise.
  • idma_inst64_top: decode the transpose DMCPY into opt.compute and splice
    idma_transpose_midend (the engine walk) ahead of the ND midend; reject
    malformed requests (feature off, reserved mode, zero dim, twod, element-
    misaligned src/dst); capability assert that the backend has the engine.

The midend RTL is unchanged (the engine walk is devel's existing behaviour).

Output layout

Aᵀ lands with an NE-aligned leading dimension MP = ceil(M/NE)·NE (NE = StrbWidth/E): dense when M is NE-aligned, padded (valid columns + sentinel
padding) otherwise.

Validation

  • make idma_hw_all IDMA_VIDMA_IDS=rw_axi_rw_init_rw_obi — clean.
  • New public tb_idma_transpose_obi_engine (rw_obi+engine, padded checker +
    in-suite negative control): PASS at DataWidth 32 and 64, peak for aligned, no
    deadlock.
  • tb_idma_transpose_nd (rw_axi engine) regression: PASS.
  • inst64 frontend + engine end-to-end: elaborates clean, OBI→OBI and AXI→OBI
    transpose correct at peak, reject cases reject with no bursts launched.

Caveat

OBI has no bursts, so the write is one beat per grant. For a single linear
master writing to word-interleaved TCDM, consecutive addresses hit different
banks → ~1 beat/cycle = full beat-width, but it is more exposed to TCDM grant
contention than AXI (which amortises via bursts).

Follow-ups

Copilot AI review requested due to automatic review settings June 24, 2026 09:50

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Comment thread src/backend/tpl/idma_backend.sv.tpl Outdated
Comment on lines +878 to +879
stream_fifo_optimal_wrap #(
.Depth ( 2 ),
.Depth ( NumAxInFlight + ComputeFifoDepth ),

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain in detail and why is it required for the transpose

Comment thread src/backend/idma_obi_write.sv Outdated
Comment on lines +113 to +115
// one write beat accepted on the bus; drives the compute engine's retire
assign w_beat_done_o = write_happening;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is already a retire id job, why not use that?

Comment thread src/frontend/inst64/idma_inst64_top.sv Outdated
parameter int unsigned NumChannels = 32'd1,
parameter bit TCDMAliasEnable = 1'b0,
parameter int unsigned DMATracing = 32'd0,
/// Compile-time on-the-fly compute feature enables (e.g. transpose)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove verbose comment

Comment thread src/frontend/inst64/idma_inst64_top.sv Outdated
localparam type tf_len_t = logic[TFLenWidth-1:0];
localparam type offset_t = logic[OffsetWidth-1:0];
localparam type strides_t = logic[RepWidth-1:0];
// strides must match addr_t: signed transpose deltas would not sign-extend if narrower

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove verbose comment

Comment thread src/frontend/inst64/idma_inst64_top.sv Outdated
Comment on lines +378 to +380
// expand a transpose request into the backend FF engine's NumDim=4 tiled
// walk (compute.enable kept). Aᵀ lands PADDED at an NE-aligned leading
// pitch MP*E, MP = ceil(M/NE)*NE (dense when M is NE-aligned).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verbose comments, keep single line

Comment thread src/frontend/inst64/idma_inst64_top.sv Outdated
Comment on lines +609 to +611
// transpose request (register form only): argb spare bits
// carry {enable, mode, tensor_m, tensor_n}. The FF engine
// emits Aᵀ PADDED at NE-aligned leading pitch MP = ceil(M/NE)*NE.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more verbose comments

Comment thread src/frontend/inst64/idma_inst64_top.sv Outdated
Comment on lines +623 to +624
// reject malformed transpose: feature off, reserved mode,
// zero dim, twod (cfg[1]), or src/dst not E-aligned (E=1<<mode)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more verbose comments

Comment thread test/tb_idma_transpose_obi_engine.sv Outdated

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file should be only in nonfree repo,

Lift the compute gate to any AXI/OBI write port (pure-INIT still rejected); give
the OBI write port the engine's per-beat retire w_beat_done_o = write_happening
(strobe-independent, pulses even for masked beats) and external mask mask_ext_i;
add the multi-write w_beat_done retire mux.

Gate the bypass i_aw_fifo depth on enable_compute: the engine reads a full
NE-beat tile before any write retires, so on combined-aw+w (OBI) the read-meta
bypass FIFO must hold a tile's worth (NumAxInFlight + ComputeFifoDepth, matching
its lockstep twin i_w_dp_req) or the legalizer deadlocks. Non-compute backends
keep Depth 2 (byte-identical to before).
@DanielKellerM DanielKellerM force-pushed the inst64/transpose-engine branch from 9e122a3 to b358048 Compare June 24, 2026 17:17
Decode the transpose DMCPY (argb spare bits) into opt.compute and splice
idma_transpose_midend (engine walk) ahead of the ND midend, gated by
ComputeEnable.transpose with NumDim=4. Widen strides_t to addr_t so signed
transpose deltas sign-extend. Reject malformed requests (feature off, reserved
mode, zero dim, twod, element-misaligned src/dst) and assert the backend carries
the engine.
@DanielKellerM DanielKellerM force-pushed the inst64/transpose-engine branch from b358048 to f5684c3 Compare June 25, 2026 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants