Skip to content

inst64: On-the-fly transpose via address generation (no engine)#141

Open
DanielKellerM wants to merge 3 commits into
develfrom
inst64/transpose-snitch
Open

inst64: On-the-fly transpose via address generation (no engine)#141
DanielKellerM wants to merge 3 commits into
develfrom
inst64/transpose-snitch

Conversation

@DanielKellerM

@DanielKellerM DanielKellerM commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds on-the-fly transpose by address generation — no FF engine — and drives it from the inst64 frontend.

PR #112's transpose engine sits on a single AXI write path, so the generator refuses it on multi-write backends (rw_axi_rw_init_rw_obi, i.e. inst64). But OBI/TCDM is random-access: a transpose into/within TCDM is just a strided address program, no tile buffer needed. So idma_transpose_midend gains an AddrGenTranspose mode that emits an element-granular NumDim=3 swapped-stride program (out_T[c][r] = in[r][c]) and clears opt.compute, leaving the backend to run a plain strided copy. The FF engine (#112) stays the full-throughput path for AXI↔AXI on rw_axi.

What's here

  • idma_transpose_midendAddrGenTranspose (element-granular swapped-stride walk, no engine) and optional BankSkew (pads the dst row pitch by one bus-word so the per-column word stride is odd → round-robins all banks on a power-of-2-bank TCDM, ≤1 word/row). The NumDim=4 engine walk is unchanged when AddrGenTranspose=0.
  • idma_inst64_top — decodes the transpose DMCPY into opt.compute, splices the expander ahead of idma_nd_midend, selects the address-gen path via the params, and rejects malformed requests (no hardware / reserved mode / zero dim / unaligned dst).
  • Regressions — standalone TBs driving idma_nd_midend + a stock rw_axi / rw_obi backend with the swapped-stride program, checking out_T[c][r] == in[r][c] with no compute engine. Cover int8/fp16/fp32, square/rectangular/odd, DataWidth 32 and 64; the OBI path uses native obi_sim_mem.

Notes

  • Address-gen is element-granular (M·N transactions): ideal on random-access OBI/TCDM, slower than a burst on AXI — hence the engine remains for AXI↔AXI.
  • All changes are in src/ + test/; no codegen/templatization. The inst64 end-to-end integration harness is maintained out of tree.
  • Rebased on current devel.

Copilot AI review requested due to automatic review settings June 23, 2026 09:48

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@DanielKellerM DanielKellerM force-pushed the inst64/transpose-snitch branch from 76c0de1 to dea5345 Compare June 23, 2026 11:32
@DanielKellerM DanielKellerM changed the title inst64: Drive on-the-fly transpose; add the snitch integration harness inst64: On-the-fly transpose via address generation (no engine) Jun 23, 2026
@DanielKellerM DanielKellerM force-pushed the inst64/transpose-snitch branch 2 times, most recently from 7906c22 to e9835ba Compare June 23, 2026 13:40

@DanielKellerM DanielKellerM left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

Comment thread src/midend/idma_transpose_midend.sv Outdated
Comment on lines +8 to +11
/// Transpose geometry expander for the generic idma_nd_midend. Two modes:
/// engine (NumDim=4 tiled walk feeding the FF transpose engine) and address-gen
/// (element-granular swapped-stride walk, no engine, for backends without the
/// engine e.g. multi-write OBI/TCDM). Non-transpose passes through.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix verbose comments, should be 1 line

Comment thread src/midend/idma_transpose_midend.sv Outdated
Comment on lines +62 to +67
m = $signed({{(WorkW-TensorW){1'b0}}, tm}); // M
n = $signed({{(WorkW-TensorW){1'b0}}, tn}); // N
log2ne = $signed(WorkW'(Log2Strb)) - $signed({{(WorkW-ModeW){1'b0}}, mode});
ne = $signed(WorkW'(1)) <<< log2ne; // tile side (elements)
yt = (m + ne - 1) >>> log2ne; // ceil(M/NE)
nt = (n + ne - 1) >>> log2ne; // ceil(N/NE)
nxe = n <<< mode; // N*E (E = 1<<mode)
mpe = yt <<< Log2Strb; // MP*E = YT*NE*E = YT*StrbWidth
strb_c = $signed(WorkW'(StrbWidth)); // NE*E (one tile-row = StrbWidth B)
m = $signed({{(WorkW-TensorW){1'b0}}, tm}); // M
n = $signed({{(WorkW-TensorW){1'b0}}, tn}); // N

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we remove this?

Comment thread src/midend/idma_transpose_midend.sv Outdated
Comment on lines +70 to +73
// Element-granular swapped-stride walk (out_T[c][r] = in[r][c]),
// dst a contiguous N x M transpose. No engine: compute is cleared
// so the backend runs a plain strided copy. Correct on any
// protocol (ideal on random-access OBI/TCDM; slow on burst AXI).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix verbose comments, should be 1 line

Comment thread test/tb_idma_addrgen_transpose.sv Outdated
Comment on lines +8 to +13
// Address-gen transpose proof: NO compute engine. A plain rw_axi backend
// transposes an M x N matrix purely via a 2D ND program with swapped strides
// and a one-element (EB-byte) inner burst. out_T[c][r] = in[r][c], dst is a
// contiguous N x M transpose (no padding). Validates that iDMA's read->FIFO->
// write datapath transposes by addressing alone, the basis for the OBI/TCDM
// multi-write transpose path.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix verbose comments, should be 1 line

@DanielKellerM DanielKellerM force-pushed the inst64/transpose-snitch branch from e9835ba to 58d644f Compare June 23, 2026 13:58
idma_transpose_midend gains AddrGenTranspose: an element-granular NumDim=3
swapped-stride program (out_T[c][r]=in[r][c], contiguous N x M dst) that clears
opt.compute, so a backend with no FF transpose engine (e.g. multi-write OBI/
TCDM) runs a plain strided copy. The NumDim=4 engine walk is unchanged when
AddrGenTranspose=0.
Decode the transpose DMCPY into opt.compute and splice idma_transpose_midend
ahead of idma_nd_midend. This multi-write backend has no FF engine, so transpose
is always address generation (AddrGenTranspose hardcoded). Reject malformed
requests (feature off, reserved mode, zero dim, twod, element-misaligned src or
dst).
TBs drive a transpose request through idma_transpose_midend (so its stride
expansion is exercised) into idma_nd_midend + a stock rw_axi / rw_obi backend,
checking out_T[c][r]==in[r][c] with no compute engine. Cover int8/fp16/fp32
(square + rectangular), DataWidth 32 and 64, plus an in-suite negative control
that a corrupted dst is caught; OBI uses native obi_sim_mem. Wired into idma.mk.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants