inst64: On-the-fly transpose via address generation (no engine) by DanielKellerM · Pull Request #141 · pulp-platform/iDMA

DanielKellerM · 2026-06-23T09:48:28Z

Summary

Adds on-the-fly transpose by address generation — no FF engine — and drives it from the inst64 frontend.

PR #112's transpose engine sits on a single AXI write path, so the generator refuses it on multi-write backends (rw_axi_rw_init_rw_obi, i.e. inst64). But OBI/TCDM is random-access: a transpose into/within TCDM is just a strided address program, no tile buffer needed. So idma_transpose_midend gains an AddrGenTranspose mode that emits an element-granular NumDim=3 swapped-stride program (out_T[c][r] = in[r][c]) and clears opt.compute, leaving the backend to run a plain strided copy. The FF engine (#112) stays the full-throughput path for AXI↔AXI on rw_axi.

What's here

idma_transpose_midend — AddrGenTranspose (element-granular swapped-stride walk, no engine) and optional BankSkew (pads the dst row pitch by one bus-word so the per-column word stride is odd → round-robins all banks on a power-of-2-bank TCDM, ≤1 word/row). The NumDim=4 engine walk is unchanged when AddrGenTranspose=0.
idma_inst64_top — decodes the transpose DMCPY into opt.compute, splices the expander ahead of idma_nd_midend, selects the address-gen path via the params, and rejects malformed requests (no hardware / reserved mode / zero dim / unaligned dst).
Regressions — standalone TBs driving idma_nd_midend + a stock rw_axi / rw_obi backend with the swapped-stride program, checking out_T[c][r] == in[r][c] with no compute engine. Cover int8/fp16/fp32, square/rectangular/odd, DataWidth 32 and 64; the OBI path uses native obi_sim_mem.

Notes

Address-gen is element-granular (M·N transactions): ideal on random-access OBI/TCDM, slower than a burst on AXI — hence the engine remains for AXI↔AXI.
All changes are in src/ + test/; no codegen/templatization. The inst64 end-to-end integration harness is maintained out of tree.
Rebased on current devel.

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

DanielKellerM

fix

DanielKellerM · 2026-06-23T13:46:41Z

+/// Transpose geometry expander for the generic idma_nd_midend. Two modes:
+/// engine (NumDim=4 tiled walk feeding the FF transpose engine) and address-gen
+/// (element-granular swapped-stride walk, no engine, for backends without the
+/// engine e.g. multi-write OBI/TCDM). Non-transpose passes through.


fix verbose comments, should be 1 line

DanielKellerM · 2026-06-23T13:46:52Z

-            m      = $signed({{(WorkW-TensorW){1'b0}}, tm});   // M
-            n      = $signed({{(WorkW-TensorW){1'b0}}, tn});   // N
-            log2ne = $signed(WorkW'(Log2Strb)) - $signed({{(WorkW-ModeW){1'b0}}, mode});
-            ne     = $signed(WorkW'(1)) <<< log2ne;            // tile side (elements)
-            yt     = (m + ne - 1) >>> log2ne;                  // ceil(M/NE)
-            nt     = (n + ne - 1) >>> log2ne;                  // ceil(N/NE)
-            nxe    = n  <<< mode;                              // N*E  (E = 1<<mode)
-            mpe    = yt <<< Log2Strb;                          // MP*E = YT*NE*E = YT*StrbWidth
-            strb_c = $signed(WorkW'(StrbWidth));               // NE*E (one tile-row = StrbWidth B)
+            m = $signed({{(WorkW-TensorW){1'b0}}, tm});   // M
+            n = $signed({{(WorkW-TensorW){1'b0}}, tn});   // N


why did we remove this?

DanielKellerM · 2026-06-23T13:46:59Z

+                // Element-granular swapped-stride walk (out_T[c][r] = in[r][c]),
+                // dst a contiguous N x M transpose. No engine: compute is cleared
+                // so the backend runs a plain strided copy. Correct on any
+                // protocol (ideal on random-access OBI/TCDM; slow on burst AXI).


fix verbose comments, should be 1 line

DanielKellerM · 2026-06-23T13:47:10Z

+// Address-gen transpose proof: NO compute engine. A plain rw_axi backend
+// transposes an M x N matrix purely via a 2D ND program with swapped strides
+// and a one-element (EB-byte) inner burst. out_T[c][r] = in[r][c], dst is a
+// contiguous N x M transpose (no padding). Validates that iDMA's read->FIFO->
+// write datapath transposes by addressing alone, the basis for the OBI/TCDM
+// multi-write transpose path.


fix verbose comments, should be 1 line

idma_transpose_midend gains AddrGenTranspose: an element-granular NumDim=3 swapped-stride program (out_T[c][r]=in[r][c], contiguous N x M dst) that clears opt.compute, so a backend with no FF transpose engine (e.g. multi-write OBI/ TCDM) runs a plain strided copy. The NumDim=4 engine walk is unchanged when AddrGenTranspose=0.

Decode the transpose DMCPY into opt.compute and splice idma_transpose_midend ahead of idma_nd_midend. This multi-write backend has no FF engine, so transpose is always address generation (AddrGenTranspose hardcoded). Reject malformed requests (feature off, reserved mode, zero dim, twod, element-misaligned src or dst).

TBs drive a transpose request through idma_transpose_midend (so its stride expansion is exercised) into idma_nd_midend + a stock rw_axi / rw_obi backend, checking out_T[c][r]==in[r][c] with no compute engine. Cover int8/fp16/fp32 (square + rectangular), DataWidth 32 and 64, plus an in-suite negative control that a corrupted dst is caught; OBI uses native obi_sim_mem. Wired into idma.mk.

Copilot AI review requested due to automatic review settings June 23, 2026 09:48

DanielKellerM requested review from micprog and thommythomaso as code owners June 23, 2026 09:48

Copilot AI reviewed Jun 23, 2026

DanielKellerM mentioned this pull request Jun 23, 2026

inst64: Drive on-the-fly transpose; add the snitch integration harness #113

Closed

DanielKellerM force-pushed the inst64/transpose-snitch branch from 76c0de1 to dea5345 Compare June 23, 2026 11:32

DanielKellerM changed the title ~~inst64: Drive on-the-fly transpose; add the snitch integration harness~~ inst64: On-the-fly transpose via address generation (no engine) Jun 23, 2026

DanielKellerM force-pushed the inst64/transpose-snitch branch 2 times, most recently from 7906c22 to e9835ba Compare June 23, 2026 13:40

DanielKellerM commented Jun 23, 2026

View reviewed changes

DanielKellerM force-pushed the inst64/transpose-snitch branch from e9835ba to 58d644f Compare June 23, 2026 13:58

DanielKellerM added 3 commits June 23, 2026 22:58

DanielKellerM force-pushed the inst64/transpose-snitch branch from 58d644f to a3334c4 Compare June 23, 2026 20:59

DanielKellerM mentioned this pull request Jun 24, 2026

inst64: Peak-bandwidth transpose via the FF engine on the multi-write backend #143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

inst64: On-the-fly transpose via address generation (no engine)#141

inst64: On-the-fly transpose via address generation (no engine)#141
DanielKellerM wants to merge 3 commits into
develfrom
inst64/transpose-snitch

DanielKellerM commented Jun 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

DanielKellerM left a comment

Uh oh!

DanielKellerM Jun 23, 2026

Uh oh!

DanielKellerM Jun 23, 2026

Uh oh!

DanielKellerM Jun 23, 2026

Uh oh!

DanielKellerM Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

DanielKellerM commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's here

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

DanielKellerM left a comment

Choose a reason for hiding this comment

Uh oh!

DanielKellerM Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

DanielKellerM Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

DanielKellerM Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

DanielKellerM Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DanielKellerM commented Jun 23, 2026 •

edited

Loading