inst64: On-the-fly transpose via address generation (no engine)#141
Open
DanielKellerM wants to merge 3 commits into
Open
inst64: On-the-fly transpose via address generation (no engine)#141DanielKellerM wants to merge 3 commits into
DanielKellerM wants to merge 3 commits into
Conversation
76c0de1 to
dea5345
Compare
7906c22 to
e9835ba
Compare
DanielKellerM
commented
Jun 23, 2026
Comment on lines
+8
to
+11
| /// Transpose geometry expander for the generic idma_nd_midend. Two modes: | ||
| /// engine (NumDim=4 tiled walk feeding the FF transpose engine) and address-gen | ||
| /// (element-granular swapped-stride walk, no engine, for backends without the | ||
| /// engine e.g. multi-write OBI/TCDM). Non-transpose passes through. |
Collaborator
Author
There was a problem hiding this comment.
fix verbose comments, should be 1 line
Comment on lines
+62
to
+67
| m = $signed({{(WorkW-TensorW){1'b0}}, tm}); // M | ||
| n = $signed({{(WorkW-TensorW){1'b0}}, tn}); // N | ||
| log2ne = $signed(WorkW'(Log2Strb)) - $signed({{(WorkW-ModeW){1'b0}}, mode}); | ||
| ne = $signed(WorkW'(1)) <<< log2ne; // tile side (elements) | ||
| yt = (m + ne - 1) >>> log2ne; // ceil(M/NE) | ||
| nt = (n + ne - 1) >>> log2ne; // ceil(N/NE) | ||
| nxe = n <<< mode; // N*E (E = 1<<mode) | ||
| mpe = yt <<< Log2Strb; // MP*E = YT*NE*E = YT*StrbWidth | ||
| strb_c = $signed(WorkW'(StrbWidth)); // NE*E (one tile-row = StrbWidth B) | ||
| m = $signed({{(WorkW-TensorW){1'b0}}, tm}); // M | ||
| n = $signed({{(WorkW-TensorW){1'b0}}, tn}); // N |
Collaborator
Author
There was a problem hiding this comment.
why did we remove this?
Comment on lines
+70
to
+73
| // Element-granular swapped-stride walk (out_T[c][r] = in[r][c]), | ||
| // dst a contiguous N x M transpose. No engine: compute is cleared | ||
| // so the backend runs a plain strided copy. Correct on any | ||
| // protocol (ideal on random-access OBI/TCDM; slow on burst AXI). |
Collaborator
Author
There was a problem hiding this comment.
fix verbose comments, should be 1 line
Comment on lines
+8
to
+13
| // Address-gen transpose proof: NO compute engine. A plain rw_axi backend | ||
| // transposes an M x N matrix purely via a 2D ND program with swapped strides | ||
| // and a one-element (EB-byte) inner burst. out_T[c][r] = in[r][c], dst is a | ||
| // contiguous N x M transpose (no padding). Validates that iDMA's read->FIFO-> | ||
| // write datapath transposes by addressing alone, the basis for the OBI/TCDM | ||
| // multi-write transpose path. |
Collaborator
Author
There was a problem hiding this comment.
fix verbose comments, should be 1 line
e9835ba to
58d644f
Compare
idma_transpose_midend gains AddrGenTranspose: an element-granular NumDim=3 swapped-stride program (out_T[c][r]=in[r][c], contiguous N x M dst) that clears opt.compute, so a backend with no FF transpose engine (e.g. multi-write OBI/ TCDM) runs a plain strided copy. The NumDim=4 engine walk is unchanged when AddrGenTranspose=0.
Decode the transpose DMCPY into opt.compute and splice idma_transpose_midend ahead of idma_nd_midend. This multi-write backend has no FF engine, so transpose is always address generation (AddrGenTranspose hardcoded). Reject malformed requests (feature off, reserved mode, zero dim, twod, element-misaligned src or dst).
TBs drive a transpose request through idma_transpose_midend (so its stride expansion is exercised) into idma_nd_midend + a stock rw_axi / rw_obi backend, checking out_T[c][r]==in[r][c] with no compute engine. Cover int8/fp16/fp32 (square + rectangular), DataWidth 32 and 64, plus an in-suite negative control that a corrupted dst is caught; OBI uses native obi_sim_mem. Wired into idma.mk.
58d644f to
a3334c4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds on-the-fly transpose by address generation — no FF engine — and drives it from the inst64 frontend.
PR #112's transpose engine sits on a single AXI write path, so the generator refuses it on multi-write backends (
rw_axi_rw_init_rw_obi, i.e. inst64). But OBI/TCDM is random-access: a transpose into/within TCDM is just a strided address program, no tile buffer needed. Soidma_transpose_midendgains anAddrGenTransposemode that emits an element-granularNumDim=3swapped-stride program (out_T[c][r] = in[r][c]) and clearsopt.compute, leaving the backend to run a plain strided copy. The FF engine (#112) stays the full-throughput path for AXI↔AXI onrw_axi.What's here
idma_transpose_midend—AddrGenTranspose(element-granular swapped-stride walk, no engine) and optionalBankSkew(pads the dst row pitch by one bus-word so the per-column word stride is odd → round-robins all banks on a power-of-2-bank TCDM, ≤1 word/row). TheNumDim=4engine walk is unchanged whenAddrGenTranspose=0.idma_inst64_top— decodes the transpose DMCPY intoopt.compute, splices the expander ahead ofidma_nd_midend, selects the address-gen path via the params, and rejects malformed requests (no hardware / reserved mode / zero dim / unaligned dst).idma_nd_midend+ a stockrw_axi/rw_obibackend with the swapped-stride program, checkingout_T[c][r] == in[r][c]with no compute engine. Cover int8/fp16/fp32, square/rectangular/odd, DataWidth 32 and 64; the OBI path uses nativeobi_sim_mem.Notes
src/+test/; no codegen/templatization. The inst64 end-to-end integration harness is maintained out of tree.devel.