[Feature] Implement DMA support by BenkangPeng · Pull Request #293 · tancheng/VectorCGRA

BenkangPeng · 2026-06-02T13:55:27Z

This PR introduces CgraDmaRTL which integrates the CGRA with a DMA engine, enabling direct memory transfers between external DRAM(don't implement now) and the CGRA's dataSPM.

…ests.

…RecvIfcRTL. Replace `mem` with `dram` for clarity.

HobbitQia · 2026-06-04T15:05:31Z

Hi @tancheng @BenkangPeng , I summarized two direction of DMA design as below:

Rely on data controller
- DMA is added as a new client of the DataMemControllerRTL, where data in the DMA engine communicates directly with the DataMemControllerRTL, and the logic for multiplexing SPM ports is also implemented in that module.
  
  To initiate DMA, the CPU can send dma_mvin or dma_mvout to the CGRA, after which the controller activates the DMA engine by sending start signals.
- Pros: Keeps the controller clean; provides a faster path because data does not go through the controller.
- Cons: Additional logic is required to feed DMA results into the control memory.
All in controller
- All decoding logic is handled in the controller. The logic for handling the logic port should still reside in the DataMemControllerRTL, since the data memory should have its own port multiplexing logic.
  
  The logic of packeting should also be implemented in the controller module.
- Pros: Unifies control and data memory within the controller (the controller is already connected to both control and data memory).
- Cons: Introduces complex control logic in the controller; results in a slower path.

I prefer the second method but I think there are still some logic should be written in DataMemControllerRTL. WDTY?

tancheng · 2026-06-05T08:30:27Z

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

HobbitQia · 2026-06-07T14:56:54Z

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

tancheng · 2026-06-07T18:15:57Z

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

I thought they are the same latency if we can distinguish the CMD_STORE_REQUEST into CMD_STORE_REQUEST_FROM_NOC and CMD_STORE_REQUEST_FROM_CPU (and add another inport on the xbar) in controller?

Adding logic inside the DataMemController kind of bypassing the CGRA controller, which doesn't align with your Option 2, WDYT?

HobbitQia · 2026-06-08T02:15:57Z

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

I thought they are the same latency if we can distinguish the CMD_STORE_REQUEST into CMD_STORE_REQUEST_FROM_NOC and CMD_STORE_REQUEST_FROM_CPU (and add another inport on the xbar) in controller?

Adding logic inside the DataMemController kind of bypassing the CGRA controller, which doesn't align with your Option 2, WDYT?

If the DMA data should go through the controller packet path, there may be extra latency of packeting, and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

tancheng · 2026-06-08T04:21:57Z

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

HobbitQia · 2026-06-08T04:41:47Z

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

Got it. I mean the conflict between different paths. Even we have FROM_CPU and FROM_NOC, there may be concurrent read/write on SPMs. So maybe we need to multiplex the in/out ports of SPMs and the logic of multiplexing and handling the conflicts can be implemented in DataSPMController?

tancheng · 2026-06-08T05:21:11Z

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

Got it. I mean the conflict between different paths. Even we have FROM_CPU and FROM_NOC, there may be concurrent read/write on SPMs. So maybe we need to multiplex the in/out ports of SPMs and the logic of multiplexing and handling the conflicts can be implemented in DataSPMController?

Oh, we don't need to distinguish FROM_CPU and FROM_NOC, we can decompose the requests from CPU like:

VectorCGRA/controller/ControllerRTL.py

Line 207 in eb71842

    
           s.recv_from_cpu_pkt_queue.send.rdy @= s.crossbar.recv[kFromCpuCtrlAndDataIdx].rdy

… error of pytml verilog backend.

…interface for enhanced data transfer capabilities.

… then drives types from them

…te requests and adjust related signal handling for clarity and consistency.

…TL by passing DmaDataType and DmaCmdType as parameters, and updating related type definitions for improved clarity and consistency.

…rite requests, enhancing type definitions for DmaCmdType and DmaDataType

…Type

…proved clarity

tancheng · 2026-06-15T05:26:49Z

+
+  return mk_bitstruct(new_name, {
+    'dram_data': DramDataType,
+    'dram_mask': DramMaskType,


explain what is dram_mask with comment?

tancheng · 2026-06-15T05:27:00Z

+    'dram_data': DramDataType,
+    'dram_mask': DramMaskType,
+    'spm_data': SpmDataType,
+    'spm_mask': SpmMaskType,


explain what is spm_mask with comment?

tancheng · 2026-06-15T05:35:15Z

Hi @tancheng @BenkangPeng , I summarized two direction of DMA design as below:

Rely on data controller

DMA is added as a new client of the DataMemControllerRTL, where data in the DMA engine communicates directly with the DataMemControllerRTL, and the logic for multiplexing SPM ports is also implemented in that module.
To initiate DMA, the CPU can send dma_mvin or dma_mvout to the CGRA, after which the controller activates the DMA engine by sending start signals.

Pros: Keeps the controller clean; provides a faster path because data does not go through the controller.

Cons: Additional logic is required to feed DMA results into the control memory.

All in controller

All decoding logic is handled in the controller. The logic for handling the logic port should still reside in the DataMemControllerRTL, since the data memory should have its own port multiplexing logic.
The logic of packeting should also be implemented in the controller module.

Pros: Unifies control and data memory within the controller (the controller is already connected to both control and data memory).

Cons: Introduces complex control logic in the controller; results in a slower path.

I prefer the second method but I think there are still some logic should be written in DataMemControllerRTL. WDTY?

We should also include the figure 2 into our

tancheng · 2026-06-16T15:46:19Z

Hi @tancheng @BenkangPeng , I summarized two direction of DMA design as below:

Rely on data controller

DMA is added as a new client of the DataMemControllerRTL, where data in the DMA engine communicates directly with the DataMemControllerRTL, and the logic for multiplexing SPM ports is also implemented in that module.
To initiate DMA, the CPU can send dma_mvin or dma_mvout to the CGRA, after which the controller activates the DMA engine by sending start signals.

Pros: Keeps the controller clean; provides a faster path because data does not go through the controller.

Cons: Additional logic is required to feed DMA results into the control memory.

All in controller

All decoding logic is handled in the controller. The logic for handling the logic port should still reside in the DataMemControllerRTL, since the data memory should have its own port multiplexing logic.
The logic of packeting should also be implemented in the controller module.

Pros: Unifies control and data memory within the controller (the controller is already connected to both control and data memory).

Cons: Introduces complex control logic in the controller; results in a slower path.

I prefer the second method but I think there are still some logic should be written in DataMemControllerRTL. WDTY?

We should also include the figure 2 into our

@BenkangPeng, as you introduce IntegratedDmaWithCgraRTL, so we should update this diagram that the "Controller", "DataMemController", "SPM", "Control SPM" should be wrapped by a CGRA box. And entire diagram (except the "CPU") should be wrapped by "IntegratedDmaWithCgra" box.

HobbitQia · 2026-06-16T06:08:23Z

        #   # TODO: Handle other cmd types.
        #   assert(False)

+      if has_dma_ports & s.dma_done.val:


I am thinking the possible conflict between CMD_DMA_CONE and CMD_COMPLETE or other commands that will be sent back to CPUs. Maybe we need some logic to judge these conflicts if at the same cycle the CMD_DMA_DONE and CMD_COMPLETE is valid?

HobbitQia · 2026-06-16T06:22:01Z

+            s.opcode_ff     <<= s.dma_cmd.msg.opcode
+            s.dram_addr_ff  <<= s.dma_cmd.msg.dram_addr
+            s.spm_addr_ff   <<= s.dma_cmd.msg.spm_addr
+            s.words_left_ff <<= s.dma_cmd.msg.nbytes >> 2 # Converts the transfer size from bytes to words.


There may be a truncation? If we want transfer 3 bytes, then words_left_ff = 0 so no data will be transferred.

For now I think we don't need to consider such fine-grained data movement since usually DMA should handle the large-scale data. I think we can add some assertions or checking to ensure nbytes%4 == 0 ?

There may be a truncation? If we want transfer 3 bytes, then words_left_ff = 0 so no data will be transferred.

For now I think we don't need to consider such fine-grained data movement since usually DMA should handle the large-scale data. I think we can add some assertions or checking to ensure nbytes%4 == 0 ?

Yes, I ignore it. I will add an assertion to ensure it.

Now the data transfer granularity between DRAM and SPM is 1 word (4 bytes), with a mask of 1 bit per word.

HobbitQia · 2026-06-16T06:24:22Z

+          bank_index_store_from_dma = trunc((recv_waddr_from_dma - s.address_lower) >> per_bank_addr_nbits, XbarOutWrType)
+        else:
+          bank_index_store_from_dma = XbarOutWrType(num_banks_per_cgra)
+        s.wr_pkt[dma_wr_idx] @= MemWritePktType(dma_wr_idx,                 # src


Is this mask actually used?

HobbitQia · 2026-06-18T01:31:46Z

I summarized the TODOs and something that needs to be clarified here as below. If there is something missing, please feel free to add it. @tancheng @BenkangPeng

This PR should address the following items:

Potential conflict between CMD_DMA_DONE and CMD_COMPLETE. (Marked as a future TODO)
Double-check: synthesizability of constructs like if has_dma_ports & s.dma_done.val:, and correctness of the generated Verilog logic.
Multi-CGRA support: a single DMA engine forwards data to the controller, which routes the data to the corresponding CGRA instances. (Marked as a future TODO)
Illustrate the complete data paths:
- Forward path: DRAM → DMA Engine → Controller → Data SPM Controller
- Reverse path: Data SPM Controller → Controller → DMA Engine → DRAM
Details to be clarified:
- Data granularity (number of words transferred per cycle)
- Transfer mode (serial vs. batch-parallel) and data packetization
- Feasibility of extending the current implementation to support single-packet transfer instead of multi-cycle serial transfer
- Internal FSM design
- Update the architecture diagram with the above details

…e of 4

…is an integer multiple of 4

…SendRTL to replace them.

…dyRecv/SendIfcRTL for improved clarity and consistency in DMA signal handling.

… with ValRdyRecv/SendIfcRTL

…ding tests

…rite request type and update corresponding tests for consistency

…arity and consistency by renaming signals related to memory requests and responses.

tancheng · 2026-06-23T16:37:21Z

+kAttrOpcode = 'opcode'
+kAttrDramAddr = 'dram_addr'
+kAttrNBytes = 'nbytes'
+kAttrTag = 'tag'


tag -> dram_tag, and comment on what this tag is used for?

tag -> dma_tag, right?

This tag isn't used now. Maybe we will use it to distinguish different DMA commands? @HobbitQia

right, dma_tag sounds good. You then can leave comment like:

This tag isn't used now. We may use it to distinguish different DMA commands.

tancheng · 2026-06-23T16:39:10Z

+kAttrNBytes = 'nbytes'
+kAttrTag = 'tag'
+kAttrSpmAddr = 'spm_addr'
+kAttrSpmData = 'spm_data'


We already have kAttrDataAddr, but we are now distinguishing it from DRAM data. Can you file an issue, and do the cleanup (consolidate these attributes) in later PR?

Oh, do you mean that we should rename kAttrDataAddr/kAttrData into kAttrSpmAddr/kAttrSpmData, since kAttrDataAddr/kAttrData of current codebase actually indicate the address/data of SPM? Is my understanding correct?

Right, I feel some kAttr need to be refactored, but we don't need to do that in this PR.

tancheng · 2026-06-23T16:58:15Z

+    s.recv_from_ctrl_spm_wr_req = RecvIfcRTL(DmaSpmWriteReqType)
+    s.recv_from_ctrl_spm_rd_req = RecvIfcRTL(DmaSpmReadReqType)
+    s.send_to_ctrl_spm_rd_resp = SendIfcRTL(DmaSpmReadRespType)


Why do we need to access ctrl_spm in DataMemControllerRTL?

recv_from_ctrl_spm_wr_req means that receiving the request of writing into spm from Controller.

We connect the Controller with the DataMemController, and the wr_req, rd_req, rd_resp of SPM are transferred by these 3 ports.

I will update the current design diagram ASAP.

then plz rename it to recv_from_controller_spm_wr_req.

The current name sounds like:

VectorCGRA/controller/ControllerRTL.py

Line 344 in eb71842

s.send_to_ctrl_ring_pkt.msg @= \

BTW, how to send to ctrl_ring then? We not yet distinguish ctrl from data?

BenkangPeng requested review from HobbitQia and tancheng June 2, 2026 13:55

tancheng reviewed Jun 2, 2026

View reviewed changes

BenkangPeng force-pushed the dma-cgra branch from f41e7a6 to 86f25a4 Compare June 3, 2026 10:29

BenkangPeng commented Jun 3, 2026

View reviewed changes

Comment thread mem/data/DataMemControllerRTL.py Outdated

tancheng reviewed Jun 3, 2026

View reviewed changes

Comment thread cgra/CgraTemplateRTL.py Outdated

BenkangPeng mentioned this pull request Jun 4, 2026

[CleanUp][NFC] Standardize line endings to LF #294

Merged

BenkangPeng added 13 commits June 4, 2026 17:53

Add the DmaEngine implementation and the test.

308a213

[Test] Update the test of DmaEngine.

90d9eef

Add DMA support to DataMemControllerRTL and implement corresponding t…

5e615e1

…ests.

Add the dma ports into CgraTemplateRTL

30bdc36

Wrap the Cgra and Dma into one single module.

e6c0b3b

[Script] Add the local_CI script file

c3f3dc4

Update .gitignore to ignore the log file

046c860

[Test] Add the test for CgraDmaRTL

4359f1f

[Fix] Fix the bit mismatch error between dma_idx and num_xbar_in_ports.

aff3a8a

[Doc] Add some comments

b2e41e8

[Fix] Fix the bit mismatch by type convertion

5fc388c

Move some constant into common header file

70ae3da

[Refactor] Wrap the signals between dma and dram with SendIfcRTL and …

fc589c5

…RecvIfcRTL. Replace `mem` with `dram` for clarity.

BenkangPeng force-pushed the dma-cgra branch from 86f25a4 to fc589c5 Compare June 4, 2026 10:15

tancheng reviewed Jun 5, 2026

View reviewed changes

Comment thread cgra/CgraDmaRTL.py Outdated

BenkangPeng added 7 commits June 14, 2026 11:10

[Fix] Use Outport instead of Wire in DmaWireIfcRTL to avoid the RTLIR…

e69c3de

… error of pytml verilog backend.

[CleanUp] Remove the unnecessary ports.

82ac18d

[Feature] Introduce DMA data structure and DMA-to-DRAM write request …

1a1172b

…interface for enhanced data transfer capabilities.

[Refactor] Pass DmaCmdType and DmaDataType into DataMemController and…

94cc7d0

… then drives types from them

[Refactor] Update DmaEngineRTL to use DmaDramWrReqIfcRTL for DRAM wri…

65fb185

…te requests and adjust related signal handling for clarity and consistency.

[Refactor] Enhance DMA integration in CgraTemplateRTL and ControllerR…

d2d7e7d

…TL by passing DmaDataType and DmaCmdType as parameters, and updating related type definitions for improved clarity and consistency.

[Refactor] Update CgraDmaRTL to utilize DmaDramWrReqIfcRTL for DRAM w…

28b263b

…rite requests, enhancing type definitions for DmaCmdType and DmaDataType

tancheng reviewed Jun 15, 2026

View reviewed changes

Comment thread controller/ControllerRTL.py Outdated

Comment thread controller/ControllerRTL.py Outdated

tancheng reviewed Jun 15, 2026

View reviewed changes

Comment thread cgra/CgraTemplateRTL.py Outdated

BenkangPeng added 2 commits June 15, 2026 10:19

[Fix] Fix the bitwidth mismatch error between DataType and DmaSpmData…

3af7d8b

…Type

[CleanUp] Update DMA attribute references to use new constants for im…

326167d

…proved clarity

tancheng reviewed Jun 15, 2026

View reviewed changes

HobbitQia reviewed Jun 17, 2026

View reviewed changes

[Rename][NFC] Rename some variables for clarity

63b252f

BenkangPeng added 2 commits June 22, 2026 15:58

Add the assertion to ensure the number of tranfer data is the multipl…

bee2bfc

…e of 4

Add assertions to ensure that the number of bytes transferred by DMA …

320c8ec

…is an integer multiple of 4

tancheng reviewed Jun 22, 2026

View reviewed changes

Comment thread cgra/CgraTemplateRTL.py Outdated

Comment thread cgra/IntegratedCgraWithDmaRTL.py

Comment thread cgra/IntegratedCgraWithDmaRTL.py Outdated

BenkangPeng added 8 commits June 23, 2026 10:38

[Refactor] Remove DmaWireIfcRTL and DmaSpmWireIfcRTL. Use ValRdyRecv/…

711ae02

…SendRTL to replace them.

Split the dma_spm_to_dram into 3 signals.

3c19076

Deprecate the DmaSpmMasterRTL in DMA module

985fc98

Refactor DataMemControllerRTL to replace DmaSpmMinionIfcRTL with ValR…

abb4e75

…dyRecv/SendIfcRTL for improved clarity and consistency in DMA signal handling.

Refactor CgraDmaRTL and CgraTemplateRTL to replace DmaSpmMinionIfcRTL…

d17d42c

… with ValRdyRecv/SendIfcRTL

Add CgraDmaRTL wrapper integrating CGRA with DMA engine and correspon…

1427cda

…ding tests

Refactor CgraDmaRTL to replace DmaDramWrReqIfcRTL with new DMA DRAM w…

28f75eb

…rite request type and update corresponding tests for consistency

Refactor DMA signal handling across multiple components to improve cl…

85eee45

…arity and consistency by renaming signals related to memory requests and responses.

tancheng reviewed Jun 23, 2026

View reviewed changes

tancheng approved these changes Jun 23, 2026

View reviewed changes

Conversation

BenkangPeng commented Jun 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HobbitQia commented Jun 4, 2026

Uh oh!

tancheng commented Jun 5, 2026

Uh oh!

Uh oh!

HobbitQia commented Jun 7, 2026

Uh oh!

tancheng commented Jun 7, 2026

Uh oh!

HobbitQia commented Jun 8, 2026

Uh oh!

tancheng commented Jun 8, 2026

Uh oh!

HobbitQia commented Jun 8, 2026

Uh oh!

tancheng commented Jun 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tancheng commented Jun 15, 2026

Uh oh!

tancheng commented Jun 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HobbitQia commented Jun 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment