Skip to content

[Feature] Implement DMA support#293

Open
BenkangPeng wants to merge 35 commits into
tancheng:masterfrom
BenkangPeng:dma-cgra
Open

[Feature] Implement DMA support#293
BenkangPeng wants to merge 35 commits into
tancheng:masterfrom
BenkangPeng:dma-cgra

Conversation

@BenkangPeng

Copy link
Copy Markdown
Collaborator

Related issue: coredac/CGRA-SoC#2

This PR introduces CgraDmaRTL which integrates the CGRA with a DMA engine, enabling direct memory transfers between external DRAM(don't implement now) and the CGRA's dataSPM.

@BenkangPeng BenkangPeng requested review from HobbitQia and tancheng June 2, 2026 13:55
Comment thread mem/dma/DmaEngineRTL.py Outdated
Comment thread mem/dma/DmaEngineRTL.py
Comment thread mem/dma/DmaEngineRTL.py
Comment thread mem/dma/DmaEngineRTL.py Outdated
Comment thread mem/dma/DmaEngineRTL.py Outdated
Comment thread mem/dma/DmaEngineRTL.py Outdated
Comment thread cgra/CgraDmaRTL.py Outdated
Comment thread mem/data/DataMemControllerRTL.py Outdated
Comment thread cgra/CgraTemplateRTL.py Outdated
@HobbitQia

Copy link
Copy Markdown
Collaborator

Hi @tancheng @BenkangPeng , I summarized two direction of DMA design as below:

  • Rely on data controller

    • DMA is added as a new client of the DataMemControllerRTL, where data in the DMA engine communicates directly with the DataMemControllerRTL, and the logic for multiplexing SPM ports is also implemented in that module.

      To initiate DMA, the CPU can send dma_mvin or dma_mvout to the CGRA, after which the controller activates the DMA engine by sending start signals.

    • Pros: Keeps the controller clean; provides a faster path because data does not go through the controller.

    • Cons: Additional logic is required to feed DMA results into the control memory.

    image
  • All in controller

    • All decoding logic is handled in the controller. The logic for handling the logic port should still reside in the DataMemControllerRTL, since the data memory should have its own port multiplexing logic.

      The logic of packeting should also be implemented in the controller module.

    • Pros: Unifies control and data memory within the controller (the controller is already connected to both control and data memory).

    • Cons: Introduces complex control logic in the controller; results in a slower path.

    image

I prefer the second method but I think there are still some logic should be written in DataMemControllerRTL. WDTY?

@tancheng

tancheng commented Jun 5, 2026

Copy link
Copy Markdown
Owner

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

Comment thread cgra/CgraDmaRTL.py Outdated
@HobbitQia

Copy link
Copy Markdown
Collaborator

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

@tancheng

tancheng commented Jun 7, 2026

Copy link
Copy Markdown
Owner

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

I thought they are the same latency if we can distinguish the CMD_STORE_REQUEST into CMD_STORE_REQUEST_FROM_NOC and CMD_STORE_REQUEST_FROM_CPU (and add another inport on the xbar) in controller?

Adding logic inside the DataMemController kind of bypassing the CGRA controller, which doesn't align with your Option 2, WDYT?

@HobbitQia

Copy link
Copy Markdown
Collaborator

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

I thought they are the same latency if we can distinguish the CMD_STORE_REQUEST into CMD_STORE_REQUEST_FROM_NOC and CMD_STORE_REQUEST_FROM_CPU (and add another inport on the xbar) in controller?

Adding logic inside the DataMemController kind of bypassing the CGRA controller, which doesn't align with your Option 2, WDYT?

If the DMA data should go through the controller packet path, there may be extra latency of packeting, and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

@tancheng

tancheng commented Jun 8, 2026

Copy link
Copy Markdown
Owner

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

@HobbitQia

Copy link
Copy Markdown
Collaborator

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

Got it. I mean the conflict between different paths. Even we have FROM_CPU and FROM_NOC, there may be concurrent read/write on SPMs. So maybe we need to multiplex the in/out ports of SPMs and the logic of multiplexing and handling the conflicts can be implemented in DataSPMController?

@tancheng

tancheng commented Jun 8, 2026

Copy link
Copy Markdown
Owner

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

Got it. I mean the conflict between different paths. Even we have FROM_CPU and FROM_NOC, there may be concurrent read/write on SPMs. So maybe we need to multiplex the in/out ports of SPMs and the logic of multiplexing and handling the conflicts can be implemented in DataSPMController?

Oh, we don't need to distinguish FROM_CPU and FROM_NOC, we can decompose the requests from CPU like:

s.recv_from_cpu_pkt_queue.send.rdy @= s.crossbar.recv[kFromCpuCtrlAndDataIdx].rdy

…interface for enhanced data transfer capabilities.
…te requests and adjust related signal handling for clarity and consistency.
…TL by passing DmaDataType and DmaCmdType as parameters, and updating related type definitions for improved clarity and consistency.
…rite requests, enhancing type definitions for DmaCmdType and DmaDataType
Comment thread controller/ControllerRTL.py Outdated
Comment thread controller/ControllerRTL.py Outdated
Comment thread cgra/CgraTemplateRTL.py Outdated
Comment thread cgra/CgraDmaRTL.py Outdated
Comment thread cgra/CgraDmaRTL.py Outdated
Comment thread cgra/IntegratedCgraWithDmaRTL.py
Comment thread lib/basic/val_rdy/ifcs.py Outdated
Comment thread lib/basic/val_rdy/ifcs.py Outdated
Comment thread lib/basic/val_rdy/ifcs.py Outdated
Comment thread lib/util/common.py Outdated
Comment thread lib/messages.py

return mk_bitstruct(new_name, {
'dram_data': DramDataType,
'dram_mask': DramMaskType,

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain what is dram_mask with comment?

Comment thread lib/messages.py
'dram_data': DramDataType,
'dram_mask': DramMaskType,
'spm_data': SpmDataType,
'spm_mask': SpmMaskType,

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain what is spm_mask with comment?

@tancheng

Copy link
Copy Markdown
Owner

Hi @tancheng @BenkangPeng , I summarized two direction of DMA design as below:

  • Rely on data controller

    • DMA is added as a new client of the DataMemControllerRTL, where data in the DMA engine communicates directly with the DataMemControllerRTL, and the logic for multiplexing SPM ports is also implemented in that module.
      To initiate DMA, the CPU can send dma_mvin or dma_mvout to the CGRA, after which the controller activates the DMA engine by sending start signals.

    • Pros: Keeps the controller clean; provides a faster path because data does not go through the controller.

    • Cons: Additional logic is required to feed DMA results into the control memory.

      image
  • All in controller

    • All decoding logic is handled in the controller. The logic for handling the logic port should still reside in the DataMemControllerRTL, since the data memory should have its own port multiplexing logic.
      The logic of packeting should also be implemented in the controller module.

    • Pros: Unifies control and data memory within the controller (the controller is already connected to both control and data memory).

    • Cons: Introduces complex control logic in the controller; results in a slower path.

      image

I prefer the second method but I think there are still some logic should be written in DataMemControllerRTL. WDTY?

We should also include the figure 2 into our

@tancheng

Copy link
Copy Markdown
Owner

Hi @tancheng @BenkangPeng , I summarized two direction of DMA design as below:

  • Rely on data controller

    • DMA is added as a new client of the DataMemControllerRTL, where data in the DMA engine communicates directly with the DataMemControllerRTL, and the logic for multiplexing SPM ports is also implemented in that module.
      To initiate DMA, the CPU can send dma_mvin or dma_mvout to the CGRA, after which the controller activates the DMA engine by sending start signals.
    • Pros: Keeps the controller clean; provides a faster path because data does not go through the controller.
    • Cons: Additional logic is required to feed DMA results into the control memory.
      image
  • All in controller

    • All decoding logic is handled in the controller. The logic for handling the logic port should still reside in the DataMemControllerRTL, since the data memory should have its own port multiplexing logic.
      The logic of packeting should also be implemented in the controller module.
    • Pros: Unifies control and data memory within the controller (the controller is already connected to both control and data memory).
    • Cons: Introduces complex control logic in the controller; results in a slower path.
      image

I prefer the second method but I think there are still some logic should be written in DataMemControllerRTL. WDTY?

We should also include the figure 2 into our

@BenkangPeng, as you introduce IntegratedDmaWithCgraRTL, so we should update this diagram that the "Controller", "DataMemController", "SPM", "Control SPM" should be wrapped by a CGRA box. And entire diagram (except the "CPU") should be wrapped by "IntegratedDmaWithCgra" box.

# # TODO: Handle other cmd types.
# assert(False)

if has_dma_ports & s.dma_done.val:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking the possible conflict between CMD_DMA_CONE and CMD_COMPLETE or other commands that will be sent back to CPUs. Maybe we need some logic to judge these conflicts if at the same cycle the CMD_DMA_DONE and CMD_COMPLETE is valid?

Comment thread mem/dma/DmaEngineRTL.py Outdated
s.opcode_ff <<= s.dma_cmd.msg.opcode
s.dram_addr_ff <<= s.dma_cmd.msg.dram_addr
s.spm_addr_ff <<= s.dma_cmd.msg.spm_addr
s.words_left_ff <<= s.dma_cmd.msg.nbytes >> 2 # Converts the transfer size from bytes to words.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a truncation? If we want transfer 3 bytes, then words_left_ff = 0 so no data will be transferred.

For now I think we don't need to consider such fine-grained data movement since usually DMA should handle the large-scale data. I think we can add some assertions or checking to ensure nbytes%4 == 0 ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a truncation? If we want transfer 3 bytes, then words_left_ff = 0 so no data will be transferred.

For now I think we don't need to consider such fine-grained data movement since usually DMA should handle the large-scale data. I think we can add some assertions or checking to ensure nbytes%4 == 0 ?

Yes, I ignore it. I will add an assertion to ensure it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the data transfer granularity between DRAM and SPM is 1 word (4 bytes), with a mask of 1 bit per word.

bank_index_store_from_dma = trunc((recv_waddr_from_dma - s.address_lower) >> per_bank_addr_nbits, XbarOutWrType)
else:
bank_index_store_from_dma = XbarOutWrType(num_banks_per_cgra)
s.wr_pkt[dma_wr_idx] @= MemWritePktType(dma_wr_idx, # src

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this mask actually used?

@HobbitQia

Copy link
Copy Markdown
Collaborator

I summarized the TODOs and something that needs to be clarified here as below. If there is something missing, please feel free to add it. @tancheng @BenkangPeng

This PR should address the following items:

  • Potential conflict between CMD_DMA_DONE and CMD_COMPLETE. (Marked as a future TODO)

  • Double-check: synthesizability of constructs like if has_dma_ports & s.dma_done.val:, and correctness of the generated Verilog logic.

  • Multi-CGRA support: a single DMA engine forwards data to the controller, which routes the data to the corresponding CGRA instances. (Marked as a future TODO)

  • Illustrate the complete data paths:

    • Forward path: DRAM → DMA Engine → Controller → Data SPM Controller
    • Reverse path: Data SPM Controller → Controller → DMA Engine → DRAM

    Details to be clarified:

    • Data granularity (number of words transferred per cycle)
    • Transfer mode (serial vs. batch-parallel) and data packetization
    • Feasibility of extending the current implementation to support single-packet transfer instead of multi-cycle serial transfer
    • Internal FSM design
    • Update the architecture diagram with the above details

Comment thread cgra/CgraTemplateRTL.py Outdated
Comment thread cgra/IntegratedCgraWithDmaRTL.py
Comment thread cgra/IntegratedCgraWithDmaRTL.py Outdated
kAttrOpcode = 'opcode'
kAttrDramAddr = 'dram_addr'
kAttrNBytes = 'nbytes'
kAttrTag = 'tag'

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tag -> dram_tag, and comment on what this tag is used for?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tag -> dma_tag, right?

This tag isn't used now. Maybe we will use it to distinguish different DMA commands? @HobbitQia

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, dma_tag sounds good. You then can leave comment like:

This tag isn't used now. We may use it to distinguish different DMA commands.

kAttrNBytes = 'nbytes'
kAttrTag = 'tag'
kAttrSpmAddr = 'spm_addr'
kAttrSpmData = 'spm_data'

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have kAttrDataAddr, but we are now distinguishing it from DRAM data. Can you file an issue, and do the cleanup (consolidate these attributes) in later PR?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, do you mean that we should rename kAttrDataAddr/kAttrData into kAttrSpmAddr/kAttrSpmData, since kAttrDataAddr/kAttrData of current codebase actually indicate the address/data of SPM? Is my understanding correct?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I feel some kAttr need to be refactored, but we don't need to do that in this PR.

Comment thread cgra/IntegratedCgraWithDmaRTL.py Outdated
Comment on lines +159 to +161
s.recv_from_ctrl_spm_wr_req = RecvIfcRTL(DmaSpmWriteReqType)
s.recv_from_ctrl_spm_rd_req = RecvIfcRTL(DmaSpmReadReqType)
s.send_to_ctrl_spm_rd_resp = SendIfcRTL(DmaSpmReadRespType)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to access ctrl_spm in DataMemControllerRTL?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recv_from_ctrl_spm_wr_req means that receiving the request of writing into spm from Controller.

We connect the Controller with the DataMemController, and the wr_req, rd_req, rd_resp of SPM are transferred by these 3 ports.

I will update the current design diagram ASAP.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then plz rename it to recv_from_controller_spm_wr_req.

The current name sounds like:

s.send_to_ctrl_ring_pkt.msg @= \

BTW, how to send to ctrl_ring then? We not yet distinguish ctrl from data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants