Skip to content

DAOS-16935 API: Add GPU direct I/O support#18419

Draft
gnailzenh wants to merge 1 commit into
masterfrom
liang-ttai/b_gpu_direct
Draft

DAOS-16935 API: Add GPU direct I/O support#18419
gnailzenh wants to merge 1 commit into
masterfrom
liang-ttai/b_gpu_direct

Conversation

@gnailzenh

Copy link
Copy Markdown
Collaborator

Extend DAOS to support GPU direct RDMA I/O without changing the wire protocol or d_iov_t ABI. Key changes:

API layer:

  • Add daos_mem_type_t enum (HOST, CUDA, CUDA_MANAGED, ROCM, ZE)
  • Add daos_mem_attr_t side-channel struct for memory attributes
  • Add DAOS_OBJ_IO_GPU_DIRECT flag for existing fetch/update APIs
  • Add daos_obj_fetch_gpu()/daos_obj_update_gpu() wrappers

CaRT transport:

  • Add crt_bulk_create_with_mem_attr() for GPU-aware bulk handles
  • Forward memory type to Mercury via hg_bulk_attr
  • Add D_GPU_DIRECT env var and crt_mem_device_enabled() init hook

Client object layer:

  • Validate GPU direct flag against mem_attrs
  • Propagate ORF_GPU_DIRECT to server RPCs
  • Use crt_bulk_create_with_mem_attr() for GPU buffers

Server object layer:

  • Add GPU direct observability (debug logs, telemetry counter)
  • No behavioral change (Mercury handles GPU RDMA transparently)

Build system:

  • Add BUILD_GPU_DIRECT SCons option (off by default)
  • Conditionally enable CUDA/GDRCopy in UCX and FI_HMEM in libfabric

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Extend DAOS to support GPU direct RDMA I/O without changing the wire
protocol or d_iov_t ABI. Key changes:

API layer:
- Add daos_mem_type_t enum (HOST, CUDA, CUDA_MANAGED, ROCM, ZE)
- Add daos_mem_attr_t side-channel struct for memory attributes
- Add DAOS_OBJ_IO_GPU_DIRECT flag for existing fetch/update APIs
- Add daos_obj_fetch_gpu()/daos_obj_update_gpu() wrappers

CaRT transport:
- Add crt_bulk_create_with_mem_attr() for GPU-aware bulk handles
- Forward memory type to Mercury via hg_bulk_attr
- Add D_GPU_DIRECT env var and crt_mem_device_enabled() init hook

Client object layer:
- Validate GPU direct flag against mem_attrs
- Propagate ORF_GPU_DIRECT to server RPCs
- Use crt_bulk_create_with_mem_attr() for GPU buffers

Server object layer:
- Add GPU direct observability (debug logs, telemetry counter)
- No behavioral change (Mercury handles GPU RDMA transparently)

Build system:
- Add BUILD_GPU_DIRECT SCons option (off by default)
- Conditionally enable CUDA/GDRCopy in UCX and FI_HMEM in libfabric

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

Ticket title is 'dfs_write bad address from 32K buffer allocated with malloc_shared'
Status is 'In Progress'
Labels: 'ALCF,alcf_cluster'
Errors are Component should be lower-case
https://daosio.atlassian.net/browse/DAOS-16935

@daosbuild3

Copy link
Copy Markdown
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants