DAOS-16935 API: Add GPU direct I/O support by gnailzenh · Pull Request #18419 · daos-stack/daos

gnailzenh · 2026-06-03T09:44:46Z

Extend DAOS to support GPU direct RDMA I/O without changing the wire protocol or d_iov_t ABI. Key changes:

API layer:

Add daos_mem_type_t enum (HOST, CUDA, CUDA_MANAGED, ROCM, ZE)
Add daos_mem_attr_t side-channel struct for memory attributes
Add DAOS_OBJ_IO_GPU_DIRECT flag for existing fetch/update APIs
Add daos_obj_fetch_gpu()/daos_obj_update_gpu() wrappers

CaRT transport:

Add crt_bulk_create_with_mem_attr() for GPU-aware bulk handles
Forward memory type to Mercury via hg_bulk_attr
Add D_GPU_DIRECT env var and crt_mem_device_enabled() init hook

Client object layer:

Validate GPU direct flag against mem_attrs
Propagate ORF_GPU_DIRECT to server RPCs
Use crt_bulk_create_with_mem_attr() for GPU buffers

Server object layer:

Add GPU direct observability (debug logs, telemetry counter)
No behavioral change (Mercury handles GPU RDMA transparently)

Build system:

Add BUILD_GPU_DIRECT SCons option (off by default)
Conditionally enable CUDA/GDRCopy in UCX and FI_HMEM in libfabric

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

Extend DAOS to support GPU direct RDMA I/O without changing the wire protocol or d_iov_t ABI. Key changes: API layer: - Add daos_mem_type_t enum (HOST, CUDA, CUDA_MANAGED, ROCM, ZE) - Add daos_mem_attr_t side-channel struct for memory attributes - Add DAOS_OBJ_IO_GPU_DIRECT flag for existing fetch/update APIs - Add daos_obj_fetch_gpu()/daos_obj_update_gpu() wrappers CaRT transport: - Add crt_bulk_create_with_mem_attr() for GPU-aware bulk handles - Forward memory type to Mercury via hg_bulk_attr - Add D_GPU_DIRECT env var and crt_mem_device_enabled() init hook Client object layer: - Validate GPU direct flag against mem_attrs - Propagate ORF_GPU_DIRECT to server RPCs - Use crt_bulk_create_with_mem_attr() for GPU buffers Server object layer: - Add GPU direct observability (debug logs, telemetry counter) - No behavioral change (Mercury handles GPU RDMA transparently) Build system: - Add BUILD_GPU_DIRECT SCons option (off by default) - Conditionally enable CUDA/GDRCopy in UCX and FI_HMEM in libfabric Signed-off-by: Liang Zhen <gnailzenh@gmail.com>

github-actions · 2026-06-03T10:12:24Z

Ticket title is 'dfs_write bad address from 32K buffer allocated with malloc_shared'
Status is 'In Progress'
Labels: 'ALCF,alcf_cluster'
Errors are Component should be lower-case
https://daosio.atlassian.net/browse/DAOS-16935

daosbuild3 · 2026-06-03T10:29:18Z

Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18419/1/testReport/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-16935 API: Add GPU direct I/O support#18419

DAOS-16935 API: Add GPU direct I/O support#18419
gnailzenh wants to merge 1 commit into
masterfrom
liang-ttai/b_gpu_direct

gnailzenh commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

daosbuild3 commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

gnailzenh commented Jun 3, 2026

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

daosbuild3 commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants