Skip to content

Use caller CUDA stream for D2H and H2D copies (#20498)#20498

Open
Conarnar wants to merge 1 commit into
pytorch:mainfrom
Conarnar:export-D109590531
Open

Use caller CUDA stream for D2H and H2D copies (#20498)#20498
Conarnar wants to merge 1 commit into
pytorch:mainfrom
Conarnar:export-D109590531

Conversation

@Conarnar

@Conarnar Conarnar commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via getCallerStream()), copy_host_to_device and copy_device_to_host use cudaMemcpyAsync. When no caller stream is set, the synchronous cudaMemcpy path is used as before.

Additionally:

  • Added null pointer and zero-byte validation — null dst/src return Error::InvalidArgument instead of aborting in cudaMemcpy, and zero-byte copies return Error::Ok early.
  • Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
  • Wired //executorch/extension/cuda:caller_stream dependency in TARGETS.
  • Added extension_cuda dependencies to CMakeLists.txt.
  • Added test_cuda_allocator with coverage for sync/async paths and error handling.
  • Added CIs for unit tests.

Reviewed By: Gasoonjia

Differential Revision: D109590531

Copilot AI review requested due to automatic review settings June 24, 2026 22:51
@pytorch-bot

pytorch-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20498

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 533e5de with merge base 55a71e6 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026
@meta-codesync

meta-codesync Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@Conarnar has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109590531.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-codesync meta-codesync Bot changed the title Use caller CUDA stream for D2H and H2D copies Use caller CUDA stream for D2H and H2D copies (#20498) Jun 24, 2026
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 24, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync` and synchronize the stream before returning — preserving the blocking API contract while allowing work to be issued on the caller's stream. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch 2 times, most recently from 3d8da75 to 07765c3 Compare June 25, 2026 17:10
Copilot AI review requested due to automatic review settings June 25, 2026 17:10
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 25, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync` and synchronize the stream before returning — preserving the blocking API contract while allowing work to be issued on the caller's stream. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Differential Revision: D109590531

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

Comment on lines +161 to +168
cudaError_t err = cudaSuccess;
const auto caller_stream = executorch::extension::cuda::getCallerStream();
if (caller_stream) {
err = cudaMemcpyAsync(
dst, src, nbytes, cudaMemcpyHostToDevice, *caller_stream);
} else {
err = cudaMemcpy(dst, src, nbytes, cudaMemcpyHostToDevice);
}
Comment on lines +144 to +150
// TODO: validate caller stream device matches index.
// For now assert single-GPU case.
ET_CHECK_OR_RETURN_ERROR(
index == -1 || index == 0,
InvalidArgument,
"CudaAllocator::copy_host_to_device only supports device 0, got %d",
static_cast<int>(index));
Comment on lines +202 to +208
// TODO: validate caller stream device matches index.
// For now assert single-GPU case.
ET_CHECK_OR_RETURN_ERROR(
index == -1 || index == 0,
InvalidArgument,
"CudaAllocator::copy_device_to_host only supports device 0, got %d",
static_cast<int>(index));
Comment on lines +78 to +90
cudaStream_t s;
ASSERT_EQ(cudaStreamCreate(&s), cudaSuccess);
executorch::extension::cuda::CallerStreamGuard g(s);

CudaAllocator& a = CudaAllocator::instance();
auto res = a.allocate(256, 0);
ASSERT_TRUE(res.ok());
void* d = res.get();
std::vector<uint8_t> h(256, 7);
// should take async branch internally, still return Ok
EXPECT_EQ(a.copy_host_to_device(d, h.data(), 256, 0), Error::Ok);
a.deallocate(d, 0);
cudaStreamDestroy(s);
Comment on lines +103 to +117
cudaStream_t s;
ASSERT_EQ(cudaStreamCreate(&s), cudaSuccess);
executorch::extension::cuda::CallerStreamGuard g(s);

CudaAllocator& a = CudaAllocator::instance();
auto res = a.allocate(256, 0);
ASSERT_TRUE(res.ok());
void* d = res.get();
std::vector<uint8_t> h_src(256, 5), h_dst(256, 0);
ASSERT_EQ(a.copy_host_to_device(d, h_src.data(), 256, 0), Error::Ok);
EXPECT_EQ(a.copy_device_to_host(h_dst.data(), d, 256, 0), Error::Ok);
EXPECT_EQ(h_src, h_dst);

a.deallocate(d, 0);
cudaStreamDestroy(s);
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 25, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch from 07765c3 to b316b71 Compare June 25, 2026 17:59
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 25, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch from b316b71 to 98081dc Compare June 25, 2026 18:57
Copilot AI review requested due to automatic review settings June 25, 2026 18:57

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

Comment on lines +144 to +150
// TODO: validate caller stream device matches index.
// For now assert index is -1 or 0.
ET_CHECK_OR_RETURN_ERROR(
index == -1 || index == 0,
InvalidArgument,
"CudaAllocator::copy_host_to_device only supports device 0 or -1 (current), got %d",
static_cast<int>(index));
Comment on lines +204 to +210
// TODO: validate caller stream device matches index.
// For now assert index is -1 or 0.
ET_CHECK_OR_RETURN_ERROR(
index == -1 || index == 0,
InvalidArgument,
"CudaAllocator::copy_device_to_host only supports device 0 or -1 (current), got %d",
static_cast<int>(index));
Comment on lines +161 to +166
cudaError_t err = cudaSuccess;
const auto caller_stream = executorch::extension::cuda::getCallerStream();
if (caller_stream) {
err = cudaMemcpyAsync(
dst, src, nbytes, cudaMemcpyHostToDevice, *caller_stream);
// We don't synchronize the stream here because the caller is expected to
Comment on lines +223 to +228
if (caller_stream) {
err = cudaMemcpyAsync(
dst, src, nbytes, cudaMemcpyDeviceToHost, *caller_stream);
if (err == cudaSuccess) {
err = cudaStreamSynchronize(*caller_stream);
}
Comment on lines +116 to +118
ASSERT_EQ(a.copy_host_to_device(d, h_src.data(), 256, 0), Error::Ok);
EXPECT_EQ(a.copy_device_to_host(h_dst.data(), d, 256, 0), Error::Ok);
EXPECT_EQ(h_src, h_dst);
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 25, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch from 98081dc to 1e001a5 Compare June 25, 2026 20:50
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 25, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch from c657616 to fd2e388 Compare June 26, 2026 21:45
Copilot AI review requested due to automatic review settings June 26, 2026 21:45

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 26, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `extension_cuda` dependencies to CMakeLists.txt.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Reviewed By: Gasoonjia

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch from fd2e388 to 9bfc44e Compare June 26, 2026 22:50
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 27, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `extension_cuda` dependencies to CMakeLists.txt.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Reviewed By: Gasoonjia

Differential Revision: D109590531
Copilot AI review requested due to automatic review settings June 27, 2026 10:24
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 27, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `extension_cuda` dependencies to CMakeLists.txt.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Reviewed By: Gasoonjia

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch 2 times, most recently from 32968c0 to 056c25c Compare June 27, 2026 10:24

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 27, 2026
Summary:
Pull Request resolved: pytorch#20498

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `extension_cuda` dependencies to CMakeLists.txt.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Reviewed By: Gasoonjia

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch from 056c25c to 96b9452 Compare June 27, 2026 10:26
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 27, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `extension_cuda` dependencies to CMakeLists.txt.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Reviewed By: Gasoonjia

Differential Revision: D109590531
Copilot AI review requested due to automatic review settings June 27, 2026 19:34
@Conarnar Conarnar force-pushed the export-D109590531 branch from 96b9452 to f042310 Compare June 27, 2026 19:34
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 27, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `extension_cuda` dependencies to CMakeLists.txt.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Reviewed By: Gasoonjia

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch from f042310 to 0dd28cc Compare June 27, 2026 19:34

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@Conarnar Conarnar force-pushed the export-D109590531 branch from 0dd28cc to f042310 Compare June 27, 2026 19:34
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 28, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `extension_cuda` dependencies to CMakeLists.txt.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Reviewed By: Gasoonjia

Differential Revision: D109590531
Copilot AI review requested due to automatic review settings June 28, 2026 00:10
@Conarnar Conarnar force-pushed the export-D109590531 branch from f042310 to 6ec0025 Compare June 28, 2026 00:10

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 28, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `extension_cuda` dependencies to CMakeLists.txt.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Reviewed By: Gasoonjia

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch from 6ec0025 to 4b6fde9 Compare June 28, 2026 00:40
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 28, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `extension_cuda` dependencies to CMakeLists.txt.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.
- Added CIs for unit tests.

Reviewed By: Gasoonjia

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch from 4b6fde9 to 5bfd1b4 Compare June 28, 2026 03:05
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `extension_cuda` dependencies to CMakeLists.txt.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.
- Added CIs for unit tests.

Reviewed By: Gasoonjia

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch from 5bfd1b4 to 533e5de Compare June 28, 2026 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants