Use caller CUDA stream for D2H and H2D copies (#20498) by Conarnar · Pull Request #20498 · pytorch/executorch

Conarnar · 2026-06-24T22:51:11Z

Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via getCallerStream()), copy_host_to_device and copy_device_to_host use cudaMemcpyAsync. When no caller stream is set, the synchronous cudaMemcpy path is used as before.

Additionally:

Added null pointer and zero-byte validation — null dst/src return Error::InvalidArgument instead of aborting in cudaMemcpy, and zero-byte copies return Error::Ok early.
Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
Wired //executorch/extension/cuda:caller_stream dependency in TARGETS.
Added extension_cuda dependencies to CMakeLists.txt.
Added test_cuda_allocator with coverage for sync/async paths and error handling.
Added CIs for unit tests.

Reviewed By: Gasoonjia

Differential Revision: D109590531

pytorch-bot · 2026-06-24T22:51:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20498

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 533e5de with merge base 55a71e6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-06-24T22:51:21Z

@Conarnar has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109590531.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

github-actions · 2026-06-24T22:52:05Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync` and synchronize the stream before returning — preserving the blocking API contract while allowing work to be issued on the caller's stream. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

+  cudaError_t err = cudaSuccess;
+  const auto caller_stream = executorch::extension::cuda::getCallerStream();
+  if (caller_stream) {
+    err = cudaMemcpyAsync(
+        dst, src, nbytes, cudaMemcpyHostToDevice, *caller_stream);
+  } else {
+    err = cudaMemcpy(dst, src, nbytes, cudaMemcpyHostToDevice);
+  }


+  // TODO: validate caller stream device matches index.
+  // For now assert single-GPU case.
+  ET_CHECK_OR_RETURN_ERROR(
+      index == -1 || index == 0,
+      InvalidArgument,
+      "CudaAllocator::copy_host_to_device only supports device 0, got %d",
+      static_cast<int>(index));


+  // TODO: validate caller stream device matches index.
+  // For now assert single-GPU case.
+  ET_CHECK_OR_RETURN_ERROR(
+      index == -1 || index == 0,
+      InvalidArgument,
+      "CudaAllocator::copy_device_to_host only supports device 0, got %d",
+      static_cast<int>(index));


+  cudaStream_t s;
+  ASSERT_EQ(cudaStreamCreate(&s), cudaSuccess);
+  executorch::extension::cuda::CallerStreamGuard g(s);
+
+  CudaAllocator& a = CudaAllocator::instance();
+  auto res = a.allocate(256, 0);
+  ASSERT_TRUE(res.ok());
+  void* d = res.get();
+  std::vector<uint8_t> h(256, 7);
+  // should take async branch internally, still return Ok
+  EXPECT_EQ(a.copy_host_to_device(d, h.data(), 256, 0), Error::Ok);
+  a.deallocate(d, 0);
+  cudaStreamDestroy(s);


+  cudaStream_t s;
+  ASSERT_EQ(cudaStreamCreate(&s), cudaSuccess);
+  executorch::extension::cuda::CallerStreamGuard g(s);
+
+  CudaAllocator& a = CudaAllocator::instance();
+  auto res = a.allocate(256, 0);
+  ASSERT_TRUE(res.ok());
+  void* d = res.get();
+  std::vector<uint8_t> h_src(256, 5), h_dst(256, 0);
+  ASSERT_EQ(a.copy_host_to_device(d, h_src.data(), 256, 0), Error::Ok);
+  EXPECT_EQ(a.copy_device_to_host(h_dst.data(), d, 256, 0), Error::Ok);
+  EXPECT_EQ(h_src, h_dst);
+
+  a.deallocate(d, 0);
+  cudaStreamDestroy(s);


Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

+  // TODO: validate caller stream device matches index.
+  // For now assert index is -1 or 0.
+  ET_CHECK_OR_RETURN_ERROR(
+      index == -1 || index == 0,
+      InvalidArgument,
+      "CudaAllocator::copy_host_to_device only supports device 0 or -1 (current), got %d",
+      static_cast<int>(index));


+  // TODO: validate caller stream device matches index.
+  // For now assert index is -1 or 0.
+  ET_CHECK_OR_RETURN_ERROR(
+      index == -1 || index == 0,
+      InvalidArgument,
+      "CudaAllocator::copy_device_to_host only supports device 0 or -1 (current), got %d",
+      static_cast<int>(index));


+  cudaError_t err = cudaSuccess;
+  const auto caller_stream = executorch::extension::cuda::getCallerStream();
+  if (caller_stream) {
+    err = cudaMemcpyAsync(
+        dst, src, nbytes, cudaMemcpyHostToDevice, *caller_stream);
+    // We don't synchronize the stream here because the caller is expected to


+  if (caller_stream) {
+    err = cudaMemcpyAsync(
+        dst, src, nbytes, cudaMemcpyDeviceToHost, *caller_stream);
+    if (err == cudaSuccess) {
+      err = cudaStreamSynchronize(*caller_stream);
+    }


+    ASSERT_EQ(a.copy_host_to_device(d, h_src.data(), 256, 0), Error::Ok);
+    EXPECT_EQ(a.copy_device_to_host(h_dst.data(), d, 256, 0), Error::Ok);
+    EXPECT_EQ(h_src, h_dst);


Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Summary: Pull Request resolved: pytorch#20498 CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531

Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531

Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. - Added CIs for unit tests. Reviewed By: Gasoonjia Differential Revision: D109590531

Copilot AI review requested due to automatic review settings June 24, 2026 22:51

Conarnar requested review from kirklandsign and larryliu0820 as code owners June 24, 2026 22:51

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026

meta-codesync Bot added the meta-exported label Jun 24, 2026

meta-codesync Bot temporarily deployed to cadence June 24, 2026 22:51 Inactive

Copilot started reviewing on behalf of Conarnar June 24, 2026 22:51 View session

Copilot AI reviewed Jun 24, 2026

meta-codesync Bot changed the title ~~Use caller CUDA stream for D2H and H2D copies~~ Use caller CUDA stream for D2H and H2D copies (#20498) Jun 24, 2026

Conarnar force-pushed the export-D109590531 branch 2 times, most recently from 3d8da75 to 07765c3 Compare June 25, 2026 17:10

Copilot AI review requested due to automatic review settings June 25, 2026 17:10

Copilot started reviewing on behalf of Conarnar June 25, 2026 17:10 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Conarnar force-pushed the export-D109590531 branch from 07765c3 to b316b71 Compare June 25, 2026 17:59

Conarnar force-pushed the export-D109590531 branch from b316b71 to 98081dc Compare June 25, 2026 18:57

Copilot AI review requested due to automatic review settings June 25, 2026 18:57

Copilot started reviewing on behalf of Conarnar June 25, 2026 18:58 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Conarnar force-pushed the export-D109590531 branch from 98081dc to 1e001a5 Compare June 25, 2026 20:50

Conarnar force-pushed the export-D109590531 branch from c657616 to fd2e388 Compare June 26, 2026 21:45

Copilot AI review requested due to automatic review settings June 26, 2026 21:45

Copilot AI reviewed Jun 26, 2026

Conarnar force-pushed the export-D109590531 branch from fd2e388 to 9bfc44e Compare June 26, 2026 22:50

Copilot AI review requested due to automatic review settings June 27, 2026 10:24

Conarnar force-pushed the export-D109590531 branch 2 times, most recently from 32968c0 to 056c25c Compare June 27, 2026 10:24

Copilot AI reviewed Jun 27, 2026

Conarnar force-pushed the export-D109590531 branch from 056c25c to 96b9452 Compare June 27, 2026 10:26

Copilot AI review requested due to automatic review settings June 27, 2026 19:34

Conarnar force-pushed the export-D109590531 branch from 96b9452 to f042310 Compare June 27, 2026 19:34

Conarnar force-pushed the export-D109590531 branch from f042310 to 0dd28cc Compare June 27, 2026 19:34

Copilot AI reviewed Jun 27, 2026

Conarnar force-pushed the export-D109590531 branch from 0dd28cc to f042310 Compare June 27, 2026 19:34

Copilot AI review requested due to automatic review settings June 28, 2026 00:10

Conarnar force-pushed the export-D109590531 branch from f042310 to 6ec0025 Compare June 28, 2026 00:10

Copilot AI reviewed Jun 28, 2026

Conarnar force-pushed the export-D109590531 branch from 6ec0025 to 4b6fde9 Compare June 28, 2026 00:40

Conarnar force-pushed the export-D109590531 branch from 4b6fde9 to 5bfd1b4 Compare June 28, 2026 03:05

Conarnar force-pushed the export-D109590531 branch from 5bfd1b4 to 533e5de Compare June 28, 2026 03:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use caller CUDA stream for D2H and H2D copies (#20498)#20498

Use caller CUDA stream for D2H and H2D copies (#20498)#20498
Conarnar wants to merge 1 commit into
pytorch:mainfrom
Conarnar:export-D109590531

Conarnar commented Jun 24, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

pytorch-bot Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Conarnar commented Jun 24, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20498

✅ No Failures

Uh oh!

meta-codesync Bot commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 24, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conarnar commented Jun 24, 2026 •

edited by meta-codesync Bot

Loading

pytorch-bot Bot commented Jun 24, 2026 •

edited

Loading

This PR needs a `release notes:` label