Add pure CUDA backend along the PTX engine #124
Conversation
Add a --cuda flag to llama-tornado that selects the TornadoVM CUDA backend, mirroring the existing --opencl/--ptx/--metal plumbing: it loads the tornado.drivers.cuda module and the cuda-exports export list. Also disambiguate --ptx help text (was 'PTX/CUDA').
The CUDA backend is only available in a dev build of TornadoVM (PR #861), so point the JDK21 build at 4.0.2-jdk21-dev. The project's own version is unchanged.
List CUDA among the supported backends, add a --cuda usage example, and note that the CUDA backend requires a TornadoVM build with the CUDA backend from PR #861 (beehive-lab/TornadoVM#861).
There was a problem hiding this comment.
Pull request overview
Adds first-class --cuda backend selection to the llama-tornado launcher, wiring it to TornadoVM’s new CUDA driver module/export list and updating docs/build metadata to reflect the new backend option.
Changes:
- Extend launcher backend selection to include
--cuda(newBackend.CUDA+ module/export wiring). - Build against TornadoVM
4.0.2-jdk21-devto pick up the unreleased CUDA backend. - Update README examples/help text to document the CUDA backend and the TornadoVM PR #861 requirement.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| README.md | Documents --cuda usage and updates backend/help text examples. |
| pom.xml | Updates TornadoVM dependency versioning to a -dev build for CUDA support. |
| llama-tornado | Adds Backend.CUDA, --cuda flag, and CUDA module/export configuration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - [TornadoVM](https://github.com/beehive-lab/TornadoVM) with OpenCL, PTX, or CUDA backends. | ||
| - The `--cuda` backend requires a TornadoVM build that includes the CUDA backend from [TornadoVM PR #861](https://github.com/beehive-lab/TornadoVM/pull/861). This project currently builds against TornadoVM `4.0.2-jdk21-dev`. |
| --ptx Use PTX/CUDA backend (default: None) | ||
| --ptx Use PTX backend (default: None) | ||
| --cuda Use CUDA backend (requires TornadoVM built with the CUDA backend) (default: None) | ||
| --metal Use Apple Metal backend (macOS only) (default: None) |
| <tornadovm.base.version>4.0.2</tornadovm.base.version> | ||
| <jdk.version.suffix>-jdk21</jdk.version.suffix> | ||
| <tornadovm.version>${tornadovm.base.version}${jdk.version.suffix}</tornadovm.version> | ||
| <!-- CUDA backend is only available in a dev build of TornadoVM (PR #861) --> | ||
| <tornadovm.version>${tornadovm.base.version}${jdk.version.suffix}-dev</tornadovm.version> |
Add a cuda variant to the build, standalone-inference, and quarkus-integration backend matrices. The setup-tornadovm action now builds the CUDA backend from the cuda2 branch (TornadoVM PR #861) until it is merged to master; other backends still build from master. Shared inference steps run on CUDA via the matrix; the PTX-only CUDA-graph steps remain gated to ptx.
CUDA vs PTX performance comparisonBenchmarked the new Total throughput (tokens/s) — higher is better
Prefill / decode / total split (tokens/s and seconds)Prefill = prompt processing, Decode = token generation.
One-time JIT compile cost (ms)CUDA compiles generated CUDA-C via NVRTC (CUDA-C → PTX), which is heavier than PTX assembly. This is a one-time warm-up cost (amortized over the run), not per-token.
Methodology
Takeaways
|
CUDA vs PTX performance comparison — FP16 modelsCompanion to the Q8_0 comparison above, this run uses FP16 models. Same methodology: identical GPULlama3 build, only Total throughput (tokens/s) — higher is better
Prefill / decode / total split (tokens/s)
One-time JIT compile cost (ms)
Notes
Takeaways
|
CUDA vs OpenCL performance comparison — FP16 modelsOpenCL counterpart to the FP16 CUDA/PTX comparison above. This is the cleanest apples-to-apples comparison of the three: both backends are built from the same TornadoVM source and commit ( Total throughput (tokens/s) — higher is better
Prefill / decode / total split (tokens/s)
One-time JIT compile cost (ms)Both compile generated kernels at runtime — OpenCL via the driver's OpenCL-C compiler, CUDA via NVRTC (CUDA-C → PTX). One-time warm-up cost, not per-token.
Notes
Takeaways
|
📊 Performance summary — what this PR addsThis PR adds a first-class CUDA backend path to Throughput below is tokens/s, averaged over a short and a long prompt (full per-prompt / prefill-decode breakdowns are in the three comments above). OpenCL was measured for FP16 only; both CUDA and OpenCL use the same TornadoVM
Speedup ranges (this PR's CUDA backend):
What's included in this PR
Validation
Hardware: RTX 3070 (cc 8.6), JDK 21.0.2. |
Performance: CUDA backend vs PTX and OpenCLDecode throughput (
Highlights: CUDA is faster than PTX in every case (+10% to +240%), and ≥ OpenCL on most models (up to +15%), trailing OpenCL only slightly on the small Qwen family and Granite-4.0-1B (−4% to −9%). CUDA
|
Summary
Adds a first-class
--cudabackend path to thellama-tornadolauncher, mapping to TornadoVM's new CUDA backend (CUDA C → NVRTC → PTX → CUDA Driver API). It complements the existing--opencl,--ptx, and--metalpaths.How it maps to TornadoVM
Backend selection in the launcher is by which TornadoVM driver module is loaded. The new
--cudabranch mirrors--ptxexactly:@$TORNADOVM_HOME/etc/exportLists/cuda-exports--add-modules ...,tornado.drivers.common,tornado.drivers.cuda--gpu --cudabehaves like--gpu --ptx. The--ptxhelp text was tightened (it previously said "PTX/CUDA") now that CUDA is its own flag.TornadoVM requirement
The CUDA backend is not yet in a released TornadoVM; it lives in TornadoVM PR #861 — beehive-lab/TornadoVM#861. This PR therefore builds against TornadoVM
4.0.2-jdk21-dev(a build that includes the CUDA backend). The project's own version is unchanged. The README documents this requirement.Validation
Built with JDK 21 against TornadoVM
4.0.2-jdk21-devand run on an NVIDIA RTX 3070 (device0:0,Backend: CUDAconfirmed via--print-threads). All produced coherent output:llama-3.2-1b-instruct-q8_0(Llama Q8)Llama-3.2-1B-Instruct.FP16(Llama FP16)granite-3.2-2b-instruct-Q8_0(Granite)qwen2.5-1.5b-instruct-q8_0(Qwen)No regression:
--opencl,--ptx, and--metalstill parse and wire their respective driver modules/export lists.Changes
llama-tornado: addCUDAto theBackendenum, add the--cudaargparse flag, add the CUDA module-config branch, update the docstring.pom.xml: build the JDK21 path against TornadoVM4.0.2-jdk21-dev.README.md: list CUDA among supported backends, add a--cudaexample, document the PR #861 requirement.