Add pure CUDA backend along the PTX engine by mikepapadim · Pull Request #124 · beehive-lab/GPULlama3.java

mikepapadim · 2026-06-20T11:47:16Z

Summary

Adds a first-class --cuda backend path to the llama-tornado launcher, mapping to TornadoVM's new CUDA backend (CUDA C → NVRTC → PTX → CUDA Driver API). It complements the existing --opencl, --ptx, and --metal paths.

./llama-tornado --gpu --cuda --model <model.gguf> --prompt "hello"

How it maps to TornadoVM

Backend selection in the launcher is by which TornadoVM driver module is loaded. The new --cuda branch mirrors --ptx exactly:

export list @$TORNADOVM_HOME/etc/exportLists/cuda-exports
--add-modules ...,tornado.drivers.common,tornado.drivers.cuda

--gpu --cuda behaves like --gpu --ptx. The --ptx help text was tightened (it previously said "PTX/CUDA") now that CUDA is its own flag.

TornadoVM requirement

The CUDA backend is not yet in a released TornadoVM; it lives in TornadoVM PR #861 — beehive-lab/TornadoVM#861. This PR therefore builds against TornadoVM 4.0.2-jdk21-dev (a build that includes the CUDA backend). The project's own version is unchanged. The README documents this requirement.

Validation

Built with JDK 21 against TornadoVM 4.0.2-jdk21-dev and run on an NVIDIA RTX 3070 (device 0:0, Backend: CUDA confirmed via --print-threads). All produced coherent output:

Model	Result
`llama-3.2-1b-instruct-q8_0` (Llama Q8)	✅ "The capital of France is Paris."
`Llama-3.2-1B-Instruct.FP16` (Llama FP16)	✅ coherent
`granite-3.2-2b-instruct-Q8_0` (Granite)	✅ coherent
`qwen2.5-1.5b-instruct-q8_0` (Qwen)	✅ coherent

No regression: --opencl, --ptx, and --metal still parse and wire their respective driver modules/export lists.

Changes

llama-tornado: add CUDA to the Backend enum, add the --cuda argparse flag, add the CUDA module-config branch, update the docstring.
pom.xml: build the JDK21 path against TornadoVM 4.0.2-jdk21-dev.
README.md: list CUDA among supported backends, add a --cuda example, document the PR #861 requirement.

Add a --cuda flag to llama-tornado that selects the TornadoVM CUDA backend, mirroring the existing --opencl/--ptx/--metal plumbing: it loads the tornado.drivers.cuda module and the cuda-exports export list. Also disambiguate --ptx help text (was 'PTX/CUDA').

The CUDA backend is only available in a dev build of TornadoVM (PR #861), so point the JDK21 build at 4.0.2-jdk21-dev. The project's own version is unchanged.

List CUDA among the supported backends, add a --cuda usage example, and note that the CUDA backend requires a TornadoVM build with the CUDA backend from PR #861 (beehive-lab/TornadoVM#861).

Copilot

Pull request overview

Adds first-class --cuda backend selection to the llama-tornado launcher, wiring it to TornadoVM’s new CUDA driver module/export list and updating docs/build metadata to reflect the new backend option.

Changes:

Extend launcher backend selection to include --cuda (new Backend.CUDA + module/export wiring).
Build against TornadoVM 4.0.2-jdk21-dev to pick up the unreleased CUDA backend.
Update README examples/help text to document the CUDA backend and the TornadoVM PR #861 requirement.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
README.md	Documents `--cuda` usage and updates backend/help text examples.
pom.xml	Updates TornadoVM dependency versioning to a `-dev` build for CUDA support.
llama-tornado	Adds `Backend.CUDA`, `--cuda` flag, and CUDA module/export configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+- [TornadoVM](https://github.com/beehive-lab/TornadoVM) with OpenCL, PTX, or CUDA backends.
+  - The `--cuda` backend requires a TornadoVM build that includes the CUDA backend from [TornadoVM PR #861](https://github.com/beehive-lab/TornadoVM/pull/861). This project currently builds against TornadoVM `4.0.2-jdk21-dev`.


-  --ptx                 Use PTX/CUDA backend (default: None)
+  --ptx                 Use PTX backend (default: None)
+  --cuda                Use CUDA backend (requires TornadoVM built with the CUDA backend) (default: None)
+  --metal               Use Apple Metal backend (macOS only) (default: None)


+            <tornadovm.base.version>4.0.2</tornadovm.base.version>
            <jdk.version.suffix>-jdk21</jdk.version.suffix>
-            <tornadovm.version>${tornadovm.base.version}${jdk.version.suffix}</tornadovm.version>
+            <!-- CUDA backend is only available in a dev build of TornadoVM (PR #861) -->
+            <tornadovm.version>${tornadovm.base.version}${jdk.version.suffix}-dev</tornadovm.version>


Add a cuda variant to the build, standalone-inference, and quarkus-integration backend matrices. The setup-tornadovm action now builds the CUDA backend from the cuda2 branch (TornadoVM PR #861) until it is merged to master; other backends still build from master. Shared inference steps run on CUDA via the matrix; the PTX-only CUDA-graph steps remain gated to ptx.

mikepapadim · 2026-06-20T12:08:55Z

CUDA vs PTX performance comparison

Benchmarked the new --cuda backend against the released PTX backend on identical hardware and identical GPULlama3 build, across 4 models and two prompt sizes (short input vs long ~150-token input). All numbers are end-to-end llama-tornado runs.

Total throughput (tokens/s) — higher is better

Model	Prompt	CUDA tok/s	PTX tok/s	CUDA speedup
Llama-3.2-1B-Q8	small	56.95	53.20	1.07×
Llama-3.2-1B-Q8	large	60.11	54.06	1.11×
Qwen2.5-1.5B-Q8	small	32.86	17.65	1.86×
Qwen2.5-1.5B-Q8	large	29.12	12.27	2.37×
Qwen3-1.7B-Q8	small	35.46	16.17	2.19×
Qwen3-1.7B-Q8	large	34.66	12.18	2.85×
Granite-3.2-2B-Q8	small	26.59	23.13	1.15×
Granite-3.2-2B-Q8	large	27.20	22.37	1.22×

Prefill / decode / total split (tokens/s and seconds)

Prefill = prompt processing, Decode = token generation. * rows are reported as a single total phase by the engine for that model (no prefill/decode split available).

Model	Prompt	Backend	Prefill tok/s	Decode tok/s	Total tok/s	Total tok	Total s
Llama-3.2-1B-Q8	small	CUDA	63.80	55.65	56.95	123	2.16
Llama-3.2-1B-Q8	small	PTX	58.12	52.06	53.20	107	2.01
Llama-3.2-1B-Q8	large	CUDA	66.33	51.52	60.11	236	3.93
Llama-3.2-1B-Q8	large	PTX	59.74	47.10	54.06	248	4.59
Qwen2.5-1.5B-Q8	small	CUDA *	35.91	32.26	32.86	106	3.23
Qwen2.5-1.5B-Q8	small	PTX *	25.74	16.58	17.65	111	6.29
Qwen2.5-1.5B-Q8	large	CUDA *	31.82	25.92	29.12	250	8.59
Qwen2.5-1.5B-Q8	large	PTX *	16.24	8.73	12.27	237	19.31
Qwen3-1.7B-Q8	small	CUDA	42.62	34.48	35.46	131	3.69
Qwen3-1.7B-Q8	small	PTX	31.89	15.11	16.17	153	9.46
Qwen3-1.7B-Q8	large	CUDA	39.85	29.41	34.66	256	7.39
Qwen3-1.7B-Q8	large	PTX	17.23	8.69	12.18	256	21.02
Granite-3.2-2B-Q8	small	CUDA *	26.29	26.62	26.59	211	7.94
Granite-3.2-2B-Q8	small	PTX *	23.05	23.15	23.13	131	5.66
Granite-3.2-2B-Q8	large	CUDA *	28.33	25.64	27.20	256	9.41
Granite-3.2-2B-Q8	large	PTX *	23.32	21.05	22.37	256	11.44

One-time JIT compile cost (ms)

CUDA compiles generated CUDA-C via NVRTC (CUDA-C → PTX), which is heavier than PTX assembly. This is a one-time warm-up cost (amortized over the run), not per-token.

Model	CUDA JIT (ms)	PTX JIT (ms)
Llama-3.2-1B-Q8	12677	2519
Qwen2.5-1.5B-Q8	23381	4143
Qwen3-1.7B-Q8	24689	3579
Granite-3.2-2B-Q8	31078	5571

Methodology

Hardware: NVIDIA GeForce RTX 3070 (compute 8.6), device 0:0.
GPULlama3: same feat/cuda-backend build for both backends; only $TORNADOVM_HOME swapped.
CUDA backend: TornadoVM 4.0.2-jdk21-dev (CUDA backend, PR #861).
PTX backend: TornadoVM 4.0.1-jdk21-ptx (released, via SDKMAN).
Settings: --with-prefill-decode --verbose-init -n 256 --seed 42, JSON metrics via llama.metrics.format=json. JDK 21.0.2.
Single cold run per cell (includes model load + JIT); decode/total rates exclude load/compile.

Takeaways

CUDA is faster than PTX on every model/prompt for end-to-end throughput on this GPU, from ~+7% (Llama 1B) up to ~2–3× on the Qwen families.
The gap widens on larger prompts (prefill-heavy), where CUDA's prefill throughput holds up much better than PTX on the Qwen models.
CUDA's only cost is a higher one-time NVRTC compile time vs PTX assembly; it does not affect steady-state token rate.
All four models produce coherent output on CUDA (validated separately).

mikepapadim · 2026-06-20T12:25:33Z

CUDA vs PTX performance comparison — FP16 models

Companion to the Q8_0 comparison above, this run uses FP16 models. Same methodology: identical GPULlama3 build, only $TORNADOVM_HOME swapped between the CUDA (4.0.2-jdk21-dev, PR #861) and released PTX (4.0.1-jdk21-ptx) backends; small (short input) and large (~150-token input) prompts.

Total throughput (tokens/s) — higher is better

Model	Prompt	CUDA tok/s	PTX tok/s	CUDA speedup
Llama-3.2-1B-F16	small	58.72	34.56	1.70×
Llama-3.2-1B-F16	large	63.93	36.51	1.75×
Qwen3-1.7B-F16	small	33.09	14.05	2.36×
Qwen3-1.7B-F16	large	30.82	10.20	3.02×
Granite-3.2-2B-F16	small	25.29	15.00	1.69×
Granite-3.2-2B-F16	large	25.70	14.52	1.77×

Prefill / decode / total split (tokens/s)

* = engine reports a single total phase for that model (no prefill/decode split).

Model	Prompt	Backend	Prefill tok/s	Decode tok/s	Total tok/s	Total tok	Total s
Llama-3.2-1B-F16	small	CUDA	63.90	57.59	58.72	113	1.92
Llama-3.2-1B-F16	small	PTX	38.02	33.83	34.56	116	3.36
Llama-3.2-1B-F16	large	CUDA	70.93	54.10	63.93	233	3.64
Llama-3.2-1B-F16	large	PTX	38.43	32.82	36.51	218	5.97
Qwen3-1.7B-F16	small	CUDA	38.09	32.38	33.09	132	3.99
Qwen3-1.7B-F16	small	PTX	21.89	13.21	14.05	126	8.97
Qwen3-1.7B-F16	large	CUDA	35.25	26.29	30.82	256	8.31
Qwen3-1.7B-F16	large	PTX	13.53	7.63	10.20	256	25.09
Granite-3.2-2B-F16	small	CUDA *	26.53	25.19	25.29	256	10.12
Granite-3.2-2B-F16	small	PTX *	14.46	15.08	15.00	176	11.73
Granite-3.2-2B-F16	large	CUDA *	26.81	24.17	25.70	256	9.96
Granite-3.2-2B-F16	large	PTX *	14.94	13.92	14.52	256	17.63

One-time JIT compile cost (ms)

Model	CUDA JIT (ms)	PTX JIT (ms)
Llama-3.2-1B-F16	11406	1993
Qwen3-1.7B-F16	21835	3200
Granite-3.2-2B-F16	28228	4818

Notes

DeepSeek-R1-Distill-Qwen-1.5B-F16 failed on both backends with NoSuchElementException: No value present during model setup — a model-loading issue unrelated to the GPU backend, so it is excluded from the comparison.
Hardware/settings: NVIDIA RTX 3070 (compute 8.6), JDK 21.0.2, --with-prefill-decode --verbose-init -n 256 --seed 42, JSON metrics. Single cold run per cell.

Takeaways

On FP16, CUDA's lead over PTX is even larger than on Q8_0 — roughly 1.7× (Llama, Granite) up to ~3× (Qwen3, large prompt) end-to-end.
As with Q8, the gap widens on larger (prefill-heavy) prompts.
CUDA again pays a higher one-time NVRTC compile cost; steady-state token rate is unaffected.

mikepapadim · 2026-06-20T12:35:47Z

CUDA vs OpenCL performance comparison — FP16 models

OpenCL counterpart to the FP16 CUDA/PTX comparison above. This is the cleanest apples-to-apples comparison of the three: both backends are built from the same TornadoVM source and commit (4.0.2-jdk21-dev), so only the code-generation/runtime backend differs. Same GPULlama3 build, same GPU, small + large prompts.

Total throughput (tokens/s) — higher is better

Model	Prompt	CUDA tok/s	OpenCL tok/s	CUDA speedup
Llama-3.2-1B-F16	small	58.72	46.15	1.27×
Llama-3.2-1B-F16	large	63.93	48.20	1.33×
Qwen3-1.7B-F16	small	33.09	26.54	1.25×
Qwen3-1.7B-F16	large	30.82	26.70	1.15×
Granite-3.2-2B-F16	small	25.29	20.25	1.25×
Granite-3.2-2B-F16	large	25.70	19.68	1.31×

Prefill / decode / total split (tokens/s)

* = engine reports a single total phase for that model (no prefill/decode split).

Model	Prompt	Backend	Prefill tok/s	Decode tok/s	Total tok/s	Total tok	Total s
Llama-3.2-1B-F16	small	CUDA	63.90	57.59	58.72	113	1.92
Llama-3.2-1B-F16	small	OpenCL	50.27	45.40	46.15	131	2.84
Llama-3.2-1B-F16	large	CUDA	70.93	54.10	63.93	233	3.64
Llama-3.2-1B-F16	large	OpenCL	52.00	43.62	48.20	256	5.31
Qwen3-1.7B-F16	small	CUDA	38.09	32.38	33.09	132	3.99
Qwen3-1.7B-F16	small	OpenCL	29.89	26.11	26.54	146	5.50
Qwen3-1.7B-F16	large	CUDA	35.25	26.29	30.82	256	8.31
Qwen3-1.7B-F16	large	OpenCL	29.36	23.75	26.70	256	9.59
Granite-3.2-2B-F16	small	CUDA *	26.53	25.19	25.29	256	10.12
Granite-3.2-2B-F16	small	OpenCL *	19.81	20.32	20.25	169	8.34
Granite-3.2-2B-F16	large	CUDA *	26.81	24.17	25.70	256	9.96
Granite-3.2-2B-F16	large	OpenCL *	19.62	19.77	19.68	256	13.01

One-time JIT compile cost (ms)

Both compile generated kernels at runtime — OpenCL via the driver's OpenCL-C compiler, CUDA via NVRTC (CUDA-C → PTX). One-time warm-up cost, not per-token.

Model	CUDA JIT (ms)	OpenCL JIT (ms)
Llama-3.2-1B-F16	11406	4063
Qwen3-1.7B-F16	21835	8954
Granite-3.2-2B-F16	28228	9166

Notes

Version parity: both backends are TornadoVM 4.0.2-jdk21-dev built from the same commit (CUDA from PR #861); only $TORNADOVM_HOME is swapped. OpenCL device: [NVIDIA CUDA] NVIDIA GeForce RTX 3070 via the OpenCL platform.
DeepSeek-R1-Distill-Qwen-1.5B-F16 again failed on both backends (NoSuchElementException at model setup — model-loading issue, not backend-related) and is excluded.
Settings: RTX 3070, JDK 21.0.2, --with-prefill-decode --verbose-init -n 256 --seed 42, JSON metrics, single cold run per cell.

Takeaways

CUDA is ~1.15–1.33× faster than OpenCL end-to-end on FP16 across these models — a smaller margin than CUDA-vs-PTX, since OpenCL is the stronger of the two existing NVIDIA paths here.
CUDA's JIT (NVRTC) warm-up is heavier than OpenCL's; steady-state token rate is unaffected.
Net ordering on this GPU (FP16): CUDA > OpenCL > PTX.

mikepapadim · 2026-06-20T12:44:22Z

📊 Performance summary — what this PR adds

This PR adds a first-class CUDA backend path to llama-tornado. On an NVIDIA RTX 3070, it is the fastest of the three NVIDIA-capable TornadoVM backends for end-to-end LLM inference: CUDA > OpenCL > PTX.

Throughput below is tokens/s, averaged over a short and a long prompt (full per-prompt / prefill-decode breakdowns are in the three comments above). OpenCL was measured for FP16 only; both CUDA and OpenCL use the same TornadoVM 4.0.2-jdk21-dev build, PTX uses the released 4.0.1-jdk21-ptx.

Model	Precision	CUDA tok/s	PTX tok/s	OpenCL tok/s	CUDA vs PTX	CUDA vs OpenCL
Llama-3.2-1B	Q8_0	58.5	53.6	—	1.09×	—
Llama-3.2-1B	FP16	61.3	35.5	47.2	1.73×	1.30×
Qwen2.5-1.5B	Q8_0	31.0	15.0	—	2.07×	—
Qwen3-1.7B	Q8_0	35.1	14.2	—	2.47×	—
Qwen3-1.7B	FP16	32.0	12.1	26.6	2.64×	1.20×
Granite-3.2-2B	Q8_0	26.9	22.8	—	1.18×	—
Granite-3.2-2B	FP16	25.5	14.8	20.0	1.73×	1.28×

Speedup ranges (this PR's CUDA backend):

vs PTX: 1.09×–2.64× faster (geo-mean ≈ 1.76×)
vs OpenCL (FP16): 1.20×–1.30× faster (geo-mean ≈ 1.26×)

What's included in this PR

--cuda launcher flag wired to TornadoVM's CUDA backend (tornado.drivers.cuda + cuda-exports), symmetric with --opencl/--ptx/--metal.
Builds against TornadoVM 4.0.2-jdk21-dev (CUDA backend from TornadoVM PR #861); README documents the requirement.
CI: cuda added to the build / inference / quarkus matrices (CUDA built from the cuda2 branch until merged).

Validation

Coherent output on CUDA for Llama 3.2 1B (Q8 & FP16), Qwen2.5 1.5B, Qwen3 1.7B, Granite 3.2 2B — device confirmed as CUDA / RTX 3070.
No regression: --opencl, --ptx, --metal still parse and wire correctly.

Hardware: RTX 3070 (cc 8.6), JDK 21.0.2. DeepSeek-Qwen-1.5B-F16 excluded — it fails to load on all three backends (model issue, not backend-specific).

Copilot

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

Copilot

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

orionpapadakis · 2026-06-23T19:34:02Z

Performance: CUDA backend vs PTX and OpenCL

Decode throughput (eval tok/s) per backend, with CUDA speedup. Benchmark: each model with a fixed prompt, max_tokens=256, 3 reps (mean), RTX 5090 Laptop, CUDA 13.1 toolkit / 13.0 driver, Java
21. cuda-graphs configurations run on PTX & CUDA only (OpenCL N/A).

Model	Size	Quant	Configuration	OpenCL	PTX	CUDA	CUDA vs PTX	CUDA vs OpenCL
Qwen3	0.6B	F16	standard	39.5	11.5	37.7	3.27× (+227%)	0.95× (-5%)
Qwen3	0.6B	F16	prefill-decode	39.6	11.4	37.7	3.29× (+229%)	0.95× (-5%)
Qwen3	0.6B	F16	batch-prefill-decode	38.8	11.4	37.0	3.25× (+225%)	0.96× (-4%)
Qwen3	0.6B	F16	prefill-decode + cuda-graphs	—	12.2	40.5	3.32× (+232%)	—
Qwen3	0.6B	F16	batch-prefill-decode + cuda-graphs	—	12.1	39.3	3.26× (+226%)	—
Qwen3	0.6B	Q8_0	standard	40.9	12.6	39.2	3.11× (+211%)	0.96× (-4%)
Qwen3	0.6B	Q8_0	prefill-decode	40.7	12.5	39.1	3.11× (+211%)	0.96× (-4%)
Qwen3	0.6B	Q8_0	batch-prefill-decode	37.5	12.1	38.2	3.17× (+217%)	1.02× (+2%)
Qwen3	0.6B	Q8_0	prefill-decode + cuda-graphs	—	12.7	42.7	3.35× (+235%)	—
Qwen3	0.6B	Q8_0	batch-prefill-decode + cuda-graphs	—	12.6	42.7	3.40× (+240%)	—
Llama-3.2	1B	F16	standard	59.0	43.1	63.8	1.48× (+48%)	1.08× (+8%)
Llama-3.2	1B	F16	prefill-decode	59.1	43.2	63.3	1.47× (+47%)	1.07× (+7%)
Llama-3.2	1B	F16	batch-prefill-decode	57.8	42.2	62.6	1.48× (+48%)	1.08× (+8%)
Llama-3.2	1B	F16	prefill-decode + cuda-graphs	—	46.8	67.6	1.44× (+44%)	—
Llama-3.2	1B	F16	batch-prefill-decode + cuda-graphs	—	44.8	66.6	1.49× (+49%)	—
Llama-3.2	1B	Q8_0	standard	62.2	60.8	71.3	1.17× (+17%)	1.15× (+15%)
Llama-3.2	1B	Q8_0	prefill-decode	62.1	60.4	71.0	1.18× (+18%)	1.14× (+14%)
Llama-3.2	1B	Q8_0	batch-prefill-decode	60.2	58.9	68.6	1.16× (+16%)	1.14× (+14%)
Llama-3.2	1B	Q8_0	prefill-decode + cuda-graphs	—	71.3	78.4	1.10× (+10%)	—
Llama-3.2	1B	Q8_0	batch-prefill-decode + cuda-graphs	—	68.9	75.8	1.10× (+10%)	—
Qwen2.5	1.5B	F16	standard	30.7	10.3	28.0	2.72× (+172%)	0.91× (-9%)
Qwen2.5	1.5B	Q8_0	standard	31.6	11.8	30.2	2.57× (+157%)	0.96× (-4%)
Granite-3.2	2B	F16	standard	24.8	18.5	28.5	1.54× (+54%)	1.15× (+15%)
Granite-3.2	2B	Q8_0	standard	27.5	26.5	31.1	1.17× (+17%)	1.13× (+13%)
Granite-4.0	1B	F16	standard	24.2	7.9	24.8	3.15× (+215%)	1.03× (+3%)
Granite-4.0	1B	Q8_0	standard	27.6	8.6	26.3	3.07× (+207%)	0.95× (-5%)
Phi-3-mini	3.8B	F16	standard	24.7	11.3	26.8	2.38× (+138%)	1.09× (+9%)
Phi-3-mini	3.8B	Q8_0	standard	26.5	14.3	30.6	2.13× (+113%)	1.15× (+15%)
Mistral	7B	F16	standard	8.4	5.0	8.5	1.70× (+70%)	1.02× (+2%)
Mistral	7B	Q8_0	standard	8.8	6.7	9.4	1.41× (+41%)	1.06× (+6%)

Highlights: CUDA is faster than PTX in every case (+10% to +240%), and ≥ OpenCL on most models (up to +15%), trailing OpenCL only slightly on the small Qwen family and Granite-4.0-1B (−4% to −9%). CUDA

cuda-graphs gives the top result overall (Llama-3.2-1B Q8_0: 78.4 tok/s).

mikepapadim added 3 commits June 20, 2026 14:46

build(pom): build against TornadoVM 4.0.2-jdk21-dev (CUDA backend)

2e2fa90

The CUDA backend is only available in a dev build of TornadoVM (PR #861), so point the JDK21 build at 4.0.2-jdk21-dev. The project's own version is unchanged.

docs: document CUDA backend and TornadoVM PR #861 requirement

74b88c2

List CUDA among the supported backends, add a --cuda usage example, and note that the CUDA backend requires a TornadoVM build with the CUDA backend from PR #861 (beehive-lab/TornadoVM#861).

Copilot AI review requested due to automatic review settings June 20, 2026 11:47

Copilot started reviewing on behalf of mikepapadim June 20, 2026 11:47 View session

Copilot AI reviewed Jun 20, 2026

View reviewed changes

mikepapadim changed the title ~~feat: add CUDA backend path to llama-tornado launcher~~ Add pure CUDA backend along the PTX engine Jun 20, 2026

mikepapadim requested review from mairooni, orionpapadakis and stratika June 20, 2026 11:54

[hack] force ci to run on strix

3f13606

Copilot AI review requested due to automatic review settings June 22, 2026 10:37

Copilot AI reviewed Jun 22, 2026

orionpapadakis added 2 commits June 22, 2026 14:03

[hack] force all ci jobs to run on strix

bd4a70b

[hack] use correct runner custom label

3efde92

Copilot AI review requested due to automatic review settings June 22, 2026 11:12

Copilot AI reviewed Jun 22, 2026

Revert specific workflow runner labels

dcee6cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add pure CUDA backend along the PTX engine #124

Add pure CUDA backend along the PTX engine #124
mikepapadim wants to merge 8 commits into
mainfrom
feat/cuda-backend

mikepapadim commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

mikepapadim commented Jun 20, 2026

Uh oh!

mikepapadim commented Jun 20, 2026

Uh oh!

mikepapadim commented Jun 20, 2026

Uh oh!

mikepapadim commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

orionpapadakis commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		- [TornadoVM](https://github.com/beehive-lab/TornadoVM) with OpenCL, PTX, or CUDA backends.
		- The `--cuda` backend requires a TornadoVM build that includes the CUDA backend from [TornadoVM PR #861](https://github.com/beehive-lab/TornadoVM/pull/861). This project currently builds against TornadoVM `4.0.2-jdk21-dev`.

Uh oh!

Conversation

mikepapadim commented Jun 20, 2026

Summary

How it maps to TornadoVM

TornadoVM requirement

Validation

Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

mikepapadim commented Jun 20, 2026

CUDA vs PTX performance comparison

Total throughput (tokens/s) — higher is better

Prefill / decode / total split (tokens/s and seconds)

One-time JIT compile cost (ms)

Methodology

Takeaways

Uh oh!

mikepapadim commented Jun 20, 2026

CUDA vs PTX performance comparison — FP16 models

Total throughput (tokens/s) — higher is better

Prefill / decode / total split (tokens/s)

One-time JIT compile cost (ms)

Notes

Takeaways

Uh oh!

mikepapadim commented Jun 20, 2026

CUDA vs OpenCL performance comparison — FP16 models

Total throughput (tokens/s) — higher is better

Prefill / decode / total split (tokens/s)

One-time JIT compile cost (ms)

Notes

Takeaways

Uh oh!

mikepapadim commented Jun 20, 2026

📊 Performance summary — what this PR adds

What's included in this PR

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

orionpapadakis commented Jun 23, 2026

Performance: CUDA backend vs PTX and OpenCL

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants