Skip to content

Add pure CUDA backend along the PTX engine #124

Open
mikepapadim wants to merge 8 commits into
mainfrom
feat/cuda-backend
Open

Add pure CUDA backend along the PTX engine #124
mikepapadim wants to merge 8 commits into
mainfrom
feat/cuda-backend

Conversation

@mikepapadim

Copy link
Copy Markdown
Member

Summary

Adds a first-class --cuda backend path to the llama-tornado launcher, mapping to TornadoVM's new CUDA backend (CUDA C → NVRTC → PTX → CUDA Driver API). It complements the existing --opencl, --ptx, and --metal paths.

./llama-tornado --gpu --cuda --model <model.gguf> --prompt "hello"

How it maps to TornadoVM

Backend selection in the launcher is by which TornadoVM driver module is loaded. The new --cuda branch mirrors --ptx exactly:

  • export list @$TORNADOVM_HOME/etc/exportLists/cuda-exports
  • --add-modules ...,tornado.drivers.common,tornado.drivers.cuda

--gpu --cuda behaves like --gpu --ptx. The --ptx help text was tightened (it previously said "PTX/CUDA") now that CUDA is its own flag.

TornadoVM requirement

The CUDA backend is not yet in a released TornadoVM; it lives in TornadoVM PR #861beehive-lab/TornadoVM#861. This PR therefore builds against TornadoVM 4.0.2-jdk21-dev (a build that includes the CUDA backend). The project's own version is unchanged. The README documents this requirement.

Validation

Built with JDK 21 against TornadoVM 4.0.2-jdk21-dev and run on an NVIDIA RTX 3070 (device 0:0, Backend: CUDA confirmed via --print-threads). All produced coherent output:

Model Result
llama-3.2-1b-instruct-q8_0 (Llama Q8) ✅ "The capital of France is Paris."
Llama-3.2-1B-Instruct.FP16 (Llama FP16) ✅ coherent
granite-3.2-2b-instruct-Q8_0 (Granite) ✅ coherent
qwen2.5-1.5b-instruct-q8_0 (Qwen) ✅ coherent

No regression: --opencl, --ptx, and --metal still parse and wire their respective driver modules/export lists.

Changes

  • llama-tornado: add CUDA to the Backend enum, add the --cuda argparse flag, add the CUDA module-config branch, update the docstring.
  • pom.xml: build the JDK21 path against TornadoVM 4.0.2-jdk21-dev.
  • README.md: list CUDA among supported backends, add a --cuda example, document the PR #861 requirement.

Add a --cuda flag to llama-tornado that selects the TornadoVM CUDA backend,
mirroring the existing --opencl/--ptx/--metal plumbing: it loads the
tornado.drivers.cuda module and the cuda-exports export list. Also disambiguate
--ptx help text (was 'PTX/CUDA').
The CUDA backend is only available in a dev build of TornadoVM (PR #861), so
point the JDK21 build at 4.0.2-jdk21-dev. The project's own version is unchanged.
List CUDA among the supported backends, add a --cuda usage example, and note
that the CUDA backend requires a TornadoVM build with the CUDA backend from
PR #861 (beehive-lab/TornadoVM#861).
Copilot AI review requested due to automatic review settings June 20, 2026 11:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class --cuda backend selection to the llama-tornado launcher, wiring it to TornadoVM’s new CUDA driver module/export list and updating docs/build metadata to reflect the new backend option.

Changes:

  • Extend launcher backend selection to include --cuda (new Backend.CUDA + module/export wiring).
  • Build against TornadoVM 4.0.2-jdk21-dev to pick up the unreleased CUDA backend.
  • Update README examples/help text to document the CUDA backend and the TornadoVM PR #861 requirement.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
README.md Documents --cuda usage and updates backend/help text examples.
pom.xml Updates TornadoVM dependency versioning to a -dev build for CUDA support.
llama-tornado Adds Backend.CUDA, --cuda flag, and CUDA module/export configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md
Comment on lines +69 to +70
- [TornadoVM](https://github.com/beehive-lab/TornadoVM) with OpenCL, PTX, or CUDA backends.
- The `--cuda` backend requires a TornadoVM build that includes the CUDA backend from [TornadoVM PR #861](https://github.com/beehive-lab/TornadoVM/pull/861). This project currently builds against TornadoVM `4.0.2-jdk21-dev`.
Comment thread README.md
--ptx Use PTX/CUDA backend (default: None)
--ptx Use PTX backend (default: None)
--cuda Use CUDA backend (requires TornadoVM built with the CUDA backend) (default: None)
--metal Use Apple Metal backend (macOS only) (default: None)
Comment thread pom.xml
Comment on lines +42 to +45
<tornadovm.base.version>4.0.2</tornadovm.base.version>
<jdk.version.suffix>-jdk21</jdk.version.suffix>
<tornadovm.version>${tornadovm.base.version}${jdk.version.suffix}</tornadovm.version>
<!-- CUDA backend is only available in a dev build of TornadoVM (PR #861) -->
<tornadovm.version>${tornadovm.base.version}${jdk.version.suffix}-dev</tornadovm.version>
@mikepapadim mikepapadim changed the title feat: add CUDA backend path to llama-tornado launcher Add pure CUDA backend along the PTX engine Jun 20, 2026
Add a cuda variant to the build, standalone-inference, and quarkus-integration
backend matrices. The setup-tornadovm action now builds the CUDA backend from
the cuda2 branch (TornadoVM PR #861) until it is merged to master; other
backends still build from master. Shared inference steps run on CUDA via the
matrix; the PTX-only CUDA-graph steps remain gated to ptx.
@mikepapadim

Copy link
Copy Markdown
Member Author

CUDA vs PTX performance comparison

Benchmarked the new --cuda backend against the released PTX backend on identical hardware and identical GPULlama3 build, across 4 models and two prompt sizes (short input vs long ~150-token input). All numbers are end-to-end llama-tornado runs.

Total throughput (tokens/s) — higher is better

Model Prompt CUDA tok/s PTX tok/s CUDA speedup
Llama-3.2-1B-Q8 small 56.95 53.20 1.07×
Llama-3.2-1B-Q8 large 60.11 54.06 1.11×
Qwen2.5-1.5B-Q8 small 32.86 17.65 1.86×
Qwen2.5-1.5B-Q8 large 29.12 12.27 2.37×
Qwen3-1.7B-Q8 small 35.46 16.17 2.19×
Qwen3-1.7B-Q8 large 34.66 12.18 2.85×
Granite-3.2-2B-Q8 small 26.59 23.13 1.15×
Granite-3.2-2B-Q8 large 27.20 22.37 1.22×

Prefill / decode / total split (tokens/s and seconds)

Prefill = prompt processing, Decode = token generation. * rows are reported as a single total phase by the engine for that model (no prefill/decode split available).

Model Prompt Backend Prefill tok/s Decode tok/s Total tok/s Total tok Total s
Llama-3.2-1B-Q8 small CUDA 63.80 55.65 56.95 123 2.16
Llama-3.2-1B-Q8 small PTX 58.12 52.06 53.20 107 2.01
Llama-3.2-1B-Q8 large CUDA 66.33 51.52 60.11 236 3.93
Llama-3.2-1B-Q8 large PTX 59.74 47.10 54.06 248 4.59
Qwen2.5-1.5B-Q8 small CUDA * 35.91 32.26 32.86 106 3.23
Qwen2.5-1.5B-Q8 small PTX * 25.74 16.58 17.65 111 6.29
Qwen2.5-1.5B-Q8 large CUDA * 31.82 25.92 29.12 250 8.59
Qwen2.5-1.5B-Q8 large PTX * 16.24 8.73 12.27 237 19.31
Qwen3-1.7B-Q8 small CUDA 42.62 34.48 35.46 131 3.69
Qwen3-1.7B-Q8 small PTX 31.89 15.11 16.17 153 9.46
Qwen3-1.7B-Q8 large CUDA 39.85 29.41 34.66 256 7.39
Qwen3-1.7B-Q8 large PTX 17.23 8.69 12.18 256 21.02
Granite-3.2-2B-Q8 small CUDA * 26.29 26.62 26.59 211 7.94
Granite-3.2-2B-Q8 small PTX * 23.05 23.15 23.13 131 5.66
Granite-3.2-2B-Q8 large CUDA * 28.33 25.64 27.20 256 9.41
Granite-3.2-2B-Q8 large PTX * 23.32 21.05 22.37 256 11.44

One-time JIT compile cost (ms)

CUDA compiles generated CUDA-C via NVRTC (CUDA-C → PTX), which is heavier than PTX assembly. This is a one-time warm-up cost (amortized over the run), not per-token.

Model CUDA JIT (ms) PTX JIT (ms)
Llama-3.2-1B-Q8 12677 2519
Qwen2.5-1.5B-Q8 23381 4143
Qwen3-1.7B-Q8 24689 3579
Granite-3.2-2B-Q8 31078 5571

Methodology

  • Hardware: NVIDIA GeForce RTX 3070 (compute 8.6), device 0:0.
  • GPULlama3: same feat/cuda-backend build for both backends; only $TORNADOVM_HOME swapped.
  • CUDA backend: TornadoVM 4.0.2-jdk21-dev (CUDA backend, PR #861).
  • PTX backend: TornadoVM 4.0.1-jdk21-ptx (released, via SDKMAN).
  • Settings: --with-prefill-decode --verbose-init -n 256 --seed 42, JSON metrics via llama.metrics.format=json. JDK 21.0.2.
  • Single cold run per cell (includes model load + JIT); decode/total rates exclude load/compile.

Takeaways

  • CUDA is faster than PTX on every model/prompt for end-to-end throughput on this GPU, from ~+7% (Llama 1B) up to ~2–3× on the Qwen families.
  • The gap widens on larger prompts (prefill-heavy), where CUDA's prefill throughput holds up much better than PTX on the Qwen models.
  • CUDA's only cost is a higher one-time NVRTC compile time vs PTX assembly; it does not affect steady-state token rate.
  • All four models produce coherent output on CUDA (validated separately).

@mikepapadim

Copy link
Copy Markdown
Member Author

CUDA vs PTX performance comparison — FP16 models

Companion to the Q8_0 comparison above, this run uses FP16 models. Same methodology: identical GPULlama3 build, only $TORNADOVM_HOME swapped between the CUDA (4.0.2-jdk21-dev, PR #861) and released PTX (4.0.1-jdk21-ptx) backends; small (short input) and large (~150-token input) prompts.

Total throughput (tokens/s) — higher is better

Model Prompt CUDA tok/s PTX tok/s CUDA speedup
Llama-3.2-1B-F16 small 58.72 34.56 1.70×
Llama-3.2-1B-F16 large 63.93 36.51 1.75×
Qwen3-1.7B-F16 small 33.09 14.05 2.36×
Qwen3-1.7B-F16 large 30.82 10.20 3.02×
Granite-3.2-2B-F16 small 25.29 15.00 1.69×
Granite-3.2-2B-F16 large 25.70 14.52 1.77×

Prefill / decode / total split (tokens/s)

* = engine reports a single total phase for that model (no prefill/decode split).

Model Prompt Backend Prefill tok/s Decode tok/s Total tok/s Total tok Total s
Llama-3.2-1B-F16 small CUDA 63.90 57.59 58.72 113 1.92
Llama-3.2-1B-F16 small PTX 38.02 33.83 34.56 116 3.36
Llama-3.2-1B-F16 large CUDA 70.93 54.10 63.93 233 3.64
Llama-3.2-1B-F16 large PTX 38.43 32.82 36.51 218 5.97
Qwen3-1.7B-F16 small CUDA 38.09 32.38 33.09 132 3.99
Qwen3-1.7B-F16 small PTX 21.89 13.21 14.05 126 8.97
Qwen3-1.7B-F16 large CUDA 35.25 26.29 30.82 256 8.31
Qwen3-1.7B-F16 large PTX 13.53 7.63 10.20 256 25.09
Granite-3.2-2B-F16 small CUDA * 26.53 25.19 25.29 256 10.12
Granite-3.2-2B-F16 small PTX * 14.46 15.08 15.00 176 11.73
Granite-3.2-2B-F16 large CUDA * 26.81 24.17 25.70 256 9.96
Granite-3.2-2B-F16 large PTX * 14.94 13.92 14.52 256 17.63

One-time JIT compile cost (ms)

Model CUDA JIT (ms) PTX JIT (ms)
Llama-3.2-1B-F16 11406 1993
Qwen3-1.7B-F16 21835 3200
Granite-3.2-2B-F16 28228 4818

Notes

  • DeepSeek-R1-Distill-Qwen-1.5B-F16 failed on both backends with NoSuchElementException: No value present during model setup — a model-loading issue unrelated to the GPU backend, so it is excluded from the comparison.
  • Hardware/settings: NVIDIA RTX 3070 (compute 8.6), JDK 21.0.2, --with-prefill-decode --verbose-init -n 256 --seed 42, JSON metrics. Single cold run per cell.

Takeaways

  • On FP16, CUDA's lead over PTX is even larger than on Q8_0 — roughly 1.7× (Llama, Granite) up to ~3× (Qwen3, large prompt) end-to-end.
  • As with Q8, the gap widens on larger (prefill-heavy) prompts.
  • CUDA again pays a higher one-time NVRTC compile cost; steady-state token rate is unaffected.

@mikepapadim

Copy link
Copy Markdown
Member Author

CUDA vs OpenCL performance comparison — FP16 models

OpenCL counterpart to the FP16 CUDA/PTX comparison above. This is the cleanest apples-to-apples comparison of the three: both backends are built from the same TornadoVM source and commit (4.0.2-jdk21-dev), so only the code-generation/runtime backend differs. Same GPULlama3 build, same GPU, small + large prompts.

Total throughput (tokens/s) — higher is better

Model Prompt CUDA tok/s OpenCL tok/s CUDA speedup
Llama-3.2-1B-F16 small 58.72 46.15 1.27×
Llama-3.2-1B-F16 large 63.93 48.20 1.33×
Qwen3-1.7B-F16 small 33.09 26.54 1.25×
Qwen3-1.7B-F16 large 30.82 26.70 1.15×
Granite-3.2-2B-F16 small 25.29 20.25 1.25×
Granite-3.2-2B-F16 large 25.70 19.68 1.31×

Prefill / decode / total split (tokens/s)

* = engine reports a single total phase for that model (no prefill/decode split).

Model Prompt Backend Prefill tok/s Decode tok/s Total tok/s Total tok Total s
Llama-3.2-1B-F16 small CUDA 63.90 57.59 58.72 113 1.92
Llama-3.2-1B-F16 small OpenCL 50.27 45.40 46.15 131 2.84
Llama-3.2-1B-F16 large CUDA 70.93 54.10 63.93 233 3.64
Llama-3.2-1B-F16 large OpenCL 52.00 43.62 48.20 256 5.31
Qwen3-1.7B-F16 small CUDA 38.09 32.38 33.09 132 3.99
Qwen3-1.7B-F16 small OpenCL 29.89 26.11 26.54 146 5.50
Qwen3-1.7B-F16 large CUDA 35.25 26.29 30.82 256 8.31
Qwen3-1.7B-F16 large OpenCL 29.36 23.75 26.70 256 9.59
Granite-3.2-2B-F16 small CUDA * 26.53 25.19 25.29 256 10.12
Granite-3.2-2B-F16 small OpenCL * 19.81 20.32 20.25 169 8.34
Granite-3.2-2B-F16 large CUDA * 26.81 24.17 25.70 256 9.96
Granite-3.2-2B-F16 large OpenCL * 19.62 19.77 19.68 256 13.01

One-time JIT compile cost (ms)

Both compile generated kernels at runtime — OpenCL via the driver's OpenCL-C compiler, CUDA via NVRTC (CUDA-C → PTX). One-time warm-up cost, not per-token.

Model CUDA JIT (ms) OpenCL JIT (ms)
Llama-3.2-1B-F16 11406 4063
Qwen3-1.7B-F16 21835 8954
Granite-3.2-2B-F16 28228 9166

Notes

  • Version parity: both backends are TornadoVM 4.0.2-jdk21-dev built from the same commit (CUDA from PR #861); only $TORNADOVM_HOME is swapped. OpenCL device: [NVIDIA CUDA] NVIDIA GeForce RTX 3070 via the OpenCL platform.
  • DeepSeek-R1-Distill-Qwen-1.5B-F16 again failed on both backends (NoSuchElementException at model setup — model-loading issue, not backend-related) and is excluded.
  • Settings: RTX 3070, JDK 21.0.2, --with-prefill-decode --verbose-init -n 256 --seed 42, JSON metrics, single cold run per cell.

Takeaways

  • CUDA is ~1.15–1.33× faster than OpenCL end-to-end on FP16 across these models — a smaller margin than CUDA-vs-PTX, since OpenCL is the stronger of the two existing NVIDIA paths here.
  • CUDA's JIT (NVRTC) warm-up is heavier than OpenCL's; steady-state token rate is unaffected.
  • Net ordering on this GPU (FP16): CUDA > OpenCL > PTX.

@mikepapadim

Copy link
Copy Markdown
Member Author

📊 Performance summary — what this PR adds

This PR adds a first-class CUDA backend path to llama-tornado. On an NVIDIA RTX 3070, it is the fastest of the three NVIDIA-capable TornadoVM backends for end-to-end LLM inference: CUDA > OpenCL > PTX.

Throughput below is tokens/s, averaged over a short and a long prompt (full per-prompt / prefill-decode breakdowns are in the three comments above). OpenCL was measured for FP16 only; both CUDA and OpenCL use the same TornadoVM 4.0.2-jdk21-dev build, PTX uses the released 4.0.1-jdk21-ptx.

Model Precision CUDA tok/s PTX tok/s OpenCL tok/s CUDA vs PTX CUDA vs OpenCL
Llama-3.2-1B Q8_0 58.5 53.6 1.09×
Llama-3.2-1B FP16 61.3 35.5 47.2 1.73× 1.30×
Qwen2.5-1.5B Q8_0 31.0 15.0 2.07×
Qwen3-1.7B Q8_0 35.1 14.2 2.47×
Qwen3-1.7B FP16 32.0 12.1 26.6 2.64× 1.20×
Granite-3.2-2B Q8_0 26.9 22.8 1.18×
Granite-3.2-2B FP16 25.5 14.8 20.0 1.73× 1.28×

Speedup ranges (this PR's CUDA backend):

  • vs PTX: 1.09×–2.64× faster (geo-mean ≈ 1.76×)
  • vs OpenCL (FP16): 1.20×–1.30× faster (geo-mean ≈ 1.26×)

What's included in this PR

  • --cuda launcher flag wired to TornadoVM's CUDA backend (tornado.drivers.cuda + cuda-exports), symmetric with --opencl/--ptx/--metal.
  • Builds against TornadoVM 4.0.2-jdk21-dev (CUDA backend from TornadoVM PR #861); README documents the requirement.
  • CI: cuda added to the build / inference / quarkus matrices (CUDA built from the cuda2 branch until merged).

Validation

  • Coherent output on CUDA for Llama 3.2 1B (Q8 & FP16), Qwen2.5 1.5B, Qwen3 1.7B, Granite 3.2 2B — device confirmed as CUDA / RTX 3070.
  • No regression: --opencl, --ptx, --metal still parse and wire correctly.

Hardware: RTX 3070 (cc 8.6), JDK 21.0.2. DeepSeek-Qwen-1.5B-F16 excluded — it fails to load on all three backends (model issue, not backend-specific).

Copilot AI review requested due to automatic review settings June 22, 2026 10:37

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

Copilot AI review requested due to automatic review settings June 22, 2026 11:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

@orionpapadakis

Copy link
Copy Markdown
Collaborator

Performance: CUDA backend vs PTX and OpenCL

Decode throughput (eval tok/s) per backend, with CUDA speedup. Benchmark: each model with a fixed prompt, max_tokens=256, 3 reps (mean), RTX 5090 Laptop, CUDA 13.1 toolkit / 13.0 driver, Java
21. cuda-graphs configurations run on PTX & CUDA only (OpenCL N/A).

Model Size Quant Configuration OpenCL PTX CUDA CUDA vs PTX CUDA vs OpenCL
Qwen3 0.6B F16 standard 39.5 11.5 37.7 3.27× (+227%) 0.95× (-5%)
Qwen3 0.6B F16 prefill-decode 39.6 11.4 37.7 3.29× (+229%) 0.95× (-5%)
Qwen3 0.6B F16 batch-prefill-decode 38.8 11.4 37.0 3.25× (+225%) 0.96× (-4%)
Qwen3 0.6B F16 prefill-decode + cuda-graphs 12.2 40.5 3.32× (+232%)
Qwen3 0.6B F16 batch-prefill-decode + cuda-graphs 12.1 39.3 3.26× (+226%)
Qwen3 0.6B Q8_0 standard 40.9 12.6 39.2 3.11× (+211%) 0.96× (-4%)
Qwen3 0.6B Q8_0 prefill-decode 40.7 12.5 39.1 3.11× (+211%) 0.96× (-4%)
Qwen3 0.6B Q8_0 batch-prefill-decode 37.5 12.1 38.2 3.17× (+217%) 1.02× (+2%)
Qwen3 0.6B Q8_0 prefill-decode + cuda-graphs 12.7 42.7 3.35× (+235%)
Qwen3 0.6B Q8_0 batch-prefill-decode + cuda-graphs 12.6 42.7 3.40× (+240%)
Llama-3.2 1B F16 standard 59.0 43.1 63.8 1.48× (+48%) 1.08× (+8%)
Llama-3.2 1B F16 prefill-decode 59.1 43.2 63.3 1.47× (+47%) 1.07× (+7%)
Llama-3.2 1B F16 batch-prefill-decode 57.8 42.2 62.6 1.48× (+48%) 1.08× (+8%)
Llama-3.2 1B F16 prefill-decode + cuda-graphs 46.8 67.6 1.44× (+44%)
Llama-3.2 1B F16 batch-prefill-decode + cuda-graphs 44.8 66.6 1.49× (+49%)
Llama-3.2 1B Q8_0 standard 62.2 60.8 71.3 1.17× (+17%) 1.15× (+15%)
Llama-3.2 1B Q8_0 prefill-decode 62.1 60.4 71.0 1.18× (+18%) 1.14× (+14%)
Llama-3.2 1B Q8_0 batch-prefill-decode 60.2 58.9 68.6 1.16× (+16%) 1.14× (+14%)
Llama-3.2 1B Q8_0 prefill-decode + cuda-graphs 71.3 78.4 1.10× (+10%)
Llama-3.2 1B Q8_0 batch-prefill-decode + cuda-graphs 68.9 75.8 1.10× (+10%)
Qwen2.5 1.5B F16 standard 30.7 10.3 28.0 2.72× (+172%) 0.91× (-9%)
Qwen2.5 1.5B Q8_0 standard 31.6 11.8 30.2 2.57× (+157%) 0.96× (-4%)
Granite-3.2 2B F16 standard 24.8 18.5 28.5 1.54× (+54%) 1.15× (+15%)
Granite-3.2 2B Q8_0 standard 27.5 26.5 31.1 1.17× (+17%) 1.13× (+13%)
Granite-4.0 1B F16 standard 24.2 7.9 24.8 3.15× (+215%) 1.03× (+3%)
Granite-4.0 1B Q8_0 standard 27.6 8.6 26.3 3.07× (+207%) 0.95× (-5%)
Phi-3-mini 3.8B F16 standard 24.7 11.3 26.8 2.38× (+138%) 1.09× (+9%)
Phi-3-mini 3.8B Q8_0 standard 26.5 14.3 30.6 2.13× (+113%) 1.15× (+15%)
Mistral 7B F16 standard 8.4 5.0 8.5 1.70× (+70%) 1.02× (+2%)
Mistral 7B Q8_0 standard 8.8 6.7 9.4 1.41× (+41%) 1.06× (+6%)

Highlights: CUDA is faster than PTX in every case (+10% to +240%), and ≥ OpenCL on most models (up to +15%), trailing OpenCL only slightly on the small Qwen family and Granite-4.0-1B (−4% to −9%). CUDA

  • cuda-graphs gives the top result overall (Llama-3.2-1B Q8_0: 78.4 tok/s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants