Fix build by arthw · Pull Request #12 · arthw/llama.cpp

arthw · 2026-06-07T10:20:47Z

Make sure to read the contributing guidelines before submitting a PR

* Refactored Compressed Tensors NVFP4 support for new base.py * Support compressed-tensors NVFP4 conversion * Moved Qwen MTP remap into filter_tensors * simplify * pathlib no longer used --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* CUDA: add fast walsh-hadamard transform * review: add unrolls + change size_t -> int * warp size 64 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

ffn_latent_down/up are declared GGML_OP_MUL in LLM_TENSOR_INFOS but nemotron-h feeds them through ggml_mul_mat. The loader buft probe asks the backend about the declared op, so it tested an elementwise MUL on a q8_0 weight. That used to return true unconditionally and the weight stayed on GPU by luck. Once supports_op told the truth, the probe got a no and the loader pushed the weight and its matmul to CPU, splitting the graph. Tagging it MUL_MAT asks the real question, the math is unchanged. Verified on Nemotron 3 Super 120B Q5_K_M: from 64.9 back to 103.22 t/s.

* ci : disable SYCL f16 builds * ci : extract android and hip into separate workflows * ci : move webgpu to separate workflow * ci : move the rpc to a separate workflow * ci : extract s309x and ppcl jobs * ci : extract opencl job into a separate workflow

…l-org#23680)

Co-authored-by: lvyichen <lvyichen@stepfun.com>

…ing GPU profiling (ggml-org#23457) * Only run webgpu CI on my fork * Add webgpu only workflow * refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled * restore build.yml

…L_MAT pipeline (ggml-org#23594) * ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K * Fix to editorconfig checking pass * Remove mul-mat-legacy pipeline * Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx

* initial talkie support, coherent * reorder to follow convention * absorb inverse rope * stop folding scalars to improve quantization * use broadcasting instead of duplication * style cleanup * add scaling support to LoraTorchTensor; use that path in conversion * use layer_out_scale instead of embd_skip_scale

Create a pool of N threads that grab a chunk of up to 100 tests at a time to iterate through. The number of tests at a time decreases as fewer remain. Each thread uses its own dev and cpu backend, and set_n_threads_fn is not called on the cpu backend. Fix some TSAN issues that arose: - In init_tensor_uniform, don't use static vector of generators. - Replace gmtime with versions that don't use a global variable. - Mutex calls to print_test_result.

@sanmai

* SYCL: implement ggml_sycl_pool_vmm * Add an option to bypass VMM with GGML_SYCL_DISABLE_VMM * Clean up debugging logging * document GGML_SYCL_DISABLE_VMM * Multi-stream MoE optimization * Revert "Multi-stream MoE optimization" This reverts commit 938929c. * Update common.hpp Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * Flip GGML_SYCL_DISABLE_VMM to GGML_SYCL_ENABLE_VMM * add logging for GGML_SYCL_ENABLE_VMM when extension is not available (SYCL_EXT_ONEAPI_VIRTUAL_MEM macro) * Apply suggestions from code review Co-authored-by: Alexey Kopytko <alexey@kopytko.com> * Apply suggestion from @sanmai * Apply suggestion from @sanmai --------- Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

* convert : support Gemma4ForCausalLM architecture (ggml-org#23674) * fix indent --------- Co-authored-by: Oleg Afonin <your.email@example.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* ci : reduce [no ci] * cont : disable sycl, cann + rename caches [no ci] * cont : cann [no ci]

* hexagon: add support for CONCAT with optimized concat_2d_transposed qwen3.5 models are quite heavy on the CONCAT with large and transposed src1. * hex-concat: use fastdiv in generic version * hex-concat: make checks for transposed a bit more readable * hex-concat: reoder dma ops for better pipelining * hex-cont/cpy: optimize CPY and CONT ops The primary change is to avoid scalar divs in the inner loops. We were calling hvx_copy_uu(... type_size) where type_size is non a constexpr. This causes runtime divs by that value which is normally just 4 or 2 (f32/f16). * hex-get-rows: optimize GET_ROWS for large rows We now use DMA for larger rows and also split them into chunks to improve perf for Qwen3.5 and other models that do lots of GET_ROWS with huge (2MB+ rows). Also bump the DMA queue depth now that we can take advantage of it. * hex-concat: unroll the inner loops of concat_2d * hex-concat: more updates to concat_2d to improve perf a bit further * hex-cpy: fixed n_rows per thread checks in the copy ops * hmx-fa: fix alignment issues while computing dma sizes * hex-set-rows: add early returns for idle threads * hvx-rope: minor optimization to replace loops with fastdiv logic * hex-rope: replace scalar tail processing with HVX * hex-rope: optimize rope cache init with HVX Add hvx-utils sin/cos helpers that use an aprox method (similar to rsqrt, inverse, etc) Use the helpers to optimize ROPE.

* ci : remove vulkan dep from webgpu build * cont : add ccache to `ubuntu-24-webgpu-wasm` * ci : fix name + add wasm test

* vulkan: add CONV_SHAPE_64x128 for medium-K conv2d * vulkan: skip conv2d bounds checks when shapes align with tile sizes * vulkan: use WG_SIZE=128 for CONV_SHAPE_64x32 conv2d * vulkan: stage cm2 conv2d accumulator through shmem before global store * vulkan: add coopmat1 conv2d path * fallback when using too much shared memory. clean up comments * Require 16x16x16 and subgroup size 32 or 64 * check whether shared memory is sufficient before overwriting conv2d params with coopmat1 values

* ci : skip release workflow on master when commit message contains [no release] Assisted-by: llama.cpp:local pi * ci : restrict sanitizer builds to x86_64 + fix build type the spark is apparently too slow for some reason * tests : fix undefined warning [no ci]

…#23734) * ci : move [no release] check to dedicated check_release job Move the workflow-level \`if\` condition that skips builds when the commit message contains \`[no release]\` into a lightweight \`check_release\` job. All build jobs now depend on it via \`needs\` and check its output. This ensures the skip logic is evaluated at the job level rather than at the workflow level, which is the recommended approach for conditional jobs. Assisted-by: llama.cpp:local pi * cont : use `fast` runner

) * ci : do not allocate ccache for 3rd-party hosted runners [no release] * cont : add prints [no ci] [no release]

* ggml-zendnn: fixed naming of matmul function * ggml-zendnn: fixed naming of mul_mat_id function * ggml-zendnn: fixed print in mul_mat_id --------- Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>

When llama-server is started with SSL key and cert, the log says that it listens on http instead of https. This patch fixes this.

Add minicpm5 pre-tokenizer hash via convert_hf_to_gguf_update.py and implement hardcoded regex handling in llama-vocab.cpp, consistent with other BPE pre-tokenizers. Co-authored-by: zhangtao <zhangtao2@modelbest.cn>

ggml-org#23767) Co-authored-by: Kai Tanaka <275430420+quyentonndbs@users.noreply.github.com>

…l-org#23763) * ci : fix undefined sanitizer build to use Debug build type only * ci : ccache the server builds * cont : remove ui dependency + reuse ccache for both ubuntu jobs * tmp : force ccache save * Revert "tmp : force ccache save" This reverts commit a857b03. * cont : no need for node.js

…4267) A fitted target context can end up smaller than the draft default, the oversized assistant views then overflow the shared K/V tensors and trip the ggml_view_4d size assert during graph reserve.

Mistral explicitly sets `moe` and `llama_4_scaling` to `null` in params.json, breaking `key in dict` checks during conversion. Replace with `dict.get(key) is not None` where this matters. Fixes `convert-hf-to-gguf.py --mistral-format Mistral-Medium-3.5-128B`

@CISC

* common : relax sampler name matching Currently, in some cases, the alternative names for samplers (like `top-k` and `min-p` instead of the canonical `top_k` and `min_p`) are not always recognized by the `common_sampler_types_from_names` function in `common/sampling.cpp`. This PR changes the signature of this function to remove the `bool allow_alt_names` flag, and removes all occurences of the flag from call sites. Therefore, the function will now always match all known names. I also changed the logic of the function to unconditionally check the provided sampler names against both the canonical and alternative names, and to be case-insensitive. This fixes an issue I was seeing wherein samplers specified in the `llama-server` UI were not recognized as valid when the alternative names were used. * add more alt names * cont. fix * cast to unsigned char for correctness * common : unify sampler name mapping * annotate canonical vs. alt sampler name mappings per @CISC * Update common/sampling.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * common : auto-generate sampler name aliases per @ngxson * use merged map for matching * use `.merge` instead of iterating * nit: simplify comment * nit: use insert everywhere, not index assignment --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* update compute runtime from 25 to 26 in docker * add comment with old driver for multiple GPUs

* cuda: reset device in get_memory function if no backend is active * also count device and host buffers * exclude hip and musa from counting and device reset * use device mutex instead of atomic * undo backend_free function move

…#23991) This allows vec4 loads of the B elements. Also increase BK to 64 when this is enabled. Neither of these alone is consistently faster, but together these give a nice speedup. In ggml-vulkan.cpp, we need to make sure the B matrix alignment and stride are multiples of 4.

* wip * ok: lazy bitmap API * remember to free lazy text * wip * add mtmd_helper_video * support video input on server (base64 input) * add MTMD_VIDEO config * add timestamp * update CLI * cli: allow auto-completion for video * add --video arg * fix build * update docs * rename as suggested

ggml-org#24044) * Only run webgpu CI on my fork * Add webgpu only workflow * Implement 2d workgroups for more operations * fix * Fix type * Move back to global_invocation_id

…ator (ggml-org#24000) * Only run webgpu CI on my fork * Add webgpu only workflow * handle buffer overlap case for concat operator * restore build-webgpu.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Run clang-format * Update ggml/src/ggml-webgpu/wgsl-shaders/concat.wgsl --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Reese Levine <reeselevine1@gmail.com>

A SWA-only draft head (e.g. StepFun MTP) leaves the base sub-cache empty, so its kq_mask buffer stays null and asserts at load. Guard each mask on its own buffer in set_input and can_reuse, base and swa. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* models: update converter to support smaller assistants * models: add masked_embd tensors to gemma4-assist arch * gemma-4: remove temp debug for conversion * gemma-4-mtp: filter out masked_embedding tensors during conversion

…r Q4/Q5/Q8 and k-quants (ggml-org#24225) * ggml-webgpu: Improve prefill speeds + refactor matmul for quants * Fixes for editroconfig checker

michaelw9999 and others added 30 commits May 25, 2026 14:16

ui: fix stop/continue during an agentic loop (ggml-org#23356)

5a4126a

CUDA: add fast walsh-hadamard transform (ggml-org#23615)

c1f1e28

* CUDA: add fast walsh-hadamard transform * review: add unrolls + change size_t -> int * warp size 64 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

snapdragon: bump toolchain docker to v0.7 to fix ui build issues (ggm…

4bead4e

…l-org#23680)

metal : add apple device id (ggml-org#23566)

35c9b1f

Co-authored-by: lvyichen <lvyichen@stepfun.com>

CUDA: missing PDL sync for FWHT, better fallback (ggml-org#23690)

192d8ae

models : Attach Mistral3 NVFP4 weight scales (ggml-org#23629)

6fe90de

convert : support Gemma4ForCausalLM architecture (ggml-org#23682)

dbe9c0c

* convert : support Gemma4ForCausalLM architecture (ggml-org#23674) * fix indent --------- Co-authored-by: Oleg Afonin <your.email@example.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ci : reduce (disable SYCL and CANN builds/releases) (ggml-org#23705)

3dc7684

* ci : reduce [no ci] * cont : disable sycl, cann + rename caches [no ci] * cont : cann [no ci]

ci : move sanitizer jobs to self-hosted runners (ggml-org#23713)

ef41a69

ci : move more CPU jobs to self-hosted runners (ggml-org#23715)

678d43d

ci : remove vulkan SDK dep from webgpu job (ggml-org#23718)

3a3ed15

* ci : remove vulkan dep from webgpu build * cont : add ccache to `ubuntu-24-webgpu-wasm` * ci : fix name + add wasm test

ci : move macos jobs to the apple workflow + fix names (ggml-org#23721)

5190c2e

ci : do not allocate ccache for 3rd-party hosted runners (ggml-org#23730

0d18aaa

) * ci : do not allocate ccache for 3rd-party hosted runners [no release] * cont : add prints [no ci] [no release]

ggml-zendnn : fixed naming of matmul function (ggml-org#20964)

b4c0549

* ggml-zendnn: fixed naming of matmul function * ggml-zendnn: fixed naming of mul_mat_id function * ggml-zendnn: fixed print in mul_mat_id --------- Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>

server : fix the log message when using SSL (ggml-org#23393)

7085492

When llama-server is started with SSL key and cert, the log says that it listens on http instead of https. This patch fixes this.

convert: add MiniCPM5 tokenizer support (ggml-org#23384)

9777256

Add minicpm5 pre-tokenizer hash via convert_hf_to_gguf_update.py and implement hardcoded regex handling in llama-vocab.cpp, consistent with other BPE pre-tokenizers. Co-authored-by: zhangtao <zhangtao2@modelbest.cn>

docs : fix duplicated "the" in granitevision and model-conversion docs (

1d971bb

ggml-org#23767) Co-authored-by: Kai Tanaka <275430420+quyentonndbs@users.noreply.github.com>

CISC and others added 23 commits June 7, 2026 14:43

spec : fix vocab compatibility check (ggml-org#24256)

8a091c4

llama : add Gemma4 MTP (ggml-org#23398)

04eb4c4

kv-cache: follow the source cache size when sharing cells (ggml-org#2…

f0156d1

…4267) A fitted target context can end up smaller than the draft default, the oversized assistant views then overflow the shared K/V tensors and trip the ggml_view_4d size assert during graph reserve.

kv-cache : avoid kv cells copies (ggml-org#24277)

379ac66

[SYCL] Update compute runtime version to 26.x in docker (ggml-org#24070)

d403f00

* update compute runtime from 25 to 26 in docker * add comment with old driver for multiple GPUs

metal : fix im2col 1D case (audio models) (ggml-org#24220)

daf6bc9

HIP: add gfx1152 and gfx1153 to RDNA3.5 (ggml-org#24129)

19bba67

cli: fix spinner not show during prompt processing (ggml-org#24283)

715b86a

ggml : bump version to 0.14.0 (ggml/1533)

6a1de6f

sync : ggml

c2b1518

docker: install ffmpeg in the released image (ggml-org#24302)

3ebe862

[ggml-webgpu] Implement 2D workgroups for scale, binary, and unary ops (

3b3da01

ggml-org#24044) * Only run webgpu CI on my fork * Add webgpu only workflow * Implement 2d workgroups for more operations * fix * Fix type * Move back to global_invocation_id

server : do not parse when flushing http headers (ggml-org#24281)

42a0afd

ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul fo…

1e1aca0

…r Q4/Q5/Q8 and k-quants (ggml-org#24225) * ggml-webgpu: Improve prefill speeds + refactor matmul for quants * Fixes for editroconfig checker

restore SYCL build and release, remove github cache

ff3e4f1

arthw force-pushed the fix_build branch from aa8f42a to ff3e4f1 Compare June 9, 2026 02:22

arthw added 6 commits June 9, 2026 16:56

modify for test only

e630ca4

verify the ccache is used

4ac5e7d

remove debug code change

0aef7d7

rm duplicate action, update key in ccache

002efcd

add action ccache-clear after building in both ubuntu and windows

81bc3d7

set %NUMBER_OF_PROCESSORS% in widnows build

0e7ec65

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix build#12

Fix build#12
arthw wants to merge 6297 commits into
masterfrom
fix_build

arthw commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

arthw commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants