Conversation
* Refactored Compressed Tensors NVFP4 support for new base.py * Support compressed-tensors NVFP4 conversion * Moved Qwen MTP remap into filter_tensors * simplify * pathlib no longer used --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CUDA: add fast walsh-hadamard transform * review: add unrolls + change size_t -> int * warp size 64 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
ffn_latent_down/up are declared GGML_OP_MUL in LLM_TENSOR_INFOS but nemotron-h feeds them through ggml_mul_mat. The loader buft probe asks the backend about the declared op, so it tested an elementwise MUL on a q8_0 weight. That used to return true unconditionally and the weight stayed on GPU by luck. Once supports_op told the truth, the probe got a no and the loader pushed the weight and its matmul to CPU, splitting the graph. Tagging it MUL_MAT asks the real question, the math is unchanged. Verified on Nemotron 3 Super 120B Q5_K_M: from 64.9 back to 103.22 t/s.
* ci : disable SYCL f16 builds * ci : extract android and hip into separate workflows * ci : move webgpu to separate workflow * ci : move the rpc to a separate workflow * ci : extract s309x and ppcl jobs * ci : extract opencl job into a separate workflow
Co-authored-by: lvyichen <lvyichen@stepfun.com>
…ing GPU profiling (ggml-org#23457) * Only run webgpu CI on my fork * Add webgpu only workflow * refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled * restore build.yml
…L_MAT pipeline (ggml-org#23594) * ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K * Fix to editorconfig checking pass * Remove mul-mat-legacy pipeline * Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx
* initial talkie support, coherent * reorder to follow convention * absorb inverse rope * stop folding scalars to improve quantization * use broadcasting instead of duplication * style cleanup * add scaling support to LoraTorchTensor; use that path in conversion * use layer_out_scale instead of embd_skip_scale
Create a pool of N threads that grab a chunk of up to 100 tests at a time to iterate through. The number of tests at a time decreases as fewer remain. Each thread uses its own dev and cpu backend, and set_n_threads_fn is not called on the cpu backend. Fix some TSAN issues that arose: - In init_tensor_uniform, don't use static vector of generators. - Replace gmtime with versions that don't use a global variable. - Mutex calls to print_test_result.
* SYCL: implement ggml_sycl_pool_vmm * Add an option to bypass VMM with GGML_SYCL_DISABLE_VMM * Clean up debugging logging * document GGML_SYCL_DISABLE_VMM * Multi-stream MoE optimization * Revert "Multi-stream MoE optimization" This reverts commit 938929c. * Update common.hpp Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * Flip GGML_SYCL_DISABLE_VMM to GGML_SYCL_ENABLE_VMM * add logging for GGML_SYCL_ENABLE_VMM when extension is not available (SYCL_EXT_ONEAPI_VIRTUAL_MEM macro) * Apply suggestions from code review Co-authored-by: Alexey Kopytko <alexey@kopytko.com> * Apply suggestion from @sanmai * Apply suggestion from @sanmai --------- Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
* convert : support Gemma4ForCausalLM architecture (ggml-org#23674) * fix indent --------- Co-authored-by: Oleg Afonin <your.email@example.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* ci : reduce [no ci] * cont : disable sycl, cann + rename caches [no ci] * cont : cann [no ci]
* hexagon: add support for CONCAT with optimized concat_2d_transposed qwen3.5 models are quite heavy on the CONCAT with large and transposed src1. * hex-concat: use fastdiv in generic version * hex-concat: make checks for transposed a bit more readable * hex-concat: reoder dma ops for better pipelining * hex-cont/cpy: optimize CPY and CONT ops The primary change is to avoid scalar divs in the inner loops. We were calling hvx_copy_uu(... type_size) where type_size is non a constexpr. This causes runtime divs by that value which is normally just 4 or 2 (f32/f16). * hex-get-rows: optimize GET_ROWS for large rows We now use DMA for larger rows and also split them into chunks to improve perf for Qwen3.5 and other models that do lots of GET_ROWS with huge (2MB+ rows). Also bump the DMA queue depth now that we can take advantage of it. * hex-concat: unroll the inner loops of concat_2d * hex-concat: more updates to concat_2d to improve perf a bit further * hex-cpy: fixed n_rows per thread checks in the copy ops * hmx-fa: fix alignment issues while computing dma sizes * hex-set-rows: add early returns for idle threads * hvx-rope: minor optimization to replace loops with fastdiv logic * hex-rope: replace scalar tail processing with HVX * hex-rope: optimize rope cache init with HVX Add hvx-utils sin/cos helpers that use an aprox method (similar to rsqrt, inverse, etc) Use the helpers to optimize ROPE.
* ci : remove vulkan dep from webgpu build * cont : add ccache to `ubuntu-24-webgpu-wasm` * ci : fix name + add wasm test
* vulkan: add CONV_SHAPE_64x128 for medium-K conv2d * vulkan: skip conv2d bounds checks when shapes align with tile sizes * vulkan: use WG_SIZE=128 for CONV_SHAPE_64x32 conv2d * vulkan: stage cm2 conv2d accumulator through shmem before global store * vulkan: add coopmat1 conv2d path * fallback when using too much shared memory. clean up comments * Require 16x16x16 and subgroup size 32 or 64 * check whether shared memory is sufficient before overwriting conv2d params with coopmat1 values
* ci : skip release workflow on master when commit message contains [no release] Assisted-by: llama.cpp:local pi * ci : restrict sanitizer builds to x86_64 + fix build type the spark is apparently too slow for some reason * tests : fix undefined warning [no ci]
…#23734) * ci : move [no release] check to dedicated check_release job Move the workflow-level \`if\` condition that skips builds when the commit message contains \`[no release]\` into a lightweight \`check_release\` job. All build jobs now depend on it via \`needs\` and check its output. This ensures the skip logic is evaluated at the job level rather than at the workflow level, which is the recommended approach for conditional jobs. Assisted-by: llama.cpp:local pi * cont : use `fast` runner
* ggml-zendnn: fixed naming of matmul function * ggml-zendnn: fixed naming of mul_mat_id function * ggml-zendnn: fixed print in mul_mat_id --------- Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>
When llama-server is started with SSL key and cert, the log says that it listens on http instead of https. This patch fixes this.
Add minicpm5 pre-tokenizer hash via convert_hf_to_gguf_update.py and implement hardcoded regex handling in llama-vocab.cpp, consistent with other BPE pre-tokenizers. Co-authored-by: zhangtao <zhangtao2@modelbest.cn>
ggml-org#23767) Co-authored-by: Kai Tanaka <275430420+quyentonndbs@users.noreply.github.com>
…l-org#23763) * ci : fix undefined sanitizer build to use Debug build type only * ci : ccache the server builds * cont : remove ui dependency + reuse ccache for both ubuntu jobs * tmp : force ccache save * Revert "tmp : force ccache save" This reverts commit a857b03. * cont : no need for node.js
…4267) A fitted target context can end up smaller than the draft default, the oversized assistant views then overflow the shared K/V tensors and trip the ggml_view_4d size assert during graph reserve.
Mistral explicitly sets `moe` and `llama_4_scaling` to `null` in params.json, breaking `key in dict` checks during conversion. Replace with `dict.get(key) is not None` where this matters. Fixes `convert-hf-to-gguf.py --mistral-format Mistral-Medium-3.5-128B`
* common : relax sampler name matching Currently, in some cases, the alternative names for samplers (like `top-k` and `min-p` instead of the canonical `top_k` and `min_p`) are not always recognized by the `common_sampler_types_from_names` function in `common/sampling.cpp`. This PR changes the signature of this function to remove the `bool allow_alt_names` flag, and removes all occurences of the flag from call sites. Therefore, the function will now always match all known names. I also changed the logic of the function to unconditionally check the provided sampler names against both the canonical and alternative names, and to be case-insensitive. This fixes an issue I was seeing wherein samplers specified in the `llama-server` UI were not recognized as valid when the alternative names were used. * add more alt names * cont. fix * cast to unsigned char for correctness * common : unify sampler name mapping * annotate canonical vs. alt sampler name mappings per @CISC * Update common/sampling.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * common : auto-generate sampler name aliases per @ngxson * use merged map for matching * use `.merge` instead of iterating * nit: simplify comment * nit: use insert everywhere, not index assignment --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* update compute runtime from 25 to 26 in docker * add comment with old driver for multiple GPUs
* cuda: reset device in get_memory function if no backend is active * also count device and host buffers * exclude hip and musa from counting and device reset * use device mutex instead of atomic * undo backend_free function move
…#23991) This allows vec4 loads of the B elements. Also increase BK to 64 when this is enabled. Neither of these alone is consistently faster, but together these give a nice speedup. In ggml-vulkan.cpp, we need to make sure the B matrix alignment and stride are multiples of 4.
* wip * ok: lazy bitmap API * remember to free lazy text * wip * add mtmd_helper_video * support video input on server (base64 input) * add MTMD_VIDEO config * add timestamp * update CLI * cli: allow auto-completion for video * add --video arg * fix build * update docs * rename as suggested
ggml-org#24044) * Only run webgpu CI on my fork * Add webgpu only workflow * Implement 2d workgroups for more operations * fix * Fix type * Move back to global_invocation_id
…ator (ggml-org#24000) * Only run webgpu CI on my fork * Add webgpu only workflow * handle buffer overlap case for concat operator * restore build-webgpu.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Run clang-format * Update ggml/src/ggml-webgpu/wgsl-shaders/concat.wgsl --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Reese Levine <reeselevine1@gmail.com>
A SWA-only draft head (e.g. StepFun MTP) leaves the base sub-cache empty, so its kq_mask buffer stays null and asserts at load. Guard each mask on its own buffer in set_input and can_reuse, base and swa. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* models: update converter to support smaller assistants * models: add masked_embd tensors to gemma4-assist arch * gemma-4: remove temp debug for conversion * gemma-4-mtp: filter out masked_embedding tensors during conversion
…r Q4/Q5/Q8 and k-quants (ggml-org#24225) * ggml-webgpu: Improve prefill speeds + refactor matmul for quants * Fixes for editroconfig checker
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Make sure to read the contributing guidelines before submitting a PR