Add near-complete Codex VSCode Support, full OAI Responses bridge by michaelw9999 · Pull Request #3 · michaelw9999/llama.cpp

michaelw9999 · 2026-05-03T08:57:36Z

Things brings in automatic compaction, web_search and file_search and is super easy to configure, for example:

model = "qwen3.5-4B-NVFP4"
model_provider = "llamacpp"
personality = "friendly"
model_context_window = 128000
model_auto_compact_token_limit = 100000
model_supports_reasoning_summaries = true
model_reasoning_summary = "auto"
model_reasoning_effort = "medium"

[model_providers.llamacpp]
name = "Local llama.cpp"
model = "Qwen3.5-4B-NVFP4.gguf"
base_url = "http://192.168.50.50:43901/v1"
supports_websockets = false

[model_providers.llamacpp.http_headers]
X-Llama-Responses-Web-Search-Wrapper = "tvly"
X-Llama-Responses-File-Search-Wrapper = "rg"
X-Llama-Responses-Reasoning-Budget-Tokens = "minimal=2048,low=4096,medium=8192,high=16384,xhigh=32768"

For the automatic compaction to work, you must set model_context_window and model_auto_compact_token_limit. Summary boxes and clickable diffs with the undo button ususally need model_supports_reasoning_summaries = true and model_reasoning_summary = "auto".
Just install tavily (but shell command is tvly) and rg or any other preferred web search MCP or file search/locator tool, it will wrap it through the shell and integrate it more natively and intuitively. If left out, it will hide these tools from the model.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d72b0819db

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-03T09:03:12Z

+                    {"sequence_number", seq_num++},
+                    {"output_index",    output_idx++},
                    {"item", json {
+                        {"id",        oai_resp_fc_item_id},


Emit a fresh function-call item id for each added tool call

server_task_result_cmpl_partial::update() only assigns state.oai_resp_fc_item_id after snapshotting state into the chunk fields, so to_json_oaicompat_resp() can emit response.output_item.added with {"id": oai_resp_fc_item_id} from the previous value (often empty on the first streamed tool call). This makes streamed response.function_call_arguments.delta.item_id/final output_item.done.item.id inconsistent with the announced item, which breaks clients that stitch function-call argument deltas by item_id.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-03T09:03:12Z

+    if (checkpoints.empty()) {
+        return true;
+    }


Remove stale checkpoint sidecar when no checkpoints exist

When checkpoints is empty, slot_checkpoints_save() returns without touching <filepath>.checkpoints, so reusing the same save filename can leave an old sidecar file behind. A later restore will then load stale checkpoint metadata for a different KV snapshot, which can trigger invalid recurrent-state restore attempts or unnecessary full prompt reprocessing.

Useful? React with 👍 / 👎.

…gml-org#23869) * spec: add speed-bench support for benchmarking * speed-bench : add trailing newline to requirements.txt * speed-bench : bump datasets to 4.8.0 to fix ty check * server-bench : remove now-unused type: ignore after datasets bump

* Add q8_0 and q4_0 set_rows * Add fast(er) quantization set_rows path * formatting/naming * a little more naming * Remove unused constant * Don't override other override * Avoid bitcast * Narrow relaxation

* server: in SSE mode, send HTTP headers when slot starts * ref to pr * stream should be false by default

…23868) After ggml-org#23007 reclassified integrated CUDA/HIP devices as IGPU, the device selection logic dropped the local iGPU whenever any RPC server was added, because RPC devices made `model->devices` non-empty. On systems where the "iGPU" is the main compute device (e.g. Strix Halo with 128 GiB of unified memory), this caused all tensors to be allocated on the RPC peer alone and model loading to fail. Gate the iGPU inclusion on `gpus.empty()` instead, so RPC peers no longer suppress the local iGPU. closes: ggml-org#23858

…#23895) * ci : ios use macos-15 again * ci : add and test ccache-clear * cont : fix * cont : set permission * cont : another permission * cont : token * cont : print key * cont : bring back perms * cont : test windows * cont : add token * cont : cleanup * ci : make release jobs clean-up their ccache

* ci : fix s390x release job * ci : multi-thread build for `ios-xcode` * ocd : names

…3420) * vulkan: add flash attention bf16 kv support * vulkan: bf16 FA coopmat1 support * vulkan: bf16 FA coopmat2 support * fix FA bf16 f32 fallback * fix FA bf16 coopmat1 shader * fix FA bf16 coopmat2 shader * code cleanup * cleanup comment change * address feedback * add O_TYPE for cm2 FA * use O_TYPE for gqaStore function * reduce BFLOAT16 ifdefs

* loongarch : optimize LSX fp16 load/store with native intrinsics Use __lsx_vfcvtl_s_h and __lsx_vfcvt_h_s instead of scalar loops in __lsx_f16x4_load and __lsx_f16x4_store. * loongarch : add LSX implementation for q8_0 dot product * loongarch : add LSX implementation for q6_K dot product * loongarch : add LSX implementation for iq4_xs dot product * Improve reduce ops when sun int16 pairs to int32

* ci : disable libcommon build from xcframework * ocd : fix name * ci : ios-xcode change to macos-26 * cont : pin xcode * cont : pin xcode to minor version

* TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs * fix afmoe TP

…ors stop being masked (ggml-org#23910)

* Support `-fa auto` in llama-bench Make the default value of `-ngl` -1, similar to other tools. Update README with latest usage and examples * Address review comments

@niutech

* webui: add custom CSS injection via config register a customCSS setting in the Developer section under Custom JSON, syncable so it rides the existing ui-config pass through. inject the value into a single style element in the head, reactive on the setting. lets an operator theme a prebuilt binary through --ui-config without rebuilding, and lets a user set it from the settings panel. * ui: address review from @niutech and @allozaur, rename custom JSON key and CSS field * ui: address review from @allozaur, move custom CSS injection to a style tag in svelte:head * ui: inject custom CSS through a svelte action instead of a bound element move the textContent write into a use: action on the head style node. the action is the idiomatic way to touch a node, so the no-dom-manipulating lint rule is satisfied without a disable. value stays text through textContent, never parsed as HTML. * Update tools/ui/src/lib/constants/settings-keys.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * ui: address review from @allozaur, rename custom config key to customJson with migration rename the custom config key to customJson across the type, the chat request builder, the settings save check and the custom tools reader, keeping the custom API param name unchanged. add a non destructive migration that copies the legacy custom key to customJson at startup. only render the head style tag when custom CSS is set. --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* docs zendnn added information about Q8 support * docs zendnn rm unnecessary data * docs update, links to ZenDNN docs provided * docs zenDNN update: clarified explanation * docs zenDNN update: one more explanation clarified --------- Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>

…g#18756) * vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer) * lowercase defaults to true * type fix --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* remove redundant apple job openvino gpu and cpu test can share the same build and machine Update build-rpc.yml Update build-openvino.yml cpu any doesnt make sense as we have an arm job already, so do high perf on both x86 and arm remove duplicate x86 vulkan combine backend sampling Update server.yml run server on arm as windows is x86 * emdawn on one machine only * fix openvino, remove cpu tag as we dont have many x64 machines with that tag

* add to support Q1_0, NVFP4, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ1_S, IQ1_M, IQ3_S, IQ4_NL, IQ4_XS, I32, MXFP4, Q2_K, Q3_K, Q5_K, and Q6_K in GET_ROWS OP * correct the link

* support Q4_1, Q5_0, Q5_1 * update ut case

) Fixes: ggml-org#23927 (comment) The cpu-x64-high-perf job was missing the Linux label in its runs-on specification, causing the runner to not be discovered. All other self-hosted Linux jobs include this label. Assisted-by: llama.cpp:local pi

ggml-org#23949)

* ui: add svg block visualizer based on allozaur's mermaid PR * ui: rationalise diagram block styling and pre transforms shared by mermaid and svg * ui: live render streaming svg blocks * ui: also render svg authored in xml code fences * ui: refactor svg block rendering, address review from allozaur - Move the svg size ceiling and DOMPurify config out of sanitize-svg.ts into /constants. - Rename the svg-diagram class to svg-block so the name no longer implies diagrams only. - Replace the svg, xml and svg tag magic strings in the markdown pipeline with shared constants. - Promote the data-svg-rendered marker and its sibling data attributes to constants. * ui: render svg blocks in a shadow root for animation and live zoom Mount each sanitized svg inside an open shadow root so author <style> and keyframe or smil animations run while staying scoped to the host element. Relax the sanitizer to forbid only foreignObject and script, which lets animation, href and external resource refs through for wider compatibility. Render the inline block and the zoom dialog from the same reactive source, so a streaming svg keeps drawing live inside the open zoom popup.

* chat: fix whitespace problems once and for all * Purge trailing spaces from grammar generation * Revert "Purge trailing spaces from grammar generation" This reverts commit b0827ec.

@sanmai

* [SYCL] Centralize Level Zero detection in ggml_sycl_init * use the same wording * get back the warning * [SYCL] Remove per-allocation getenv() for GGML_SYCL_ENABLE_LEVEL_ZERO * bring back the comment * move it up to make sure devices call the shots * move the env detection early * replace g_ggml_sycl_enable_level_zero with a direct call to .ext_oneapi_level_zero * update the comment * switch back to g_ggml_sycl_enable_level_zero with a sentinel * remove the check * Reduce the diff * reword, move lower * move things aroudn * remove forward declaration if favor of a full replace * pre-cache results of zeDeviceGetProperties * put ggml_sycl_get_env back * replace get_sycl_env with ggml_sycl_get_env * add whitespace back * Apply suggestion from @sanmai

…gml-org#24584) * add to support pool_1d, move pool_1d/2d code to pool.cpp/hpp * update ops.md

…24578)

Assisted-by: pi:llama.cpp/Qwen3.6-27B

…uring last changes (ggml-org#24653) * chat: fix an "oldie but goodie" grammar generator bug that surfaced during last changes * update erroneous case in PEG parser test

* chat: harden peg-native tool call parsing accept an optional leading type: function field in build_json_tools_flat_keys so openai style tool calls parse on templates whose serialization opens on the name field. return a clean error and log the unparsed fragment on a final peg parse failure instead of throwing the raw parser position and input. keep the raw arguments string in func_args_not_string when it is not valid json instead of aborting the prompt render. * chat: surface peg-native parse failures a final peg parse failure threw the raw parser position and input. log the unparsed fragment and raise a clearer error instead, so a model output that does not match the expected format no longer fails silently with an empty assistant turn. minimal change, no behavior change on successful parses. * chat: handle openai style tool calls in peg-native * nits * common: scope OpenAI wrapper grammar trigger via autoparser flag * chat: gate type:function parsing leniency on the analysis flag Thread accept_openai_wrapper from the generator to build_json_tools_flat_keys so the leading "type": "function" field is accepted only when openai_wrapper_trigger is set.

…org#22219) * docs: Add instructions to install `llama.cpp` from conda-forge Signed-off-by: Julien Jerphanion <git@jjerphan.xyz> * Rewording of instructions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

message on parse error

* Add escape test cases * chat : fix LFM2 tool-call parsing double-escaping

@jeffbolznv

* vulkan: add GGML_OP_COL2IM_1D, follow-up to the CPU op * vulkan: col2im_1d bounded gather loop instead of full-K scan with modulo * vulkan: col2im_1d address review from @jeffbolznv * vulkan: col2im_1d return nullptr for unsupported types, address review from @0cc4m

* Add -cl-fp32-correctly-rounded-divide-sqrt to F16=ON builds Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Make GGML_SYCL_F16=ON the default Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Leave F32 the default F16 remains explictly set for example and Dockerfile builds. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Revert changes to examples/sycl/build scripts Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> --------- Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

…gml-org#24363) * support OP EXPM1, support all UT cases of FLOOR, TRUNC, ROUND * fix conflict * rebase, support new UT case of repeat, concat

* sycl: support reordered Q4_K and Q5_K MoE MUL_MAT_ID Extend reordered-weight handling to fused MoE MUL_MAT_ID for Q4_K and Q5_K expert tensors and add Q5_K reordered DMMV coverage. Unsupported 3D reorder cases now fall back instead of aborting. * sycl: extend MoE reorder to Q6_K mul_mat_id

* bench : add --offline Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add default Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>

… position (ggml-org#24536) * spec: add spec metrics mean acceptance length and acceptance per pos * fix as suggestion Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix as suggestion Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix as suggestion Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix as suggestions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…2930) * implement UMA host-visible memory * update based on 0cc4m's suggestion

* Move post-GEMM MUL required for dequant b4 lora and bias add see ggml-org#23484 : 1. For lora, I would presume we want fully dequantized values before doing the residuals, but this depends on how the LORAs were generated. Literature tells me LORA happens post-mul but pre-bias add ggml-org#8332 2. For ModelOPT, bias-add should happen on [fully-dequantized values](https://github.com/NVIDIA/Model-Optimizer/blob/b49f9b9e2d747af992d78a3aa7f10efe5a8847e1/modelopt/torch/quantization/backends/nvfp4_gemm.py#L59-L64) * Restrict build_ffn for NVFP4 to supported combinations

@allozaur

* ui: add source toggle to mermaid and svg blocks Add a toggle button next to copy and preview that switches a rendered mermaid or svg block to its source code and back. The button is shared by both block types and the rendered view stays the default. The source view reuses the code block scroll container and the highlighted code element captured at transform time, so it matches the app code blocks without highlighting again. Make tall diagrams scroll like text code blocks: safe centering keeps the diagram centered when it fits and falls back to start alignment when it overflows, so the top stays reachable instead of clipping above. Keep the block header opaque and layered above the scrolled diagram, and ignore header clicks in the zoom handler, so a button click never falls through to the zoom dialog. * ui: transparent diagram block header, address review from @allozaur

chatgpt-codex-connector Bot reviewed May 3, 2026

View reviewed changes

ruixiang63 and others added 29 commits May 29, 2026 23:09

ggml-webgpu: add q4_0/q8_0 SET_ROWS (ggml-org#23760)

b22da25

* Add q8_0 and q4_0 set_rows * Add fast(er) quantization set_rows path * formatting/naming * a little more naming * Remove unused constant * Don't override other override * Avoid bitcast * Narrow relaxation

ggml-webgpu: Check earlier for WebGPU required features (ggml-org#23879)

151f3a9

server: in SSE mode, send HTTP headers when slot starts (ggml-org#23884)

0821c5f

* server: in SSE mode, send HTTP headers when slot starts * ref to pr * stream should be false by default

ci : fix s390x release job (ggml-org#23898)

3375285

* ci : fix s390x release job * ci : multi-thread build for `ios-xcode` * ocd : names

ci : update ios-xcode release job to macos-26 (ggml-org#23906)

4c4e91b

* ci : disable libcommon build from xcframework * ocd : fix name * ci : ios-xcode change to macos-26 * cont : pin xcode * cont : pin xcode to minor version

test: (test-llama-archs) log the config name first (ggml-org#23885)

e674b12

metal : restore im2col implementation for large kernels (ggml-org#23901)

2d9b7c8

TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (ggml-org#23843)

8b0e0db

* TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs * fix afmoe TP

ui: exclude generated build dirs from prettier and eslint so lint err…

d38d50e

…ors stop being masked (ggml-org#23910)

opencl: support bf16 by converting to f16 (ggml-org#23839)

d6588da

Support -fa auto in llama-bench (ggml-org#23714)

aa46bda

* Support `-fa auto` in llama-bench Make the default value of `-ngl` -1, similar to other tools. Update README with latest usage and examples * Address review comments

llama: only use one iGPU device by default (ggml-org#23897)

22cadc1

ui: fix ETag truncation with MSVC compiler (ggml-org#23917)

3292da0

vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-or…

d4c8e2c

…g#18756) * vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer) * lowercase defaults to true * type fix --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ci : limit trigger paths for the CPU workflow (ggml-org#23938)

399739d

server : handle If-None-Match weak ETags (ggml-org#23916)

6f165c1

sycl : Optimize Q3_K mul_mat by reorder (ggml-org#23725)

44e211c

[SYCL] Add more types in GET_ROWS OP (ggml-org#23710)

4162522

* add to support Q1_0, NVFP4, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ1_S, IQ1_M, IQ3_S, IQ4_NL, IQ4_XS, I32, MXFP4, Q2_K, Q3_K, Q5_K, and Q6_K in GET_ROWS OP * correct the link

[SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (ggml-org#23812)

a511424

* support Q4_1, Q5_0, Q5_1 * update ut case

common : support manually triggering the reasoning budget end sequence (

5254a79

ggml-org#23949)

ServeurpersoCom and others added 29 commits June 15, 2026 08:11

chat: fix whitespace problems once and for all (ggml-org#24624)

a6dff71

* chat: fix whitespace problems once and for all * Purge trailing spaces from grammar generation * Revert "Purge trailing spaces from grammar generation" This reverts commit b0827ec.

metal : add repeat bf16 (ggml-org#24638)

272088b

[SYCL] add to support pool_1d, move pool_1d/2d code to pool.cpp/hpp (g…

987fbd8

…gml-org#24584) * add to support pool_1d, move pool_1d/2d code to pool.cpp/hpp * update ops.md

sycl : enhance set_rows to support q1_0, mxfp4, nvfp4 (ggml-org#24564)

8872ab5

sycl : fix reorder function; add fp32/fp16 in build script (ggml-org#…

72be44f

…24578)

sycl: fix soft_max_f32 max reduction (ggml-org#24451)

d8a3f52

SYCL: use native subgroup size for K-quant DMMV (ggml-org#21700)

e3bb1ad

wasm : fix fallback symbol collision (ggml-org#24639)

6eab471

vulkan: support more CONCAT types (ggml-org#24579)

9dbc662

mtmd : add post-decode callback (ggml-org#24645)

e3cab40

Assisted-by: pi:llama.cpp/Qwen3.6-27B

chat: fix an "oldie but goodie" grammar generator bug that surfaced d…

0ae3f45

…uring last changes (ggml-org#24653) * chat: fix an "oldie but goodie" grammar generator bug that surfaced during last changes * update erroneous case in PEG parser test

chat: include full unparsed prompt in debug (ggml-org#24650)

38d5463

message on parse error

mtmd: fix miscounting n_tokens (ggml-org#24656)

e36a602

chat : fix LFM2 tool-call parsing double-escaping (ggml-org#24667)

7dad2f1

* Add escape test cases * chat : fix LFM2 tool-call parsing double-escaping

[SYCL] Support OP EXPM1, support all UT cases of FLOOR, TRUNC, ROUND (g…

fdd1098

…gml-org#24363) * support OP EXPM1, support all UT cases of FLOOR, TRUNC, ROUND * fix conflict * rebase, support new UT case of repeat, concat

bench : add --offline (ggml-org#24511)

e3a74b2

* bench : add --offline Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add default Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>

vulkan: Support gated_delta_net with S_v=16 (ggml-org#24581)

d5fb104

vulkan: prefer host-visible memory buffers on UMA devices (ggml-org#2…

32120c1

…2930) * implement UMA host-visible memory * update based on 0cc4m's suggestion

spec: add backend sampling support for eagle3 (ggml-org#24655)

a182490

krystophny force-pushed the master branch from 43fb8c0 to c1304d7 Compare June 16, 2026 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add near-complete Codex VSCode Support, full OAI Responses bridge#3

Add near-complete Codex VSCode Support, full OAI Responses bridge#3
michaelw9999 wants to merge 678 commits into
michaelw9999:full-openai-responsesfrom
krystophny:master

michaelw9999 commented May 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 3, 2026

Uh oh!

chatgpt-codex-connector Bot May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

michaelw9999 commented May 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants