Skip to content

Add near-complete Codex VSCode Support, full OAI Responses bridge#3

Open
michaelw9999 wants to merge 678 commits into
michaelw9999:full-openai-responsesfrom
krystophny:master
Open

Add near-complete Codex VSCode Support, full OAI Responses bridge#3
michaelw9999 wants to merge 678 commits into
michaelw9999:full-openai-responsesfrom
krystophny:master

Conversation

@michaelw9999

Copy link
Copy Markdown
Owner

Things brings in automatic compaction, web_search and file_search and is super easy to configure, for example:

model = "qwen3.5-4B-NVFP4"
model_provider = "llamacpp"
personality = "friendly"
model_context_window = 128000
model_auto_compact_token_limit = 100000
model_supports_reasoning_summaries = true
model_reasoning_summary = "auto"
model_reasoning_effort = "medium"

[model_providers.llamacpp]
name = "Local llama.cpp"
model = "Qwen3.5-4B-NVFP4.gguf"
base_url = "http://192.168.50.50:43901/v1"
supports_websockets = false

[model_providers.llamacpp.http_headers]
X-Llama-Responses-Web-Search-Wrapper = "tvly"
X-Llama-Responses-File-Search-Wrapper = "rg"
X-Llama-Responses-Reasoning-Budget-Tokens = "minimal=2048,low=4096,medium=8192,high=16384,xhigh=32768"

For the automatic compaction to work, you must set model_context_window and model_auto_compact_token_limit. Summary boxes and clickable diffs with the undo button ususally need model_supports_reasoning_summaries = true and model_reasoning_summary = "auto".
Just install tavily (but shell command is tvly) and rg or any other preferred web search MCP or file search/locator tool, it will wrap it through the shell and integrate it more natively and intuitively. If left out, it will hide these tools from the model.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d72b0819db

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tools/server/server-task.cpp Outdated
{"sequence_number", seq_num++},
{"output_index", output_idx++},
{"item", json {
{"id", oai_resp_fc_item_id},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Emit a fresh function-call item id for each added tool call

server_task_result_cmpl_partial::update() only assigns state.oai_resp_fc_item_id after snapshotting state into the chunk fields, so to_json_oaicompat_resp() can emit response.output_item.added with {"id": oai_resp_fc_item_id} from the previous value (often empty on the first streamed tool call). This makes streamed response.function_call_arguments.delta.item_id/final output_item.done.item.id inconsistent with the announced item, which breaks clients that stitch function-call argument deltas by item_id.

Useful? React with 👍 / 👎.

Comment thread tools/server/server-context.cpp Outdated
Comment on lines +487 to +489
if (checkpoints.empty()) {
return true;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove stale checkpoint sidecar when no checkpoints exist

When checkpoints is empty, slot_checkpoints_save() returns without touching <filepath>.checkpoints, so reusing the same save filename can leave an old sidecar file behind. A later restore will then load stale checkpoint metadata for a different KV snapshot, which can trigger invalid recurrent-state restore attempts or unnecessary full prompt reprocessing.

Useful? React with 👍 / 👎.

ruixiang63 and others added 29 commits May 29, 2026 23:09
…gml-org#23869)

* spec: add speed-bench support for benchmarking

* speed-bench : add trailing newline to requirements.txt

* speed-bench : bump datasets to 4.8.0 to fix ty check

* server-bench : remove now-unused type: ignore after datasets bump
* Add q8_0 and q4_0 set_rows

* Add fast(er) quantization set_rows path

* formatting/naming

* a little more naming

* Remove unused constant

* Don't override other override

* Avoid bitcast

* Narrow relaxation
* server: in SSE mode, send HTTP headers when slot starts

* ref to pr

* stream should be false by default
…23868)

After ggml-org#23007 reclassified integrated CUDA/HIP devices as IGPU, the device
selection logic dropped the local iGPU whenever any RPC server was added,
because RPC devices made `model->devices` non-empty. On systems where the
"iGPU" is the main compute device (e.g. Strix Halo with 128 GiB of unified
memory), this caused all tensors to be allocated on the RPC peer alone and
model loading to fail.

Gate the iGPU inclusion on `gpus.empty()` instead, so RPC peers no longer
suppress the local iGPU.

closes: ggml-org#23858
…#23895)

* ci : ios use macos-15 again

* ci : add and test ccache-clear

* cont : fix

* cont : set permission

* cont : another permission

* cont : token

* cont : print key

* cont : bring back perms

* cont : test windows

* cont : add token

* cont : cleanup

* ci : make release jobs clean-up their ccache
* ci : fix s390x release job

* ci : multi-thread build for `ios-xcode`

* ocd : names
…3420)

* vulkan: add flash attention bf16 kv support

* vulkan: bf16 FA coopmat1 support

* vulkan: bf16 FA coopmat2 support

* fix FA bf16 f32 fallback

* fix FA bf16 coopmat1 shader

* fix FA bf16 coopmat2 shader

* code cleanup

* cleanup comment change

* address feedback

* add O_TYPE for cm2 FA

* use O_TYPE for gqaStore function

* reduce BFLOAT16 ifdefs
* loongarch : optimize LSX fp16 load/store with native intrinsics

Use __lsx_vfcvtl_s_h and __lsx_vfcvt_h_s instead of scalar loops in
__lsx_f16x4_load and __lsx_f16x4_store.

* loongarch : add LSX implementation for q8_0 dot product

* loongarch : add LSX implementation for q6_K dot product

* loongarch : add LSX implementation for iq4_xs dot product

* Improve reduce ops when sun int16 pairs to int32
* ci : disable libcommon build from xcframework

* ocd : fix name

* ci : ios-xcode change to macos-26

* cont : pin xcode

* cont : pin xcode to minor version
* TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs

* fix afmoe TP
* Support `-fa auto` in llama-bench

Make the default value of `-ngl` -1, similar to other tools.

Update README with latest usage and examples

* Address review comments
* webui: add custom CSS injection via config

register a customCSS setting in the Developer section under Custom JSON,
syncable so it rides the existing ui-config pass through. inject the value
into a single style element in the head, reactive on the setting. lets an
operator theme a prebuilt binary through --ui-config without rebuilding,
and lets a user set it from the settings panel.

* ui: address review from @niutech and @allozaur, rename custom JSON key and CSS field

* ui: address review from @allozaur, move custom CSS injection to a style tag in svelte:head

* ui: inject custom CSS through a svelte action instead of a bound element

move the textContent write into a use: action on the head style node.
the action is the idiomatic way to touch a node, so the no-dom-manipulating
lint rule is satisfied without a disable. value stays text through
textContent, never parsed as HTML.

* Update tools/ui/src/lib/constants/settings-keys.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* ui: address review from @allozaur, rename custom config key to customJson with migration

rename the custom config key to customJson across the type, the chat
request builder, the settings save check and the custom tools reader,
keeping the custom API param name unchanged. add a non destructive
migration that copies the legacy custom key to customJson at startup.
only render the head style tag when custom CSS is set.

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* docs zendnn added information about Q8 support

* docs zendnn rm unnecessary data

* docs update, links to ZenDNN docs provided

* docs zenDNN update: clarified explanation

* docs zenDNN update: one more explanation clarified

---------

Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>
…g#18756)

* vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer)

* lowercase defaults to true

* type fix

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* remove redundant apple job

openvino gpu and cpu test can share the same build and machine

Update build-rpc.yml

Update build-openvino.yml

cpu any doesnt make sense as we have an arm job already, so do high perf on both x86 and arm

remove duplicate x86 vulkan

combine backend sampling

Update server.yml

run server on arm as windows is x86

* emdawn on one machine only

* fix openvino, remove cpu tag as we dont have many x64 machines with that tag
* add to support Q1_0, NVFP4, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ1_S, IQ1_M, IQ3_S, IQ4_NL, IQ4_XS, I32, MXFP4, Q2_K, Q3_K, Q5_K, and Q6_K in GET_ROWS OP

* correct the link
* support Q4_1, Q5_0, Q5_1

* update ut case
)

Fixes: ggml-org#23927 (comment)

The cpu-x64-high-perf job was missing the Linux label in its runs-on
specification, causing the runner to not be discovered. All other
self-hosted Linux jobs include this label.

Assisted-by: llama.cpp:local pi
ServeurpersoCom and others added 29 commits June 15, 2026 08:11
* ui: add svg block visualizer based on allozaur's mermaid PR

* ui: rationalise diagram block styling and pre transforms shared by mermaid and svg

* ui: live render streaming svg blocks

* ui: also render svg authored in xml code fences

* ui: refactor svg block rendering, address review from allozaur

- Move the svg size ceiling and DOMPurify config out of sanitize-svg.ts into /constants.
- Rename the svg-diagram class to svg-block so the name no longer implies diagrams only.
- Replace the svg, xml and svg tag magic strings in the markdown pipeline with shared constants.
- Promote the data-svg-rendered marker and its sibling data attributes to constants.

* ui: render svg blocks in a shadow root for animation and live zoom

Mount each sanitized svg inside an open shadow root so author <style> and
keyframe or smil animations run while staying scoped to the host element.
Relax the sanitizer to forbid only foreignObject and script, which lets
animation, href and external resource refs through for wider compatibility.
Render the inline block and the zoom dialog from the same reactive source,
so a streaming svg keeps drawing live inside the open zoom popup.
* chat: fix whitespace problems once and for all

* Purge trailing spaces from grammar generation

* Revert "Purge trailing spaces from grammar generation"

This reverts commit b0827ec.
* [SYCL] Centralize Level Zero detection in ggml_sycl_init

* use the same wording

* get back the warning

* [SYCL] Remove per-allocation getenv() for GGML_SYCL_ENABLE_LEVEL_ZERO

* bring back the comment

* move it up to make sure devices call the shots

* move the env detection early

* replace g_ggml_sycl_enable_level_zero with a direct call to .ext_oneapi_level_zero

* update the comment

* switch back to g_ggml_sycl_enable_level_zero with a sentinel

* remove the check

* Reduce the diff

* reword, move lower

* move things aroudn

* remove forward declaration if favor of a full replace

* pre-cache results of zeDeviceGetProperties

* put ggml_sycl_get_env back

* replace get_sycl_env with ggml_sycl_get_env

* add whitespace back

* Apply suggestion from @sanmai
…gml-org#24584)

* add to support pool_1d, move pool_1d/2d code to pool.cpp/hpp

* update ops.md
Assisted-by: pi:llama.cpp/Qwen3.6-27B
…uring last changes (ggml-org#24653)

* chat: fix an "oldie but goodie" grammar generator bug that surfaced during last changes

* update erroneous case in PEG parser test
* chat: harden peg-native tool call parsing

accept an optional leading type: function field in
build_json_tools_flat_keys so openai style tool calls parse on
templates whose serialization opens on the name field.

return a clean error and log the unparsed fragment on a final peg
parse failure instead of throwing the raw parser position and input.

keep the raw arguments string in func_args_not_string when it is not
valid json instead of aborting the prompt render.

* chat: surface peg-native parse failures

a final peg parse failure threw the raw parser position and input. log
the unparsed fragment and raise a clearer error instead, so a model
output that does not match the expected format no longer fails silently
with an empty assistant turn.

minimal change, no behavior change on successful parses.

* chat: handle openai style tool calls in peg-native

* nits

* common: scope OpenAI wrapper grammar trigger via autoparser flag

* chat: gate type:function parsing leniency on the analysis flag

Thread accept_openai_wrapper from the generator to build_json_tools_flat_keys
so the leading "type": "function" field is accepted only when openai_wrapper_trigger is set.
…org#22219)

* docs: Add instructions to install `llama.cpp` from conda-forge

Signed-off-by: Julien Jerphanion <git@jjerphan.xyz>

* Rewording of instructions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add escape test cases

* chat : fix LFM2 tool-call parsing double-escaping
* vulkan: add GGML_OP_COL2IM_1D, follow-up to the CPU op

* vulkan: col2im_1d bounded gather loop instead of full-K scan with modulo

* vulkan: col2im_1d address review from @jeffbolznv

* vulkan: col2im_1d return nullptr for unsupported types, address review from @0cc4m
* Add -cl-fp32-correctly-rounded-divide-sqrt to F16=ON builds

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

* Make GGML_SYCL_F16=ON the default

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

* Leave F32 the default

F16 remains explictly set for example and Dockerfile builds.

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

* Revert changes to examples/sycl/build scripts

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

---------

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>
…gml-org#24363)

* support OP EXPM1, support all UT cases of FLOOR, TRUNC, ROUND

* fix conflict

* rebase, support new UT case of repeat, concat
* sycl: support reordered Q4_K and Q5_K MoE MUL_MAT_ID

Extend reordered-weight handling to fused MoE MUL_MAT_ID for Q4_K and Q5_K expert tensors and add Q5_K reordered DMMV coverage. Unsupported 3D reorder cases now fall back instead of aborting.

* sycl: extend MoE reorder to Q6_K mul_mat_id
* bench : add --offline

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add default

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
… position (ggml-org#24536)

* spec: add spec metrics mean acceptance length and acceptance per pos

* fix as suggestion

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix as suggestion

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix as suggestion

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix as suggestions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…2930)

* implement UMA host-visible memory

* update based on 0cc4m's suggestion
* Move post-GEMM MUL required for dequant b4 lora and bias add

see ggml-org#23484 :
1. For lora, I would presume we want fully dequantized values before
   doing the residuals, but this depends on how the LORAs were
generated. Literature tells me LORA happens post-mul but pre-bias add ggml-org#8332
2. For ModelOPT, bias-add should happen on [fully-dequantized
   values](https://github.com/NVIDIA/Model-Optimizer/blob/b49f9b9e2d747af992d78a3aa7f10efe5a8847e1/modelopt/torch/quantization/backends/nvfp4_gemm.py#L59-L64)

* Restrict build_ffn for NVFP4 to supported combinations
* ui: add source toggle to mermaid and svg blocks

Add a toggle button next to copy and preview that switches a rendered
mermaid or svg block to its source code and back. The button is shared by
both block types and the rendered view stays the default.

The source view reuses the code block scroll container and the highlighted
code element captured at transform time, so it matches the app code blocks
without highlighting again.

Make tall diagrams scroll like text code blocks: safe centering keeps the
diagram centered when it fits and falls back to start alignment when it
overflows, so the top stays reachable instead of clipping above.

Keep the block header opaque and layered above the scrolled diagram, and
ignore header clicks in the zoom handler, so a button click never falls
through to the zoom dialog.

* ui: transparent diagram block header, address review from @allozaur
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.