feat(native-agent): Anthropic OAuth (Claude Pro/Max) login for openab-agent by canyugs · Pull Request #1187 · openabdev/openab

canyugs · 2026-06-24T09:48:33Z

What problem does this solve?

openab-agent can only reach Anthropic via ANTHROPIC_API_KEY (pay-per-token). Codex already supports subscription login, but Claude Pro/Max subscribers cannot use their subscription with the native agent. This adds native Anthropic OAuth (Claude Pro/Max) so users run openab-agent on the Claude subscription they already pay for — no API key — matching the existing Codex experience.

Closes #1186

Discord Discussion URL: https://discord.com/channels/1491295327620169908/1519271476002291752

At a Glance

  openab-agent auth anthropic-oauth          ~/.openab/agent/auth.json
        │ PKCE + browser/paste                ┌────────────────────────────┐
        ▼                                      │ "codex"          (Token)   │
  claude.ai/oauth/authorize ──code──► token   │ "anthropic-oauth"(Token) ◄─┼─ new tenant
  platform.claude.com/v1/oauth/token          │ "<mcp server>"   (Mcp)     │
        │ JSON exchange + scope-less refresh   └────────────────────────────┘
        ▼
  AnthropicProvider (OAuth mode)
    Authorization: Bearer <sk-ant-oat…>
    anthropic-beta: claude-code-20250219,oauth-2025-04-20   x-app: cli
    system[0] = "You are Claude Code, …"   tools: read→Read …
        │
        ▼   api.anthropic.com/v1/messages
  provider select: ANTHROPIC_API_KEY → anthropic-oauth → codex

Prior Art & Industry Research

I looked at how two comparable self-hosted agents authenticate to Anthropic and Codex. The headline finding: neither implements a native Anthropic (Claude Pro/Max) OAuth login — both avoid the PKCE flow and instead lean on a setup-token or on reusing Claude Code's local credentials. This PR (following Pi) does the full native PKCE login, which is strictly more self-contained for pod deployments. Their surrounding architecture, however, validates the storage/refresh choices here.

OpenClaw — supports API keys and subscription OAuth.

Anthropic: no native OAuth login — either a user-supplied setup-token stored in an auth profile, or reuse of the local Claude CLI (claude -p).
Codex: full PKCE — auth.openai.com/oauth/authorize → callback http://127.0.0.1:1455/auth/callback (or manual paste) → token exchange → accountId extracted from the access token. This is byte-for-byte the same flow openab-agent already uses for Codex, corroborating our approach as the de-facto standard.
Storage: ~/.openclaw/agents/<id>/agent/auth-profiles.json, one {access, refresh, expires, accountId} tuple per profile.
Refresh: treats the profile file as a "token sink", refreshes under a file lock, and is careful not to spend copied refresh tokens — the same rotation hazard we handle by persisting the rotated refresh token on every refresh.
Refs: https://docs.openclaw.ai/concepts/oauth , https://docs.openclaw.ai/gateway/authentication

Hermes Agent — PROVIDER_REGISTRY dataclasses in hermes_cli/auth.py declare each provider's auth type + base URLs + env vars; resolve_runtime_provider() is the single resolution entry point.

Anthropic: API key only — no OAuth. Reads ANTHROPIC_API_KEY, and if absent reads ~/.claude/.credentials.json (reuses Claude Code's store). Docs explicitly note Anthropic here is "straightforward API key authentication without refresh token complexity."
Storage: ~/.hermes/auth.json (OAuth tokens + active provider), credential_pool.json (rotation), .env (API keys); auth.json guarded with fcntl/msvcrt file locks.
Refs: https://deepwiki.com/NousResearch/hermes-agent/2.3-authentication-and-providers , https://github.com/NousResearch/hermes-agent

Primary source ported: Pi (earendil-works/pi) — packages/ai/src/utils/oauth/anthropic.ts (PKCE flow, endpoints, scopes; verifier doubles as state) and packages/ai/src/api/anthropic-messages.ts (OAuth headers, Claude Code system block, tool-name normalisation). The OAuth client is Claude Code's public client.

How this PR compares: like both systems, openab-agent keeps a single namespaced credential file (~/.openab/agent/auth.json) with atomic writes + per-refresh rotation handling, and an existing Codex tenant identical to OpenClaw's Codex flow. Unlike both, it adds a native Anthropic PKCE login so subscribers need neither a setup-token nor a local Claude Code install.

Proposed Solution

Add an anthropic-oauth tenant alongside the existing codex tenant in ~/.openab/agent/auth.json:

auth.rs — login_anthropic_browser_flow() (PKCE; verifier doubles as state per Claude's flow); namespaced token store (load/save/get_valid_token/force_refresh _for(provider)); per-provider refresh encoding (Anthropic = JSON, no scope; Codex = form); shared loopback-callback helpers; show_status lists all tenants.
llm.rs — AnthropicProvider gains AnthropicAuth { ApiKey | OAuth }. OAuth mode sends Bearer + Claude Code identity headers (anthropic-beta: claude-code-20250219,oauth-2025-04-20, x-app: cli), prepends the required "You are Claude Code…" system block, normalises built-in tool names to Claude Code casing (read↔Read), and refreshes once on a mid-flight 401. select_provider gains anthropic-oauth; anthropic/auto fall back API-key → OAuth.
acp.rs — session/model selection via AnthropicProvider::auto*() (covers both auth modes); model catalog shows Anthropic models when an API key or OAuth token is present.
main.rs — openab-agent auth anthropic-oauth [--no-browser].

Also bumps the stale default model claude-sonnet-4-20250514 → claude-opus-4-8 (the old dated snapshot returns 404 on the subscription endpoint).

Why this approach?

Consistency: reuses the existing Codex OAuth tenant pattern in the same auth.json, so the two subscription logins coexist without new storage mechanisms.
Proven flow: endpoints/headers/system block are ported verbatim from Pi (a shipping Anthropic-OAuth client), de-risking the Claude-Code-beta specifics.
Refresh safety: each refresh persists the rotated refresh token (OpenClaw/Hermes both flag this as a sharp edge).

Tradeoffs / limitations: depends on Claude Code's public OAuth client and the claude-code-20250219,oauth-2025-04-20 beta headers — if Anthropic changes these, the OAuth path needs updating (API-key path is unaffected). The claude-opus-4-8 default now also applies to API-key mode (Opus is pricier per-token; overridable via OPENAB_AGENT_MODEL). The legacy Dockerfile.native is unrelated and intentionally out of scope (canonical Dockerfile.unified builds the native variant correctly).

Alternatives Considered

API key only (status quo): excludes every Pro/Max subscriber — the whole point.
Reuse Claude Code's local credential store (what Hermes does via ~/.claude/.credentials.json, and OpenClaw via claude -p): rejected — openab-agent runs in a pod with no Claude Code install and no ~/.claude. Owning an anthropic-oauth tenant in our own auth.json keeps it self-contained and matches the Codex tenant already present.
Setup-token paste (OpenClaw's other Anthropic path): rejected — a full PKCE login is a better UX (no manual token minting) and reuses the exact callback/paste plumbing the Codex flow already has.
Separate credential file per provider: rejected — the namespaced single auth.json already serves Codex + MCP; a new file would fragment storage and duplicate the atomic-write/rotation logic.

Validation

Rust:

cargo check / cargo build pass (0 warnings)
cargo test passes — 194 passed, incl. 4 new (authorize-URL, namespaced storage disjointness, OAuth request-body identity+tool-casing, name round-trip)
cargo clippy — no new warnings from this change (6 pre-existing in test-module ENV_LOCK-across-await + mcp/runtime.rs; the OAuth code is clippy-clean)

Manual (real Claude Pro/Max account):

auth anthropic-oauth --no-browser login → token stored under anthropic-oauth, auth status shows valid
Live chat on claude-opus-4-8 (default) / claude-sonnet-4-6 / 4-5 → correct responses
Real tool call — bash executes in-sandbox (echo …$((6*7)) → 42), confirming tool-name normalisation round-trips
Token refresh — forced expiry → live JSON refresh succeeds, access+refresh tokens rotate, request still completes
Canonical image — built Dockerfile.unified --target native; ran the in-image agent end-to-end (chat + tool call) via the stored OAuth token

Copilot

Pull request overview

Adds first-class Anthropic (Claude Pro/Max) OAuth login support to openab-agent, storing credentials as a new anthropic-oauth tenant alongside the existing codex tenant in ~/.openab/agent/auth.json. This extends provider auto-detection and request shaping so Claude subscription users can run the native agent without an ANTHROPIC_API_KEY.

Changes:

Introduces openab-agent auth anthropic-oauth [--no-browser] PKCE login flow and namespaced token load/save/refresh helpers.
Extends AnthropicProvider to support OAuth auth mode (Bearer + Claude Code identity headers/system block + tool-name casing normalization).
Updates ACP provider/model selection and available-model listing to recognize Anthropic OAuth credentials; bumps default Anthropic model to claude-opus-4-8.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
openab-agent/src/main.rs	Adds the `auth anthropic-oauth` CLI subcommand wiring.
openab-agent/src/auth.rs	Implements Anthropic PKCE/OAuth flow and namespaced token storage/refresh.
openab-agent/src/llm.rs	Adds Anthropic OAuth request behavior (headers/system/tool name normalization) and provider selection updates.
openab-agent/src/acp.rs	Uses Anthropic `auto*` selection (API key or OAuth), updates default model and model availability gating.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+/// Load the LLM token stored under `namespace` (`codex` / `anthropic-oauth`).
+pub fn load_tokens_for(namespace: &str) -> Result<TokenStore> {
    let path = auth_path();
    let map = read_auth_file(&path).map_err(|_| {
        anyhow!(
-            "No credentials found at {}. Run `openab-agent auth codex-oauth` first.",
+            "No credentials found at {}. Run `openab-agent auth` first.",
            path.display()
        )
    })?;
-    match map.get(CODEX_NAMESPACE) {
+    match map.get(namespace) {
        Some(AuthEntry::Token(t)) => Ok(t.clone()),
        _ => Err(anyhow!(
-            "No codex credentials in {}. Run `openab-agent auth codex-oauth` first.",
+            "No {namespace} credentials in {}. Run `openab-agent auth` first.",
            path.display()
        )),
    }


+/// Block on the loopback listener for the OAuth redirect, reply 200, return the
+/// authorization code. ponytail: the Codex flow above predates this helper and
+/// still inlines the same logic; fold it in if that path is ever touched again.


+        _ => match AnthropicProvider::auto() {
            Ok(p) => Ok(Box::new(p)),
            Err(_) => match OpenAiProvider::from_auth_store() {
                Ok(p) => Ok(Box::new(p)),
                Err(e) => Err(format!(
-                    "No credentials: set ANTHROPIC_API_KEY or run `openab-agent auth codex-oauth`. {e}"
+                    "No credentials: set ANTHROPIC_API_KEY, or run `openab-agent auth anthropic-oauth` / `auth codex-oauth`. {e}"
                )),
            },


                                    return self.error_response(
                                        id,
                                        -32000,
-                                        &format!("No credentials: set ANTHROPIC_API_KEY or run `openab-agent auth codex-oauth`. {e}"),
+                                        &format!("No credentials: set ANTHROPIC_API_KEY, or run `openab-agent auth anthropic-oauth` / `auth codex-oauth`. {e}"),
                                    )


canyugs · 2026-06-24T12:38:15Z

Thanks for the thorough review — addressed in 73bf9a2.

F1 (🔴 CI workspace) — fixed, with a root-cause correction. The "rebase onto main" suggestion doesn't apply: this branch is already based on current main (d1b0cfb), and the breakage is present-but-dormant there. ci-openab-agent.yml only runs on paths: openab-agent/**; its last run was 2026-06-17 (#1133) — before the workspace restructure (#1146/#1170) made the repo root a workspace without excluding openab-agent. Nothing has touched openab-agent/ on main since, so the check never re-ran — this PR is the first openab-agent/ change to expose it. Fix: exclude = ["openab-agent"] in the root Cargo.toml (the cargo error's own suggestion), plus cargo fmt. Locally the full CI sequence now passes: cargo fmt --check, cargo clippy -- -D warnings, cargo test (194), cargo test -- --ignored (11). Note: committing [workspace] into openab-agent/Cargo.toml instead would have collided with Dockerfile.unified's build-time [workspace] injection (duplicate-key error), so the root exclude is the non-conflicting fix.

F2 (🟡 PKCE state) — fixed + verified live. Now uses an independent 32-byte random state; the verifier stays back-channel-only. Worth recording: claude.ai's authorize rejects a short independent state with Invalid request format, so the state is sized to match the verifier (32 bytes) — value still independent. Verified end-to-end against a real Pro/Max account (browser login → token exchange with state ≠ verifier → live chat).

F3 (🟡 error UX) — fixed. Credential errors now name fully-qualified subcommands (openab-agent auth anthropic-oauth / openab-agent auth codex-oauth) and preserve the underlying read/parse error.

F4 (🟡 comment tag) — fixed. ponytail: → Note:.

Also confirmed the canonical native image still builds: Dockerfile.unified --target native (the path build-images.yml uses) builds clean and runs the OAuth flow end-to-end in-container. The legacy Dockerfile.native is a separate pre-existing issue (missing the [workspace] injection that Dockerfile.unified does) and is intentionally out of scope here.

canyugs · 2026-06-24T13:02:00Z

Follow-up: the CI openab-agent check went green only after one more fix (4ef7c49).

With the workspace exclude in place, fmt/clippy/test ran and passed, but the ACP initialize smoke test then failed (empty response). Root cause: a latent flush-on-shutdown race — the dispatch loop fed responses to a detached stdout-drain task, and on stdin-EOF #[tokio::main] aborted the drain before it flushed the last line. main wins this race by timing; this branch's slightly different startup timing lost it ~85% of the time locally (origin/main binary: 20/20 pass; this branch pre-fix: 3/20). Fix captures the drain handle and bounded-awaits it after the loop so queued output is flushed before return — 20/20 after. All CI openab-agent steps now pass.

brettchien · 2026-06-24T15:43:55Z

Supplementary architectural review (forward-looking — this PR is already LGTM'd and the line-level items
are well covered by the automated review + the author's fixes; CI is green, state is now independent, the
workspace exclude is in). None of the below blocks merging — they're direction/follow-up points for the
planned OAuth revamp + a forthcoming ADR, worth anchoring on record before this lands as the first vendor.

Findings

#	Severity	Finding	Location
1	🟡	`auth.json` read-modify-write is unlocked across processes — concurrent refresh can trip OAuth 2.1 §10.4 token-family revocation. Not introduced here; follow-up, not blocking.	`auth.rs` storage layer
2	🟢	Add `CLAUDE_CODE_OAUTH_TOKEN` env support — unblocks headless/pod/CI now, and is race-free	`llm.rs` / `auth.rs`
3	🟢	Keep per-vendor specifics in a descriptor (sets up the shared multi-vendor adapter as a refactor, not a rewrite)	`auth.rs` / `llm.rs`
4	🟡	Hardcoded default `claude-opus-4-8` is a recurring-staleness/404 risk (no evergreen alias in 4.6+) — recommend dropping the hardcoded default and requiring the model via config/env (fail-loud)	`llm.rs:152`, `acp.rs:385/446`
5	🟢	Stale doc comment still says "PKCE with the verifier doubling as `state`" though the code now uses an independent state	`auth.rs:692`
6	🟢	`getrandom::fill(...).expect(...)` in a `Result`-returning fn → prefer `?`	`auth.rs:461/472`

Detail — F1: cross-process auth.json race (follow-up)

auth.json is multi-tenant (codex / anthropic-oauth / mcp:*) and every writer does
read_auth_file → mutate → write_auth_file with no lock. openab-agent runs one process per session
thread (SessionPool → connection.rs spawns a child per thread), so concurrent threads = concurrent
processes doing that RMW. When a token expires, multiple processes refresh in the same window; with
refresh-token rotation the later ones present an already-rotated token, which under OAuth 2.1 §10.4 reuse
detection can revoke the whole token family → every session logged out at once.

This is not introduced by this PR (Codex already shares the same store), but OAuth widens the exposure,
and API-key users never hit it (no refresh writes). Not merge-blocking for this PR. The fix — a
flock-guarded RMW funnel that all writers (including the MCP CredentialStore) share, with the network
refresh done outside the lock + a re-read inside for single-flight — is planned for the revamp PR. Until
then, F2 (env token) gives a race-free path for fleet/pod use.

Detail — F4: default model staleness

This PR replaces the old default claude-sonnet-4-20250514 precisely because it 404s on the subscription
endpoint (retired there). The replacement claude-opus-4-8 will eventually hit the same fate when it's
retired from the Claude Code surface — i.e. this is a recurring 404 timebomb, not a one-off. And there's no
escape via a floating alias: in the 4.6+ generation, dateless IDs (claude-opus-4-8, claude-sonnet-4-6)
are fixed canonical IDs, not evergreen pointers — Anthropic ships new versions under new IDs and never
re-points an old one. So the default must be bumped by hand each generation.

Recommendation — no hardcoded model default; require it via config/env and fail loud. Since Messages V1
mandates a model and there is no evergreen alias to fall back to, any in-code default is a 404 timebomb.
Resolve model as: ACP/CLI model_override → OPENAB_AGENT_MODEL → error ("no model configured; set
OPENAB_AGENT_MODEL or select one") instead of the hardcoded claude-opus-4-8. This also removes the
silent Opus cost bump for API-key users.

Note this is a behavior change: today's zero-config default goes away, so a model must be set
(deployments via values.yaml/env already do; zero-config npx/local users will need to set
OPENAB_AGENT_MODEL). Worth a clear error message + CHANGELOG line. Drops the three duplicated default
sites (llm.rs:152, acp.rs:385/446).

Detail — F2: support CLAUDE_CODE_OAUTH_TOKEN

claude setup-token mints a 1-year subscription OAuth token. Reading it as a Bearer source — precedence
ANTHROPIC_API_KEY → CLAUDE_CODE_OAUTH_TOKEN → stored anthropic-oauth tenant (same shape as pi,
earendil-works/pi#3591) — is only a few lines and:

makes Claude subscription auth usable in headless / pod / CI today (no browser, no loopback, no stdin
paste), which the interactive PKCE flow can't cover in a container; and
is race-free (env token, no auth.json refresh writes) — see F1.

For ops-managed deployments this is arguably the primary path; interactive PKCE is for local self-service.

Detail — F3: per-vendor descriptor

client_id / client_secret / endpoints / scope / token-body-format are the only things that vary between
Codex, Anthropic, and upcoming vendors (Gemini, agy, grok…). Holding them in a small per-vendor struct now
(rather than inlined in the Anthropic flow) makes the planned shared OAuth adapter a clean extraction
instead of a rewrite. No need to build the adapter in this PR — just don't entrench Claude-only assumptions.

Direction / roadmap (tracked in a forthcoming ADR)

A short ADR is being drafted for multi-vendor LLM-provider OAuth + credential storage; this PR is the first
vendor under it. Explicitly out of scope here, listed so the follow-ups are on record:

the shared OAuthVendor adapter + Codex refactor onto it (+ grok as a cheap OpenAI-compatible add);
the flock locked-RMW storage fix (F1);
additional vendors (Gemini / agy via Google Code Assist, etc.);
/auth (PR feat(discord): add /auth slash command for device-flow auth #1185) interactive integration — later roadmap. Note for whenever it's picked up:
Anthropic has no device flow (claude.ai is authorization_code only, Support device-code authentication flow (RFC 8628) for Pro/Max subscription users in headless environments anthropics/claude-code#22992) and the
--no-browser paste reads stdin, which the /auth relay can't feed, so that path will likely be a
two-step code-as-CLI-arg flow. No action needed in this PR.

brettchien · 2026-06-24T15:45:00Z

Follow-up on F4 — one thing worth doing in this PR before it merges: please don't pin claude-opus-4-8 as a hardcoded default.

This PR exists because the previous hardcoded default (claude-sonnet-4-20250514) 404'd on the subscription endpoint after retirement — and claude-opus-4-8 will eventually hit the same fate. There's no escape via an alias: in the 4.6+ generation, dateless IDs are fixed canonical IDs, not evergreen pointers, so a hardcoded default is a recurring 404 timebomb that has to be chased by hand every generation.

Since Messages V1 mandates a model, the cleanest fix is to require it and fail loud rather than bake in a new pin:

resolve model as ACP/CLI model_override → OPENAB_AGENT_MODEL → error ("no model configured; set OPENAB_AGENT_MODEL or select one")
drop the three hardcoded default sites: llm.rs:153, acp.rs:385, acp.rs:446
(the ModelOption catalog entry at acp.rs:509 is fine — that's offering the model, not defaulting to it)

It's a small change, and it also removes the silent Opus cost bump for API-key users. Behavior note: this drops the zero-config default, so a model must be set — deployments via values.yaml/env already do; worth a clear error message + CHANGELOG line for zero-config/local users.

Proposed ADR for the openab-agent LLM-provider OAuth revamp: a two-axis OAuthVendor adapter (auth flow vs inference transport), a cross-process flock-guarded credential-store invariant for auth.json, the CLAUDE_CODE_OAUTH_TOKEN env route, a 14-variant vendor feasibility matrix, and the /auth (PR openabdev#1185) auth-trigger model. Surfaced while reviewing PR openabdev#1187 (first OAuth vendor). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

canyugs · 2026-06-24T16:06:17Z

@brettchien Great call — this is exactly the right framing, thank you. The dateless 4.6+ IDs being fixed canonical IDs (not evergreen pointers) makes any hardcoded default a recurring 404 timebomb, and pinning Opus also quietly raised costs for API-key users. Implemented your fail-loud approach in b28c312:

Model resolution is now: explicit ACP/CLI override → OPENAB_AGENT_MODEL → error (no model configured; set OPENAB_AGENT_MODEL or select a model). No baked-in default.
Dropped the three hardcoded sites (llm.rs anthropic_model_from_env, acp.rs session new/load). anthropic_model() is now fallible; session new/load report the provider's actually-resolved model instead of a fallback.
Kept the claude-opus-4-8 entry in the model catalog (acp.rs) — offering, not defaulting, as you noted.
Two implementation details worth flagging: a model override (CLI/ACP) does not require OPENAB_AGENT_MODEL, and credentials are still checked before the model so a missing key/token surfaces its own error rather than the generic "no model configured". auto() only falls through to OAuth when no API key is present.
Behavior note + CHANGELOG-equivalent: documented in docs/native-agent.md that OPENAB_AGENT_MODEL is required for Anthropic — zero-config now fails loud with a clear message; env/values.yaml deployments are unaffected.

Tests: cargo fmt/clippy -D warnings clean, cargo test 196 + 11 ignored. Added test_session_new_requires_model to lock in the fail-loud contract. This also resolves the earlier #5 (silent Opus cost bump for API-key users).

canyugs · 2026-06-26T16:57:08Z

All review findings addressed across iterations — summary of what was fixed:

F1 (CI red): Root-caused to workspace resolution; added exclude = ["openab-agent"] to root Cargo.toml + fixed ACP drain-shutdown race (4ef7c49). CI green since.
F2 (PKCE state = verifier): Now uses independent 32-byte random state; verifier stays back-channel only (73bf9a2).
F3 (error messages): All credential-error messages now spell out full commands (openab-agent auth anthropic-oauth / openab-agent auth codex-oauth).
F4 (hardcoded default model): Removed per @brettchien's architectural feedback — model resolution is now override → OPENAB_AGENT_MODEL → error, no baked-in default (b28c312).
Bonus: ModelRef parser for provider/model_id format, doc updates, ponytail: tag cleaned up.

Final bot review is LGTM ✅, all 33 CI checks green. Ready for maintainer review.

…g (codex + MCP) (#1190) * docs(adr): openab-agent multi-vendor OAuth & credential storage Proposed ADR for the openab-agent LLM-provider OAuth revamp: a two-axis OAuthVendor adapter (auth flow vs inference transport), a cross-process flock-guarded credential-store invariant for auth.json, the CLAUDE_CODE_OAUTH_TOKEN env route, a 14-variant vendor feasibility matrix, and the /auth (PR #1185) auth-trigger model. Surfaced while reviewing PR #1187 (first OAuth vendor). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(adr): address review — prefer oauth2 crate, drop rollout, vendor names - Build the OAuthVendor driver on the official `oauth2` crate (already in-tree via the MCP side) instead of a hand-rolled PKCE/exchange/refresh flow; the Anthropic JSON-token-body quirk is applied via the crate's custom http-client hook. Reframe §8 accordingly (hand-rolled flow is the rejected alternative). - Remove the project Rollout-plan section (internal sequencing, not ADR material); keep the race-window mitigation in §5.4. - Use vendor names only; drop internal fleet-agent references from the matrix. - Replace unexplained "ToS-gray" with "ToS-risk" + a definition. - Fix cross-references (crate-qualified paths, §-refs, line numbers) and move the settled model-default decision under "Decisions & open questions". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(adr): per-tenant refresh lock + per-user PKCE keying (Mira review r1) Fold in two release-blocker-class concurrency fixes from Mira's review: - §5.4 refresh-token rotation: the prior "refresh outside lock, re-read on commit" claim was wrong — N processes each send a refresh with the same RT_old before committing, tripping OAuth 2.1 §10.4 reuse -> token-family revocation. Replace with a per-tenant exclusive lock: network refresh held under the tenant lock only (not the global lock), so exactly one refresh per tenant per expiry with no head-of-line blocking across tenants. Extends to mcp:<server> tenants. - §7 /auth: key the pending PKCE verifier+state by the initiating Discord user id instead of a single global entry — prevents concurrent-user overwrite (PKCE mismatch) and session hijack. - §5.1 OAuthVendor::redirect() -> Option, since device flows have no redirect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(adr): bundled-secret storage + pending-entry GC (Mira review r2) - §6/§9 Q2: correct the stale "git-safe" claim on the gemini GOCSPX- secret. The value is non-confidential by RFC 8252 / Google docs, but GitHub now push-protects Google secrets by default (changelog 2026-03) and partner-scans them for auto-revoke, so a raw literal is not safe in a public repo. Decide: encode-at-rest (scanner-evasion for a non-secret, NOT a security control) or env-inject at runtime — not raw text. - §7: pending PKCE entries get created_at + a 15-min GC sweep in with_auth_locked so abandoned /auth attempts don't accumulate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(adr): close §9 Q2/Q3 — env-injection default, vendor go/no-go (Brett decisions r3) - §9 Q2 DECIDED: env-injection (b) is the default for the gemini/agy bundled client_secret; encode-at-rest (a) is the fallback for bundled zero-config binaries only, framed as scanner-evasion (not security). Cite rclone rcloneEncryptedClientSecret + obscure.MustReveal() as the canonical precedent (§10). - §9 Q3 DECIDED: GO gemini/grok (first wave) + agy (experimental, opt-in, ToS caveat — shares gemini's Code-Assist provider, residual risk is ToS not secret storage); No-Go cursor/kiro. Mirror as a build-decision line under §6. - §10: add rclone + GitHub/Google secret-scanning references. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(adr): confirm agy secret + ecosystem evidence (GitHub survey r4) GitHub survey 2026-06-24 grounds the agy GO decision: - §6/§9 Q2: agy client_secret requirement CONFIRMED — it needs a GOCSPX- secret, a public constant >=20 antigravity-auth repos hardcode verbatim (NoeFabris/opencode-antigravity-auth, router-for-me/CLIProxyAPI, ...). Literal deliberately NOT pasted into the doc — would trip the very §9 Q2 push-protection, so we dogfood the env/encode decision. Redirect confirmed localhost:51121/oauth-callback. - §9 Q3: ecosystem evidence — agy OAuth is widely ported (opencode/pi/hermes/ openclaw plugins + proxies), proving the integration, while the same ecosystem's anti-ban / quota-locking / multi-account-rotation tooling empirically confirms the ToS-ban + 429 risks, reinforcing the opt-in gate. - §10: add the antigravity OAuth ecosystem reference set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(adr): agy-vendor vs agy-CLI/ACP clarification + MCP store revamp in scope (r5) Grounded by a codebase survey (2026-06-24): - §4/§6: clarify "agy as a GO vendor" means a native OAuthVendor + Code-Assist inference provider — it does NOT run the agy CLI. agy speaks no ACP; the existing `antigravity` runtime variant (Mira/ECS) only works via a dedicated agy-acp adapter that shells out to the agy binary per prompt and polls its SQLite DB. The provider path sidesteps ACP entirely and supersedes the CLI-wrapper for native use, so agy's lack of ACP doesn't block the GO. - §5.4/§9 Q4: make the MCP CredentialStore revamp explicitly in-scope. auth.json has NO lock today (only atomic rename); provider save_tokens (auth.rs:234) and McpCredentialStore::save/clear (auth.rs:284-328) are two independent unlocked RMW callers, so with_auth_locked must wrap BOTH or the race persists. Also correct the stale save_tokens_for name and note with_auth_locked is new. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(adr): fix stale §9 Q4 cross-reference The §9 Q4 tail referenced 'openab-agent-mcp.md open items #1 (reqwest 0.12/0.13 split) and #8 (doctor/runtime two-store split)' — but that ADR's §10 Open Questions has only two items (mcp.json location; native-vs-broker parity), neither matching, and no such numbered items / terms exist anywhere in it. The phantom reference dated to the original draft. Replace with an accurate statement: McpCredentialStore reuses the same TokenStore/auth.json storage (openab-agent-mcp.md §6.1), so the lock lands once and serves both; the reqwest version conflict is the rmcp-OAuth dependency issue surfaced on the feat/openab-agent-mcp-resilience PR, not an mcp-ADR open item. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(adr): §9 Q2 — encode-at-rest is the default, env is the alternative (Brett) Brett's call given the 93%-plaintext ecosystem survey: keep the bundled zero-config UX via encode-at-rest (obscure, rclone obscure.MustReveal style) as the DEFAULT, with env-injection as the alternative for fleet/pod. Reverses the prior (b)-default ordering. Framing unchanged: encode-at-rest is scanner-evasion for a non-confidential value, not a security control. Record the survey numbers (99/107 plaintext; auto-revoke largely unrealized) and that the real risk mitigated is org-repo push-protection friction, not credential loss. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(auth): cross-process locking for auth.json — codex + MCP (ADR §5.4) Implements the ADR §5.4 invariant. auth.json had NO lock: provider save_tokens and McpCredentialStore::save/clear each did an independent unlocked read-modify-write, so a concurrent codex refresh and MCP save last-writer-wins the whole map, and N processes (one openab-agent per Discord thread) could each refresh the codex token with the same RT_old → OAuth 2.1 §10.4 token-family revocation (fleet-wide logout). Two locks, flock(2) on sidecar files (kernel auto-releases on death), cfg(unix) with a non-unix no-op: - with_auth_locked: global exclusive lock across the re-read -> mutate -> atomic-write. ALL writers funnel through it — save_tokens (codex) and McpCredentialStore::save/clear (MCP) — so writers merge onto the latest on-disk state instead of lost-updating. Held only for the fast file RMW, never across network I/O. - lock_tenant_refresh: per-tenant refresh serialisation. get_valid_token / force_refresh take the codex tenant lock (non-blocking try + async backoff + 10s timeout, held across the network refresh), with a double-checked re-read so a process that waited adopts the token another already refreshed → exactly one real refresh per tenant per expiry, no RT_old reuse. Uses libc::flock (already a cfg(unix) dep); rustix is NOT in-tree (ADR text was optimistic). New test asserts the locked RMW merges codex + MCP tenants without lost-update. fmt + clippy -D warnings + test (191 passed) green. Known gap (for review): the MCP *network* refresh is owned by rmcp's AuthorizationManager (it refreshes then calls CredentialStore::save), so MCP writes get the file-integrity lock but MCP refreshes are not yet tenant- serialised. Closing that needs an rmcp-level hook — proposed as follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(auth): address Mira code review — pending GC + portable WouldBlock Round-2 code review (Mira): - §7 pending-entry GC: add created_at to PendingPasteLogin and sweep AuthEntry::Pending older than 15 min inside with_auth_locked on every write, so abandoned /auth two-step attempts don't accumulate. PendingPasteLogin is currently a legacy tombstone with no live writer (PKCE state lives in rmcp's in-memory StateStore); the field + GC land now per ADR §7 and are forward-compatible with the forthcoming /auth two-step flow, and meanwhile sweep legacy stray entries (created_at default 0 reads as ancient). New test covers stale-swept / fresh-kept / real-tenant-untouched. - flock_try_exclusive: match std::io::ErrorKind::WouldBlock instead of a single raw EWOULDBLOCK errno, covering EAGAIN/EWOULDBLOCK across libc/BSD. - MCP network-refresh serialisation stays a follow-up (rmcp owns the refresh) — Mira concurs. fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(auth): serialise MCP token refresh cross-process (close §5.4 (b) gap) Closes the known gap from f370110: MCP refreshes now get the same per-tenant serialisation as codex. rmcp's CredentialStore exposes no pre-refresh hook, but openab drives the MCP refresh explicitly in McpRuntimeManager::resolve_oauth_dial via client.get_access_token(). Wrap that call (per-server) with the existing auth::lock_tenant_refresh so only one process refreshes a given server at a time. No explicit double-check needed: rmcp's get_access_token re-load()s auth.json each call and returns the cached token without a network refresh when remaining >= REFRESH_BUFFER (rmcp auth.rs:1238), so a process that loses the race adopts the token the winner wrote to the shared file — no second RT_old presentation, no OAuth 2.1 §10.4 family revocation. rmcp already single-flights within one process via its AuthorizationManager Mutex; this closes the cross-process gap. - auth.rs: lock_tenant_refresh + AuthFileLock made pub(crate) for the mcp module. - ADR §5.4: document the resolve_oauth_dial serialisation point. fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * refactor(auth): /simplify cleanups — dedup lock acquire, reuse refresh fd Quality-only cleanups from a 4-angle /simplify pass (no behaviour change): - Extract lock_global(path) -> Result<Option<AuthFileLock>> and route both with_auth_locked and McpCredentialStore::clear through it, so the "global" sidecar name + the cfg(unix) acquire live in one place instead of two copy-pasted blocks. clear flattens (drop the inner run closure). - lock_tenant_refresh opens the lock fd once and re-issues flock on it each retry, instead of re-opening (and re-create_dir_all-ing) the file every 100ms under contention. Removes the now-unused flock_try_exclusive helper. - Document lock_tenant_refresh's double-check contract (the lock only serialises; callers must re-check freshness after acquiring). fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(auth): /code-review — harden GC + refresh-lock contract comments Doc-only clarifications from a /code-review pass (no behaviour change): - PendingPasteLogin.created_at: warn loudly that any writer of a fresh Pending entry MUST stamp created_at, else gc_stale_pending sweeps it on the next locked write (latent footgun for the forthcoming /auth two-step writer). - lock_tenant_refresh: correct the contract — reuse-safety comes from loading the refresh token INSIDE the lock (so force_refresh, which always refreshes on a 401, is reuse-safe too); the post-lock expiry re-check is only an optimisation to skip a redundant refresh. Review also surfaced a pre-existing, out-of-scope namespacing gap (an MCP server literally named "codex" collides with the codex tenant's auth.json key and refresh lock — committed MCP creds use the bare server name, not mcp:<server> as ADR §5.4 describes) — flagged for a separate change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * build(openab-agent): declare standalone [workspace] to fix CI ci-openab-agent.yml runs cargo fmt/clippy/test/build with working-directory: openab-agent, but the crate is not a member of the parent openab workspace (members = crates/openab-core, openab-gateway) and was not excluded, so cargo errors 'current package believes it's in a workspace when it's not' and every step fails before doing any work. openab-agent is intentionally standalone (own version + dual reqwest 0.12/0.13 for rmcp), so the correct fix is an empty [workspace] table making it its own root. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Revert "build(openab-agent): declare standalone [workspace] to fix CI" This reverts commit b40693c. * ci(openab-agent): append [workspace] in CI + harden ACP smoke test ci-openab-agent.yml ran cargo from working-directory: openab-agent without the [workspace] table the crate needs (it's standalone, not a parent-workspace member), so every step failed with 'believes it's in a workspace when it's not' — the workflow had been red independently of any PR. Dockerfile.unified already works around this by appending the table at build time; replicate that in the workflow rather than committing it to Cargo.toml (which would double-append in the Dockerfile and break the image build). Also harden the ACP smoke test: build the release binary in its own step and bump the response timeout 5s -> 30s so a loaded runner doesn't flake the agent's first-response window (the binary itself responds in <1s locally, verified incl. a clean-HOME release build). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(openab-agent): surface ACP smoke-test stderr + exit code for diagnosis The smoke test produces empty stdout only in CI (the binary responds with agentInfo in <1s locally across debug/release/clean-HOME/CI-env-var runs). Capture stderr + exit code so the next CI run reveals why stdout is empty. * openab-agent: address review findings on OAuth ADR + auth.json locking Resolves the Changes-Requested review on the multi-vendor OAuth ADR + cross-process auth.json locking PR. No behavioural change beyond log level and a non-unix diagnostic. ADR (docs/adr/openab-agent-oauth.md) - Fix lock-file names to match code: auth.json.global.lock and auth.json.refresh.<tenant>.lock (were auth.json.lock / auth.json.<tenant>.lock). - Add the path parameter to the with_auth_locked pseudocode signature. - Correct the MCP refresh note: initialize_from_store() does the disk reload, not get_access_token. - Correct the crate note: libc::flock directly (rustix is not in-tree). - Document the deliberate fail-open trade-off on tenant-lock timeout. - Mark the default-model removal (Decision 1) as a follow-up; this PR ships the ADR + locking only. Code - auth.rs: escalate the tenant-lock timeout log from warn! to error! and document the fail-open trade-off on lock_tenant_refresh. - auth.rs: non-unix lock_global no-op now warns once instead of silently providing zero cross-process protection. - mcp/runtime.rs: fix the rmcp reload comment and document the cross-module invariant that the refresh lock and credential entry share the server name. CI (ci-openab-agent.yml) - ACP smoke test now fails on a non-zero/timeout exit code, and the agentInfo assertions use { } so exit fails the step rather than just a subshell. Gate: cargo fmt --check, clippy -D warnings, test (192 passed) all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * openab-agent: fail closed on a contended refresh lock (F6) Replaces the fail-open tenant-lock timeout with a bounded-refresh + fail-closed design, so a contended lock can no longer reintroduce the double-refresh it exists to prevent. - Bound the refresh network call with an explicit 8s HTTP timeout on both the codex client (auth.rs) and the MCP AuthClient (mcp/runtime.rs), strictly shorter than the 10s lock-acquire deadline. With flock(2) auto-release on death, a live holder always frees the tenant lock before a waiter's deadline, so a lock-acquire timeout is genuinely abnormal. - lock_tenant_refresh now returns RefreshLock { Held, Unavailable, TimedOut }. On TimedOut callers fail closed instead of refreshing unserialised: - codex get_valid_token / force_refresh return a retryable error; - MCP resolve_oauth_dial returns a new OauthDialError::Transient that leaves the server retryable WITHOUT forcing re-login (NeedsAuth) or tripping the circuit breaker (auth-level failures still map to NeedsAuth as before). A filesystem error opening the sidecar returns Unavailable and degrades to a best-effort unserialised refresh rather than blocking every refresh. - Add a fail-closed timeout test (injectable deadline) and update ADR §5.4 to document the bounded-refresh + fail-closed invariant, superseding the earlier fail-open note. Gate: cargo fmt --check, clippy -D warnings, test (193 passed) all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * openab-agent: size lock timeout above worst-case multi-call refresh hold Mira review (PR #1190) found the fail-closed invariant held only for the codex path. The MCP path holds the tenant lock across TWO sequential bounded calls — rmcp's initialize_from_store() (AS discovery) then get_access_token() (refresh) — so the worst-case lock-hold is ~2 x REFRESH_HTTP_TIMEOUT (16s), which could exceed the fixed 10s lock deadline and fail a waiter closed while the holder is still legitimately progressing. Derive REFRESH_LOCK_TIMEOUT from the bound instead of hardcoding it: REFRESH_LOCK_TIMEOUT = MAX_REFRESH_ROUND_TRIPS (2) * REFRESH_HTTP_TIMEOUT + 4s = 20s so the deadline is always above the worst-case hold and only a genuinely stuck holder trips it. Normal-case latency is unchanged (the waiter polls every 100ms and acquires as soon as the holder releases). Update the fn doc and ADR §5.4 to state the multi-call hold explicitly. Gate: cargo fmt --check, clippy -D warnings, test (193 passed) all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: thepagent <hehsieh1010@gmail.com>

…nthropic-oauth # Conflicts: # openab-agent/src/auth.rs

canyugs · 2026-06-27T05:57:20Z

Design update — landing the ADR's `OAuthVendor` abstraction + a centralized config file

This comment brings this PR in line with the multi-vendor OAuth ADR (docs/adr/openab-agent-oauth.md, landed in #1190) and adds the prior-art research that ADR §3 only partially covered for the config-file axis. Following docs/adr/pr-contribution-guidelines.md.

Framing — "decided" vs "landed". #1190 shipped the ADR document + the §5.4 storage locking (with_auth_locked / lock_tenant_refresh) only. The OAuthVendor abstraction (§4.2/§5.1) and the CLAUDE_CODE_OAUTH_TOKEN env route (§5.3) were decided but deferred. This PR is where they get implemented.

0. Discord Discussion

Internal thread, 2026-06-27 (multi-provider auth direction + config-file design).

1. What problem does this solve?

Adding a 2nd subscription-OAuth provider (Codex → now Anthropic Claude) surfaced two gaps:

No auth abstraction. Today each vendor is hand-rolled — separate codex_* / anthropic_* constants, three near-duplicate login flows (login_browser_flow, login_anthropic_browser_flow, login_codex_device_flow), and a if provider == ANTHROPIC_NAMESPACE branch in refresh_token. A 3rd vendor (gemini/grok) would be copy-paste. ADR §4.2 calls for a single OAuthVendor descriptor; it was never implemented.
No centralized config. Default provider/model live only in env vars (OPENAB_AGENT_PROVIDER, OPENAB_AGENT_MODEL) with a hardcoded fallback chain in llm.rs::select_provider. There's no declarative config file, and the CLAUDE_CODE_OAUTH_TOKEN fleet route (ADR §5.3) is unimplemented.

2. At a Glance

                       auth axis  (HOW a credential is obtained/refreshed)
                       ─────────────────────────────────────────────
  OAuthVendor descriptor (code, static)        credential store (auth.json, 0600, locked-RMW)
    ┌──────────────┐                             ┌────────────────────────────────┐
    │ CodexVendor  │   shared oauth2 driver      │ "codex":          Token{...}    │
    │ AnthropicVdr │──►(PKCE / device / refresh)─│ "anthropic-oauth":Token{...}    │
    │ (gemini…)    │   §5.4 per-tenant lock       │ "mcp:<server>":   Mcp{...}      │
    └──────────────┘                             └────────────────────────────────┘
            │                                                  ▲ secrets ONLY here
            ▼ inference axis (HOW a request is sent)           │
    AnthropicProvider / OpenAiProvider / …                     │
            ▲                                                  │
            │ selection (dynamic)                              │
   config.json  ──►  OPENAB_AGENT_MODEL env  ──►  model_override (ACP/CLI)
   "model":"anthropic/claude-sonnet-4-6"   (no secrets — commitable)
   resolution: model_override → config.json → env → error (fail-loud, ADR §9 Q1)
   credential precedence (§5.3): ANTHROPIC_API_KEY → CLAUDE_CODE_OAUTH_TOKEN → stored tenant

3. Prior Art & Industry Research

Surveyed how four agents structure config × provider × auth. OpenClaw + Hermes are the mandated references; pi + opencode added because the config-file axis is exactly the gap ADR §3 left open (§3 covered the auth axis well).

Dimension	OpenClaw (req'd)	Hermes (req'd)	pi	opencode
Config format / location	`openclaw.json` (JSON5), global+project merge	`~/.hermes/config.yaml` + `.env`	`~/.pi/agent/` — 3 files (settings/models/auth)	`opencode.json` (JSONC), deep-merge layers
Default model	single `model.primary:"prov/model"`	`model.default:"prov/model"` (+`model.provider`)	two fields `defaultProvider`+`defaultModel`	single `model:"prov/model"`
Provider registry	built-in + `models.providers{}` (custom)	`ProviderProfile` dataclass registry (in code)	`models.json` + `registerProvider()` hook	Models.dev + `provider{}` block
Auth-method encoding	profile key `prov:label` + `type:oauth\|apikey`	`auth_type` string field on profile	cred `type:api_key\|oauth`	cred `type:oauth\|api` in auth store
Secrets location	separate `auth-profiles.json`	separate `.env` / `auth.json`	separate `auth.json` (0600)	separate `auth.json` (0600)
Per-model params	`models[].params` passthrough	`fixed_temperature`/`default_max_tokens`	structured model fields (no temp)	`models.<id>.options` passthrough
Precedence	flag/slash → config order → env → default	CLI → config.yaml → .env → default	flag → auth.json → env → registry	env → auth.json → config

Three signals are unanimous across all four — we adopt them:

Secrets always live in a file separate from config (config stays commitable). We already have this: auth.json.
The credential record is tagged with a type discriminator (oauth vs api_key). We already have this: AuthEntry / TokenStore.
Static provider facts in code, dynamic selection in config, resolved in one place. Hermes is the cleanest exemplar (ProviderProfile registry + resolve_runtime_provider()), and it maps 1:1 onto ADR §5.1's OAuthVendor descriptor + a resolver.

Where we diverge: unlike all four (TS/Python), we keep the cross-process flock-based locked-RMW invariant (ADR §5.4) — OpenClaw and Hermes both file-lock but only for the single-process case; OpenClaw's open refresh-persistence bug cluster (#52037/#48153/#71026) is direct evidence that getting the lock + atomic write-back right matters.

4. Proposed Solution

OAuthVendor trait + descriptors (ADR §5.1): CodexVendor, AnthropicVendor today; new vendor = descriptor only. Collapse the three login flows into one shared PKCE/device driver on the in-tree oauth2 crate; fold the refresh_token provider-branch into vendor.token_body(). Anthropic's JSON-no-scope quirk via the crate's custom-http hook.
Thin config.json (next to auth.json), single provider/model string + optional max_tokens / per-provider base_url. No secrets. Resolution: model_override → config.json → OPENAB_AGENT_MODEL → error (env still overrides; fail-loud per §9 Q1).
Credential precedence (§5.3): ANTHROPIC_API_KEY → CLAUDE_CODE_OAUTH_TOKEN → stored anthropic-oauth tenant.
Storage: Anthropic save/refresh already routed through §5.4 locking in this branch (merge with main done — see commit resolving the auth.rs conflict; save_tokens_for → with_auth_locked, get/force_refresh_for → lock_tenant_refresh(namespace)), so the per-tenant RTR race is closed for the Anthropic tenant too.

// ~/.openab/agent/config.json — commitable, secrets stay in auth.json
{
  "model": "anthropic/claude-sonnet-4-6",   // single provider/model string (reuses ModelRef)
  "max_tokens": 8192,                        // optional
  "providers": { "anthropic": { "base_url": "https://api.anthropic.com" } }  // optional, custom only
}

5. Why This Approach

Single provider/model string: 3 of 4 surveyed projects use it, and openab already parses it — ModelRef::parse (llm.rs) + resolve_provider_choice() extract the provider from the provider/ prefix today. Two fields (pi) would add coupling for no gain.
Thin config / secrets-in-auth.json: unanimous prior-art signal; reuses the hardened §5.4 store instead of inventing a second secret file.
OAuthVendor now, not later: with only codex+anthropic in tree this is the lowest-cost moment to abstract; doing it after gemini/grok land means refactoring more call sites.
Linked fix — F4: a single provider/model config string makes the ModelRef::parse HuggingFace-ID bug load-bearing (anthropic/claude-… must split; some-org/some-model must not). F4's fix (split only on known provider prefixes) ships with the config change, not separately.

6. Alternatives Considered

Two fields (defaultProvider+defaultModel, pi-style) — rejected: couples two values openab already derives from one string; only 1/4 projects do it.
Secrets in the config file — rejected: contradicts all four references and would bypass the §5.4 locked store.
Build all vendors now (gemini/grok/agy) — out of scope per ADR §9 Q3 (incremental landing); this PR lands the abstraction + keeps codex/anthropic green.
Layer-3 auto-trigger (auto-login on mid-turn 401) — deferred per ADR §2 / Brett 2026-06-24; /auth is sufficient.

7. Validation

Rust: cargo check, cargo clippy --all-targets, cargo test green after the main merge — 56/56 auth+llm+mcp tests pass, incl. test_anthropic_save_uses_provider_as_key_disjoint_from_codex, with_auth_locked_merges_concurrent_tenants_no_lost_update, lock_tenant_refresh_fails_closed_when_contended, save_uses_rotated_refresh_token_when_present.
To add with the implementation: config.json load + layering test (model_override > config > env), precedence test (ANTHROPIC_API_KEY > CLAUDE_CODE_OAUTH_TOKEN > tenant), F4 ModelRef::parse HuggingFace-ID regression test, and an OAuthVendor round-trip test per descriptor.
Helm / CI / docs: docs update to docs/native-agent.md (vendor model + config.json + env route) lands with the change; no Helm surface.

brettchien · 2026-06-27T06:04:34Z

Now that #1190 has merged, here's a concrete path to rebase this PR onto the new
auth.json locking invariant — plus a couple of best-practice shape changes so the
new Anthropic tenant lands the way the ADR intends rather than as a parallel copy.

Blast radius: the lock layer (with_auth_locked, lock_tenant_refresh/RefreshLock,
the flock helpers, the refresh timeouts, gc_stale_pending) is reused from #1190
verbatim — it was already built tenant-agnostic, so nothing in it is re-implemented. The
only #1190 code touched is lifting the codex orchestration (get_valid_token/force_refresh)
into generic _for(namespace) form so both vendors share one implementation. Genuinely new
code is just the vendor-specific refresh body + the descriptor.

#1190 did the hard part generically, so most of this is wiring the new tenant
into existing primitives, not new infrastructure:

with_auth_locked(path, |map| …) — the global flock-guarded read-modify-write
funnel. Every writer of auth.json must go through it.
lock_tenant_refresh(auth, tenant: &str) -> RefreshLock — already takes a tenant
string; pass "anthropic-oauth". Handle all three arms: Held (hold across the
refresh), Unavailable (best-effort, proceed unserialised), TimedOut
(fail closed — return a retryable error, do not refresh).
REFRESH_HTTP_TIMEOUT (8s) bounds each refresh round-trip so the tenant lock-hold
stays provably under REFRESH_LOCK_TIMEOUT.

1. Route the four unlocked writers through #1190's locks

These four sites in openab-agent/src/auth.rs still do raw read → mutate → write
(this branch predates #1190):

save_tokens_for (:276) → funnel through with_auth_locked (mirrors save_tokens):

fn save_tokens_for(store: &TokenStore) -> Result<()> {
    let store = store.clone();
    with_auth_locked(&auth_path(), move |map| {
        map.insert(store.provider.clone(), AuthEntry::Token(store));
    })
}

get_valid_token_for (:377) → adopt the 5-step locked sequence: fast-path on a
fresh token, then lock_tenant_refresh(.., namespace), re-load inside the lock
(reuse-safety), double-check expiry, one network refresh, commit via save_tokens_for.
force_refresh_for (:386) → same lock_tenant_refresh wrap, but keep its current
shape: it runs after a 401 (token known-bad), so it intentionally skips the expiry
double-check — it loads inside the lock (reuse-safe) and always refreshes.
refresh_token (:401) → build the client with
reqwest::Client::builder().timeout(REFRESH_HTTP_TIMEOUT).build()?. Keep the existing
per-provider body branch (Anthropic JSON-no-scope / Codex form). The Anthropic refresh
is a single round-trip, so it fits the existing MAX_REFRESH_ROUND_TRIPS = 2 budget —
no need to change that constant.
McpCredentialStore::save/clear (:330 / :361) — openab-agent: multi-vendor OAuth ADR + cross-process auth.json locking (codex + MCP) #1190 already locked these on
main. This is a straight conflict; take main's locked versions, don't carry the
unlocked copies forward.

2. One locked implementation, not two — and stage it as two commits

After (1), main's codex get_valid_token / force_refresh and the new _for
variants would be near-duplicates of the same security-critical refresh sequence
(token-family-revocation safety). Duplicated security code drifts — one copy gets a
fix the other doesn't. So parameterise main's locked bodies into the _for(namespace)
functions and keep thin wrappers:

pub async fn get_valid_token() -> Result<String> { get_valid_token_for(CODEX_NAMESPACE).await }
pub async fn force_refresh()   -> Result<String> { force_refresh_for(CODEX_NAMESPACE).await }

To keep the codex path's review surface clean, please split this into two commits:

behaviour-preserving refactor — generalise the locked codex path into _for(namespace);
feature — add the anthropic-oauth tenant on top.

That lets a reviewer verify the refactor changes no behaviour separately from the new vendor.

3. Add a concurrency test for the Anthropic tenant

#1190 ships lock_tenant_refresh_fails_closed_when_contended and
with_auth_locked_merges_concurrent_tenants_no_lost_update for codex. Please cover the
new tenant the same way — parameterise (or duplicate) the lock test over "anthropic-oauth"
so single-flight / fail-closed is actually proven for it, not just structurally similar.

4. Shape the new vendor as a descriptor, not inlined constants

Rather than hand-rolling the Anthropic flow with inline constants alongside the codex one,
lift the per-vendor specifics — client_id, authorize/token URLs, scope, token-body
format (JSON-no-scope vs form), grant type — into a small per-vendor descriptor struct, and
drive the shared flow off it. This is the F3 point from the earlier review, and it lines up
with the ADR's OAuthVendor direction (§5.1): adding the second vendor in descriptor shape
keeps the future shared oauth2-crate driver a clean extraction instead of a rewrite.

To be clear on scope: the descriptor struct belongs in this PR; building the shared
oauth2-crate driver and migrating codex onto it is the genuinely separate, ADR-tracked
follow-up — not asked for here.

One deployment note (not blocking): for pod/fleet use, the race-free path is the
CLAUDE_CODE_OAUTH_TOKEN env route (ADR §5.3) — no refresh writes, no lock needed. The
interactive PKCE + per-tenant lock is the self-service path; worth recommending the env
token as the primary fleet mode and treating interactive OAuth as the fallback.

… §5.1) Collapse the hand-rolled per-vendor OAuth surface into a single OAuthVendor trait + CodexVendor/AnthropicVendor descriptors. Adding a subscription-OAuth vendor is now a descriptor, not a new hand-rolled flow: - Unify the codex + anthropic PKCE login flows into one shared `login_pkce_flow` driven by the descriptor; fold the codex flow into the `accept_callback_code` / `code_from_redirect` helpers (the long-standing TODO) and unify the 127.0.0.1 bind. - Drive `refresh_token` and the authorization-code exchange off `vendor.token_body()` instead of an `if provider == ANTHROPIC_NAMESPACE` branch. - Shared pure builders `build_authorize_url` / `token_store_from_payload`, pinned by wire-format unit tests — the login authorize-URL/exchange hit live OAuth servers, so no integration test covers them. The driver keeps the proven reqwest flows; swapping the engine onto `oauth2::BasicClient` (as `mcp/runtime.rs` already does via a custom http hook) is a follow-up internal change invisible to vendor authors — only the descriptor surface lands here. `client_secret()`/`grant()` are the ADR §5.1 surface for later vendors (gemini/agy bundled secret; copilot/kiro device-code). 205 tests pass (4 new wire-format locks). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ADR §5.3) Add `AnthropicAuth::OAuthEnv` for the pre-provisioned long-lived subscription token route (`CLAUDE_CODE_OAUTH_TOKEN`) — the recommended fleet mode (ops mints once, injects as a k8s secret; no interactive flow, no auth.json write, no refresh race). - `AnthropicProvider::auto[_with_model]()` now resolves in ADR §5.3 precedence: `ANTHROPIC_API_KEY` → `CLAUDE_CODE_OAUTH_TOKEN` → stored `anthropic-oauth` tenant. Each source surfaces its own errors rather than falling through to a lower-precedence credential error. - `OAuthEnv` shares the OAuth `Bearer` + Claude Code identity path (extracted into `oauth_headers`); `is_oauth()` covers it so the system block + tool-name casing apply. The 401 force-refresh is gated to the stored tenant only — the env token has no tenant to refresh, so a 401 there surfaces (re-mint) instead of erroring on a missing tenant. 207 tests pass (+2 precedence tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…s (review F4) `provider/model` previously split on the first `/`, mis-parsing HuggingFace-style `org/model` ids (e.g. `meta-llama/Llama-3-8B`) for custom/OpenAI-compatible endpoints — `org` became the "provider" and the real id was truncated. Now the prefix is split off only when it's a `KNOWN_PROVIDERS` entry; otherwise the whole string is the model id. Load-bearing for the planned single-string `provider/model` config field. Known prefixes still split unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…reds (review F3) `select_provider`'s auto-detect arm discarded the `AnthropicProvider::auto()` error and fell through to Codex. A present Anthropic credential that failed for a config reason (e.g. a key set but no model) was silently masked — the user got a Codex provider, or a Codex-only error that hid the real cause. Now: if any Anthropic credential source exists (API key / CLAUDE_CODE_OAUTH_TOKEN / stored tenant), an `auto()` failure surfaces as a config error ("credential present but unusable"). Codex fallthrough happens only when no Anthropic credential exists at all, and that final error names both credential routes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…(ADR §5.5) Add a small JSON config file next to auth.json so a deployment can declare the default provider/model (and max_tokens) in a file instead of only via env vars — the centralized-config gap the multi-vendor work surfaced. - New `config` module: `{ model, max_tokens }` (single `provider/model` string, reusing ModelRef). Unknown keys tolerated (forward-compat for `providers`/etc.); malformed JSON fails loud; a missing file is an empty config. - Resolution is env-over-config: `anthropic_model` / `anthropic_max_tokens` / `resolve_provider_choice` now fall `OPENAB_AGENT_*` env → config.json → built-in. A pod's injected env stays authoritative over a baked config. - Secrets never live here — they stay in the locked `auth.json` store. Per-provider `base_url` routing is a deliberate follow-up; this lands the default provider/model/params surface. 213 tests pass (+5). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ig.json (review F5) - Anthropic credentials section: the ADR §5.3 precedence (ANTHROPIC_API_KEY → CLAUDE_CODE_OAUTH_TOKEN fleet route → interactive anthropic-oauth) + the `openab-agent auth anthropic-oauth` login. - New "Configuration file (config.json)" section: schema, env-over-config precedence, secrets-stay-in-auth.json. - "Adding an OAuth vendor": the OAuthVendor descriptor model (ADR §5.1). - Env table: CLAUDE_CODE_OAUTH_TOKEN, OPENAB_AGENT_ANTHROPIC_CLIENT_ID, OPENAB_CONFIG_PATH. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…auth tenant Brett asked to cover the new tenant the same way openabdev#1190's codex lock tests do, so single-flight / fail-closed is proven for it rather than only structurally similar: - `with_auth_locked_merges_anthropic_tenant_no_lost_update` — a concurrent codex write does not clobber a just-written `anthropic-oauth` token (the new tenant rides the same locked RMW funnel). - `lock_tenant_refresh_fails_closed_for_anthropic_and_is_per_tenant` — a held anthropic refresh lock makes a second anthropic acquire fail closed (`TimedOut`), and does NOT block codex (per-tenant isolation — the reason §5.4 uses a per-tenant lock, not the global one). 215 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

canyugs · 2026-06-27T10:36:38Z

Architecture — Before / After

A visual companion to the design-update comment above: how auth/provider selection is shaped before this PR's multi-vendor refactor vs after (ADR §4/§5.1/§5.3/§5.5).

Before — hand-rolled per-vendor, env-only, unlocked new tenant

 SELECTION   env only:  OPENAB_AGENT_PROVIDER / OPENAB_AGENT_MODEL
             + hardcoded default model  (retire -> 404 timebomb)
             ModelRef::parse: split on FIRST '/'  (breaks HF org/model)
                    |
                    v
             select_provider(choice)
               auto(): API key -> stored OAuth            <- 2-tier
               auto-detect: Anthropic err SWALLOWED --.
                    |                                  '-> silently try Codex
        +-----------+-----------+
        v                       v
  AnthropicProvider        OpenAiProvider
  auth: {ApiKey, OAuth}      (codex)
        |
  -- AUTH: hand-rolled per vendor (coupled to inference) -----------
    codex_client_id() / anthropic_client_id()  + inline constants
    login_browser_flow (codex PKCE)      -.
    login_anthropic_browser_flow (PKCE)   |- 3 near-duplicate flows
    login_codex_device_flow (device)     -'
    refresh_token:  if provider==ANTHROPIC {json} else {form}   <- branch
        |
        v
    auth.json  (namespaced: codex / anthropic-oauth / mcp:<srv>)
    writes: raw read -> mutate -> write   X no lock -> RTR race / lost-update

After — two-axis model, 3-tier precedence, locked, config file

 SELECTION   model_override(ACP/CLI) -> OPENAB_AGENT_MODEL -> config.json -> ERROR
             resolve_provider_choice: env -> config.json prefix -> auto
             ModelRef::parse: split only KNOWN providers (HF org/model kept) [F4]
                    |
                    v
             select_provider(choice)
               auto(): ANTHROPIC_API_KEY -> CLAUDE_CODE_OAUTH_TOKEN
                                         -> stored tenant      <- 3-tier [§5.3]
               auto-detect: credential present + fail -> FAIL LOUD [F3]
                    |
  == TWO-AXIS (ADR §4 - auth _|_ inference) ==========================
                    |
    INFERENCE axis  |   AUTH axis  -- trait OAuthVendor [§5.1] --
    AnthropicProvider|     +- CodexVendor      (namespace, client_id,
    OpenAiProvider   |     +- AnthropicVendor   authorize/token url, redirect,
    (1 per wire fmt) |              |            scope, token_body, grant)
        |            |              v
    AnthropicAuth    |   shared driver (reqwest now; oauth2 follow-up)
    {ApiKey, OAuth,  |   login_pkce_flow . device . refresh -- descriptor-driven
     OAuthEnv}<-CLAUDE…             |
        |            |              v
        v            v        auth.json (0600, namespaced)
    config.json {model, max_tokens}   writes:  with_auth_locked (global flock)
    declarative . secrets NEVER here  refresh: lock_tenant_refresh(ns) per-tenant
                                      [§5.4 - anthropic tenant proven by tests]

Core deltas

Dimension	Before	After
Auth abstraction	hand-rolled 2 vendors, 3 duplicate login flows, refresh `if`-branch	`OAuthVendor` trait + descriptor + shared driver (new vendor = a descriptor)
auth / inference	coupled (login flow lives inside the provider)	two orthogonal axes (§4)
Credential precedence	API key → stored OAuth (2-tier)	+ `CLAUDE_CODE_OAUTH_TOKEN` fleet route (3-tier, §5.3)
Storage safety	raw RMW, unlocked, RTR race	`with_auth_locked` + per-tenant `lock_tenant_refresh` (§5.4)
Config source	env-only + hardcoded default model	`config.json` (env-over-config), fail-loud, secrets stay in auth.json
Error handling	Anthropic error swallowed → silent Codex fallthrough	credential-present-but-broken → fail loud (F3)
Model parsing	split on first `/` (breaks HF ids)	split known providers only (F4)

All of @brettchien's rebase roadmap and the review findings are addressed: routed the new tenant through #1190's with_auth_locked / lock_tenant_refresh, shaped both vendors as OAuthVendor descriptors, fail-loud model resolution (no hardcoded default), and added per-tenant lock tests for the anthropic-oauth tenant (single-flight + fail-closed + codex-independence) in f32de33. CI green — handing to maintainer.

chaodu-agent · 2026-06-27T15:57:31Z

LGTM ✅ — Well-architected multi-vendor OAuth with clean credential-source precedence and solid test coverage.

What This PR Does

Adds native Anthropic OAuth (Claude Pro/Max subscription) login to openab-agent, so subscribers can authenticate via PKCE without an API key — matching the existing Codex OAuth experience. Also introduces a centralized config.json for declarative defaults and removes the hardcoded model default (fail-loud on missing model).

How It Works

OAuthVendor trait (ADR §5.1): a static descriptor per vendor (endpoints, client ID, token-body encoding, redirect). CodexVendor and AnthropicVendor are the first two; the shared PKCE driver (login_pkce_flow) parameterises on the descriptor, so a new vendor is a new struct, not a new hand-rolled flow.
Credential precedence (ADR §5.3): ANTHROPIC_API_KEY → CLAUDE_CODE_OAUTH_TOKEN env route (fleet/pod) → stored anthropic-oauth tenant in auth.json. A present-but-misconfigured credential fails loud (F3 guard) instead of silently falling through to Codex.
Token refresh: per-tenant flock serialisation (rides the §5.4 locking from openab-agent: multi-vendor OAuth ADR + cross-process auth.json locking (codex + MCP) #1190), JSON body for Anthropic (no scope on refresh), form body for Codex. 401 mid-flight triggers one forced refresh.
config.json (ADR §5.5): env-over-config precedence for model/max_tokens. Missing file is fine; malformed file logs a warning and falls back.
ModelRef::parse: provider-qualified anthropic/claude-sonnet-4-6 form; only known provider prefixes split (HuggingFace org/model ids preserved).
Tool-name normalisation: openab-agent's read/write/bash ↔ Claude Code's Read/Write/Bash for the claude-code-20250219 beta header path.

Findings

#	Severity	Finding	Location
1	🟢	Clean vendor abstraction — the `OAuthVendor` trait + descriptor pattern makes adding vendors trivial and the wire-format pinning tests are excellent insurance	`auth.rs`
2	🟢	Fail-loud on missing model eliminates the recurring 404 timebomb (per brettchien's architectural feedback)	`llm.rs:anthropic_model()`
3	🟢	F3 guard in `select_provider` distinguishes "no credential" (fall to Codex) from "credential present but unusable" (fail loud) — prevents subtle silent fallback bugs	`llm.rs:select_provider()`
4	🟢	Comprehensive test coverage: 4 new wire-contract tests, namespaced storage disjointness, per-tenant refresh lock isolation, model precedence chain, OAuth identity + tool casing	across `auth.rs`, `llm.rs`, `config.rs`, `acp.rs`
5	🟢	ACP drain fix (bounded `tokio::time::timeout` on the stdout drain task) prevents lost final responses — correct shutdown ordering	`acp.rs:268`
6	🟢	Per-tenant refresh lock isolation proven by test (`lock_tenant_refresh_fails_closed_for_anthropic_and_is_per_tenant`) — Anthropic lock never head-of-line-blocks Codex	`auth.rs` tests

Baseline Check

PR opened: 2026-06-24
Main already has: API-key-only AnthropicProvider::from_env(), single-tenant Codex OAuth, no config.rs, no vendor abstraction, hardcoded claude-sonnet-4-20250514 default
Net-new value: Full Anthropic PKCE OAuth login, OAuthVendor trait abstraction, CLAUDE_CODE_OAUTH_TOKEN env route, centralized config.json, fail-loud model resolution, ModelRef parser, tool-name normalisation, ACP drain fix, 4+ new test families

What's Good (🟢)

The PR description is exceptionally well-researched (OpenClaw, Hermes, Pi prior art comparison)
Every previous review finding (CI workspace, state/verifier separation, hardcoded default removal, locking integration) has been addressed in this final iteration
The OAuthVendor surface is designed for extension (gemini/grok/agy vendors) without touching the shared driver
unsafe env manipulation in tests is properly scoped with ENV_LOCK guards
The is_multiple_of nits in format.rs and wecom.rs are bonus Rust modernisation (nightly → stable API usage)

CI: check (cargo check/clippy/test) ✅ | all individual smoke-tests ✅ | unified smoke-tests pending (builder just completed, these will land shortly)

canyugs requested a review from thepagent as a code owner June 24, 2026 09:48

Copilot AI review requested due to automatic review settings June 24, 2026 09:48

openab-app Bot added the closing-soon PR missing Discord Discussion URL — will auto-close in 24 hours. label Jun 24, 2026

Copilot started reviewing on behalf of canyugs June 24, 2026 09:49 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

openab-app Bot removed the closing-soon PR missing Discord Discussion URL — will auto-close in 24 hours. label Jun 24, 2026