Skip to content

feat(native-agent): Anthropic OAuth (Claude Pro/Max) login for openab-agent#1187

Open
canyugs wants to merge 21 commits into
openabdev:mainfrom
canyugs:feat/native-agent-anthropic-oauth
Open

feat(native-agent): Anthropic OAuth (Claude Pro/Max) login for openab-agent#1187
canyugs wants to merge 21 commits into
openabdev:mainfrom
canyugs:feat/native-agent-anthropic-oauth

Conversation

@canyugs

@canyugs canyugs commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What problem does this solve?

openab-agent can only reach Anthropic via ANTHROPIC_API_KEY (pay-per-token). Codex already supports subscription login, but Claude Pro/Max subscribers cannot use their subscription with the native agent. This adds native Anthropic OAuth (Claude Pro/Max) so users run openab-agent on the Claude subscription they already pay for — no API key — matching the existing Codex experience.

Closes #1186

Discord Discussion URL: https://discord.com/channels/1491295327620169908/1519271476002291752

At a Glance

  openab-agent auth anthropic-oauth          ~/.openab/agent/auth.json
        │ PKCE + browser/paste                ┌────────────────────────────┐
        ▼                                      │ "codex"          (Token)   │
  claude.ai/oauth/authorize ──code──► token   │ "anthropic-oauth"(Token) ◄─┼─ new tenant
  platform.claude.com/v1/oauth/token          │ "<mcp server>"   (Mcp)     │
        │ JSON exchange + scope-less refresh   └────────────────────────────┘
        ▼
  AnthropicProvider (OAuth mode)
    Authorization: Bearer <sk-ant-oat…>
    anthropic-beta: claude-code-20250219,oauth-2025-04-20   x-app: cli
    system[0] = "You are Claude Code, …"   tools: read→Read …
        │
        ▼   api.anthropic.com/v1/messages
  provider select: ANTHROPIC_API_KEY → anthropic-oauth → codex

Prior Art & Industry Research

I looked at how two comparable self-hosted agents authenticate to Anthropic and Codex. The headline finding: neither implements a native Anthropic (Claude Pro/Max) OAuth login — both avoid the PKCE flow and instead lean on a setup-token or on reusing Claude Code's local credentials. This PR (following Pi) does the full native PKCE login, which is strictly more self-contained for pod deployments. Their surrounding architecture, however, validates the storage/refresh choices here.

OpenClaw — supports API keys and subscription OAuth.

  • Anthropic: no native OAuth login — either a user-supplied setup-token stored in an auth profile, or reuse of the local Claude CLI (claude -p).
  • Codex: full PKCEauth.openai.com/oauth/authorize → callback http://127.0.0.1:1455/auth/callback (or manual paste) → token exchange → accountId extracted from the access token. This is byte-for-byte the same flow openab-agent already uses for Codex, corroborating our approach as the de-facto standard.
  • Storage: ~/.openclaw/agents/<id>/agent/auth-profiles.json, one {access, refresh, expires, accountId} tuple per profile.
  • Refresh: treats the profile file as a "token sink", refreshes under a file lock, and is careful not to spend copied refresh tokens — the same rotation hazard we handle by persisting the rotated refresh token on every refresh.
  • Refs: https://docs.openclaw.ai/concepts/oauth , https://docs.openclaw.ai/gateway/authentication

Hermes AgentPROVIDER_REGISTRY dataclasses in hermes_cli/auth.py declare each provider's auth type + base URLs + env vars; resolve_runtime_provider() is the single resolution entry point.

Primary source ported: Pi (earendil-works/pi) — packages/ai/src/utils/oauth/anthropic.ts (PKCE flow, endpoints, scopes; verifier doubles as state) and packages/ai/src/api/anthropic-messages.ts (OAuth headers, Claude Code system block, tool-name normalisation). The OAuth client is Claude Code's public client.

How this PR compares: like both systems, openab-agent keeps a single namespaced credential file (~/.openab/agent/auth.json) with atomic writes + per-refresh rotation handling, and an existing Codex tenant identical to OpenClaw's Codex flow. Unlike both, it adds a native Anthropic PKCE login so subscribers need neither a setup-token nor a local Claude Code install.

Proposed Solution

Add an anthropic-oauth tenant alongside the existing codex tenant in ~/.openab/agent/auth.json:

  • auth.rslogin_anthropic_browser_flow() (PKCE; verifier doubles as state per Claude's flow); namespaced token store (load/save/get_valid_token/force_refresh _for(provider)); per-provider refresh encoding (Anthropic = JSON, no scope; Codex = form); shared loopback-callback helpers; show_status lists all tenants.
  • llm.rsAnthropicProvider gains AnthropicAuth { ApiKey | OAuth }. OAuth mode sends Bearer + Claude Code identity headers (anthropic-beta: claude-code-20250219,oauth-2025-04-20, x-app: cli), prepends the required "You are Claude Code…" system block, normalises built-in tool names to Claude Code casing (read↔Read), and refreshes once on a mid-flight 401. select_provider gains anthropic-oauth; anthropic/auto fall back API-key → OAuth.
  • acp.rs — session/model selection via AnthropicProvider::auto*() (covers both auth modes); model catalog shows Anthropic models when an API key or OAuth token is present.
  • main.rsopenab-agent auth anthropic-oauth [--no-browser].

Also bumps the stale default model claude-sonnet-4-20250514claude-opus-4-8 (the old dated snapshot returns 404 on the subscription endpoint).

Why this approach?

  • Consistency: reuses the existing Codex OAuth tenant pattern in the same auth.json, so the two subscription logins coexist without new storage mechanisms.
  • Proven flow: endpoints/headers/system block are ported verbatim from Pi (a shipping Anthropic-OAuth client), de-risking the Claude-Code-beta specifics.
  • Refresh safety: each refresh persists the rotated refresh token (OpenClaw/Hermes both flag this as a sharp edge).

Tradeoffs / limitations: depends on Claude Code's public OAuth client and the claude-code-20250219,oauth-2025-04-20 beta headers — if Anthropic changes these, the OAuth path needs updating (API-key path is unaffected). The claude-opus-4-8 default now also applies to API-key mode (Opus is pricier per-token; overridable via OPENAB_AGENT_MODEL). The legacy Dockerfile.native is unrelated and intentionally out of scope (canonical Dockerfile.unified builds the native variant correctly).

Alternatives Considered

  • API key only (status quo): excludes every Pro/Max subscriber — the whole point.
  • Reuse Claude Code's local credential store (what Hermes does via ~/.claude/.credentials.json, and OpenClaw via claude -p): rejected — openab-agent runs in a pod with no Claude Code install and no ~/.claude. Owning an anthropic-oauth tenant in our own auth.json keeps it self-contained and matches the Codex tenant already present.
  • Setup-token paste (OpenClaw's other Anthropic path): rejected — a full PKCE login is a better UX (no manual token minting) and reuses the exact callback/paste plumbing the Codex flow already has.
  • Separate credential file per provider: rejected — the namespaced single auth.json already serves Codex + MCP; a new file would fragment storage and duplicate the atomic-write/rotation logic.

Validation

Rust:

  • cargo check / cargo build pass (0 warnings)
  • cargo test passes — 194 passed, incl. 4 new (authorize-URL, namespaced storage disjointness, OAuth request-body identity+tool-casing, name round-trip)
  • cargo clippy — no new warnings from this change (6 pre-existing in test-module ENV_LOCK-across-await + mcp/runtime.rs; the OAuth code is clippy-clean)

Manual (real Claude Pro/Max account):

  • auth anthropic-oauth --no-browser login → token stored under anthropic-oauth, auth status shows valid
  • Live chat on claude-opus-4-8 (default) / claude-sonnet-4-6 / 4-5 → correct responses
  • Real tool call — bash executes in-sandbox (echo …$((6*7))42), confirming tool-name normalisation round-trips
  • Token refresh — forced expiry → live JSON refresh succeeds, access+refresh tokens rotate, request still completes
  • Canonical image — built Dockerfile.unified --target native; ran the in-image agent end-to-end (chat + tool call) via the stored OAuth token

@canyugs canyugs requested a review from thepagent as a code owner June 24, 2026 09:48
Copilot AI review requested due to automatic review settings June 24, 2026 09:48
@openab-app openab-app Bot added the closing-soon PR missing Discord Discussion URL — will auto-close in 24 hours. label Jun 24, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class Anthropic (Claude Pro/Max) OAuth login support to openab-agent, storing credentials as a new anthropic-oauth tenant alongside the existing codex tenant in ~/.openab/agent/auth.json. This extends provider auto-detection and request shaping so Claude subscription users can run the native agent without an ANTHROPIC_API_KEY.

Changes:

  • Introduces openab-agent auth anthropic-oauth [--no-browser] PKCE login flow and namespaced token load/save/refresh helpers.
  • Extends AnthropicProvider to support OAuth auth mode (Bearer + Claude Code identity headers/system block + tool-name casing normalization).
  • Updates ACP provider/model selection and available-model listing to recognize Anthropic OAuth credentials; bumps default Anthropic model to claude-opus-4-8.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
openab-agent/src/main.rs Adds the auth anthropic-oauth CLI subcommand wiring.
openab-agent/src/auth.rs Implements Anthropic PKCE/OAuth flow and namespaced token storage/refresh.
openab-agent/src/llm.rs Adds Anthropic OAuth request behavior (headers/system/tool name normalization) and provider selection updates.
openab-agent/src/acp.rs Uses Anthropic auto* selection (API key or OAuth), updates default model and model availability gating.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread openab-agent/src/auth.rs
Comment on lines +249 to 264
/// Load the LLM token stored under `namespace` (`codex` / `anthropic-oauth`).
pub fn load_tokens_for(namespace: &str) -> Result<TokenStore> {
let path = auth_path();
let map = read_auth_file(&path).map_err(|_| {
anyhow!(
"No credentials found at {}. Run `openab-agent auth codex-oauth` first.",
"No credentials found at {}. Run `openab-agent auth` first.",
path.display()
)
})?;
match map.get(CODEX_NAMESPACE) {
match map.get(namespace) {
Some(AuthEntry::Token(t)) => Ok(t.clone()),
_ => Err(anyhow!(
"No codex credentials in {}. Run `openab-agent auth codex-oauth` first.",
"No {namespace} credentials in {}. Run `openab-agent auth` first.",
path.display()
)),
}
Comment thread openab-agent/src/auth.rs Outdated
Comment on lines +656 to +658
/// Block on the loopback listener for the OAuth redirect, reply 200, return the
/// authorization code. ponytail: the Codex flow above predates this helper and
/// still inlines the same logic; fold it in if that path is ever touched again.
Comment thread openab-agent/src/llm.rs Outdated
Comment on lines 104 to 111
_ => match AnthropicProvider::auto() {
Ok(p) => Ok(Box::new(p)),
Err(_) => match OpenAiProvider::from_auth_store() {
Ok(p) => Ok(Box::new(p)),
Err(e) => Err(format!(
"No credentials: set ANTHROPIC_API_KEY or run `openab-agent auth codex-oauth`. {e}"
"No credentials: set ANTHROPIC_API_KEY, or run `openab-agent auth anthropic-oauth` / `auth codex-oauth`. {e}"
)),
},
Comment thread openab-agent/src/acp.rs
Comment on lines 347 to 351
return self.error_response(
id,
-32000,
&format!("No credentials: set ANTHROPIC_API_KEY or run `openab-agent auth codex-oauth`. {e}"),
&format!("No credentials: set ANTHROPIC_API_KEY, or run `openab-agent auth anthropic-oauth` / `auth codex-oauth`. {e}"),
)
@openab-app openab-app Bot removed the closing-soon PR missing Discord Discussion URL — will auto-close in 24 hours. label Jun 24, 2026
@chaodu-agent

This comment has been minimized.

@chaodu-agent

This comment has been minimized.

@canyugs

canyugs commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review — addressed in 73bf9a2.

F1 (🔴 CI workspace) — fixed, with a root-cause correction. The "rebase onto main" suggestion doesn't apply: this branch is already based on current main (d1b0cfb), and the breakage is present-but-dormant there. ci-openab-agent.yml only runs on paths: openab-agent/**; its last run was 2026-06-17 (#1133) — before the workspace restructure (#1146/#1170) made the repo root a workspace without excluding openab-agent. Nothing has touched openab-agent/ on main since, so the check never re-ran — this PR is the first openab-agent/ change to expose it. Fix: exclude = ["openab-agent"] in the root Cargo.toml (the cargo error's own suggestion), plus cargo fmt. Locally the full CI sequence now passes: cargo fmt --check, cargo clippy -- -D warnings, cargo test (194), cargo test -- --ignored (11). Note: committing [workspace] into openab-agent/Cargo.toml instead would have collided with Dockerfile.unified's build-time [workspace] injection (duplicate-key error), so the root exclude is the non-conflicting fix.

F2 (🟡 PKCE state) — fixed + verified live. Now uses an independent 32-byte random state; the verifier stays back-channel-only. Worth recording: claude.ai's authorize rejects a short independent state with Invalid request format, so the state is sized to match the verifier (32 bytes) — value still independent. Verified end-to-end against a real Pro/Max account (browser login → token exchange with state ≠ verifier → live chat).

F3 (🟡 error UX) — fixed. Credential errors now name fully-qualified subcommands (openab-agent auth anthropic-oauth / openab-agent auth codex-oauth) and preserve the underlying read/parse error.

F4 (🟡 comment tag) — fixed. ponytail:Note:.

Also confirmed the canonical native image still builds: Dockerfile.unified --target native (the path build-images.yml uses) builds clean and runs the OAuth flow end-to-end in-container. The legacy Dockerfile.native is a separate pre-existing issue (missing the [workspace] injection that Dockerfile.unified does) and is intentionally out of scope here.

@canyugs canyugs closed this Jun 24, 2026
@canyugs canyugs reopened this Jun 24, 2026
@canyugs canyugs closed this Jun 24, 2026
@canyugs canyugs reopened this Jun 24, 2026
@canyugs

canyugs commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up: the CI openab-agent check went green only after one more fix (4ef7c49).

With the workspace exclude in place, fmt/clippy/test ran and passed, but the ACP initialize smoke test then failed (empty response). Root cause: a latent flush-on-shutdown race — the dispatch loop fed responses to a detached stdout-drain task, and on stdin-EOF #[tokio::main] aborted the drain before it flushed the last line. main wins this race by timing; this branch's slightly different startup timing lost it ~85% of the time locally (origin/main binary: 20/20 pass; this branch pre-fix: 3/20). Fix captures the drain handle and bounded-awaits it after the loop so queued output is flushed before return — 20/20 after. All CI openab-agent steps now pass.

@chaodu-agent

This comment has been minimized.

@chaodu-agent

This comment has been minimized.

@brettchien

Copy link
Copy Markdown
Contributor

Supplementary architectural review (forward-looking — this PR is already LGTM'd and the line-level items
are well covered by the automated review + the author's fixes; CI is green, state is now independent, the
workspace exclude is in). None of the below blocks merging — they're direction/follow-up points for the
planned OAuth revamp + a forthcoming ADR, worth anchoring on record before this lands as the first vendor.

Findings

# Severity Finding Location
1 🟡 auth.json read-modify-write is unlocked across processes — concurrent refresh can trip OAuth 2.1 §10.4 token-family revocation. Not introduced here; follow-up, not blocking. auth.rs storage layer
2 🟢 Add CLAUDE_CODE_OAUTH_TOKEN env support — unblocks headless/pod/CI now, and is race-free llm.rs / auth.rs
3 🟢 Keep per-vendor specifics in a descriptor (sets up the shared multi-vendor adapter as a refactor, not a rewrite) auth.rs / llm.rs
4 🟡 Hardcoded default claude-opus-4-8 is a recurring-staleness/404 risk (no evergreen alias in 4.6+) — recommend dropping the hardcoded default and requiring the model via config/env (fail-loud) llm.rs:152, acp.rs:385/446
5 🟢 Stale doc comment still says "PKCE with the verifier doubling as state" though the code now uses an independent state auth.rs:692
6 🟢 getrandom::fill(...).expect(...) in a Result-returning fn → prefer ? auth.rs:461/472
Detail — F1: cross-process auth.json race (follow-up)

auth.json is multi-tenant (codex / anthropic-oauth / mcp:*) and every writer does
read_auth_file → mutate → write_auth_file with no lock. openab-agent runs one process per session
thread
(SessionPoolconnection.rs spawns a child per thread), so concurrent threads = concurrent
processes doing that RMW. When a token expires, multiple processes refresh in the same window; with
refresh-token rotation the later ones present an already-rotated token, which under OAuth 2.1 §10.4 reuse
detection can revoke the whole token family → every session logged out at once.

This is not introduced by this PR (Codex already shares the same store), but OAuth widens the exposure,
and API-key users never hit it (no refresh writes). Not merge-blocking for this PR. The fix — a
flock-guarded RMW funnel that all writers (including the MCP CredentialStore) share, with the network
refresh done outside the lock + a re-read inside for single-flight — is planned for the revamp PR. Until
then, F2 (env token) gives a race-free path for fleet/pod use.

Detail — F4: default model staleness

This PR replaces the old default claude-sonnet-4-20250514 precisely because it 404s on the subscription
endpoint
(retired there). The replacement claude-opus-4-8 will eventually hit the same fate when it's
retired from the Claude Code surface — i.e. this is a recurring 404 timebomb, not a one-off. And there's no
escape via a floating alias: in the 4.6+ generation, dateless IDs (claude-opus-4-8, claude-sonnet-4-6)
are fixed canonical IDs, not evergreen pointers
— Anthropic ships new versions under new IDs and never
re-points an old one. So the default must be bumped by hand each generation.

Recommendation — no hardcoded model default; require it via config/env and fail loud. Since Messages V1
mandates a model and there is no evergreen alias to fall back to, any in-code default is a 404 timebomb.
Resolve model as: ACP/CLI model_overrideOPENAB_AGENT_MODELerror ("no model configured; set
OPENAB_AGENT_MODEL or select one") instead of the hardcoded claude-opus-4-8. This also removes the
silent Opus cost bump for API-key users.

Note this is a behavior change: today's zero-config default goes away, so a model must be set
(deployments via values.yaml/env already do; zero-config npx/local users will need to set
OPENAB_AGENT_MODEL). Worth a clear error message + CHANGELOG line. Drops the three duplicated default
sites (llm.rs:152, acp.rs:385/446).

Detail — F2: support CLAUDE_CODE_OAUTH_TOKEN

claude setup-token mints a 1-year subscription OAuth token. Reading it as a Bearer source — precedence
ANTHROPIC_API_KEYCLAUDE_CODE_OAUTH_TOKEN → stored anthropic-oauth tenant (same shape as pi,
earendil-works/pi#3591) — is only a few lines and:

  • makes Claude subscription auth usable in headless / pod / CI today (no browser, no loopback, no stdin
    paste), which the interactive PKCE flow can't cover in a container; and
  • is race-free (env token, no auth.json refresh writes) — see F1.

For ops-managed deployments this is arguably the primary path; interactive PKCE is for local self-service.

Detail — F3: per-vendor descriptor

client_id / client_secret / endpoints / scope / token-body-format are the only things that vary between
Codex, Anthropic, and upcoming vendors (Gemini, agy, grok…). Holding them in a small per-vendor struct now
(rather than inlined in the Anthropic flow) makes the planned shared OAuth adapter a clean extraction
instead of a rewrite. No need to build the adapter in this PR — just don't entrench Claude-only assumptions.

Direction / roadmap (tracked in a forthcoming ADR)

A short ADR is being drafted for multi-vendor LLM-provider OAuth + credential storage; this PR is the first
vendor under it. Explicitly out of scope here, listed so the follow-ups are on record:

@brettchien

Copy link
Copy Markdown
Contributor

Follow-up on F4 — one thing worth doing in this PR before it merges: please don't pin claude-opus-4-8 as a hardcoded default.

This PR exists because the previous hardcoded default (claude-sonnet-4-20250514) 404'd on the subscription endpoint after retirement — and claude-opus-4-8 will eventually hit the same fate. There's no escape via an alias: in the 4.6+ generation, dateless IDs are fixed canonical IDs, not evergreen pointers, so a hardcoded default is a recurring 404 timebomb that has to be chased by hand every generation.

Since Messages V1 mandates a model, the cleanest fix is to require it and fail loud rather than bake in a new pin:

  • resolve model as ACP/CLI model_overrideOPENAB_AGENT_MODELerror ("no model configured; set OPENAB_AGENT_MODEL or select one")
  • drop the three hardcoded default sites: llm.rs:153, acp.rs:385, acp.rs:446
  • (the ModelOption catalog entry at acp.rs:509 is fine — that's offering the model, not defaulting to it)

It's a small change, and it also removes the silent Opus cost bump for API-key users. Behavior note: this drops the zero-config default, so a model must be set — deployments via values.yaml/env already do; worth a clear error message + CHANGELOG line for zero-config/local users.

brettchien added a commit to brettchien/openab that referenced this pull request Jun 24, 2026
Proposed ADR for the openab-agent LLM-provider OAuth revamp: a two-axis
OAuthVendor adapter (auth flow vs inference transport), a cross-process
flock-guarded credential-store invariant for auth.json, the
CLAUDE_CODE_OAUTH_TOKEN env route, a 14-variant vendor feasibility matrix,
and the /auth (PR openabdev#1185) auth-trigger model. Surfaced while reviewing
PR openabdev#1187 (first OAuth vendor).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@canyugs

canyugs commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

@brettchien Great call — this is exactly the right framing, thank you. The dateless 4.6+ IDs being fixed canonical IDs (not evergreen pointers) makes any hardcoded default a recurring 404 timebomb, and pinning Opus also quietly raised costs for API-key users. Implemented your fail-loud approach in b28c312:

  • Model resolution is now: explicit ACP/CLI override → OPENAB_AGENT_MODELerror (no model configured; set OPENAB_AGENT_MODEL or select a model). No baked-in default.
  • Dropped the three hardcoded sites (llm.rs anthropic_model_from_env, acp.rs session new/load). anthropic_model() is now fallible; session new/load report the provider's actually-resolved model instead of a fallback.
  • Kept the claude-opus-4-8 entry in the model catalog (acp.rs) — offering, not defaulting, as you noted.
  • Two implementation details worth flagging: a model override (CLI/ACP) does not require OPENAB_AGENT_MODEL, and credentials are still checked before the model so a missing key/token surfaces its own error rather than the generic "no model configured". auto() only falls through to OAuth when no API key is present.
  • Behavior note + CHANGELOG-equivalent: documented in docs/native-agent.md that OPENAB_AGENT_MODEL is required for Anthropic — zero-config now fails loud with a clear message; env/values.yaml deployments are unaffected.

Tests: cargo fmt/clippy -D warnings clean, cargo test 196 + 11 ignored. Added test_session_new_requires_model to lock in the fail-loud contract. This also resolves the earlier #5 (silent Opus cost bump for API-key users).

@chaodu-agent

This comment has been minimized.

@chaodu-agent

This comment has been minimized.

@chaodu-agent

This comment has been minimized.

@canyugs

canyugs commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

All review findings addressed across iterations — summary of what was fixed:

  • F1 (CI red): Root-caused to workspace resolution; added exclude = ["openab-agent"] to root Cargo.toml + fixed ACP drain-shutdown race (4ef7c49). CI green since.
  • F2 (PKCE state = verifier): Now uses independent 32-byte random state; verifier stays back-channel only (73bf9a2).
  • F3 (error messages): All credential-error messages now spell out full commands (openab-agent auth anthropic-oauth / openab-agent auth codex-oauth).
  • F4 (hardcoded default model): Removed per @brettchien's architectural feedback — model resolution is now override → OPENAB_AGENT_MODEL → error, no baked-in default (b28c312).
  • Bonus: ModelRef parser for provider/model_id format, doc updates, ponytail: tag cleaned up.

Final bot review is LGTM ✅, all 33 CI checks green. Ready for maintainer review.

@chaodu-agent

This comment has been minimized.

@chaodu-agent

This comment has been minimized.

@chaodu-agent

This comment has been minimized.

@chaodu-agent

This comment has been minimized.

@chaodu-agent

This comment has been minimized.

thepagent added a commit that referenced this pull request Jun 27, 2026
…g (codex + MCP) (#1190)

* docs(adr): openab-agent multi-vendor OAuth & credential storage

Proposed ADR for the openab-agent LLM-provider OAuth revamp: a two-axis
OAuthVendor adapter (auth flow vs inference transport), a cross-process
flock-guarded credential-store invariant for auth.json, the
CLAUDE_CODE_OAUTH_TOKEN env route, a 14-variant vendor feasibility matrix,
and the /auth (PR #1185) auth-trigger model. Surfaced while reviewing
PR #1187 (first OAuth vendor).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(adr): address review — prefer oauth2 crate, drop rollout, vendor names

- Build the OAuthVendor driver on the official `oauth2` crate (already in-tree
  via the MCP side) instead of a hand-rolled PKCE/exchange/refresh flow; the
  Anthropic JSON-token-body quirk is applied via the crate's custom http-client
  hook. Reframe §8 accordingly (hand-rolled flow is the rejected alternative).
- Remove the project Rollout-plan section (internal sequencing, not ADR
  material); keep the race-window mitigation in §5.4.
- Use vendor names only; drop internal fleet-agent references from the matrix.
- Replace unexplained "ToS-gray" with "ToS-risk" + a definition.
- Fix cross-references (crate-qualified paths, §-refs, line numbers) and move
  the settled model-default decision under "Decisions & open questions".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(adr): per-tenant refresh lock + per-user PKCE keying (Mira review r1)

Fold in two release-blocker-class concurrency fixes from Mira's review:

- §5.4 refresh-token rotation: the prior "refresh outside lock, re-read on
  commit" claim was wrong — N processes each send a refresh with the same
  RT_old before committing, tripping OAuth 2.1 §10.4 reuse -> token-family
  revocation. Replace with a per-tenant exclusive lock: network refresh held
  under the tenant lock only (not the global lock), so exactly one refresh per
  tenant per expiry with no head-of-line blocking across tenants. Extends to
  mcp:<server> tenants.
- §7 /auth: key the pending PKCE verifier+state by the initiating Discord user
  id instead of a single global entry — prevents concurrent-user overwrite
  (PKCE mismatch) and session hijack.
- §5.1 OAuthVendor::redirect() -> Option, since device flows have no redirect.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(adr): bundled-secret storage + pending-entry GC (Mira review r2)

- §6/§9 Q2: correct the stale "git-safe" claim on the gemini GOCSPX-
  secret. The value is non-confidential by RFC 8252 / Google docs, but
  GitHub now push-protects Google secrets by default (changelog 2026-03)
  and partner-scans them for auto-revoke, so a raw literal is not safe in a
  public repo. Decide: encode-at-rest (scanner-evasion for a non-secret, NOT
  a security control) or env-inject at runtime — not raw text.
- §7: pending PKCE entries get created_at + a 15-min GC sweep in
  with_auth_locked so abandoned /auth attempts don't accumulate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(adr): close §9 Q2/Q3 — env-injection default, vendor go/no-go (Brett decisions r3)

- §9 Q2 DECIDED: env-injection (b) is the default for the gemini/agy bundled
  client_secret; encode-at-rest (a) is the fallback for bundled zero-config
  binaries only, framed as scanner-evasion (not security). Cite rclone
  rcloneEncryptedClientSecret + obscure.MustReveal() as the canonical
  precedent (§10).
- §9 Q3 DECIDED: GO gemini/grok (first wave) + agy (experimental, opt-in,
  ToS caveat — shares gemini's Code-Assist provider, residual risk is ToS not
  secret storage); No-Go cursor/kiro. Mirror as a build-decision line under §6.
- §10: add rclone + GitHub/Google secret-scanning references.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(adr): confirm agy secret + ecosystem evidence (GitHub survey r4)

GitHub survey 2026-06-24 grounds the agy GO decision:
- §6/§9 Q2: agy client_secret requirement CONFIRMED — it needs a GOCSPX-
  secret, a public constant >=20 antigravity-auth repos hardcode verbatim
  (NoeFabris/opencode-antigravity-auth, router-for-me/CLIProxyAPI, ...).
  Literal deliberately NOT pasted into the doc — would trip the very §9 Q2
  push-protection, so we dogfood the env/encode decision. Redirect confirmed
  localhost:51121/oauth-callback.
- §9 Q3: ecosystem evidence — agy OAuth is widely ported (opencode/pi/hermes/
  openclaw plugins + proxies), proving the integration, while the same
  ecosystem's anti-ban / quota-locking / multi-account-rotation tooling
  empirically confirms the ToS-ban + 429 risks, reinforcing the opt-in gate.
- §10: add the antigravity OAuth ecosystem reference set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(adr): agy-vendor vs agy-CLI/ACP clarification + MCP store revamp in scope (r5)

Grounded by a codebase survey (2026-06-24):

- §4/§6: clarify "agy as a GO vendor" means a native OAuthVendor + Code-Assist
  inference provider — it does NOT run the agy CLI. agy speaks no ACP; the
  existing `antigravity` runtime variant (Mira/ECS) only works via a dedicated
  agy-acp adapter that shells out to the agy binary per prompt and polls its
  SQLite DB. The provider path sidesteps ACP entirely and supersedes the
  CLI-wrapper for native use, so agy's lack of ACP doesn't block the GO.
- §5.4/§9 Q4: make the MCP CredentialStore revamp explicitly in-scope. auth.json
  has NO lock today (only atomic rename); provider save_tokens (auth.rs:234) and
  McpCredentialStore::save/clear (auth.rs:284-328) are two independent unlocked
  RMW callers, so with_auth_locked must wrap BOTH or the race persists. Also
  correct the stale save_tokens_for name and note with_auth_locked is new.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(adr): fix stale §9 Q4 cross-reference

The §9 Q4 tail referenced 'openab-agent-mcp.md open items #1 (reqwest
0.12/0.13 split) and #8 (doctor/runtime two-store split)' — but that ADR's
§10 Open Questions has only two items (mcp.json location; native-vs-broker
parity), neither matching, and no such numbered items / terms exist anywhere
in it. The phantom reference dated to the original draft. Replace with an
accurate statement: McpCredentialStore reuses the same TokenStore/auth.json
storage (openab-agent-mcp.md §6.1), so the lock lands once and serves both;
the reqwest version conflict is the rmcp-OAuth dependency issue surfaced on
the feat/openab-agent-mcp-resilience PR, not an mcp-ADR open item.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(adr): §9 Q2 — encode-at-rest is the default, env is the alternative (Brett)

Brett's call given the 93%-plaintext ecosystem survey: keep the bundled
zero-config UX via encode-at-rest (obscure, rclone obscure.MustReveal style)
as the DEFAULT, with env-injection as the alternative for fleet/pod. Reverses
the prior (b)-default ordering. Framing unchanged: encode-at-rest is
scanner-evasion for a non-confidential value, not a security control. Record
the survey numbers (99/107 plaintext; auto-revoke largely unrealized) and that
the real risk mitigated is org-repo push-protection friction, not credential
loss.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(auth): cross-process locking for auth.json — codex + MCP (ADR §5.4)

Implements the ADR §5.4 invariant. auth.json had NO lock: provider
save_tokens and McpCredentialStore::save/clear each did an independent
unlocked read-modify-write, so a concurrent codex refresh and MCP save
last-writer-wins the whole map, and N processes (one openab-agent per Discord
thread) could each refresh the codex token with the same RT_old → OAuth 2.1
§10.4 token-family revocation (fleet-wide logout).

Two locks, flock(2) on sidecar files (kernel auto-releases on death), cfg(unix)
with a non-unix no-op:

- with_auth_locked: global exclusive lock across the re-read -> mutate ->
  atomic-write. ALL writers funnel through it — save_tokens (codex) and
  McpCredentialStore::save/clear (MCP) — so writers merge onto the latest
  on-disk state instead of lost-updating. Held only for the fast file RMW,
  never across network I/O.
- lock_tenant_refresh: per-tenant refresh serialisation. get_valid_token /
  force_refresh take the codex tenant lock (non-blocking try + async backoff +
  10s timeout, held across the network refresh), with a double-checked re-read
  so a process that waited adopts the token another already refreshed → exactly
  one real refresh per tenant per expiry, no RT_old reuse.

Uses libc::flock (already a cfg(unix) dep); rustix is NOT in-tree (ADR text was
optimistic). New test asserts the locked RMW merges codex + MCP tenants without
lost-update. fmt + clippy -D warnings + test (191 passed) green.

Known gap (for review): the MCP *network* refresh is owned by rmcp's
AuthorizationManager (it refreshes then calls CredentialStore::save), so MCP
writes get the file-integrity lock but MCP refreshes are not yet tenant-
serialised. Closing that needs an rmcp-level hook — proposed as follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(auth): address Mira code review — pending GC + portable WouldBlock

Round-2 code review (Mira):
- §7 pending-entry GC: add created_at to PendingPasteLogin and sweep
  AuthEntry::Pending older than 15 min inside with_auth_locked on every write,
  so abandoned /auth two-step attempts don't accumulate. PendingPasteLogin is
  currently a legacy tombstone with no live writer (PKCE state lives in rmcp's
  in-memory StateStore); the field + GC land now per ADR §7 and are
  forward-compatible with the forthcoming /auth two-step flow, and meanwhile
  sweep legacy stray entries (created_at default 0 reads as ancient). New test
  covers stale-swept / fresh-kept / real-tenant-untouched.
- flock_try_exclusive: match std::io::ErrorKind::WouldBlock instead of a single
  raw EWOULDBLOCK errno, covering EAGAIN/EWOULDBLOCK across libc/BSD.
- MCP network-refresh serialisation stays a follow-up (rmcp owns the refresh) —
  Mira concurs.

fmt + clippy -D warnings + test (192 passed) green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(auth): serialise MCP token refresh cross-process (close §5.4 (b) gap)

Closes the known gap from f370110: MCP refreshes now get the same per-tenant
serialisation as codex.

rmcp's CredentialStore exposes no pre-refresh hook, but openab drives the MCP
refresh explicitly in McpRuntimeManager::resolve_oauth_dial via
client.get_access_token(). Wrap that call (per-server) with the existing
auth::lock_tenant_refresh so only one process refreshes a given server at a
time. No explicit double-check needed: rmcp's get_access_token re-load()s
auth.json each call and returns the cached token without a network refresh when
remaining >= REFRESH_BUFFER (rmcp auth.rs:1238), so a process that loses the
race adopts the token the winner wrote to the shared file — no second RT_old
presentation, no OAuth 2.1 §10.4 family revocation. rmcp already single-flights
within one process via its AuthorizationManager Mutex; this closes the
cross-process gap.

- auth.rs: lock_tenant_refresh + AuthFileLock made pub(crate) for the mcp module.
- ADR §5.4: document the resolve_oauth_dial serialisation point.

fmt + clippy -D warnings + test (192 passed) green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* refactor(auth): /simplify cleanups — dedup lock acquire, reuse refresh fd

Quality-only cleanups from a 4-angle /simplify pass (no behaviour change):
- Extract lock_global(path) -> Result<Option<AuthFileLock>> and route both
  with_auth_locked and McpCredentialStore::clear through it, so the "global"
  sidecar name + the cfg(unix) acquire live in one place instead of two
  copy-pasted blocks. clear flattens (drop the inner run closure).
- lock_tenant_refresh opens the lock fd once and re-issues flock on it each
  retry, instead of re-opening (and re-create_dir_all-ing) the file every
  100ms under contention. Removes the now-unused flock_try_exclusive helper.
- Document lock_tenant_refresh's double-check contract (the lock only
  serialises; callers must re-check freshness after acquiring).

fmt + clippy -D warnings + test (192 passed) green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(auth): /code-review — harden GC + refresh-lock contract comments

Doc-only clarifications from a /code-review pass (no behaviour change):
- PendingPasteLogin.created_at: warn loudly that any writer of a fresh Pending
  entry MUST stamp created_at, else gc_stale_pending sweeps it on the next
  locked write (latent footgun for the forthcoming /auth two-step writer).
- lock_tenant_refresh: correct the contract — reuse-safety comes from loading
  the refresh token INSIDE the lock (so force_refresh, which always refreshes
  on a 401, is reuse-safe too); the post-lock expiry re-check is only an
  optimisation to skip a redundant refresh.

Review also surfaced a pre-existing, out-of-scope namespacing gap (an MCP
server literally named "codex" collides with the codex tenant's auth.json key
and refresh lock — committed MCP creds use the bare server name, not
mcp:<server> as ADR §5.4 describes) — flagged for a separate change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* build(openab-agent): declare standalone [workspace] to fix CI

ci-openab-agent.yml runs cargo fmt/clippy/test/build with
working-directory: openab-agent, but the crate is not a member of the parent
openab workspace (members = crates/openab-core, openab-gateway) and was not
excluded, so cargo errors 'current package believes it's in a workspace when
it's not' and every step fails before doing any work. openab-agent is
intentionally standalone (own version + dual reqwest 0.12/0.13 for rmcp), so
the correct fix is an empty [workspace] table making it its own root.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* Revert "build(openab-agent): declare standalone [workspace] to fix CI"

This reverts commit b40693c.

* ci(openab-agent): append [workspace] in CI + harden ACP smoke test

ci-openab-agent.yml ran cargo from working-directory: openab-agent without the
[workspace] table the crate needs (it's standalone, not a parent-workspace
member), so every step failed with 'believes it's in a workspace when it's
not' — the workflow had been red independently of any PR. Dockerfile.unified
already works around this by appending the table at build time; replicate that
in the workflow rather than committing it to Cargo.toml (which would
double-append in the Dockerfile and break the image build).

Also harden the ACP smoke test: build the release binary in its own step and
bump the response timeout 5s -> 30s so a loaded runner doesn't flake the
agent's first-response window (the binary itself responds in <1s locally,
verified incl. a clean-HOME release build).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* ci(openab-agent): surface ACP smoke-test stderr + exit code for diagnosis

The smoke test produces empty stdout only in CI (the binary responds with
agentInfo in <1s locally across debug/release/clean-HOME/CI-env-var runs).
Capture stderr + exit code so the next CI run reveals why stdout is empty.

* openab-agent: address review findings on OAuth ADR + auth.json locking

Resolves the Changes-Requested review on the multi-vendor OAuth ADR +
cross-process auth.json locking PR. No behavioural change beyond log level
and a non-unix diagnostic.

ADR (docs/adr/openab-agent-oauth.md)
- Fix lock-file names to match code: auth.json.global.lock and
  auth.json.refresh.<tenant>.lock (were auth.json.lock / auth.json.<tenant>.lock).
- Add the path parameter to the with_auth_locked pseudocode signature.
- Correct the MCP refresh note: initialize_from_store() does the disk reload,
  not get_access_token.
- Correct the crate note: libc::flock directly (rustix is not in-tree).
- Document the deliberate fail-open trade-off on tenant-lock timeout.
- Mark the default-model removal (Decision 1) as a follow-up; this PR ships
  the ADR + locking only.

Code
- auth.rs: escalate the tenant-lock timeout log from warn! to error! and
  document the fail-open trade-off on lock_tenant_refresh.
- auth.rs: non-unix lock_global no-op now warns once instead of silently
  providing zero cross-process protection.
- mcp/runtime.rs: fix the rmcp reload comment and document the cross-module
  invariant that the refresh lock and credential entry share the server name.

CI (ci-openab-agent.yml)
- ACP smoke test now fails on a non-zero/timeout exit code, and the agentInfo
  assertions use { } so exit fails the step rather than just a subshell.

Gate: cargo fmt --check, clippy -D warnings, test (192 passed) all clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* openab-agent: fail closed on a contended refresh lock (F6)

Replaces the fail-open tenant-lock timeout with a bounded-refresh + fail-closed
design, so a contended lock can no longer reintroduce the double-refresh it
exists to prevent.

- Bound the refresh network call with an explicit 8s HTTP timeout on both the
  codex client (auth.rs) and the MCP AuthClient (mcp/runtime.rs), strictly
  shorter than the 10s lock-acquire deadline. With flock(2) auto-release on
  death, a live holder always frees the tenant lock before a waiter's deadline,
  so a lock-acquire timeout is genuinely abnormal.

- lock_tenant_refresh now returns RefreshLock { Held, Unavailable, TimedOut }.
  On TimedOut callers fail closed instead of refreshing unserialised:
  - codex get_valid_token / force_refresh return a retryable error;
  - MCP resolve_oauth_dial returns a new OauthDialError::Transient that leaves
    the server retryable WITHOUT forcing re-login (NeedsAuth) or tripping the
    circuit breaker (auth-level failures still map to NeedsAuth as before).
  A filesystem error opening the sidecar returns Unavailable and degrades to a
  best-effort unserialised refresh rather than blocking every refresh.

- Add a fail-closed timeout test (injectable deadline) and update ADR §5.4 to
  document the bounded-refresh + fail-closed invariant, superseding the earlier
  fail-open note.

Gate: cargo fmt --check, clippy -D warnings, test (193 passed) all clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* openab-agent: size lock timeout above worst-case multi-call refresh hold

Mira review (PR #1190) found the fail-closed invariant held only for the codex
path. The MCP path holds the tenant lock across TWO sequential bounded calls —
rmcp's initialize_from_store() (AS discovery) then get_access_token() (refresh)
— so the worst-case lock-hold is ~2 x REFRESH_HTTP_TIMEOUT (16s), which could
exceed the fixed 10s lock deadline and fail a waiter closed while the holder is
still legitimately progressing.

Derive REFRESH_LOCK_TIMEOUT from the bound instead of hardcoding it:
  REFRESH_LOCK_TIMEOUT = MAX_REFRESH_ROUND_TRIPS (2) * REFRESH_HTTP_TIMEOUT + 4s
                       = 20s
so the deadline is always above the worst-case hold and only a genuinely stuck
holder trips it. Normal-case latency is unchanged (the waiter polls every 100ms
and acquires as soon as the holder releases). Update the fn doc and ADR §5.4 to
state the multi-call hold explicitly.

Gate: cargo fmt --check, clippy -D warnings, test (193 passed) all clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: thepagent <hehsieh1010@gmail.com>
…nthropic-oauth

# Conflicts:
#	openab-agent/src/auth.rs
@canyugs

canyugs commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Design update — landing the ADR's OAuthVendor abstraction + a centralized config file

This comment brings this PR in line with the multi-vendor OAuth ADR (docs/adr/openab-agent-oauth.md, landed in #1190) and adds the prior-art research that ADR §3 only partially covered for the config-file axis. Following docs/adr/pr-contribution-guidelines.md.

Framing — "decided" vs "landed". #1190 shipped the ADR document + the §5.4 storage locking (with_auth_locked / lock_tenant_refresh) only. The OAuthVendor abstraction (§4.2/§5.1) and the CLAUDE_CODE_OAUTH_TOKEN env route (§5.3) were decided but deferred. This PR is where they get implemented.

0. Discord Discussion

Internal thread, 2026-06-27 (multi-provider auth direction + config-file design).

1. What problem does this solve?

Adding a 2nd subscription-OAuth provider (Codex → now Anthropic Claude) surfaced two gaps:

  1. No auth abstraction. Today each vendor is hand-rolled — separate codex_* / anthropic_* constants, three near-duplicate login flows (login_browser_flow, login_anthropic_browser_flow, login_codex_device_flow), and a if provider == ANTHROPIC_NAMESPACE branch in refresh_token. A 3rd vendor (gemini/grok) would be copy-paste. ADR §4.2 calls for a single OAuthVendor descriptor; it was never implemented.
  2. No centralized config. Default provider/model live only in env vars (OPENAB_AGENT_PROVIDER, OPENAB_AGENT_MODEL) with a hardcoded fallback chain in llm.rs::select_provider. There's no declarative config file, and the CLAUDE_CODE_OAUTH_TOKEN fleet route (ADR §5.3) is unimplemented.

2. At a Glance

                       auth axis  (HOW a credential is obtained/refreshed)
                       ─────────────────────────────────────────────
  OAuthVendor descriptor (code, static)        credential store (auth.json, 0600, locked-RMW)
    ┌──────────────┐                             ┌────────────────────────────────┐
    │ CodexVendor  │   shared oauth2 driver      │ "codex":          Token{...}    │
    │ AnthropicVdr │──►(PKCE / device / refresh)─│ "anthropic-oauth":Token{...}    │
    │ (gemini…)    │   §5.4 per-tenant lock       │ "mcp:<server>":   Mcp{...}      │
    └──────────────┘                             └────────────────────────────────┘
            │                                                  ▲ secrets ONLY here
            ▼ inference axis (HOW a request is sent)           │
    AnthropicProvider / OpenAiProvider / …                     │
            ▲                                                  │
            │ selection (dynamic)                              │
   config.json  ──►  OPENAB_AGENT_MODEL env  ──►  model_override (ACP/CLI)
   "model":"anthropic/claude-sonnet-4-6"   (no secrets — commitable)
   resolution: model_override → config.json → env → error (fail-loud, ADR §9 Q1)
   credential precedence (§5.3): ANTHROPIC_API_KEY → CLAUDE_CODE_OAUTH_TOKEN → stored tenant

3. Prior Art & Industry Research

Surveyed how four agents structure config × provider × auth. OpenClaw + Hermes are the mandated references; pi + opencode added because the config-file axis is exactly the gap ADR §3 left open (§3 covered the auth axis well).

Dimension OpenClaw (req'd) Hermes (req'd) pi opencode
Config format / location openclaw.json (JSON5), global+project merge ~/.hermes/config.yaml + .env ~/.pi/agent/ — 3 files (settings/models/auth) opencode.json (JSONC), deep-merge layers
Default model single model.primary:"prov/model" model.default:"prov/model" (+model.provider) two fields defaultProvider+defaultModel single model:"prov/model"
Provider registry built-in + models.providers{} (custom) ProviderProfile dataclass registry (in code) models.json + registerProvider() hook Models.dev + provider{} block
Auth-method encoding profile key prov:label + type:oauth|apikey auth_type string field on profile cred type:api_key|oauth cred type:oauth|api in auth store
Secrets location separate auth-profiles.json separate .env / auth.json separate auth.json (0600) separate auth.json (0600)
Per-model params models[].params passthrough fixed_temperature/default_max_tokens structured model fields (no temp) models.<id>.options passthrough
Precedence flag/slash → config order → env → default CLI → config.yaml → .env → default flag → auth.json → env → registry env → auth.json → config

Three signals are unanimous across all four — we adopt them:

  1. Secrets always live in a file separate from config (config stays commitable). We already have this: auth.json.
  2. The credential record is tagged with a type discriminator (oauth vs api_key). We already have this: AuthEntry / TokenStore.
  3. Static provider facts in code, dynamic selection in config, resolved in one place. Hermes is the cleanest exemplar (ProviderProfile registry + resolve_runtime_provider()), and it maps 1:1 onto ADR §5.1's OAuthVendor descriptor + a resolver.

Where we diverge: unlike all four (TS/Python), we keep the cross-process flock-based locked-RMW invariant (ADR §5.4) — OpenClaw and Hermes both file-lock but only for the single-process case; OpenClaw's open refresh-persistence bug cluster (#52037/#48153/#71026) is direct evidence that getting the lock + atomic write-back right matters.

4. Proposed Solution

  • OAuthVendor trait + descriptors (ADR §5.1): CodexVendor, AnthropicVendor today; new vendor = descriptor only. Collapse the three login flows into one shared PKCE/device driver on the in-tree oauth2 crate; fold the refresh_token provider-branch into vendor.token_body(). Anthropic's JSON-no-scope quirk via the crate's custom-http hook.
  • Thin config.json (next to auth.json), single provider/model string + optional max_tokens / per-provider base_url. No secrets. Resolution: model_override → config.json → OPENAB_AGENT_MODEL → error (env still overrides; fail-loud per §9 Q1).
  • Credential precedence (§5.3): ANTHROPIC_API_KEY → CLAUDE_CODE_OAUTH_TOKEN → stored anthropic-oauth tenant.
  • Storage: Anthropic save/refresh already routed through §5.4 locking in this branch (merge with main done — see commit resolving the auth.rs conflict; save_tokens_for → with_auth_locked, get/force_refresh_for → lock_tenant_refresh(namespace)), so the per-tenant RTR race is closed for the Anthropic tenant too.
// ~/.openab/agent/config.json — commitable, secrets stay in auth.json
{
  "model": "anthropic/claude-sonnet-4-6",   // single provider/model string (reuses ModelRef)
  "max_tokens": 8192,                        // optional
  "providers": { "anthropic": { "base_url": "https://api.anthropic.com" } }  // optional, custom only
}

5. Why This Approach

  • Single provider/model string: 3 of 4 surveyed projects use it, and openab already parses it — ModelRef::parse (llm.rs) + resolve_provider_choice() extract the provider from the provider/ prefix today. Two fields (pi) would add coupling for no gain.
  • Thin config / secrets-in-auth.json: unanimous prior-art signal; reuses the hardened §5.4 store instead of inventing a second secret file.
  • OAuthVendor now, not later: with only codex+anthropic in tree this is the lowest-cost moment to abstract; doing it after gemini/grok land means refactoring more call sites.
  • Linked fix — F4: a single provider/model config string makes the ModelRef::parse HuggingFace-ID bug load-bearing (anthropic/claude-… must split; some-org/some-model must not). F4's fix (split only on known provider prefixes) ships with the config change, not separately.

6. Alternatives Considered

  • Two fields (defaultProvider+defaultModel, pi-style) — rejected: couples two values openab already derives from one string; only 1/4 projects do it.
  • Secrets in the config file — rejected: contradicts all four references and would bypass the §5.4 locked store.
  • Build all vendors now (gemini/grok/agy) — out of scope per ADR §9 Q3 (incremental landing); this PR lands the abstraction + keeps codex/anthropic green.
  • Layer-3 auto-trigger (auto-login on mid-turn 401) — deferred per ADR §2 / Brett 2026-06-24; /auth is sufficient.

7. Validation

  • Rust: cargo check, cargo clippy --all-targets, cargo test green after the main merge — 56/56 auth+llm+mcp tests pass, incl. test_anthropic_save_uses_provider_as_key_disjoint_from_codex, with_auth_locked_merges_concurrent_tenants_no_lost_update, lock_tenant_refresh_fails_closed_when_contended, save_uses_rotated_refresh_token_when_present.
  • To add with the implementation: config.json load + layering test (model_override > config > env), precedence test (ANTHROPIC_API_KEY > CLAUDE_CODE_OAUTH_TOKEN > tenant), F4 ModelRef::parse HuggingFace-ID regression test, and an OAuthVendor round-trip test per descriptor.
  • Helm / CI / docs: docs update to docs/native-agent.md (vendor model + config.json + env route) lands with the change; no Helm surface.

@brettchien

Copy link
Copy Markdown
Contributor

Now that #1190 has merged, here's a concrete path to rebase this PR onto the new
auth.json locking invariant — plus a couple of best-practice shape changes so the
new Anthropic tenant lands the way the ADR intends rather than as a parallel copy.

Blast radius: the lock layer (with_auth_locked, lock_tenant_refresh/RefreshLock,
the flock helpers, the refresh timeouts, gc_stale_pending) is reused from #1190
verbatim — it was already built tenant-agnostic, so nothing in it is re-implemented. The
only #1190 code touched is lifting the codex orchestration (get_valid_token/force_refresh)
into generic _for(namespace) form so both vendors share one implementation. Genuinely new
code is just the vendor-specific refresh body + the descriptor.

#1190 did the hard part generically, so most of this is wiring the new tenant
into existing primitives, not new infrastructure:

  • with_auth_locked(path, |map| …) — the global flock-guarded read-modify-write
    funnel. Every writer of auth.json must go through it.
  • lock_tenant_refresh(auth, tenant: &str) -> RefreshLock — already takes a tenant
    string; pass "anthropic-oauth". Handle all three arms: Held (hold across the
    refresh), Unavailable (best-effort, proceed unserialised), TimedOut
    (fail closed — return a retryable error, do not refresh).
  • REFRESH_HTTP_TIMEOUT (8s) bounds each refresh round-trip so the tenant lock-hold
    stays provably under REFRESH_LOCK_TIMEOUT.

1. Route the four unlocked writers through #1190's locks

These four sites in openab-agent/src/auth.rs still do raw read → mutate → write
(this branch predates #1190):

  • save_tokens_for (:276) → funnel through with_auth_locked (mirrors save_tokens):
    fn save_tokens_for(store: &TokenStore) -> Result<()> {
        let store = store.clone();
        with_auth_locked(&auth_path(), move |map| {
            map.insert(store.provider.clone(), AuthEntry::Token(store));
        })
    }
  • get_valid_token_for (:377) → adopt the 5-step locked sequence: fast-path on a
    fresh token, then lock_tenant_refresh(.., namespace), re-load inside the lock
    (reuse-safety), double-check expiry, one network refresh, commit via save_tokens_for.
  • force_refresh_for (:386) → same lock_tenant_refresh wrap, but keep its current
    shape: it runs after a 401 (token known-bad), so it intentionally skips the expiry
    double-check
    — it loads inside the lock (reuse-safe) and always refreshes.
  • refresh_token (:401) → build the client with
    reqwest::Client::builder().timeout(REFRESH_HTTP_TIMEOUT).build()?. Keep the existing
    per-provider body branch (Anthropic JSON-no-scope / Codex form). The Anthropic refresh
    is a single round-trip, so it fits the existing MAX_REFRESH_ROUND_TRIPS = 2 budget —
    no need to change that constant.
  • McpCredentialStore::save/clear (:330 / :361) — openab-agent: multi-vendor OAuth ADR + cross-process auth.json locking (codex + MCP) #1190 already locked these on
    main. This is a straight conflict; take main's locked versions, don't carry the
    unlocked copies forward.

2. One locked implementation, not two — and stage it as two commits

After (1), main's codex get_valid_token / force_refresh and the new _for
variants would be near-duplicates of the same security-critical refresh sequence
(token-family-revocation safety). Duplicated security code drifts — one copy gets a
fix the other doesn't. So parameterise main's locked bodies into the _for(namespace)
functions
and keep thin wrappers:

pub async fn get_valid_token() -> Result<String> { get_valid_token_for(CODEX_NAMESPACE).await }
pub async fn force_refresh()   -> Result<String> { force_refresh_for(CODEX_NAMESPACE).await }

To keep the codex path's review surface clean, please split this into two commits:

  1. behaviour-preserving refactor — generalise the locked codex path into _for(namespace);
  2. feature — add the anthropic-oauth tenant on top.

That lets a reviewer verify the refactor changes no behaviour separately from the new vendor.

3. Add a concurrency test for the Anthropic tenant

#1190 ships lock_tenant_refresh_fails_closed_when_contended and
with_auth_locked_merges_concurrent_tenants_no_lost_update for codex. Please cover the
new tenant the same way — parameterise (or duplicate) the lock test over "anthropic-oauth"
so single-flight / fail-closed is actually proven for it, not just structurally similar.

4. Shape the new vendor as a descriptor, not inlined constants

Rather than hand-rolling the Anthropic flow with inline constants alongside the codex one,
lift the per-vendor specifics — client_id, authorize/token URLs, scope, token-body
format (JSON-no-scope vs form), grant type — into a small per-vendor descriptor struct, and
drive the shared flow off it. This is the F3 point from the earlier review, and it lines up
with the ADR's OAuthVendor direction (§5.1): adding the second vendor in descriptor shape
keeps the future shared oauth2-crate driver a clean extraction instead of a rewrite.

To be clear on scope: the descriptor struct belongs in this PR; building the shared
oauth2-crate driver and migrating codex onto it is the genuinely separate, ADR-tracked
follow-up — not asked for here.


One deployment note (not blocking): for pod/fleet use, the race-free path is the
CLAUDE_CODE_OAUTH_TOKEN env route (ADR §5.3) — no refresh writes, no lock needed. The
interactive PKCE + per-tenant lock is the self-service path; worth recommending the env
token as the primary fleet mode and treating interactive OAuth as the fallback.

Can and others added 7 commits June 27, 2026 06:09
… §5.1)

Collapse the hand-rolled per-vendor OAuth surface into a single OAuthVendor
trait + CodexVendor/AnthropicVendor descriptors. Adding a subscription-OAuth
vendor is now a descriptor, not a new hand-rolled flow:

- Unify the codex + anthropic PKCE login flows into one shared `login_pkce_flow`
  driven by the descriptor; fold the codex flow into the `accept_callback_code` /
  `code_from_redirect` helpers (the long-standing TODO) and unify the 127.0.0.1 bind.
- Drive `refresh_token` and the authorization-code exchange off `vendor.token_body()`
  instead of an `if provider == ANTHROPIC_NAMESPACE` branch.
- Shared pure builders `build_authorize_url` / `token_store_from_payload`, pinned by
  wire-format unit tests — the login authorize-URL/exchange hit live OAuth servers,
  so no integration test covers them.

The driver keeps the proven reqwest flows; swapping the engine onto
`oauth2::BasicClient` (as `mcp/runtime.rs` already does via a custom http hook) is a
follow-up internal change invisible to vendor authors — only the descriptor surface
lands here. `client_secret()`/`grant()` are the ADR §5.1 surface for later vendors
(gemini/agy bundled secret; copilot/kiro device-code).

205 tests pass (4 new wire-format locks).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ADR §5.3)

Add `AnthropicAuth::OAuthEnv` for the pre-provisioned long-lived subscription
token route (`CLAUDE_CODE_OAUTH_TOKEN`) — the recommended fleet mode (ops mints
once, injects as a k8s secret; no interactive flow, no auth.json write, no
refresh race).

- `AnthropicProvider::auto[_with_model]()` now resolves in ADR §5.3 precedence:
  `ANTHROPIC_API_KEY` → `CLAUDE_CODE_OAUTH_TOKEN` → stored `anthropic-oauth` tenant.
  Each source surfaces its own errors rather than falling through to a
  lower-precedence credential error.
- `OAuthEnv` shares the OAuth `Bearer` + Claude Code identity path (extracted into
  `oauth_headers`); `is_oauth()` covers it so the system block + tool-name casing
  apply. The 401 force-refresh is gated to the stored tenant only — the env token
  has no tenant to refresh, so a 401 there surfaces (re-mint) instead of erroring
  on a missing tenant.

207 tests pass (+2 precedence tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s (review F4)

`provider/model` previously split on the first `/`, mis-parsing HuggingFace-style
`org/model` ids (e.g. `meta-llama/Llama-3-8B`) for custom/OpenAI-compatible
endpoints — `org` became the "provider" and the real id was truncated. Now the
prefix is split off only when it's a `KNOWN_PROVIDERS` entry; otherwise the whole
string is the model id. Load-bearing for the planned single-string `provider/model`
config field. Known prefixes still split unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…reds (review F3)

`select_provider`'s auto-detect arm discarded the `AnthropicProvider::auto()`
error and fell through to Codex. A present Anthropic credential that failed for a
config reason (e.g. a key set but no model) was silently masked — the user got a
Codex provider, or a Codex-only error that hid the real cause.

Now: if any Anthropic credential source exists (API key / CLAUDE_CODE_OAUTH_TOKEN /
stored tenant), an `auto()` failure surfaces as a config error ("credential
present but unusable"). Codex fallthrough happens only when no Anthropic
credential exists at all, and that final error names both credential routes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(ADR §5.5)

Add a small JSON config file next to auth.json so a deployment can declare the
default provider/model (and max_tokens) in a file instead of only via env vars —
the centralized-config gap the multi-vendor work surfaced.

- New `config` module: `{ model, max_tokens }` (single `provider/model` string,
  reusing ModelRef). Unknown keys tolerated (forward-compat for `providers`/etc.);
  malformed JSON fails loud; a missing file is an empty config.
- Resolution is env-over-config: `anthropic_model` / `anthropic_max_tokens` /
  `resolve_provider_choice` now fall `OPENAB_AGENT_*` env → config.json → built-in.
  A pod's injected env stays authoritative over a baked config.
- Secrets never live here — they stay in the locked `auth.json` store.

Per-provider `base_url` routing is a deliberate follow-up; this lands the default
provider/model/params surface. 213 tests pass (+5).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ig.json (review F5)

- Anthropic credentials section: the ADR §5.3 precedence (ANTHROPIC_API_KEY →
  CLAUDE_CODE_OAUTH_TOKEN fleet route → interactive anthropic-oauth) + the
  `openab-agent auth anthropic-oauth` login.
- New "Configuration file (config.json)" section: schema, env-over-config
  precedence, secrets-stay-in-auth.json.
- "Adding an OAuth vendor": the OAuthVendor descriptor model (ADR §5.1).
- Env table: CLAUDE_CODE_OAUTH_TOKEN, OPENAB_AGENT_ANTHROPIC_CLIENT_ID,
  OPENAB_CONFIG_PATH.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@chaodu-agent

This comment has been minimized.

…auth tenant

Brett asked to cover the new tenant the same way openabdev#1190's codex lock tests do, so
single-flight / fail-closed is proven for it rather than only structurally similar:

- `with_auth_locked_merges_anthropic_tenant_no_lost_update` — a concurrent codex
  write does not clobber a just-written `anthropic-oauth` token (the new tenant
  rides the same locked RMW funnel).
- `lock_tenant_refresh_fails_closed_for_anthropic_and_is_per_tenant` — a held
  anthropic refresh lock makes a second anthropic acquire fail closed (`TimedOut`),
  and does NOT block codex (per-tenant isolation — the reason §5.4 uses a
  per-tenant lock, not the global one).

215 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@chaodu-agent

This comment has been minimized.

@canyugs

canyugs commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Architecture — Before / After

A visual companion to the design-update comment above: how auth/provider selection is shaped before this PR's multi-vendor refactor vs after (ADR §4/§5.1/§5.3/§5.5).

Before — hand-rolled per-vendor, env-only, unlocked new tenant

 SELECTION   env only:  OPENAB_AGENT_PROVIDER / OPENAB_AGENT_MODEL
             + hardcoded default model  (retire -> 404 timebomb)
             ModelRef::parse: split on FIRST '/'  (breaks HF org/model)
                    |
                    v
             select_provider(choice)
               auto(): API key -> stored OAuth            <- 2-tier
               auto-detect: Anthropic err SWALLOWED --.
                    |                                  '-> silently try Codex
        +-----------+-----------+
        v                       v
  AnthropicProvider        OpenAiProvider
  auth: {ApiKey, OAuth}      (codex)
        |
  -- AUTH: hand-rolled per vendor (coupled to inference) -----------
    codex_client_id() / anthropic_client_id()  + inline constants
    login_browser_flow (codex PKCE)      -.
    login_anthropic_browser_flow (PKCE)   |- 3 near-duplicate flows
    login_codex_device_flow (device)     -'
    refresh_token:  if provider==ANTHROPIC {json} else {form}   <- branch
        |
        v
    auth.json  (namespaced: codex / anthropic-oauth / mcp:<srv>)
    writes: raw read -> mutate -> write   X no lock -> RTR race / lost-update

After — two-axis model, 3-tier precedence, locked, config file

 SELECTION   model_override(ACP/CLI) -> OPENAB_AGENT_MODEL -> config.json -> ERROR
             resolve_provider_choice: env -> config.json prefix -> auto
             ModelRef::parse: split only KNOWN providers (HF org/model kept) [F4]
                    |
                    v
             select_provider(choice)
               auto(): ANTHROPIC_API_KEY -> CLAUDE_CODE_OAUTH_TOKEN
                                         -> stored tenant      <- 3-tier [§5.3]
               auto-detect: credential present + fail -> FAIL LOUD [F3]
                    |
  == TWO-AXIS (ADR §4 - auth _|_ inference) ==========================
                    |
    INFERENCE axis  |   AUTH axis  -- trait OAuthVendor [§5.1] --
    AnthropicProvider|     +- CodexVendor      (namespace, client_id,
    OpenAiProvider   |     +- AnthropicVendor   authorize/token url, redirect,
    (1 per wire fmt) |              |            scope, token_body, grant)
        |            |              v
    AnthropicAuth    |   shared driver (reqwest now; oauth2 follow-up)
    {ApiKey, OAuth,  |   login_pkce_flow . device . refresh -- descriptor-driven
     OAuthEnv}<-CLAUDE…             |
        |            |              v
        v            v        auth.json (0600, namespaced)
    config.json {model, max_tokens}   writes:  with_auth_locked (global flock)
    declarative . secrets NEVER here  refresh: lock_tenant_refresh(ns) per-tenant
                                      [§5.4 - anthropic tenant proven by tests]

Core deltas

Dimension Before After
Auth abstraction hand-rolled 2 vendors, 3 duplicate login flows, refresh if-branch OAuthVendor trait + descriptor + shared driver (new vendor = a descriptor)
auth / inference coupled (login flow lives inside the provider) two orthogonal axes (§4)
Credential precedence API key → stored OAuth (2-tier) + CLAUDE_CODE_OAUTH_TOKEN fleet route (3-tier, §5.3)
Storage safety raw RMW, unlocked, RTR race with_auth_locked + per-tenant lock_tenant_refresh (§5.4)
Config source env-only + hardcoded default model config.json (env-over-config), fail-loud, secrets stay in auth.json
Error handling Anthropic error swallowed → silent Codex fallthrough credential-present-but-broken → fail loud (F3)
Model parsing split on first / (breaks HF ids) split known providers only (F4)

All of @brettchien's rebase roadmap and the review findings are addressed: routed the new tenant through #1190's with_auth_locked / lock_tenant_refresh, shaped both vendors as OAuthVendor descriptors, fail-loud model resolution (no hardcoded default), and added per-tenant lock tests for the anthropic-oauth tenant (single-flight + fail-closed + codex-independence) in f32de33. CI green — handing to maintainer.

@chaodu-agent

Copy link
Copy Markdown
Collaborator

LGTM ✅ — Well-architected multi-vendor OAuth with clean credential-source precedence and solid test coverage.

What This PR Does

Adds native Anthropic OAuth (Claude Pro/Max subscription) login to openab-agent, so subscribers can authenticate via PKCE without an API key — matching the existing Codex OAuth experience. Also introduces a centralized config.json for declarative defaults and removes the hardcoded model default (fail-loud on missing model).

How It Works

  • OAuthVendor trait (ADR §5.1): a static descriptor per vendor (endpoints, client ID, token-body encoding, redirect). CodexVendor and AnthropicVendor are the first two; the shared PKCE driver (login_pkce_flow) parameterises on the descriptor, so a new vendor is a new struct, not a new hand-rolled flow.
  • Credential precedence (ADR §5.3): ANTHROPIC_API_KEYCLAUDE_CODE_OAUTH_TOKEN env route (fleet/pod) → stored anthropic-oauth tenant in auth.json. A present-but-misconfigured credential fails loud (F3 guard) instead of silently falling through to Codex.
  • Token refresh: per-tenant flock serialisation (rides the §5.4 locking from openab-agent: multi-vendor OAuth ADR + cross-process auth.json locking (codex + MCP) #1190), JSON body for Anthropic (no scope on refresh), form body for Codex. 401 mid-flight triggers one forced refresh.
  • config.json (ADR §5.5): env-over-config precedence for model/max_tokens. Missing file is fine; malformed file logs a warning and falls back.
  • ModelRef::parse: provider-qualified anthropic/claude-sonnet-4-6 form; only known provider prefixes split (HuggingFace org/model ids preserved).
  • Tool-name normalisation: openab-agent's read/write/bash ↔ Claude Code's Read/Write/Bash for the claude-code-20250219 beta header path.

Findings

# Severity Finding Location
1 🟢 Clean vendor abstraction — the OAuthVendor trait + descriptor pattern makes adding vendors trivial and the wire-format pinning tests are excellent insurance auth.rs
2 🟢 Fail-loud on missing model eliminates the recurring 404 timebomb (per brettchien's architectural feedback) llm.rs:anthropic_model()
3 🟢 F3 guard in select_provider distinguishes "no credential" (fall to Codex) from "credential present but unusable" (fail loud) — prevents subtle silent fallback bugs llm.rs:select_provider()
4 🟢 Comprehensive test coverage: 4 new wire-contract tests, namespaced storage disjointness, per-tenant refresh lock isolation, model precedence chain, OAuth identity + tool casing across auth.rs, llm.rs, config.rs, acp.rs
5 🟢 ACP drain fix (bounded tokio::time::timeout on the stdout drain task) prevents lost final responses — correct shutdown ordering acp.rs:268
6 🟢 Per-tenant refresh lock isolation proven by test (lock_tenant_refresh_fails_closed_for_anthropic_and_is_per_tenant) — Anthropic lock never head-of-line-blocks Codex auth.rs tests
Baseline Check
  • PR opened: 2026-06-24
  • Main already has: API-key-only AnthropicProvider::from_env(), single-tenant Codex OAuth, no config.rs, no vendor abstraction, hardcoded claude-sonnet-4-20250514 default
  • Net-new value: Full Anthropic PKCE OAuth login, OAuthVendor trait abstraction, CLAUDE_CODE_OAUTH_TOKEN env route, centralized config.json, fail-loud model resolution, ModelRef parser, tool-name normalisation, ACP drain fix, 4+ new test families
What's Good (🟢)
  • The PR description is exceptionally well-researched (OpenClaw, Hermes, Pi prior art comparison)
  • Every previous review finding (CI workspace, state/verifier separation, hardcoded default removal, locking integration) has been addressed in this final iteration
  • The OAuthVendor surface is designed for extension (gemini/grok/agy vendors) without touching the shared driver
  • unsafe env manipulation in tests is properly scoped with ENV_LOCK guards
  • The is_multiple_of nits in format.rs and wecom.rs are bonus Rust modernisation (nightly → stable API usage)

CI: check (cargo check/clippy/test) ✅ | all individual smoke-tests ✅ | unified smoke-tests pending (builder just completed, these will land shortly)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(native-agent): Anthropic OAuth (Claude Pro/Max) login for openab-agent

5 participants