feat(native-agent): Anthropic OAuth (Claude Pro/Max) login for openab-agent#1187
feat(native-agent): Anthropic OAuth (Claude Pro/Max) login for openab-agent#1187canyugs wants to merge 21 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds first-class Anthropic (Claude Pro/Max) OAuth login support to openab-agent, storing credentials as a new anthropic-oauth tenant alongside the existing codex tenant in ~/.openab/agent/auth.json. This extends provider auto-detection and request shaping so Claude subscription users can run the native agent without an ANTHROPIC_API_KEY.
Changes:
- Introduces
openab-agent auth anthropic-oauth [--no-browser]PKCE login flow and namespaced token load/save/refresh helpers. - Extends
AnthropicProviderto support OAuth auth mode (Bearer + Claude Code identity headers/system block + tool-name casing normalization). - Updates ACP provider/model selection and available-model listing to recognize Anthropic OAuth credentials; bumps default Anthropic model to
claude-opus-4-8.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| openab-agent/src/main.rs | Adds the auth anthropic-oauth CLI subcommand wiring. |
| openab-agent/src/auth.rs | Implements Anthropic PKCE/OAuth flow and namespaced token storage/refresh. |
| openab-agent/src/llm.rs | Adds Anthropic OAuth request behavior (headers/system/tool name normalization) and provider selection updates. |
| openab-agent/src/acp.rs | Uses Anthropic auto* selection (API key or OAuth), updates default model and model availability gating. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Load the LLM token stored under `namespace` (`codex` / `anthropic-oauth`). | ||
| pub fn load_tokens_for(namespace: &str) -> Result<TokenStore> { | ||
| let path = auth_path(); | ||
| let map = read_auth_file(&path).map_err(|_| { | ||
| anyhow!( | ||
| "No credentials found at {}. Run `openab-agent auth codex-oauth` first.", | ||
| "No credentials found at {}. Run `openab-agent auth` first.", | ||
| path.display() | ||
| ) | ||
| })?; | ||
| match map.get(CODEX_NAMESPACE) { | ||
| match map.get(namespace) { | ||
| Some(AuthEntry::Token(t)) => Ok(t.clone()), | ||
| _ => Err(anyhow!( | ||
| "No codex credentials in {}. Run `openab-agent auth codex-oauth` first.", | ||
| "No {namespace} credentials in {}. Run `openab-agent auth` first.", | ||
| path.display() | ||
| )), | ||
| } |
| /// Block on the loopback listener for the OAuth redirect, reply 200, return the | ||
| /// authorization code. ponytail: the Codex flow above predates this helper and | ||
| /// still inlines the same logic; fold it in if that path is ever touched again. |
| _ => match AnthropicProvider::auto() { | ||
| Ok(p) => Ok(Box::new(p)), | ||
| Err(_) => match OpenAiProvider::from_auth_store() { | ||
| Ok(p) => Ok(Box::new(p)), | ||
| Err(e) => Err(format!( | ||
| "No credentials: set ANTHROPIC_API_KEY or run `openab-agent auth codex-oauth`. {e}" | ||
| "No credentials: set ANTHROPIC_API_KEY, or run `openab-agent auth anthropic-oauth` / `auth codex-oauth`. {e}" | ||
| )), | ||
| }, |
| return self.error_response( | ||
| id, | ||
| -32000, | ||
| &format!("No credentials: set ANTHROPIC_API_KEY or run `openab-agent auth codex-oauth`. {e}"), | ||
| &format!("No credentials: set ANTHROPIC_API_KEY, or run `openab-agent auth anthropic-oauth` / `auth codex-oauth`. {e}"), | ||
| ) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Thanks for the thorough review — addressed in F1 (🔴 CI workspace) — fixed, with a root-cause correction. The "rebase onto main" suggestion doesn't apply: this branch is already based on current F2 (🟡 PKCE state) — fixed + verified live. Now uses an independent 32-byte random F3 (🟡 error UX) — fixed. Credential errors now name fully-qualified subcommands ( F4 (🟡 comment tag) — fixed. Also confirmed the canonical native image still builds: |
|
Follow-up: the With the workspace |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Supplementary architectural review (forward-looking — this PR is already LGTM'd and the line-level items Findings
Detail — F1: cross-process auth.json race (follow-up)
This is not introduced by this PR (Codex already shares the same store), but OAuth widens the exposure, Detail — F4: default model stalenessThis PR replaces the old default Recommendation — no hardcoded model default; require it via config/env and fail loud. Since Messages V1 Note this is a behavior change: today's zero-config default goes away, so a model must be set Detail — F2: support CLAUDE_CODE_OAUTH_TOKEN
For ops-managed deployments this is arguably the primary path; interactive PKCE is for local self-service. Detail — F3: per-vendor descriptorclient_id / client_secret / endpoints / scope / token-body-format are the only things that vary between Direction / roadmap (tracked in a forthcoming ADR)A short ADR is being drafted for multi-vendor LLM-provider OAuth + credential storage; this PR is the first
|
|
Follow-up on F4 — one thing worth doing in this PR before it merges: please don't pin This PR exists because the previous hardcoded default ( Since Messages V1 mandates a
It's a small change, and it also removes the silent Opus cost bump for API-key users. Behavior note: this drops the zero-config default, so a model must be set — deployments via values.yaml/env already do; worth a clear error message + CHANGELOG line for zero-config/local users. |
Proposed ADR for the openab-agent LLM-provider OAuth revamp: a two-axis OAuthVendor adapter (auth flow vs inference transport), a cross-process flock-guarded credential-store invariant for auth.json, the CLAUDE_CODE_OAUTH_TOKEN env route, a 14-variant vendor feasibility matrix, and the /auth (PR openabdev#1185) auth-trigger model. Surfaced while reviewing PR openabdev#1187 (first OAuth vendor). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@brettchien Great call — this is exactly the right framing, thank you. The dateless 4.6+ IDs being fixed canonical IDs (not evergreen pointers) makes any hardcoded default a recurring 404 timebomb, and pinning Opus also quietly raised costs for API-key users. Implemented your fail-loud approach in
Tests: |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
All review findings addressed across iterations — summary of what was fixed:
Final bot review is LGTM ✅, all 33 CI checks green. Ready for maintainer review. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…g (codex + MCP) (#1190) * docs(adr): openab-agent multi-vendor OAuth & credential storage Proposed ADR for the openab-agent LLM-provider OAuth revamp: a two-axis OAuthVendor adapter (auth flow vs inference transport), a cross-process flock-guarded credential-store invariant for auth.json, the CLAUDE_CODE_OAUTH_TOKEN env route, a 14-variant vendor feasibility matrix, and the /auth (PR #1185) auth-trigger model. Surfaced while reviewing PR #1187 (first OAuth vendor). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(adr): address review — prefer oauth2 crate, drop rollout, vendor names - Build the OAuthVendor driver on the official `oauth2` crate (already in-tree via the MCP side) instead of a hand-rolled PKCE/exchange/refresh flow; the Anthropic JSON-token-body quirk is applied via the crate's custom http-client hook. Reframe §8 accordingly (hand-rolled flow is the rejected alternative). - Remove the project Rollout-plan section (internal sequencing, not ADR material); keep the race-window mitigation in §5.4. - Use vendor names only; drop internal fleet-agent references from the matrix. - Replace unexplained "ToS-gray" with "ToS-risk" + a definition. - Fix cross-references (crate-qualified paths, §-refs, line numbers) and move the settled model-default decision under "Decisions & open questions". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(adr): per-tenant refresh lock + per-user PKCE keying (Mira review r1) Fold in two release-blocker-class concurrency fixes from Mira's review: - §5.4 refresh-token rotation: the prior "refresh outside lock, re-read on commit" claim was wrong — N processes each send a refresh with the same RT_old before committing, tripping OAuth 2.1 §10.4 reuse -> token-family revocation. Replace with a per-tenant exclusive lock: network refresh held under the tenant lock only (not the global lock), so exactly one refresh per tenant per expiry with no head-of-line blocking across tenants. Extends to mcp:<server> tenants. - §7 /auth: key the pending PKCE verifier+state by the initiating Discord user id instead of a single global entry — prevents concurrent-user overwrite (PKCE mismatch) and session hijack. - §5.1 OAuthVendor::redirect() -> Option, since device flows have no redirect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(adr): bundled-secret storage + pending-entry GC (Mira review r2) - §6/§9 Q2: correct the stale "git-safe" claim on the gemini GOCSPX- secret. The value is non-confidential by RFC 8252 / Google docs, but GitHub now push-protects Google secrets by default (changelog 2026-03) and partner-scans them for auto-revoke, so a raw literal is not safe in a public repo. Decide: encode-at-rest (scanner-evasion for a non-secret, NOT a security control) or env-inject at runtime — not raw text. - §7: pending PKCE entries get created_at + a 15-min GC sweep in with_auth_locked so abandoned /auth attempts don't accumulate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(adr): close §9 Q2/Q3 — env-injection default, vendor go/no-go (Brett decisions r3) - §9 Q2 DECIDED: env-injection (b) is the default for the gemini/agy bundled client_secret; encode-at-rest (a) is the fallback for bundled zero-config binaries only, framed as scanner-evasion (not security). Cite rclone rcloneEncryptedClientSecret + obscure.MustReveal() as the canonical precedent (§10). - §9 Q3 DECIDED: GO gemini/grok (first wave) + agy (experimental, opt-in, ToS caveat — shares gemini's Code-Assist provider, residual risk is ToS not secret storage); No-Go cursor/kiro. Mirror as a build-decision line under §6. - §10: add rclone + GitHub/Google secret-scanning references. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(adr): confirm agy secret + ecosystem evidence (GitHub survey r4) GitHub survey 2026-06-24 grounds the agy GO decision: - §6/§9 Q2: agy client_secret requirement CONFIRMED — it needs a GOCSPX- secret, a public constant >=20 antigravity-auth repos hardcode verbatim (NoeFabris/opencode-antigravity-auth, router-for-me/CLIProxyAPI, ...). Literal deliberately NOT pasted into the doc — would trip the very §9 Q2 push-protection, so we dogfood the env/encode decision. Redirect confirmed localhost:51121/oauth-callback. - §9 Q3: ecosystem evidence — agy OAuth is widely ported (opencode/pi/hermes/ openclaw plugins + proxies), proving the integration, while the same ecosystem's anti-ban / quota-locking / multi-account-rotation tooling empirically confirms the ToS-ban + 429 risks, reinforcing the opt-in gate. - §10: add the antigravity OAuth ecosystem reference set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(adr): agy-vendor vs agy-CLI/ACP clarification + MCP store revamp in scope (r5) Grounded by a codebase survey (2026-06-24): - §4/§6: clarify "agy as a GO vendor" means a native OAuthVendor + Code-Assist inference provider — it does NOT run the agy CLI. agy speaks no ACP; the existing `antigravity` runtime variant (Mira/ECS) only works via a dedicated agy-acp adapter that shells out to the agy binary per prompt and polls its SQLite DB. The provider path sidesteps ACP entirely and supersedes the CLI-wrapper for native use, so agy's lack of ACP doesn't block the GO. - §5.4/§9 Q4: make the MCP CredentialStore revamp explicitly in-scope. auth.json has NO lock today (only atomic rename); provider save_tokens (auth.rs:234) and McpCredentialStore::save/clear (auth.rs:284-328) are two independent unlocked RMW callers, so with_auth_locked must wrap BOTH or the race persists. Also correct the stale save_tokens_for name and note with_auth_locked is new. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(adr): fix stale §9 Q4 cross-reference The §9 Q4 tail referenced 'openab-agent-mcp.md open items #1 (reqwest 0.12/0.13 split) and #8 (doctor/runtime two-store split)' — but that ADR's §10 Open Questions has only two items (mcp.json location; native-vs-broker parity), neither matching, and no such numbered items / terms exist anywhere in it. The phantom reference dated to the original draft. Replace with an accurate statement: McpCredentialStore reuses the same TokenStore/auth.json storage (openab-agent-mcp.md §6.1), so the lock lands once and serves both; the reqwest version conflict is the rmcp-OAuth dependency issue surfaced on the feat/openab-agent-mcp-resilience PR, not an mcp-ADR open item. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(adr): §9 Q2 — encode-at-rest is the default, env is the alternative (Brett) Brett's call given the 93%-plaintext ecosystem survey: keep the bundled zero-config UX via encode-at-rest (obscure, rclone obscure.MustReveal style) as the DEFAULT, with env-injection as the alternative for fleet/pod. Reverses the prior (b)-default ordering. Framing unchanged: encode-at-rest is scanner-evasion for a non-confidential value, not a security control. Record the survey numbers (99/107 plaintext; auto-revoke largely unrealized) and that the real risk mitigated is org-repo push-protection friction, not credential loss. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(auth): cross-process locking for auth.json — codex + MCP (ADR §5.4) Implements the ADR §5.4 invariant. auth.json had NO lock: provider save_tokens and McpCredentialStore::save/clear each did an independent unlocked read-modify-write, so a concurrent codex refresh and MCP save last-writer-wins the whole map, and N processes (one openab-agent per Discord thread) could each refresh the codex token with the same RT_old → OAuth 2.1 §10.4 token-family revocation (fleet-wide logout). Two locks, flock(2) on sidecar files (kernel auto-releases on death), cfg(unix) with a non-unix no-op: - with_auth_locked: global exclusive lock across the re-read -> mutate -> atomic-write. ALL writers funnel through it — save_tokens (codex) and McpCredentialStore::save/clear (MCP) — so writers merge onto the latest on-disk state instead of lost-updating. Held only for the fast file RMW, never across network I/O. - lock_tenant_refresh: per-tenant refresh serialisation. get_valid_token / force_refresh take the codex tenant lock (non-blocking try + async backoff + 10s timeout, held across the network refresh), with a double-checked re-read so a process that waited adopts the token another already refreshed → exactly one real refresh per tenant per expiry, no RT_old reuse. Uses libc::flock (already a cfg(unix) dep); rustix is NOT in-tree (ADR text was optimistic). New test asserts the locked RMW merges codex + MCP tenants without lost-update. fmt + clippy -D warnings + test (191 passed) green. Known gap (for review): the MCP *network* refresh is owned by rmcp's AuthorizationManager (it refreshes then calls CredentialStore::save), so MCP writes get the file-integrity lock but MCP refreshes are not yet tenant- serialised. Closing that needs an rmcp-level hook — proposed as follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(auth): address Mira code review — pending GC + portable WouldBlock Round-2 code review (Mira): - §7 pending-entry GC: add created_at to PendingPasteLogin and sweep AuthEntry::Pending older than 15 min inside with_auth_locked on every write, so abandoned /auth two-step attempts don't accumulate. PendingPasteLogin is currently a legacy tombstone with no live writer (PKCE state lives in rmcp's in-memory StateStore); the field + GC land now per ADR §7 and are forward-compatible with the forthcoming /auth two-step flow, and meanwhile sweep legacy stray entries (created_at default 0 reads as ancient). New test covers stale-swept / fresh-kept / real-tenant-untouched. - flock_try_exclusive: match std::io::ErrorKind::WouldBlock instead of a single raw EWOULDBLOCK errno, covering EAGAIN/EWOULDBLOCK across libc/BSD. - MCP network-refresh serialisation stays a follow-up (rmcp owns the refresh) — Mira concurs. fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(auth): serialise MCP token refresh cross-process (close §5.4 (b) gap) Closes the known gap from f370110: MCP refreshes now get the same per-tenant serialisation as codex. rmcp's CredentialStore exposes no pre-refresh hook, but openab drives the MCP refresh explicitly in McpRuntimeManager::resolve_oauth_dial via client.get_access_token(). Wrap that call (per-server) with the existing auth::lock_tenant_refresh so only one process refreshes a given server at a time. No explicit double-check needed: rmcp's get_access_token re-load()s auth.json each call and returns the cached token without a network refresh when remaining >= REFRESH_BUFFER (rmcp auth.rs:1238), so a process that loses the race adopts the token the winner wrote to the shared file — no second RT_old presentation, no OAuth 2.1 §10.4 family revocation. rmcp already single-flights within one process via its AuthorizationManager Mutex; this closes the cross-process gap. - auth.rs: lock_tenant_refresh + AuthFileLock made pub(crate) for the mcp module. - ADR §5.4: document the resolve_oauth_dial serialisation point. fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * refactor(auth): /simplify cleanups — dedup lock acquire, reuse refresh fd Quality-only cleanups from a 4-angle /simplify pass (no behaviour change): - Extract lock_global(path) -> Result<Option<AuthFileLock>> and route both with_auth_locked and McpCredentialStore::clear through it, so the "global" sidecar name + the cfg(unix) acquire live in one place instead of two copy-pasted blocks. clear flattens (drop the inner run closure). - lock_tenant_refresh opens the lock fd once and re-issues flock on it each retry, instead of re-opening (and re-create_dir_all-ing) the file every 100ms under contention. Removes the now-unused flock_try_exclusive helper. - Document lock_tenant_refresh's double-check contract (the lock only serialises; callers must re-check freshness after acquiring). fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(auth): /code-review — harden GC + refresh-lock contract comments Doc-only clarifications from a /code-review pass (no behaviour change): - PendingPasteLogin.created_at: warn loudly that any writer of a fresh Pending entry MUST stamp created_at, else gc_stale_pending sweeps it on the next locked write (latent footgun for the forthcoming /auth two-step writer). - lock_tenant_refresh: correct the contract — reuse-safety comes from loading the refresh token INSIDE the lock (so force_refresh, which always refreshes on a 401, is reuse-safe too); the post-lock expiry re-check is only an optimisation to skip a redundant refresh. Review also surfaced a pre-existing, out-of-scope namespacing gap (an MCP server literally named "codex" collides with the codex tenant's auth.json key and refresh lock — committed MCP creds use the bare server name, not mcp:<server> as ADR §5.4 describes) — flagged for a separate change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * build(openab-agent): declare standalone [workspace] to fix CI ci-openab-agent.yml runs cargo fmt/clippy/test/build with working-directory: openab-agent, but the crate is not a member of the parent openab workspace (members = crates/openab-core, openab-gateway) and was not excluded, so cargo errors 'current package believes it's in a workspace when it's not' and every step fails before doing any work. openab-agent is intentionally standalone (own version + dual reqwest 0.12/0.13 for rmcp), so the correct fix is an empty [workspace] table making it its own root. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Revert "build(openab-agent): declare standalone [workspace] to fix CI" This reverts commit b40693c. * ci(openab-agent): append [workspace] in CI + harden ACP smoke test ci-openab-agent.yml ran cargo from working-directory: openab-agent without the [workspace] table the crate needs (it's standalone, not a parent-workspace member), so every step failed with 'believes it's in a workspace when it's not' — the workflow had been red independently of any PR. Dockerfile.unified already works around this by appending the table at build time; replicate that in the workflow rather than committing it to Cargo.toml (which would double-append in the Dockerfile and break the image build). Also harden the ACP smoke test: build the release binary in its own step and bump the response timeout 5s -> 30s so a loaded runner doesn't flake the agent's first-response window (the binary itself responds in <1s locally, verified incl. a clean-HOME release build). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(openab-agent): surface ACP smoke-test stderr + exit code for diagnosis The smoke test produces empty stdout only in CI (the binary responds with agentInfo in <1s locally across debug/release/clean-HOME/CI-env-var runs). Capture stderr + exit code so the next CI run reveals why stdout is empty. * openab-agent: address review findings on OAuth ADR + auth.json locking Resolves the Changes-Requested review on the multi-vendor OAuth ADR + cross-process auth.json locking PR. No behavioural change beyond log level and a non-unix diagnostic. ADR (docs/adr/openab-agent-oauth.md) - Fix lock-file names to match code: auth.json.global.lock and auth.json.refresh.<tenant>.lock (were auth.json.lock / auth.json.<tenant>.lock). - Add the path parameter to the with_auth_locked pseudocode signature. - Correct the MCP refresh note: initialize_from_store() does the disk reload, not get_access_token. - Correct the crate note: libc::flock directly (rustix is not in-tree). - Document the deliberate fail-open trade-off on tenant-lock timeout. - Mark the default-model removal (Decision 1) as a follow-up; this PR ships the ADR + locking only. Code - auth.rs: escalate the tenant-lock timeout log from warn! to error! and document the fail-open trade-off on lock_tenant_refresh. - auth.rs: non-unix lock_global no-op now warns once instead of silently providing zero cross-process protection. - mcp/runtime.rs: fix the rmcp reload comment and document the cross-module invariant that the refresh lock and credential entry share the server name. CI (ci-openab-agent.yml) - ACP smoke test now fails on a non-zero/timeout exit code, and the agentInfo assertions use { } so exit fails the step rather than just a subshell. Gate: cargo fmt --check, clippy -D warnings, test (192 passed) all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * openab-agent: fail closed on a contended refresh lock (F6) Replaces the fail-open tenant-lock timeout with a bounded-refresh + fail-closed design, so a contended lock can no longer reintroduce the double-refresh it exists to prevent. - Bound the refresh network call with an explicit 8s HTTP timeout on both the codex client (auth.rs) and the MCP AuthClient (mcp/runtime.rs), strictly shorter than the 10s lock-acquire deadline. With flock(2) auto-release on death, a live holder always frees the tenant lock before a waiter's deadline, so a lock-acquire timeout is genuinely abnormal. - lock_tenant_refresh now returns RefreshLock { Held, Unavailable, TimedOut }. On TimedOut callers fail closed instead of refreshing unserialised: - codex get_valid_token / force_refresh return a retryable error; - MCP resolve_oauth_dial returns a new OauthDialError::Transient that leaves the server retryable WITHOUT forcing re-login (NeedsAuth) or tripping the circuit breaker (auth-level failures still map to NeedsAuth as before). A filesystem error opening the sidecar returns Unavailable and degrades to a best-effort unserialised refresh rather than blocking every refresh. - Add a fail-closed timeout test (injectable deadline) and update ADR §5.4 to document the bounded-refresh + fail-closed invariant, superseding the earlier fail-open note. Gate: cargo fmt --check, clippy -D warnings, test (193 passed) all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * openab-agent: size lock timeout above worst-case multi-call refresh hold Mira review (PR #1190) found the fail-closed invariant held only for the codex path. The MCP path holds the tenant lock across TWO sequential bounded calls — rmcp's initialize_from_store() (AS discovery) then get_access_token() (refresh) — so the worst-case lock-hold is ~2 x REFRESH_HTTP_TIMEOUT (16s), which could exceed the fixed 10s lock deadline and fail a waiter closed while the holder is still legitimately progressing. Derive REFRESH_LOCK_TIMEOUT from the bound instead of hardcoding it: REFRESH_LOCK_TIMEOUT = MAX_REFRESH_ROUND_TRIPS (2) * REFRESH_HTTP_TIMEOUT + 4s = 20s so the deadline is always above the worst-case hold and only a genuinely stuck holder trips it. Normal-case latency is unchanged (the waiter polls every 100ms and acquires as soon as the holder releases). Update the fn doc and ADR §5.4 to state the multi-call hold explicitly. Gate: cargo fmt --check, clippy -D warnings, test (193 passed) all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: thepagent <hehsieh1010@gmail.com>
…nthropic-oauth # Conflicts: # openab-agent/src/auth.rs
Design update — landing the ADR's
|
| Dimension | OpenClaw (req'd) | Hermes (req'd) | pi | opencode |
|---|---|---|---|---|
| Config format / location | openclaw.json (JSON5), global+project merge |
~/.hermes/config.yaml + .env |
~/.pi/agent/ — 3 files (settings/models/auth) |
opencode.json (JSONC), deep-merge layers |
| Default model | single model.primary:"prov/model" |
model.default:"prov/model" (+model.provider) |
two fields defaultProvider+defaultModel |
single model:"prov/model" |
| Provider registry | built-in + models.providers{} (custom) |
ProviderProfile dataclass registry (in code) |
models.json + registerProvider() hook |
Models.dev + provider{} block |
| Auth-method encoding | profile key prov:label + type:oauth|apikey |
auth_type string field on profile |
cred type:api_key|oauth |
cred type:oauth|api in auth store |
| Secrets location | separate auth-profiles.json |
separate .env / auth.json |
separate auth.json (0600) |
separate auth.json (0600) |
| Per-model params | models[].params passthrough |
fixed_temperature/default_max_tokens |
structured model fields (no temp) | models.<id>.options passthrough |
| Precedence | flag/slash → config order → env → default | CLI → config.yaml → .env → default | flag → auth.json → env → registry | env → auth.json → config |
Three signals are unanimous across all four — we adopt them:
- Secrets always live in a file separate from config (config stays commitable). We already have this:
auth.json. - The credential record is tagged with a
typediscriminator (oauthvsapi_key). We already have this:AuthEntry/TokenStore. - Static provider facts in code, dynamic selection in config, resolved in one place. Hermes is the cleanest exemplar (
ProviderProfileregistry +resolve_runtime_provider()), and it maps 1:1 onto ADR §5.1'sOAuthVendordescriptor + a resolver.
Where we diverge: unlike all four (TS/Python), we keep the cross-process flock-based locked-RMW invariant (ADR §5.4) — OpenClaw and Hermes both file-lock but only for the single-process case; OpenClaw's open refresh-persistence bug cluster (#52037/#48153/#71026) is direct evidence that getting the lock + atomic write-back right matters.
4. Proposed Solution
OAuthVendortrait + descriptors (ADR §5.1):CodexVendor,AnthropicVendortoday; new vendor = descriptor only. Collapse the three login flows into one shared PKCE/device driver on the in-treeoauth2crate; fold therefresh_tokenprovider-branch intovendor.token_body(). Anthropic's JSON-no-scope quirk via the crate's custom-http hook.- Thin
config.json(next toauth.json), singleprovider/modelstring + optionalmax_tokens/ per-providerbase_url. No secrets. Resolution:model_override → config.json → OPENAB_AGENT_MODEL → error(env still overrides; fail-loud per §9 Q1). - Credential precedence (§5.3):
ANTHROPIC_API_KEY → CLAUDE_CODE_OAUTH_TOKEN → stored anthropic-oauth tenant. - Storage: Anthropic save/refresh already routed through §5.4 locking in this branch (merge with
maindone — see commit resolving theauth.rsconflict;save_tokens_for → with_auth_locked,get/force_refresh_for → lock_tenant_refresh(namespace)), so the per-tenant RTR race is closed for the Anthropic tenant too.
// ~/.openab/agent/config.json — commitable, secrets stay in auth.json
{
"model": "anthropic/claude-sonnet-4-6", // single provider/model string (reuses ModelRef)
"max_tokens": 8192, // optional
"providers": { "anthropic": { "base_url": "https://api.anthropic.com" } } // optional, custom only
}5. Why This Approach
- Single
provider/modelstring: 3 of 4 surveyed projects use it, and openab already parses it —ModelRef::parse(llm.rs) +resolve_provider_choice()extract the provider from theprovider/prefix today. Two fields (pi) would add coupling for no gain. - Thin config / secrets-in-auth.json: unanimous prior-art signal; reuses the hardened §5.4 store instead of inventing a second secret file.
OAuthVendornow, not later: with only codex+anthropic in tree this is the lowest-cost moment to abstract; doing it after gemini/grok land means refactoring more call sites.- Linked fix — F4: a single
provider/modelconfig string makes theModelRef::parseHuggingFace-ID bug load-bearing (anthropic/claude-…must split;some-org/some-modelmust not). F4's fix (split only on known provider prefixes) ships with the config change, not separately.
6. Alternatives Considered
- Two fields (
defaultProvider+defaultModel, pi-style) — rejected: couples two values openab already derives from one string; only 1/4 projects do it. - Secrets in the config file — rejected: contradicts all four references and would bypass the §5.4 locked store.
- Build all vendors now (gemini/grok/agy) — out of scope per ADR §9 Q3 (incremental landing); this PR lands the abstraction + keeps codex/anthropic green.
- Layer-3 auto-trigger (auto-login on mid-turn 401) — deferred per ADR §2 / Brett 2026-06-24;
/authis sufficient.
7. Validation
- Rust:
cargo check,cargo clippy --all-targets,cargo testgreen after themainmerge — 56/56 auth+llm+mcp tests pass, incl.test_anthropic_save_uses_provider_as_key_disjoint_from_codex,with_auth_locked_merges_concurrent_tenants_no_lost_update,lock_tenant_refresh_fails_closed_when_contended,save_uses_rotated_refresh_token_when_present. - To add with the implementation: config.json load + layering test (
model_override > config > env), precedence test (ANTHROPIC_API_KEY > CLAUDE_CODE_OAUTH_TOKEN > tenant), F4ModelRef::parseHuggingFace-ID regression test, and anOAuthVendorround-trip test per descriptor. - Helm / CI / docs: docs update to
docs/native-agent.md(vendor model + config.json + env route) lands with the change; no Helm surface.
|
Now that #1190 has merged, here's a concrete path to rebase this PR onto the new Blast radius: the lock layer ( #1190 did the hard part generically, so most of this is wiring the new tenant
1. Route the four unlocked writers through #1190's locksThese four sites in
2. One locked implementation, not two — and stage it as two commitsAfter (1), pub async fn get_valid_token() -> Result<String> { get_valid_token_for(CODEX_NAMESPACE).await }
pub async fn force_refresh() -> Result<String> { force_refresh_for(CODEX_NAMESPACE).await }To keep the codex path's review surface clean, please split this into two commits:
That lets a reviewer verify the refactor changes no behaviour separately from the new vendor. 3. Add a concurrency test for the Anthropic tenant#1190 ships 4. Shape the new vendor as a descriptor, not inlined constantsRather than hand-rolling the Anthropic flow with inline constants alongside the codex one, To be clear on scope: the descriptor struct belongs in this PR; building the shared One deployment note (not blocking): for pod/fleet use, the race-free path is the |
… §5.1) Collapse the hand-rolled per-vendor OAuth surface into a single OAuthVendor trait + CodexVendor/AnthropicVendor descriptors. Adding a subscription-OAuth vendor is now a descriptor, not a new hand-rolled flow: - Unify the codex + anthropic PKCE login flows into one shared `login_pkce_flow` driven by the descriptor; fold the codex flow into the `accept_callback_code` / `code_from_redirect` helpers (the long-standing TODO) and unify the 127.0.0.1 bind. - Drive `refresh_token` and the authorization-code exchange off `vendor.token_body()` instead of an `if provider == ANTHROPIC_NAMESPACE` branch. - Shared pure builders `build_authorize_url` / `token_store_from_payload`, pinned by wire-format unit tests — the login authorize-URL/exchange hit live OAuth servers, so no integration test covers them. The driver keeps the proven reqwest flows; swapping the engine onto `oauth2::BasicClient` (as `mcp/runtime.rs` already does via a custom http hook) is a follow-up internal change invisible to vendor authors — only the descriptor surface lands here. `client_secret()`/`grant()` are the ADR §5.1 surface for later vendors (gemini/agy bundled secret; copilot/kiro device-code). 205 tests pass (4 new wire-format locks). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ADR §5.3) Add `AnthropicAuth::OAuthEnv` for the pre-provisioned long-lived subscription token route (`CLAUDE_CODE_OAUTH_TOKEN`) — the recommended fleet mode (ops mints once, injects as a k8s secret; no interactive flow, no auth.json write, no refresh race). - `AnthropicProvider::auto[_with_model]()` now resolves in ADR §5.3 precedence: `ANTHROPIC_API_KEY` → `CLAUDE_CODE_OAUTH_TOKEN` → stored `anthropic-oauth` tenant. Each source surfaces its own errors rather than falling through to a lower-precedence credential error. - `OAuthEnv` shares the OAuth `Bearer` + Claude Code identity path (extracted into `oauth_headers`); `is_oauth()` covers it so the system block + tool-name casing apply. The 401 force-refresh is gated to the stored tenant only — the env token has no tenant to refresh, so a 401 there surfaces (re-mint) instead of erroring on a missing tenant. 207 tests pass (+2 precedence tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s (review F4) `provider/model` previously split on the first `/`, mis-parsing HuggingFace-style `org/model` ids (e.g. `meta-llama/Llama-3-8B`) for custom/OpenAI-compatible endpoints — `org` became the "provider" and the real id was truncated. Now the prefix is split off only when it's a `KNOWN_PROVIDERS` entry; otherwise the whole string is the model id. Load-bearing for the planned single-string `provider/model` config field. Known prefixes still split unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…reds (review F3)
`select_provider`'s auto-detect arm discarded the `AnthropicProvider::auto()`
error and fell through to Codex. A present Anthropic credential that failed for a
config reason (e.g. a key set but no model) was silently masked — the user got a
Codex provider, or a Codex-only error that hid the real cause.
Now: if any Anthropic credential source exists (API key / CLAUDE_CODE_OAUTH_TOKEN /
stored tenant), an `auto()` failure surfaces as a config error ("credential
present but unusable"). Codex fallthrough happens only when no Anthropic
credential exists at all, and that final error names both credential routes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(ADR §5.5)
Add a small JSON config file next to auth.json so a deployment can declare the
default provider/model (and max_tokens) in a file instead of only via env vars —
the centralized-config gap the multi-vendor work surfaced.
- New `config` module: `{ model, max_tokens }` (single `provider/model` string,
reusing ModelRef). Unknown keys tolerated (forward-compat for `providers`/etc.);
malformed JSON fails loud; a missing file is an empty config.
- Resolution is env-over-config: `anthropic_model` / `anthropic_max_tokens` /
`resolve_provider_choice` now fall `OPENAB_AGENT_*` env → config.json → built-in.
A pod's injected env stays authoritative over a baked config.
- Secrets never live here — they stay in the locked `auth.json` store.
Per-provider `base_url` routing is a deliberate follow-up; this lands the default
provider/model/params surface. 213 tests pass (+5).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ig.json (review F5) - Anthropic credentials section: the ADR §5.3 precedence (ANTHROPIC_API_KEY → CLAUDE_CODE_OAUTH_TOKEN fleet route → interactive anthropic-oauth) + the `openab-agent auth anthropic-oauth` login. - New "Configuration file (config.json)" section: schema, env-over-config precedence, secrets-stay-in-auth.json. - "Adding an OAuth vendor": the OAuthVendor descriptor model (ADR §5.1). - Env table: CLAUDE_CODE_OAUTH_TOKEN, OPENAB_AGENT_ANTHROPIC_CLIENT_ID, OPENAB_CONFIG_PATH. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
…auth tenant Brett asked to cover the new tenant the same way openabdev#1190's codex lock tests do, so single-flight / fail-closed is proven for it rather than only structurally similar: - `with_auth_locked_merges_anthropic_tenant_no_lost_update` — a concurrent codex write does not clobber a just-written `anthropic-oauth` token (the new tenant rides the same locked RMW funnel). - `lock_tenant_refresh_fails_closed_for_anthropic_and_is_per_tenant` — a held anthropic refresh lock makes a second anthropic acquire fail closed (`TimedOut`), and does NOT block codex (per-tenant isolation — the reason §5.4 uses a per-tenant lock, not the global one). 215 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Architecture — Before / AfterA visual companion to the design-update comment above: how auth/provider selection is shaped before this PR's multi-vendor refactor vs after (ADR §4/§5.1/§5.3/§5.5). Before — hand-rolled per-vendor, env-only, unlocked new tenantAfter — two-axis model, 3-tier precedence, locked, config fileCore deltas
All of @brettchien's rebase roadmap and the review findings are addressed: routed the new tenant through #1190's |
|
LGTM ✅ — Well-architected multi-vendor OAuth with clean credential-source precedence and solid test coverage. What This PR DoesAdds native Anthropic OAuth (Claude Pro/Max subscription) login to How It Works
Findings
Baseline Check
What's Good (🟢)
CI: |
What problem does this solve?
openab-agentcan only reach Anthropic viaANTHROPIC_API_KEY(pay-per-token). Codex already supports subscription login, but Claude Pro/Max subscribers cannot use their subscription with the native agent. This adds native Anthropic OAuth (Claude Pro/Max) so users runopenab-agenton the Claude subscription they already pay for — no API key — matching the existing Codex experience.Closes #1186
Discord Discussion URL: https://discord.com/channels/1491295327620169908/1519271476002291752
At a Glance
Prior Art & Industry Research
I looked at how two comparable self-hosted agents authenticate to Anthropic and Codex. The headline finding: neither implements a native Anthropic (Claude Pro/Max) OAuth login — both avoid the PKCE flow and instead lean on a setup-token or on reusing Claude Code's local credentials. This PR (following Pi) does the full native PKCE login, which is strictly more self-contained for pod deployments. Their surrounding architecture, however, validates the storage/refresh choices here.
OpenClaw — supports API keys and subscription OAuth.
claude -p).auth.openai.com/oauth/authorize→ callbackhttp://127.0.0.1:1455/auth/callback(or manual paste) → token exchange →accountIdextracted from the access token. This is byte-for-byte the same flow openab-agent already uses for Codex, corroborating our approach as the de-facto standard.~/.openclaw/agents/<id>/agent/auth-profiles.json, one{access, refresh, expires, accountId}tuple per profile.Hermes Agent —
PROVIDER_REGISTRYdataclasses inhermes_cli/auth.pydeclare each provider's auth type + base URLs + env vars;resolve_runtime_provider()is the single resolution entry point.ANTHROPIC_API_KEY, and if absent reads~/.claude/.credentials.json(reuses Claude Code's store). Docs explicitly note Anthropic here is "straightforward API key authentication without refresh token complexity."~/.hermes/auth.json(OAuth tokens + active provider),credential_pool.json(rotation),.env(API keys);auth.jsonguarded withfcntl/msvcrtfile locks.Primary source ported: Pi (
earendil-works/pi) —packages/ai/src/utils/oauth/anthropic.ts(PKCE flow, endpoints, scopes; verifier doubles asstate) andpackages/ai/src/api/anthropic-messages.ts(OAuth headers, Claude Code system block, tool-name normalisation). The OAuth client is Claude Code's public client.How this PR compares: like both systems, openab-agent keeps a single namespaced credential file (
~/.openab/agent/auth.json) with atomic writes + per-refresh rotation handling, and an existing Codex tenant identical to OpenClaw's Codex flow. Unlike both, it adds a native Anthropic PKCE login so subscribers need neither a setup-token nor a local Claude Code install.Proposed Solution
Add an
anthropic-oauthtenant alongside the existingcodextenant in~/.openab/agent/auth.json:auth.rs—login_anthropic_browser_flow()(PKCE; verifier doubles asstateper Claude's flow); namespaced token store (load/save/get_valid_token/force_refresh_for(provider)); per-provider refresh encoding (Anthropic = JSON, noscope; Codex = form); shared loopback-callback helpers;show_statuslists all tenants.llm.rs—AnthropicProvidergainsAnthropicAuth { ApiKey | OAuth }. OAuth mode sendsBearer+ Claude Code identity headers (anthropic-beta: claude-code-20250219,oauth-2025-04-20,x-app: cli), prepends the required"You are Claude Code…"system block, normalises built-in tool names to Claude Code casing (read↔Read), and refreshes once on a mid-flight 401.select_providergainsanthropic-oauth;anthropic/auto fall back API-key → OAuth.acp.rs— session/model selection viaAnthropicProvider::auto*()(covers both auth modes); model catalog shows Anthropic models when an API key or OAuth token is present.main.rs—openab-agent auth anthropic-oauth [--no-browser].Also bumps the stale default model
claude-sonnet-4-20250514→claude-opus-4-8(the old dated snapshot returns 404 on the subscription endpoint).Why this approach?
auth.json, so the two subscription logins coexist without new storage mechanisms.Tradeoffs / limitations: depends on Claude Code's public OAuth client and the
claude-code-20250219,oauth-2025-04-20beta headers — if Anthropic changes these, the OAuth path needs updating (API-key path is unaffected). Theclaude-opus-4-8default now also applies to API-key mode (Opus is pricier per-token; overridable viaOPENAB_AGENT_MODEL). The legacyDockerfile.nativeis unrelated and intentionally out of scope (canonicalDockerfile.unifiedbuilds the native variant correctly).Alternatives Considered
~/.claude/.credentials.json, and OpenClaw viaclaude -p): rejected — openab-agent runs in a pod with no Claude Code install and no~/.claude. Owning ananthropic-oauthtenant in our ownauth.jsonkeeps it self-contained and matches the Codex tenant already present.auth.jsonalready serves Codex + MCP; a new file would fragment storage and duplicate the atomic-write/rotation logic.Validation
Rust:
cargo check/cargo buildpass (0 warnings)cargo testpasses — 194 passed, incl. 4 new (authorize-URL, namespaced storage disjointness, OAuth request-body identity+tool-casing, name round-trip)cargo clippy— no new warnings from this change (6 pre-existing in test-moduleENV_LOCK-across-await +mcp/runtime.rs; the OAuth code is clippy-clean)Manual (real Claude Pro/Max account):
auth anthropic-oauth --no-browserlogin → token stored underanthropic-oauth,auth statusshows validclaude-opus-4-8(default) /claude-sonnet-4-6/4-5→ correct responsesbashexecutes in-sandbox (echo …$((6*7))→42), confirming tool-name normalisation round-tripsDockerfile.unified --target native; ran the in-image agent end-to-end (chat + tool call) via the stored OAuth token