Make MCP server startup non-fatal for Oz agent runs#12470
Conversation
Cloud agent runs previously failed outright when any requested MCP server failed to reach Running within the 20s startup timeout. MCP startup now waits for every server to reach a terminal state, names the servers that failed or were still starting, and continues the run without them by default, surfacing the degradation via task logs and a best-effort run status message. Strictness and the timeout are configurable via --strict-mcp-startup and --mcp-startup-timeout. Co-Authored-By: Oz <oz-agent@warp.dev>
|
I'm starting a first review of this pull request. You can view the conversation on Warp. I completed the review and no human review was requested for this pull request. Comment Powered by Oz |
There was a problem hiding this comment.
Overview
This PR changes Oz MCP startup handling so requested servers that fail or time out no longer abort non-strict agent runs, adds degradation details to MCP startup errors, and introduces CLI controls for strict startup and timeout tuning.
Concerns
- The changed path is user-perceivable: non-strict runs now continue with a warning/status message, and strict mode/timeout behavior is exposed through new CLI flags. The PR description says the degraded path was not manually tested end to end and does not include a screenshot or screen recording. For this user-facing behavior change, please include screenshots or a short recording demonstrating an Oz agent run with one unreachable MCP server continuing in non-strict mode and failing in strict mode.
- No approved spec context was provided for this implementation PR, and I did not find blocking security issues in the attached diff.
Verdict
Found: 0 critical, 1 important, 0 suggestions
Request changes
Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).
Powered by Oz
captainsafia
left a comment
There was a problem hiding this comment.
Looks good to me as far as the structural changes:
- Failures will warn and log instead of erroring by default
- Erros are collected on a per-MCP server basis instead of all-up and logged individually
peicodes
left a comment
There was a problem hiding this comment.
This will require the customer to provide the new option to see a difference, correct?
|
@peicodes Nah, |
|
Got it, I had it backwards. Ty for the fix! |
| /// Fail the run when any requested MCP server fails to start. | ||
| /// | ||
| /// By default, MCP servers that don't start within the startup timeout are | ||
| /// skipped and the agent runs without their tools. | ||
| #[arg(long = "strict-mcp-startup")] | ||
| pub strict_mcp_startup: bool, | ||
| /// Maximum time to wait for requested MCP servers to start (e.g. `30s`, `1m`). | ||
| #[arg(long = "mcp-startup-timeout", value_name = "DURATION")] | ||
| pub mcp_startup_timeout: Option<humantime::Duration>, |
There was a problem hiding this comment.
context: --strict-mcp-startup is a boolean arg, so defaults to false when unset. Driver log in driver::driver.rs makes it so we set startup timeout to 20s when unset, too.
The implication is that when both flags are unset, we revert to a 20s timeout with the only behavioral change from the pre-PR state being that we log warnings for servers that fail to start up in that window.

Description
Makes MCP server startup non-fatal by default for Oz agent runs.
Previously, cloud agent runs failed outright with "One or more MCP servers failed to start..." when any requested MCP server failed to reach
Runningwithin the 20s startup timeout. Several Oz users hit issues where some subset of their servers wouldn't spawn within the (non-configurable) timeout and thus the entire run was borked.The shape of this solution:
AgentDrivernow wait for every requested server to reach a terminal state (RunningorFailedToStart) instead of failing fast on the first failure, and track per-server names so degradation is reportable (e.g.failed to start: server_x; still starting after 20s: server_y).AgentDriverError::MCPStartupFailedcarries those details, which surface in both the strict-mode failure message and the non-strict warning.run_internal, both the spec'd (--mcp) and profile MCP startup results are handled outsiderecord_result(preservingsetup_mcp_server_startuptelemetry): non-strict runs log a WARN naming the unavailable servers and post a best-effort message-onlyupdate_agent_taskstatus update, then continue.MCPServerNotFound(bad UUID config) stays fatal.oz agent run:--strict-mcp-startup(restores the old fail-fast behavior) and--mcp-startup-timeout <DURATION>(default 20s), plumbed throughAgentDriverOptions.Servers that connect after the timeout still become usable mid-run: in-flight spawns are never aborted, and the per-request
mcp_contextis rebuilt from active servers each turn.Follow-ups
This PR sets up client-side scaffolding that lets us configure how strict we are about spawning MCP servers prior to cloud agent runs, and the duration we wait for those servers to spawn before propagating the first user query. The intended follow-up here is to surface these as configurable options via the Oz web app, which should be a fairly trivial change.
Testing
Manually tested via local Oz CLI; here's an example of a publicly reachable MCP server (
deepwiki) and an unreachable server configured via a URL that explicitly accepts no connections. Notice that the conversation proceeds despite the failure.cargo nextest run -p warp -E 'test(error_classification) or test(driver_tests)'— 32/32 pass, including a new test assertingMCPStartupFaileddetails surface in the classified message.cargo nextest run -p warp_cli— 175/175 pass (clap arg definitions)../script/formatand presubmit'scargo clippy --workspace --exclude warp_completer --all-targets --tests -- -D warningspass.Agent Mode
CHANGELOG-OZ: By default, cloud agent runs now continue without configured MCP servers that fail to start within the 20s default startup window, instead of failing the run.
Co-Authored-By: Oz oz-agent@warp.dev
Plan: https://staging.warp.dev/drive/notebook/ASfpNkdsDIpw6hkTsmooCo
Conversation: https://staging.warp.dev/conversation/44d42331-9340-4dbe-9e87-276117c147cb