Skip to content

Make MCP server startup non-fatal for Oz agent runs#12470

Merged
vkodithala merged 2 commits into
masterfrom
varoon/oz-mcp-failures-non-fatal
Jun 10, 2026
Merged

Make MCP server startup non-fatal for Oz agent runs#12470
vkodithala merged 2 commits into
masterfrom
varoon/oz-mcp-failures-non-fatal

Conversation

@vkodithala

@vkodithala vkodithala commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Description

Makes MCP server startup non-fatal by default for Oz agent runs.

Previously, cloud agent runs failed outright with "One or more MCP servers failed to start..." when any requested MCP server failed to reach Running within the 20s startup timeout. Several Oz users hit issues where some subset of their servers wouldn't spawn within the (non-configurable) timeout and thus the entire run was borked.

The shape of this solution:

  • The MCP startup waiters in AgentDriver now wait for every requested server to reach a terminal state (Running or FailedToStart) instead of failing fast on the first failure, and track per-server names so degradation is reportable (e.g. failed to start: server_x; still starting after 20s: server_y).
  • AgentDriverError::MCPStartupFailed carries those details, which surface in both the strict-mode failure message and the non-strict warning.
  • In run_internal, both the spec'd (--mcp) and profile MCP startup results are handled outside record_result (preserving setup_mcp_server_startup telemetry): non-strict runs log a WARN naming the unavailable servers and post a best-effort message-only update_agent_task status update, then continue. MCPServerNotFound (bad UUID config) stays fatal.
  • Two new CLI flags on oz agent run: --strict-mcp-startup (restores the old fail-fast behavior) and --mcp-startup-timeout <DURATION> (default 20s), plumbed through AgentDriverOptions.

Servers that connect after the timeout still become usable mid-run: in-flight spawns are never aborted, and the per-request mcp_context is rebuilt from active servers each turn.

Follow-ups

This PR sets up client-side scaffolding that lets us configure how strict we are about spawning MCP servers prior to cloud agent runs, and the duration we wait for those servers to spawn before propagating the first user query. The intended follow-up here is to surface these as configurable options via the Oz web app, which should be a fairly trivial change.

Testing

Manually tested via local Oz CLI; here's an example of a publicly reachable MCP server (deepwiki) and an unreachable server configured via a URL that explicitly accepts no connections. Notice that the conversation proceeds despite the failure.

Screenshot 2026-06-10 at 5.47.07 PM.png

cargo nextest run -p warp -E 'test(error_classification) or test(driver_tests)' — 32/32 pass, including a new test asserting MCPStartupFailed details surface in the classified message.

  • cargo nextest run -p warp_cli — 175/175 pass (clap arg definitions).
  • ./script/format and presubmit's cargo clippy --workspace --exclude warp_completer --all-targets --tests -- -D warnings pass.

Agent Mode

  • Warp Agent Mode - This PR was created via Warp's AI Agent Mode

CHANGELOG-OZ: By default, cloud agent runs now continue without configured MCP servers that fail to start within the 20s default startup window, instead of failing the run.

Co-Authored-By: Oz oz-agent@warp.dev

Plan: https://staging.warp.dev/drive/notebook/ASfpNkdsDIpw6hkTsmooCo
Conversation: https://staging.warp.dev/conversation/44d42331-9340-4dbe-9e87-276117c147cb

Cloud agent runs previously failed outright when any requested MCP
server failed to reach Running within the 20s startup timeout. MCP
startup now waits for every server to reach a terminal state, names the
servers that failed or were still starting, and continues the run
without them by default, surfacing the degradation via task logs and a
best-effort run status message. Strictness and the timeout are
configurable via --strict-mcp-startup and --mcp-startup-timeout.

Co-Authored-By: Oz <oz-agent@warp.dev>
@cla-bot cla-bot Bot added the cla-signed label Jun 10, 2026
Comment thread app/src/ai/agent_sdk/driver/error_classification.rs Outdated
@vkodithala vkodithala marked this pull request as ready for review June 10, 2026 20:25
@oz-for-oss

oz-for-oss Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@vkodithala

I'm starting a first review of this pull request.

You can view the conversation on Warp.

I completed the review and no human review was requested for this pull request.

Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).

Powered by Oz

@oz-for-oss oz-for-oss Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

This PR changes Oz MCP startup handling so requested servers that fail or time out no longer abort non-strict agent runs, adds degradation details to MCP startup errors, and introduces CLI controls for strict startup and timeout tuning.

Concerns

  • The changed path is user-perceivable: non-strict runs now continue with a warning/status message, and strict mode/timeout behavior is exposed through new CLI flags. The PR description says the degraded path was not manually tested end to end and does not include a screenshot or screen recording. For this user-facing behavior change, please include screenshots or a short recording demonstrating an Oz agent run with one unreachable MCP server continuing in non-strict mode and failing in strict mode.
  • No approved spec context was provided for this implementation PR, and I did not find blocking security issues in the attached diff.

Verdict

Found: 0 critical, 1 important, 0 suggestions

Request changes

Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).

Powered by Oz

@captainsafia captainsafia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me as far as the structural changes:

- Failures will warn and log instead of erroring by default
- Erros are collected on a per-MCP server basis instead of all-up and logged individually

Comment thread app/src/ai/agent_sdk/driver/error_classification.rs Outdated

@peicodes peicodes left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will require the customer to provide the new option to see a difference, correct?

Copy link
Copy Markdown
Contributor Author

@peicodes Nah, --strict-mcp-startup is a boolean arg, so it's only set to true when provided (default False). Similar case for --mcp-startup-timeout; in the CLI driver we set a default value of 20s: https://github.com/warpdotdev/warp/blob/varoon/oz-mcp-failures-non-fatal/app/src/ai/agent_sdk/driver.rs#L694-L695, this is what it was prior to this change. Meaning the default = we don't block agent startup on MCP spawn and wait for 20s for servers to report back their status.

@peicodes

Copy link
Copy Markdown
Contributor

Got it, I had it backwards. Ty for the fix!

Comment on lines +315 to +323
/// Fail the run when any requested MCP server fails to start.
///
/// By default, MCP servers that don't start within the startup timeout are
/// skipped and the agent runs without their tools.
#[arg(long = "strict-mcp-startup")]
pub strict_mcp_startup: bool,
/// Maximum time to wait for requested MCP servers to start (e.g. `30s`, `1m`).
#[arg(long = "mcp-startup-timeout", value_name = "DURATION")]
pub mcp_startup_timeout: Option<humantime::Duration>,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context: --strict-mcp-startup is a boolean arg, so defaults to false when unset. Driver log in driver::driver.rs makes it so we set startup timeout to 20s when unset, too.

The implication is that when both flags are unset, we revert to a 20s timeout with the only behavioral change from the pre-PR state being that we log warnings for servers that fail to start up in that window.

Copy link
Copy Markdown
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@vkodithala vkodithala enabled auto-merge (squash) June 10, 2026 22:02
@vkodithala vkodithala merged commit 9a9439e into master Jun 10, 2026
26 checks passed
@vkodithala vkodithala deleted the varoon/oz-mcp-failures-non-fatal branch June 10, 2026 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants