Skip to content

StreamableHttp client: transparent session re-init (HTTP 404) orphans in-flight requests, hanging call_tool forever #912

@gold-silver-copper

Description

@gold-silver-copper

Summary

With the StreamableHttp client transport and reinit_on_expired_session = true (the default), a request that is in flight when the session expires (HTTP 404) can be permanently orphaned: its response is dropped, the pending responder is never completed or errored, and the caller's request future (e.g. peer.call_tool(...)) hangs forever. Because the typed peer methods use no timeout, there is no recovery.

This is a follow-up to #733, which added transparent re-initialization — the re-init path itself has a request-orphaning race.

  • crate: rmcp 1.7.0, feature client + transport-streamable-http-client
  • Observed downstream as modelcontextprotocol-unrelated app hangs; full downstream report: 0xPlaygrounds/rig#1914.

Symptom (from a user log)

INFO  rmcp::transport::streamable_http_client: session expired (HTTP 404), attempting transparent re-initialization
WARN  rmcp::transport::streamable_http_client: sse client event stream terminated with error: Err(TokioJoinError(JoinError::Cancelled(Id(15065))))

…after which the client is stuck. The JoinError::Cancelled WARN is expected (it's streams.abort_all() firing during re-init) — a symptom, not the cause.

Root-cause analysis (rmcp 1.7.0)

A tools/call response is not the return value of the POST — the POST resolves on 202 Accepted, and the actual CallToolResult arrives asynchronously as a ServerMessage on an SSE stream task, matched back to the caller by JSON-RPC id via local_responder_pool.

  1. On SessionExpired with re-init enabled, the worker (src/transport/streamable_http_client.rs) logs re-init (:670), runs perform_reinitialization (:672), then calls streams.abort_all() (:684), which aborts every SSE stream task — including the standalone GET stream carrying outstanding responses, and any in-flight POST-SSE response stream.
  2. The aborted stream is dropped mid-poll. SseAutoReconnectStream (src/transport/.../client_side_sse.rs) only reconnects on Some(Err(..)) or graceful None; a JoinSet abort is neither, so reconnect/last_event_id recovery never runs. The in-flight response is lost.
  3. Re-init replays only the single message that received the 404 (:759); any other concurrently-pending request — or one whose response was mid-delivery on an aborted stream — is never replayed.
  4. Nothing ever completes/errors the orphaned entry in local_responder_pool (src/service.rs:764). Responders are only removed on id-matched response (:1023), id-matched error (:1037), transport-send error (:855), or an explicit cancellation notification (:872) — none of which fire here. The worker has no RequestId/responder concept, so it cannot fail the request on its side either.
  5. The caller's future never resolves: call_tool (src/service/client.rs:365) → send_request (src/service.rs:442) → send_request_with_option(req, PeerRequestOptions::no_options())await_response (:321), whose else branch is self.rx.await (:345) with no timeout.

Net: an in-flight request whose response stream is killed by abort_all() during re-init → no response, no error, unbounded await. reinit_on_expired_session defaults to true (:1080), so the path is active by default. This matches the intermittent ("occasionally") nature — it only triggers when a request is in flight at the moment of the 404/re-init.

Suggested directions (deferring to maintainers on architecture)

  1. Don't silently orphan in-flight requests across re-init. When abort_all() discards streams that may carry outstanding responses, the affected local_responder_pool entries should be failed with a retryable/transport error (so callers get Err instead of hanging) — or, better, all in-flight requests should be replayed after re-init, not just the one that 404'd.
  2. Prefer recovery over abort for the standalone SSE stream: reconnect it under the new session id (with last_event_id) instead of aborting and losing it.
  3. Defensive timeouts for typed methods. Peer::send_request_with_option already supports PeerRequestOptions { timeout } (:457), but every typed peer_req method (call_tool, etc.) hard-codes no_options(). Consider a configurable default request timeout, or timeout-aware typed variants, so a lost response can't wedge a caller indefinitely.
  4. Logging nit: the JoinError::Cancelled WARN at :824 is a by-design consequence of abort_all(); consider downgrading/clarifying so it doesn't read as a transport error.

Workaround (today)

Callers can avoid the typed call_tool and issue the request with a timeout:

use rmcp::service::PeerRequestOptions;
peer.send_request_with_option(
    /* ClientRequest::CallToolRequest(..) */ request,
    PeerRequestOptions { timeout: Some(std::time::Duration::from_secs(60)), ..PeerRequestOptions::no_options() },
).await

This turns the hang into a ServiceError::Timeout (and sends a cancellation notification), but it doesn't recover the lost request — it just bounds the wait. Downstream, rig added a per-call timeout wrapper as mitigation (0xPlaygrounds/rig#1921).

Reproduction

A fully deterministic pure-rmcp repro is awkward because it needs a server that expires the session while a tools/call response is pending. The trigger conditions are:

  • client with reinit_on_expired_session = true (default);
  • server accepts a tools/call (202, response to be delivered over SSE), then expires/discards the session so a subsequent request/standalone GET returns 404 → re-init → abort_all() kills the SSE stream that would have delivered the pending response.

Happy to attempt a minimal failing integration test against the in-tree streamable-http test server if that would help triage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions