Summary
With the StreamableHttp client transport and reinit_on_expired_session = true (the default), a request that is in flight when the session expires (HTTP 404) can be permanently orphaned: its response is dropped, the pending responder is never completed or errored, and the caller's request future (e.g. peer.call_tool(...)) hangs forever. Because the typed peer methods use no timeout, there is no recovery.
This is a follow-up to #733, which added transparent re-initialization — the re-init path itself has a request-orphaning race.
- crate:
rmcp 1.7.0, feature client + transport-streamable-http-client
- Observed downstream as modelcontextprotocol-unrelated app hangs; full downstream report: 0xPlaygrounds/rig#1914.
Symptom (from a user log)
INFO rmcp::transport::streamable_http_client: session expired (HTTP 404), attempting transparent re-initialization
WARN rmcp::transport::streamable_http_client: sse client event stream terminated with error: Err(TokioJoinError(JoinError::Cancelled(Id(15065))))
…after which the client is stuck. The JoinError::Cancelled WARN is expected (it's streams.abort_all() firing during re-init) — a symptom, not the cause.
Root-cause analysis (rmcp 1.7.0)
A tools/call response is not the return value of the POST — the POST resolves on 202 Accepted, and the actual CallToolResult arrives asynchronously as a ServerMessage on an SSE stream task, matched back to the caller by JSON-RPC id via local_responder_pool.
- On
SessionExpired with re-init enabled, the worker (src/transport/streamable_http_client.rs) logs re-init (:670), runs perform_reinitialization (:672), then calls streams.abort_all() (:684), which aborts every SSE stream task — including the standalone GET stream carrying outstanding responses, and any in-flight POST-SSE response stream.
- The aborted stream is dropped mid-poll.
SseAutoReconnectStream (src/transport/.../client_side_sse.rs) only reconnects on Some(Err(..)) or graceful None; a JoinSet abort is neither, so reconnect/last_event_id recovery never runs. The in-flight response is lost.
- Re-init replays only the single message that received the 404 (
:759); any other concurrently-pending request — or one whose response was mid-delivery on an aborted stream — is never replayed.
- Nothing ever completes/errors the orphaned entry in
local_responder_pool (src/service.rs:764). Responders are only removed on id-matched response (:1023), id-matched error (:1037), transport-send error (:855), or an explicit cancellation notification (:872) — none of which fire here. The worker has no RequestId/responder concept, so it cannot fail the request on its side either.
- The caller's future never resolves:
call_tool (src/service/client.rs:365) → send_request (src/service.rs:442) → send_request_with_option(req, PeerRequestOptions::no_options()) → await_response (:321), whose else branch is self.rx.await (:345) with no timeout.
Net: an in-flight request whose response stream is killed by abort_all() during re-init → no response, no error, unbounded await. reinit_on_expired_session defaults to true (:1080), so the path is active by default. This matches the intermittent ("occasionally") nature — it only triggers when a request is in flight at the moment of the 404/re-init.
Suggested directions (deferring to maintainers on architecture)
- Don't silently orphan in-flight requests across re-init. When
abort_all() discards streams that may carry outstanding responses, the affected local_responder_pool entries should be failed with a retryable/transport error (so callers get Err instead of hanging) — or, better, all in-flight requests should be replayed after re-init, not just the one that 404'd.
- Prefer recovery over abort for the standalone SSE stream: reconnect it under the new session id (with
last_event_id) instead of aborting and losing it.
- Defensive timeouts for typed methods.
Peer::send_request_with_option already supports PeerRequestOptions { timeout } (:457), but every typed peer_req method (call_tool, etc.) hard-codes no_options(). Consider a configurable default request timeout, or timeout-aware typed variants, so a lost response can't wedge a caller indefinitely.
- Logging nit: the
JoinError::Cancelled WARN at :824 is a by-design consequence of abort_all(); consider downgrading/clarifying so it doesn't read as a transport error.
Workaround (today)
Callers can avoid the typed call_tool and issue the request with a timeout:
use rmcp::service::PeerRequestOptions;
peer.send_request_with_option(
/* ClientRequest::CallToolRequest(..) */ request,
PeerRequestOptions { timeout: Some(std::time::Duration::from_secs(60)), ..PeerRequestOptions::no_options() },
).await
This turns the hang into a ServiceError::Timeout (and sends a cancellation notification), but it doesn't recover the lost request — it just bounds the wait. Downstream, rig added a per-call timeout wrapper as mitigation (0xPlaygrounds/rig#1921).
Reproduction
A fully deterministic pure-rmcp repro is awkward because it needs a server that expires the session while a tools/call response is pending. The trigger conditions are:
- client with
reinit_on_expired_session = true (default);
- server accepts a
tools/call (202, response to be delivered over SSE), then expires/discards the session so a subsequent request/standalone GET returns 404 → re-init → abort_all() kills the SSE stream that would have delivered the pending response.
Happy to attempt a minimal failing integration test against the in-tree streamable-http test server if that would help triage.
Summary
With the StreamableHttp client transport and
reinit_on_expired_session = true(the default), a request that is in flight when the session expires (HTTP 404) can be permanently orphaned: its response is dropped, the pending responder is never completed or errored, and the caller's request future (e.g.peer.call_tool(...)) hangs forever. Because the typed peer methods use no timeout, there is no recovery.This is a follow-up to #733, which added transparent re-initialization — the re-init path itself has a request-orphaning race.
rmcp1.7.0, featureclient+transport-streamable-http-clientSymptom (from a user log)
…after which the client is stuck. The
JoinError::CancelledWARN is expected (it'sstreams.abort_all()firing during re-init) — a symptom, not the cause.Root-cause analysis (rmcp 1.7.0)
A
tools/callresponse is not the return value of the POST — the POST resolves on202 Accepted, and the actualCallToolResultarrives asynchronously as aServerMessageon an SSE stream task, matched back to the caller by JSON-RPC id vialocal_responder_pool.SessionExpiredwith re-init enabled, the worker (src/transport/streamable_http_client.rs) logs re-init (:670), runsperform_reinitialization(:672), then callsstreams.abort_all()(:684), which aborts every SSE stream task — including the standalone GET stream carrying outstanding responses, and any in-flight POST-SSE response stream.SseAutoReconnectStream(src/transport/.../client_side_sse.rs) only reconnects onSome(Err(..))or gracefulNone; aJoinSetabort is neither, so reconnect/last_event_idrecovery never runs. The in-flight response is lost.:759); any other concurrently-pending request — or one whose response was mid-delivery on an aborted stream — is never replayed.local_responder_pool(src/service.rs:764). Responders are only removed on id-matched response (:1023), id-matched error (:1037), transport-send error (:855), or an explicit cancellation notification (:872) — none of which fire here. The worker has noRequestId/responder concept, so it cannot fail the request on its side either.call_tool(src/service/client.rs:365) →send_request(src/service.rs:442) →send_request_with_option(req, PeerRequestOptions::no_options())→await_response(:321), whoseelsebranch isself.rx.await(:345) with no timeout.Net: an in-flight request whose response stream is killed by
abort_all()during re-init → no response, no error, unbounded await.reinit_on_expired_sessiondefaults totrue(:1080), so the path is active by default. This matches the intermittent ("occasionally") nature — it only triggers when a request is in flight at the moment of the 404/re-init.Suggested directions (deferring to maintainers on architecture)
abort_all()discards streams that may carry outstanding responses, the affectedlocal_responder_poolentries should be failed with a retryable/transport error (so callers getErrinstead of hanging) — or, better, all in-flight requests should be replayed after re-init, not just the one that 404'd.last_event_id) instead of aborting and losing it.Peer::send_request_with_optionalready supportsPeerRequestOptions { timeout }(:457), but every typedpeer_reqmethod (call_tool, etc.) hard-codesno_options(). Consider a configurable default request timeout, or timeout-aware typed variants, so a lost response can't wedge a caller indefinitely.JoinError::CancelledWARN at:824is a by-design consequence ofabort_all(); consider downgrading/clarifying so it doesn't read as a transport error.Workaround (today)
Callers can avoid the typed
call_tooland issue the request with a timeout:This turns the hang into a
ServiceError::Timeout(and sends a cancellation notification), but it doesn't recover the lost request — it just bounds the wait. Downstream, rig added a per-call timeout wrapper as mitigation (0xPlaygrounds/rig#1921).Reproduction
A fully deterministic pure-rmcp repro is awkward because it needs a server that expires the session while a
tools/callresponse is pending. The trigger conditions are:reinit_on_expired_session = true(default);tools/call(202, response to be delivered over SSE), then expires/discards the session so a subsequent request/standalone GET returns 404 → re-init →abort_all()kills the SSE stream that would have delivered the pending response.Happy to attempt a minimal failing integration test against the in-tree streamable-http test server if that would help triage.