Fix/polling abandoned session leak#732
Open
BenLocal wants to merge 5 commits into
Open
Conversation
…grade When an HTTP long-polling session starts a websocket upgrade its heartbeat is paused (`is_upgrading`). If a client opens the upgrade but never completes the probe handshake while keeping the connection open — e.g. a reverse proxy that forwards the upgrade as a pooled request and never relays the `101`, after which the browser falls back to polling — `upgrade_handshake` blocks forever on `ws.next()`, the heartbeat never times out, and the whole session (its socket, both mpsc channels, the heartbeat task and the hyper connection buffer) is retained forever. Under connect/abandon churn this grows RSS without bound (~19 GiB over 20 days observed in production). Add an `upgrade_timeout` config option (default 10s, matching the engine.io reference `upgradeTimeout`) and bound the upgrade handshake with it. On timeout the session is closed and reclaimed. - engineioxide: `EngineIoConfig::upgrade_timeout` + builder method - socketioxide: `SocketIoBuilder::upgrade_timeout` passthrough - tests: unit regression `ws_upgrade_abandoned_timeout`, plus an e2e test that drives 100 abandoned upgrades over real TCP connections and asserts every session is reclaimed while the client connections stay open Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The existing e2e test covers an upgrade that stalls while the heartbeat is in
its first ping-wait (reclaimed at ping_timeout pre-fix). The *permanent* leak
from the dhat profile ("heartbeat task never ends") is sharper: the session
first completes a ping/pong cycle so the heartbeat parks in `interval_tick`
BETWEEN pings, and only then starts a ws upgrade that stalls with the
connection held open. The parked heartbeat then sees `is_upgrading()` at the
top of every loop and just `continue`s — it never sends a ping, never reaches
the pong recv-timeout, and so never fires `HeartbeatTimeout`. With the
connection held there is no transport-close reclaim either, so the session is
retained forever.
Add `parked_heartbeat_abandoned_upgrade_is_reclaimed`, which drives exactly
that sequence (handshake -> poll the ping -> pong -> stall the upgrade) over
real TCP and asserts the session is reclaimed. Verified it fails without the
upgrade-timeout fix (permanent leak) and passes with it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Owner
|
Thanks for the fix, I can't properly review and test it as I'm holidays. This will be done by the end of June. In the meantime you can temporarily use this fix with cargo git link. |
Totodore
reviewed
Jun 3, 2026
Totodore
left a comment
Owner
There was a problem hiding this comment.
I won't be able to merge it / release it anytime soon but here are some preliminary comments.
Thanks a lot for the detailed bug report and this fix!
…ebsocket upgrades
…or abandoned parked heartbeat timeout
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #731.
Reclaims engine.io sessions whose WebSocket upgrade is started but never completed, so
they (and their held-open upgrade connection) can no longer leak.
Motivation
Abandoned WebSocket-upgrade sessions are never reclaimed. The engine
Socket, both ofits mpsc channels, the heartbeat task, and the held-open upgrade connection (hyper's
~16 KiB read buffer) are retained forever, so server RSS grows without bound under
connect/abandon churn (#731 reports ~19 GiB over ~20 days behind nginx).
While a session upgrades from HTTP long-polling to WebSocket its heartbeat is paused
(
socket.rs,if self.is_upgrading() { interval_tick.tick().await; continue; }). Theupgrade itself (
transport/ws.rs::on_init)awaitedupgrade_handshake()with nobound. If a client opens the upgrade request but never completes the
2probe/5handshake while keeping the connection open — e.g. a reverse proxy that forwards the
upgrade as a pooled request and never relays the
101, after which the browser silentlyfalls back to polling — then
upgrade_handshake()blocks forever,is_upgrading()staystrueforever, the heartbeat never resumes and so never firesHeartbeatTimeout, andbecause the connection is held open there is no transport-close reclaim either. This
matches the issue's dhat profile: heartbeat tasks that "never end" and read buffers rooted
at
hyper::server::conn::http1::UpgradeableConnection::pollthat are "never dropped".(For contrast, a pure long-polling session that is abandoned without ever upgrading is
already reclaimed by the existing heartbeat timeout — verified under churn — so the leak
is specific to the upgrade window.)
Solution
Bound the upgrade handshake with a new, configurable
upgrade_timeout(default 10 s).On timeout the session is reclaimed via
close_session(DisconnectReason::TransportError),so the heartbeat-paused upgrade window can no longer leak.
New public API:
EngineIoConfig::upgrade_timeout+EngineIoConfigBuilder::upgrade_timeout(Duration)SocketIoBuilder::upgrade_timeout(Duration)Tests (both fail without the fix and pass with it):
engineioxideintegration —ws_upgrade_abandoned_timeout: a stalled upgrade isreclaimed with
DisconnectReason::TransportError.engineioxide-e2e(real hyper server over real TCP):abandoned_ws_upgrades_do_not_leak_sessions— 100 concurrent stalled upgrades, everyclient connection held open; all sessions reclaimed after the upgrade timeout.
parked_heartbeat_abandoned_upgrade_is_reclaimed— the permanent-leak case where theheartbeat parks between pings before the upgrade stalls; only the upgrade timeout can
reclaim it.
cargo fmt,cargo clippy --all-features, andcargo test --all-featuresare all green.