Skip to content

Fix/polling abandoned session leak#732

Open
BenLocal wants to merge 5 commits into
Totodore:mainfrom
BenLocal:fix/polling-abandoned-session-leak
Open

Fix/polling abandoned session leak#732
BenLocal wants to merge 5 commits into
Totodore:mainfrom
BenLocal:fix/polling-abandoned-session-leak

Conversation

@BenLocal

@BenLocal BenLocal commented Jun 2, 2026

Copy link
Copy Markdown

Fixes #731.

Reclaims engine.io sessions whose WebSocket upgrade is started but never completed, so
they (and their held-open upgrade connection) can no longer leak.

Motivation

Abandoned WebSocket-upgrade sessions are never reclaimed. The engine Socket, both of
its mpsc channels, the heartbeat task, and the held-open upgrade connection (hyper's
~16 KiB read buffer) are retained forever, so server RSS grows without bound under
connect/abandon churn (#731 reports ~19 GiB over ~20 days behind nginx).

While a session upgrades from HTTP long-polling to WebSocket its heartbeat is paused
(socket.rs, if self.is_upgrading() { interval_tick.tick().await; continue; }). The
upgrade itself (transport/ws.rs::on_init) awaited upgrade_handshake() with no
bound
. If a client opens the upgrade request but never completes the 2probe / 5
handshake while keeping the connection open — e.g. a reverse proxy that forwards the
upgrade as a pooled request and never relays the 101, after which the browser silently
falls back to polling — then upgrade_handshake() blocks forever, is_upgrading() stays
true forever, the heartbeat never resumes and so never fires HeartbeatTimeout, and
because the connection is held open there is no transport-close reclaim either. This
matches the issue's dhat profile: heartbeat tasks that "never end" and read buffers rooted
at hyper::server::conn::http1::UpgradeableConnection::poll that are "never dropped".

(For contrast, a pure long-polling session that is abandoned without ever upgrading is
already reclaimed by the existing heartbeat timeout — verified under churn — so the leak
is specific to the upgrade window.)

Solution

Bound the upgrade handshake with a new, configurable upgrade_timeout (default 10 s).
On timeout the session is reclaimed via close_session(DisconnectReason::TransportError),
so the heartbeat-paused upgrade window can no longer leak.

New public API:

  • EngineIoConfig::upgrade_timeout + EngineIoConfigBuilder::upgrade_timeout(Duration)
  • SocketIoBuilder::upgrade_timeout(Duration)

Tests (both fail without the fix and pass with it):

  • engineioxide integration — ws_upgrade_abandoned_timeout: a stalled upgrade is
    reclaimed with DisconnectReason::TransportError.
  • engineioxide-e2e (real hyper server over real TCP):
    • abandoned_ws_upgrades_do_not_leak_sessions — 100 concurrent stalled upgrades, every
      client connection held open; all sessions reclaimed after the upgrade timeout.
    • parked_heartbeat_abandoned_upgrade_is_reclaimed — the permanent-leak case where the
      heartbeat parks between pings before the upgrade stalls; only the upgrade timeout can
      reclaim it.

cargo fmt, cargo clippy --all-features, and cargo test --all-features are all green.

BenLocal and others added 2 commits June 2, 2026 02:49
…grade

When an HTTP long-polling session starts a websocket upgrade its heartbeat is
paused (`is_upgrading`). If a client opens the upgrade but never completes the
probe handshake while keeping the connection open — e.g. a reverse proxy that
forwards the upgrade as a pooled request and never relays the `101`, after
which the browser falls back to polling — `upgrade_handshake` blocks forever on
`ws.next()`, the heartbeat never times out, and the whole session (its socket,
both mpsc channels, the heartbeat task and the hyper connection buffer) is
retained forever. Under connect/abandon churn this grows RSS without bound
(~19 GiB over 20 days observed in production).

Add an `upgrade_timeout` config option (default 10s, matching the engine.io
reference `upgradeTimeout`) and bound the upgrade handshake with it. On timeout
the session is closed and reclaimed.

- engineioxide: `EngineIoConfig::upgrade_timeout` + builder method
- socketioxide: `SocketIoBuilder::upgrade_timeout` passthrough
- tests: unit regression `ws_upgrade_abandoned_timeout`, plus an e2e test that
  drives 100 abandoned upgrades over real TCP connections and asserts every
  session is reclaimed while the client connections stay open

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The existing e2e test covers an upgrade that stalls while the heartbeat is in
its first ping-wait (reclaimed at ping_timeout pre-fix). The *permanent* leak
from the dhat profile ("heartbeat task never ends") is sharper: the session
first completes a ping/pong cycle so the heartbeat parks in `interval_tick`
BETWEEN pings, and only then starts a ws upgrade that stalls with the
connection held open. The parked heartbeat then sees `is_upgrading()` at the
top of every loop and just `continue`s — it never sends a ping, never reaches
the pong recv-timeout, and so never fires `HeartbeatTimeout`. With the
connection held there is no transport-close reclaim either, so the session is
retained forever.

Add `parked_heartbeat_abandoned_upgrade_is_reclaimed`, which drives exactly
that sequence (handshake -> poll the ping -> pong -> stall the upgrade) over
real TCP and asserts the session is reclaimed. Verified it fails without the
upgrade-timeout fix (permanent leak) and passes with it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codspeed-hq

codspeed-hq Bot commented Jun 2, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 87 untouched benchmarks


Comparing BenLocal:fix/polling-abandoned-session-leak (fc4abd8) with main (1c9fc9a)

Open in CodSpeed

@Totodore Totodore added C-Bug Something isn't working A-engineioxide Area related to engineioxide P-High High priority labels Jun 2, 2026
@Totodore

Totodore commented Jun 2, 2026

Copy link
Copy Markdown
Owner

Thanks for the fix, I can't properly review and test it as I'm holidays. This will be done by the end of June. In the meantime you can temporarily use this fix with cargo git link.

@Totodore Totodore left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't be able to merge it / release it anytime soon but here are some preliminary comments.

Thanks a lot for the detailed bug report and this fix!

Comment thread crates/engineioxide/src/config.rs Outdated
Comment thread crates/engineioxide/src/transport/ws.rs Outdated
Comment thread e2e/engineioxide/tests/memory_leak.rs Outdated
Comment thread e2e/engineioxide/tests/memory_leak.rs Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-engineioxide Area related to engineioxide C-Bug Something isn't working P-High High priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory leak: abandoned HTTP long-polling sessions are never reclaimed

2 participants