Skip to content

[SLOP(claude-opus-4-8)] feat(pegboard-envoy): rate-limit envoy ws ingress and cap get_pages, trim unused metrics#5327

Open
MasterPtato wants to merge 1 commit into
stack/slop-claude-opus-4-8-feat-util-add-rate-limiter-primitive-and-ingress-throttles-for-actor-create-and-gateway-websocket-pnyswwtmfrom
stack/slop-claude-opus-4-8-feat-pegboard-envoy-rate-limit-envoy-ws-ingress-and-cap-get_pages-trim-unused-metrics-tnzyzqrl
Open

[SLOP(claude-opus-4-8)] feat(pegboard-envoy): rate-limit envoy ws ingress and cap get_pages, trim unused metrics#5327
MasterPtato wants to merge 1 commit into
stack/slop-claude-opus-4-8-feat-util-add-rate-limiter-primitive-and-ingress-throttles-for-actor-create-and-gateway-websocket-pnyswwtmfrom
stack/slop-claude-opus-4-8-feat-pegboard-envoy-rate-limit-envoy-ws-ingress-and-cap-get_pages-trim-unused-metrics-tnzyzqrl

Conversation

@MasterPtato

Copy link
Copy Markdown
Contributor

No description provided.

@MasterPtato

MasterPtato commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

@claude

claude Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

PR Review

This PR adds a leaky-bucket rate limiter to the pegboard-envoy WebSocket ingress path (mirroring the existing gateway rate limiter), caps the number of pages a single get_pages request may ask for (8192), and removes the prepopulate() metric-initialization functions that were pre-seeding zero-valued time series with empty labels.

Findings

engine/packages/pegboard-envoy/src/ws_to_tunnel_task.rs:486 — Abort and term signals cannot interrupt the envoy rate-limit sleep

ws_to_tunnel_abort_rx and term_signal are handled inside recv_msg, which only runs after acquire() returns. While the leaky bucket is sleeping, those signals cannot interrupt. The gateway version has abort as an outer select! branch and cancels acquire() immediately when abort fires. With the default drip_rate_us = 200 (0.2 ms) this is negligible in practice, but the inconsistency means an operator setting a higher drip rate (e.g. 50 ms to throttle an overloaded runner more aggressively) would see a corresponding delay in graceful-shutdown responsiveness for every envoy connection. The fix is to add the abort as an outer branch in the task_inner loop, the same way it is in pegboard-gateway2.

engine/packages/pegboard-envoy/src/ws_to_tunnel_task.rs:1436 — Oversized get_pages returns a soft error rather than terminating the connection

CLAUDE.md states that the envoy/pegboard-envoy channel is untrusted. Returning SqliteErrorResponse and leaving the connection open means a misbehaving runner can keep sending oversized requests within the rate-limit burst. Because this is a validation failure at a trust boundary, closing the connection would be more principled. That said, CLAUDE.md also explicitly says to return SqliteErrorResponse for validation failures rather than bubble them through the shared connection task, so this follows the stated guidance. The rate limiter already bounds total message volume; the cap adds defense-in-depth regardless.

engine/packages/pegboard-gateway2/src/ws_to_tunnel_task.rs:29 — Minor: ctx.config().pegboard() now called twice

The removed pegboard_config local binding made it obvious the same config snapshot was used for both rate-limit fields. The new code calls ctx.config().pegboard() on two separate lines. Not a bug, but a small readability regression.

What looks good

  • The envoy rate-limiter defaults (16384 burst, 200 us drip = ~5000 sustained msgs/sec) are well-sized for a connection that multiplexes all actors on a runner, while the gateway limit is appropriately tighter.
  • Moving rate_limit.acquire() inside the select! async block in the gateway (rather than before it) is a correct improvement: abort can now cancel the acquire sleep instead of waiting for it to complete.
  • MAX_GET_PAGES_PER_REQUEST = 8192 with the comment citing observed production max (~1024) is a sensible cap with adequate headroom.
  • PegboardEnvoyWs::new(&ctx) is a cleaner API; the removed ctx.clone() at the call site was redundant.
  • Removing prepopulate() is correct. Pre-populating metrics with blank label values was an anti-pattern; sparse series until the first event is the right behavior.

Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant