fix: bound sdk.shutdown() so a hung telemetry flush can't block exit#910
fix: bound sdk.shutdown() so a hung telemetry flush can't block exit#910CryptoQuaternions wants to merge 2 commits into
Conversation
When the iii engine is already gone, the OTel exporter retries forever and sdk.shutdown() never resolves, leaving the process hanging on SIGTERM until it is force-killed (~90s under systemd). Race sdk.shutdown() against a 3s timeout so the handler always reaches process.exit(). Fixes rohitg00#909 Signed-off-by: Quaternions <github@tynes.nl>
|
@CryptoQuaternions is attempting to deploy a commit to the rohitg00's projects Team on Vercel. A member of the Team first needs to authorize it. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe worker shutdown() now races ChangesWorker Shutdown Timeout Protection
sequenceDiagram
participant Worker
participant SDK
participant Timer
participant Pidfile
participant Process
Worker->>SDK: invoke sdk.shutdown()
Worker->>Timer: start 3s timeout
alt SDK shutdown completes before timeout
SDK-->>Worker: resolve / error (log warning if error)
Worker->>Pidfile: clearWorkerPidfile()
Worker->>Process: process.exit(0)
else Timeout first
Timer-->>Worker: timeout after 3s (log warning)
Worker->>Pidfile: clearWorkerPidfile()
Worker->>Process: process.exit(0)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~15 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/index.ts (1)
601-604: ⚡ Quick winRefactor comment to avoid explaining WHAT.
The comment on lines 602-604 explains the implementation mechanism ("Race it against a 3s timeout so we always reach process.exit()"), which the coding guidelines advise against. The issue references (line 601) provide valuable WHY context and should remain, but the implementation details could be conveyed through clearer code structure or a more purpose-focused comment.
♻️ Guideline-compliant alternative
Revise to focus on the purpose rather than the mechanism:
- // `#909` / iii-hq/iii#1835: when the iii engine is already gone, the OTel - // exporter retries forever and sdk.shutdown() never resolves, hanging the - // process until it's force-killed. Race it against a 3s timeout so we always - // reach process.exit(); any un-flushed telemetry is dropped on shutdown. + // `#909` / iii-hq/iii#1835: defensive timeout prevents indefinite hangs when + // OTel exporter is stuck retrying; un-flushed telemetry is dropped on exit.The Promise.race structure itself communicates the "how" without needing comment explanation.
As per coding guidelines: "In TypeScript source code, avoid code comments explaining WHAT — use clear naming instead"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/index.ts` around lines 601 - 604, Keep the issue reference but replace the current comment that explains HOW with a purpose-focused comment: state that the block ensures the process exits even if sdk.shutdown() hangs (e.g., "Ensure process exit when OpenTelemetry exporter shutdown hangs (see `#909` / iii-hq/iii#1835)"), remove the "Race it against a 3s timeout..." wording, and optionally extract the Promise.race(sdk.shutdown(), timeout(3000)) logic into a clearly named helper like shutdownWithTimeout to make the mechanism self-documenting; keep references to sdk.shutdown(), Promise.race and the 3s timeout in the code but not in the comment.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/index.ts`:
- Around line 605-610: The current Promise.race usage hides whether the timeout
or sdk.shutdown() completed; change the logic so the timeout promise returns a
sentinel (e.g., { timedOut: true }) and the sdk.shutdown() returns a distinct
value or resolves normally, then await Promise.race([shutdownPromise,
timeoutPromise]) and inspect the result—log a clear timeout message when the
sentinel is returned and separately catch and log errors from sdk.shutdown()
(referencing sdk.shutdown(), the timeout 3000ms Promise, and the Promise.race
call).
---
Nitpick comments:
In `@src/index.ts`:
- Around line 601-604: Keep the issue reference but replace the current comment
that explains HOW with a purpose-focused comment: state that the block ensures
the process exits even if sdk.shutdown() hangs (e.g., "Ensure process exit when
OpenTelemetry exporter shutdown hangs (see `#909` / iii-hq/iii#1835)"), remove the
"Race it against a 3s timeout..." wording, and optionally extract the
Promise.race(sdk.shutdown(), timeout(3000)) logic into a clearly named helper
like shutdownWithTimeout to make the mechanism self-documenting; keep references
to sdk.shutdown(), Promise.race and the 3s timeout in the code but not in the
comment.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
Address CodeRabbit review: the previous catch() only logged when sdk.shutdown() threw, so the timeout path — the defensive case this guards against — was silent. Log the timeout and shutdown-error cases separately, and trim the comment to the why. Signed-off-by: Quaternions <github@tynes.nl>
What
Wraps
sdk.shutdown()in the SIGTERM/SIGINT handler with a 3s timeout so a hung telemetry flush can't block process exit.Why
When the bundled iii engine has already exited,
sdk.shutdown()→ iii-sdk'sshutdownOtel()callstracerProvider.forceFlush(), which queues spans onto a WebSocket exporter stuck in an infinite reconnect loop (maxRetries: -1). The flush never resolves, so the handler never reachesprocess.exit(0). Under systemd this is a ~90s hang on every shutdown/reboot before the process is SIGKILLed.Root cause is upstream in iii-sdk (filed as iii-hq/iii#1835). This PR is a defensive stopgap so agentmemory exits cleanly regardless of the telemetry layer's state.
How to verify
npm run buildcompiles clean.Fixes #909
Summary by CodeRabbit