Skip to content

fix: bound sdk.shutdown() so a hung telemetry flush can't block exit#910

Open
CryptoQuaternions wants to merge 2 commits into
rohitg00:mainfrom
CryptoQuaternions:fix/909-bound-shutdown-timeout
Open

fix: bound sdk.shutdown() so a hung telemetry flush can't block exit#910
CryptoQuaternions wants to merge 2 commits into
rohitg00:mainfrom
CryptoQuaternions:fix/909-bound-shutdown-timeout

Conversation

@CryptoQuaternions

@CryptoQuaternions CryptoQuaternions commented Jun 11, 2026

Copy link
Copy Markdown

What

Wraps sdk.shutdown() in the SIGTERM/SIGINT handler with a 3s timeout so a hung telemetry flush can't block process exit.

Why

When the bundled iii engine has already exited, sdk.shutdown() → iii-sdk's shutdownOtel() calls tracerProvider.forceFlush(), which queues spans onto a WebSocket exporter stuck in an infinite reconnect loop (maxRetries: -1). The flush never resolves, so the handler never reaches process.exit(0). Under systemd this is a ~90s hang on every shutdown/reboot before the process is SIGKILLed.

Root cause is upstream in iii-sdk (filed as iii-hq/iii#1835). This PR is a defensive stopgap so agentmemory exits cleanly regardless of the telemetry layer's state.

How to verify

  • With the engine already stopped, send SIGTERM to the worker: it exits within ~3s instead of hanging until force-kill.
  • npm run build compiles clean.

Fixes #909

Summary by CodeRabbit

  • Bug Fixes
    • Improved worker shutdown reliability by adding a timeout to prevent indefinite hangs.
    • Logs warnings when shutdown errors occur or when the timeout elapses.
    • Ensures cleanup proceeds and the process exits even if shutdown does not complete (may drop un‑flushed telemetry on timeout).

When the iii engine is already gone, the OTel exporter retries forever
and sdk.shutdown() never resolves, leaving the process hanging on SIGTERM
until it is force-killed (~90s under systemd). Race sdk.shutdown() against
a 3s timeout so the handler always reaches process.exit().

Fixes rohitg00#909

Signed-off-by: Quaternions <github@tynes.nl>
@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

@CryptoQuaternions is attempting to deploy a commit to the rohitg00's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5cb8d649-6ab3-42af-8b4b-dfd38d955518

📥 Commits

Reviewing files that changed from the base of the PR and between 5cb7111 and de901b5.

📒 Files selected for processing (1)
  • src/index.ts

📝 Walkthrough

Walkthrough

The worker shutdown() now races sdk.shutdown() against a 3-second timeout; it logs a warning on SDK error or timeout and then continues to clearWorkerPidfile() and process.exit(0) to avoid hanging during telemetry shutdown.

Changes

Worker Shutdown Timeout Protection

Layer / File(s) Summary
Timeout-wrapped SDK shutdown
src/index.ts
sdk.shutdown() is raced against a 3s timer; timeout or errors are logged as warnings while allowing process cleanup (clearWorkerPidfile()) and process.exit(0) to continue.
sequenceDiagram
  participant Worker
  participant SDK
  participant Timer
  participant Pidfile
  participant Process

  Worker->>SDK: invoke sdk.shutdown()
  Worker->>Timer: start 3s timeout
  alt SDK shutdown completes before timeout
    SDK-->>Worker: resolve / error (log warning if error)
    Worker->>Pidfile: clearWorkerPidfile()
    Worker->>Process: process.exit(0)
  else Timeout first
    Timer-->>Worker: timeout after 3s (log warning)
    Worker->>Pidfile: clearWorkerPidfile()
    Worker->>Process: process.exit(0)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15 minutes

Poem

🐰 Three seconds tick, the logs all chime,
When SDK stalls, we end the time.
Pidfile cleared, the exit's neat,
No frozen boots, no stuck heartbeat.
A hopping fix — quick, light, and sweet!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: bounding sdk.shutdown() with a timeout to prevent hung telemetry flushes from blocking process exit.
Linked Issues check ✅ Passed The PR implements the defensive stopgap timeout around sdk.shutdown() as required by issue #909, with separate logging for timeout and error paths, and successfully prevents shutdown hangs.
Out of Scope Changes check ✅ Passed All changes are focused on the shutdown handler timeout fix; no unrelated modifications to other features or components are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/index.ts (1)

601-604: ⚡ Quick win

Refactor comment to avoid explaining WHAT.

The comment on lines 602-604 explains the implementation mechanism ("Race it against a 3s timeout so we always reach process.exit()"), which the coding guidelines advise against. The issue references (line 601) provide valuable WHY context and should remain, but the implementation details could be conveyed through clearer code structure or a more purpose-focused comment.

♻️ Guideline-compliant alternative

Revise to focus on the purpose rather than the mechanism:

-    // `#909` / iii-hq/iii#1835: when the iii engine is already gone, the OTel
-    // exporter retries forever and sdk.shutdown() never resolves, hanging the
-    // process until it's force-killed. Race it against a 3s timeout so we always
-    // reach process.exit(); any un-flushed telemetry is dropped on shutdown.
+    // `#909` / iii-hq/iii#1835: defensive timeout prevents indefinite hangs when
+    // OTel exporter is stuck retrying; un-flushed telemetry is dropped on exit.

The Promise.race structure itself communicates the "how" without needing comment explanation.

As per coding guidelines: "In TypeScript source code, avoid code comments explaining WHAT — use clear naming instead"

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/index.ts` around lines 601 - 604, Keep the issue reference but replace
the current comment that explains HOW with a purpose-focused comment: state that
the block ensures the process exits even if sdk.shutdown() hangs (e.g., "Ensure
process exit when OpenTelemetry exporter shutdown hangs (see `#909` /
iii-hq/iii#1835)"), remove the "Race it against a 3s timeout..." wording, and
optionally extract the Promise.race(sdk.shutdown(), timeout(3000)) logic into a
clearly named helper like shutdownWithTimeout to make the mechanism
self-documenting; keep references to sdk.shutdown(), Promise.race and the 3s
timeout in the code but not in the comment.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/index.ts`:
- Around line 605-610: The current Promise.race usage hides whether the timeout
or sdk.shutdown() completed; change the logic so the timeout promise returns a
sentinel (e.g., { timedOut: true }) and the sdk.shutdown() returns a distinct
value or resolves normally, then await Promise.race([shutdownPromise,
timeoutPromise]) and inspect the result—log a clear timeout message when the
sentinel is returned and separately catch and log errors from sdk.shutdown()
(referencing sdk.shutdown(), the timeout 3000ms Promise, and the Promise.race
call).

---

Nitpick comments:
In `@src/index.ts`:
- Around line 601-604: Keep the issue reference but replace the current comment
that explains HOW with a purpose-focused comment: state that the block ensures
the process exits even if sdk.shutdown() hangs (e.g., "Ensure process exit when
OpenTelemetry exporter shutdown hangs (see `#909` / iii-hq/iii#1835)"), remove the
"Race it against a 3s timeout..." wording, and optionally extract the
Promise.race(sdk.shutdown(), timeout(3000)) logic into a clearly named helper
like shutdownWithTimeout to make the mechanism self-documenting; keep references
to sdk.shutdown(), Promise.race and the 3s timeout in the code but not in the
comment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 167c648c-715f-426d-8f01-807f39090fce

📥 Commits

Reviewing files that changed from the base of the PR and between f6f9e3c and 5cb7111.

📒 Files selected for processing (1)
  • src/index.ts

Comment thread src/index.ts Outdated
Address CodeRabbit review: the previous catch() only logged when
sdk.shutdown() threw, so the timeout path — the defensive case this
guards against — was silent. Log the timeout and shutdown-error cases
separately, and trim the comment to the why.

Signed-off-by: Quaternions <github@tynes.nl>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

agentmemory hangs ~90s on shutdown (systemd black screen) — iii-sdk telemetry shutdown never resolves

1 participant